Commit d22bfedc authored by François  LAURENT's avatar François LAURENT
Browse files

more sensible choice of sums of squares

parent 09b58ac6
%% Cell type:code id:d5065981 tags:
%% Cell type:code id:4e16caf7 tags:
``` python
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
from scipy import stats
from patsy import dmatrices
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.stats import diagnostic
from statsmodels.stats.multitest import multipletests
from statsmodels.stats.outliers_influence import OLSInfluence
%% Cell type:markdown id:4ddd8902 tags:
%% Cell type:markdown id:358dce7a tags:
# Multi-way ANOVA
%% Cell type:markdown id:9c4680b9 tags:
%% Cell type:markdown id:a1face9f tags:
## Q
Load the `titanic.csv` data file, insert the natural logarithm of `1+Fare` as a new column in the dataframe (*e.g.* with column name `'LogFare'`), and plot this new variable as a function of `Age`, `Pclass` and `Sex`.
%% Cell type:markdown id:d2d4fa47 tags:
%% Cell type:markdown id:965cc75c tags:
## A
%% Cell type:code id:ae476b31 tags:
%% Cell type:code id:7eb30cc9 tags:
``` python
%% Cell type:markdown id:b32ced14 tags:
%% Cell type:markdown id:154c1460 tags:
## Q
Fit a linear model to these data to explain our synthetic variable `LogFare` as a function of `Age`, `Pclass` and `Sex`.
Treat `Pclass` and `Sex` as factors.
Print an ANOVA table.
Print an ANOVA table for different types of sum of squares..
%% Cell type:markdown id:24492fad tags:
%% Cell type:markdown id:3ad89dec tags:
## A
%% Cell type:code id:0ad3c464 tags:
%% Cell type:code id:39f58924 tags:
``` python
%% Cell type:markdown id:ded672e8 tags:
%% Cell type:markdown id:ecf3bcd9 tags:
## Q
Let us ignore the not-normal residuals and play with post-hoc tests instead.
Because we have a large sample, we will ignore the not-normal residuals and play with post-hoc tests instead.
Split the ANOVA for levels of `Pclass` and `Sex`, perform all pairwise comparisons if it make sense, and correct for multiple comparisons.
First proceed considering type-3 sums of squares.
We are not interested in the significance of the slope of `Age` for the different levels of the factors.
%% Cell type:markdown id:63677a0d tags:
%% Cell type:markdown id:00ace9f5 tags:
## A
%% Cell type:code id:09794922 tags:
``` python
%% Cell type:markdown id:d30cdb9c tags:
## Q
Let us suppose we want to use type-1 sums of squares instead.
Proceed again to performing with `Sex` first, `Pclass` second, and `Age` last.
In the post-hoc comparisons, we will disregard the effect of the slop of `Age`.
%% Cell type:markdown id:7a417d76 tags:
## A
%% Cell type:code id:9f6645bb tags:
%% Cell type:code id:034acc94 tags:
``` python
%% Cell type:markdown id:9a863196 tags:
%% Cell type:markdown id:b7b20012 tags:
# Linear model with multiple variables
%% Cell type:markdown id:ef97cb7f tags:
%% Cell type:markdown id:144b0584 tags:
## Q
Load the `mi.csv` file and plot the variables `Temperature`, `HeartRate` and `PhysicalActivity`.
Load the `mi.csv` file and plot the variables `Temperature` vs `HeartRate` and `PhysicalActivity`.
We will try to «explain» `Temperature` from `HeartRate` and `PhysicalActivity`.
%% Cell type:markdown id:c66b1c3b tags:
%% Cell type:markdown id:35949307 tags:
## A
%% Cell type:code id:4bdeea0c tags:
%% Cell type:code id:47cf88a3 tags:
``` python
%% Cell type:markdown id:358d3903 tags:
%% Cell type:markdown id:62449fb6 tags:
## Q
%% Cell type:markdown id:f3003272 tags:
%% Cell type:markdown id:c71d2a4c tags:
The `PhysicalActivity` variable exhibit a long-tail distribution. This is usually undesirable for an explanatory variable, because we cannot densely sample a large part of its domain of possible values, and therefore a model based on the data cannot be reliable.
The `PhysicalActivity` variable is very asymmetric. This is usually undesirable for an explanatory variable, because we cannot densely sample a large part of its domain of possible values, and therefore a model based on the data cannot be reliable.
We will proceed to transforming `PhysicalActivity` using a simple natural logarithm. `log` is undefined at $0$ and tends to the infinite near $0$, which renders its straightforward application to `PhysicalActivity` inappropriate. Therefore we will also add $1$ to the `PhysicalActivity` measurements prior to applying `log`.
Plot again the temperature versus the transformed `PhysicalActivity` variable and compare the skewness of the transformed versus raw variable.
%% Cell type:markdown id:34dff28d tags:
%% Cell type:markdown id:21e8879e tags:
## A
%% Cell type:code id:16e7787e tags:
%% Cell type:code id:0b8cac58 tags:
``` python
%% Cell type:markdown id:7887928e tags:
%% Cell type:markdown id:2c6d5225 tags:
## Q
To appreciate the increased robustness of a linear model using the transformed variable compared to the raw variable, design a simple univariate linear regression of `Temperature` as response variable, and draw the Cook's distance of all the observations in regard of this model:
* first with the raw `PhysicalActivity` as explanatory variable,
* second with the transformed `PhysicalActivity` as explanatory variable.
%% Cell type:markdown id:b9ed6879 tags:
%% Cell type:markdown id:c47489e6 tags:
## A
%% Cell type:code id:1044c8c6 tags:
%% Cell type:code id:bf32c7e6 tags:
``` python
%% Cell type:markdown id:49408adc tags:
%% Cell type:markdown id:1c7871a2 tags:
## Q
Make a linear model of `Temperature` as response and `HeartRate` and `PhysicalActivity` (or its transformed variant) as explanatory variables.
Make two such models, one with interaction and one without. How would you choose between the two models?
%% Cell type:markdown id:1b977a53 tags:
%% Cell type:markdown id:12d7e9a5 tags:
## A (with nested Q&A)
%% Cell type:code id:b891fa3e tags:
%% Cell type:code id:82e9f8d6 tags:
``` python
%% Cell type:code id:c469b3d5 tags:
``` python
%% Cell type:markdown id:475eb53d tags:
%% Cell type:markdown id:66ddaa3e tags:
### Q
The interaction term is not significant, but most importantly, the increase in log-likelihood is very small; the interaction term does not help to better fit the model to the data.
To get a better intuition about the log-likelihood, plot it (with a dot plot) for different models, with one variable, with two variables, with and without interaction.
Feel free to introduce one or two extra explanatory variables such as `BMI`.
%% Cell type:markdown id:f7a51e03 tags:
%% Cell type:markdown id:05844c56 tags:
### A
%% Cell type:code id:3a4295ad tags:
%% Cell type:code id:2588d8d6 tags:
``` python
%% Cell type:markdown id:2474a218 tags:
%% Cell type:markdown id:665a7a5c tags:
# White test for homoscedasticity
%% Cell type:markdown id:1bf04c2f tags:
%% Cell type:markdown id:6d0bdd7b tags:
To keep things simple, let us use the `'Heart + PhysicalActivity'` or `'Heart + logPhysicalActivity'`.
## Q
Inspect the residuals plotting them versus each explanatory variable.
%% Cell type:markdown id:77c0d822 tags:
%% Cell type:markdown id:a6f37b2e tags:
## A
%% Cell type:code id:4d7f9d1f tags:
%% Cell type:code id:77e350be tags:
``` python
%% Cell type:markdown id:20fdd05b tags:
%% Cell type:markdown id:789bd3f7 tags:
## Q
We will further inspect the residuals for heteroscedasticity, using the [White test](
`statsmodels` features an implementation of this test, but the [documentation]( is scarce on details.
Try to apply the `het_white` function, but do not feel ashamed if you fail.
Try to apply the `het_white` function (if you can 'x)).
%% Cell type:markdown id:34e7a050 tags:
%% Cell type:markdown id:0a822074 tags:
## A
%% Cell type:code id:5de56d54 tags:
%% Cell type:code id:85d73982 tags:
``` python
%% Cell type:markdown id:98ca812e tags:
%% Cell type:markdown id:e3ccb464 tags:
## Q
Instead, we will implement this test, as an application of polynomial regression.
The algorithm is simple. First part:
* take the squared residuals as a response variable,
* take the same explanatory variables as in the original model, plus all their possible interaction terms, plus all their values squared,
* fit a linear model to these data.
%% Cell type:markdown id:3ebf1176 tags:
%% Cell type:markdown id:7ce19023 tags:
## A
%% Cell type:code id:46c53e4e tags:
%% Cell type:code id:7dec1d91 tags:
``` python
%% Cell type:markdown id:46bd2a33 tags:
%% Cell type:markdown id:b588b5c1 tags:
## Q
Second part:
* get the coefficient of determination $R^2$,
* get the sample size $n$,
* set the number $k$ of degrees of freedom as the number of predictors (intercept excluded),
The test is:
H_0: nR^2 \sim \chi_{k}^2
H_A: nR^2 > \tt{Critical Value}(\chi_{k}^2, 1-\alpha)
You do not necessarily need to compute the critical value. Just note the test is one-sided.
Compute the statistic $nR^2$ and the resulting $p$-value.
%% Cell type:markdown id:374e25eb tags:
%% Cell type:markdown id:5224316d tags:
## A
%% Cell type:code id:08216a83 tags:
%% Cell type:code id:1bac9cde tags:
``` python
This diff is collapsed.