Commit d22bfedc authored by François  LAURENT's avatar François LAURENT
Browse files

more sensible choice of sums of squares

parent 09b58ac6
%% Cell type:code id:d5065981 tags:
%% Cell type:code id:4e16caf7 tags:
``` python
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
......@@ -12,216 +12,228 @@
from statsmodels.stats import diagnostic
from statsmodels.stats.multitest import multipletests
from statsmodels.stats.outliers_influence import OLSInfluence
```
%% Cell type:markdown id:4ddd8902 tags:
%% Cell type:markdown id:358dce7a tags:
# Multi-way ANOVA
%% Cell type:markdown id:9c4680b9 tags:
%% Cell type:markdown id:a1face9f tags:
## Q
Load the `titanic.csv` data file, insert the natural logarithm of `1+Fare` as a new column in the dataframe (*e.g.* with column name `'LogFare'`), and plot this new variable as a function of `Age`, `Pclass` and `Sex`.
%% Cell type:markdown id:d2d4fa47 tags:
%% Cell type:markdown id:965cc75c tags:
## A
%% Cell type:code id:ae476b31 tags:
%% Cell type:code id:7eb30cc9 tags:
``` python
```
%% Cell type:markdown id:b32ced14 tags:
%% Cell type:markdown id:154c1460 tags:
## Q
Fit a linear model to these data to explain our synthetic variable `LogFare` as a function of `Age`, `Pclass` and `Sex`.
Treat `Pclass` and `Sex` as factors.
Print an ANOVA table.
Print an ANOVA table for different types of sum of squares..
%% Cell type:markdown id:24492fad tags:
%% Cell type:markdown id:3ad89dec tags:
## A
%% Cell type:code id:0ad3c464 tags:
%% Cell type:code id:39f58924 tags:
``` python
```
%% Cell type:markdown id:ded672e8 tags:
%% Cell type:markdown id:ecf3bcd9 tags:
## Q
Let us ignore the not-normal residuals and play with post-hoc tests instead.
Because we have a large sample, we will ignore the not-normal residuals and play with post-hoc tests instead.
Split the ANOVA for levels of `Pclass` and `Sex`, perform all pairwise comparisons if it make sense, and correct for multiple comparisons.
First proceed considering type-3 sums of squares.
We are not interested in the significance of the slope of `Age` for the different levels of the factors.
%% Cell type:markdown id:63677a0d tags:
%% Cell type:markdown id:00ace9f5 tags:
## A
%% Cell type:code id:09794922 tags:
``` python
```
%% Cell type:markdown id:d30cdb9c tags:
## Q
Let us suppose we want to use type-1 sums of squares instead.
Proceed again to performing with `Sex` first, `Pclass` second, and `Age` last.
In the post-hoc comparisons, we will disregard the effect of the slop of `Age`.
%% Cell type:markdown id:7a417d76 tags:
## A
%% Cell type:code id:9f6645bb tags:
%% Cell type:code id:034acc94 tags:
``` python
```
%% Cell type:markdown id:9a863196 tags:
%% Cell type:markdown id:b7b20012 tags:
# Linear model with multiple variables
%% Cell type:markdown id:ef97cb7f tags:
%% Cell type:markdown id:144b0584 tags:
## Q
Load the `mi.csv` file and plot the variables `Temperature`, `HeartRate` and `PhysicalActivity`.
Load the `mi.csv` file and plot the variables `Temperature` vs `HeartRate` and `PhysicalActivity`.
We will try to «explain» `Temperature` from `HeartRate` and `PhysicalActivity`.
%% Cell type:markdown id:c66b1c3b tags:
%% Cell type:markdown id:35949307 tags:
## A
%% Cell type:code id:4bdeea0c tags:
%% Cell type:code id:47cf88a3 tags:
``` python
```
%% Cell type:markdown id:358d3903 tags:
%% Cell type:markdown id:62449fb6 tags:
## Q
%% Cell type:markdown id:f3003272 tags:
%% Cell type:markdown id:c71d2a4c tags:
The `PhysicalActivity` variable exhibit a long-tail distribution. This is usually undesirable for an explanatory variable, because we cannot densely sample a large part of its domain of possible values, and therefore a model based on the data cannot be reliable.
The `PhysicalActivity` variable is very asymmetric. This is usually undesirable for an explanatory variable, because we cannot densely sample a large part of its domain of possible values, and therefore a model based on the data cannot be reliable.
We will proceed to transforming `PhysicalActivity` using a simple natural logarithm. `log` is undefined at $0$ and tends to the infinite near $0$, which renders its straightforward application to `PhysicalActivity` inappropriate. Therefore we will also add $1$ to the `PhysicalActivity` measurements prior to applying `log`.
Plot again the temperature versus the transformed `PhysicalActivity` variable and compare the skewness of the transformed versus raw variable.
%% Cell type:markdown id:34dff28d tags:
%% Cell type:markdown id:21e8879e tags:
## A
%% Cell type:code id:16e7787e tags:
%% Cell type:code id:0b8cac58 tags:
``` python
```
%% Cell type:markdown id:7887928e tags:
%% Cell type:markdown id:2c6d5225 tags:
## Q
To appreciate the increased robustness of a linear model using the transformed variable compared to the raw variable, design a simple univariate linear regression of `Temperature` as response variable, and draw the Cook's distance of all the observations in regard of this model:
* first with the raw `PhysicalActivity` as explanatory variable,
* second with the transformed `PhysicalActivity` as explanatory variable.
%% Cell type:markdown id:b9ed6879 tags:
%% Cell type:markdown id:c47489e6 tags:
## A
%% Cell type:code id:1044c8c6 tags:
%% Cell type:code id:bf32c7e6 tags:
``` python
```
%% Cell type:markdown id:49408adc tags:
%% Cell type:markdown id:1c7871a2 tags:
## Q
Make a linear model of `Temperature` as response and `HeartRate` and `PhysicalActivity` (or its transformed variant) as explanatory variables.
Make two such models, one with interaction and one without. How would you choose between the two models?
%% Cell type:markdown id:1b977a53 tags:
%% Cell type:markdown id:12d7e9a5 tags:
## A (with nested Q&A)
%% Cell type:code id:b891fa3e tags:
``` python
```
%% Cell type:code id:c469b3d5 tags:
%% Cell type:code id:82e9f8d6 tags:
``` python
```
%% Cell type:markdown id:475eb53d tags:
%% Cell type:markdown id:66ddaa3e tags:
### Q
The interaction term is not significant, but most importantly, the increase in log-likelihood is very small; the interaction term does not help to better fit the model to the data.
To get a better intuition about the log-likelihood, plot it (with a dot plot) for different models, with one variable, with two variables, with and without interaction.
Feel free to introduce one or two extra explanatory variables such as `BMI`.
%% Cell type:markdown id:f7a51e03 tags:
%% Cell type:markdown id:05844c56 tags:
### A
%% Cell type:code id:3a4295ad tags:
%% Cell type:code id:2588d8d6 tags:
``` python
```
%% Cell type:markdown id:2474a218 tags:
%% Cell type:markdown id:665a7a5c tags:
# White test for homoscedasticity
%% Cell type:markdown id:1bf04c2f tags:
%% Cell type:markdown id:6d0bdd7b tags:
To keep things simple, let us use the `'Heart + PhysicalActivity'` or `'Heart + logPhysicalActivity'`.
## Q
Inspect the residuals plotting them versus each explanatory variable.
%% Cell type:markdown id:77c0d822 tags:
%% Cell type:markdown id:a6f37b2e tags:
## A
%% Cell type:code id:4d7f9d1f tags:
%% Cell type:code id:77e350be tags:
``` python
```
%% Cell type:markdown id:20fdd05b tags:
%% Cell type:markdown id:789bd3f7 tags:
## Q
We will further inspect the residuals for heteroscedasticity, using the [White test](https://itfeature.com/heteroscedasticity/white-test-for-heteroskedasticity).
`statsmodels` features an implementation of this test, but the [documentation](https://www.statsmodels.org/stable/generated/statsmodels.stats.diagnostic.het_white.html) is scarce on details.
Try to apply the `het_white` function, but do not feel ashamed if you fail.
Try to apply the `het_white` function (if you can 'x)).
%% Cell type:markdown id:34e7a050 tags:
%% Cell type:markdown id:0a822074 tags:
## A
%% Cell type:code id:5de56d54 tags:
%% Cell type:code id:85d73982 tags:
``` python
```
%% Cell type:markdown id:98ca812e tags:
%% Cell type:markdown id:e3ccb464 tags:
## Q
Instead, we will implement this test, as an application of polynomial regression.
......@@ -229,21 +241,21 @@
* take the squared residuals as a response variable,
* take the same explanatory variables as in the original model, plus all their possible interaction terms, plus all their values squared,
* fit a linear model to these data.
%% Cell type:markdown id:3ebf1176 tags:
%% Cell type:markdown id:7ce19023 tags:
## A
%% Cell type:code id:46c53e4e tags:
%% Cell type:code id:7dec1d91 tags:
``` python
```
%% Cell type:markdown id:46bd2a33 tags:
%% Cell type:markdown id:b588b5c1 tags:
## Q
Second part:
* get the coefficient of determination $R^2$,
......@@ -260,14 +272,14 @@
You do not necessarily need to compute the critical value. Just note the test is one-sided.
Compute the statistic $nR^2$ and the resulting $p$-value.
%% Cell type:markdown id:374e25eb tags:
%% Cell type:markdown id:5224316d tags:
## A
%% Cell type:code id:08216a83 tags:
%% Cell type:code id:1bac9cde tags:
``` python
```
......
This diff is collapsed.
......@@ -603,11 +603,11 @@
Again, we can use [anova_lm](https://www.statsmodels.org/stable/generated/statsmodels.stats.anova.anova_lm.html) to print a condensed table:
 
%% Cell type:code id:cc06b3ed-a066-4de3-884c-a7c82729c359 tags:
 
``` python
anova_table = sm.stats.anova_lm(plant_model, typ=2)
anova_table = sm.stats.anova_lm(plant_model, typ=3) # typ specifies the type of sum of squares
print(anova_table)
```
 
%% Cell type:markdown id:309fa360-4a14-4f5e-bbe3-eea72e1e8dc4 tags:
 
......@@ -687,11 +687,11 @@
%% Cell type:code id:60de13e5-b798-4312-8d06-0ef79f480760 tags:
 
``` python
model_with_interaction = ols('height ~ water * sun', data=plant_data).fit()
# remember `water * sun` is equivalent to `water + sun + water:sun`
print(sm.stats.anova_lm(model_with_interaction, typ=2))
print(sm.stats.anova_lm(model_with_interaction, typ=3))
```
 
%% Cell type:markdown id:bec6f02b tags:
 
Argument `typ` specifies the type of sum of squares. Type 2 is often used for ANOVA because it does not depend on the order of the factors.
......@@ -721,11 +721,11 @@
dx = .2
w = model_with_interaction.params
 
y_low_daily = w['Intercept'] + w['sun[T.low]']
y_low_weekly = w['Intercept'] + w['sun[T.low]'] + w['water[T.weekly]'] + w['water[T.weekly]:sun[T.low]']
ax.plot([x[0]-dx, x[0]+dx], [y_low_daily, y_low_weekly], 'k-d', markerfacecolor='w')
ax.plot([x[0]-dx, x[0]+dx], [y_low_daily, y_low_weekly], 'k-d', markerfacecolor='w')Q
 
y_med_daily = w['Intercept'] + w['sun[T.med]']
y_med_weekly = w['Intercept'] + w['sun[T.med]'] + w['water[T.weekly]'] + w['water[T.weekly]:sun[T.med]']
ax.plot([x[1]-dx, x[1]+dx], [y_med_daily, y_med_weekly], 'k-d', markerfacecolor='w')
 
......@@ -790,11 +790,11 @@
 
``` python
daily_water_model = ols('height ~ sun', data=plant_data[plant_data['water']=='daily']).fit()
weekly_water_model = ols('height ~ sun', data=plant_data[plant_data['water']=='weekly']).fit()
low_sun_model = ols('height ~ water', data=plant_data[plant_data['sun']=='low']).fit()
med_sun_model = ols('height ~ water', data=plant_data[plant_data['sun']=='med']).fit()
med_sun_model = ols('height ~ water', data=plant_data[plant_daQta['sun']=='med']).fit()
high_sun_model = ols('height ~ water', data=plant_data[plant_data['sun']=='high']).fit()
```
 
%% Cell type:markdown id:a39e6bd5 tags:
 
......@@ -912,20 +912,21 @@
This can be done with a procedure called *correction for multiple comparisons*.
 
%% Cell type:code id:e6c92946 tags:
 
``` python
sm.stats.anova_lm(model_with_interaction)
sm.stats.anova_lm(model_with_interaction, typ=3)
```
 
%%%% Output: execute_result
 
df sum_sq mean_sq F PR(>F)
water 1.0 15.552000 15.552000 19.117394 0.000205
sun 2.0 21.424667 10.712333 13.168203 0.000138
water:sun 2.0 5.694000 2.847000 3.499693 0.046376
Residual 24.0 19.524000 0.813500 NaN NaN
sum_sq df F PR(>F)
Intercept 246.402000 1.0 302.891211 4.094979e-15
water 2.401000 1.0 2.951444 9.867817e-02
sun 8.041333 2.0 4.942430 1.593920e-02
water:sun 5.694000 2.0 3.499693 4.637649e-02
Residual 19.524000 24.0 NaN NaN
 
%% Cell type:markdown id:48d0f081 tags:
 
### multipletests
 
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment