Commit 09b58ac6 authored by François  LAURENT's avatar François LAURENT
Browse files

StatsModels TP

parent efed6c55
......@@ -81,22 +81,23 @@
%% Cell type:markdown id:5162d7b2-625a-4699-922e-92c5c2bfa769 tags:
 
We will merely review statistical tests:
 
* Student $t$ tests
* compare a sample against the population mean
* compare two independent samples
* compare paired samples
* compare a sample mean against the population mean
* compare means of two independent samples
* compare the means of paired samples
* analyses of variance (one-way)
* compare more than two groups
* compare more than two group means
* tests for other tests' assumptions
* normality tests
* homoscedasticity tests
* $\chi^2$ tests for discrete variables
* goodness-of-fit test
* homogeneity and independence tests
* correlation coefficients
* correlation coefficient and linear regression
* effect sizes and test power
 
%% Cell type:markdown id:f1be90cf tags:
 
## What Python can do -- What Python cannot
 
......@@ -113,11 +114,13 @@
| **sample** | a finite set of selected individuals<br />assumed to be representative of a population |
 
Always good to get a reminder about [general considerations](https://www.coursera.org/learn/stanford-statistics/home/welcome), __prior to data collection__ and analysis.
 
* Sampling from the population,
* identifying the sources of variability, etc.
* identifying the sources of variability,
* checking the assumptions of a test are met,
* etc.
 
%% Cell type:markdown id:23c92344 tags:
 
### Workflow
 
......@@ -125,11 +128,13 @@
 
To determine whether there is *sufficient evidence* to conclude the treatment has an effect, we use *statistical tests*.
 
However, because experimental designs are often complex and involve multiple treatments and additional sources of variability, most studies also involve multiple tests, that are usually carried out after a so-called *omnibus* test.
 
In addition, every statistical test makes various assumptions that in turn needs to be checked. As a consequence, every statistical analysis involves a series of tests and procedures.
In addition, every statistical test makes various assumptions that in turn needs to be checked.
As a consequence, reaching a conclusion about the data usually involves a series of tests and procedures.
 
<table style="text-align:left;"><tr><th>
Example worflow adapted from...?
</th></tr><tr><td>
<img src="img/example_anova_workflow.png" width="70%" />
......@@ -189,18 +194,10 @@
posthoc [label="Post-hoc tests"];
 
}
]]>
 
%% Cell type:markdown id:e4db04ef tags:
### Replicability
%% Cell type:markdown id:05a435dd tags:
Not covered: tools and resources to support code and data management practices, some of which are provided by the Python ecosystem.
%% Cell type:markdown id:d6b0c2ee-4f6b-49db-8c26-dd64aede72fb tags:
 
## Data exploration
 
%% Cell type:markdown id:f911c720-5dbd-4e00-bb2b-329c4a3cd48e tags:
......@@ -510,10 +507,24 @@
 
``` python
print(f'{sample_mean:.2f} ± {1.96 * sem:.2f} years old on average')
```
 
%% Cell type:markdown id:ddef9204 tags:
`scipy` actually offers a more straightforward way to computing confidence intervals:
%% Cell type:code id:a7d15497 tags:
``` python
stats.norm(sample_mean, sem).interval(1 - alpha)
```
%%%% Output: execute_result
(45.535126163333835, 47.43629540529361)
%% Cell type:markdown id:52848e98 tags:
 
### Outliers
 
%% Cell type:markdown id:e1a2ea78-087a-4cd1-bdf1-ba67db7067de tags:
......
%% Cell type:code id:d5065981 tags:
``` python
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
from scipy import stats
from patsy import dmatrices
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.stats import diagnostic
from statsmodels.stats.multitest import multipletests
from statsmodels.stats.outliers_influence import OLSInfluence
```
%% Cell type:markdown id:4ddd8902 tags:
# Multi-way ANOVA
%% Cell type:markdown id:9c4680b9 tags:
## Q
Load the `titanic.csv` data file, insert the natural logarithm of `1+Fare` as a new column in the dataframe (*e.g.* with column name `'LogFare'`), and plot this new variable as a function of `Age`, `Pclass` and `Sex`.
%% Cell type:markdown id:d2d4fa47 tags:
## A
%% Cell type:code id:ae476b31 tags:
``` python
```
%% Cell type:markdown id:b32ced14 tags:
## Q
Fit a linear model to these data to explain our synthetic variable `LogFare` as a function of `Age`, `Pclass` and `Sex`.
Treat `Pclass` and `Sex` as factors.
Print an ANOVA table.
%% Cell type:markdown id:24492fad tags:
## A
%% Cell type:code id:0ad3c464 tags:
``` python
```
%% Cell type:markdown id:ded672e8 tags:
## Q
Let us ignore the not-normal residuals and play with post-hoc tests instead.
Split the ANOVA for levels of `Pclass` and `Sex`, perform all pairwise comparisons if it make sense, and correct for multiple comparisons.
We are not interested in the significance of the slope of `Age` for the different levels of the factors.
%% Cell type:markdown id:00ace9f5 tags:
## A
%% Cell type:code id:9f6645bb tags:
``` python
```
%% Cell type:markdown id:9a863196 tags:
# Linear model with multiple variables
%% Cell type:markdown id:ef97cb7f tags:
## Q
Load the `mi.csv` file and plot the variables `Temperature`, `HeartRate` and `PhysicalActivity`.
We will try to «explain» `Temperature` from `HeartRate` and `PhysicalActivity`.
%% Cell type:markdown id:c66b1c3b tags:
## A
%% Cell type:code id:4bdeea0c tags:
``` python
```
%% Cell type:markdown id:358d3903 tags:
## Q
%% Cell type:markdown id:f3003272 tags:
The `PhysicalActivity` variable exhibit a long-tail distribution. This is usually undesirable for an explanatory variable, because we cannot densely sample a large part of its domain of possible values, and therefore a model based on the data cannot be reliable.
We will proceed to transforming `PhysicalActivity` using a simple natural logarithm. `log` is undefined at $0$ and tends to the infinite near $0$, which renders its straightforward application to `PhysicalActivity` inappropriate. Therefore we will also add $1$ to the `PhysicalActivity` measurements prior to applying `log`.
Plot again the temperature versus the transformed `PhysicalActivity` variable and compare the skewness of the transformed versus raw variable.
%% Cell type:markdown id:34dff28d tags:
## A
%% Cell type:code id:16e7787e tags:
``` python
```
%% Cell type:markdown id:7887928e tags:
## Q
To appreciate the increased robustness of a linear model using the transformed variable compared to the raw variable, design a simple univariate linear regression of `Temperature` as response variable, and draw the Cook's distance of all the observations in regard of this model:
* first with the raw `PhysicalActivity` as explanatory variable,
* second with the transformed `PhysicalActivity` as explanatory variable.
%% Cell type:markdown id:b9ed6879 tags:
## A
%% Cell type:code id:1044c8c6 tags:
``` python
```
%% Cell type:markdown id:49408adc tags:
## Q
Make a linear model of `Temperature` as response and `HeartRate` and `PhysicalActivity` (or its transformed variant) as explanatory variables.
Make two such models, one with interaction and one without. How would you choose between the two models?
%% Cell type:markdown id:1b977a53 tags:
## A (with nested Q&A)
%% Cell type:code id:b891fa3e tags:
``` python
```
%% Cell type:code id:c469b3d5 tags:
``` python
```
%% Cell type:markdown id:475eb53d tags:
### Q
To get a better intuition about the log-likelihood, plot it (with a dot plot) for different models, with one variable, with two variables, with and without interaction.
Feel free to introduce one or two extra explanatory variables such as `BMI`.
%% Cell type:markdown id:f7a51e03 tags:
### A
%% Cell type:code id:3a4295ad tags:
``` python
```
%% Cell type:markdown id:2474a218 tags:
# White test for homoscedasticity
%% Cell type:markdown id:1bf04c2f tags:
To keep things simple, let us use the `'Heart + PhysicalActivity'` or `'Heart + logPhysicalActivity'`.
## Q
Inspect the residuals plotting them versus each explanatory variable.
%% Cell type:markdown id:77c0d822 tags:
## A
%% Cell type:code id:4d7f9d1f tags:
``` python
```
%% Cell type:markdown id:20fdd05b tags:
## Q
We will further inspect the residuals for heteroscedasticity, using the [White test](https://itfeature.com/heteroscedasticity/white-test-for-heteroskedasticity).
`statsmodels` features an implementation of this test, but the [documentation](https://www.statsmodels.org/stable/generated/statsmodels.stats.diagnostic.het_white.html) is scarce on details.
Try to apply the `het_white` function, but do not feel ashamed if you fail.
%% Cell type:markdown id:34e7a050 tags:
## A
%% Cell type:code id:5de56d54 tags:
``` python
```
%% Cell type:markdown id:98ca812e tags:
## Q
Instead, we will implement this test, as an application of polynomial regression.
The algorithm is simple. First part:
* take the squared residuals as a response variable,
* take the same explanatory variables as in the original model, plus all their possible interaction terms, plus all their values squared,
* fit a linear model to these data.
%% Cell type:markdown id:3ebf1176 tags:
## A
%% Cell type:code id:46c53e4e tags:
``` python
```
%% Cell type:markdown id:46bd2a33 tags:
## Q
Second part:
* get the coefficient of determination $R^2$,
* get the sample size $n$,
* set the number $k$ of degrees of freedom as the number of predictors (intercept excluded),
The test is:
$$
H_0: nR^2 \sim \chi_{k}^2
$$
$$
H_A: nR^2 > \tt{Critical Value}(\chi_{k}^2, 1-\alpha)
$$
You do not necessarily need to compute the critical value. Just note the test is one-sided.
Compute the statistic $nR^2$ and the resulting $p$-value.
%% Cell type:markdown id:374e25eb tags:
## A
%% Cell type:code id:08216a83 tags:
``` python
```
This diff is collapsed.
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment