Commit 1dbef0a4 authored by François  LAURENT's avatar François LAURENT
Browse files

first day-3 exercise

parent a460c148
This diff is collapsed.
This diff is collapsed.
......@@ -192,7 +192,7 @@
"Always good to get a reminder about [general considerations](https://www.coursera.org/learn/stanford-statistics/home/welcome), __prior to data collection__ and analysis.\n",
"\n",
"* Sampling from the population,\n",
"* identifying the sources of variability..."
"* identifying the sources of variability, etc."
]
},
{
......@@ -1587,7 +1587,7 @@
"\n",
"This is used as a basis to calculate a *p*-value that estimates the probability of erroneously rejecting $H_0$.\n",
"\n",
"The experimenter also defines a significance level $\\alpha$, with common values $\\alpha=0.05$ or $0.01$, that sets the maximum tolerated risk of rejecting $H_0$ by chance.\n",
"The experimenter also defines a significance level $\\alpha$, with common values $\\alpha=0.05$ or $0.01$, that sets the maximum tolerated risk of making a *type-1 error*, *i.e.* of rejecting $H_0$ by chance.\n",
"If the obtained <em>p</em>-value is lower than $\\alpha$, then s·he can conclude there is sufficient evidence to reject $H_0$."
]
},
......
%% Cell type:code id:8ccafd3d tags:
``` python
import sys
!"{sys.executable}" -m pip install scipy
```
%%%% Output: stream
Requirement already satisfied: scipy in /home/flaurent/.local/lib/python3.8/site-packages (1.7.1)
Requirement already satisfied: numpy<1.23.0,>=1.16.5 in /home/flaurent/.local/lib/python3.8/site-packages (from scipy) (1.21.1)
%% Cell type:markdown id:cf7e7b4c-6fbe-4b1c-81b7-6d5a851c2835 tags:
<script src="https://polyfill.io/v3/polyfill.min.js?features=es6"></script>
<script async src="https://cdn.jsdelivr.net/npm/mathjax@3.0.1/es5/tex-mml-chtml.js"></script>
%% Cell type:markdown id:6d2c6ebd tags:
<h1 align='center'>Statistical tests with the SciPy library</h1>
<div style='text-align:center'><img src='https://docs.scipy.org/doc/scipy/reference/_static/scipyshiny_small.png' /></div>
[SciPy](https://docs.scipy.org/doc/scipy/reference/) is a collection of mathematical tools aiming at diverse fields, with functionalities split in several modules:
%% Cell type:code id:42342d74 tags:
``` python
from scipy import (
cluster, # Clustering algorithms
constants, # Physical and mathematical constants
fftpack, # Fast Fourier Transform routines
integrate, # Integration and ordinary differential equation solvers
interpolate, # Interpolation and smoothing splines
io, # Input and Output
linalg, # Linear algebra
ndimage, # N-dimensional image processing
odr, # Orthogonal distance regression
optimize, # Optimization and root-finding routines
signal, # Signal processing
sparse, # Sparse matrices and associated routines
spatial, # Spatial data structures and algorithms
special, # Special functions
stats, # Statistical distributions and functions
)
```
%% Cell type:markdown id:1091dfd5 tags:
Reminder about module loading:
Example: how to access the `ttest_ind` function defined in the `scipy.stats` module?
%% Cell type:code id:898957c8 tags:
``` python
%%script echo skipping
import scipy.stats
scipy.stats.ttest_ind
from scipy import stats
stats.ttest_ind
from scipy.stats import *
ttest_ind
```
%%%% Output: stream
skipping
%% Cell type:markdown id:9e2fe344 tags:
`scipy.stats` content (see the [official documention](https://docs.scipy.org/doc/scipy/reference/reference/stats.html#module-scipy.stats)):
* [Probability distributions](https://docs.scipy.org/doc/scipy/reference/reference/stats.html#probability-distributions)
* [Summary statistics](https://docs.scipy.org/doc/scipy/reference/reference/stats.html#summary-statistics)
* [Frequency statistics](https://docs.scipy.org/doc/scipy/reference/reference/stats.html#frequency-statistics)
* [Correlation functions](https://docs.scipy.org/doc/scipy/reference/reference/stats.html#correlation-functions)
* [Statistical tests](https://docs.scipy.org/doc/scipy/reference/reference/stats.html#statistical-tests)
* ...
`scipy.stats` features basic functionalities and we will occasionally mention the `statsmodels` and `pingouin` libraries as we will hit `scipy.stats` limitations.
%% Cell type:markdown id:0d715671-e4ad-4d4d-b60f-9397e2b80b90 tags:
## Overview
%% Cell type:markdown id:5162d7b2-625a-4699-922e-92c5c2bfa769 tags:
We will merely review statistical tests:
* Student $t$ tests
* compare a sample against the population mean
* compare two independent samples
* compare paired samples
* analyses of variance (one-way)
* compare more than two groups
* tests for other tests' assumptions
* normality tests
* homoscedasticity tests
* $\chi^2$ tests for discrete variables
* goodness-of-fit test
* homogeneity and independence tests
* correlation coefficients
%% Cell type:markdown id:f1be90cf tags:
## What Python can do -- What Python cannot
%% Cell type:markdown id:01c6c854 tags:
### Experimental design
Some vocabulary:
| term | description |
| --: | :-- |
| **treatment** | every procedure or experimental condition<br />that departs from a control condition<br />(and the control itself is a special-case treatment) |
| **population** | a set of individuals we want our conclusions to apply to |
| **sample** | a finite set of selected individuals<br />assumed to be representative of a population |
Always good to get a reminder about [general considerations](https://www.coursera.org/learn/stanford-statistics/home/welcome), __prior to data collection__ and analysis.
* Sampling from the population,
* identifying the sources of variability...
* identifying the sources of variability, etc.
%% Cell type:markdown id:23c92344 tags:
### Workflow
%% Cell type:markdown id:35bcd5a9-fd6f-4914-96a7-734d240efd8e tags:
To determine whether there is *sufficient evidence* to conclude the treatment has an effect, we use *statistical tests*.
However, because experimental designs are often complex and involve multiple treatments and additional sources of variability, most studies also involve multiple tests, that are usually carried out after a so-called *omnibus* test.
In addition, every statistical test makes various assumptions that in turn needs to be checked. As a consequence, every statistical analysis involves a series of tests and procedures.
<table style="text-align:left;"><tr><th>
Example worflow adapted from...?
</th></tr><tr><td>
<img src="img/example_anova_workflow.png" width="70%" />
</td></tr></table>
Contrary to statistical software with a GUI, the tools featured in programming languages such as Python do not offer much guidance in following such a workflow.
<![CDATA[
# GraphViz source for https://sketchviz.com
digraph G {
graph [fontname = "Handlee"];
node [fontname = "freesans"];
edge [fontname = "Handlee"];
bgcolor=transparent;
subgraph cluster0 {
color=none;
n0 -> normality [label="yes", constraint=false];
}
subgraph cluster1 {
style=filled;
color=lightgrey;
node [style=filled,color=pink];
anova; robust_f; nonparam;
label = "*Omnibus test *";
fontsize = 20;
}
n0 -> nonparam [label="no"];
normality -> H0_0;
H0_0 -> homoscedasticity [label="not rejected"];
H0_0 -> transform [label="rejected"];
transform -> homoscedasticity [label="success"];
transform -> nonparam [label="failure"];
homoscedasticity -> H0_1;
H0_1 -> anova [label="not rejected"];
H0_1 -> robust_f [label="rejected"];
H0_2 -> posthoc [label="rejected"];
anova -> H0_2;
robust_f -> H0_2;
nonparam -> H0_2;
n0 [shape=diamond, label=<n<SUB>i</SUB>&ge;20>];
normality [label="Normality test"];
nonparam [label="Non-parametric\nomnibus test;\nK-W, Friedman"];
H0_0 [shape=diamond, label=<H<SUB>0</SUB>>];
homoscedasticity [label="Homoscedasticity test"];
transform [label="Data transform"];
H0_1 [shape=diamond, label=<H<SUB>0</SUB>>];
anova [label="Standard ANOVA"];
robust_f [label="Robust F test;\nWelch, Yuen,\nAlexander-Govern"];
H0_2 [shape=diamond, label=<H<SUB>0</SUB>>];
posthoc [label="Post-hoc tests"];
}
]]>
%% Cell type:markdown id:e4db04ef tags:
### Replicability
%% Cell type:markdown id:05a435dd tags:
Not covered: tools and resources to support code and data management practices, some of which are provided by the Python ecosystem.
%% Cell type:markdown id:d6b0c2ee-4f6b-49db-8c26-dd64aede72fb tags:
## Data exploration
%% Cell type:markdown id:f911c720-5dbd-4e00-bb2b-329c4a3cd48e tags:
Each measurements or variables should be checked for undesirable properties, among others:
* completeness (undefined values?)
* extreme values (outliers?)
* modes (multiple?)
Let us import a few more libraries and load some example data:
%% Cell type:code id:65104f3d tags:
``` python
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
pd.options.display.max_columns = None
dataframe = pd.read_csv('../data/mi.csv', index_col=0)
dataframe
```
%%%% Output: execute_result
Age OwnsHouse PhysicalActivity Sex LivesWithPartner LivesWithKids \
1 22.33 Yes 3.0 Female No No
2 28.83 Yes 0.0 Female Yes No
3 23.67 Yes 0.0 Female Yes No
4 21.17 No 0.5 Female No No
5 26.17 Yes 1.5 Female No No
.. ... ... ... ... ... ...
812 21.42 Yes 6.0 Male No No
813 38.33 No 1.5 Female Yes Yes
814 32.75 Yes 2.5 Female No No
815 39.17 No 4.0 Male No Yes
816 59.00 No 0.0 Male Yes Yes
BornInCity Inbreeding BMI CMVPositiveSerology FluIgG \
1 Yes 94.9627 20.13 No 0.464319
2 Yes 79.1024 21.33 Yes -0.049817
3 Yes 117.2540 22.18 No 0.332944
4 No 94.1796 18.68 No 0.404886
5 Yes 105.1250 29.01 No -0.303782
.. ... ... ... ... ...
812 Yes 73.0073 24.75 No 0.210671
813 No 79.7187 23.53 No 0.354790
814 No 86.6744 20.07 No 0.252037
815 Yes 98.9744 22.77 Yes 0.264940
816 Yes 84.6433 27.39 Yes 0.281522
MetabolicScore LowAppetite TroubleConcentrating TroubleSleeping \
1 0 0 0 1.0
2 1 0 0 1.0
3 2 0 0 1.0
4 0 0 0 2.0
5 1 0 0 1.0
.. ... ... ... ...
812 0 0 0 3.0
813 1 0 0 2.0
814 0 0 0 1.0
815 0 0 0 2.0
816 1 0 0 0.0
HoursOfSleep Listless UsesCannabis RecentPersonalCrisis Smoking \
1 9.00 3 No No Never
2 7.05 3 No No Active
3 6.50 3 Yes No Active
4 10.00 3 No No Never
5 9.00 0 No No Never
.. ... ... ... ... ...
812 7.50 0 No No Never
813 9.00 0 No Yes Ex
814 8.00 3 No Yes Never
815 7.50 0 No No Never
816 8.00 0 No No Never
Employed Education DustExposure Income HadMeasles HadRubella \
1 No PhD No (1000-2000] No No
2 Yes Baccalaureat No (2000-3000] No No
3 Yes Baccalaureat Current (2000-3000] No No
4 No PhD No (3000-inf] No No
5 Yes Baccalaureat No [0-1000] No No
.. ... ... ... ... ... ...
812 Yes PhD Current (3000-inf] No No
813 Yes Baccalaureat No (3000-inf] Yes Yes
814 No Baccalaureat No (1000-2000] No No
815 Yes PhD No (2000-3000] No No
816 No Baccalaureat No (2000-3000] No No
HadChickenPox HadMumps HadTonsillectomy HadAppendicectomy VaccineHepA \
1 Yes No No No No
2 Yes No No No No
3 Yes No No No No
4 Yes No No No No
5 No No No No No
.. ... ... ... ... ...
812 Yes No No No No
813 Yes No No No No
814 No Yes No No No
815 No Yes No No No
816 No No Yes No No
VaccineMMR VaccineTyphoid VaccineWhoopingCough VaccineYellowFever \
1 No No Yes No
2 No No Yes No
3 No No No No
4 No No No No
5 No No Yes No
.. ... ... ... ...
812 No No No No
813 No No No No
814 No No No No
815 Yes No No No
816 No No No No
VaccineHepB VaccineFlu SUBJID DepressionScore HeartRate Temperature \
1 Yes No 2 0.0 66 36.8
2 Yes No 3 0.0 66 37.4
3 Yes No 4 0.0 62 36.9
4 Yes No 5 1.0 64 36.0
5 Yes No 8 0.0 67 36.7
.. ... ... ... ... ... ...
812 Yes No 5167 0.0 44 36.6
813 Yes No 5219 0.0 54 36.7
814 Yes No 5279 3.0 62 36.3
815 No No 5303 0.0 39 36.3
816 No No 5701 0.0 61 36.1
HourOfSampling DayOfSampling
1 8.883 40
2 9.350 40
3 8.667 40
4 9.883 40
5 8.550 81
.. ... ...
812 8.550 270
813 9.217 229
814 9.800 278
815 9.583 277
816 9.317 241
[816 rows x 43 columns]
%% Cell type:markdown id:ae8b5f34 tags:
<div class="alert alert-block alert-success">
The most important datum in a dataset is the sample size $n$. Here $n=816$.
</div>
%% Cell type:markdown id:52848e98 tags:
### Outliers
%% Cell type:markdown id:e1a2ea78-087a-4cd1-bdf1-ba67db7067de tags:
For example, undefined values are sometimes encoded as an otherwise-meaningless value:
%% Cell type:code id:69922361-f828-44ad-bb36-402f355dd0cb tags:
``` python
X = 'Age'
x = dataframe[X]
x_noisy = x.copy()
x_noisy.name = f'{X} [corrupted]'
x_noisy.iloc[[
29, 30, 73, 76, 82, 97, 112, 117, 136, 166, 196,
209, 253, 260, 266, 339, 345, 355, 414, 446, 453,
469, 486, 502, 505, 517, 579, 592, 614, 618, 623,
630, 664, 669, 683, 718, 786, 797, 804, 810,
]] = 0
sns.histplot(x=x_noisy)
sns.rugplot(x=x_noisy, color='r');
```
%%%% Output: display_data
%% Cell type:code id:ac5a7020-b903-4196-8db6-31b90070d22b tags:
``` python
Y = 'PhysicalActivity'
y = dataframe[Y].copy()
y[:] = np.log(1+y)
y.name = f'log( {Y} )'
_, axes = plt.subplots(1, 2, figsize=(13.3,4.1))
ax = axes[0]
sns.regplot(x=x_noisy, y=y, line_kws=dict(color='r'), ax=ax)
ax = axes[1]
sns.regplot(x=x, y=y, line_kws=dict(color='r'), ax=ax);
```
%%%% Output: display_data
%% Cell type:markdown id:b4f4e471-ec89-46ef-8984-fbf7b5afe84f tags:
Outliers may incur large biases and strongly impact the outcome of the analyses.
%% Cell type:markdown id:91a55884-2ca8-4cb0-9190-6c42bea7047b tags:
### Modes
%% Cell type:markdown id:4c483874-e9db-440a-b1c8-7dd47a06008a tags:
In many cases, we want the values for variables of interest to exhibit a single *mode*.
Otherwise, estimates of central tendency (=average, mean or median) may not be representative of the overall sample.
%% Cell type:code id:578edd4a-bcb1-4f9e-a8ba-d3e63bbabb71 tags:
``` python
X = 'HoursOfSleep'
Y = 'Age'
y = dataframe[Y][dataframe[X] > 8]
sns.histplot(y, bins=20)
plt.axvline(y.mean(), linestyle=':', color='red');
plt.axvline(y.median(), linestyle=':', color='green');
```
%%%% Output: display_data
%% Cell type:markdown id:c101eeb7-5580-4a75-889d-0fd2e965c7d6 tags:
### Another example
%% Cell type:code id:72a790bd tags:
``` python
X = 'HoursOfSleep' # variable
x = dataframe[X] # measurements
```
%% Cell type:code id:d6ab1d43 tags:
``` python
sns.histplot(x=x, bins=10)
sns.rugplot(x=x, color='r');
```
%%%% Output: display_data
%% Cell type:markdown id:994df34d tags:
#### Fitting a normal distribution
If $X$ follows a normal distribution:
$X \sim \mathcal{N}(\mu, \sigma^2)$
then we can use the **empirical mean** and **unbiased variance** as estimators for $\mu$ and $\sigma^2$ respectively.
`scipy.stats` exposes the `norm` function to define a normal distribution:
%% Cell type:code id:92b00d5d tags:
``` python
normal_distribution = stats.norm(x.mean(), x.var())
```
%% Cell type:markdown id:5eefc50c-44fa-403b-92c6-16aec1204d26 tags:
If unsure about how to fit the distribution parameters to the data, you can alternatively use `fit`:
%% Cell type:code id:4f114ad2-9cd2-431a-8b3e-2283006ef970 tags:
``` python
normal_distribution = stats.norm(*stats.norm.fit(x))
```
%% Cell type:markdown id:cef7bb4e-38b9-4abc-97e4-419a9650173b tags:
The object returned by `norm` features several distribution-related functions, *e.g.* `pdf`:
%% Cell type:code id:8e114e17-8edb-4674-9309-c6f566885c0a tags:
``` python
sns.histplot(x=x, bins=10, stat='density')
sns.rugplot(x=x, color='r')
grid = np.linspace(x.min(), x.max(), 100)
plt.plot(grid, normal_distribution.pdf(grid), 'g-');
```
%%%% Output: display_data
%% Cell type:code id:6eaa04b5 tags:
``` python
sns.histplot(x=x, bins=10, stat='density')
sns.rugplot(x=x, color='r')
sns.kdeplot(x, clip=(x.min(), x.max()), color='g', linestyle='-');
```
%%%% Output: display_data
%% Cell type:code id:ddfe2b2f tags:
``` python
sns.histplot(x=x, stat='density')
sns.rugplot(x=x, color='r')
sns.kdeplot(x, clip=(x.min(), x.max()), color='g', linestyle='-');
```
%%%% Output: display_data
%% Cell type:markdown id:cb8febd8 tags:
We can check whether `HoursOfSleep` follows a normal distribution with the `normaltest` function:
%% Cell type:code id:417bfc62 tags:
``` python
stats.normaltest(x)
```
%%%% Output: execute_result
NormaltestResult(statistic=17.050178832107033, pvalue=0.00019842697373015917)
%% Cell type:markdown id:655b0fc0 tags:
...and with the `kstest` function:
%% Cell type:code id:5b1b8872 tags:
``` python
stats.kstest(x, normal_distribution.cdf)
```
%%%% Output: execute_result
KstestResult(statistic=0.17414497571515752, pvalue=4.115192819897108e-22)
%% Cell type:markdown id:ddeacee0-2062-476d-b7ec-9e0f036d6c4b tags:
## Statistical testing
%% Cell type:markdown id:5f2fcb2e-1097-4bb7-ba77-fceaa3b55702 tags:
> What did we do?
We compared our **observations** `x` with some **expectation**.
We actually formulated a so-called *null hypothesis*, denoted $H_0$, that models the situation such that "nothing is going on", *i.e.* the observations meet the expectation.
We also implicitly defined an alternative hypothesis, usually denoted $H_1$ or $H_A$, that can simply be the opposite of $H_0$.
For example:
$$
\left\{
\begin{array}{ l l l }
H_0: & X \sim \mathcal{N}(\mu, \sigma^2) & \mbox{with } \mu \mbox{ assumed to be } \bar{x} \mbox{ and } \sigma^2 \mbox{ as } \frac{1}{n-1}\sum_{i=0}^{n-1} (x_i - \bar{x})^2 \\
H_A: & \mbox{not } H_0
\end{array}
\right.
$$
A test consists in contrasting the two incompatible hypotheses.
If we had a single observation – say $z=1.4$ – to compare with a distribution – say $\mathcal{N}(0,1)$ – we would simply compute the probability for this value to be drawn from this distribution (or not):
%% Cell type:code id:8346ad8e tags:
``` python
z = 1.4
N = stats.norm(0, 1)
onesided_pvalue = N.sf(z) # sf= survival function
twosided_pvalue = 2 * min(N.cdf(z), N.sf(z))
```
%% Cell type:code id:404476b6 tags:
``` python
grid = np.linspace(N.ppf(.001), N.ppf(.999), 100); pdf_curve, = plt.plot(grid, N.pdf(grid), 'r-'); z_line, = plt.plot([z, z], [0, N.pdf(z)], '-', zorder=1); tail = grid[z<=grid]; plt.fill_between(tail, np.zeros_like(tail), N.pdf(tail), alpha=.2); plt.axvline(0, linestyle='--', color='grey', linewidth=1); plt.axhline(0, linestyle='--', color='grey', linewidth=1); plt.xlim(grid[[0,-1]]); plt.xlabel('$X$'); plt.ylabel('probability density'); plt.legend([pdf_curve, z_line], ['$\mathcal{N}(0,1)$', '$z$']); plt.annotate(f'$\\approx {onesided_pvalue:.2f}$', (1.8, .03), xytext=(2, .13), arrowprops=dict(arrowstyle="->"));
```
%%%% Output: display_data
%% Cell type:markdown id:99a007ac tags:
In practice, all tests boil down to comparing a single value with a reference distribution. Basically, a test expresses the discrepancy between the observations and the expectation in the shape of a *statistic*, and this statistic is supposed to follow a given distribution under $H_0$.
This is used as a basis to calculate a *p*-value that estimates the probability of erroneously rejecting $H_0$.
The experimenter also defines a significance level $\alpha$, with common values $\alpha=0.05$ or $0.01$, that sets the maximum tolerated risk of rejecting $H_0$ by chance.
The experimenter also defines a significance level $\alpha$, with common values $\alpha=0.05$ or $0.01$, that sets the maximum tolerated risk of making a *type-1 error*, *i.e.* of rejecting $H_0$ by chance.
If the obtained <em>p</em>-value is lower than $\alpha$, then s·he can conclude there is sufficient evidence to reject $H_0$.
%% Cell type:code id:b162e92e tags:
``` python
%%html
<style>table#typeoferrors { text-align: center; font-size: large; margin-left: 1px;} #typeoferrors td { text-align: center; font-size: large; border-right: solid 1px black; border-bottom: solid 1px black; } #typeoferrors td.border { font-size: small; border-left: solid 1px black; border-top: solid 1px black; } #typeoferrors td.wrong { color: orange; } #typeoferrors td.ok { color: green; } #typeoferrors span.sub { font-size: x-small; } #typeoferrors td.footnote { text-align: left; font-size: xx-small; border-right: 0px; border-bottom: 0px; } </style> <table id="typeoferrors"> <tr><td rowspan="2" colspan="2"></td><td colspan="2" class="border">Conclusion about $H_0$<br />from the statistical test</td></tr> <tr><td>accept</td><td>reject</td></tr> <tr><td rowspan="2" class="border">Truth about $H_0$<br />in the population</td><td>true</td><td class="ok">Correct</td><td class="wrong">Type 1 error<br /><span class="sub">observe difference<br />when none exists</span></td></tr> <tr><td>false</td><td class="wrong">Type 2 error<br /><span class="sub">fail to observe difference<br />when one exists</span></td><td class="ok">Correct</td></tr> <tr><td colspan="4" class="footnote"> <a href="https://faculty.nps.edu/rbassett/_book/hypothesis-testing-one-sample.html#fig:errorsHypTesting">https://faculty.nps.edu/rbassett/_book/hypothesis-testing-one-sample.html#fig:errorsHypTesting</a> </td></tr> </table>
```
%%%% Output: display_data
%% Cell type:markdown id:446ba63e-df67-46ca-8921-eca686462c93 tags:
## *t* tests
%% Cell type:markdown id:8f9602b8-5f64-438b-8051-c71f9c8b0b14 tags:
*t* tests derive a statistic that is supposed to follow the [Student's *t* distribution](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.t.html) under $H_0$:
%% Cell type:code id:67d046f0-ca64-4a12-8e32-58b8e7e142a0 tags:
``` python
grid = np.linspace(-3.1, 3.1, 100)
dfs = [1, 2, 5, 20]
for df, color in zip(
dfs,
['blue', 'green', 'orange', 'red'],
):
t = stats.t(df)
plt.plot(grid, t.pdf(grid), '-', color=color)
plt.axvline(0, linestyle='--', color='grey', linewidth=1)
plt.axhline(0, linestyle='--', color='grey', linewidth=1)
plt.xlim(grid[[0,-1]])
plt.xlabel('$t$')
plt.ylabel('probability density')
plt.legend([ f'$df={df}$' for df in dfs ]);
```
%%%% Output: display_data
%% Cell type:markdown id:701ee597-2eec-4404-8a0e-601ae2121e19 tags:
At high degrees of freedom, the *t* distribution approaches the normal distribution. At lower degrees of freedom, the *t* distribution exhibits heavier tails and is less sensitive to extreme values.
%% Cell type:markdown id:ca6bf548-cadf-4c75-8130-fe0c3ef8de9a tags:
### One-sample *t* test
%% Cell type:markdown id:0e1a462c-5c60-4438-a339-7c284ff56e15 tags:
This test compares a sample's central tendency (*sample mean*) with a reference value (*population mean*).
<table style="text-align: center;"><tr><td>
<img src='img/8mice.svg' />
</td><td>
<img src='img/Scientific_journal_icon.svg' width="96px" />
</td></tr><tr><td><center>
<code>x=[49.5 81.9 64.0 17.3 59.8 94.6 69.9 12.4]</code>
</center></td><td><center>
<code>&mu;=50</code>
</center></td></tr></table>
Let us call $\mu$ this reference value. Our expectation is that the sample mean $\bar{X}$ is close enough to $\mu$.
In other words, $H_0: \bar{X} = \mu$.
The statistic is:
$$
\frac{\bar{X} - \mu}{\mathrm{SEM}} \mbox{ } \mbox{ } \mbox{ } \mbox{ } \mbox{ } \mbox{ } \mbox{ } \sim t(n-1) \mbox{ } \textrm{under} \mbox{ } H_0
$$
<div class="alert alert-block alert-info" markdown="1">
<code>scipy.stats</code> provides a <a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.sem.html">sem</a> function as an addition to <code>numpy</code>'s <a href="https://numpy.org/doc/stable/reference/generated/numpy.std.html">std</a>. <code>sem</code> is unbiased by default.
</div>
`scipy`'s one-sample *t* test is [ttest_1samp](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_1samp.html):
`scipy.stats.ttest_1samp(a, popmean, axis=0, nan_policy='propagate', alternative='two-sided')`
%% Cell type:code id:b471633d-c9ad-455e-84e1-d32af085b32a tags:
``` python
mu = 50
x = np.array([49.47257879, 81.93967205, 64.030398, 17.25423608, 59.80082512,
94.56012514, 69.91672899, 12.39640637])
stats.ttest_1samp(x, mu)
```
%%%% Output: execute_result
Ttest_1sampResult(statistic=0.6024056396957578, pvalue=0.5658990587680466)
%% Cell type:markdown id:dac3b6db-2ad7-46b4-937f-9dda1015b7fc tags:
If we do not mind a negative difference (resp. positive difference), *i.e.* we consider the danger zone to begin only above (resp. below) the expected value, we can make the test one-sided to gain statistical power.
To this aim, we must choose and specify which side: here `'greater'` (resp. `'less'`):
%% Cell type:code id:b732c69f-1851-4ef2-bdc2-8f80b4c5281e tags:
``` python
stats.ttest_1samp(x, mu, alternative='greater')
```
%%%% Output: execute_result
Ttest_1sampResult(statistic=0.6024056396957578, pvalue=0.2829495293840233)
%% Cell type:code id:dc360e24-d676-4398-8e4b-dba71e9e68e8 tags:
``` python
x.mean(), stats.sem(x)
```
%%%% Output: execute_result
(56.1713713175, 10.244544391411772)
%% Cell type:markdown id:2144869e-9e4a-4e63-ae58-81c5a6be7205 tags:
### *t* test for independent samples
%% Cell type:markdown id:1e31b437-eceb-4da1-9ab3-f11d45b352c0 tags:
This test compares the means of two samples or groups, *e.g.* a control sample and a sample from a mutated population: $H_0: \bar{X_1} = \bar{X_2}$.
<table style="text-align:center;"><tr><td>
<img src="img/8mice.svg" alt="sample of the control population" />
</td><td>
<img src="img/8mutants1.svg" alt="sample of a mutated population" />
</td></tr><tr><td><center>
<code>x<sub>1</sub>=[49.5 81.9 64.0 17.3 59.8 94.6 69.9 12.4]</code>
</center></td><td><center>
<code>x<sub>2</sub>=[64.2 96.6 101.9 85.3 66.5 63.9 127.6 55.0]</code>
</center></td></tr></table>
`scipy`'s *t* test for independent samples uses the statistic $t=\frac{\bar{X_1}-\bar{X_2}}{\sqrt{(\frac{1}{n_1}+\frac{1}{n_2})\mbox{ }\textrm{PooledVariance}}}$ with $\textrm{PooledVariance} = \frac{1}{n_1+n_2-2}\sum_{j\in\{1,2\}}\sum_i (x_{ij}-\bar{x_j})^2$ and is available as [ttest_ind](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html):
`scipy.stats.ttest_ind(a, b, axis=0, equal_var=True, nan_policy='propagate', permutations=None, random_state=None, alternative='two-sided', trim=0)`
%% Cell type:code id:6231e214-ac36-4c4f-8a16-e1551c8484b4 tags:
``` python
x1 = x
x2 = np.array([64.22723692, 96.56483856, 101.94191774, 85.31918879,
66.4952999, 63.88841224, 127.63861749, 55.00527005])
stats.ttest_ind(x1, x2)
```
%%%% Output: execute_result
Ttest_indResult(statistic=-1.96174329619957, pvalue=0.06998888828308221)
%% Cell type:markdown id:2703f3b1 tags:
1. Had we got a more precise assumption about *e.g.* $\bar{X_2} > \bar{X_1}$, we could have made a one-sided test that would have successfully rejected $H_0$.
...but this should be defined prior to carring out any test!
More important than the *p*-value is the *effect size*. A common measure of effect size for two independent samples is Cohen's $d$: $d = \frac{\bar{X_2}-\bar{X_1}}{\sqrt{\textrm{PooledVariance}}}$
%% Cell type:code id:2ba983bc-ef4c-4a15-ac23-776fffd88afb tags:
``` python
def cohen_d(x1, x2):
n1, n2 = len(x1), len(x2)
m1, m2 = np.mean(x1), np.mean(x2)
v1, v2 = np.var(x1), np.var(x2)
pooled_variance = ((n1 - 1) * v1 + (n2 - 1) * v2) / (n1 + n2 - 2)
d = (m2 - m1) / np.sqrt(pooled_variance)
return d
cohen_d(x1, x2)
```
%%%% Output: execute_result
1.0485958993113402
%% Cell type:code id:3beb1fbb-a1ac-40aa-b4ce-b97f151392f5 tags:
``` python
from matplotlib import pyplot as plt
import ipywidgets as widgets
from ipywidgets import interact
def plot_pdfs(cohen_d):
grid = np.linspace(-3, 3+cohen_d, 100)
x1 = stats.norm(0, 1).pdf(grid)
x2 = stats.norm(cohen_d, 1).pdf(grid)
plt.fill_between(grid, x1, alpha=.5)
plt.fill_between(grid, x2, alpha=.5)
plt.show()
slider = widgets.FloatSlider(.5, min=0, max=4, step=.5)
interact(plot_pdfs, cohen_d=slider);
```
%%%% Output: display_data
%% Cell type:markdown id:f8e1face-4845-459d-95c2-8674181d39b5 tags:
1. (...)
We could have found a significant effect of size $0.3$, which may not be of practical interest due to the small effect size.
Measurements of effect size were proposed together with [tables](https://core.ecu.edu/wuenschk/docs30/EffectSizeConventions.pdf) for interpreting size values. For example, for Cohen's $d$:
| $|d|$ | size of effect |
| :-: | :-- |
| $0.2$ | small |
| $0.5$ | medium |
| $0.8$ | large |
2. Had we found enough evidence to reject $H_0$, we could have concluded about an *association* between the mutation and the observed effect.
To further conclude in terms of *causation*, it is necessary to rule out all possible [confounders](https://en.wikipedia.org/wiki/Confounding) (supplier, cage effect, etc).
3. `scipy`'s implementation does not require equal numbers of observations per group.
However, it assumes the groups are normally distributed (but is relatively robust to non-«extreme non-normality») and, more importantly, have similar variances ($0.5<\frac{s_{X_1}}{s_{X_2}}<2$).
For heterogeneous groups, `ttest_ind` embarks various variants of the *t* test that can be selected with additional arguments:
* Welch's *t* test with `equal_var=False`;
* Yuen's *t* test with `equal_var=False` and `trim=0.2` (requires more data).
<div class="alert alert-block alert-info" markdown="1">
<em>Tip</em>: always check for the underlying assumptions in the documentation.
</div>
%% Cell type:markdown id:1083c04c-1221-446f-84a0-7084413a7722 tags:
### *t* test for paired samples
%% Cell type:markdown id:4743e7fc-b964-48d4-9e13-e78146626c5a tags:
<img src='img/paired1.svg' />
`scipy`'s *t* test for paired samples is [ttest_rel](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_rel.html):
`scipy.stats.ttest_rel(a, b, axis=0, nan_policy='propagate', alternative='two-sided')`
This is actually a one-sample *t* test of the between-group differences against a population mean equal to zero (compare [1](https://github.com/scipy/scipy/blob/v1.7.1/scipy/stats/stats.py#L6450-L6460) and [2](https://github.com/scipy/scipy/blob/v1.7.1/scipy/stats/stats.py#L5647-L5656)).
%% Cell type:markdown id:234dc53a-dce0-426e-8f57-4a4f9bfa38fc tags:
#### Effect size
%% Cell type:markdown id:479e8f62-c653-472d-b9d5-d3bf854d09ff tags:
Adjusted Cohen's $d$:
%% Cell type:code id:25adcf4f-6611-434f-b619-27f341b3caad tags:
``` python
r, _ = stats.pearsonr(x1, x2)
cohen_d(x1, x2) / np.sqrt(1 - r)
```
%%%% Output: execute_result
1.2657389481669015
%% Cell type:markdown id:6b0fd532-87d0-4747-b3a3-322d77c2bfbd tags:
## Analysis of variance
%% Cell type:markdown id:8443a0ff-75ed-4d88-a4e8-a988e6de3716 tags:
### One-way ANOVA
Comparing three or more group means reads $H_0: \bar{X_0} = \bar{X_1} = ... = \bar{X_k}$ and is usually carried out with an *analysis of variance*.
The total variance ($SS_{\textrm{total}}$) is decomposed as the sum of two terms: *within-group* variance ($SS_{\textrm{error}}$) and *between-group* variance ($SS_{\textrm{treatment}}$).
$$
\underbrace{\sum_j\sum_i (x_{ij} - \bar{\bar{x}})^2}_{SS_{\textrm{total}}} = \underbrace{\sum_j\sum_i (\bar{x_j} - \bar{\bar{x}})^2}_{SS_{\textrm{treatment}}} + \underbrace{\sum_j\sum_i (x_{ij} - \bar{x_j})^2}_{SS_{\textrm{error}}}
$$
Many statistical tools give the following detailled table:
| Source | Degrees of<br />freedom | Sum of squares | Mean squares | $\mbox{ }F\mbox{ }$ | $p$-value |
| :- | :-: | :-: | :-: | :-: | :-: |
| Treatment | $k-1$ | $SS_{\textrm{treatment}}$ | $MS_{\textrm{treatment}}$ | $\frac{MS_{\textrm{treatment}}}{MS_{\textrm{error}}}$ | $\mbox{ }p\mbox{ }$ |
| Error | $N-k$ | $SS_{\textrm{error}}$ | $MS_{\textrm{error}}$ | | |
| Total | $N-1$ | $SS_{\textrm{total}}$ | | | |
The statistic $F = \frac{MS_{\textrm{treatment}}}{MS_{\textrm{error}}}$ follows the Fisher's [F](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.f.html) distribution under $H_0$.
More about it at: https://www.coursera.org/learn/stanford-statistics/lecture/pskeN/the-idea-of-analysis-of-variance
The most basic [implementation](https://github.com/scipy/scipy/blob/v1.7.1/scipy/stats/mstats_basic.py#L2937-L2967) of the one-way ANOVA in SciPy is [f_oneway](https://docs.scipy.org/doc/scipy/reference/reference/generated/scipy.stats.f_oneway.html):
%% Cell type:code id:2bf0aafa-d23f-44fe-b9e0-cf4bf13e7314 tags:
``` python
A = [85, 86, 88, 75, 78, 94, 98, 79, 71, 80]
B = [91, 92, 93, 85, 87, 84, 82, 88, 95, 96]
C = [79, 78, 88, 94, 92, 85, 83, 85, 82, 81]
stats.f_oneway(A, B, C)
```
%%%% Output: execute_result
F_onewayResult(statistic=2.3575322551335636, pvalue=0.11384795345837218)
%% Cell type:markdown id:164aa1ae-5446-4680-9571-c783edaf4101 tags:
The ANOVA is an *omnibus* test and does not tell which groups exhibit differing means. Specific differences are later identified using *post-hoc tests* (more about it in next session).
%% Cell type:markdown id:8bf2c9ad-68fd-4f4b-b68c-180de69d3522 tags:
### Assumptions
%% Cell type:markdown id:bd484cda-1d0b-4468-a092-f362744f8aa0 tags:
The standard ANOVA requires the data to exhibit the following properties:
* independent observations,
* normally distributed residuals,
* all groups have equal population variance (*homoscedasticity*),
* at least 5 observations ($n \ge 5$) per group (and equal number).
%% Cell type:markdown id:0ea4e6c8-4851-4704-b9ca-6257239ab07b tags:
#### Size effect
%% Cell type:markdown id:39925882-deb4-43e1-8d82-9fb1e68df0ba tags:
Mentioned for completeness: Cohen's $f=\sqrt{\frac{R^2}{1 - R^2}}=\sqrt{\frac{SS_{\textrm{treatment}}}{SS_{\textrm{error}}}}$ and [$\sqrt{F}$ root mean square effect](https://en.wikipedia.org/wiki/Effect_size#%CE%A8,_root-mean-square_standardized_effect) are suitable for one-way ANOVA but not widely used, as post-hoc tests give a more natural approach to size effects.
`statsmodels` features [effectsize_oneway](https://www.statsmodels.org/stable/generated/statsmodels.stats.oneway.effectsize_oneway.html).
%% Cell type:markdown id:e38c759e-8807-4975-9620-9c2100b964a8 tags:
## Checking for common assumptions
%% Cell type:markdown id:e9509dde-4edf-4d5b-bcb6-7bd6bbd87918 tags:
Visually checking for desired properties like normality or equal variance is acceptable, especially if the data are generally known to exhibit these properties.
### Normality
Having this property is usually not critical, because most tests are fairly robust to non-normality.
We only need to avoid cases of «extreme non-normality».
%% Cell type:markdown id:c5fa2537-0d3b-4b1d-bba4-a479cdf2c0a9 tags:
#### Graphical approaches
%% Cell type:markdown id:f0f983f0-6143-44b0-9eee-f6d54eb5cf51 tags:
Probability plots with [probplot](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.probplot.html) (or [statsmodels.api.qqplot](https://www.statsmodels.org/stable/generated/statsmodels.graphics.gofplots.qqplot.html) with one sample):
%% Cell type:code id:6bcaa795-3a85-4b8c-aad9-d554e244a2ec tags:
``` python
np.random.seed(1245619531)
x_normal = stats.norm.rvs(loc=0, scale=1, size=30) # generate 30 observations from the standard normal distribution
x_not_normal = stats.norm.rvs(loc=[-1,1], scale=[1,3], size=(15,2)).ravel() # generate 30 observations from a mixture of normal distributions
_, axes = plt.subplots(1, 2, figsize=(13.3,4.1))
stats.probplot(x_normal, plot=axes[0])
stats.probplot(x_not_normal, plot=axes[1]);
```
%%%% Output: display_data
%% Cell type:code id:e1f2356b-dd02-47d5-8e2c-82c77a6a5c97 tags:
``` python
_, axes = plt.subplots(1, 2, figsize=(13.3,4.1))
axes[0].hist(x_normal, bins=7)
axes[1].hist(x_not_normal, bins=7);
```
%%%% Output: display_data
%% Cell type:markdown id:8f46e74c-ee1f-454c-a4a1-3e551899e7e7 tags:
#### Normality tests
%% Cell type:markdown id:81a1572d-1639-42d9-834c-2ba86c7c7dd3 tags:
* D'Agostino's test: [normaltest](https://docs.scipy.org/doc/scipy/reference/reference/generated/scipy.stats.normaltest.html), preferably for large samples ($n>20$),
* Similar test for skewness only: [skewtest](https://docs.scipy.org/doc/scipy/reference/reference/generated/scipy.stats.skewtest.html) ($n\ge8$),
%% Cell type:code id:e964be07-0696-4ff8-844a-3e7d318a7f3a tags:
``` python
skewed_dist = lambda sigma, x: np.exp( -.5*(np.log(x)/sigma)**2 ) / ( x*sigma*np.sqrt(2*np.pi) )
heavy_tailed_dist = lambda scale, x: stats.cauchy.pdf(x, 0, scale)
colors = ['blue', 'green', 'orange', 'red']
_, axes = plt.subplots(1, 2, figsize=(13.3,4.1))
grid = np.linspace(0, 3, 100)
grid = grid[1:]
ax = axes[0]
for sigma, color in zip((.25, .5, 1), colors):
ax.plot(grid, skewed_dist(sigma, grid), '-', color=color, label=f'$\sigma={sigma:.2f}$')
ax.axhline(0, linestyle='--', color='grey', linewidth=1)
ax.set_xlim(grid[[0,-1]])
ax.set_title('skewness')
grid = np.linspace(-4, 4, 100)
ax = axes[1]
for s, color in zip((.5, 1, 2), colors):
ax.plot(grid, heavy_tailed_dist(s, grid), '-', color=color, label=f'$scale={s:.2f}$')
ax.axvline(0, linestyle='--', color='grey', linewidth=1)
ax.axhline(0, linestyle='--', color='grey', linewidth=1)
ax.set_xlim(grid[[0,-1]])
ax.set_title('kurtosis')
for ax in axes:
ax.set_ylabel('probability density')
ax.legend();
```
%%%% Output: display_data
%% Cell type:markdown id:eed065d2-dd35-4fdb-ad91-8d387a06c7ea tags:
* Shapiro-Wilk's test: [shapiro](https://docs.scipy.org/doc/scipy/reference/reference/generated/scipy.stats.shapiro.html),
* Generic goodness-of-fit tests: [kstest](https://docs.scipy.org/doc/scipy/reference/reference/generated/scipy.stats.kstest.html) and [anderson](https://docs.scipy.org/doc/scipy/reference/reference/generated/scipy.stats.anderson.html).
%% Cell type:markdown id:4c7a08d6-3c74-48ce-ad32-f164089b6ec7 tags:
### Equal variance (homoscedasticity)
%% Cell type:markdown id:5e5a893e-6475-401b-9335-b117629b8d0a tags:
#### Graphical approaches
Simple per-group box plots.
%% Cell type:code id:8a31b653-d891-4f07-ab8b-73562b2ed86d tags:
``` python
A = [85, 86, 88, 75, 78, 94, 98, 79, 71, 80]
B = [91, 92, 93, 85, 87, 84, 82, 88, 95, 96]
C = [79, 78, 88, 94, 92, 85, 83, 85, 82, 81]
df = pd.DataFrame(data=dict(A=A, B=B, C=C))
plt.boxplot(df, labels=df.columns);
```
%%%% Output: display_data
%% Cell type:markdown id:e985446d-d250-4d88-9a13-50a098cd016b tags:
#### Equality-of-variance tests
* Bartlett's test: [bartlett](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bartlett.html), most basic and common test,
* Levene's test: [levene](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.levene.html), better for skewed or heavy-tailed distributions,
* ...and others: Fligner-Killeen's test ([fligner](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.fligner.html)), Ansari-Bradley's test ([ansari](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ansari.html)), etc
Example:
%% Cell type:code id:280180fa-4f20-44a6-a463-2ce9b0ad75a4 tags:
``` python
# copied-pasted from https://www.statology.org/bartletts-test-python/
A = [85, 86, 88, 75, 78, 94, 98, 79, 71, 80]
B = [91, 92, 93, 85, 87, 84, 82, 88, 95, 96]
C = [79, 78, 88, 94, 92, 85, 83, 85, 82, 81]
stats.bartlett(A, B, C)
```
%%%% Output: execute_result
BartlettResult(statistic=3.3024375753550457, pvalue=0.19181598314036113)
%% Cell type:markdown id:e906820a-af04-4a9d-992f-f2e07a7aed91 tags:
In the above example, as there is not enough evidence to reject $H_0$ ($p>0.05$), we can proceed to perform a standard one-way ANOVA. Otherwise, we would go for an Alexander-Govern's test or Welch's *F* test instead.
The Alexander-Govern's test is available in `scipy` as [alexandergovern](https://docs.scipy.org/doc/scipy/reference/reference/generated/scipy.stats.alexandergovern.html), but the Welch's *F* test is not (neither in `scipy.stats` nor in `statsmodels`). Install the `Pingouin` package and try out the [welch_anova](https://pingouin-stats.org/generated/pingouin.welch_anova.html) function instead.
The Bartlett's test statistic follows $\chi^2_{k-1}$ with $k$ the number of groups. As most tests based on the $\chi^2$ distribution, the *p*-value is one-sided.
%% Cell type:code id:b8123d48-7285-4fa3-8473-04f316dedffc tags:
``` python
grid = np.linspace(0, 20, 200)
dfs = [2, 3, 5, 9]
_, axes = plt.subplots(1, 2, figsize=(13.3,4.1))
ax = axes[0]
for df, color in zip(
dfs,
['blue', 'green', 'orange', 'red'],
):
chi2 = stats.chi2.pdf(grid, df)
ax.plot(grid, chi2, '-', color=color)
ax.axhline(0, linestyle='--', color='grey', linewidth=1)
ax.set_xlim(grid[0],grid[-1])
ax.set_xlabel('$\chi^2$')
ax.set_ylabel('probability density')
ax.legend([ f'$df={df}$' for df in dfs ])
ax = axes[1]
df, color = 2, 'blue'
chi2 = stats.chi2.pdf(grid, df)
ax.plot(grid, chi2, '-', color=color)
ax.axhline(0, linestyle='--', color='grey', linewidth=1)
ax.set_xlim(grid[0],grid[-1])
ax.set_xlabel('$\chi^2$')
ax.set_ylabel('probability density');
A = [85, 86, 88, 75, 78, 94, 98, 79, 71, 80]
B = [91, 92, 93, 85, 87, 84, 82, 88, 95, 96]
C = [79, 78, 88, 94, 92, 85, 83, 85, 82, 81]
bartlett_statistic, bartlett_pvalue = stats.bartlett(A, B, C)
bartlett_statistic_line, = ax.plot([bartlett_statistic]*2, [0, stats.chi2.pdf(bartlett_statistic, df)], '-', zorder=1)
tail = grid[bartlett_statistic<=grid]
ax.fill_between(tail, np.zeros_like(tail), stats.chi2.pdf(tail, df), alpha=.2)
ax.annotate(f'$\\approx {bartlett_pvalue:.2f}$', (4, .02), xytext=(8, .1), arrowprops=dict(arrowstyle="->"));
```
%%%% Output: display_data
%% Cell type:markdown id:c09c6222-4167-4e32-b6da-4abe3a8960de tags:
## $\chi^2$ tests
%% Cell type:markdown id:1ec71b98-8d1b-4854-b86d-df738ddc8595 tags:
When the sum of the observations is known, *e.g.* observations are frequencies -- proportions that sum to $1$, we use a $\chi^2$ test instead of an ANOVA.
### Goodness-of-fit
%% Cell type:markdown id:d855b1f2-3105-445a-b1a6-b99bceeb66b7 tags:
Example:
Comparing the frequencies of the different allele variants at a given locus between a reference genome and a test genome.
Another popular example: [Color proportion of M&Ms [Coursera]](https://www.coursera.org/learn/stanford-statistics/lecture/rAwbR/the-color-proportions-of-m-ms):
| blue | orange | green | yellow | red | brown |
| :-: | :-: | :-: | :-: | :-: | :-: |
| 24% | 20% | 16% | 14% | 13% | 13% |
%% Cell type:code id:84f4e9b3-a653-422c-8fc5-83b29869ba87 tags:
``` python
expected_props = np.array([ .24, .2, .16, .14, .13, .13 ])
observed_counts = np.array([ 85, 79, 56, 64, 58, 68 ])
np.sum(observed_counts)
```
%%%% Output: execute_result
410
%% Cell type:code id:8bb2916a-e9a4-4bbe-b04f-14818bb68723 tags:
``` python
expected_counts = expected_props * np.sum(observed_counts)
expected_counts
```
%%%% Output: execute_result
array([98.4, 82. , 65.6, 57.4, 53.3, 53.3])
%% Cell type:markdown id:a75bd8d7-f61e-49bc-80b4-cb1f05ccd7c1 tags:
| | blue | orange | green | yellow | red | brown |
| --: | :-: | :-: | :-: | :-: | :-: | :-: |
| Expected | 98.4 | 82 | 65.6 | 57.4 | 53.3 | 53.3 |
| Observed | 85 | 79 | 56 | 64 | 58 | 58 |
The statistic is:
$$
\chi^2 = \sum_{i=1}^{k} \frac{(O_i - E_i)^2}{E_i} \mbox{ } \mbox{ } \mbox{ } \mbox{ } \mbox{ } \mbox{ } \mbox{ } \sim \chi^2_{k-1} \mbox{ } \textrm{under} \mbox{ } H_0
$$
%% Cell type:code id:a8898631-a5b5-49e8-a158-3e149345391a tags:
``` python
k = len(expected_counts)
chi2 = np.sum((observed_counts - expected_counts) ** 2 / expected_counts)
chi2
```
%%%% Output: execute_result
8.566983829178941
%% Cell type:code id:7a8af457-13dc-483a-90e3-50d3949c8edf tags:
``` python
pvalue = stats.chi2(k-1).sf(chi2)
pvalue
```
%%%% Output: execute_result
0.1276329790529603
%% Cell type:markdown id:92886c91-f6ef-419d-bfda-e28d9dfcc7bc tags:
`scipy.stats`'s implementation of the test is [chisquare](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html):
%% Cell type:code id:1e246915-e9be-4826-9677-d9f30348d0df tags:
``` python
stats.chisquare(observed_counts, expected_counts)
```
%%%% Output: execute_result
Power_divergenceResult(statistic=8.566983829178941, pvalue=0.1276329790529603)
%% Cell type:markdown id:00614391-2cdb-4e8f-9734-a1fd2df57637 tags:
#### Size effect
Cohen's $w$:
%% Cell type:code id:f2231df3-3ce3-40c4-ae75-f6159061e3d6 tags:
``` python
cohen_w = np.sqrt(chi2)
cohen_w
```
%%%% Output: execute_result
2.9269410361636843
%% Cell type:markdown id:1feb29a6-eeac-4096-8e90-c663fdeea8a8 tags:
### Homogeneity and independence
%% Cell type:markdown id:2ef858d5-ec8f-405d-924e-62f5286a94c5 tags:
Example:
Comparing the frequency of cell types in cultures that differ in the treatments:
| Observed | Type A cells | Type B cells | Type C cells | Type D cells |
| --: | :-: | :-: | :-: | :-: |
| Treatment 1 | 134 | 86 | 32 | 11 |
| Treatment 2 | 101 | 92 | 38 | 8 |
| Treatment 3 | 188 | 67 | 54 | 19 |
$H_0$: the treatments have no effect on the frequency of the cell types.
https://www.coursera.org/learn/stanford-statistics/lecture/78IMJ/the-chi-square-test-for-homogeneity-and-independence
%% Cell type:code id:13145d41-7dc3-4eff-9f9c-38d0abed1e8d tags:
``` python
observed_counts = np.array([
[ 134, 86, 32, 11 ],
[ 101, 92, 38, 8 ],
[ 188, 67, 54, 19 ],
])
```
%% Cell type:code id:fe4ce304-25ac-4734-9a0a-e20db59b742f tags:
``` python
expected_props = np.sum(observed_counts, axis=0) / np.sum(observed_counts)
expected_props
```
%%%% Output: execute_result
array([0.50963855, 0.29518072, 0.14939759, 0.04578313])
%% Cell type:markdown id:7f4b4727-3a9b-474f-b6ad-b632be97b04e tags:
Under $H_0$, the expected proportions are:
| Expected | Type A cells | Type B cells | Type C cells | Type D cells |
| --: | :-: | :-: | :-: | :-: |
| Treatment 1 | 51% | 29% | 15% | 5% |
| Treatment 2 | 51% | 29% | 15% | 5% |
| Treatment 3 | 51% | 29% | 15% | 5% |
%% Cell type:code id:087ea780-304c-459d-a7c0-066817f3d82f tags:
``` python
expected_counts = np.outer(np.sum(observed_counts, axis=1), expected_props)
expected_counts
```
%%%% Output: execute_result
array([[134.03493976, 77.63253012, 39.29156627, 12.04096386],
[121.80361446, 70.54819277, 35.7060241 , 10.94216867],
[167.16144578, 96.81927711, 49.00240964, 15.01686747]])
%% Cell type:markdown id:c0f49b1a-3a7a-4ad8-8d2a-9fd3513ad2a6 tags:
| Expected | Type A cells | Type B cells | Type C cells | Type D cells |
| --: | :-: | :-: | :-: | :-: |
| Treatment 1 | 134 | 77.6 | 39.3 | 12 |
| Treatment 2 | 121.8 | 70.5 | 35.7 | 10.9 |
| Treatment 3 | 167.2 | 96.8 | 49 | 15 |
%% Cell type:code id:d9585b87-f5dd-4cf2-ad91-0a79dcc95c86 tags:
``` python
j, k = expected_counts.shape
dof = (j - 1) * (k - 1)
chi2 = np.sum((observed_counts - expected_counts) ** 2 / expected_counts)
chi2
```
%%%% Output: execute_result
26.7075512595244
%% Cell type:code id:a7cd985e-eb2b-4a72-a221-30e3e0fdf8ec tags:
``` python
stats.chi2(dof).sf(chi2)
```
%%%% Output: execute_result
0.00016426084515914902
%% Cell type:markdown id:13d7ef58-e62d-48de-a3a4-2ccc2ac588cb tags:
`scipy.stats`'s $\chi^2$ test for homogeneity/independence is [chi2_contingency](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html):
%% Cell type:code id:4cd7ea08-110a-4ff7-9521-0f12e4169d35 tags:
``` python
stats.chi2_contingency(observed_counts)
```
%%%% Output: execute_result
(26.707551259524408,
0.0001642608451591484,
6,
array([[134.03493976, 77.63253012, 39.29156627, 12.04096386],
[121.80361446, 70.54819277, 35.7060241 , 10.94216867],
[167.16144578, 96.81927711, 49.00240964, 15.01686747]]))
%% Cell type:markdown id:45da92b9-cd4e-4d1b-8996-451285ef1896 tags:
Due to the design of the test, it doesn't matter what factor whose effect is hypothesized to be null under $H_0$:
%% Cell type:code id:d7e7fd75-068c-4f37-b981-498890c4a0d7 tags:
``` python
stats.chi2_contingency(observed_counts.T)
```
%%%% Output: execute_result
(26.707551259524408,
0.0001642608451591484,
6,
array([[134.03493976, 121.80361446, 167.16144578],
[ 77.63253012, 70.54819277, 96.81927711],
[ 39.29156627, 35.7060241 , 49.00240964],
[ 12.04096386, 10.94216867, 15.01686747]]))
%% Cell type:markdown id:141dc10b-c964-4621-9e53-32f36c11d2da tags:
## Correlation
%% Cell type:markdown id:6efe325b tags:
### Analyses of association (recap)
| Test | Types of variables |
| :-: | :-- |
| $\chi^2$ test<br />(ind./homo.) | categorical *vs* categorical |
| ANOVA | categorical (*e.g.* group) *vs* continuous (response) |
| ? | continuous *vs* continuous |
Note: frequencies in the $\chi^2$ tests are summary statistics and play the role of a sample size. They are NOT treated as measurements of a variable, although they could be, at another conceptual level (e.g. population of the bags of M&Ms).
%% Cell type:markdown id:344da42e-5707-4a0a-9b98-4cbf47781fbe tags:
### Correlation coefficient
%% Cell type:markdown id:0c8b61ee-e2da-4dc4-b3ad-02790d407d5b tags:
Pearson correlation coefficient of two series is the covariance of the two series normalized by the standard deviation of each series:
$$
\textrm{Var}(x) = \frac{1}{n-1}\sum_i (x_i - \bar{x})(x_i - \bar{x}) \\
\textrm{Cov}(x, y) = \frac{1}{n-1}\sum_i(x_i - \bar{x})(y_i - \bar{y}) \\
-1 \le \mbox{ } \mbox{ } \mbox{ } r(x, y) = \frac{\textrm{Cov}(x, y)}{\sqrt{\textrm{Var}(x)\textrm{Var}(y)}} \mbox{ } \mbox{ } \mbox{ } \le 1
$$
The [pearsonr](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html) function computes the Pearson correlation coefficient together with a *p*-value:
%% Cell type:code id:f5a36568-d5b5-4388-b3f9-f38855c58c80 tags:
``` python
x1 = stats.norm.rvs(loc=46, scale=30, size=30)
x2 = stats.norm.rvs(loc=71, scale=30, size=30)
r, pv = stats.pearsonr(x1, x2)
r, pv
```
%%%% Output: execute_result
(0.3518680132574827, 0.05653920630309695)
%% Cell type:markdown id:a9fbd953 tags:
The correlation coefficient is a commonly-used effect size for the linear relationship between the two variables, similarly to (but not to be confused with) a regression coefficient:
%% Cell type:code id:3581eeb0-f98d-4b2f-a0f1-bef073d86980 tags:
``` python
df = pd.DataFrame(dict(x1=x1, x2=x2));
sns.regplot(x="x1", y="x2", data=df);
```
%%%% Output: display_data