"Always good to get a reminder about [general considerations](https://www.coursera.org/learn/stanford-statistics/home/welcome), __prior to data collection__ and analysis.\n",
"\n",
"* Sampling from the population,\n",
"* identifying the sources of variability..."
"* identifying the sources of variability, etc."
]
},
{
...
...
@@ -1587,7 +1587,7 @@
"\n",
"This is used as a basis to calculate a *p*-value that estimates the probability of erroneously rejecting $H_0$.\n",
"\n",
"The experimenter also defines a significance level $\\alpha$, with common values $\\alpha=0.05$ or $0.01$, that sets the maximum tolerated risk of rejecting $H_0$ by chance.\n",
"The experimenter also defines a significance level $\\alpha$, with common values $\\alpha=0.05$ or $0.01$, that sets the maximum tolerated risk of making a *type-1 error*, *i.e.* of rejecting $H_0$ by chance.\n",
"If the obtained <em>p</em>-value is lower than $\\alpha$, then s·he can conclude there is sufficient evidence to reject $H_0$."
]
},
...
...
%% Cell type:code id:8ccafd3d tags:
``` python
importsys
!"{sys.executable}"-mpipinstallscipy
```
%%%% Output: stream
Requirement already satisfied: scipy in /home/flaurent/.local/lib/python3.8/site-packages (1.7.1)
Requirement already satisfied: numpy<1.23.0,>=1.16.5 in /home/flaurent/.local/lib/python3.8/site-packages (from scipy) (1.21.1)
[SciPy](https://docs.scipy.org/doc/scipy/reference/) is a collection of mathematical tools aiming at diverse fields, with functionalities split in several modules:
%% Cell type:code id:42342d74 tags:
``` python
fromscipyimport(
cluster,# Clustering algorithms
constants,# Physical and mathematical constants
fftpack,# Fast Fourier Transform routines
integrate,# Integration and ordinary differential equation solvers
interpolate,# Interpolation and smoothing splines
io,# Input and Output
linalg,# Linear algebra
ndimage,# N-dimensional image processing
odr,# Orthogonal distance regression
optimize,# Optimization and root-finding routines
signal,# Signal processing
sparse,# Sparse matrices and associated routines
spatial,# Spatial data structures and algorithms
special,# Special functions
stats,# Statistical distributions and functions
)
```
%% Cell type:markdown id:1091dfd5 tags:
Reminder about module loading:
Example: how to access the `ttest_ind` function defined in the `scipy.stats` module?
%% Cell type:code id:898957c8 tags:
``` python
%%scriptechoskipping
importscipy.stats
scipy.stats.ttest_ind
fromscipyimportstats
stats.ttest_ind
fromscipy.statsimport*
ttest_ind
```
%%%% Output: stream
skipping
%% Cell type:markdown id:9e2fe344 tags:
`scipy.stats` content (see the [official documention](https://docs.scipy.org/doc/scipy/reference/reference/stats.html#module-scipy.stats)):
`scipy.stats` features basic functionalities and we will occasionally mention the `statsmodels` and `pingouin` libraries as we will hit `scipy.stats` limitations.
| **treatment** | every procedure or experimental condition<br/>that departs from a control condition<br/>(and the control itself is a special-case treatment) |
| **population** | a set of individuals we want our conclusions to apply to |
| **sample** | a finite set of selected individuals<br/>assumed to be representative of a population |
Always good to get a reminder about [general considerations](https://www.coursera.org/learn/stanford-statistics/home/welcome), __prior to data collection__ and analysis.
To determine whether there is *sufficient evidence* to conclude the treatment has an effect, we use *statistical tests*.
However, because experimental designs are often complex and involve multiple treatments and additional sources of variability, most studies also involve multiple tests, that are usually carried out after a so-called *omnibus* test.
In addition, every statistical test makes various assumptions that in turn needs to be checked. As a consequence, every statistical analysis involves a series of tests and procedures.
Contrary to statistical software with a GUI, the tools featured in programming languages such as Python do not offer much guidance in following such a workflow.
We compared our **observations**`x` with some **expectation**.
We actually formulated a so-called *null hypothesis*, denoted $H_0$, that models the situation such that "nothing is going on", *i.e.* the observations meet the expectation.
We also implicitly defined an alternative hypothesis, usually denoted $H_1$ or $H_A$, that can simply be the opposite of $H_0$.
For example:
$$
\left\{
\begin{array}{ l l l }
H_0: & X \sim \mathcal{N}(\mu, \sigma^2) & \mbox{with } \mu \mbox{ assumed to be } \bar{x} \mbox{ and } \sigma^2 \mbox{ as } \frac{1}{n-1}\sum_{i=0}^{n-1} (x_i - \bar{x})^2 \\
H_A: & \mbox{not } H_0
\end{array}
\right.
$$
A test consists in contrasting the two incompatible hypotheses.
If we had a single observation – say $z=1.4$ – to compare with a distribution – say $\mathcal{N}(0,1)$ – we would simply compute the probability for this value to be drawn from this distribution (or not):
In practice, all tests boil down to comparing a single value with a reference distribution. Basically, a test expresses the discrepancy between the observations and the expectation in the shape of a *statistic*, and this statistic is supposed to follow a given distribution under $H_0$.
This is used as a basis to calculate a *p*-value that estimates the probability of erroneously rejecting $H_0$.
The experimenter also defines a significance level $\alpha$, with common values $\alpha=0.05$ or $0.01$, that sets the maximum tolerated risk of rejecting $H_0$ by chance.
The experimenter also defines a significance level $\alpha$, with common values $\alpha=0.05$ or $0.01$, that sets the maximum tolerated risk of making a *type-1 error*, *i.e.* of rejecting $H_0$ by chance.
If the obtained <em>p</em>-value is lower than $\alpha$, then s·he can conclude there is sufficient evidence to reject $H_0$.
*t* tests derive a statistic that is supposed to follow the [Student's *t* distribution](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.t.html) under $H_0$:
At high degrees of freedom, the *t* distribution approaches the normal distribution. At lower degrees of freedom, the *t* distribution exhibits heavier tails and is less sensitive to extreme values.
<code>scipy.stats</code> provides a <ahref="https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.sem.html">sem</a> function as an addition to <code>numpy</code>'s <ahref="https://numpy.org/doc/stable/reference/generated/numpy.std.html">std</a>. <code>sem</code> is unbiased by default.
</div>
`scipy`'s one-sample *t* test is [ttest_1samp](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_1samp.html):
If we do not mind a negative difference (resp. positive difference), *i.e.* we consider the danger zone to begin only above (resp. below) the expected value, we can make the test one-sided to gain statistical power.
To this aim, we must choose and specify which side: here `'greater'` (resp. `'less'`):
`scipy`'s *t* test for independent samples uses the statistic $t=\frac{\bar{X_1}-\bar{X_2}}{\sqrt{(\frac{1}{n_1}+\frac{1}{n_2})\mbox{ }\textrm{PooledVariance}}}$ with $\textrm{PooledVariance} = \frac{1}{n_1+n_2-2}\sum_{j\in\{1,2\}}\sum_i (x_{ij}-\bar{x_j})^2$ and is available as [ttest_ind](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html):
1. Had we got a more precise assumption about *e.g.* $\bar{X_2} > \bar{X_1}$, we could have made a one-sided test that would have successfully rejected $H_0$.
...but this should be defined prior to carring out any test!
More important than the *p*-value is the *effect size*. A common measure of effect size for two independent samples is Cohen's $d$: $d = \frac{\bar{X_2}-\bar{X_1}}{\sqrt{\textrm{PooledVariance}}}$
We could have found a significant effect of size $0.3$, which may not be of practical interest due to the small effect size.
Measurements of effect size were proposed together with [tables](https://core.ecu.edu/wuenschk/docs30/EffectSizeConventions.pdf) for interpreting size values. For example, for Cohen's $d$:
| $|d|$ | size of effect |
| :-: | :-- |
| $0.2$ | small |
| $0.5$ | medium |
| $0.8$ | large |
2. Had we found enough evidence to reject $H_0$, we could have concluded about an *association* between the mutation and the observed effect.
To further conclude in terms of *causation*, it is necessary to rule out all possible [confounders](https://en.wikipedia.org/wiki/Confounding)(supplier, cage effect, etc).
3.`scipy`'s implementation does not require equal numbers of observations per group.
However, it assumes the groups are normally distributed (but is relatively robust to non-«extreme non-normality») and, more importantly, have similar variances ($0.5<\frac{s_{X_1}}{s_{X_2}}<2$).
For heterogeneous groups, `ttest_ind` embarks various variants of the *t* test that can be selected with additional arguments:
* Welch's *t* test with `equal_var=False`;
* Yuen's *t* test with `equal_var=False` and `trim=0.2` (requires more data).
This is actually a one-sample *t* test of the between-group differences against a population mean equal to zero (compare [1](https://github.com/scipy/scipy/blob/v1.7.1/scipy/stats/stats.py#L6450-L6460) and [2](https://github.com/scipy/scipy/blob/v1.7.1/scipy/stats/stats.py#L5647-L5656)).
Comparing three or more group means reads $H_0: \bar{X_0} = \bar{X_1} = ... = \bar{X_k}$ and is usually carried out with an *analysis of variance*.
The total variance ($SS_{\textrm{total}}$) is decomposed as the sum of two terms: *within-group* variance ($SS_{\textrm{error}}$) and *between-group* variance ($SS_{\textrm{treatment}}$).
The statistic $F = \frac{MS_{\textrm{treatment}}}{MS_{\textrm{error}}}$ follows the Fisher's [F](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.f.html) distribution under $H_0$.
More about it at: https://www.coursera.org/learn/stanford-statistics/lecture/pskeN/the-idea-of-analysis-of-variance
The most basic [implementation](https://github.com/scipy/scipy/blob/v1.7.1/scipy/stats/mstats_basic.py#L2937-L2967) of the one-way ANOVA in SciPy is [f_oneway](https://docs.scipy.org/doc/scipy/reference/reference/generated/scipy.stats.f_oneway.html):
The ANOVA is an *omnibus* test and does not tell which groups exhibit differing means. Specific differences are later identified using *post-hoc tests* (more about it in next session).
Mentioned for completeness: Cohen's $f=\sqrt{\frac{R^2}{1 - R^2}}=\sqrt{\frac{SS_{\textrm{treatment}}}{SS_{\textrm{error}}}}$ and [$\sqrt{F}$ root mean square effect](https://en.wikipedia.org/wiki/Effect_size#%CE%A8,_root-mean-square_standardized_effect) are suitable for one-way ANOVA but not widely used, as post-hoc tests give a more natural approach to size effects.
`statsmodels` features [effectsize_oneway](https://www.statsmodels.org/stable/generated/statsmodels.stats.oneway.effectsize_oneway.html).
Visually checking for desired properties like normality or equal variance is acceptable, especially if the data are generally known to exhibit these properties.
### Normality
Having this property is usually not critical, because most tests are fairly robust to non-normality.
We only need to avoid cases of «extreme non-normality».
Probability plots with [probplot](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.probplot.html)(or[statsmodels.api.qqplot](https://www.statsmodels.org/stable/generated/statsmodels.graphics.gofplots.qqplot.html) with one sample):
* D'Agostino's test: [normaltest](https://docs.scipy.org/doc/scipy/reference/reference/generated/scipy.stats.normaltest.html), preferably for large samples ($n>20$),
* Similar test for skewness only: [skewtest](https://docs.scipy.org/doc/scipy/reference/reference/generated/scipy.stats.skewtest.html)($n\ge8$),
* Bartlett's test: [bartlett](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bartlett.html), most basic and common test,
* Levene's test: [levene](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.levene.html), better for skewed or heavy-tailed distributions,
* ...and others: Fligner-Killeen's test ([fligner](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.fligner.html)), Ansari-Bradley's test ([ansari](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ansari.html)), etc
In the above example, as there is not enough evidence to reject $H_0$ ($p>0.05$), we can proceed to perform a standard one-way ANOVA. Otherwise, we would go for an Alexander-Govern's test or Welch's *F* test instead.
The Alexander-Govern's test is available in `scipy` as [alexandergovern](https://docs.scipy.org/doc/scipy/reference/reference/generated/scipy.stats.alexandergovern.html), but the Welch's *F* test is not (neither in `scipy.stats` nor in `statsmodels`). Install the `Pingouin` package and try out the [welch_anova](https://pingouin-stats.org/generated/pingouin.welch_anova.html) function instead.
The Bartlett's test statistic follows $\chi^2_{k-1}$ with $k$ the number of groups. As most tests based on the $\chi^2$ distribution, the *p*-value is one-sided.
When the sum of the observations is known, *e.g.* observations are frequencies -- proportions that sum to $1$, we use a $\chi^2$ test instead of an ANOVA.
Comparing the frequencies of the different allele variants at a given locus between a reference genome and a test genome.
Another popular example: [Color proportion of M&Ms [Coursera]](https://www.coursera.org/learn/stanford-statistics/lecture/rAwbR/the-color-proportions-of-m-ms):
`scipy.stats`'s $\chi^2$ test for homogeneity/independence is [chi2_contingency](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html):
Note: frequencies in the $\chi^2$ tests are summary statistics and play the role of a sample size. They are NOT treated as measurements of a variable, although they could be, at another conceptual level (e.g. population of the bags of M&Ms).
The [pearsonr](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html) function computes the Pearson correlation coefficient together with a *p*-value:
The correlation coefficient is a commonly-used effect size for the linear relationship between the two variables, similarly to (but not to be confused with) a regression coefficient: