Commit 1dbef0a4 by François LAURENT

### first day-3 exercise

parent a460c148
This diff is collapsed.
This diff is collapsed.
 ... ... @@ -192,7 +192,7 @@ "Always good to get a reminder about [general considerations](https://www.coursera.org/learn/stanford-statistics/home/welcome), __prior to data collection__ and analysis.\n", "\n", "* Sampling from the population,\n", "* identifying the sources of variability..." "* identifying the sources of variability, etc." ] }, { ... ... @@ -1587,7 +1587,7 @@ "\n", "This is used as a basis to calculate a *p*-value that estimates the probability of erroneously rejecting $H_0$.\n", "\n", "The experimenter also defines a significance level $\\alpha$, with common values $\\alpha=0.05$ or $0.01$, that sets the maximum tolerated risk of rejecting $H_0$ by chance.\n", "The experimenter also defines a significance level $\\alpha$, with common values $\\alpha=0.05$ or $0.01$, that sets the maximum tolerated risk of making a *type-1 error*, *i.e.* of rejecting $H_0$ by chance.\n", "If the obtained p-value is lower than $\\alpha$, then s·he can conclude there is sufficient evidence to reject $H_0$." ] }, ... ...
 %% Cell type:code id:8ccafd3d tags:  python import sys !"{sys.executable}" -m pip install scipy  %%%% Output: stream Requirement already satisfied: scipy in /home/flaurent/.local/lib/python3.8/site-packages (1.7.1) Requirement already satisfied: numpy<1.23.0,>=1.16.5 in /home/flaurent/.local/lib/python3.8/site-packages (from scipy) (1.21.1) %% Cell type:markdown id:cf7e7b4c-6fbe-4b1c-81b7-6d5a851c2835 tags: %% Cell type:markdown id:6d2c6ebd tags:

Statistical tests with the SciPy library

[SciPy](https://docs.scipy.org/doc/scipy/reference/) is a collection of mathematical tools aiming at diverse fields, with functionalities split in several modules: %% Cell type:code id:42342d74 tags:  python from scipy import ( cluster, # Clustering algorithms constants, # Physical and mathematical constants fftpack, # Fast Fourier Transform routines integrate, # Integration and ordinary differential equation solvers interpolate, # Interpolation and smoothing splines io, # Input and Output linalg, # Linear algebra ndimage, # N-dimensional image processing odr, # Orthogonal distance regression optimize, # Optimization and root-finding routines signal, # Signal processing sparse, # Sparse matrices and associated routines spatial, # Spatial data structures and algorithms special, # Special functions stats, # Statistical distributions and functions )  %% Cell type:markdown id:1091dfd5 tags: Reminder about module loading: Example: how to access the ttest_ind function defined in the scipy.stats module? %% Cell type:code id:898957c8 tags:  python %%script echo skipping import scipy.stats scipy.stats.ttest_ind from scipy import stats stats.ttest_ind from scipy.stats import * ttest_ind  %%%% Output: stream skipping %% Cell type:markdown id:9e2fe344 tags: scipy.stats content (see the [official documention](https://docs.scipy.org/doc/scipy/reference/reference/stats.html#module-scipy.stats)): * [Probability distributions](https://docs.scipy.org/doc/scipy/reference/reference/stats.html#probability-distributions) * [Summary statistics](https://docs.scipy.org/doc/scipy/reference/reference/stats.html#summary-statistics) * [Frequency statistics](https://docs.scipy.org/doc/scipy/reference/reference/stats.html#frequency-statistics) * [Correlation functions](https://docs.scipy.org/doc/scipy/reference/reference/stats.html#correlation-functions) * [Statistical tests](https://docs.scipy.org/doc/scipy/reference/reference/stats.html#statistical-tests) * ... scipy.stats features basic functionalities and we will occasionally mention the statsmodels and pingouin libraries as we will hit scipy.stats limitations. %% Cell type:markdown id:0d715671-e4ad-4d4d-b60f-9397e2b80b90 tags: ## Overview %% Cell type:markdown id:5162d7b2-625a-4699-922e-92c5c2bfa769 tags: We will merely review statistical tests: * Student $t$ tests * compare a sample against the population mean * compare two independent samples * compare paired samples * analyses of variance (one-way) * compare more than two groups * tests for other tests' assumptions * normality tests * homoscedasticity tests * $\chi^2$ tests for discrete variables * goodness-of-fit test * homogeneity and independence tests * correlation coefficients %% Cell type:markdown id:f1be90cf tags: ## What Python can do -- What Python cannot %% Cell type:markdown id:01c6c854 tags: ### Experimental design Some vocabulary: | term | description | | --: | :-- | | **treatment** | every procedure or experimental condition
that departs from a control condition
(and the control itself is a special-case treatment) | | **population** | a set of individuals we want our conclusions to apply to | | **sample** | a finite set of selected individuals
assumed to be representative of a population | Always good to get a reminder about [general considerations](https://www.coursera.org/learn/stanford-statistics/home/welcome), __prior to data collection__ and analysis. * Sampling from the population, * identifying the sources of variability... * identifying the sources of variability, etc. %% Cell type:markdown id:23c92344 tags: ### Workflow %% Cell type:markdown id:35bcd5a9-fd6f-4914-96a7-734d240efd8e tags: To determine whether there is *sufficient evidence* to conclude the treatment has an effect, we use *statistical tests*. However, because experimental designs are often complex and involve multiple treatments and additional sources of variability, most studies also involve multiple tests, that are usually carried out after a so-called *omnibus* test. In addition, every statistical test makes various assumptions that in turn needs to be checked. As a consequence, every statistical analysis involves a series of tests and procedures.
Example worflow adapted from...?
Contrary to statistical software with a GUI, the tools featured in programming languages such as Python do not offer much guidance in following such a workflow. normality [label="yes", constraint=false]; } subgraph cluster1 { style=filled; color=lightgrey; node [style=filled,color=pink]; anova; robust_f; nonparam; label = "*Omnibus test *"; fontsize = 20; } n0 -> nonparam [label="no"]; normality -> H0_0; H0_0 -> homoscedasticity [label="not rejected"]; H0_0 -> transform [label="rejected"]; transform -> homoscedasticity [label="success"]; transform -> nonparam [label="failure"]; homoscedasticity -> H0_1; H0_1 -> anova [label="not rejected"]; H0_1 -> robust_f [label="rejected"]; H0_2 -> posthoc [label="rejected"]; anova -> H0_2; robust_f -> H0_2; nonparam -> H0_2; n0 [shape=diamond, label=i≥20>]; normality [label="Normality test"]; nonparam [label="Non-parametric\nomnibus test;\nK-W, Friedman"]; H0_0 [shape=diamond, label=0>]; homoscedasticity [label="Homoscedasticity test"]; transform [label="Data transform"]; H0_1 [shape=diamond, label=0>]; anova [label="Standard ANOVA"]; robust_f [label="Robust F test;\nWelch, Yuen,\nAlexander-Govern"]; H0_2 [shape=diamond, label=0>]; posthoc [label="Post-hoc tests"]; } ]]> %% Cell type:markdown id:e4db04ef tags: ### Replicability %% Cell type:markdown id:05a435dd tags: Not covered: tools and resources to support code and data management practices, some of which are provided by the Python ecosystem. %% Cell type:markdown id:d6b0c2ee-4f6b-49db-8c26-dd64aede72fb tags: ## Data exploration %% Cell type:markdown id:f911c720-5dbd-4e00-bb2b-329c4a3cd48e tags: Each measurements or variables should be checked for undesirable properties, among others: * completeness (undefined values?) * extreme values (outliers?) * modes (multiple?) Let us import a few more libraries and load some example data: %% Cell type:code id:65104f3d tags:  python import numpy as np import pandas as pd from matplotlib import pyplot as plt import seaborn as sns pd.options.display.max_columns = None dataframe = pd.read_csv('../data/mi.csv', index_col=0) dataframe  %%%% Output: execute_result Age OwnsHouse PhysicalActivity Sex LivesWithPartner LivesWithKids \ 1 22.33 Yes 3.0 Female No No 2 28.83 Yes 0.0 Female Yes No 3 23.67 Yes 0.0 Female Yes No 4 21.17 No 0.5 Female No No 5 26.17 Yes 1.5 Female No No .. ... ... ... ... ... ... 812 21.42 Yes 6.0 Male No No 813 38.33 No 1.5 Female Yes Yes 814 32.75 Yes 2.5 Female No No 815 39.17 No 4.0 Male No Yes 816 59.00 No 0.0 Male Yes Yes BornInCity Inbreeding BMI CMVPositiveSerology FluIgG \ 1 Yes 94.9627 20.13 No 0.464319 2 Yes 79.1024 21.33 Yes -0.049817 3 Yes 117.2540 22.18 No 0.332944 4 No 94.1796 18.68 No 0.404886 5 Yes 105.1250 29.01 No -0.303782 .. ... ... ... ... ... 812 Yes 73.0073 24.75 No 0.210671 813 No 79.7187 23.53 No 0.354790 814 No 86.6744 20.07 No 0.252037 815 Yes 98.9744 22.77 Yes 0.264940 816 Yes 84.6433 27.39 Yes 0.281522 MetabolicScore LowAppetite TroubleConcentrating TroubleSleeping \ 1 0 0 0 1.0 2 1 0 0 1.0 3 2 0 0 1.0 4 0 0 0 2.0 5 1 0 0 1.0 .. ... ... ... ... 812 0 0 0 3.0 813 1 0 0 2.0 814 0 0 0 1.0 815 0 0 0 2.0 816 1 0 0 0.0 HoursOfSleep Listless UsesCannabis RecentPersonalCrisis Smoking \ 1 9.00 3 No No Never 2 7.05 3 No No Active 3 6.50 3 Yes No Active 4 10.00 3 No No Never 5 9.00 0 No No Never .. ... ... ... ... ... 812 7.50 0 No No Never 813 9.00 0 No Yes Ex 814 8.00 3 No Yes Never 815 7.50 0 No No Never 816 8.00 0 No No Never Employed Education DustExposure Income HadMeasles HadRubella \ 1 No PhD No (1000-2000] No No 2 Yes Baccalaureat No (2000-3000] No No 3 Yes Baccalaureat Current (2000-3000] No No 4 No PhD No (3000-inf] No No 5 Yes Baccalaureat No [0-1000] No No .. ... ... ... ... ... ... 812 Yes PhD Current (3000-inf] No No 813 Yes Baccalaureat No (3000-inf] Yes Yes 814 No Baccalaureat No (1000-2000] No No 815 Yes PhD No (2000-3000] No No 816 No Baccalaureat No (2000-3000] No No HadChickenPox HadMumps HadTonsillectomy HadAppendicectomy VaccineHepA \ 1 Yes No No No No 2 Yes No No No No 3 Yes No No No No 4 Yes No No No No 5 No No No No No .. ... ... ... ... ... 812 Yes No No No No 813 Yes No No No No 814 No Yes No No No 815 No Yes No No No 816 No No Yes No No VaccineMMR VaccineTyphoid VaccineWhoopingCough VaccineYellowFever \ 1 No No Yes No 2 No No Yes No 3 No No No No 4 No No No No 5 No No Yes No .. ... ... ... ... 812 No No No No 813 No No No No 814 No No No No 815 Yes No No No 816 No No No No VaccineHepB VaccineFlu SUBJID DepressionScore HeartRate Temperature \ 1 Yes No 2 0.0 66 36.8 2 Yes No 3 0.0 66 37.4 3 Yes No 4 0.0 62 36.9 4 Yes No 5 1.0 64 36.0 5 Yes No 8 0.0 67 36.7 .. ... ... ... ... ... ... 812 Yes No 5167 0.0 44 36.6 813 Yes No 5219 0.0 54 36.7 814 Yes No 5279 3.0 62 36.3 815 No No 5303 0.0 39 36.3 816 No No 5701 0.0 61 36.1 HourOfSampling DayOfSampling 1 8.883 40 2 9.350 40 3 8.667 40 4 9.883 40 5 8.550 81 .. ... ... 812 8.550 270 813 9.217 229 814 9.800 278 815 9.583 277 816 9.317 241 [816 rows x 43 columns] %% Cell type:markdown id:ae8b5f34 tags:
The most important datum in a dataset is the sample size $n$. Here $n=816$.
%% Cell type:markdown id:52848e98 tags: ### Outliers %% Cell type:markdown id:e1a2ea78-087a-4cd1-bdf1-ba67db7067de tags: For example, undefined values are sometimes encoded as an otherwise-meaningless value: %% Cell type:code id:69922361-f828-44ad-bb36-402f355dd0cb tags:  python X = 'Age' x = dataframe[X] x_noisy = x.copy() x_noisy.name = f'{X} [corrupted]' x_noisy.iloc[[ 29, 30, 73, 76, 82, 97, 112, 117, 136, 166, 196, 209, 253, 260, 266, 339, 345, 355, 414, 446, 453, 469, 486, 502, 505, 517, 579, 592, 614, 618, 623, 630, 664, 669, 683, 718, 786, 797, 804, 810, ]] = 0 sns.histplot(x=x_noisy) sns.rugplot(x=x_noisy, color='r');  %%%% Output: display_data %% Cell type:code id:ac5a7020-b903-4196-8db6-31b90070d22b tags:  python Y = 'PhysicalActivity' y = dataframe[Y].copy() y[:] = np.log(1+y) y.name = f'log( {Y} )' _, axes = plt.subplots(1, 2, figsize=(13.3,4.1)) ax = axes[0] sns.regplot(x=x_noisy, y=y, line_kws=dict(color='r'), ax=ax) ax = axes[1] sns.regplot(x=x, y=y, line_kws=dict(color='r'), ax=ax);  %%%% Output: display_data %% Cell type:markdown id:b4f4e471-ec89-46ef-8984-fbf7b5afe84f tags: Outliers may incur large biases and strongly impact the outcome of the analyses. %% Cell type:markdown id:91a55884-2ca8-4cb0-9190-6c42bea7047b tags: ### Modes %% Cell type:markdown id:4c483874-e9db-440a-b1c8-7dd47a06008a tags: In many cases, we want the values for variables of interest to exhibit a single *mode*. Otherwise, estimates of central tendency (=average, mean or median) may not be representative of the overall sample. %% Cell type:code id:578edd4a-bcb1-4f9e-a8ba-d3e63bbabb71 tags:  python X = 'HoursOfSleep' Y = 'Age' y = dataframe[Y][dataframe[X] > 8] sns.histplot(y, bins=20) plt.axvline(y.mean(), linestyle=':', color='red'); plt.axvline(y.median(), linestyle=':', color='green');  %%%% Output: display_data %% Cell type:markdown id:c101eeb7-5580-4a75-889d-0fd2e965c7d6 tags: ### Another example %% Cell type:code id:72a790bd tags:  python X = 'HoursOfSleep' # variable x = dataframe[X] # measurements  %% Cell type:code id:d6ab1d43 tags:  python sns.histplot(x=x, bins=10) sns.rugplot(x=x, color='r');  %%%% Output: display_data %% Cell type:markdown id:994df34d tags: #### Fitting a normal distribution If $X$ follows a normal distribution: $X \sim \mathcal{N}(\mu, \sigma^2)$ then we can use the **empirical mean** and **unbiased variance** as estimators for $\mu$ and $\sigma^2$ respectively. scipy.stats exposes the norm function to define a normal distribution: %% Cell type:code id:92b00d5d tags:  python normal_distribution = stats.norm(x.mean(), x.var())  %% Cell type:markdown id:5eefc50c-44fa-403b-92c6-16aec1204d26 tags: If unsure about how to fit the distribution parameters to the data, you can alternatively use fit: %% Cell type:code id:4f114ad2-9cd2-431a-8b3e-2283006ef970 tags:  python normal_distribution = stats.norm(*stats.norm.fit(x))  %% Cell type:markdown id:cef7bb4e-38b9-4abc-97e4-419a9650173b tags: The object returned by norm features several distribution-related functions, *e.g.* pdf: %% Cell type:code id:8e114e17-8edb-4674-9309-c6f566885c0a tags:  python sns.histplot(x=x, bins=10, stat='density') sns.rugplot(x=x, color='r') grid = np.linspace(x.min(), x.max(), 100) plt.plot(grid, normal_distribution.pdf(grid), 'g-');  %%%% Output: display_data %% Cell type:code id:6eaa04b5 tags:  python sns.histplot(x=x, bins=10, stat='density') sns.rugplot(x=x, color='r') sns.kdeplot(x, clip=(x.min(), x.max()), color='g', linestyle='-');  %%%% Output: display_data %% Cell type:code id:ddfe2b2f tags:  python sns.histplot(x=x, stat='density') sns.rugplot(x=x, color='r') sns.kdeplot(x, clip=(x.min(), x.max()), color='g', linestyle='-');  %%%% Output: display_data %% Cell type:markdown id:cb8febd8 tags: We can check whether HoursOfSleep follows a normal distribution with the normaltest function: %% Cell type:code id:417bfc62 tags:  python stats.normaltest(x)  %%%% Output: execute_result NormaltestResult(statistic=17.050178832107033, pvalue=0.00019842697373015917) %% Cell type:markdown id:655b0fc0 tags: ...and with the kstest function: %% Cell type:code id:5b1b8872 tags:  python stats.kstest(x, normal_distribution.cdf)  %%%% Output: execute_result KstestResult(statistic=0.17414497571515752, pvalue=4.115192819897108e-22) %% Cell type:markdown id:ddeacee0-2062-476d-b7ec-9e0f036d6c4b tags: ## Statistical testing %% Cell type:markdown id:5f2fcb2e-1097-4bb7-ba77-fceaa3b55702 tags: > What did we do? We compared our **observations** x with some **expectation**. We actually formulated a so-called *null hypothesis*, denoted $H_0$, that models the situation such that "nothing is going on", *i.e.* the observations meet the expectation. We also implicitly defined an alternative hypothesis, usually denoted $H_1$ or $H_A$, that can simply be the opposite of $H_0$. For example: $$\left\{ \begin{array}{ l l l } H_0: & X \sim \mathcal{N}(\mu, \sigma^2) & \mbox{with } \mu \mbox{ assumed to be } \bar{x} \mbox{ and } \sigma^2 \mbox{ as } \frac{1}{n-1}\sum_{i=0}^{n-1} (x_i - \bar{x})^2 \\ H_A: & \mbox{not } H_0 \end{array} \right.$$ A test consists in contrasting the two incompatible hypotheses. If we had a single observation – say $z=1.4$ – to compare with a distribution – say $\mathcal{N}(0,1)$ – we would simply compute the probability for this value to be drawn from this distribution (or not): %% Cell type:code id:8346ad8e tags:  python z = 1.4 N = stats.norm(0, 1) onesided_pvalue = N.sf(z) # sf= survival function twosided_pvalue = 2 * min(N.cdf(z), N.sf(z))  %% Cell type:code id:404476b6 tags:  python grid = np.linspace(N.ppf(.001), N.ppf(.999), 100); pdf_curve, = plt.plot(grid, N.pdf(grid), 'r-'); z_line, = plt.plot([z, z], [0, N.pdf(z)], '-', zorder=1); tail = grid[z<=grid]; plt.fill_between(tail, np.zeros_like(tail), N.pdf(tail), alpha=.2); plt.axvline(0, linestyle='--', color='grey', linewidth=1); plt.axhline(0, linestyle='--', color='grey', linewidth=1); plt.xlim(grid[[0,-1]]); plt.xlabel('$X$'); plt.ylabel('probability density'); plt.legend([pdf_curve, z_line], ['$\mathcal{N}(0,1)$', '$z$']); plt.annotate(f'$\\approx {onesided_pvalue:.2f}$', (1.8, .03), xytext=(2, .13), arrowprops=dict(arrowstyle="->"));  %%%% Output: display_data %% Cell type:markdown id:99a007ac tags: In practice, all tests boil down to comparing a single value with a reference distribution. Basically, a test expresses the discrepancy between the observations and the expectation in the shape of a *statistic*, and this statistic is supposed to follow a given distribution under $H_0$. This is used as a basis to calculate a *p*-value that estimates the probability of erroneously rejecting $H_0$. The experimenter also defines a significance level $\alpha$, with common values $\alpha=0.05$ or $0.01$, that sets the maximum tolerated risk of rejecting $H_0$ by chance. The experimenter also defines a significance level $\alpha$, with common values $\alpha=0.05$ or $0.01$, that sets the maximum tolerated risk of making a *type-1 error*, *i.e.* of rejecting $H_0$ by chance. If the obtained p-value is lower than $\alpha$, then s·he can conclude there is sufficient evidence to reject $H_0$. %% Cell type:code id:b162e92e tags:  python %%html
Conclusion about $H_0$
from the statistical test
acceptreject
Truth about $H_0$
in the population
trueCorrectType 1 error
observe difference
when none exists
falseType 2 error
fail to observe difference
when one exists
Correct
https://faculty.nps.edu/rbassett/_book/hypothesis-testing-one-sample.html#fig:errorsHypTesting
 %%%% Output: display_data %% Cell type:markdown id:446ba63e-df67-46ca-8921-eca686462c93 tags: ## *t* tests %% Cell type:markdown id:8f9602b8-5f64-438b-8051-c71f9c8b0b14 tags: *t* tests derive a statistic that is supposed to follow the [Student's *t* distribution](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.t.html) under $H_0$: %% Cell type:code id:67d046f0-ca64-4a12-8e32-58b8e7e142a0 tags:  python grid = np.linspace(-3.1, 3.1, 100) dfs = [1, 2, 5, 20] for df, color in zip( dfs, ['blue', 'green', 'orange', 'red'], ): t = stats.t(df) plt.plot(grid, t.pdf(grid), '-', color=color) plt.axvline(0, linestyle='--', color='grey', linewidth=1) plt.axhline(0, linestyle='--', color='grey', linewidth=1) plt.xlim(grid[[0,-1]]) plt.xlabel('$t$') plt.ylabel('probability density') plt.legend([ f'$df={df}$' for df in dfs ]);  %%%% Output: display_data %% Cell type:markdown id:701ee597-2eec-4404-8a0e-601ae2121e19 tags: At high degrees of freedom, the *t* distribution approaches the normal distribution. At lower degrees of freedom, the *t* distribution exhibits heavier tails and is less sensitive to extreme values. %% Cell type:markdown id:ca6bf548-cadf-4c75-8130-fe0c3ef8de9a tags: ### One-sample *t* test %% Cell type:markdown id:0e1a462c-5c60-4438-a339-7c284ff56e15 tags: This test compares a sample's central tendency (*sample mean*) with a reference value (*population mean*).
x=[49.5 81.9 64.0 17.3 59.8 94.6 69.9 12.4]
μ=50
Let us call $\mu$ this reference value. Our expectation is that the sample mean $\bar{X}$ is close enough to $\mu$. In other words, $H_0: \bar{X} = \mu$. The statistic is: $$\frac{\bar{X} - \mu}{\mathrm{SEM}} \mbox{ } \mbox{ } \mbox{ } \mbox{ } \mbox{ } \mbox{ } \mbox{ } \sim t(n-1) \mbox{ } \textrm{under} \mbox{ } H_0$$
scipy.stats provides a sem function as an addition to numpy's std. sem is unbiased by default.
scipy's one-sample *t* test is [ttest_1samp](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_1samp.html): scipy.stats.ttest_1samp(a, popmean, axis=0, nan_policy='propagate', alternative='two-sided') %% Cell type:code id:b471633d-c9ad-455e-84e1-d32af085b32a tags:  python mu = 50 x = np.array([49.47257879, 81.93967205, 64.030398, 17.25423608, 59.80082512, 94.56012514, 69.91672899, 12.39640637]) stats.ttest_1samp(x, mu)  %%%% Output: execute_result Ttest_1sampResult(statistic=0.6024056396957578, pvalue=0.5658990587680466) %% Cell type:markdown id:dac3b6db-2ad7-46b4-937f-9dda1015b7fc tags: If we do not mind a negative difference (resp. positive difference), *i.e.* we consider the danger zone to begin only above (resp. below) the expected value, we can make the test one-sided to gain statistical power. To this aim, we must choose and specify which side: here 'greater' (resp. 'less'): %% Cell type:code id:b732c69f-1851-4ef2-bdc2-8f80b4c5281e tags:  python stats.ttest_1samp(x, mu, alternative='greater')  %%%% Output: execute_result Ttest_1sampResult(statistic=0.6024056396957578, pvalue=0.2829495293840233) %% Cell type:code id:dc360e24-d676-4398-8e4b-dba71e9e68e8 tags:  python x.mean(), stats.sem(x)  %%%% Output: execute_result (56.1713713175, 10.244544391411772) %% Cell type:markdown id:2144869e-9e4a-4e63-ae58-81c5a6be7205 tags: ### *t* test for independent samples %% Cell type:markdown id:1e31b437-eceb-4da1-9ab3-f11d45b352c0 tags: This test compares the means of two samples or groups, *e.g.* a control sample and a sample from a mutated population: $H_0: \bar{X_1} = \bar{X_2}$.
x1=[49.5 81.9 64.0 17.3 59.8 94.6 69.9 12.4]
x2=[64.2 96.6 101.9 85.3 66.5 63.9 127.6 55.0]
scipy's *t* test for independent samples uses the statistic $t=\frac{\bar{X_1}-\bar{X_2}}{\sqrt{(\frac{1}{n_1}+\frac{1}{n_2})\mbox{ }\textrm{PooledVariance}}}$ with $\textrm{PooledVariance} = \frac{1}{n_1+n_2-2}\sum_{j\in\{1,2\}}\sum_i (x_{ij}-\bar{x_j})^2$ and is available as [ttest_ind](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html): scipy.stats.ttest_ind(a, b, axis=0, equal_var=True, nan_policy='propagate', permutations=None, random_state=None, alternative='two-sided', trim=0) %% Cell type:code id:6231e214-ac36-4c4f-8a16-e1551c8484b4 tags:  python x1 = x x2 = np.array([64.22723692, 96.56483856, 101.94191774, 85.31918879, 66.4952999, 63.88841224, 127.63861749, 55.00527005]) stats.ttest_ind(x1, x2)  %%%% Output: execute_result Ttest_indResult(statistic=-1.96174329619957, pvalue=0.06998888828308221) %% Cell type:markdown id:2703f3b1 tags: 1. Had we got a more precise assumption about *e.g.* $\bar{X_2} > \bar{X_1}$, we could have made a one-sided test that would have successfully rejected $H_0$. ...but this should be defined prior to carring out any test! More important than the *p*-value is the *effect size*. A common measure of effect size for two independent samples is Cohen's $d$: $d = \frac{\bar{X_2}-\bar{X_1}}{\sqrt{\textrm{PooledVariance}}}$ %% Cell type:code id:2ba983bc-ef4c-4a15-ac23-776fffd88afb tags:  python def cohen_d(x1, x2): n1, n2 = len(x1), len(x2) m1, m2 = np.mean(x1), np.mean(x2) v1, v2 = np.var(x1), np.var(x2) pooled_variance = ((n1 - 1) * v1 + (n2 - 1) * v2) / (n1 + n2 - 2) d = (m2 - m1) / np.sqrt(pooled_variance) return d cohen_d(x1, x2)  %%%% Output: execute_result 1.0485958993113402 %% Cell type:code id:3beb1fbb-a1ac-40aa-b4ce-b97f151392f5 tags:  python from matplotlib import pyplot as plt import ipywidgets as widgets from ipywidgets import interact def plot_pdfs(cohen_d): grid = np.linspace(-3, 3+cohen_d, 100) x1 = stats.norm(0, 1).pdf(grid) x2 = stats.norm(cohen_d, 1).pdf(grid) plt.fill_between(grid, x1, alpha=.5) plt.fill_between(grid, x2, alpha=.5) plt.show() slider = widgets.FloatSlider(.5, min=0, max=4, step=.5) interact(plot_pdfs, cohen_d=slider);  %%%% Output: display_data %% Cell type:markdown id:f8e1face-4845-459d-95c2-8674181d39b5 tags: 1. (...) We could have found a significant effect of size $0.3$, which may not be of practical interest due to the small effect size. Measurements of effect size were proposed together with [tables](https://core.ecu.edu/wuenschk/docs30/EffectSizeConventions.pdf) for interpreting size values. For example, for Cohen's $d$: | $|d|$ | size of effect | | :-: | :-- | | $0.2$ | small | | $0.5$ | medium | | $0.8$ | large | 2. Had we found enough evidence to reject $H_0$, we could have concluded about an *association* between the mutation and the observed effect. To further conclude in terms of *causation*, it is necessary to rule out all possible [confounders](https://en.wikipedia.org/wiki/Confounding) (supplier, cage effect, etc). 3. scipy's implementation does not require equal numbers of observations per group. However, it assumes the groups are normally distributed (but is relatively robust to non-«extreme non-normality») and, more importantly, have similar variances ($0.5<\frac{s_{X_1}}{s_{X_2}}<2$). For heterogeneous groups, ttest_ind embarks various variants of the *t* test that can be selected with additional arguments: * Welch's *t* test with equal_var=False; * Yuen's *t* test with equal_var=False and trim=0.2 (requires more data).
Tip: always check for the underlying assumptions in the documentation.
%% Cell type:markdown id:1083c04c-1221-446f-84a0-7084413a7722 tags: ### *t* test for paired samples %% Cell type:markdown id:4743e7fc-b964-48d4-9e13-e78146626c5a tags: scipy's *t* test for paired samples is [ttest_rel](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_rel.html): scipy.stats.ttest_rel(a, b, axis=0, nan_policy='propagate', alternative='two-sided') This is actually a one-sample *t* test of the between-group differences against a population mean equal to zero (compare [1](https://github.com/scipy/scipy/blob/v1.7.1/scipy/stats/stats.py#L6450-L6460) and [2](https://github.com/scipy/scipy/blob/v1.7.1/scipy/stats/stats.py#L5647-L5656)). %% Cell type:markdown id:234dc53a-dce0-426e-8f57-4a4f9bfa38fc tags: #### Effect size %% Cell type:markdown id:479e8f62-c653-472d-b9d5-d3bf854d09ff tags: Adjusted Cohen's $d$: %% Cell type:code id:25adcf4f-6611-434f-b619-27f341b3caad tags:  python r, _ = stats.pearsonr(x1, x2) cohen_d(x1, x2) / np.sqrt(1 - r)  %%%% Output: execute_result 1.2657389481669015 %% Cell type:markdown id:6b0fd532-87d0-4747-b3a3-322d77c2bfbd tags: ## Analysis of variance %% Cell type:markdown id:8443a0ff-75ed-4d88-a4e8-a988e6de3716 tags: ### One-way ANOVA Comparing three or more group means reads $H_0: \bar{X_0} = \bar{X_1} = ... = \bar{X_k}$ and is usually carried out with an *analysis of variance*. The total variance ($SS_{\textrm{total}}$) is decomposed as the sum of two terms: *within-group* variance ($SS_{\textrm{error}}$) and *between-group* variance ($SS_{\textrm{treatment}}$). $$\underbrace{\sum_j\sum_i (x_{ij} - \bar{\bar{x}})^2}_{SS_{\textrm{total}}} = \underbrace{\sum_j\sum_i (\bar{x_j} - \bar{\bar{x}})^2}_{SS_{\textrm{treatment}}} + \underbrace{\sum_j\sum_i (x_{ij} - \bar{x_j})^2}_{SS_{\textrm{error}}}$$ Many statistical tools give the following detailled table: | Source | Degrees of
freedom | Sum of squares | Mean squares | $\mbox{ }F\mbox{ }$ | $p$-value | | :- | :-: | :-: | :-: | :-: | :-: | | Treatment | $k-1$ | $SS_{\textrm{treatment}}$ | $MS_{\textrm{treatment}}$ | $\frac{MS_{\textrm{treatment}}}{MS_{\textrm{error}}}$ | $\mbox{ }p\mbox{ }$ | | Error | $N-k$ | $SS_{\textrm{error}}$ | $MS_{\textrm{error}}$ | | | | Total | $N-1$ | $SS_{\textrm{total}}$ | | | | The statistic $F = \frac{MS_{\textrm{treatment}}}{MS_{\textrm{error}}}$ follows the Fisher's [F](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.f.html) distribution under $H_0$. More about it at: https://www.coursera.org/learn/stanford-statistics/lecture/pskeN/the-idea-of-analysis-of-variance The most basic [implementation](https://github.com/scipy/scipy/blob/v1.7.1/scipy/stats/mstats_basic.py#L2937-L2967) of the one-way ANOVA in SciPy is [f_oneway](https://docs.scipy.org/doc/scipy/reference/reference/generated/scipy.stats.f_oneway.html): %% Cell type:code id:2bf0aafa-d23f-44fe-b9e0-cf4bf13e7314 tags:  python A = [85, 86, 88, 75, 78, 94, 98, 79, 71, 80] B = [91, 92, 93, 85, 87, 84, 82, 88, 95, 96] C = [79, 78, 88, 94, 92, 85, 83, 85, 82, 81] stats.f_oneway(A, B, C)  %%%% Output: execute_result F_onewayResult(statistic=2.3575322551335636, pvalue=0.11384795345837218) %% Cell type:markdown id:164aa1ae-5446-4680-9571-c783edaf4101 tags: The ANOVA is an *omnibus* test and does not tell which groups exhibit differing means. Specific differences are later identified using *post-hoc tests* (more about it in next session). %% Cell type:markdown id:8bf2c9ad-68fd-4f4b-b68c-180de69d3522 tags: ### Assumptions %% Cell type:markdown id:bd484cda-1d0b-4468-a092-f362744f8aa0 tags: The standard ANOVA requires the data to exhibit the following properties: * independent observations, * normally distributed residuals, * all groups have equal population variance (*homoscedasticity*), * at least 5 observations ($n \ge 5$) per group (and equal number). %% Cell type:markdown id:0ea4e6c8-4851-4704-b9ca-6257239ab07b tags: #### Size effect %% Cell type:markdown id:39925882-deb4-43e1-8d82-9fb1e68df0ba tags: Mentioned for completeness: Cohen's $f=\sqrt{\frac{R^2}{1 - R^2}}=\sqrt{\frac{SS_{\textrm{treatment}}}{SS_{\textrm{error}}}}$ and [$\sqrt{F}$ root mean square effect](https://en.wikipedia.org/wiki/Effect_size#%CE%A8,_root-mean-square_standardized_effect) are suitable for one-way ANOVA but not widely used, as post-hoc tests give a more natural approach to size effects. statsmodels features [effectsize_oneway](https://www.statsmodels.org/stable/generated/statsmodels.stats.oneway.effectsize_oneway.html). %% Cell type:markdown id:e38c759e-8807-4975-9620-9c2100b964a8 tags: ## Checking for common assumptions %% Cell type:markdown id:e9509dde-4edf-4d5b-bcb6-7bd6bbd87918 tags: Visually checking for desired properties like normality or equal variance is acceptable, especially if the data are generally known to exhibit these properties. ### Normality Having this property is usually not critical, because most tests are fairly robust to non-normality. We only need to avoid cases of «extreme non-normality». %% Cell type:markdown id:c5fa2537-0d3b-4b1d-bba4-a479cdf2c0a9 tags: #### Graphical approaches %% Cell type:markdown id:f0f983f0-6143-44b0-9eee-f6d54eb5cf51 tags: Probability plots with [probplot](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.probplot.html) (or [statsmodels.api.qqplot](https://www.statsmodels.org/stable/generated/statsmodels.graphics.gofplots.qqplot.html) with one sample): %% Cell type:code id:6bcaa795-3a85-4b8c-aad9-d554e244a2ec tags:  python np.random.seed(1245619531) x_normal = stats.norm.rvs(loc=0, scale=1, size=30) # generate 30 observations from the standard normal distribution x_not_normal = stats.norm.rvs(loc=[-1,1], scale=[1,3], size=(15,2)).ravel() # generate 30 observations from a mixture of normal distributions _, axes = plt.subplots(1, 2, figsize=(13.3,4.1)) stats.probplot(x_normal, plot=axes[0]) stats.probplot(x_not_normal, plot=axes[1]);  %%%% Output: display_data %% Cell type:code id:e1f2356b-dd02-47d5-8e2c-82c77a6a5c97 tags:  python _, axes = plt.subplots(1, 2, figsize=(13.3,4.1)) axes[0].hist(x_normal, bins=7) axes[1].hist(x_not_normal, bins=7);  %%%% Output: display_data %% Cell type:markdown id:8f46e74c-ee1f-454c-a4a1-3e551899e7e7 tags: #### Normality tests %% Cell type:markdown id:81a1572d-1639-42d9-834c-2ba86c7c7dd3 tags: * D'Agostino's test: [normaltest](https://docs.scipy.org/doc/scipy/reference/reference/generated/scipy.stats.normaltest.html), preferably for large samples ($n>20$), * Similar test for skewness only: [skewtest](https://docs.scipy.org/doc/scipy/reference/reference/generated/scipy.stats.skewtest.html) ($n\ge8$), %% Cell type:code id:e964be07-0696-4ff8-844a-3e7d318a7f3a tags:  python skewed_dist = lambda sigma, x: np.exp( -.5*(np.log(x)/sigma)**2 ) / ( x*sigma*np.sqrt(2*np.pi) ) heavy_tailed_dist = lambda scale, x: stats.cauchy.pdf(x, 0, scale) colors = ['blue', 'green', 'orange', 'red'] _, axes = plt.subplots(1, 2, figsize=(13.3,4.1)) grid = np.linspace(0, 3, 100) grid = grid[1:] ax = axes[0] for sigma, color in zip((.25, .5, 1), colors): ax.plot(grid, skewed_dist(sigma, grid), '-', color=color, label=f'$\sigma={sigma:.2f}$') ax.axhline(0, linestyle='--', color='grey', linewidth=1) ax.set_xlim(grid[[0,-1]]) ax.set_title('skewness') grid = np.linspace(-4, 4, 100) ax = axes[1] for s, color in zip((.5, 1, 2), colors): ax.plot(grid, heavy_tailed_dist(s, grid), '-', color=color, label=f'$scale={s:.2f}$') ax.axvline(0, linestyle='--', color='grey', linewidth=1) ax.axhline(0, linestyle='--', color='grey', linewidth=1) ax.set_xlim(grid[[0,-1]]) ax.set_title('kurtosis') for ax in axes: ax.set_ylabel('probability density') ax.legend();  %%%% Output: display_data %% Cell type:markdown id:eed065d2-dd35-4fdb-ad91-8d387a06c7ea tags: * Shapiro-Wilk's test: [shapiro](https://docs.scipy.org/doc/scipy/reference/reference/generated/scipy.stats.shapiro.html), * Generic goodness-of-fit tests: [kstest](https://docs.scipy.org/doc/scipy/reference/reference/generated/scipy.stats.kstest.html) and [anderson](https://docs.scipy.org/doc/scipy/reference/reference/generated/scipy.stats.anderson.html). %% Cell type:markdown id:4c7a08d6-3c74-48ce-ad32-f164089b6ec7 tags: ### Equal variance (homoscedasticity) %% Cell type:markdown id:5e5a893e-6475-401b-9335-b117629b8d0a tags: #### Graphical approaches Simple per-group box plots. %% Cell type:code id:8a31b653-d891-4f07-ab8b-73562b2ed86d tags:  python A = [85, 86, 88, 75, 78, 94, 98, 79, 71, 80] B = [91, 92, 93, 85, 87, 84, 82, 88, 95, 96] C = [79, 78, 88, 94, 92, 85, 83, 85, 82, 81] df = pd.DataFrame(data=dict(A=A, B=B, C=C)) plt.boxplot(df, labels=df.columns);  %%%% Output: display_data %% Cell type:markdown id:e985446d-d250-4d88-9a13-50a098cd016b tags: #### Equality-of-variance tests * Bartlett's test: [bartlett](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bartlett.html), most basic and common test, * Levene's test: [levene](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.levene.html), better for skewed or heavy-tailed distributions, * ...and others: Fligner-Killeen's test ([fligner](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.fligner.html)), Ansari-Bradley's test ([ansari](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ansari.html)), etc Example: %% Cell type:code id:280180fa-4f20-44a6-a463-2ce9b0ad75a4 tags:  python # copied-pasted from https://www.statology.org/bartletts-test-python/ A = [85, 86, 88, 75, 78, 94, 98, 79, 71, 80] B = [91, 92, 93, 85, 87, 84, 82, 88, 95, 96] C = [79, 78, 88, 94, 92, 85, 83, 85, 82, 81] stats.bartlett(A, B, C)  %%%% Output: execute_result BartlettResult(statistic=3.3024375753550457, pvalue=0.19181598314036113) %% Cell type:markdown id:e906820a-af04-4a9d-992f-f2e07a7aed91 tags: In the above example, as there is not enough evidence to reject $H_0$ ($p>0.05$), we can proceed to perform a standard one-way ANOVA. Otherwise, we would go for an Alexander-Govern's test or Welch's *F* test instead. The Alexander-Govern's test is available in scipy as [alexandergovern](https://docs.scipy.org/doc/scipy/reference/reference/generated/scipy.stats.alexandergovern.html), but the Welch's *F* test is not (neither in scipy.stats nor in statsmodels). Install the Pingouin package and try out the [welch_anova](https://pingouin-stats.org/generated/pingouin.welch_anova.html) function instead. The Bartlett's test statistic follows $\chi^2_{k-1}$ with $k$ the number of groups. As most tests based on the $\chi^2$ distribution, the *p*-value is one-sided. %% Cell type:code id:b8123d48-7285-4fa3-8473-04f316dedffc tags:  python grid = np.linspace(0, 20, 200) dfs = [2, 3, 5, 9] _, axes = plt.subplots(1, 2, figsize=(13.3,4.1)) ax = axes[0] for df, color in zip( dfs, ['blue', 'green', 'orange', 'red'], ): chi2 = stats.chi2.pdf(grid, df) ax.plot(grid, chi2, '-', color=color) ax.axhline(0, linestyle='--', color='grey', linewidth=1) ax.set_xlim(grid[0],grid[-1]) ax.set_xlabel('$\chi^2$') ax.set_ylabel('probability density') ax.legend([ f'$df={df}$' for df in dfs ]) ax = axes[1] df, color = 2, 'blue' chi2 = stats.chi2.pdf(grid, df) ax.plot(grid, chi2, '-', color=color) ax.axhline(0, linestyle='--', color='grey', linewidth=1) ax.set_xlim(grid[0],grid[-1]) ax.set_xlabel('$\chi^2$') ax.set_ylabel('probability density'); A = [85, 86, 88, 75, 78, 94, 98, 79, 71, 80] B = [91, 92, 93, 85, 87, 84, 82, 88, 95, 96] C = [79, 78, 88, 94, 92, 85, 83, 85, 82, 81] bartlett_statistic, bartlett_pvalue = stats.bartlett(A, B, C) bartlett_statistic_line, = ax.plot([bartlett_statistic]*2, [0, stats.chi2.pdf(bartlett_statistic, df)], '-', zorder=1) tail = grid[bartlett_statistic<=grid] ax.fill_between(tail, np.zeros_like(tail), stats.chi2.pdf(tail, df), alpha=.2) ax.annotate(f'$\\approx {bartlett_pvalue:.2f}$', (4, .02), xytext=(8, .1), arrowprops=dict(arrowstyle="->"));  %%%% Output: display_data %% Cell type:markdown id:c09c6222-4167-4e32-b6da-4abe3a8960de tags: ## $\chi^2$ tests %% Cell type:markdown id:1ec71b98-8d1b-4854-b86d-df738ddc8595 tags: When the sum of the observations is known, *e.g.* observations are frequencies -- proportions that sum to $1$, we use a $\chi^2$ test instead of an ANOVA. ### Goodness-of-fit %% Cell type:markdown id:d855b1f2-3105-445a-b1a6-b99bceeb66b7 tags: Example: Comparing the frequencies of the different allele variants at a given locus between a reference genome and a test genome. Another popular example: [Color proportion of M&Ms [Coursera]](https://www.coursera.org/learn/stanford-statistics/lecture/rAwbR/the-color-proportions-of-m-ms): | blue | orange | green | yellow | red | brown | | :-: | :-: | :-: | :-: | :-: | :-: | | 24% | 20% | 16% | 14% | 13% | 13% | %% Cell type:code id:84f4e9b3-a653-422c-8fc5-83b29869ba87 tags:  python expected_props = np.array([ .24, .2, .16, .14, .13, .13 ]) observed_counts = np.array([ 85, 79, 56, 64, 58, 68 ]) np.sum(observed_counts)  %%%% Output: execute_result 410 %% Cell type:code id:8bb2916a-e9a4-4bbe-b04f-14818bb68723 tags:  python expected_counts = expected_props * np.sum(observed_counts) expected_counts  %%%% Output: execute_result array([98.4, 82. , 65.6, 57.4, 53.3, 53.3]) %% Cell type:markdown id:a75bd8d7-f61e-49bc-80b4-cb1f05ccd7c1 tags: | | blue | orange | green | yellow | red | brown | | --: | :-: | :-: | :-: | :-: | :-: | :-: | | Expected | 98.4 | 82 | 65.6 | 57.4 | 53.3 | 53.3 | | Observed | 85 | 79 | 56 | 64 | 58 | 58 | The statistic is: $$\chi^2 = \sum_{i=1}^{k} \frac{(O_i - E_i)^2}{E_i} \mbox{ } \mbox{ } \mbox{ } \mbox{ } \mbox{ } \mbox{ } \mbox{ } \sim \chi^2_{k-1} \mbox{ } \textrm{under} \mbox{ } H_0$$ %% Cell type:code id:a8898631-a5b5-49e8-a158-3e149345391a tags:  python k = len(expected_counts) chi2 = np.sum((observed_counts - expected_counts) ** 2 / expected_counts) chi2  %%%% Output: execute_result 8.566983829178941 %% Cell type:code id:7a8af457-13dc-483a-90e3-50d3949c8edf tags:  python pvalue = stats.chi2(k-1).sf(chi2) pvalue  %%%% Output: execute_result 0.1276329790529603 %% Cell type:markdown id:92886c91-f6ef-419d-bfda-e28d9dfcc7bc tags: scipy.stats's implementation of the test is [chisquare](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html): %% Cell type:code id:1e246915-e9be-4826-9677-d9f30348d0df tags:  python stats.chisquare(observed_counts, expected_counts)  %%%% Output: execute_result Power_divergenceResult(statistic=8.566983829178941, pvalue=0.1276329790529603) %% Cell type:markdown id:00614391-2cdb-4e8f-9734-a1fd2df57637 tags: #### Size effect Cohen's $w$: %% Cell type:code id:f2231df3-3ce3-40c4-ae75-f6159061e3d6 tags:  python cohen_w = np.sqrt(chi2) cohen_w  %%%% Output: execute_result 2.9269410361636843 %% Cell type:markdown id:1feb29a6-eeac-4096-8e90-c663fdeea8a8 tags: ### Homogeneity and independence %% Cell type:markdown id:2ef858d5-ec8f-405d-924e-62f5286a94c5 tags: Example: Comparing the frequency of cell types in cultures that differ in the treatments: | Observed | Type A cells | Type B cells | Type C cells | Type D cells | | --: | :-: | :-: | :-: | :-: | | Treatment 1 | 134 | 86 | 32 | 11 | | Treatment 2 | 101 | 92 | 38 | 8 | | Treatment 3 | 188 | 67 | 54 | 19 | $H_0$: the treatments have no effect on the frequency of the cell types. https://www.coursera.org/learn/stanford-statistics/lecture/78IMJ/the-chi-square-test-for-homogeneity-and-independence %% Cell type:code id:13145d41-7dc3-4eff-9f9c-38d0abed1e8d tags:  python observed_counts = np.array([ [ 134, 86, 32, 11 ], [ 101, 92, 38, 8 ], [ 188, 67, 54, 19 ], ])  %% Cell type:code id:fe4ce304-25ac-4734-9a0a-e20db59b742f tags:  python expected_props = np.sum(observed_counts, axis=0) / np.sum(observed_counts) expected_props  %%%% Output: execute_result array([0.50963855, 0.29518072, 0.14939759, 0.04578313]) %% Cell type:markdown id:7f4b4727-3a9b-474f-b6ad-b632be97b04e tags: Under $H_0$, the expected proportions are: | Expected | Type A cells | Type B cells | Type C cells | Type D cells | | --: | :-: | :-: | :-: | :-: | | Treatment 1 | 51% | 29% | 15% | 5% | | Treatment 2 | 51% | 29% | 15% | 5% | | Treatment 3 | 51% | 29% | 15% | 5% | %% Cell type:code id:087ea780-304c-459d-a7c0-066817f3d82f tags:  python expected_counts = np.outer(np.sum(observed_counts, axis=1), expected_props) expected_counts  %%%% Output: execute_result array([[134.03493976, 77.63253012, 39.29156627, 12.04096386], [121.80361446, 70.54819277, 35.7060241 , 10.94216867], [167.16144578, 96.81927711, 49.00240964, 15.01686747]]) %% Cell type:markdown id:c0f49b1a-3a7a-4ad8-8d2a-9fd3513ad2a6 tags: | Expected | Type A cells | Type B cells | Type C cells | Type D cells | | --: | :-: | :-: | :-: | :-: | | Treatment 1 | 134 | 77.6 | 39.3 | 12 | | Treatment 2 | 121.8 | 70.5 | 35.7 | 10.9 | | Treatment 3 | 167.2 | 96.8 | 49 | 15 | %% Cell type:code id:d9585b87-f5dd-4cf2-ad91-0a79dcc95c86 tags:  python j, k = expected_counts.shape dof = (j - 1) * (k - 1) chi2 = np.sum((observed_counts - expected_counts) ** 2 / expected_counts) chi2  %%%% Output: execute_result 26.7075512595244 %% Cell type:code id:a7cd985e-eb2b-4a72-a221-30e3e0fdf8ec tags:  python stats.chi2(dof).sf(chi2)  %%%% Output: execute_result 0.00016426084515914902 %% Cell type:markdown id:13d7ef58-e62d-48de-a3a4-2ccc2ac588cb tags: scipy.stats's $\chi^2$ test for homogeneity/independence is [chi2_contingency](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html): %% Cell type:code id:4cd7ea08-110a-4ff7-9521-0f12e4169d35 tags:  python stats.chi2_contingency(observed_counts)  %%%% Output: execute_result (26.707551259524408, 0.0001642608451591484, 6, array([[134.03493976, 77.63253012, 39.29156627, 12.04096386], [121.80361446, 70.54819277, 35.7060241 , 10.94216867], [167.16144578, 96.81927711, 49.00240964, 15.01686747]])) %% Cell type:markdown id:45da92b9-cd4e-4d1b-8996-451285ef1896 tags: Due to the design of the test, it doesn't matter what factor whose effect is hypothesized to be null under $H_0$: %% Cell type:code id:d7e7fd75-068c-4f37-b981-498890c4a0d7 tags:  python stats.chi2_contingency(observed_counts.T)  %%%% Output: execute_result (26.707551259524408, 0.0001642608451591484, 6, array([[134.03493976, 121.80361446, 167.16144578], [ 77.63253012, 70.54819277, 96.81927711], [ 39.29156627, 35.7060241 , 49.00240964], [ 12.04096386, 10.94216867, 15.01686747]])) %% Cell type:markdown id:141dc10b-c964-4621-9e53-32f36c11d2da tags: ## Correlation %% Cell type:markdown id:6efe325b tags: ### Analyses of association (recap) | Test | Types of variables | | :-: | :-- | | $\chi^2$ test
(ind./homo.) | categorical *vs* categorical | | ANOVA | categorical (*e.g.* group) *vs* continuous (response) | | ? | continuous *vs* continuous | Note: frequencies in the $\chi^2$ tests are summary statistics and play the role of a sample size. They are NOT treated as measurements of a variable, although they could be, at another conceptual level (e.g. population of the bags of M&Ms). %% Cell type:markdown id:344da42e-5707-4a0a-9b98-4cbf47781fbe tags: ### Correlation coefficient %% Cell type:markdown id:0c8b61ee-e2da-4dc4-b3ad-02790d407d5b tags: Pearson correlation coefficient of two series is the covariance of the two series normalized by the standard deviation of each series: $$\textrm{Var}(x) = \frac{1}{n-1}\sum_i (x_i - \bar{x})(x_i - \bar{x}) \\ \textrm{Cov}(x, y) = \frac{1}{n-1}\sum_i(x_i - \bar{x})(y_i - \bar{y}) \\ -1 \le \mbox{ } \mbox{ } \mbox{ } r(x, y) = \frac{\textrm{Cov}(x, y)}{\sqrt{\textrm{Var}(x)\textrm{Var}(y)}} \mbox{ } \mbox{ } \mbox{ } \le 1$$ The [pearsonr](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html) function computes the Pearson correlation coefficient together with a *p*-value: %% Cell type:code id:f5a36568-d5b5-4388-b3f9-f38855c58c80 tags:  python x1 = stats.norm.rvs(loc=46, scale=30, size=30) x2 = stats.norm.rvs(loc=71, scale=30, size=30) r, pv = stats.pearsonr(x1, x2) r, pv  %%%% Output: execute_result (0.3518680132574827, 0.05653920630309695) %% Cell type:markdown id:a9fbd953 tags: The correlation coefficient is a commonly-used effect size for the linear relationship between the two variables, similarly to (but not to be confused with) a regression coefficient: %% Cell type:code id:3581eeb0-f98d-4b2f-a0f1-bef073d86980 tags:  python df = pd.DataFrame(dict(x1=x1, x2=x2)); sns.regplot(x="x1", y="x2", data=df);  %%%% Output: display_data