Commit a8ff27bc authored by François  LAURENT's avatar François LAURENT
Browse files

SciPy course material complete

parent 1dbef0a4
%% Cell type:markdown id:a5a5210d tags:
Import `numpy`, `pandas`, the `pyplot` module from `matplotlib`, `seaborn`, and the `stats` module from `scipy`:
## Q
Import `numpy`, `pandas`, the `pyplot` module from `matplotlib`, `seaborn`, and the `stats` module from `scipy`.
%% Cell type:markdown id:5ac6cc32 tags:
## A
%% Cell type:code id:529c5f56 tags:
``` python
import numpy as np
import pandas as pd
from scipy import stats
from matplotlib import pyplot as plt
import seaborn as sns
```
%% Cell type:markdown id:bb564f37 tags:
%% Cell type:markdown id:93ad4aaf tags:
# Comparison of two group means
%% Cell type:markdown id:0e4fd0d9 tags:
## Q
Load the `mi.csv` data file located in the `data` directory of the course repository.
%% Cell type:markdown id:08c1dd12 tags:
Load the `mi.csv` data file located in the `data` directory of the course repository:
## A
%% Cell type:code id:00130518 tags:
``` python
df = pd.read_csv('../data/mi.csv', index_col=0)
df.head()
```
%%%% Output: execute_result
Age OwnsHouse PhysicalActivity Sex LivesWithPartner LivesWithKids \
1 22.33 Yes 3.0 Female No No
2 28.83 Yes 0.0 Female Yes No
3 23.67 Yes 0.0 Female Yes No
4 21.17 No 0.5 Female No No
5 26.17 Yes 1.5 Female No No
BornInCity Inbreeding BMI CMVPositiveSerology ... \
1 Yes 94.9627 20.13 No ...
2 Yes 79.1024 21.33 Yes ...
3 Yes 117.2540 22.18 No ...
4 No 94.1796 18.68 No ...
5 Yes 105.1250 29.01 No ...
VaccineWhoopingCough VaccineYellowFever VaccineHepB VaccineFlu SUBJID \
1 Yes No Yes No 2
2 Yes No Yes No 3
3 No No Yes No 4
4 No No Yes No 5
5 Yes No Yes No 8
DepressionScore HeartRate Temperature HourOfSampling DayOfSampling
1 0.0 66 36.8 8.883 40
2 0.0 66 37.4 9.350 40
3 0.0 62 36.9 8.667 40
4 1.0 64 36.0 9.883 40
5 0.0 67 36.7 8.550 81
[5 rows x 43 columns]
%% Cell type:markdown id:9cc036b2 tags:
Question: anything missing?
## Q
Anything missing?
%% Cell type:markdown id:99d5dc74 tags:
## A
%% Cell type:code id:8a648a9b tags:
``` python
df.shape
```
%%%% Output: execute_result
(816, 43)
%% Cell type:code id:2f0a8116 tags:
``` python
# in Jupyter-lab, pandas is set to display dataframes with a limited number of columns
pd.options.display.max_columns = None
df.head()
```
%%%% Output: execute_result
Age OwnsHouse PhysicalActivity Sex LivesWithPartner LivesWithKids \
1 22.33 Yes 3.0 Female No No
2 28.83 Yes 0.0 Female Yes No
3 23.67 Yes 0.0 Female Yes No
4 21.17 No 0.5 Female No No
5 26.17 Yes 1.5 Female No No
BornInCity Inbreeding BMI CMVPositiveSerology FluIgG MetabolicScore \
1 Yes 94.9627 20.13 No 0.464319 0
2 Yes 79.1024 21.33 Yes -0.049817 1
3 Yes 117.2540 22.18 No 0.332944 2
4 No 94.1796 18.68 No 0.404886 0
5 Yes 105.1250 29.01 No -0.303782 1
LowAppetite TroubleConcentrating TroubleSleeping HoursOfSleep Listless \
1 0 0 1.0 9.00 3
2 0 0 1.0 7.05 3
3 0 0 1.0 6.50 3
4 0 0 2.0 10.00 3
5 0 0 1.0 9.00 0
UsesCannabis RecentPersonalCrisis Smoking Employed Education \
1 No No Never No PhD
2 No No Active Yes Baccalaureat
3 Yes No Active Yes Baccalaureat
4 No No Never No PhD
5 No No Never Yes Baccalaureat
DustExposure Income HadMeasles HadRubella HadChickenPox HadMumps \
1 No (1000-2000] No No Yes No
2 No (2000-3000] No No Yes No
3 Current (2000-3000] No No Yes No
4 No (3000-inf] No No Yes No
5 No [0-1000] No No No No
HadTonsillectomy HadAppendicectomy VaccineHepA VaccineMMR VaccineTyphoid \
1 No No No No No
2 No No No No No
3 No No No No No
4 No No No No No
5 No No No No No
VaccineWhoopingCough VaccineYellowFever VaccineHepB VaccineFlu SUBJID \
1 Yes No Yes No 2
2 Yes No Yes No 3
3 No No Yes No 4
4 No No Yes No 5
5 Yes No Yes No 8
DepressionScore HeartRate Temperature HourOfSampling DayOfSampling
1 0.0 66 36.8 8.883 40
2 0.0 66 37.4 9.350 40
3 0.0 62 36.9 8.667 40
4 1.0 64 36.0 9.883 40
5 0.0 67 36.7 8.550 81
%% Cell type:markdown id:3512f950 tags:
Show a summary table for these data:
## Q
Show a summary table for these data.
%% Cell type:markdown id:6984434b tags:
## A
%% Cell type:code id:a7a7d087 tags:
``` python
df.describe()
```
%%%% Output: execute_result
Age PhysicalActivity Inbreeding BMI FluIgG \
count 816.000000 816.000000 816.000000 816.000000 816.000000
mean 46.485711 2.751804 91.904255 24.208958 0.203601
std 13.854402 3.565008 12.936172 3.181184 0.232411
min 20.170000 0.000000 43.727000 18.500000 -0.430491
25% 35.830000 0.500000 84.077225 21.770000 0.065082
50% 47.710000 2.000000 91.862800 23.850000 0.227855
75% 58.352500 4.000000 100.008000 26.210000 0.363819
max 69.750000 49.000000 150.107000 32.000000 0.769841
MetabolicScore LowAppetite TroubleConcentrating TroubleSleeping \
count 816.000000 816.000000 816.000000 816.000000
mean 0.932598 0.512255 0.355392 1.119771
std 0.893942 1.674008 1.408535 0.931400
min 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000
50% 1.000000 0.000000 0.000000 1.000000
75% 1.000000 0.000000 0.000000 2.000000
max 4.000000 14.000000 14.000000 3.000000
HoursOfSleep Listless SUBJID DepressionScore HeartRate \
count 816.000000 816.000000 816.000000 816.000000 816.000000
mean 7.499246 1.290441 576.877451 0.544526 59.209559
std 1.017186 2.055716 518.489935 1.333918 9.206104
min 3.000000 0.000000 2.000000 0.000000 37.000000
25% 7.000000 0.000000 300.750000 0.000000 54.000000
50% 7.500000 0.000000 556.500000 0.000000 58.000000
75% 8.000000 3.000000 779.250000 1.000000 65.000000
max 12.000000 14.000000 5701.000000 14.000000 100.000000
Temperature HourOfSampling DayOfSampling
count 816.000000 816.000000 816.000000
mean 36.431985 9.214806 185.485294
std 0.318461 0.378376 84.971737
min 35.700000 8.433000 17.000000
25% 36.200000 8.917000 136.000000
50% 36.400000 9.233000 187.000000
75% 36.600000 9.550000 263.000000
max 37.700000 11.217000 335.000000
%% Cell type:markdown id:ec2a049f tags:
%% Cell type:markdown id:04163591 tags:
## Q
Inspect the distribution of variables `Age` and `OwnsHouse`.
Inspect the distribution of variables `Age` and `OwnsHouse`:
%% Cell type:markdown id:d6baac23 tags:
%% Cell type:code id:14572a36 tags:
## A
%% Cell type:code id:5de6412d tags:
``` python
# categorical vs continuous => boxplot, violinplot
sns.boxplot(x='OwnsHouse', y='Age', data=df);
# optional
sns.swarmplot(x='OwnsHouse', y='Age', data=df, linewidth=1, size=3);
```
%%%% Output: display_data
![]()
%% Cell type:code id:9c625646 tags:
%% Cell type:code id:f793503f tags:
``` python
!cowsay "Where on Earth ALL younger people own a house while elder people do not?"
```
%%%% Output: stream
_________________________________________
/ Where on Earth ALL younger people own a \
\ house while elder people do not? /
-----------------------------------------
\ ^__^
\ (oo)\_______
(__)\ )\/\
||----w |
|| ||
%% Cell type:markdown id:af5660e7 tags:
%% Cell type:markdown id:1e94c17b tags:
## Q
Isolate the house-owners group from the others, draw their respective age distributions and check they are normally distributed:
Isolate the house-owners group from the others, draw their respective age distributions and check they are normally distributed.
%% Cell type:code id:86252da3 tags:
%% Cell type:code id:55d18f16 tags:
``` python
group = df.groupby('OwnsHouse').groups
house_owners = group['Yes']
others = group['No']
house_owners_age = df.loc[house_owners, 'Age']
others_age = df.loc[others, 'Age']
```
%% Cell type:code id:1f4abc62 tags:
%% Cell type:code id:3d1a44e6 tags:
``` python
sns.histplot(hue='OwnsHouse', y='Age', data=df, kde=True);
```
%%%% Output: display_data
![]()
%% Cell type:code id:b67a8ea8 tags:
%% Cell type:code id:ddf5d4b0 tags:
``` python
(theoretical_quantiles, observed_quantiles), (slope, intercept, _) = stats.probplot(house_owners_age, fit=True)
plt.scatter(theoretical_quantiles, observed_quantiles, marker='+', color='b')
plt.axline((0, intercept), slope=slope, color='r');
```
%%%% Output: display_data
![]()
%% Cell type:markdown id:1d0bb402 tags:
\[CORR\]
%% Cell type:markdown id:24b49c4c tags:
The red line is fitted to the blue points and does not align well on the linear part. To better illustrate what is the linear part, we reimplement the regression (the exact implementation is out of the scope of this session):
%% Cell type:code id:8f971da3 tags:
%% Cell type:code id:0f888c53 tags:
``` python
import statsmodels.api as sm # anticipating the next class...
central = (-1<theoretical_quantiles) & (theoretical_quantiles<1)
model = sm.OLS(observed_quantiles[central], sm.add_constant(theoretical_quantiles[central])).fit()
a, b = model.params
plt.scatter(theoretical_quantiles, observed_quantiles, marker='+', color='b')
plt.axline((0, a), slope=b, color='r');
```
%%%% Output: display_data
![]()
%% Cell type:markdown id:b8ecfebf tags:
\[CORR\]
%% Cell type:markdown id:f35584b7 tags:
The misalignment of the default regression line on the central part of the distribution is indicative of some asymmetry, while the diverging tails also hint at some departure from normality (kurtosis).
The misalignment of the default regression line on the central part of the distribution is indicative of some asymmetry, while the diverging tails also hint at some departure from normality (kurtosis). The sampling procedure clearly excluded people younger than 20 years old or elder than 70, which results in truncated distributions.
Here, we have comfortable sample sizes and these departures from normality may not affect the power of the statistical test.
%% Cell type:markdown id:f6de8e66 tags:
%% Cell type:markdown id:2cc80be1 tags:
## Q
Are the sample size and variance of the two groups similar enough for running a standard $t$ test?
%% Cell type:code id:d5ac4dc5 tags:
%% Cell type:markdown id:cd58c73a tags:
## A
%% Cell type:code id:0dbb79f7 tags:
``` python
len(house_owners_age), len(others_age), np.var(house_owners_age), np.var(others_age)
len(house_owners_age), len(others_age), np.std(house_owners_age), np.std(others_age)
```
%%%% Output: execute_result
(288, 528, 219.94736689332564, 138.81976459481174)
(288, 528, 14.830622606395378, 11.782179959362857)
%% Cell type:markdown id:e177b52b tags:
%% Cell type:markdown id:3e221ea2 tags:
\[CORR\] `ttest_ind` allows variance ratios up to $2$. The groups can have different sample sizes.
`ttest_ind` allows standard deviation ratios [up to $2$](https://en.wikipedia.org/wiki/Student%27s_t-test#Equal_or_unequal_sample_sizes,_similar_variances_(1/2_%3C_sX1/sX2_%3C_2)). The groups can have different sample sizes.
%% Cell type:markdown id:cd273f22 tags:
%% Cell type:markdown id:d61f454a tags:
Test the group mean ages equal:
## Q
%% Cell type:code id:ed503837 tags:
Test the group mean ages equal.
%% Cell type:markdown id:b076e8e6 tags:
## A
%% Cell type:code id:1d238900 tags:
``` python
# define your significance level first!
significance_level = 0.05
# run a t-test for independent samples
stats.ttest_ind(house_owners_age, others_age)
```
%%%% Output: execute_result
Ttest_indResult(statistic=-10.858676761684935, pvalue=9.562420864768222e-26)
%% Cell type:markdown id:967b1f9f tags:
%% Cell type:markdown id:62b30b76 tags:
## Q
How would you report the result of this test?
%% Cell type:code id:de90c6e8 tags:
%% Cell type:markdown id:efeac3ab tags:
## A
%% Cell type:code id:341157b6 tags:
``` python
# we need:
# * the number of degrees of freedom, to give a complete report of the outcome of the t-test,
n1, n2 = len(house_owners_age), len(others_age)
degrees_of_freedom = n1 + n2 - 2
# * the mean difference (this is almost an effect size in itself, an intuitive one),
# * the mean difference (this is almost an effect size, not compared with the associated variability),
mean_difference = np.mean(house_owners_age) - np.mean(others_age)
# * and the effect size.
f, _ = stats.ttest_ind(house_owners_age, others_age)
cohen_d = f * np.sqrt(1/n1 + 1/n2)
t, _ = stats.ttest_ind(house_owners_age, others_age)
cohen_d = t * np.sqrt(1/n1 + 1/n2)
degrees_of_freedom, mean_difference, cohen_d
# alternatively:
import pingouin as pg
unbiased_cohen_d = pg.compute_effsize(house_owners_age, others_age)
degrees_of_freedom, mean_difference, cohen_d, unbiased_cohen_d
```
%%%% Output: execute_result
(814, -10.305953282828284, -0.7954424784394866)
%% Cell type:markdown id:21e69d05 tags:
(814, -10.305953282828284, -0.7954424784394866, -0.7954424784394866)
\[CORR\]
%% Cell type:markdown id:b79f8a5c tags:
«**In our study**, house owners ($n=288$) were found to be significantly younger than the other surveyed people ($n=528$; $10.3$ years younger on average, $t(814)=-10.9$, $p<0.05$). This effect was found to be large (Cohen's $d \approx 0.8$).»
Note: as we report the sample size for each group, we may omit the (still nice-to-have) information of the number of degrees of freedom.
%% Cell type:markdown id:54db26df tags:
%% Cell type:markdown id:f72698b7 tags:
## Q
\[optional; good for playing with Python rather than statistical methods\]
Although tractable in principle, the group difference in variance is quite large and -- had we smaller samples -- we could instead use the Welch's $t$ test that is known to better control for type-1 errors in cases of differing variances, but also a slightly lower power.
As it is now clear we have a relationship between age and owning a house, let us compute the rejection rate (or power) as a function of sample size.
Proposal:
* loop over decreasing sample sizes (*e.g.* 200, 50, 20, 10, 5),
* randomly pick a subsample of that size from each group,
* compare their means using the standard Student $t$-test and Welch $t$-test,
* observe whether each test successfully rejects $H_0$ for a constant significance level (*e.g.* 5%),
* replicate this procedure many times (*e.g.* 100)
* and compute the rejection rate for each sample size and type of test.
%% Cell type:markdown id:8a2bc253 tags:
## Help: subsampling
%% Cell type:code id:fd050fbc tags:
``` python
# let us consider an example sample
sample = others_age
# and a subsample size
n = 200
# we need a random generator
rng = np.random.default_rng()
# now we can pick n observations from the original sample
# calling the `choice` method of the random generator
subsample = rng.choice(sample, n)
# in principle the smaller sample will exhibit similar
# properties as the original sample; both are drawn from
# the population in similar ways
bins = np.arange(20, 70+1, 5)
sns.histplot(sample, bins=bins)
sns.histplot(subsample, bins=bins);
```
%%%% Output: display_data
![]()
%% Cell type:markdown id:b44a7b2b tags:
## A
%% Cell type:code id:35541f89 tags:
``` python
significance_level = 0.05
sample1 = house_owners_age
sample2 = others_age
sample_size = min(len(sample1), len(sample2))
```
%% Cell type:code id:2ae175e5 tags:
``` python
from collections import defaultdict
sample_sizes = []
test_types = []
rejection_rates = []
rng = np.random.default_rng()
for relative_sample_size in (1, .2, .1, .05, .025):
n = int(relative_sample_size * sample_size)
nreplicates = 100
rejections = defaultdict(lambda: 0)
for _ in range(nreplicates):
subsample1 = rng.choice(sample1, n)
subsample2 = rng.choice(sample2, n)
for test_type in ('Student', 'Welch'):
t, pv = stats.ttest_ind(subsample1, subsample2, equal_var=test_type=='Student')
if pv <= significance_level:
rejections[test_type] = rejections[test_type] + 1
for test_type in rejections:
rejection_rates.append(rejections[test_type] / nreplicates)
sample_sizes.append(n)
test_types.append(test_type)
result = pd.DataFrame({'sample size': sample_sizes, 'test': test_types, 'power': rejection_rates})
```