Commit a8ff27bc authored by François  LAURENT's avatar François LAURENT
Browse files

SciPy course material complete

parent 1dbef0a4
......@@ -5,14 +5,28 @@
"id": "a5a5210d",
"metadata": {},
"source": [
"Import `numpy`, `pandas`, the `pyplot` module from `matplotlib`, `seaborn`, and the `stats` module from `scipy`:"
"## Q\n",
"\n",
"Import `numpy`, `pandas`, the `pyplot` module from `matplotlib`, `seaborn`, and the `stats` module from `scipy`."
]
},
{
"cell_type": "markdown",
"id": "5ac6cc32",
"metadata": {
"heading_collapsed": true
},
"source": [
"## A"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "529c5f56",
"metadata": {},
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"import numpy as np\n",
......@@ -24,7 +38,7 @@
},
{
"cell_type": "markdown",
"id": "bb564f37",
"id": "93ad4aaf",
"metadata": {},
"source": [
"# Comparison of two group means"
......@@ -32,17 +46,31 @@
},
{
"cell_type": "markdown",
"id": "08c1dd12",
"id": "0e4fd0d9",
"metadata": {},
"source": [
"Load the `mi.csv` data file located in the `data` directory of the course repository:"
"## Q\n",
"\n",
"Load the `mi.csv` data file located in the `data` directory of the course repository."
]
},
{
"cell_type": "markdown",
"id": "08c1dd12",
"metadata": {
"heading_collapsed": true
},
"source": [
"## A"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "00130518",
"metadata": {},
"metadata": {
"hidden": true
},
"outputs": [
{
"data": {
......@@ -261,14 +289,28 @@
"id": "9cc036b2",
"metadata": {},
"source": [
"Question: anything missing?"
"## Q\n",
"\n",
"Anything missing?"
]
},
{
"cell_type": "markdown",
"id": "99d5dc74",
"metadata": {
"heading_collapsed": true
},
"source": [
"## A"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "8a648a9b",
"metadata": {},
"metadata": {
"hidden": true
},
"outputs": [
{
"data": {
......@@ -289,7 +331,9 @@
"cell_type": "code",
"execution_count": 4,
"id": "2f0a8116",
"metadata": {},
"metadata": {
"hidden": true
},
"outputs": [
{
"data": {
......@@ -666,14 +710,28 @@
"id": "3512f950",
"metadata": {},
"source": [
"Show a summary table for these data:"
"## Q\n",
"\n",
"Show a summary table for these data."
]
},
{
"cell_type": "markdown",
"id": "6984434b",
"metadata": {
"heading_collapsed": true
},
"source": [
"## A"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "a7a7d087",
"metadata": {},
"metadata": {
"hidden": true
},
"outputs": [
{
"data": {
......@@ -933,17 +991,31 @@
},
{
"cell_type": "markdown",
"id": "ec2a049f",
"id": "04163591",
"metadata": {},
"source": [
"Inspect the distribution of variables `Age` and `OwnsHouse`:"
"## Q\n",
"\n",
"Inspect the distribution of variables `Age` and `OwnsHouse`."
]
},
{
"cell_type": "markdown",
"id": "d6baac23",
"metadata": {
"heading_collapsed": true
},
"source": [
"## A"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "14572a36",
"metadata": {},
"id": "5de6412d",
"metadata": {
"hidden": true
},
"outputs": [
{
"data": {
......@@ -968,8 +1040,10 @@
{
"cell_type": "code",
"execution_count": 7,
"id": "9c625646",
"metadata": {},
"id": "f793503f",
"metadata": {
"hidden": true
},
"outputs": [
{
"name": "stdout",
......@@ -993,17 +1067,23 @@
},
{
"cell_type": "markdown",
"id": "af5660e7",
"metadata": {},
"id": "1e94c17b",
"metadata": {
"heading_collapsed": true
},
"source": [
"Isolate the house-owners group from the others, draw their respective age distributions and check they are normally distributed:"
"## Q\n",
"\n",
"Isolate the house-owners group from the others, draw their respective age distributions and check they are normally distributed."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "86252da3",
"metadata": {},
"id": "55d18f16",
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"group = df.groupby('OwnsHouse').groups\n",
......@@ -1016,8 +1096,10 @@
{
"cell_type": "code",
"execution_count": 9,
"id": "1f4abc62",
"metadata": {},
"id": "3d1a44e6",
"metadata": {
"hidden": true
},
"outputs": [
{
"data": {
......@@ -1039,8 +1121,10 @@
{
"cell_type": "code",
"execution_count": 10,
"id": "b67a8ea8",
"metadata": {},
"id": "ddf5d4b0",
"metadata": {
"hidden": true
},
"outputs": [
{
"data": {
......@@ -1063,19 +1147,21 @@
},
{
"cell_type": "markdown",
"id": "1d0bb402",
"metadata": {},
"id": "24b49c4c",
"metadata": {
"hidden": true
},
"source": [
"\\[CORR\\]\n",
"\n",
"The red line is fitted to the blue points and does not align well on the linear part. To better illustrate what is the linear part, we reimplement the regression (the exact implementation is out of the scope of this session):"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "8f971da3",
"metadata": {},
"id": "0f888c53",
"metadata": {
"hidden": true
},
"outputs": [
{
"data": {
......@@ -1101,34 +1187,48 @@
},
{
"cell_type": "markdown",
"id": "b8ecfebf",
"metadata": {},
"id": "f35584b7",
"metadata": {
"hidden": true
},
"source": [
"\\[CORR\\]\n",
"\n",
"The misalignment of the default regression line on the central part of the distribution is indicative of some asymmetry, while the diverging tails also hint at some departure from normality (kurtosis).\n",
"The misalignment of the default regression line on the central part of the distribution is indicative of some asymmetry, while the diverging tails also hint at some departure from normality (kurtosis). The sampling procedure clearly excluded people younger than 20 years old or elder than 70, which results in truncated distributions.\n",
"\n",
"Here, we have comfortable sample sizes and these departures from normality may not affect the power of the statistical test."
]
},
{
"cell_type": "markdown",
"id": "f6de8e66",
"id": "2cc80be1",
"metadata": {},
"source": [
"## Q\n",
"\n",
"Are the sample size and variance of the two groups similar enough for running a standard $t$ test?"
]
},
{
"cell_type": "markdown",
"id": "cd58c73a",
"metadata": {
"heading_collapsed": true
},
"source": [
"## A"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "d5ac4dc5",
"metadata": {},
"id": "0dbb79f7",
"metadata": {
"hidden": true
},
"outputs": [
{
"data": {
"text/plain": [
"(288, 528, 219.94736689332564, 138.81976459481174)"
"(288, 528, 14.830622606395378, 11.782179959362857)"
]
},
"execution_count": 12,
......@@ -1137,30 +1237,46 @@
}
],
"source": [
"len(house_owners_age), len(others_age), np.var(house_owners_age), np.var(others_age)"
"len(house_owners_age), len(others_age), np.std(house_owners_age), np.std(others_age)"
]
},
{
"cell_type": "markdown",
"id": "e177b52b",
"metadata": {},
"id": "3e221ea2",
"metadata": {
"hidden": true
},
"source": [
"\\[CORR\\] `ttest_ind` allows variance ratios up to $2$. The groups can have different sample sizes."
"`ttest_ind` allows standard deviation ratios [up to $2$](https://en.wikipedia.org/wiki/Student%27s_t-test#Equal_or_unequal_sample_sizes,_similar_variances_(1/2_%3C_sX1/sX2_%3C_2)). The groups can have different sample sizes."
]
},
{
"cell_type": "markdown",
"id": "cd273f22",
"id": "d61f454a",
"metadata": {},
"source": [
"Test the group mean ages equal:"
"## Q\n",
"\n",
"Test the group mean ages equal."
]
},
{
"cell_type": "markdown",
"id": "b076e8e6",
"metadata": {
"heading_collapsed": true
},
"source": [
"## A"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "ed503837",
"metadata": {},
"id": "1d238900",
"metadata": {
"hidden": true
},
"outputs": [
{
"data": {
......@@ -1183,25 +1299,39 @@
},
{
"cell_type": "markdown",
"id": "967b1f9f",
"id": "62b30b76",
"metadata": {},
"source": [
"## Q\n",
"\n",
"How would you report the result of this test?"
]
},
{
"cell_type": "markdown",
"id": "efeac3ab",
"metadata": {
"heading_collapsed": true
},
"source": [
"## A"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "de90c6e8",
"metadata": {},
"execution_count": 54,
"id": "341157b6",
"metadata": {
"hidden": true
},
"outputs": [
{
"data": {
"text/plain": [
"(814, -10.305953282828284, -0.7954424784394866)"
"(814, -10.305953282828284, -0.7954424784394866, -0.7954424784394866)"
]
},
"execution_count": 14,
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
}
......@@ -1212,23 +1342,27 @@
"n1, n2 = len(house_owners_age), len(others_age)\n",
"degrees_of_freedom = n1 + n2 - 2\n",
"\n",
"# * the mean difference (this is almost an effect size in itself, an intuitive one),\n",
"# * the mean difference (this is almost an effect size, not compared with the associated variability),\n",
"mean_difference = np.mean(house_owners_age) - np.mean(others_age)\n",
"\n",
"# * and the effect size.\n",
"f, _ = stats.ttest_ind(house_owners_age, others_age)\n",
"cohen_d = f * np.sqrt(1/n1 + 1/n2)\n",
"t, _ = stats.ttest_ind(house_owners_age, others_age)\n",
"cohen_d = t * np.sqrt(1/n1 + 1/n2)\n",
"\n",
"# alternatively:\n",
"import pingouin as pg\n",
"unbiased_cohen_d = pg.compute_effsize(house_owners_age, others_age)\n",
"\n",
"degrees_of_freedom, mean_difference, cohen_d"
"degrees_of_freedom, mean_difference, cohen_d, unbiased_cohen_d"
]
},
{
"cell_type": "markdown",
"id": "21e69d05",
"metadata": {},
"id": "b79f8a5c",
"metadata": {
"hidden": true
},
"source": [
"\\[CORR\\]\n",
"\n",
"«**In our study**, house owners ($n=288$) were found to be significantly younger than the other surveyed people ($n=528$; $10.3$ years younger on average, $t(814)=-10.9$, $p<0.05$). This effect was found to be large (Cohen's $d \\approx 0.8$).»\n",
"\n",
"Note: as we report the sample size for each group, we may omit the (still nice-to-have) information of the number of degrees of freedom."
......@@ -1236,16 +1370,613 @@
},
{
"cell_type": "markdown",
"id": "54db26df",
"id": "f72698b7",
"metadata": {},
"source": [
"## Q\n",
"\n",
"\\[optional; good for playing with Python rather than statistical methods\\]\n",
"\n",
"Although tractable in principle, the group difference in variance is quite large and -- had we smaller samples -- we could instead use the Welch's $t$ test that is known to better control for type-1 errors in cases of differing variances, but also a slightly lower power.\n",
"\n",
"As it is now clear we have a relationship between age and owning a house, let us compute the rejection rate (or power) as a function of sample size.\n",
"\n",
"Proposal:\n",
"* loop over decreasing sample sizes (*e.g.* 200, 50, 20, 10, 5),\n",
"* randomly pick a subsample of that size from each group,\n",
"* compare their means using the standard Student $t$-test and Welch $t$-test,\n",
"* observe whether each test successfully rejects $H_0$ for a constant significance level (*e.g.* 5%),\n",
"* replicate this procedure many times (*e.g.* 100)\n",
"* and compute the rejection rate for each sample size and type of test."
]
},
{
"cell_type": "markdown",
"id": "8a2bc253",
"metadata": {
"heading_collapsed": true
},
"source": [
"## Help: subsampling"
]
},
{
"cell_type": "code",
"execution_count": 48,
"id": "fd050fbc",
"metadata": {
"hidden": true
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAX4AAAEGCAYAAABiq/5QAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjQuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8rg+JYAAAACXBIWXMAAAsTAAALEwEAmpwYAAAQ8UlEQVR4nO3df6xfdX3H8eer1E4Ftfy46Wp/rDUaHJMJUlF+xDjYFtycMMf4Eec6g4Nk6mQ6Fd0fhC0mkhiVmE1pQIcb0yJiQGJgDNHMsdS1gKlQmQyBXn7WH8jmElnlvT++p3K5Le3t7T3fb+/383wkN/d7zvl+73l/cr993dP3OefzTVUhSWrHglEXIEkaLoNfkhpj8EtSYwx+SWqMwS9JjVk46gJm4rDDDqtVq1aNugxJmlc2bdr0g6qamL5+XgT/qlWr2Lhx46jLkKR5Jcn9u1pvq0eSGmPwS1JjDH5JaozBL0mNMfglqTEGvyQ1xuCXpMYY/JLUGINfkhpj8EuaN5atWEmSoX8tW7Fy1EOfU/NiygZJAnhocitnXnrr0Pe7/rzjh77PPnnEL0mNMfglqTEGvyQ1xuCXpMYY/JLUGINfkhpj8EtSYwx+SWqMwS9JjTH4JakxBr8kNcbgl6TGGPyS1BiDX5IaY/BLUmMMfklqjMEvSY0x+CWpMb0Gf5K/SHJnku8k+XyS5yZZnWRDknuSrE+yqM8aJEnP1FvwJ1kG/DmwpqpeARwAnAVcDHy8ql4K/Bg4p68aJEk767vVsxB4XpKFwPOBh4GTgKu77VcAp/VcgyRpit6Cv6oeBD4KPMAg8H8CbAIer6rt3dMmgWW7en2Sc5NsTLJx27ZtfZUpSc3ps9VzMHAqsBp4MXAgcMpMX19V66pqTVWtmZiY6KlKSWpPn62e3wS+X1Xbqur/gGuAE4DFXesHYDnwYI81SJKm6TP4HwBem+T5SQKcDNwF3AKc3j1nLXBtjzVIkqbps8e/gcFJ3NuAzd2+1gEfAN6T5B7gUODyvmqQJO1s4Z6fMntVdSFw4bTV9wLH9rlfSdKz885dSWqMwS9JjTH4JakxBr8kNcbgl6TGGPyS1BiDX5IaY/BLUmMMfklqjMEvSY0x+CWpMQa/JDXG4Jekxhj8ktQYg19jYdmKlSQZ+teyFStHPXRpr/U6H780LA9NbuXMS28d+n7Xn3f80Pcp7SuP+CWpMR7xS9ory1as5KHJraMuQ/vA4Je0V0bVVgNba3PFVo8kNcbgl6TGGPyS1BiDX5IaY/BLUmMMfklqjMEvSY3xOn7NGW/skeYHg19zxht7pPnB4JfmKf+Hpdky+KV5yhlJNVue3JWkxhj8ktQYg1+SGmPwS1JjDH5JaozBL0mNMfglqTG9Bn+SxUmuTvLdJFuSHJfkkCQ3Jfle9/3gPmuQJD1T30f8lwA3VNXLgVcCW4ALgJur6mXAzd2yJGlIegv+JC8CXgdcDlBVT1bV48CpwBXd064ATuurBknSzvo84l8NbAM+m+T2JJclORBYUlUPd895BFiyqxcnOTfJxiQbt23b1mOZktSWPoN/IfAq4FNVdTTwU6a1daqqgNrVi6tqXVWtqao1ExMTPZYpSW3pM/gngcmq2tAtX83gD8GjSZYCdN8f67EGSdI0vQV/VT0CbE1yeLfqZOAu4DpgbbduLXBtXzVIknbW97TM7wKuTLIIuBd4G4M/NlclOQe4Hzij5xokSVP0GvxVdQewZhebTu5zv5KkZ+edu5LUGINfkhpj8EtSYwx+SWqMwS9Jjen7ck5pvC1YSJJRVyHtFYNf2hdPbefMS28dya7Xn3f8SPar+c9WjyQ1xiN+SdqTEbX0Xrx8BQ9ufWDOf67BL0l7MqKWXl/tPFs9ktSYGQV/khNmsk6StP+b6RH/J2e4TpK0n9ttjz/JccDxwESS90zZ9ELggD4LkyT1Y08ndxcBB3XPe8GU9U8Ap/dVlCSpP7sN/qr6BvCNJH9fVfcPqSZJUo9mejnnLyVZB6ya+pqqOqmPoiRJ/Zlp8H8R+DRwGfDz/sqRJPVtpsG/vao+1WslkqShmOnlnF9J8mdJliY5ZMdXr5VJknox0yP+td33901ZV8BL5rYcSVLfZhT8VbW670IkScMxo+BP8se7Wl9Vn5vbciRJfZtpq+fVUx4/FzgZuA0w+CVpnplpq+ddU5eTLAa+0EdBkqR+zXZa5p8C9v0laR6aaY//Kwyu4oHB5Gy/ClzVV1GSpP7MtMf/0SmPtwP3V9VkD/VIkno2o1ZPN1nbdxnM0Hkw8GSfRUmS+jPTT+A6A/gW8IfAGcCGJE7LLEnz0ExbPX8FvLqqHgNIMgH8C3B1X4VJkvox06t6FuwI/c4P9+K1kqT9yEyP+G9IciPw+W75TOCr/ZQkSerTnj5z96XAkqp6X5I3Ayd2m/4duLLv4iRJc29PR/yfAD4IUFXXANcAJDmy2/Z7PdYmSerBnoJ/SVVtnr6yqjYnWdVPSZL0LBYsZP15x49kv+NkT6NZvJttz5vDOiRpz57azpEX3jD03W6+6JSh77NPe7oyZ2OSP52+MsnbgU0z2UGSA5LcnuT6bnl1kg1J7kmyPsmivS9bkjRbezriPx/4cpK38HTQrwEWAb8/w328G9gCvLBbvhj4eFV9IcmngXMAP89XkoZkt0f8VfVoVR0PXATc131dVFXHVdUje/rhSZYDvwtc1i0HOImnb/y6AjhtlrVLkmZhpvPx3wLcMouf/wng/Qzm+AE4FHi8qrZ3y5PAsl29MMm5wLkAK1eunMWuJUm70tvdt0neCDxWVTM6FzBdVa2rqjVVtWZiYmKOq5OkdvV5jdIJwJuS/A6Dj2t8IXAJsDjJwu6ofznwYI81SJKm6e2Iv6o+WFXLq2oVcBbwtap6C4OW0Y6ZPdcC1/ZVgyRpZ6OYaO0DwHuS3MOg53/5CGqQpGYN5Xa0qvo68PXu8b3AscPYryRpZ06tLEmNMfglqTEGvyQ1ZrymnBMAy1as5KHJraMuQ9J+yuAfQw9NbuXMS28d+n5HMl2upL1mq0eSGmPwS1JjDH5JaozBL0mNMfglqTEGvyQ1xuCXpMYY/JLUGINfkhpj8EtSYwx+SWqMwS9JjTH4JakxBr8kNcbgl6TGGPyS1BiDX5IaY/BLUmMMfklqjJ+5q/GwYOFoPvN3gf+ENP/4rtV4eGo7R154w9B3u/miU4a+T2lf2eqRpMYY/JLUGINfkhpj8EtSYwx+SWqMV/VI85WXsGqW/A1K85WXsGqWbPVIUmMMfklqjK0eaV+Mqs8u7QODX9oXI+qzg712zV5vwZ9kBfA5YAlQwLqquiTJIcB6YBVwH3BGVf24rzo0RB79alyN2RVUfR7xbwfeW1W3JXkBsCnJTcCfADdX1UeSXABcAHygxzo0LB79alyN2RVUvZ3craqHq+q27vF/A1uAZcCpwBXd064ATuurBknSzoZyVU+SVcDRwAZgSVU93G16hEEraFevOTfJxiQbt23bNowyJakJvQd/koOALwHnV9UTU7dVVTHo/++kqtZV1ZqqWjMxMdF3mZLUjF6DP8lzGIT+lVV1Tbf60SRLu+1Lgcf6rEGS9Ey9BX+SAJcDW6rqY1M2XQes7R6vBa7tqwZJ0s76vKrnBOCtwOYkd3TrPgR8BLgqyTnA/cAZPdYgSZqmt+Cvqm8CeZbNJ/e1X0nS7jlXjyQ1xikbJO0d79Ce9wx+SXvHO7TnPYN/HHlEJmk3DP5xNGbzikiaW57claTGGPyS1BiDX5IaY/BLUmMMfklqjFf19GTZipU8NLl11GVI0k4M/p48NLmVMy+9dST79hp+Sbtjq0eSGmPwS1JjDH5JaozBL0mNMfglqTEGvyQ1xuCXpMYY/JLUGINfkhoz9nfuOnWCJD3T2Af/qKZOcNoESfsrWz2S1BiDX5IaY/BLUmPGvsc/MgsW2ueXtF8y+Pvy1HaOvPCGkex680WnjGS/kuYHWz2S1BiDX5IaY/BLUmPGv8fvSVZJeobxD/4RnWT1BKuk/ZWtHklqjMEvSY0x+CWpMQa/JDVmJMGf5JQkdye5J8kFo6hBklo19OBPcgDwt8AbgCOAs5McMew6JKlVozjiPxa4p6ruraongS8Ap46gDklqUqpquDtMTgdOqaq3d8tvBV5TVe+c9rxzgXO7xcOBu2e5y8OAH8zytfOVY26DYx5/+zreX6mqiekr99sbuKpqHbBuX39Oko1VtWYOSpo3HHMbHPP462u8o2j1PAismLK8vFsnSRqCUQT/fwAvS7I6ySLgLOC6EdQhSU0aequnqrYneSdwI3AA8JmqurPHXe5zu2gecsxtcMzjr5fxDv3kriRptLxzV5IaY/BLUmPGJviTrEhyS5K7ktyZ5N3d+kOS3JTke933g0dd61xJ8twk30ry7W7MF3XrVyfZ0E2Jsb47iT5WkhyQ5PYk13fLYz3mJPcl2ZzkjiQbu3Vj+94GSLI4ydVJvptkS5LjxnnMSQ7vfr87vp5Icn4fYx6b4Ae2A++tqiOA1wLv6KaCuAC4uapeBtzcLY+LnwEnVdUrgaOAU5K8FrgY+HhVvRT4MXDO6ErszbuBLVOWWxjzb1TVUVOu6x7n9zbAJcANVfVy4JUMft9jO+aqurv7/R4FHAP8L/Bl+hhzVY3lF3At8FsM7vhd2q1bCtw96tp6Gu/zgduA1zC4029ht/444MZR1zfHY13e/QM4CbgeSANjvg84bNq6sX1vAy8Cvk93AUoLY542zt8G/q2vMY/TEf8vJFkFHA1sAJZU1cPdpkeAJaOqqw9dy+MO4DHgJuC/gMeranv3lElg2YjK68sngPcDT3XLhzL+Yy7gn5Ns6qYzgfF+b68GtgGf7Vp6lyU5kPEe81RnAZ/vHs/5mMcu+JMcBHwJOL+qnpi6rQZ/Msfq+tWq+nkN/mu4nMEEeC8fbUX9SvJG4LGq2jTqWobsxKp6FYNZbd+R5HVTN47he3sh8CrgU1V1NPBTprU4xnDMAHTnp94EfHH6trka81gFf5LnMAj9K6vqmm71o0mWdtuXMjgyHjtV9ThwC4M2x+IkO27OG7cpMU4A3pTkPgYzu57EoBc8zmOmqh7svj/GoO97LOP93p4EJqtqQ7d8NYM/BOM85h3eANxWVY92y3M+5rEJ/iQBLge2VNXHpmy6DljbPV7LoPc/FpJMJFncPX4eg3MaWxj8ATi9e9pYjbmqPlhVy6tqFYP/Dn+tqt7CGI85yYFJXrDjMYP+73cY4/d2VT0CbE1yeLfqZOAuxnjMU5zN020e6GHMY3PnbpITgX8FNvN07/dDDPr8VwErgfuBM6rqRyMpco4l+XXgCgZTXywArqqqv07yEgZHw4cAtwN/VFU/G12l/UjyeuAvq+qN4zzmbmxf7hYXAv9UVR9Ocihj+t4GSHIUcBmwCLgXeBvd+5zxHfOBwAPAS6rqJ926Of89j03wS5JmZmxaPZKkmTH4JakxBr8kNcbgl6TGGPyS1BiDX9qDJKclqSRjfVe02mHwS3t2NvDN7rs07xn80m50cz+dyGCa57O6dQuS/F03T/xNSb6a5PRu2zFJvtFNpnbjjlvtpf2JwS/t3qkM5oT/T+CHSY4B3gysAo4A3spgfqQdc0V9Eji9qo4BPgN8eBRFS7uzcM9PkZp2NoNJ4GAwJcTZDP7dfLGqngIeSXJLt/1w4BXATYOpozgAeBhpP2PwS88iySEMZv88MkkxCPLi6XlzdnoJcGdVHTekEqVZsdUjPbvTgX+oql+pqlVVtYLBp0L9CPiDrte/BHh99/y7gYkkv2j9JPm1URQu7Y7BLz27s9n56P5LwC8zmC/+LuAfGXzk5U+q6kkGfywuTvJt4A7g+KFVK82Qs3NKs5DkoKr6n27K3G8BJ3RzyEv7PXv80uxc330IziLgbwx9zSce8UtSY+zxS1JjDH5JaozBL0mNMfglqTEGvyQ15v8BiGF1P6/1jG0AAAAASUVORK5CYII=\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# let us consider an example sample\n",
"sample = others_age\n",
"\n",
"# and a subsample size\n",
"n = 200\n",
"\n",
"# we need a random generator\n",
"rng = np.random.default_rng()\n",
"\n",
"# now we can pick n observations from the original sample\n",
"# calling the `choice` method of the random generator\n",
"subsample = rng.choice(sample, n)\n",
"\n",
"# in principle the smaller sample will exhibit similar\n",
"# properties as the original sample; both are drawn from\n",
"# the population in similar ways\n",
"bins = np.arange(20, 70+1, 5)\n",
"sns.histplot(sample, bins=bins)\n",
"sns.histplot(subsample, bins=bins);"
]
},
{
"cell_type": "markdown",
"id": "b44a7b2b",
"metadata": {
"heading_collapsed": true
},
"source": [
"## A"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "35541f89",
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"significance_level = 0.05\n",
"\n",
"sample1 = house_owners_age\n",
"sample2 = others_age\n",
"sample_size = min(len(sample1), len(sample2))"
]
},
{
"cell_type": "code",
"execution_count": 41,
"id": "2ae175e5",
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"from collections import defaultdict\n",
"\n",
"sample_sizes = []\n",
"test_types = []\n",
"rejection_rates = []\n",
"\n",
"rng = np.random.default_rng()\n",
"\n",
"for relative_sample_size in (1, .2, .1, .05, .025):\n",
" n = int(relative_sample_size * sample_size)\n",
" nreplicates = 100\n",
" rejections = defaultdict(lambda: 0)\n",
" for _ in range(nreplicates):\n",
" subsample1 = rng.choice(sample1, n)\n",
" subsample2 = rng.choice(sample2, n)\n",
" for test_type in ('Student', 'Welch'):\n",
" t, pv = stats.ttest_ind(subsample1, subsample2, equal_var=test_type=='Student')\n",
" if pv <= significance_level:\n",
" rejections[test_type] = rejections[test_type] + 1\n",
" for test_type in rejections:\n",
" rejection_rates.append(rejections[test_type] / nreplicates)\n",
" sample_sizes.append(n)\n",
" test_types.append(test_type)\n",
" \n",
"result = pd.DataFrame({'sample size': sample_sizes, 'test': test_types, 'power': rejection_rates})"
]
},
{
"cell_type": "code",
"execution_count": 42,
"id": "d9641fa6",
"metadata": {
"hidden": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>sample size</th>\n",
" <th>test</th>\n",
" <th>power</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>288</td>\n",