"Print an ANOVA table for different types of sum of squares.."
]
},
{
"cell_type": "markdown",
"id": "24492fad",
"id": "3ad89dec",
"metadata": {},
"source": [
"## A"
...
...
@@ -83,30 +83,56 @@
{
"cell_type": "code",
"execution_count": null,
"id": "0ad3c464",
"id": "39f58924",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "ded672e8",
"metadata": {
"heading_collapsed": true
},
"id": "ecf3bcd9",
"metadata": {},
"source": [
"## Q\n",
"\n",
"Let us ignore the not-normal residuals and play with post-hoc tests instead.\n",
"Because we have a large sample, we will ignore the not-normal residuals and play with post-hoc tests instead.\n",
"\n",
"Split the ANOVA for levels of `Pclass` and `Sex`, perform all pairwise comparisons if it make sense, and correct for multiple comparisons.\n",
"First proceed considering type-3 sums of squares."
]
},
{
"cell_type": "markdown",
"id": "63677a0d",
"metadata": {},
"source": [
"## A"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "09794922",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "d30cdb9c",
"metadata": {},
"source": [
"## Q\n",
"\n",
"Let us suppose we want to use type-1 sums of squares instead.\n",
"\n",
"We are not interested in the significance of the slope of `Age` for the different levels of the factors."
"Proceed again to performing with `Sex` first, `Pclass` second, and `Age` last.\n",
"\n",
"In the post-hoc comparisons, we will disregard the effect of the slop of `Age`."
]
},
{
"cell_type": "markdown",
"id": "00ace9f5",
"id": "7a417d76",
"metadata": {},
"source": [
"## A"
...
...
@@ -115,14 +141,14 @@
{
"cell_type": "code",
"execution_count": null,
"id": "9f6645bb",
"id": "034acc94",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "9a863196",
"id": "b7b20012",
"metadata": {},
"source": [
"# Linear model with multiple variables"
...
...
@@ -130,21 +156,19 @@
},
{
"cell_type": "markdown",
"id": "ef97cb7f",
"id": "144b0584",
"metadata": {
"heading_collapsed": true
},
"source": [
"## Q\n",
"\n",
"Load the `mi.csv` file and plot the variables `Temperature`, `HeartRate` and `PhysicalActivity`.\n",
"\n",
"We will try to «explain» `Temperature` from `HeartRate` and `PhysicalActivity`."
"Load the `mi.csv` file and plot the variables `Temperature` vs `HeartRate` and `PhysicalActivity`."
]
},
{
"cell_type": "markdown",
"id": "c66b1c3b",
"id": "35949307",
"metadata": {},
"source": [
"## A"
...
...
@@ -153,29 +177,25 @@
{
"cell_type": "code",
"execution_count": null,
"id": "4bdeea0c",
"id": "47cf88a3",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "358d3903",
"metadata": {
"heading_collapsed": true
},
"id": "62449fb6",
"metadata": {},
"source": [
"## Q"
]
},
{
"cell_type": "markdown",
"id": "f3003272",
"metadata": {
"hidden": true
},
"id": "c71d2a4c",
"metadata": {},
"source": [
"The `PhysicalActivity` variable exhibit a long-tail distribution. This is usually undesirable for an explanatory variable, because we cannot densely sample a large part of its domain of possible values, and therefore a model based on the data cannot be reliable.\n",
"The `PhysicalActivity` variable is very asymmetric. This is usually undesirable for an explanatory variable, because we cannot densely sample a large part of its domain of possible values, and therefore a model based on the data cannot be reliable.\n",
"\n",
"We will proceed to transforming `PhysicalActivity` using a simple natural logarithm. `log` is undefined at $0$ and tends to the infinite near $0$, which renders its straightforward application to `PhysicalActivity` inappropriate. Therefore we will also add $1$ to the `PhysicalActivity` measurements prior to applying `log`.\n",
"\n",
...
...
@@ -184,7 +204,7 @@
},
{
"cell_type": "markdown",
"id": "34dff28d",
"id": "21e8879e",
"metadata": {},
"source": [
"## A"
...
...
@@ -193,14 +213,14 @@
{
"cell_type": "code",
"execution_count": null,
"id": "16e7787e",
"id": "0b8cac58",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "7887928e",
"id": "2c6d5225",
"metadata": {
"heading_collapsed": true
},
...
...
@@ -214,7 +234,7 @@
},
{
"cell_type": "markdown",
"id": "b9ed6879",
"id": "c47489e6",
"metadata": {},
"source": [
"## A"
...
...
@@ -223,14 +243,14 @@
{
"cell_type": "code",
"execution_count": null,
"id": "1044c8c6",
"id": "bf32c7e6",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "49408adc",
"id": "1c7871a2",
"metadata": {
"heading_collapsed": true
},
...
...
@@ -244,10 +264,8 @@
},
{
"cell_type": "markdown",
"id": "1b977a53",
"metadata": {
"heading_collapsed": true
},
"id": "12d7e9a5",
"metadata": {},
"source": [
"## A (with nested Q&A)"
]
...
...
@@ -255,32 +273,20 @@
{
"cell_type": "code",
"execution_count": null,
"id": "b891fa3e",
"metadata": {
"hidden": true
},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "c469b3d5",
"metadata": {
"hidden": true
},
"id": "82e9f8d6",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "475eb53d",
"metadata": {
"hidden": true
},
"id": "66ddaa3e",
"metadata": {},
"source": [
"### Q\n",
"\n",
"The interaction term is not significant, but most importantly, the increase in log-likelihood is very small; the interaction term does not help to better fit the model to the data.\n",
"\n",
"To get a better intuition about the log-likelihood, plot it (with a dot plot) for different models, with one variable, with two variables, with and without interaction.\n",
"\n",
"Feel free to introduce one or two extra explanatory variables such as `BMI`."
...
...
@@ -288,10 +294,8 @@
},
{
"cell_type": "markdown",
"id": "f7a51e03",
"metadata": {
"hidden": true
},
"id": "05844c56",
"metadata": {},
"source": [
"### A"
]
...
...
@@ -299,16 +303,14 @@
{
"cell_type": "code",
"execution_count": null,
"id": "3a4295ad",
"metadata": {
"hidden": true
},
"id": "2588d8d6",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "2474a218",
"id": "665a7a5c",
"metadata": {},
"source": [
"# White test for homoscedasticity"
...
...
@@ -316,7 +318,7 @@
},
{
"cell_type": "markdown",
"id": "1bf04c2f",
"id": "6d0bdd7b",
"metadata": {},
"source": [
"To keep things simple, let us use the `'Heart + PhysicalActivity'` or `'Heart + logPhysicalActivity'`.\n",
...
...
@@ -328,7 +330,7 @@
},
{
"cell_type": "markdown",
"id": "77c0d822",
"id": "a6f37b2e",
"metadata": {},
"source": [
"## A"
...
...
@@ -337,14 +339,14 @@
{
"cell_type": "code",
"execution_count": null,
"id": "4d7f9d1f",
"id": "77e350be",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "20fdd05b",
"id": "789bd3f7",
"metadata": {
"heading_collapsed": true
},
...
...
@@ -354,12 +356,12 @@
"We will further inspect the residuals for heteroscedasticity, using the [White test](https://itfeature.com/heteroscedasticity/white-test-for-heteroskedasticity).\n",
"\n",
"`statsmodels` features an implementation of this test, but the [documentation](https://www.statsmodels.org/stable/generated/statsmodels.stats.diagnostic.het_white.html) is scarce on details.\n",
"Try to apply the `het_white` function, but do not feel ashamed if you fail."
"Try to apply the `het_white` function (if you can 'x))."
Load the `titanic.csv` data file, insert the natural logarithm of `1+Fare` as a new column in the dataframe (*e.g.* with column name `'LogFare'`), and plot this new variable as a function of `Age`, `Pclass` and `Sex`.
Let us ignore the not-normal residuals and play with post-hoc tests instead.
Because we have a large sample, we will ignore the not-normal residuals and play with post-hoc tests instead.
Split the ANOVA for levels of `Pclass` and `Sex`, perform all pairwise comparisons if it make sense, and correct for multiple comparisons.
First proceed considering type-3 sums of squares.
We are not interested in the significance of the slope of `Age` for the different levels of the factors.
%% Cell type:markdown id:63677a0d tags:
%% Cell type:markdown id:00ace9f5 tags:
## A
%% Cell type:code id:09794922 tags:
``` python
```
%% Cell type:markdown id:d30cdb9c tags:
## Q
Let us suppose we want to use type-1 sums of squares instead.
Proceed again to performing with `Sex` first, `Pclass` second, and `Age` last.
In the post-hoc comparisons, we will disregard the effect of the slop of `Age`.
%% Cell type:markdown id:7a417d76 tags:
## A
%% Cell type:code id:9f6645bb tags:
%% Cell type:code id:034acc94 tags:
``` python
```
%% Cell type:markdown id:9a863196 tags:
%% Cell type:markdown id:b7b20012 tags:
# Linear model with multiple variables
%% Cell type:markdown id:ef97cb7f tags:
%% Cell type:markdown id:144b0584 tags:
## Q
Load the `mi.csv` file and plot the variables `Temperature`, `HeartRate` and `PhysicalActivity`.
Load the `mi.csv` file and plot the variables `Temperature` vs `HeartRate` and `PhysicalActivity`.
We will try to «explain» `Temperature` from `HeartRate` and `PhysicalActivity`.
%% Cell type:markdown id:c66b1c3b tags:
%% Cell type:markdown id:35949307 tags:
## A
%% Cell type:code id:4bdeea0c tags:
%% Cell type:code id:47cf88a3 tags:
``` python
```
%% Cell type:markdown id:358d3903 tags:
%% Cell type:markdown id:62449fb6 tags:
## Q
%% Cell type:markdown id:f3003272 tags:
%% Cell type:markdown id:c71d2a4c tags:
The `PhysicalActivity` variable exhibit a long-tail distribution. This is usually undesirable for an explanatory variable, because we cannot densely sample a large part of its domain of possible values, and therefore a model based on the data cannot be reliable.
The `PhysicalActivity` variable is very asymmetric. This is usually undesirable for an explanatory variable, because we cannot densely sample a large part of its domain of possible values, and therefore a model based on the data cannot be reliable.
We will proceed to transforming `PhysicalActivity` using a simple natural logarithm. `log` is undefined at $0$ and tends to the infinite near $0$, which renders its straightforward application to `PhysicalActivity` inappropriate. Therefore we will also add $1$ to the `PhysicalActivity` measurements prior to applying `log`.
Plot again the temperature versus the transformed `PhysicalActivity` variable and compare the skewness of the transformed versus raw variable.
%% Cell type:markdown id:34dff28d tags:
%% Cell type:markdown id:21e8879e tags:
## A
%% Cell type:code id:16e7787e tags:
%% Cell type:code id:0b8cac58 tags:
``` python
```
%% Cell type:markdown id:7887928e tags:
%% Cell type:markdown id:2c6d5225 tags:
## Q
To appreciate the increased robustness of a linear model using the transformed variable compared to the raw variable, design a simple univariate linear regression of `Temperature` as response variable, and draw the Cook's distance of all the observations in regard of this model:
* first with the raw `PhysicalActivity` as explanatory variable,
* second with the transformed `PhysicalActivity` as explanatory variable.
%% Cell type:markdown id:b9ed6879 tags:
%% Cell type:markdown id:c47489e6 tags:
## A
%% Cell type:code id:1044c8c6 tags:
%% Cell type:code id:bf32c7e6 tags:
``` python
```
%% Cell type:markdown id:49408adc tags:
%% Cell type:markdown id:1c7871a2 tags:
## Q
Make a linear model of `Temperature` as response and `HeartRate` and `PhysicalActivity` (or its transformed variant) as explanatory variables.
Make two such models, one with interaction and one without. How would you choose between the two models?
%% Cell type:markdown id:1b977a53 tags:
%% Cell type:markdown id:12d7e9a5 tags:
## A (with nested Q&A)
%% Cell type:code id:b891fa3e tags:
%% Cell type:code id:82e9f8d6 tags:
``` python
```
%% Cell type:code id:c469b3d5 tags:
``` python
```
%% Cell type:markdown id:475eb53d tags:
%% Cell type:markdown id:66ddaa3e tags:
### Q
The interaction term is not significant, but most importantly, the increase in log-likelihood is very small; the interaction term does not help to better fit the model to the data.
To get a better intuition about the log-likelihood, plot it (with a dot plot) for different models, with one variable, with two variables, with and without interaction.
Feel free to introduce one or two extra explanatory variables such as `BMI`.
%% Cell type:markdown id:f7a51e03 tags:
%% Cell type:markdown id:05844c56 tags:
### A
%% Cell type:code id:3a4295ad tags:
%% Cell type:code id:2588d8d6 tags:
``` python
```
%% Cell type:markdown id:2474a218 tags:
%% Cell type:markdown id:665a7a5c tags:
# White test for homoscedasticity
%% Cell type:markdown id:1bf04c2f tags:
%% Cell type:markdown id:6d0bdd7b tags:
To keep things simple, let us use the `'Heart + PhysicalActivity'` or `'Heart + logPhysicalActivity'`.
## Q
Inspect the residuals plotting them versus each explanatory variable.
%% Cell type:markdown id:77c0d822 tags:
%% Cell type:markdown id:a6f37b2e tags:
## A
%% Cell type:code id:4d7f9d1f tags:
%% Cell type:code id:77e350be tags:
``` python
```
%% Cell type:markdown id:20fdd05b tags:
%% Cell type:markdown id:789bd3f7 tags:
## Q
We will further inspect the residuals for heteroscedasticity, using the [White test](https://itfeature.com/heteroscedasticity/white-test-for-heteroskedasticity).
`statsmodels` features an implementation of this test, but the [documentation](https://www.statsmodels.org/stable/generated/statsmodels.stats.diagnostic.het_white.html) is scarce on details.
Try to apply the `het_white` function, but do not feel ashamed if you fail.
Try to apply the `het_white` function (if you can 'x)).
%% Cell type:markdown id:34e7a050 tags:
%% Cell type:markdown id:0a822074 tags:
## A
%% Cell type:code id:5de56d54 tags:
%% Cell type:code id:85d73982 tags:
``` python
```
%% Cell type:markdown id:98ca812e tags:
%% Cell type:markdown id:e3ccb464 tags:
## Q
Instead, we will implement this test, as an application of polynomial regression.
The algorithm is simple. First part:
* take the squared residuals as a response variable,
* take the same explanatory variables as in the original model, plus all their possible interaction terms, plus all their values squared,
* fit a linear model to these data.
%% Cell type:markdown id:3ebf1176 tags:
%% Cell type:markdown id:7ce19023 tags:
## A
%% Cell type:code id:46c53e4e tags:
%% Cell type:code id:7dec1d91 tags:
``` python
```
%% Cell type:markdown id:46bd2a33 tags:
%% Cell type:markdown id:b588b5c1 tags:
## Q
Second part:
* get the coefficient of determination $R^2$,
* get the sample size $n$,
* set the number $k$ of degrees of freedom as the number of predictors (intercept excluded),