diff --git a/notebooks/statsmodels_TP.ipynb b/notebooks/statsmodels_TP.ipynb index d876b1a113039f10883c8cec8b9226493fe0527f..256c2a7dcdc33957d374695e65f5a77deee3b38d 100644 --- a/notebooks/statsmodels_TP.ipynb +++ b/notebooks/statsmodels_TP.ipynb @@ -2,8 +2,8 @@ "cells": [ { "cell_type": "code", - "execution_count": null, - "id": "d5065981", + "execution_count": 1, + "id": "4e16caf7", "metadata": {}, "outputs": [], "source": [ @@ -22,7 +22,7 @@ }, { "cell_type": "markdown", - "id": "4ddd8902", + "id": "358dce7a", "metadata": {}, "source": [ "# Multi-way ANOVA" @@ -30,7 +30,7 @@ }, { "cell_type": "markdown", - "id": "9c4680b9", + "id": "a1face9f", "metadata": { "heading_collapsed": true }, @@ -42,7 +42,7 @@ }, { "cell_type": "markdown", - "id": "d2d4fa47", + "id": "965cc75c", "metadata": {}, "source": [ "## A" @@ -51,14 +51,14 @@ { "cell_type": "code", "execution_count": null, - "id": "ae476b31", + "id": "7eb30cc9", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", - "id": "b32ced14", + "id": "154c1460", "metadata": { "heading_collapsed": true }, @@ -69,12 +69,12 @@ "\n", "Treat `Pclass` and `Sex` as factors.\n", "\n", - "Print an ANOVA table." + "Print an ANOVA table for different types of sum of squares.." ] }, { "cell_type": "markdown", - "id": "24492fad", + "id": "3ad89dec", "metadata": {}, "source": [ "## A" @@ -83,30 +83,56 @@ { "cell_type": "code", "execution_count": null, - "id": "0ad3c464", + "id": "39f58924", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", - "id": "ded672e8", - "metadata": { - "heading_collapsed": true - }, + "id": "ecf3bcd9", + "metadata": {}, "source": [ "## Q\n", "\n", - "Let us ignore the not-normal residuals and play with post-hoc tests instead.\n", + "Because we have a large sample, we will ignore the not-normal residuals and play with post-hoc tests instead.\n", "\n", - "Split the ANOVA for levels of `Pclass` and `Sex`, perform all pairwise comparisons if it make sense, and correct for multiple comparisons.\n", + "First proceed considering type-3 sums of squares." + ] + }, + { + "cell_type": "markdown", + "id": "63677a0d", + "metadata": {}, + "source": [ + "## A" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "09794922", + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "id": "d30cdb9c", + "metadata": {}, + "source": [ + "## Q\n", + "\n", + "Let us suppose we want to use type-1 sums of squares instead.\n", "\n", - "We are not interested in the significance of the slope of `Age` for the different levels of the factors." + "Proceed again to performing with `Sex` first, `Pclass` second, and `Age` last.\n", + "\n", + "In the post-hoc comparisons, we will disregard the effect of the slop of `Age`." ] }, { "cell_type": "markdown", - "id": "00ace9f5", + "id": "7a417d76", "metadata": {}, "source": [ "## A" @@ -115,14 +141,14 @@ { "cell_type": "code", "execution_count": null, - "id": "9f6645bb", + "id": "034acc94", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", - "id": "9a863196", + "id": "b7b20012", "metadata": {}, "source": [ "# Linear model with multiple variables" @@ -130,21 +156,19 @@ }, { "cell_type": "markdown", - "id": "ef97cb7f", + "id": "144b0584", "metadata": { "heading_collapsed": true }, "source": [ "## Q\n", "\n", - "Load the `mi.csv` file and plot the variables `Temperature`, `HeartRate` and `PhysicalActivity`.\n", - "\n", - "We will try to «explain» `Temperature` from `HeartRate` and `PhysicalActivity`." + "Load the `mi.csv` file and plot the variables `Temperature` vs `HeartRate` and `PhysicalActivity`." ] }, { "cell_type": "markdown", - "id": "c66b1c3b", + "id": "35949307", "metadata": {}, "source": [ "## A" @@ -153,29 +177,25 @@ { "cell_type": "code", "execution_count": null, - "id": "4bdeea0c", + "id": "47cf88a3", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", - "id": "358d3903", - "metadata": { - "heading_collapsed": true - }, + "id": "62449fb6", + "metadata": {}, "source": [ "## Q" ] }, { "cell_type": "markdown", - "id": "f3003272", - "metadata": { - "hidden": true - }, + "id": "c71d2a4c", + "metadata": {}, "source": [ - "The `PhysicalActivity` variable exhibit a long-tail distribution. This is usually undesirable for an explanatory variable, because we cannot densely sample a large part of its domain of possible values, and therefore a model based on the data cannot be reliable.\n", + "The `PhysicalActivity` variable is very asymmetric. This is usually undesirable for an explanatory variable, because we cannot densely sample a large part of its domain of possible values, and therefore a model based on the data cannot be reliable.\n", "\n", "We will proceed to transforming `PhysicalActivity` using a simple natural logarithm. `log` is undefined at $0$ and tends to the infinite near $0$, which renders its straightforward application to `PhysicalActivity` inappropriate. Therefore we will also add $1$ to the `PhysicalActivity` measurements prior to applying `log`.\n", "\n", @@ -184,7 +204,7 @@ }, { "cell_type": "markdown", - "id": "34dff28d", + "id": "21e8879e", "metadata": {}, "source": [ "## A" @@ -193,14 +213,14 @@ { "cell_type": "code", "execution_count": null, - "id": "16e7787e", + "id": "0b8cac58", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", - "id": "7887928e", + "id": "2c6d5225", "metadata": { "heading_collapsed": true }, @@ -214,7 +234,7 @@ }, { "cell_type": "markdown", - "id": "b9ed6879", + "id": "c47489e6", "metadata": {}, "source": [ "## A" @@ -223,14 +243,14 @@ { "cell_type": "code", "execution_count": null, - "id": "1044c8c6", + "id": "bf32c7e6", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", - "id": "49408adc", + "id": "1c7871a2", "metadata": { "heading_collapsed": true }, @@ -244,10 +264,8 @@ }, { "cell_type": "markdown", - "id": "1b977a53", - "metadata": { - "heading_collapsed": true - }, + "id": "12d7e9a5", + "metadata": {}, "source": [ "## A (with nested Q&A)" ] @@ -255,32 +273,20 @@ { "cell_type": "code", "execution_count": null, - "id": "b891fa3e", - "metadata": { - "hidden": true - }, - "outputs": [], - "source": [] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c469b3d5", - "metadata": { - "hidden": true - }, + "id": "82e9f8d6", + "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", - "id": "475eb53d", - "metadata": { - "hidden": true - }, + "id": "66ddaa3e", + "metadata": {}, "source": [ "### Q\n", "\n", + "The interaction term is not significant, but most importantly, the increase in log-likelihood is very small; the interaction term does not help to better fit the model to the data.\n", + "\n", "To get a better intuition about the log-likelihood, plot it (with a dot plot) for different models, with one variable, with two variables, with and without interaction.\n", "\n", "Feel free to introduce one or two extra explanatory variables such as `BMI`." @@ -288,10 +294,8 @@ }, { "cell_type": "markdown", - "id": "f7a51e03", - "metadata": { - "hidden": true - }, + "id": "05844c56", + "metadata": {}, "source": [ "### A" ] @@ -299,16 +303,14 @@ { "cell_type": "code", "execution_count": null, - "id": "3a4295ad", - "metadata": { - "hidden": true - }, + "id": "2588d8d6", + "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", - "id": "2474a218", + "id": "665a7a5c", "metadata": {}, "source": [ "# White test for homoscedasticity" @@ -316,7 +318,7 @@ }, { "cell_type": "markdown", - "id": "1bf04c2f", + "id": "6d0bdd7b", "metadata": {}, "source": [ "To keep things simple, let us use the `'Heart + PhysicalActivity'` or `'Heart + logPhysicalActivity'`.\n", @@ -328,7 +330,7 @@ }, { "cell_type": "markdown", - "id": "77c0d822", + "id": "a6f37b2e", "metadata": {}, "source": [ "## A" @@ -337,14 +339,14 @@ { "cell_type": "code", "execution_count": null, - "id": "4d7f9d1f", + "id": "77e350be", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", - "id": "20fdd05b", + "id": "789bd3f7", "metadata": { "heading_collapsed": true }, @@ -354,12 +356,12 @@ "We will further inspect the residuals for heteroscedasticity, using the [White test](https://itfeature.com/heteroscedasticity/white-test-for-heteroskedasticity).\n", "\n", "`statsmodels` features an implementation of this test, but the [documentation](https://www.statsmodels.org/stable/generated/statsmodels.stats.diagnostic.het_white.html) is scarce on details.\n", - "Try to apply the `het_white` function, but do not feel ashamed if you fail." + "Try to apply the `het_white` function (if you can 'x))." ] }, { "cell_type": "markdown", - "id": "34e7a050", + "id": "0a822074", "metadata": {}, "source": [ "## A" @@ -368,14 +370,14 @@ { "cell_type": "code", "execution_count": null, - "id": "5de56d54", + "id": "85d73982", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", - "id": "98ca812e", + "id": "e3ccb464", "metadata": { "heading_collapsed": true }, @@ -393,7 +395,7 @@ }, { "cell_type": "markdown", - "id": "3ebf1176", + "id": "7ce19023", "metadata": {}, "source": [ "## A" @@ -402,15 +404,17 @@ { "cell_type": "code", "execution_count": null, - "id": "46c53e4e", + "id": "7dec1d91", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", - "id": "46bd2a33", - "metadata": {}, + "id": "b588b5c1", + "metadata": { + "heading_collapsed": true + }, "source": [ "## Q\n", "\n", @@ -434,7 +438,7 @@ }, { "cell_type": "markdown", - "id": "374e25eb", + "id": "5224316d", "metadata": {}, "source": [ "## A" @@ -443,7 +447,7 @@ { "cell_type": "code", "execution_count": null, - "id": "08216a83", + "id": "1bac9cde", "metadata": {}, "outputs": [], "source": [] diff --git a/notebooks/statsmodels_TP_solutions.ipynb b/notebooks/statsmodels_TP_solutions.ipynb index 83257005c05c6324f1eac1dd6533be8e5c884aab..1e094a2a7705815459e26fbf5cb01c58f8dc8e14 100644 --- a/notebooks/statsmodels_TP_solutions.ipynb +++ b/notebooks/statsmodels_TP_solutions.ipynb @@ -43,16 +43,20 @@ { "cell_type": "markdown", "id": "965cc75c", - "metadata": {}, + "metadata": { + "heading_collapsed": true + }, "source": [ "## A" ] }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 56, "id": "7eb30cc9", - "metadata": {}, + "metadata": { + "hidden": true + }, "outputs": [ { "data": { @@ -192,7 +196,7 @@ "4 0 373450 8.0500 NaN S " ] }, - "execution_count": 2, + "execution_count": 56, "metadata": {}, "output_type": "execute_result" } @@ -204,9 +208,11 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 57, "id": "17f37d48", - "metadata": {}, + "metadata": { + "hidden": true + }, "outputs": [ { "data": { @@ -214,7 +220,7 @@ "(891, 12)" ] }, - "execution_count": 3, + "execution_count": 57, "metadata": {}, "output_type": "execute_result" } @@ -225,9 +231,11 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 58, "id": "31c6e99b", - "metadata": {}, + "metadata": { + "hidden": true + }, "outputs": [], "source": [ "df['LogFare'] = np.log(1+df['Fare'])" @@ -235,13 +243,15 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 59, "id": "e5b46f36", - "metadata": {}, + "metadata": { + "hidden": true + }, "outputs": [ { "data": { - "image/png": "\n", + "image/png": "\n", "text/plain": [ "<Figure size 432x288 with 1 Axes>" ] @@ -258,9 +268,11 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 60, "id": "7cf9b21a", - "metadata": {}, + "metadata": { + "hidden": true + }, "outputs": [ { "data": { @@ -292,32 +304,38 @@ "\n", "Treat `Pclass` and `Sex` as factors.\n", "\n", - "Print an ANOVA table." + "Print an ANOVA table for different types of sum of squares.." ] }, { "cell_type": "markdown", "id": "3ad89dec", - "metadata": {}, + "metadata": { + "heading_collapsed": true + }, "source": [ "## A" ] }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 63, "id": "39f58924", - "metadata": {}, + "metadata": { + "hidden": true + }, "outputs": [], "source": [ - "model = smf.ols('LogFare ~ Age * C(Pclass) * C(Sex)', df).fit()" + "model = smf.ols('LogFare ~ Age * C(Sex) * C(Pclass)', df).fit()" ] }, { "cell_type": "code", - "execution_count": 8, - "id": "bc136eb5", - "metadata": {}, + "execution_count": 66, + "id": "f09c5fea", + "metadata": { + "hidden": true + }, "outputs": [ { "data": { @@ -340,21 +358,140 @@ " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", - " <th>sum_sq</th>\n", " <th>df</th>\n", + " <th>sum_sq</th>\n", + " <th>mean_sq</th>\n", " <th>F</th>\n", " <th>PR(>F)</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", + " <th>C(Sex)</th>\n", + " <td>1.0</td>\n", + " <td>46.721986</td>\n", + " <td>46.721986</td>\n", + " <td>126.001800</td>\n", + " <td>5.241061e-27</td>\n", + " </tr>\n", + " <tr>\n", " <th>C(Pclass)</th>\n", - " <td>318.183365</td>\n", " <td>2.0</td>\n", - " <td>429.045083</td>\n", - " <td>1.856872e-122</td>\n", + " <td>318.408489</td>\n", + " <td>159.204245</td>\n", + " <td>429.348645</td>\n", + " <td>1.619836e-122</td>\n", " </tr>\n", " <tr>\n", + " <th>C(Sex):C(Pclass)</th>\n", + " <td>2.0</td>\n", + " <td>5.990345</td>\n", + " <td>2.995172</td>\n", + " <td>8.077506</td>\n", + " <td>3.402042e-04</td>\n", + " </tr>\n", + " <tr>\n", + " <th>Age</th>\n", + " <td>1.0</td>\n", + " <td>12.274132</td>\n", + " <td>12.274132</td>\n", + " <td>33.101391</td>\n", + " <td>1.306440e-08</td>\n", + " </tr>\n", + " <tr>\n", + " <th>Age:C(Sex)</th>\n", + " <td>1.0</td>\n", + " <td>1.486269</td>\n", + " <td>1.486269</td>\n", + " <td>4.008232</td>\n", + " <td>4.566288e-02</td>\n", + " </tr>\n", + " <tr>\n", + " <th>Age:C(Pclass)</th>\n", + " <td>2.0</td>\n", + " <td>0.296834</td>\n", + " <td>0.148417</td>\n", + " <td>0.400257</td>\n", + " <td>6.703004e-01</td>\n", + " </tr>\n", + " <tr>\n", + " <th>Age:C(Sex):C(Pclass)</th>\n", + " <td>2.0</td>\n", + " <td>1.335653</td>\n", + " <td>0.667827</td>\n", + " <td>1.801023</td>\n", + " <td>1.658921e-01</td>\n", + " </tr>\n", + " <tr>\n", + " <th>Residual</th>\n", + " <td>702.0</td>\n", + " <td>260.304489</td>\n", + " <td>0.370804</td>\n", + " <td>NaN</td>\n", + " <td>NaN</td>\n", + " </tr>\n", + " </tbody>\n", + "</table>\n", + "</div>" + ], + "text/plain": [ + " df sum_sq mean_sq F PR(>F)\n", + "C(Sex) 1.0 46.721986 46.721986 126.001800 5.241061e-27\n", + "C(Pclass) 2.0 318.408489 159.204245 429.348645 1.619836e-122\n", + "C(Sex):C(Pclass) 2.0 5.990345 2.995172 8.077506 3.402042e-04\n", + "Age 1.0 12.274132 12.274132 33.101391 1.306440e-08\n", + "Age:C(Sex) 1.0 1.486269 1.486269 4.008232 4.566288e-02\n", + "Age:C(Pclass) 2.0 0.296834 0.148417 0.400257 6.703004e-01\n", + "Age:C(Sex):C(Pclass) 2.0 1.335653 0.667827 1.801023 1.658921e-01\n", + "Residual 702.0 260.304489 0.370804 NaN NaN" + ] + }, + "execution_count": 66, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "sm.stats.anova_lm(model, typ=1)" + ] + }, + { + "cell_type": "code", + "execution_count": 67, + "id": "cf4e3edc", + "metadata": { + "hidden": true + }, + "outputs": [ + { + "data": { + "text/html": [ + "<div>\n", + "<style scoped>\n", + " .dataframe tbody tr th:only-of-type {\n", + " vertical-align: middle;\n", + " }\n", + "\n", + " .dataframe tbody tr th {\n", + " vertical-align: top;\n", + " }\n", + "\n", + " .dataframe thead th {\n", + " text-align: right;\n", + " }\n", + "</style>\n", + "<table border=\"1\" class=\"dataframe\">\n", + " <thead>\n", + " <tr style=\"text-align: right;\">\n", + " <th></th>\n", + " <th>sum_sq</th>\n", + " <th>df</th>\n", + " <th>F</th>\n", + " <th>PR(>F)</th>\n", + " </tr>\n", + " </thead>\n", + " <tbody>\n", + " <tr>\n", " <th>C(Sex)</th>\n", " <td>12.597516</td>\n", " <td>1.0</td>\n", @@ -362,7 +499,14 @@ " <td>8.516312e-09</td>\n", " </tr>\n", " <tr>\n", - " <th>C(Pclass):C(Sex)</th>\n", + " <th>C(Pclass)</th>\n", + " <td>318.183365</td>\n", + " <td>2.0</td>\n", + " <td>429.045083</td>\n", + " <td>1.856872e-122</td>\n", + " </tr>\n", + " <tr>\n", + " <th>C(Sex):C(Pclass)</th>\n", " <td>3.489752</td>\n", " <td>2.0</td>\n", " <td>4.705654</td>\n", @@ -376,13 +520,6 @@ " <td>1.306440e-08</td>\n", " </tr>\n", " <tr>\n", - " <th>Age:C(Pclass)</th>\n", - " <td>0.296834</td>\n", - " <td>2.0</td>\n", - " <td>0.400257</td>\n", - " <td>6.703004e-01</td>\n", - " </tr>\n", - " <tr>\n", " <th>Age:C(Sex)</th>\n", " <td>1.421046</td>\n", " <td>1.0</td>\n", @@ -390,7 +527,14 @@ " <td>5.066866e-02</td>\n", " </tr>\n", " <tr>\n", - " <th>Age:C(Pclass):C(Sex)</th>\n", + " <th>Age:C(Pclass)</th>\n", + " <td>0.296834</td>\n", + " <td>2.0</td>\n", + " <td>0.400257</td>\n", + " <td>6.703004e-01</td>\n", + " </tr>\n", + " <tr>\n", + " <th>Age:C(Sex):C(Pclass)</th>\n", " <td>1.335653</td>\n", " <td>2.0</td>\n", " <td>1.801023</td>\n", @@ -409,17 +553,17 @@ ], "text/plain": [ " sum_sq df F PR(>F)\n", - "C(Pclass) 318.183365 2.0 429.045083 1.856872e-122\n", "C(Sex) 12.597516 1.0 33.973508 8.516312e-09\n", - "C(Pclass):C(Sex) 3.489752 2.0 4.705654 9.331214e-03\n", + "C(Pclass) 318.183365 2.0 429.045083 1.856872e-122\n", + "C(Sex):C(Pclass) 3.489752 2.0 4.705654 9.331214e-03\n", "Age 12.274132 1.0 33.101391 1.306440e-08\n", - "Age:C(Pclass) 0.296834 2.0 0.400257 6.703004e-01\n", "Age:C(Sex) 1.421046 1.0 3.832335 5.066866e-02\n", - "Age:C(Pclass):C(Sex) 1.335653 2.0 1.801023 1.658921e-01\n", + "Age:C(Pclass) 0.296834 2.0 0.400257 6.703004e-01\n", + "Age:C(Sex):C(Pclass) 1.335653 2.0 1.801023 1.658921e-01\n", "Residual 260.304489 702.0 NaN NaN" ] }, - "execution_count": 8, + "execution_count": 67, "metadata": {}, "output_type": "execute_result" } @@ -428,10 +572,137 @@ "sm.stats.anova_lm(model, typ=2)" ] }, + { + "cell_type": "code", + "execution_count": 68, + "id": "bc136eb5", + "metadata": { + "hidden": true + }, + "outputs": [ + { + "data": { + "text/html": [ + "<div>\n", + "<style scoped>\n", + " .dataframe tbody tr th:only-of-type {\n", + " vertical-align: middle;\n", + " }\n", + "\n", + " .dataframe tbody tr th {\n", + " vertical-align: top;\n", + " }\n", + "\n", + " .dataframe thead th {\n", + " text-align: right;\n", + " }\n", + "</style>\n", + "<table border=\"1\" class=\"dataframe\">\n", + " <thead>\n", + " <tr style=\"text-align: right;\">\n", + " <th></th>\n", + " <th>sum_sq</th>\n", + " <th>df</th>\n", + " <th>F</th>\n", + " <th>PR(>F)</th>\n", + " </tr>\n", + " </thead>\n", + " <tbody>\n", + " <tr>\n", + " <th>Intercept</th>\n", + " <td>252.996076</td>\n", + " <td>1.0</td>\n", + " <td>682.290368</td>\n", + " <td>1.334116e-105</td>\n", + " </tr>\n", + " <tr>\n", + " <th>C(Sex)</th>\n", + " <td>0.819888</td>\n", + " <td>1.0</td>\n", + " <td>2.211108</td>\n", + " <td>1.374692e-01</td>\n", + " </tr>\n", + " <tr>\n", + " <th>C(Pclass)</th>\n", + " <td>31.940629</td>\n", + " <td>2.0</td>\n", + " <td>43.069411</td>\n", + " <td>2.273902e-18</td>\n", + " </tr>\n", + " <tr>\n", + " <th>C(Sex):C(Pclass)</th>\n", + " <td>1.012273</td>\n", + " <td>2.0</td>\n", + " <td>1.364970</td>\n", + " <td>2.560654e-01</td>\n", + " </tr>\n", + " <tr>\n", + " <th>Age</th>\n", + " <td>0.776183</td>\n", + " <td>1.0</td>\n", + " <td>2.093243</td>\n", + " <td>1.483980e-01</td>\n", + " </tr>\n", + " <tr>\n", + " <th>Age:C(Sex)</th>\n", + " <td>0.179743</td>\n", + " <td>1.0</td>\n", + " <td>0.484739</td>\n", + " <td>4.865140e-01</td>\n", + " </tr>\n", + " <tr>\n", + " <th>Age:C(Pclass)</th>\n", + " <td>0.437814</td>\n", + " <td>2.0</td>\n", + " <td>0.590357</td>\n", + " <td>5.544041e-01</td>\n", + " </tr>\n", + " <tr>\n", + " <th>Age:C(Sex):C(Pclass)</th>\n", + " <td>1.335653</td>\n", + " <td>2.0</td>\n", + " <td>1.801023</td>\n", + " <td>1.658921e-01</td>\n", + " </tr>\n", + " <tr>\n", + " <th>Residual</th>\n", + " <td>260.304489</td>\n", + " <td>702.0</td>\n", + " <td>NaN</td>\n", + " <td>NaN</td>\n", + " </tr>\n", + " </tbody>\n", + "</table>\n", + "</div>" + ], + "text/plain": [ + " sum_sq df F PR(>F)\n", + "Intercept 252.996076 1.0 682.290368 1.334116e-105\n", + "C(Sex) 0.819888 1.0 2.211108 1.374692e-01\n", + "C(Pclass) 31.940629 2.0 43.069411 2.273902e-18\n", + "C(Sex):C(Pclass) 1.012273 2.0 1.364970 2.560654e-01\n", + "Age 0.776183 1.0 2.093243 1.483980e-01\n", + "Age:C(Sex) 0.179743 1.0 0.484739 4.865140e-01\n", + "Age:C(Pclass) 0.437814 2.0 0.590357 5.544041e-01\n", + "Age:C(Sex):C(Pclass) 1.335653 2.0 1.801023 1.658921e-01\n", + "Residual 260.304489 702.0 NaN NaN" + ] + }, + "execution_count": 68, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "sm.stats.anova_lm(model, typ=3)" + ] + }, { "cell_type": "markdown", "id": "0e98e30e", - "metadata": {}, + "metadata": { + "hidden": true + }, "source": [ "We can also have a look at the summary tables. This reveals that the residuals are not normally distributed." ] @@ -440,7 +711,9 @@ "cell_type": "code", "execution_count": 9, "id": "8d3ea24c", - "metadata": {}, + "metadata": { + "hidden": true + }, "outputs": [ { "data": { @@ -586,67 +859,308 @@ { "cell_type": "markdown", "id": "ecf3bcd9", - "metadata": { - "heading_collapsed": true - }, + "metadata": {}, "source": [ "## Q\n", "\n", - "Let us ignore the not-normal residuals and play with post-hoc tests instead.\n", - "\n", - "Split the ANOVA for levels of `Pclass` and `Sex`, perform all pairwise comparisons if it make sense, and correct for multiple comparisons.\n", + "Because we have a large sample, we will ignore the not-normal residuals and play with post-hoc tests instead.\n", "\n", - "We are not interested in the significance of the slope of `Age` for the different levels of the factors." + "First proceed considering type-3 sums of squares." ] }, { "cell_type": "markdown", - "id": "7a417d76", - "metadata": {}, + "id": "06c5bc0c", + "metadata": { + "heading_collapsed": true + }, "source": [ "## A" ] }, + { + "cell_type": "markdown", + "id": "f203eea0", + "metadata": { + "hidden": true + }, + "source": [ + "We found a single main effect and no interaction. Therefore a single call to `t_test_pairwise` is enough." + ] + }, { "cell_type": "code", - "execution_count": 10, - "id": "b81407bd", + "execution_count": 69, + "id": "248c081c", + "metadata": { + "hidden": true + }, + "outputs": [ + { + "data": { + "text/html": [ + "<div>\n", + "<style scoped>\n", + " .dataframe tbody tr th:only-of-type {\n", + " vertical-align: middle;\n", + " }\n", + "\n", + " .dataframe tbody tr th {\n", + " vertical-align: top;\n", + " }\n", + "\n", + " .dataframe thead th {\n", + " text-align: right;\n", + " }\n", + "</style>\n", + "<table border=\"1\" class=\"dataframe\">\n", + " <thead>\n", + " <tr style=\"text-align: right;\">\n", + " <th></th>\n", + " <th>coef</th>\n", + " <th>std err</th>\n", + " <th>t</th>\n", + " <th>P>|t|</th>\n", + " <th>Conf. Int. Low</th>\n", + " <th>Conf. Int. Upp.</th>\n", + " <th>pvalue-hs</th>\n", + " <th>reject-hs</th>\n", + " </tr>\n", + " </thead>\n", + " <tbody>\n", + " <tr>\n", + " <th>2-1</th>\n", + " <td>-1.460550</td>\n", + " <td>0.251403</td>\n", + " <td>-5.809591</td>\n", + " <td>9.496387e-09</td>\n", + " <td>-1.954142</td>\n", + " <td>-0.966958</td>\n", + " <td>1.899277e-08</td>\n", + " <td>True</td>\n", + " </tr>\n", + " <tr>\n", + " <th>3-1</th>\n", + " <td>-2.016639</td>\n", + " <td>0.217384</td>\n", + " <td>-9.276845</td>\n", + " <td>2.121528e-19</td>\n", + " <td>-2.443439</td>\n", + " <td>-1.589838</td>\n", + " <td>0.000000e+00</td>\n", + " <td>True</td>\n", + " </tr>\n", + " <tr>\n", + " <th>3-2</th>\n", + " <td>-0.556089</td>\n", + " <td>0.211313</td>\n", + " <td>-2.631590</td>\n", + " <td>8.685267e-03</td>\n", + " <td>-0.970969</td>\n", + " <td>-0.141208</td>\n", + " <td>8.685267e-03</td>\n", + " <td>True</td>\n", + " </tr>\n", + " </tbody>\n", + "</table>\n", + "</div>" + ], + "text/plain": [ + " coef std err t P>|t| Conf. Int. Low \\\n", + "2-1 -1.460550 0.251403 -5.809591 9.496387e-09 -1.954142 \n", + "3-1 -2.016639 0.217384 -9.276845 2.121528e-19 -2.443439 \n", + "3-2 -0.556089 0.211313 -2.631590 8.685267e-03 -0.970969 \n", + "\n", + " Conf. Int. Upp. pvalue-hs reject-hs \n", + "2-1 -0.966958 1.899277e-08 True \n", + "3-1 -1.589838 0.000000e+00 True \n", + "3-2 -0.141208 8.685267e-03 True " + ] + }, + "execution_count": 69, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "model.t_test_pairwise('C(Pclass)').result_frame" + ] + }, + { + "cell_type": "markdown", + "id": "dc0a30ce", "metadata": {}, - "outputs": [], "source": [ - "class1_model = smf.ols(data=df, formula='Fare ~ Age * C(Sex)', subset=df['Pclass']==1).fit()\n", - "class2_model = smf.ols(data=df, formula='Fare ~ Age * C(Sex)', subset=df['Pclass']==2).fit()\n", - "class3_model = smf.ols(data=df, formula='Fare ~ Age * C(Sex)', subset=df['Pclass']==3).fit()\n", - "female_model = smf.ols(data=df, formula='Fare ~ Age * C(Pclass)', subset=df['Sex']=='female').fit()\n", - "male_model = smf.ols(data=df, formula='Fare ~ Age * C(Pclass)', subset=df['Sex']=='male').fit()" + "## Q\n", + "\n", + "Let us suppose we want to use type-1 sums of squares instead.\n", + "\n", + "Proceed again to performing with `Sex` first, `Pclass` second, and `Age` last.\n", + "\n", + "In the post-hoc comparisons, we will disregard the effect of the slop of `Age`." + ] + }, + { + "cell_type": "markdown", + "id": "7a417d76", + "metadata": { + "heading_collapsed": true + }, + "source": [ + "## A" ] }, { "cell_type": "code", - "execution_count": 11, - "id": "cafed8b7", - "metadata": {}, + "execution_count": 78, + "id": "9466f266", + "metadata": { + "hidden": true + }, "outputs": [ { "data": { + "text/html": [ + "<div>\n", + "<style scoped>\n", + " .dataframe tbody tr th:only-of-type {\n", + " vertical-align: middle;\n", + " }\n", + "\n", + " .dataframe tbody tr th {\n", + " vertical-align: top;\n", + " }\n", + "\n", + " .dataframe thead th {\n", + " text-align: right;\n", + " }\n", + "</style>\n", + "<table border=\"1\" class=\"dataframe\">\n", + " <thead>\n", + " <tr style=\"text-align: right;\">\n", + " <th></th>\n", + " <th>df</th>\n", + " <th>sum_sq</th>\n", + " <th>mean_sq</th>\n", + " <th>F</th>\n", + " <th>PR(>F)</th>\n", + " </tr>\n", + " </thead>\n", + " <tbody>\n", + " <tr>\n", + " <th>C(Sex)</th>\n", + " <td>1.0</td>\n", + " <td>46.721986</td>\n", + " <td>46.721986</td>\n", + " <td>126.001800</td>\n", + " <td>5.241061e-27</td>\n", + " </tr>\n", + " <tr>\n", + " <th>C(Pclass)</th>\n", + " <td>2.0</td>\n", + " <td>318.408489</td>\n", + " <td>159.204245</td>\n", + " <td>429.348645</td>\n", + " <td>1.619836e-122</td>\n", + " </tr>\n", + " <tr>\n", + " <th>C(Sex):C(Pclass)</th>\n", + " <td>2.0</td>\n", + " <td>5.990345</td>\n", + " <td>2.995172</td>\n", + " <td>8.077506</td>\n", + " <td>3.402042e-04</td>\n", + " </tr>\n", + " <tr>\n", + " <th>Age</th>\n", + " <td>1.0</td>\n", + " <td>12.274132</td>\n", + " <td>12.274132</td>\n", + " <td>33.101391</td>\n", + " <td>1.306440e-08</td>\n", + " </tr>\n", + " <tr>\n", + " <th>C(Sex):Age</th>\n", + " <td>1.0</td>\n", + " <td>1.486269</td>\n", + " <td>1.486269</td>\n", + " <td>4.008232</td>\n", + " <td>4.566288e-02</td>\n", + " </tr>\n", + " <tr>\n", + " <th>C(Pclass):Age</th>\n", + " <td>2.0</td>\n", + " <td>0.296834</td>\n", + " <td>0.148417</td>\n", + " <td>0.400257</td>\n", + " <td>6.703004e-01</td>\n", + " </tr>\n", + " <tr>\n", + " <th>C(Sex):C(Pclass):Age</th>\n", + " <td>2.0</td>\n", + " <td>1.335653</td>\n", + " <td>0.667827</td>\n", + " <td>1.801023</td>\n", + " <td>1.658921e-01</td>\n", + " </tr>\n", + " <tr>\n", + " <th>Residual</th>\n", + " <td>702.0</td>\n", + " <td>260.304489</td>\n", + " <td>0.370804</td>\n", + " <td>NaN</td>\n", + " <td>NaN</td>\n", + " </tr>\n", + " </tbody>\n", + "</table>\n", + "</div>" + ], "text/plain": [ - "((186, 4), (173, 4), (355, 4))" + " df sum_sq mean_sq F PR(>F)\n", + "C(Sex) 1.0 46.721986 46.721986 126.001800 5.241061e-27\n", + "C(Pclass) 2.0 318.408489 159.204245 429.348645 1.619836e-122\n", + "C(Sex):C(Pclass) 2.0 5.990345 2.995172 8.077506 3.402042e-04\n", + "Age 1.0 12.274132 12.274132 33.101391 1.306440e-08\n", + "C(Sex):Age 1.0 1.486269 1.486269 4.008232 4.566288e-02\n", + "C(Pclass):Age 2.0 0.296834 0.148417 0.400257 6.703004e-01\n", + "C(Sex):C(Pclass):Age 2.0 1.335653 0.667827 1.801023 1.658921e-01\n", + "Residual 702.0 260.304489 0.370804 NaN NaN" ] }, - "execution_count": 11, + "execution_count": 78, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "class1_model.model.exog.shape, class2_model.model.exog.shape, class3_model.model.exog.shape" + "sstype = 1\n", + "model = smf.ols('LogFare ~ C(Sex) * C(Pclass) * Age', df).fit()\n", + "sm.stats.anova_lm(model)" + ] + }, + { + "cell_type": "code", + "execution_count": 88, + "id": "b81407bd", + "metadata": { + "hidden": true + }, + "outputs": [], + "source": [ + "class1_model = smf.ols(data=df, formula='Fare ~ C(Sex) * Age', subset=df['Pclass']==1).fit()\n", + "class2_model = smf.ols(data=df, formula='Fare ~ C(Sex) * Age', subset=df['Pclass']==2).fit()\n", + "class3_model = smf.ols(data=df, formula='Fare ~ C(Sex) * Age', subset=df['Pclass']==3).fit()\n", + "female_model = smf.ols(data=df, formula='Fare ~ C(Pclass) * Age', subset=df['Sex']=='female').fit()\n", + "male_model = smf.ols(data=df, formula='Fare ~ C(Pclass) * Age', subset=df['Sex']=='male').fit()" ] }, { "cell_type": "code", - "execution_count": 12, + "execution_count": 89, "id": "7a8fb3ff", - "metadata": {}, + "metadata": { + "hidden": true + }, "outputs": [ { "data": { @@ -669,8 +1183,9 @@ " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", - " <th>sum_sq</th>\n", " <th>df</th>\n", + " <th>sum_sq</th>\n", + " <th>mean_sq</th>\n", " <th>F</th>\n", " <th>PR(>F)</th>\n", " </tr>\n", @@ -678,29 +1193,33 @@ " <tbody>\n", " <tr>\n", " <th>C(Sex)</th>\n", - " <td>4.043594e+04</td>\n", " <td>1.0</td>\n", - " <td>6.627354</td>\n", - " <td>0.010838</td>\n", + " <td>6.251806e+04</td>\n", + " <td>62518.055251</td>\n", + " <td>10.246561</td>\n", + " <td>0.001616</td>\n", " </tr>\n", " <tr>\n", " <th>Age</th>\n", - " <td>3.572113e+04</td>\n", " <td>1.0</td>\n", + " <td>3.572113e+04</td>\n", + " <td>35721.126820</td>\n", " <td>5.854608</td>\n", " <td>0.016520</td>\n", " </tr>\n", " <tr>\n", - " <th>Age:C(Sex)</th>\n", - " <td>8.202655e+02</td>\n", + " <th>C(Sex):Age</th>\n", " <td>1.0</td>\n", + " <td>8.202655e+02</td>\n", + " <td>820.265529</td>\n", " <td>0.134440</td>\n", " <td>0.714299</td>\n", " </tr>\n", " <tr>\n", " <th>Residual</th>\n", - " <td>1.110449e+06</td>\n", " <td>182.0</td>\n", + " <td>1.110449e+06</td>\n", + " <td>6101.369705</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", @@ -709,27 +1228,29 @@ "</div>" ], "text/plain": [ - " sum_sq df F PR(>F)\n", - "C(Sex) 4.043594e+04 1.0 6.627354 0.010838\n", - "Age 3.572113e+04 1.0 5.854608 0.016520\n", - "Age:C(Sex) 8.202655e+02 1.0 0.134440 0.714299\n", - "Residual 1.110449e+06 182.0 NaN NaN" + " df sum_sq mean_sq F PR(>F)\n", + "C(Sex) 1.0 6.251806e+04 62518.055251 10.246561 0.001616\n", + "Age 1.0 3.572113e+04 35721.126820 5.854608 0.016520\n", + "C(Sex):Age 1.0 8.202655e+02 820.265529 0.134440 0.714299\n", + "Residual 182.0 1.110449e+06 6101.369705 NaN NaN" ] }, - "execution_count": 12, + "execution_count": 89, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "sm.stats.anova_lm(class1_model, typ=2)" + "sm.stats.anova_lm(class1_model, typ=sstype)" ] }, { "cell_type": "code", - "execution_count": 13, + "execution_count": 90, "id": "c5285b28", - "metadata": {}, + "metadata": { + "hidden": true + }, "outputs": [ { "data": { @@ -752,8 +1273,9 @@ " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", - " <th>sum_sq</th>\n", " <th>df</th>\n", + " <th>sum_sq</th>\n", + " <th>mean_sq</th>\n", " <th>F</th>\n", " <th>PR(>F)</th>\n", " </tr>\n", @@ -761,29 +1283,33 @@ " <tbody>\n", " <tr>\n", " <th>C(Sex)</th>\n", - " <td>9.143118</td>\n", " <td>1.0</td>\n", - " <td>0.053774</td>\n", - " <td>0.816901</td>\n", + " <td>29.733469</td>\n", + " <td>29.733469</td>\n", + " <td>0.174875</td>\n", + " <td>0.676346</td>\n", " </tr>\n", " <tr>\n", " <th>Age</th>\n", - " <td>1140.725330</td>\n", " <td>1.0</td>\n", + " <td>1140.725330</td>\n", + " <td>1140.725330</td>\n", " <td>6.709083</td>\n", " <td>0.010430</td>\n", " </tr>\n", " <tr>\n", - " <th>Age:C(Sex)</th>\n", - " <td>7.201063</td>\n", + " <th>C(Sex):Age</th>\n", " <td>1.0</td>\n", + " <td>7.201063</td>\n", + " <td>7.201063</td>\n", " <td>0.042352</td>\n", " <td>0.837197</td>\n", " </tr>\n", " <tr>\n", " <th>Residual</th>\n", - " <td>28734.566043</td>\n", " <td>169.0</td>\n", + " <td>28734.566043</td>\n", + " <td>170.027018</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", @@ -792,27 +1318,29 @@ "</div>" ], "text/plain": [ - " sum_sq df F PR(>F)\n", - "C(Sex) 9.143118 1.0 0.053774 0.816901\n", - "Age 1140.725330 1.0 6.709083 0.010430\n", - "Age:C(Sex) 7.201063 1.0 0.042352 0.837197\n", - "Residual 28734.566043 169.0 NaN NaN" + " df sum_sq mean_sq F PR(>F)\n", + "C(Sex) 1.0 29.733469 29.733469 0.174875 0.676346\n", + "Age 1.0 1140.725330 1140.725330 6.709083 0.010430\n", + "C(Sex):Age 1.0 7.201063 7.201063 0.042352 0.837197\n", + "Residual 169.0 28734.566043 170.027018 NaN NaN" ] }, - "execution_count": 13, + "execution_count": 90, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "sm.stats.anova_lm(class2_model, typ=2)" + "sm.stats.anova_lm(class2_model, typ=sstype)" ] }, { "cell_type": "code", - "execution_count": 14, + "execution_count": 91, "id": "ed33eab1", - "metadata": {}, + "metadata": { + "hidden": true + }, "outputs": [ { "data": { @@ -835,8 +1363,9 @@ " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", - " <th>sum_sq</th>\n", " <th>df</th>\n", + " <th>sum_sq</th>\n", + " <th>mean_sq</th>\n", " <th>F</th>\n", " <th>PR(>F)</th>\n", " </tr>\n", @@ -844,29 +1373,33 @@ " <tbody>\n", " <tr>\n", " <th>C(Sex)</th>\n", - " <td>553.194455</td>\n", " <td>1.0</td>\n", - " <td>6.099022</td>\n", - " <td>0.014001</td>\n", + " <td>1001.995627</td>\n", + " <td>1001.995627</td>\n", + " <td>11.047098</td>\n", + " <td>0.000982</td>\n", " </tr>\n", " <tr>\n", " <th>Age</th>\n", - " <td>1970.785958</td>\n", " <td>1.0</td>\n", + " <td>1970.785958</td>\n", + " <td>1970.785958</td>\n", " <td>21.728105</td>\n", " <td>0.000004</td>\n", " </tr>\n", " <tr>\n", - " <th>Age:C(Sex)</th>\n", - " <td>896.982415</td>\n", + " <th>C(Sex):Age</th>\n", " <td>1.0</td>\n", + " <td>896.982415</td>\n", + " <td>896.982415</td>\n", " <td>9.889317</td>\n", " <td>0.001804</td>\n", " </tr>\n", " <tr>\n", " <th>Residual</th>\n", - " <td>31836.456663</td>\n", " <td>351.0</td>\n", + " <td>31836.456663</td>\n", + " <td>90.702156</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", @@ -875,27 +1408,29 @@ "</div>" ], "text/plain": [ - " sum_sq df F PR(>F)\n", - "C(Sex) 553.194455 1.0 6.099022 0.014001\n", - "Age 1970.785958 1.0 21.728105 0.000004\n", - "Age:C(Sex) 896.982415 1.0 9.889317 0.001804\n", - "Residual 31836.456663 351.0 NaN NaN" + " df sum_sq mean_sq F PR(>F)\n", + "C(Sex) 1.0 1001.995627 1001.995627 11.047098 0.000982\n", + "Age 1.0 1970.785958 1970.785958 21.728105 0.000004\n", + "C(Sex):Age 1.0 896.982415 896.982415 9.889317 0.001804\n", + "Residual 351.0 31836.456663 90.702156 NaN NaN" ] }, - "execution_count": 14, + "execution_count": 91, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "sm.stats.anova_lm(class3_model, typ=2)" + "sm.stats.anova_lm(class3_model, typ=sstype)" ] }, { "cell_type": "code", - "execution_count": 15, + "execution_count": 92, "id": "4652591d", - "metadata": {}, + "metadata": { + "hidden": true + }, "outputs": [], "source": [ "# we could also make these additional model to treat the 'Age:Sex' interaction, but we are not interested in 'Age' alone (hence the hint about «pairwise»)\n", @@ -905,9 +1440,11 @@ }, { "cell_type": "code", - "execution_count": 16, + "execution_count": 93, "id": "ecfbec15", - "metadata": {}, + "metadata": { + "hidden": true + }, "outputs": [ { "data": { @@ -930,8 +1467,9 @@ " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", - " <th>sum_sq</th>\n", " <th>df</th>\n", + " <th>sum_sq</th>\n", + " <th>mean_sq</th>\n", " <th>F</th>\n", " <th>PR(>F)</th>\n", " </tr>\n", @@ -939,29 +1477,33 @@ " <tbody>\n", " <tr>\n", " <th>C(Pclass)</th>\n", - " <td>436677.926114</td>\n", " <td>2.0</td>\n", - " <td>109.672659</td>\n", - " <td>4.283604e-35</td>\n", + " <td>460882.452649</td>\n", + " <td>230441.226324</td>\n", + " <td>115.751681</td>\n", + " <td>1.699917e-36</td>\n", " </tr>\n", " <tr>\n", " <th>Age</th>\n", - " <td>4564.359068</td>\n", " <td>1.0</td>\n", + " <td>4564.359068</td>\n", + " <td>4564.359068</td>\n", " <td>2.292698</td>\n", " <td>1.312222e-01</td>\n", " </tr>\n", " <tr>\n", - " <th>Age:C(Pclass)</th>\n", - " <td>5386.550285</td>\n", + " <th>C(Pclass):Age</th>\n", " <td>2.0</td>\n", + " <td>5386.550285</td>\n", + " <td>2693.275142</td>\n", " <td>1.352844</td>\n", " <td>2.603528e-01</td>\n", " </tr>\n", " <tr>\n", " <th>Residual</th>\n", - " <td>507660.125010</td>\n", " <td>255.0</td>\n", + " <td>507660.125010</td>\n", + " <td>1990.824020</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", @@ -970,27 +1512,29 @@ "</div>" ], "text/plain": [ - " sum_sq df F PR(>F)\n", - "C(Pclass) 436677.926114 2.0 109.672659 4.283604e-35\n", - "Age 4564.359068 1.0 2.292698 1.312222e-01\n", - "Age:C(Pclass) 5386.550285 2.0 1.352844 2.603528e-01\n", - "Residual 507660.125010 255.0 NaN NaN" + " df sum_sq mean_sq F PR(>F)\n", + "C(Pclass) 2.0 460882.452649 230441.226324 115.751681 1.699917e-36\n", + "Age 1.0 4564.359068 4564.359068 2.292698 1.312222e-01\n", + "C(Pclass):Age 2.0 5386.550285 2693.275142 1.352844 2.603528e-01\n", + "Residual 255.0 507660.125010 1990.824020 NaN NaN" ] }, - "execution_count": 16, + "execution_count": 93, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "sm.stats.anova_lm(female_model, typ=2)" + "sm.stats.anova_lm(female_model, typ=sstype)" ] }, { "cell_type": "code", - "execution_count": 17, + "execution_count": 94, "id": "27783db9", - "metadata": {}, + "metadata": { + "hidden": true + }, "outputs": [ { "data": { @@ -1013,8 +1557,9 @@ " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", - " <th>sum_sq</th>\n", " <th>df</th>\n", + " <th>sum_sq</th>\n", + " <th>mean_sq</th>\n", " <th>F</th>\n", " <th>PR(>F)</th>\n", " </tr>\n", @@ -1022,29 +1567,33 @@ " <tbody>\n", " <tr>\n", " <th>C(Pclass)</th>\n", - " <td>269207.914985</td>\n", " <td>2.0</td>\n", - " <td>90.701809</td>\n", - " <td>8.657391e-34</td>\n", + " <td>255902.065524</td>\n", + " <td>127951.032762</td>\n", + " <td>86.218789</td>\n", + " <td>2.149213e-32</td>\n", " </tr>\n", " <tr>\n", " <th>Age</th>\n", - " <td>18986.128890</td>\n", " <td>1.0</td>\n", + " <td>18986.128890</td>\n", + " <td>18986.128890</td>\n", " <td>12.793652</td>\n", " <td>3.857634e-04</td>\n", " </tr>\n", " <tr>\n", - " <th>Age:C(Pclass)</th>\n", - " <td>11620.048872</td>\n", + " <th>C(Pclass):Age</th>\n", " <td>2.0</td>\n", + " <td>11620.048872</td>\n", + " <td>5810.024436</td>\n", " <td>3.915039</td>\n", " <td>2.062721e-02</td>\n", " </tr>\n", " <tr>\n", " <th>Residual</th>\n", - " <td>663360.184028</td>\n", " <td>447.0</td>\n", + " <td>663360.184028</td>\n", + " <td>1484.027257</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", @@ -1053,35 +1602,39 @@ "</div>" ], "text/plain": [ - " sum_sq df F PR(>F)\n", - "C(Pclass) 269207.914985 2.0 90.701809 8.657391e-34\n", - "Age 18986.128890 1.0 12.793652 3.857634e-04\n", - "Age:C(Pclass) 11620.048872 2.0 3.915039 2.062721e-02\n", - "Residual 663360.184028 447.0 NaN NaN" + " df sum_sq mean_sq F PR(>F)\n", + "C(Pclass) 2.0 255902.065524 127951.032762 86.218789 2.149213e-32\n", + "Age 1.0 18986.128890 18986.128890 12.793652 3.857634e-04\n", + "C(Pclass):Age 2.0 11620.048872 5810.024436 3.915039 2.062721e-02\n", + "Residual 447.0 663360.184028 1484.027257 NaN NaN" ] }, - "execution_count": 17, + "execution_count": 94, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "sm.stats.anova_lm(male_model, typ=2)" + "sm.stats.anova_lm(male_model, typ=sstype)" ] }, { "cell_type": "markdown", "id": "ce321919", - "metadata": {}, + "metadata": { + "hidden": true + }, "source": [ "We won't include this last case, again, as we are not interested in the slope of `Age` for each level of `Pclass`." ] }, { "cell_type": "code", - "execution_count": 18, + "execution_count": 95, "id": "d27e6f05", - "metadata": {}, + "metadata": { + "hidden": true + }, "outputs": [ { "data": { @@ -1127,17 +1680,6 @@ " <td>False</td>\n", " </tr>\n", " <tr>\n", - " <th>male-female [2nd class]</th>\n", - " <td>0.432770</td>\n", - " <td>4.806509</td>\n", - " <td>0.090038</td>\n", - " <td>9.283634e-01</td>\n", - " <td>-9.055761</td>\n", - " <td>9.921302</td>\n", - " <td>9.283634e-01</td>\n", - " <td>False</td>\n", - " </tr>\n", - " <tr>\n", " <th>2-1 [female]</th>\n", " <td>-108.472618</td>\n", " <td>18.421076</td>\n", @@ -1177,27 +1719,24 @@ "text/plain": [ " coef std err t P>|t| \\\n", "male-female [1st class] -19.279419 32.487685 -0.593438 5.536250e-01 \n", - "male-female [2nd class] 0.432770 4.806509 0.090038 9.283634e-01 \n", "2-1 [female] -108.472618 18.421076 -5.888506 1.224216e-08 \n", "3-1 [female] -119.359262 15.928392 -7.493491 1.102750e-12 \n", "3-2 [female] -10.886644 15.483529 -0.703111 4.826278e-01 \n", "\n", " Conf. Int. Low Conf. Int. Upp. pvalue-hs \\\n", "male-female [1st class] -83.380352 44.821514 5.536250e-01 \n", - "male-female [2nd class] -9.055761 9.921302 9.283634e-01 \n", "2-1 [female] -144.749437 -72.195799 2.448432e-08 \n", "3-1 [female] -150.727213 -87.991312 3.308354e-12 \n", "3-2 [female] -41.378522 19.605233 4.826278e-01 \n", "\n", " reject-hs \n", "male-female [1st class] False \n", - "male-female [2nd class] False \n", "2-1 [female] True \n", "3-1 [female] True \n", "3-2 [female] False " ] }, - "execution_count": 18, + "execution_count": 95, "metadata": {}, "output_type": "execute_result" } @@ -1209,7 +1748,6 @@ "\n", "comparisons = pd.concat([\n", " suffix_label(class1_model.t_test_pairwise('C(Sex)').result_frame, ' [1st class]'),\n", - " suffix_label(class2_model.t_test_pairwise('C(Sex)').result_frame, ' [2nd class]'),\n", " suffix_label(female_model.t_test_pairwise('C(Pclass)').result_frame, ' [female]'),\n", "])\n", "comparisons" @@ -1217,19 +1755,20 @@ }, { "cell_type": "code", - "execution_count": 19, + "execution_count": 96, "id": "80024bdf", - "metadata": {}, + "metadata": { + "hidden": true + }, "outputs": [ { "data": { "text/plain": [ - "(array([False, False, True, True, False]),\n", - " array([8.61512929e-01, 9.28363378e-01, 4.89686469e-08, 5.51392265e-12,\n", - " 8.61512929e-01]))" + "(array([False, True, True, False]),\n", + " array([7.32326021e-01, 3.67264854e-08, 4.41113812e-12, 7.32326021e-01]))" ] }, - "execution_count": 19, + "execution_count": 96, "metadata": {}, "output_type": "execute_result" } @@ -1240,6 +1779,17 @@ "corrected_rejections, corrected_pvalues" ] }, + { + "cell_type": "markdown", + "id": "5e3fa114", + "metadata": { + "hidden": true + }, + "source": [ + "Compared with type-3 sums of squares, we lost one effect whose test-wise $p$-value was very close to the significance threshold.\n", + "Of note, this difference comes from the type of sum of squares and not the correction for multiple comparisons." + ] + }, { "cell_type": "markdown", "id": "b7b20012", @@ -1263,7 +1813,9 @@ { "cell_type": "markdown", "id": "35949307", - "metadata": {}, + "metadata": { + "heading_collapsed": true + }, "source": [ "## A" ] @@ -1272,7 +1824,9 @@ "cell_type": "code", "execution_count": 20, "id": "47cf88a3", - "metadata": {}, + "metadata": { + "hidden": true + }, "outputs": [], "source": [ "import numpy as np\n", @@ -1286,7 +1840,9 @@ "cell_type": "code", "execution_count": 21, "id": "7a8dacd7", - "metadata": {}, + "metadata": { + "hidden": true + }, "outputs": [], "source": [ "mi = pd.read_csv('../data/mi.csv', index_col=0)" @@ -1296,7 +1852,9 @@ "cell_type": "code", "execution_count": 22, "id": "bea3243d", - "metadata": {}, + "metadata": { + "hidden": true + }, "outputs": [ { "data": { @@ -1513,7 +2071,9 @@ "cell_type": "code", "execution_count": 23, "id": "d55877e4", - "metadata": {}, + "metadata": { + "hidden": true + }, "outputs": [ { "data": { @@ -1555,7 +2115,9 @@ { "cell_type": "markdown", "id": "21e8879e", - "metadata": {}, + "metadata": { + "heading_collapsed": true + }, "source": [ "## A" ] @@ -1564,7 +2126,9 @@ "cell_type": "code", "execution_count": 45, "id": "0b8cac58", - "metadata": {}, + "metadata": { + "hidden": true + }, "outputs": [ { "data": { @@ -1586,7 +2150,9 @@ { "cell_type": "markdown", "id": "04a941a5", - "metadata": {}, + "metadata": { + "hidden": true + }, "source": [ "Note that one-liners such as the above expression will almost always be refactored.\n", "We may need the transformed variable again, and therefore should reify it for future reference." @@ -1596,7 +2162,9 @@ "cell_type": "code", "execution_count": 24, "id": "b0cc612e", - "metadata": {}, + "metadata": { + "hidden": true + }, "outputs": [], "source": [ "logPA = np.log(1 + mi['PhysicalActivity'])" @@ -1605,7 +2173,9 @@ { "cell_type": "markdown", "id": "adf75983", - "metadata": {}, + "metadata": { + "hidden": true + }, "source": [ "We may also append it to the dataframe, as a column, just like any other variable." ] @@ -1614,7 +2184,9 @@ "cell_type": "code", "execution_count": 25, "id": "d357ea8a", - "metadata": {}, + "metadata": { + "hidden": true + }, "outputs": [], "source": [ "extended_mi = mi.copy()\n", @@ -1624,7 +2196,9 @@ { "cell_type": "markdown", "id": "fd32ce56", - "metadata": {}, + "metadata": { + "hidden": true + }, "source": [ "We cannot compare the skewness of both variables with a single test.\n", "\n", @@ -1635,7 +2209,9 @@ "cell_type": "code", "execution_count": 26, "id": "87455d20", - "metadata": {}, + "metadata": { + "hidden": true + }, "outputs": [ { "data": { @@ -1655,7 +2231,9 @@ { "cell_type": "markdown", "id": "1067660f", - "metadata": {}, + "metadata": { + "hidden": true + }, "source": [ "...although they both happen to be skewed if we perform individual skewness tests." ] @@ -1664,7 +2242,9 @@ "cell_type": "code", "execution_count": 27, "id": "645ef445", - "metadata": {}, + "metadata": { + "hidden": true + }, "outputs": [ { "data": { @@ -1685,7 +2265,9 @@ "cell_type": "code", "execution_count": 28, "id": "f5984bf0", - "metadata": {}, + "metadata": { + "hidden": true + }, "outputs": [ { "data": { @@ -1705,7 +2287,9 @@ { "cell_type": "markdown", "id": "478d1262", - "metadata": {}, + "metadata": { + "hidden": true + }, "source": [ "Note we do not need the explanatory variable to be symmetric or normally distributed for a model to be valid.\n", "The point is mainly to make our sample exhibit a good coverage (in a linear sense) of the domain of possible values for our predictors." @@ -1728,7 +2312,9 @@ { "cell_type": "markdown", "id": "c47489e6", - "metadata": {}, + "metadata": { + "heading_collapsed": true + }, "source": [ "## A" ] @@ -1737,7 +2323,9 @@ "cell_type": "code", "execution_count": 29, "id": "bf32c7e6", - "metadata": {}, + "metadata": { + "hidden": true + }, "outputs": [ { "data": { @@ -1762,7 +2350,9 @@ "cell_type": "code", "execution_count": 30, "id": "ac8ab2cc", - "metadata": {}, + "metadata": { + "hidden": true + }, "outputs": [ { "data": { @@ -1785,7 +2375,9 @@ { "cell_type": "markdown", "id": "3cd48b4f", - "metadata": {}, + "metadata": { + "hidden": true + }, "source": [ "In the above example, we leveraged the expressiveness of `patsy` for Wilkinson formulae. The `I` «function» is a special symbol just like `C` for tagging a variable as categorical. `I` allows to evaluate a subexpression following the Python syntax instead of the Wilkinson formalism.\n", "\n", @@ -1800,7 +2392,9 @@ "cell_type": "code", "execution_count": 31, "id": "177c2ec7", - "metadata": {}, + "metadata": { + "hidden": true + }, "outputs": [], "source": [ "# already done\n", @@ -1815,7 +2409,9 @@ "cell_type": "code", "execution_count": 32, "id": "93e6b4ee", - "metadata": {}, + "metadata": { + "hidden": true + }, "outputs": [], "source": [ "import statsmodels.api as sm" @@ -1825,7 +2421,9 @@ "cell_type": "code", "execution_count": 33, "id": "d2ec65f0", - "metadata": {}, + "metadata": { + "hidden": true + }, "outputs": [ { "name": "stderr", @@ -1845,7 +2443,9 @@ { "cell_type": "markdown", "id": "c39bd1c5", - "metadata": {}, + "metadata": { + "hidden": true + }, "source": [ "Anyway, as can be seen in the plots above, we turned an influential observation (number 362 was above $0.5$) into a non-influential one, and similarly decreased the influence of several other points.\n", "\n", @@ -1869,7 +2469,9 @@ { "cell_type": "markdown", "id": "12d7e9a5", - "metadata": {}, + "metadata": { + "heading_collapsed": true + }, "source": [ "## A (with nested Q&A)" ] @@ -1878,7 +2480,9 @@ "cell_type": "code", "execution_count": 34, "id": "82e9f8d6", - "metadata": {}, + "metadata": { + "hidden": true + }, "outputs": [ { "data": { @@ -1988,7 +2592,9 @@ "cell_type": "code", "execution_count": 35, "id": "faeab021", - "metadata": {}, + "metadata": { + "hidden": true + }, "outputs": [ { "data": { @@ -2103,7 +2709,9 @@ { "cell_type": "markdown", "id": "66ddaa3e", - "metadata": {}, + "metadata": { + "hidden": true + }, "source": [ "### Q\n", "\n", @@ -2115,7 +2723,10 @@ { "cell_type": "markdown", "id": "05844c56", - "metadata": {}, + "metadata": { + "heading_collapsed": true, + "hidden": true + }, "source": [ "### A" ] @@ -2124,7 +2735,9 @@ "cell_type": "code", "execution_count": 36, "id": "2588d8d6", - "metadata": {}, + "metadata": { + "hidden": true + }, "outputs": [ { "data": { @@ -2248,7 +2861,9 @@ { "cell_type": "markdown", "id": "c7f00836", - "metadata": {}, + "metadata": { + "hidden": true + }, "source": [ "In `seaborn`, a dot plot can be drawn with the `stripplot` function." ] @@ -2257,7 +2872,9 @@ "cell_type": "code", "execution_count": 37, "id": "625111f9", - "metadata": {}, + "metadata": { + "hidden": true + }, "outputs": [ { "data": { @@ -2280,7 +2897,9 @@ { "cell_type": "markdown", "id": "1ba5537f", - "metadata": {}, + "metadata": { + "hidden": true + }, "source": [ "Should we add a term, `BMI` brings more information alone than `Heart:logPA` for the same number of model parameters. This would be confirmed by AIC and BIC.\n", "\n", @@ -2310,7 +2929,9 @@ { "cell_type": "markdown", "id": "a6f37b2e", - "metadata": {}, + "metadata": { + "heading_collapsed": true + }, "source": [ "## A" ] @@ -2319,7 +2940,9 @@ "cell_type": "code", "execution_count": 38, "id": "77e350be", - "metadata": {}, + "metadata": { + "hidden": true + }, "outputs": [], "source": [ "# already done\n", @@ -2334,7 +2957,9 @@ "cell_type": "code", "execution_count": 39, "id": "377783b7", - "metadata": {}, + "metadata": { + "hidden": true + }, "outputs": [ { "data": { @@ -2358,7 +2983,9 @@ "cell_type": "code", "execution_count": 40, "id": "e07bc434", - "metadata": {}, + "metadata": { + "hidden": true + }, "outputs": [ { "data": { @@ -2395,7 +3022,9 @@ { "cell_type": "markdown", "id": "0a822074", - "metadata": {}, + "metadata": { + "heading_collapsed": true + }, "source": [ "## A" ] @@ -2404,7 +3033,9 @@ "cell_type": "code", "execution_count": 41, "id": "85d73982", - "metadata": {}, + "metadata": { + "hidden": true + }, "outputs": [ { "data": { @@ -2444,7 +3075,9 @@ { "cell_type": "markdown", "id": "7ce19023", - "metadata": {}, + "metadata": { + "heading_collapsed": true + }, "source": [ "## A" ] @@ -2453,7 +3086,9 @@ "cell_type": "code", "execution_count": 42, "id": "7dec1d91", - "metadata": {}, + "metadata": { + "hidden": true + }, "outputs": [], "source": [ "logPA = np.log(1 + mi['PhysicalActivity'])\n", @@ -2473,7 +3108,9 @@ { "cell_type": "markdown", "id": "b588b5c1", - "metadata": {}, + "metadata": { + "heading_collapsed": true + }, "source": [ "## Q\n", "\n", @@ -2495,11 +3132,23 @@ "Compute the statistic $nR^2$ and the resulting $p$-value." ] }, + { + "cell_type": "markdown", + "id": "e4207c4c", + "metadata": { + "heading_collapsed": true + }, + "source": [ + "## A" + ] + }, { "cell_type": "code", "execution_count": 43, "id": "db59adce", - "metadata": {}, + "metadata": { + "hidden": true + }, "outputs": [ { "data": { @@ -2525,7 +3174,9 @@ "cell_type": "code", "execution_count": null, "id": "1bac9cde", - "metadata": {}, + "metadata": { + "hidden": true + }, "outputs": [], "source": [] } diff --git a/notebooks/statsmodels_cours.ipynb b/notebooks/statsmodels_cours.ipynb index 17502d5b7f7010d8514718e893777b16088858fb..1b95f5d5d641ed85e12ba7e8f4d4bda9ff5719e8 100644 --- a/notebooks/statsmodels_cours.ipynb +++ b/notebooks/statsmodels_cours.ipynb @@ -44,7 +44,7 @@ }, { "cell_type": "code", - "execution_count": 98, + "execution_count": 151, "id": "d6590258-e1ac-4f62-8adf-3fa014376a22", "metadata": {}, "outputs": [], @@ -55,7 +55,7 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 152, "id": "7b4f63c1-9995-4516-aeb5-1fb4fe1db896", "metadata": {}, "outputs": [], @@ -84,7 +84,7 @@ }, { "cell_type": "code", - "execution_count": 99, + "execution_count": 153, "id": "aec2f6b1-c4fc-465e-9434-b32333770e24", "metadata": {}, "outputs": [], @@ -96,7 +96,7 @@ }, { "cell_type": "code", - "execution_count": 100, + "execution_count": 154, "id": "bd00bd2f-b7eb-47e4-8af4-187ce9473481", "metadata": {}, "outputs": [ @@ -106,7 +106,7 @@ "F_onewayResult(statistic=2.3575322551335636, pvalue=0.11384795345837218)" ] }, - "execution_count": 100, + "execution_count": 154, "metadata": {}, "output_type": "execute_result" } @@ -126,7 +126,7 @@ }, { "cell_type": "code", - "execution_count": 101, + "execution_count": 155, "id": "696ff83f-d827-4513-9801-51488e5a1df0", "metadata": {}, "outputs": [ @@ -140,7 +140,7 @@ " 'C', 'C', 'C', 'C'], dtype='<U1'))" ] }, - "execution_count": 101, + "execution_count": 155, "metadata": {}, "output_type": "execute_result" } @@ -153,7 +153,7 @@ }, { "cell_type": "code", - "execution_count": 102, + "execution_count": 156, "id": "c324bd0f-e769-4a0d-8e6b-d4d346e76f68", "metadata": {}, "outputs": [ @@ -371,7 +371,7 @@ "29 81 C" ] }, - "execution_count": 102, + "execution_count": 156, "metadata": {}, "output_type": "execute_result" } @@ -402,7 +402,7 @@ }, { "cell_type": "code", - "execution_count": 103, + "execution_count": 157, "id": "4a2a00ec-43c6-4299-82e3-15b807110828", "metadata": {}, "outputs": [], @@ -420,7 +420,7 @@ }, { "cell_type": "code", - "execution_count": 104, + "execution_count": 158, "id": "21bf5c1d-caa3-4072-bc86-ce860b8f33c8", "metadata": {}, "outputs": [ @@ -434,7 +434,7 @@ "Model: OLS Adj. R-squared: 0.086\n", "Method: Least Squares F-statistic: 2.358\n", "Date: Tue, 28 Sep 2021 Prob (F-statistic): 0.114\n", - "Time: 11:16:16 Log-Likelihood: -96.604\n", + "Time: 18:26:09 Log-Likelihood: -96.604\n", "No. Observations: 30 AIC: 199.2\n", "Df Residuals: 27 BIC: 203.4\n", "Df Model: 2 \n", @@ -473,7 +473,7 @@ }, { "cell_type": "code", - "execution_count": 105, + "execution_count": 159, "id": "638bdd6b-6964-4209-b762-2991fd3fb7fc", "metadata": { "tags": [] @@ -485,7 +485,7 @@ "F_onewayResult(statistic=2.3575322551335636, pvalue=0.11384795345837218)" ] }, - "execution_count": 105, + "execution_count": 159, "metadata": {}, "output_type": "execute_result" } @@ -507,7 +507,7 @@ }, { "cell_type": "code", - "execution_count": 106, + "execution_count": 166, "id": "0fe841a0-21b1-4c92-a767-c4ab3abd37f8", "metadata": { "tags": [] @@ -517,9 +517,9 @@ "name": "stdout", "output_type": "stream", "text": [ - " df sum_sq mean_sq F PR(>F)\n", - "Group 2.0 192.2 96.100000 2.357532 0.113848\n", - "Residual 27.0 1100.6 40.762963 NaN NaN\n" + " sum_sq df F PR(>F)\n", + "Group 192.2 2.0 2.357532 0.113848\n", + "Residual 1100.6 27.0 NaN NaN\n" ] } ], @@ -1736,7 +1736,7 @@ }, { "cell_type": "code", - "execution_count": 148, + "execution_count": 167, "id": "cc06b3ed-a066-4de3-884c-a7c82729c359", "metadata": {}, "outputs": [ @@ -1744,15 +1744,16 @@ "name": "stdout", "output_type": "stream", "text": [ - " sum_sq df F PR(>F)\n", - "water 15.552000 1.0 16.034261 0.000462\n", - "sun 21.424667 2.0 11.044518 0.000337\n", - "Residual 25.218000 26.0 NaN NaN\n" + " sum_sq df F PR(>F)\n", + "Intercept 394.218750 1.0 406.443314 2.139588e-17\n", + "water 15.552000 1.0 16.034261 4.623155e-04\n", + "sun 21.424667 2.0 11.044518 3.373296e-04\n", + "Residual 25.218000 26.0 NaN NaN\n" ] } ], "source": [ - "anova_table = sm.stats.anova_lm(plant_model, typ=2)\n", + "anova_table = sm.stats.anova_lm(plant_model, typ=3) # typ specifies the type of sum of squares\n", "print(anova_table)" ] }, @@ -1886,7 +1887,7 @@ }, { "cell_type": "code", - "execution_count": 149, + "execution_count": 168, "id": "60de13e5-b798-4312-8d06-0ef79f480760", "metadata": {}, "outputs": [ @@ -1894,18 +1895,19 @@ "name": "stdout", "output_type": "stream", "text": [ - " sum_sq df F PR(>F)\n", - "water 15.552000 1.0 19.117394 0.000205\n", - "sun 21.424667 2.0 13.168203 0.000138\n", - "water:sun 5.694000 2.0 3.499693 0.046376\n", - "Residual 19.524000 24.0 NaN NaN\n" + " sum_sq df F PR(>F)\n", + "Intercept 246.402000 1.0 302.891211 4.094979e-15\n", + "water 2.401000 1.0 2.951444 9.867817e-02\n", + "sun 8.041333 2.0 4.942430 1.593920e-02\n", + "water:sun 5.694000 2.0 3.499693 4.637649e-02\n", + "Residual 19.524000 24.0 NaN NaN\n" ] } ], "source": [ "model_with_interaction = ols('height ~ water * sun', data=plant_data).fit()\n", "# remember `water * sun` is equivalent to `water + sun + water:sun`\n", - "print(sm.stats.anova_lm(model_with_interaction, typ=2))" + "print(sm.stats.anova_lm(model_with_interaction, typ=3))" ] }, { @@ -1971,7 +1973,7 @@ "\n", "y_low_daily = w['Intercept'] + w['sun[T.low]']\n", "y_low_weekly = w['Intercept'] + w['sun[T.low]'] + w['water[T.weekly]'] + w['water[T.weekly]:sun[T.low]']\n", - "ax.plot([x[0]-dx, x[0]+dx], [y_low_daily, y_low_weekly], 'k-d', markerfacecolor='w')\n", + "ax.plot([x[0]-dx, x[0]+dx], [y_low_daily, y_low_weekly], 'k-d', markerfacecolor='w')Q\n", "\n", "y_med_daily = w['Intercept'] + w['sun[T.med]']\n", "y_med_weekly = w['Intercept'] + w['sun[T.med]'] + w['water[T.weekly]'] + w['water[T.weekly]:sun[T.med]']\n", @@ -2069,7 +2071,7 @@ "daily_water_model = ols('height ~ sun', data=plant_data[plant_data['water']=='daily']).fit()\n", "weekly_water_model = ols('height ~ sun', data=plant_data[plant_data['water']=='weekly']).fit()\n", "low_sun_model = ols('height ~ water', data=plant_data[plant_data['sun']=='low']).fit()\n", - "med_sun_model = ols('height ~ water', data=plant_data[plant_data['sun']=='med']).fit()\n", + "med_sun_model = ols('height ~ water', data=plant_data[plant_daQta['sun']=='med']).fit()\n", "high_sun_model = ols('height ~ water', data=plant_data[plant_data['sun']=='high']).fit()" ] }, @@ -2091,7 +2093,7 @@ }, { "cell_type": "code", - "execution_count": 136, + "execution_count": 169, "id": "93ccdd87", "metadata": {}, "outputs": [ @@ -2101,7 +2103,7 @@ "0.03098093333325329" ] }, - "execution_count": 136, + "execution_count": 169, "metadata": {}, "output_type": "execute_result" } @@ -2112,7 +2114,7 @@ }, { "cell_type": "code", - "execution_count": 137, + "execution_count": 170, "id": "d1392464", "metadata": {}, "outputs": [ @@ -2197,7 +2199,7 @@ "med-low 0.254253 False " ] }, - "execution_count": 137, + "execution_count": 170, "metadata": {}, "output_type": "execute_result" } @@ -2409,7 +2411,7 @@ }, { "cell_type": "code", - "execution_count": 126, + "execution_count": 171, "id": "e6c92946", "metadata": {}, "outputs": [ @@ -2434,43 +2436,45 @@ " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", - " <th>df</th>\n", " <th>sum_sq</th>\n", - " <th>mean_sq</th>\n", + " <th>df</th>\n", " <th>F</th>\n", " <th>PR(>F)</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", + " <th>Intercept</th>\n", + " <td>246.402000</td>\n", + " <td>1.0</td>\n", + " <td>302.891211</td>\n", + " <td>4.094979e-15</td>\n", + " </tr>\n", + " <tr>\n", " <th>water</th>\n", + " <td>2.401000</td>\n", " <td>1.0</td>\n", - " <td>15.552000</td>\n", - " <td>15.552000</td>\n", - " <td>19.117394</td>\n", - " <td>0.000205</td>\n", + " <td>2.951444</td>\n", + " <td>9.867817e-02</td>\n", " </tr>\n", " <tr>\n", " <th>sun</th>\n", + " <td>8.041333</td>\n", " <td>2.0</td>\n", - " <td>21.424667</td>\n", - " <td>10.712333</td>\n", - " <td>13.168203</td>\n", - " <td>0.000138</td>\n", + " <td>4.942430</td>\n", + " <td>1.593920e-02</td>\n", " </tr>\n", " <tr>\n", " <th>water:sun</th>\n", - " <td>2.0</td>\n", " <td>5.694000</td>\n", - " <td>2.847000</td>\n", + " <td>2.0</td>\n", " <td>3.499693</td>\n", - " <td>0.046376</td>\n", + " <td>4.637649e-02</td>\n", " </tr>\n", " <tr>\n", " <th>Residual</th>\n", - " <td>24.0</td>\n", " <td>19.524000</td>\n", - " <td>0.813500</td>\n", + " <td>24.0</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", @@ -2479,20 +2483,21 @@ "</div>" ], "text/plain": [ - " df sum_sq mean_sq F PR(>F)\n", - "water 1.0 15.552000 15.552000 19.117394 0.000205\n", - "sun 2.0 21.424667 10.712333 13.168203 0.000138\n", - "water:sun 2.0 5.694000 2.847000 3.499693 0.046376\n", - "Residual 24.0 19.524000 0.813500 NaN NaN" + " sum_sq df F PR(>F)\n", + "Intercept 246.402000 1.0 302.891211 4.094979e-15\n", + "water 2.401000 1.0 2.951444 9.867817e-02\n", + "sun 8.041333 2.0 4.942430 1.593920e-02\n", + "water:sun 5.694000 2.0 3.499693 4.637649e-02\n", + "Residual 19.524000 24.0 NaN NaN" ] }, - "execution_count": 126, + "execution_count": 171, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "sm.stats.anova_lm(model_with_interaction)" + "sm.stats.anova_lm(model_with_interaction, typ=3)" ] }, {