Compare revisions

Bertrand NÉRON · Bertrand NÉRON · 53fb043d · 53fb043d · 3c203d8a
--- a/notebooks/Practicals/seaborn_TP.ipynb
+++ b/notebooks/Practicals/seaborn_TP.ipynb
 %% Cell type:markdown id:instrumental-personal tags:

 # <center><b>Hands-on</b></center>

 <div style="text-align:center">
-    <img src="images/seaborn.png" width="600px">
+    <img src="../images/seaborn.png" width="600px">
    <div>
-       Bertrand Néron, François Laurent, Etienne Kornobis
+       Bertrand Néron, François Laurent, Etienne Kornobis, Vincent Guillemot
       <br />
       <a src=" https://research.pasteur.fr/en/team/bioinformatics-and-biostatistics-hub/">Bioinformatics and Biostatistiqucs HUB</a>
       <br />
-       © Institut Pasteur, 2021
+       © Institut Pasteur, 2024
    </div>
 </div>

 %% Cell type:markdown id:compliant-basis tags:

-Practice your graphing skills using data from milieu intérieur in `data/mi.csv`:
+Practice your graphing skills through the data of [happiness 2016](https://www.kaggle.com/datasets/unsdsn/world-happiness?select=2016.csv)

-%% Cell type:markdown id:departmental-exhibition tags:
+(The data are already in data directory as `happiness_2016.csv`)

- Do a boxplot showing the differences in temperature between females and males:
+%% Cell type:markdown id:3778963b-3bae-486d-8db7-30f23eb239ac tags:

-%% Cell type:code id:98e904b6-6e90-4c74-a463-2339d3961250 tags:
+## Import the data and have a look on them

-``` python
-```
+1. import the pandas and seaborn modules
+2. import the data

-%% Cell type:markdown id:portuguese-worse tags:
+%% Cell type:code id:minor-doctrine tags:

- Using a histogram and continuous probability density curve, display the distribution of age in the dataset
+``` python
+```

-%% Cell type:code id:55756807-e1fb-4fb5-878c-5e46acea7a11 tags:
+%% Cell type:code id:skilled-daniel tags:

 ``` python
 ```

-%% Cell type:markdown id:prepared-stephen tags:
+%% Cell type:markdown id:1360a875-cce6-4a2f-b19f-faf1b64230cc tags:

- Use a barplot to show the count of vaccinated for yellow fever (see the documentation for a countplot)
+3. have a look on them

-%% Cell type:code id:1425046c-a058-45fe-95b5-5eca6ebbd33a tags:
+%% Cell type:code id:f8729a5b-314d-42fc-b130-783ca5e2076a tags:

 ``` python
 ```

-%% Cell type:markdown id:immediate-method tags:
+%% Output

- Plot the distribution of age for the people vaccinated for the flu
+    (157, 13)

-%% Cell type:code id:d567194c-3698-44c9-b5f8-b8a3d3493b0c tags:
+%% Cell type:code id:brutal-manufacturer tags:

 ``` python
 ```

-%% Cell type:markdown id:temporal-synthesis tags:
+%% Output

- Feel free to explore more of [seaborn](https://seaborn.pydata.org/examples/index.html) !
+           Country          Region  Happiness Rank  Happiness Score  \
+    0      Denmark  Western Europe               1            7.526
+    1  Switzerland  Western Europe               2            7.509
+    2      Iceland  Western Europe               3            7.501
+    3       Norway  Western Europe               4            7.498
+    4      Finland  Western Europe               5            7.413
+    
+       Lower Confidence Interval  Upper Confidence Interval  \
+    0                      7.460                      7.592
+    1                      7.428                      7.590
+    2                      7.333                      7.669
+    3                      7.421                      7.575
+    4                      7.351                      7.475
+    
+       Economy (GDP per Capita)   Family  Health (Life Expectancy)  Freedom  \
+    0                   1.44178  1.16374                   0.79504  0.57941
+    1                   1.52733  1.14524                   0.86303  0.58557
+    2                   1.42666  1.18326                   0.86733  0.56624
+    3                   1.57744  1.12690                   0.79579  0.59609
+    4                   1.40598  1.13464                   0.81091  0.57104
+    
+       Trust (Government Corruption)  Generosity  Dystopia Residual
+    0                        0.44453     0.36171            2.73939
+    1                        0.41203     0.28083            2.69463
+    2                        0.14975     0.47678            2.83137
+    3                        0.35776     0.37895            2.66465
+    4                        0.41004     0.25492            2.82596
+
+%% Cell type:markdown id:departmental-exhibition tags:

-%% Cell type:markdown id:db56d49a-4770-4f9e-af6b-78960574d338 tags:
+## Do a boxplot showing the differences in `happiness` between `Region`:

-# Exploring count matrices from RNA-seq data
+%% Cell type:code id:saved-identity tags:

-%% Cell type:markdown id:5377668b-dea5-4c20-8249-5266f98774eb tags:
+``` python
+```

-<img src="images/rnaseq.png" style="margin:0 auto;width:800px">
+%% Cell type:markdown id:portuguese-worse tags:

-%% Cell type:markdown id:ebf1606b-0b21-4821-a899-551ec33c977e tags:
+## Using a histogram and continuous probability density curve, display the distribution of `Freedom` in the dataset

- Import the count_matrix tsv file from the data folder
+%% Cell type:code id:continuous-indian tags:

-%% Cell type:code id:eb53a1f5-9ea7-491e-bcfa-820cb1663af5 tags:
+``` python
+```
+
+%% Cell type:code id:understanding-vegetarian tags:

 ``` python
 ```

-%% Cell type:markdown id:c80d9947-9ccf-4499-a1c2-9194377cd054 tags:
+%% Cell type:markdown id:prepared-stephen tags:

- Simplify the dataframe to only have the "Geneid", "WTx" and "Cx" columns
+- Use a barplot to show the count of country per Region (see the documentation for a countplot)

-%% Cell type:code id:56e90032-75ce-47b5-9cd3-95219cd7b26e tags:
+%% Cell type:code id:worldwide-communication tags:

 ``` python
 ```

-%% Cell type:markdown id:eb65b51f-f689-4a66-b47c-e79f0e9eba52 tags:
+%% Cell type:markdown id:4e1d16f2-57d8-4f0e-9a69-e9ab193a3ebc tags:
+
+As you can see the labels overlaps each ohers and are not readable

- Format properly your DataFrame to be able to use  https://seaborn.pydata.org/generated/seaborn.clustermap.html to realize a heatmap.
+One possibility is to rotate the X-labels. In this case is better to provide the labels.

-%% Cell type:code id:9b422fcb-7cc1-4766-92e3-276742381ae6 tags:
+%% Cell type:code id:64ee74bb-5f37-485f-8c03-d06ac14d3010 tags:

 ``` python
+# extract the Region from the data, We will use them as labels for figures below
 ```

-%% Cell type:markdown id:f8d6188e-3a37-4ba5-b377-a11696054e9c tags:
+%% Cell type:code id:eb7c96ac-585a-4787-a2ee-dec3d04790ca tags:

- Explore the clustermap documentation to have a more visual heatmap by standardizing the data within genes.
+``` python
+```
+
+%% Cell type:markdown id:e2eac27c-2de9-4942-82b9-294318ec5fd4 tags:

-%% Cell type:code id:06be3f98-2167-44ac-9318-955286d77903 tags:
+## On the same data `Happiness` and `Region` do a boxplot and a swarmplot to display the structure of the data
+
+%% Cell type:code id:b0fd6058-65b0-4f9b-bc59-dce0193f1580 tags:

 ``` python
 ```

-%% Cell type:markdown id:2e61a207-223a-4c01-88ea-76b1b8c3a0b9 tags:
+%% Cell type:code id:1fe079b4-013a-4d48-ac31-9daad4b4673e tags:

- Reformat the counts_df dataframe to have genes in columns and samples in rows.
- Add a "group" column defining the grouping of the samples:
-    - "WTx" samples will be from the "WT" group.
-    - "Cx" samples will be from the "C" group.
+``` python
+```

-%% Cell type:code id:eea3f521-6960-44ab-ac0b-fcf5a002237f tags:
+%% Cell type:markdown id:immediate-method tags:
+
+## Plot the distribution of `happiness` for the people leaving `Western Europe`
+
+%% Cell type:code id:academic-measure tags:

 ``` python
 ```

-%% Cell type:markdown id:9a88ecb1-9ed3-4160-91ee-24a30e994b71 tags:
+%% Cell type:markdown id:fd9789db-9bab-478a-bcec-1c8b6775cf20 tags:

- Display a barplot showing the mean expression for each group for a particular gene (for example "gene-LEPBI_RS00065").
+## Plot the `Health (Life Expectancy)` vs `Happiness Score` and color the dots according to the region specify a size for the figure (9 inches x 7)

-%% Cell type:code id:cf74e85e-eef3-4023-bb88-5a864cf3c3f9 tags:
+1. import pyplot
+2. then create a new figure and axis at the right size
+3. create the plot
+
+%% Cell type:code id:d42f72f1-1d01-496e-89a1-68391ffa4281 tags:

 ``` python
 ```

-%% Cell type:markdown id:99e2455a-cb7d-44d5-a4a0-2cf272c814ab tags:
+%% Cell type:code id:52ae8376-3c66-4ca6-86a9-f9ae9f56076f tags:

- Try plotting a swarmplot on top of the previous barplot:
+``` python
+```
+
+%% Cell type:markdown id:temporal-synthesis tags:
+
+- Feel free to explore more of [seaborn](https://seaborn.pydata.org/examples/index.html) !

-%% Cell type:code id:7cf225f9-aea7-4cd9-ac90-a99592799527 tags:
+%% Cell type:markdown id:3063abf7-2251-48eb-b371-6c5b70b45fe7 tags:
+
+## Do a barplot of the Happiness Score for each Region
+
+%% Cell type:code id:85dd0df6-74e7-43be-9a7c-eb922a06601b tags:

 ``` python
 ```

-%% Cell type:markdown id:d200d375-362e-4c1d-a88e-130b094e6feb tags:
+%% Cell type:markdown id:3ee5741a-64f4-4690-963b-1f7e729398bf tags:
+
+## from this point we will focus on the Regions

- Now plot the same data using a boxplot. Can you see the problem of displaying boxplots for this kind of data ?
+### clean our dataset. Remove not relevant columns

-%% Cell type:code id:e4daf00e-9a2c-4ec4-9d26-aa18aae5d82d tags:
+%% Cell type:code id:0344b730-1535-47fb-82f5-07003fd223f9 tags:

 ``` python
 ```

-%% Cell type:markdown id:2e1cabe0-aab7-4f0e-888e-81aae7d5df8d tags:
+%% Cell type:markdown id:36e449d1-0add-4ebc-8903-d535219ce423 tags:

- Compute the median of each genes by groups:
+1. keep only columns 'Region', 'Happiness Score', 'Economy (GDP per Capita)', 'Family', 'Freedom','Trust (Government Corruption)', 'Generosity'
+2. set the index to the Region
+3. have a look on your new data

-%% Cell type:code id:6ffd0f59-0fd7-41b9-a87a-c6e1a74145e8 tags:
+%% Cell type:code id:bdf897c4-b8f3-4dff-b9c0-0ad47b25ecc0 tags:

 ``` python
 ```

-%% Cell type:markdown id:308cc10b-6727-4bc5-b05d-4777037e252e tags:
+%% Cell type:markdown id:e1ae03ac-ac7c-436d-987d-113e9cca3eec tags:

-We are going now to add extra annotations to this median table in order to identify genes of interest.
- Import the annotation.csv table from the data folder:
+## Aggregate the new data region by region. Compute the mean of each country as value for the corresponding Region

-%% Cell type:code id:9be6ee5b-d497-47fa-8ac5-cf5514fd52c0 tags:
+%% Cell type:code id:3fc3ea89-a448-4e7b-abfb-3fa92cffc5f7 tags:

 ``` python
 ```

-%% Cell type:markdown id:50fa81a7-3f34-4160-ad2d-f77d21be9ac0 tags:
+%% Cell type:markdown id:97cb188c-3e50-4492-961f-cadea3611aaa tags:

-Annotations in this table are available for many types of loci (the "genetic_type" column), but here we will focus on the "gene" genetic_type.
- Filter the annotation dataframe to have only "gene" as "genetic_type".
+## Do a hierarchically-clustered heatmap

-%% Cell type:code id:f9a8bcf7-0bcc-43e8-828a-ec204658e528 tags:
+%% Cell type:code id:9aa21ed4-e9b2-4eb3-a693-c59ceb513552 tags:

 ``` python
 ```

-%% Cell type:markdown id:f8a4e744-e7e2-43b6-b3d4-e59feb40d3ff tags:
+%% Cell type:markdown id:88d27d29-e3b8-43d7-8324-25e50c247872 tags:

- Concatenate the dataframe with median by group and the annotation dataframe together:
+Check the data.

-%% Cell type:code id:afd8467a-33e1-4b9e-8f6d-b2229099c874 tags:
+%% Cell type:code id:0128f575-0b2a-4cbc-8f6e-8b7e22d81254 tags:

 ``` python
 ```

-%% Cell type:markdown id:af9f8e1f-5f8b-4152-b08a-44e957f13cec tags:
+%% Cell type:markdown id:f9b39ab8-0051-4840-9e87-fe2bcb8ca07a tags:
+
+The data are not in the same range, so it could be better to standardize the data before to do the clustering
+
+%% Cell type:code id:ff4beb57-b357-47a3-b7bd-877e05229b6b tags:

- Calculate an estimate of the gene expression fold change for each gene (by dividing the C median expressions by WT median expressions).
- Add it as a "FoldChange" column to the previous dataframe.
+``` python
+```

-%% Cell type:code id:bb617d00-2c2d-45cc-ace0-3656dc999b17 tags:
+%% Cell type:code id:e19f9472-cb9b-434b-8689-2bf09d49b902 tags:

 ``` python
 ```

-%% Cell type:markdown id:d70eb26b-0a26-4bbc-af03-ba8781b09fb5 tags:
+%% Cell type:markdown id:d64a0377-339b-4fe7-beb4-a32e4a4e0113 tags:

- Use a barplot to display fold changes and using the new gene annotation (The "Name" column)
+It's possible to do that directly in seaborn. with the option z_score (https://seaborn.pydata.org/generated/seaborn.clustermap.html)

-%% Cell type:code id:4dd4cbee-547f-43f1-9ed7-173f3040b8d5 tags:
+%% Cell type:code id:3b439517-5007-4fbb-828d-265f9835594f tags:

 ``` python
 ```

-%% Cell type:markdown id:34a26492-7c6b-4a07-a4de-67ec8f693cdc tags:
+%% Cell type:markdown id:f8d6188e-3a37-4ba5-b377-a11696054e9c tags:
+
+- Explore the clustermap documentation to have a more visual heatmap by standardizing the data within genes.

- By calculating the length of each gene and using a visualisation, does gene expression appears correlated with gene length ?
+%% Cell type:markdown id:a2627322-e6a5-422f-8a69-b89dbd4b777e tags:

-%% Cell type:code id:6f35b696-0807-4df4-9310-cb9197e7bf85 tags:
+## Create a function which produce a single image with four different plots of your choice and save it to pdf file.
+
+like the image below.
+
+%% Cell type:markdown id:4121ff3d-6814-493e-a505-357ad81b0d28 tags:
+
+<img src="../images/multiple_figure.png" width="50%" />
+
+%% Cell type:code id:a322c866-9232-4fae-bcee-9a635e3fd70b tags:

 ``` python
 ```

-%% Cell type:markdown id:a2627322-e6a5-422f-8a69-b89dbd4b777e tags:
+%% Cell type:code id:044022d1-741d-4a07-ba7f-c1f863cca138 tags:

- Create a function which produce a single image with four different plots of your choice and save it to pdf file.
+``` python
+def expression_graph():
+    ...
+    return fig
+
+```

-%% Cell type:code id:70e001a1-2848-4fb7-9f33-7beb4475e0fc tags:
+%% Cell type:code id:c33bfc78-7480-4327-93a0-f8aaca0d3614 tags:

 ``` python
+my_fig = expression_graph()
+...
 ```

 %% Cell type:markdown id:0d05aba4-3c85-4cd9-85f3-5296b19308fb tags:

 # Extras

 %% Cell type:markdown id:66d6668e-683f-462e-a72f-28bdda8736f2 tags:

- Using ipywidget, make a function to display barplot of gene expression by groups with the gene being selected by the user (using a Dropdown widget for example).
+- Using ipywidget, make a function to display barplot of `Happiness Score` by country but with region selected by the user (using a Dropdown widget)
+
+%% Cell type:markdown id:042bd87e-d2dc-4544-a771-51d80c565d0f tags:
+
+Imports the needed modules
+- `widgets` and `interact` from the `ipywidgets` package
+
+%% Cell type:code id:64ebeca1-1332-4585-9e5c-c1b66f82be71 tags:
+
+``` python
+from ipywidgets import widgets
+from ipywidgets import interact
+```
+
+%% Cell type:markdown id:277264e6-a173-40c5-b71e-4cd551a7fa99 tags:
+
+create a dataframe containing regions (without duplicates values)
+
+%% Cell type:code id:ebf7fde9-b4a1-4e8a-86ab-86ad8b1b533a tags:
+
+``` python
+```
+
+%% Cell type:markdown id:f34e5053-ccf5-4a67-96db-7457fe16bbd6 tags:
+
+1. Use this DataFarame to populate your dropdown list
+2. Use the region selected in dropdown list as parameter of your function
+3. select form the whole data frame the data corresponding to this region
+4. display the barplot
+
+%% Cell type:markdown id:feba608f-2ecb-41ae-b04a-12f075fd644b tags:
+
+below the code skeleton of your function
+
+```python
+@interact(region=widgets.Dropdown(options=regions))
+def plot_counts(region):
+    data = ha_df.loc[ha_df['Region'] == region]
+    ax = sns.barplot(data= ....
+```
+
+%% Cell type:code id:fb746fda-36cc-4c35-92d8-257a489fb278 tags:
+
+``` python
+```
+
+%% Cell type:markdown id:3f4bd68e-eb26-46f8-a00f-86f9d0570580 tags:
+
+You can customize your figure as classical seaborn/matplotib figure
+
+for instance to display the value above each bar
+
+%% Cell type:code id:7bcee7c5-f1c2-4035-9b7c-e68e1d73a932 tags:
+
+``` python
+```

-%% Cell type:code id:e587f202-7ca4-43fb-ac3c-015c740c69d2 tags:
+%% Cell type:code id:d78b7b86-ecaa-4d27-80ca-2d3e46c2aca3 tags:

 ``` python
 ```

 %% Cell type:markdown id:instrumental-personal tags:

 # <center><b>Hands-on</b></center>

 <div style="text-align:center">
-    <img src="images/seaborn.png" width="600px">
+    <img src="../images/seaborn.png" width="600px">
    <div>
-       Bertrand Néron, François Laurent, Etienne Kornobis
+       Bertrand Néron, François Laurent, Etienne Kornobis, Vincent Guillemot
       <br />
       <a src=" https://research.pasteur.fr/en/team/bioinformatics-and-biostatistics-hub/">Bioinformatics and Biostatistiqucs HUB</a>
       <br />
-       © Institut Pasteur, 2021
+       © Institut Pasteur, 2024
    </div>
 </div>

 %% Cell type:markdown id:compliant-basis tags:

-Practice your graphing skills using data from milieu intérieur in `data/mi.csv`:
+Practice your graphing skills through the data of [happiness 2016](https://www.kaggle.com/datasets/unsdsn/world-happiness?select=2016.csv)

-%% Cell type:markdown id:departmental-exhibition tags:
+(The data are already in data directory as `happiness_2016.csv`)

- Do a boxplot showing the differences in temperature between females and males:
+%% Cell type:markdown id:3778963b-3bae-486d-8db7-30f23eb239ac tags:

-%% Cell type:code id:98e904b6-6e90-4c74-a463-2339d3961250 tags:
+## Import the data and have a look on them

-``` python
-```
+1. import the pandas and seaborn modules
+2. import the data

-%% Cell type:markdown id:portuguese-worse tags:
+%% Cell type:code id:minor-doctrine tags:

- Using a histogram and continuous probability density curve, display the distribution of age in the dataset
+``` python
+```

-%% Cell type:code id:55756807-e1fb-4fb5-878c-5e46acea7a11 tags:
+%% Cell type:code id:skilled-daniel tags:

 ``` python
 ```

-%% Cell type:markdown id:prepared-stephen tags:
+%% Cell type:markdown id:1360a875-cce6-4a2f-b19f-faf1b64230cc tags:

- Use a barplot to show the count of vaccinated for yellow fever (see the documentation for a countplot)
+3. have a look on them

-%% Cell type:code id:1425046c-a058-45fe-95b5-5eca6ebbd33a tags:
+%% Cell type:code id:f8729a5b-314d-42fc-b130-783ca5e2076a tags:

 ``` python
 ```

-%% Cell type:markdown id:immediate-method tags:
+%% Output

- Plot the distribution of age for the people vaccinated for the flu
+    (157, 13)

-%% Cell type:code id:d567194c-3698-44c9-b5f8-b8a3d3493b0c tags:
+%% Cell type:code id:brutal-manufacturer tags:

 ``` python
 ```

-%% Cell type:markdown id:temporal-synthesis tags:
+%% Output

- Feel free to explore more of [seaborn](https://seaborn.pydata.org/examples/index.html) !
+           Country          Region  Happiness Rank  Happiness Score  \
+    0      Denmark  Western Europe               1            7.526
+    1  Switzerland  Western Europe               2            7.509
+    2      Iceland  Western Europe               3            7.501
+    3       Norway  Western Europe               4            7.498
+    4      Finland  Western Europe               5            7.413
+    
+       Lower Confidence Interval  Upper Confidence Interval  \
+    0                      7.460                      7.592
+    1                      7.428                      7.590
+    2                      7.333                      7.669
+    3                      7.421                      7.575
+    4                      7.351                      7.475
+    
+       Economy (GDP per Capita)   Family  Health (Life Expectancy)  Freedom  \
+    0                   1.44178  1.16374                   0.79504  0.57941
+    1                   1.52733  1.14524                   0.86303  0.58557
+    2                   1.42666  1.18326                   0.86733  0.56624
+    3                   1.57744  1.12690                   0.79579  0.59609
+    4                   1.40598  1.13464                   0.81091  0.57104
+    
+       Trust (Government Corruption)  Generosity  Dystopia Residual
+    0                        0.44453     0.36171            2.73939
+    1                        0.41203     0.28083            2.69463
+    2                        0.14975     0.47678            2.83137
+    3                        0.35776     0.37895            2.66465
+    4                        0.41004     0.25492            2.82596
+
+%% Cell type:markdown id:departmental-exhibition tags:

-%% Cell type:markdown id:db56d49a-4770-4f9e-af6b-78960574d338 tags:
+## Do a boxplot showing the differences in `happiness` between `Region`:

-# Exploring count matrices from RNA-seq data
+%% Cell type:code id:saved-identity tags:

-%% Cell type:markdown id:5377668b-dea5-4c20-8249-5266f98774eb tags:
+``` python
+```

-<img src="images/rnaseq.png" style="margin:0 auto;width:800px">
+%% Cell type:markdown id:portuguese-worse tags:

-%% Cell type:markdown id:ebf1606b-0b21-4821-a899-551ec33c977e tags:
+## Using a histogram and continuous probability density curve, display the distribution of `Freedom` in the dataset

- Import the count_matrix tsv file from the data folder
+%% Cell type:code id:continuous-indian tags:

-%% Cell type:code id:eb53a1f5-9ea7-491e-bcfa-820cb1663af5 tags:
+``` python
+```
+
+%% Cell type:code id:understanding-vegetarian tags:

 ``` python
 ```

-%% Cell type:markdown id:c80d9947-9ccf-4499-a1c2-9194377cd054 tags:
+%% Cell type:markdown id:prepared-stephen tags:

- Simplify the dataframe to only have the "Geneid", "WTx" and "Cx" columns
+- Use a barplot to show the count of country per Region (see the documentation for a countplot)

-%% Cell type:code id:56e90032-75ce-47b5-9cd3-95219cd7b26e tags:
+%% Cell type:code id:worldwide-communication tags:

 ``` python
 ```

-%% Cell type:markdown id:eb65b51f-f689-4a66-b47c-e79f0e9eba52 tags:
+%% Cell type:markdown id:4e1d16f2-57d8-4f0e-9a69-e9ab193a3ebc tags:
+
+As you can see the labels overlaps each ohers and are not readable

- Format properly your DataFrame to be able to use  https://seaborn.pydata.org/generated/seaborn.clustermap.html to realize a heatmap.
+One possibility is to rotate the X-labels. In this case is better to provide the labels.

-%% Cell type:code id:9b422fcb-7cc1-4766-92e3-276742381ae6 tags:
+%% Cell type:code id:64ee74bb-5f37-485f-8c03-d06ac14d3010 tags:

 ``` python
+# extract the Region from the data, We will use them as labels for figures below
 ```

-%% Cell type:markdown id:f8d6188e-3a37-4ba5-b377-a11696054e9c tags:
+%% Cell type:code id:eb7c96ac-585a-4787-a2ee-dec3d04790ca tags:

- Explore the clustermap documentation to have a more visual heatmap by standardizing the data within genes.
+``` python
+```
+
+%% Cell type:markdown id:e2eac27c-2de9-4942-82b9-294318ec5fd4 tags:

-%% Cell type:code id:06be3f98-2167-44ac-9318-955286d77903 tags:
+## On the same data `Happiness` and `Region` do a boxplot and a swarmplot to display the structure of the data
+
+%% Cell type:code id:b0fd6058-65b0-4f9b-bc59-dce0193f1580 tags:

 ``` python
 ```

-%% Cell type:markdown id:2e61a207-223a-4c01-88ea-76b1b8c3a0b9 tags:
+%% Cell type:code id:1fe079b4-013a-4d48-ac31-9daad4b4673e tags:

- Reformat the counts_df dataframe to have genes in columns and samples in rows.
- Add a "group" column defining the grouping of the samples:
-    - "WTx" samples will be from the "WT" group.
-    - "Cx" samples will be from the "C" group.
+``` python
+```

-%% Cell type:code id:eea3f521-6960-44ab-ac0b-fcf5a002237f tags:
+%% Cell type:markdown id:immediate-method tags:
+
+## Plot the distribution of `happiness` for the people leaving `Western Europe`
+
+%% Cell type:code id:academic-measure tags:

 ``` python
 ```

-%% Cell type:markdown id:9a88ecb1-9ed3-4160-91ee-24a30e994b71 tags:
+%% Cell type:markdown id:fd9789db-9bab-478a-bcec-1c8b6775cf20 tags:

- Display a barplot showing the mean expression for each group for a particular gene (for example "gene-LEPBI_RS00065").
+## Plot the `Health (Life Expectancy)` vs `Happiness Score` and color the dots according to the region specify a size for the figure (9 inches x 7)

-%% Cell type:code id:cf74e85e-eef3-4023-bb88-5a864cf3c3f9 tags:
+1. import pyplot
+2. then create a new figure and axis at the right size
+3. create the plot
+
+%% Cell type:code id:d42f72f1-1d01-496e-89a1-68391ffa4281 tags:

 ``` python
 ```

-%% Cell type:markdown id:99e2455a-cb7d-44d5-a4a0-2cf272c814ab tags:
+%% Cell type:code id:52ae8376-3c66-4ca6-86a9-f9ae9f56076f tags:

- Try plotting a swarmplot on top of the previous barplot:
+``` python
+```
+
+%% Cell type:markdown id:temporal-synthesis tags:
+
+- Feel free to explore more of [seaborn](https://seaborn.pydata.org/examples/index.html) !

-%% Cell type:code id:7cf225f9-aea7-4cd9-ac90-a99592799527 tags:
+%% Cell type:markdown id:3063abf7-2251-48eb-b371-6c5b70b45fe7 tags:
+
+## Do a barplot of the Happiness Score for each Region
+
+%% Cell type:code id:85dd0df6-74e7-43be-9a7c-eb922a06601b tags:

 ``` python
 ```

-%% Cell type:markdown id:d200d375-362e-4c1d-a88e-130b094e6feb tags:
+%% Cell type:markdown id:3ee5741a-64f4-4690-963b-1f7e729398bf tags:
+
+## from this point we will focus on the Regions

- Now plot the same data using a boxplot. Can you see the problem of displaying boxplots for this kind of data ?
+### clean our dataset. Remove not relevant columns

-%% Cell type:code id:e4daf00e-9a2c-4ec4-9d26-aa18aae5d82d tags:
+%% Cell type:code id:0344b730-1535-47fb-82f5-07003fd223f9 tags:

 ``` python
 ```

-%% Cell type:markdown id:2e1cabe0-aab7-4f0e-888e-81aae7d5df8d tags:
+%% Cell type:markdown id:36e449d1-0add-4ebc-8903-d535219ce423 tags:

- Compute the median of each genes by groups:
+1. keep only columns 'Region', 'Happiness Score', 'Economy (GDP per Capita)', 'Family', 'Freedom','Trust (Government Corruption)', 'Generosity'
+2. set the index to the Region
+3. have a look on your new data

-%% Cell type:code id:6ffd0f59-0fd7-41b9-a87a-c6e1a74145e8 tags:
+%% Cell type:code id:bdf897c4-b8f3-4dff-b9c0-0ad47b25ecc0 tags:

 ``` python
 ```

-%% Cell type:markdown id:308cc10b-6727-4bc5-b05d-4777037e252e tags:
+%% Cell type:markdown id:e1ae03ac-ac7c-436d-987d-113e9cca3eec tags:

-We are going now to add extra annotations to this median table in order to identify genes of interest.
- Import the annotation.csv table from the data folder:
+## Aggregate the new data region by region. Compute the mean of each country as value for the corresponding Region

-%% Cell type:code id:9be6ee5b-d497-47fa-8ac5-cf5514fd52c0 tags:
+%% Cell type:code id:3fc3ea89-a448-4e7b-abfb-3fa92cffc5f7 tags:

 ``` python
 ```

-%% Cell type:markdown id:50fa81a7-3f34-4160-ad2d-f77d21be9ac0 tags:
+%% Cell type:markdown id:97cb188c-3e50-4492-961f-cadea3611aaa tags:

-Annotations in this table are available for many types of loci (the "genetic_type" column), but here we will focus on the "gene" genetic_type.
- Filter the annotation dataframe to have only "gene" as "genetic_type".
+## Do a hierarchically-clustered heatmap

-%% Cell type:code id:f9a8bcf7-0bcc-43e8-828a-ec204658e528 tags:
+%% Cell type:code id:9aa21ed4-e9b2-4eb3-a693-c59ceb513552 tags:

 ``` python
 ```

-%% Cell type:markdown id:f8a4e744-e7e2-43b6-b3d4-e59feb40d3ff tags:
+%% Cell type:markdown id:88d27d29-e3b8-43d7-8324-25e50c247872 tags:

- Concatenate the dataframe with median by group and the annotation dataframe together:
+Check the data.

-%% Cell type:code id:afd8467a-33e1-4b9e-8f6d-b2229099c874 tags:
+%% Cell type:code id:0128f575-0b2a-4cbc-8f6e-8b7e22d81254 tags:

 ``` python
 ```

-%% Cell type:markdown id:af9f8e1f-5f8b-4152-b08a-44e957f13cec tags:
+%% Cell type:markdown id:f9b39ab8-0051-4840-9e87-fe2bcb8ca07a tags:
+
+The data are not in the same range, so it could be better to standardize the data before to do the clustering
+
+%% Cell type:code id:ff4beb57-b357-47a3-b7bd-877e05229b6b tags:

- Calculate an estimate of the gene expression fold change for each gene (by dividing the C median expressions by WT median expressions).
- Add it as a "FoldChange" column to the previous dataframe.
+``` python
+```

-%% Cell type:code id:bb617d00-2c2d-45cc-ace0-3656dc999b17 tags:
+%% Cell type:code id:e19f9472-cb9b-434b-8689-2bf09d49b902 tags:

 ``` python
 ```

-%% Cell type:markdown id:d70eb26b-0a26-4bbc-af03-ba8781b09fb5 tags:
+%% Cell type:markdown id:d64a0377-339b-4fe7-beb4-a32e4a4e0113 tags:

- Use a barplot to display fold changes and using the new gene annotation (The "Name" column)
+It's possible to do that directly in seaborn. with the option z_score (https://seaborn.pydata.org/generated/seaborn.clustermap.html)

-%% Cell type:code id:4dd4cbee-547f-43f1-9ed7-173f3040b8d5 tags:
+%% Cell type:code id:3b439517-5007-4fbb-828d-265f9835594f tags:

 ``` python
 ```

-%% Cell type:markdown id:34a26492-7c6b-4a07-a4de-67ec8f693cdc tags:
+%% Cell type:markdown id:f8d6188e-3a37-4ba5-b377-a11696054e9c tags:
+
+- Explore the clustermap documentation to have a more visual heatmap by standardizing the data within genes.

- By calculating the length of each gene and using a visualisation, does gene expression appears correlated with gene length ?
+%% Cell type:markdown id:a2627322-e6a5-422f-8a69-b89dbd4b777e tags:

-%% Cell type:code id:6f35b696-0807-4df4-9310-cb9197e7bf85 tags:
+## Create a function which produce a single image with four different plots of your choice and save it to pdf file.
+
+like the image below.
+
+%% Cell type:markdown id:4121ff3d-6814-493e-a505-357ad81b0d28 tags:
+
+<img src="../images/multiple_figure.png" width="50%" />
+
+%% Cell type:code id:a322c866-9232-4fae-bcee-9a635e3fd70b tags:

 ``` python
 ```

-%% Cell type:markdown id:a2627322-e6a5-422f-8a69-b89dbd4b777e tags:
+%% Cell type:code id:044022d1-741d-4a07-ba7f-c1f863cca138 tags:

- Create a function which produce a single image with four different plots of your choice and save it to pdf file.
+``` python
+def expression_graph():
+    ...
+    return fig
+
+```

-%% Cell type:code id:70e001a1-2848-4fb7-9f33-7beb4475e0fc tags:
+%% Cell type:code id:c33bfc78-7480-4327-93a0-f8aaca0d3614 tags:

 ``` python
+my_fig = expression_graph()
+...
 ```

 %% Cell type:markdown id:0d05aba4-3c85-4cd9-85f3-5296b19308fb tags:

 # Extras

 %% Cell type:markdown id:66d6668e-683f-462e-a72f-28bdda8736f2 tags:

- Using ipywidget, make a function to display barplot of gene expression by groups with the gene being selected by the user (using a Dropdown widget for example).
+- Using ipywidget, make a function to display barplot of `Happiness Score` by country but with region selected by the user (using a Dropdown widget)
+
+%% Cell type:markdown id:042bd87e-d2dc-4544-a771-51d80c565d0f tags:
+
+Imports the needed modules
+- `widgets` and `interact` from the `ipywidgets` package
+
+%% Cell type:code id:64ebeca1-1332-4585-9e5c-c1b66f82be71 tags:
+
+``` python
+from ipywidgets import widgets
+from ipywidgets import interact
+```
+
+%% Cell type:markdown id:277264e6-a173-40c5-b71e-4cd551a7fa99 tags:
+
+create a dataframe containing regions (without duplicates values)
+
+%% Cell type:code id:ebf7fde9-b4a1-4e8a-86ab-86ad8b1b533a tags:
+
+``` python
+```
+
+%% Cell type:markdown id:f34e5053-ccf5-4a67-96db-7457fe16bbd6 tags:
+
+1. Use this DataFarame to populate your dropdown list
+2. Use the region selected in dropdown list as parameter of your function
+3. select form the whole data frame the data corresponding to this region
+4. display the barplot
+
+%% Cell type:markdown id:feba608f-2ecb-41ae-b04a-12f075fd644b tags:
+
+below the code skeleton of your function
+
+```python
+@interact(region=widgets.Dropdown(options=regions))
+def plot_counts(region):
+    data = ha_df.loc[ha_df['Region'] == region]
+    ax = sns.barplot(data= ....
+```
+
+%% Cell type:code id:fb746fda-36cc-4c35-92d8-257a489fb278 tags:
+
+``` python
+```
+
+%% Cell type:markdown id:3f4bd68e-eb26-46f8-a00f-86f9d0570580 tags:
+
+You can customize your figure as classical seaborn/matplotib figure
+
+for instance to display the value above each bar
+
+%% Cell type:code id:7bcee7c5-f1c2-4035-9b7c-e68e1d73a932 tags:
+
+``` python
+```

-%% Cell type:code id:e587f202-7ca4-43fb-ac3c-015c740c69d2 tags:
+%% Cell type:code id:d78b7b86-ecaa-4d27-80ca-2d3e46c2aca3 tags:

 ``` python
 ```

--- a/notebooks/Solutions/seaborn_TP_solutions.ipynb
+++ b/notebooks/Solutions/seaborn_TP_solutions.ipynb
 %% Cell type:markdown id:instrumental-personal tags:

 # <center><b>Hands-on</b></center>

 <div style="text-align:center">
    <img src="../images/seaborn.png" width="600px">
    <div>
       Bertrand Néron, François Laurent, Etienne Kornobis, Vincent Guillemot
       <br />
       <a src=" https://research.pasteur.fr/en/team/bioinformatics-and-biostatistics-hub/">Bioinformatics and Biostatistiqucs HUB</a>
       <br />
       © Institut Pasteur, 2024
    </div>
 </div>

 %% Cell type:markdown id:compliant-basis tags:

 Practice your graphing skills through the data of [happiness 2016](https://www.kaggle.com/datasets/unsdsn/world-happiness?select=2016.csv)

 (The data are already in data directory as `happiness_2016.csv`)

 %% Cell type:markdown id:3778963b-3bae-486d-8db7-30f23eb239ac tags:

 ## Import the data and have a look on them

+1. import the pandas and seaborn modules
+2. import the data
+
 %% Cell type:code id:minor-doctrine tags:

 ``` python
 import pandas as pd
 import seaborn as sns
 ```

 %% Cell type:code id:skilled-daniel tags:

 ``` python
 ha_df = pd.read_csv("../data/happiness_2016.csv")
 ```

+%% Cell type:markdown id:3d7b230d-ab79-4075-b3ce-248a3dc85fbb tags:
+
+3. have a look on them
+
 %% Cell type:code id:f8729a5b-314d-42fc-b130-783ca5e2076a tags:

 ``` python
 ha_df.shape
 ```

-%% Output
-
-    (157, 13)
-
 %% Cell type:code id:brutal-manufacturer tags:

 ``` python
 ha_df.head()
 ```

-%% Output
-
-           Country          Region  Happiness Rank  Happiness Score  \
-    0      Denmark  Western Europe               1            7.526
-    1  Switzerland  Western Europe               2            7.509
-    2      Iceland  Western Europe               3            7.501
-    3       Norway  Western Europe               4            7.498
-    4      Finland  Western Europe               5            7.413
-    
-       Lower Confidence Interval  Upper Confidence Interval  \
-    0                      7.460                      7.592
-    1                      7.428                      7.590
-    2                      7.333                      7.669
-    3                      7.421                      7.575
-    4                      7.351                      7.475
-    
-       Economy (GDP per Capita)   Family  Health (Life Expectancy)  Freedom  \
-    0                   1.44178  1.16374                   0.79504  0.57941
-    1                   1.52733  1.14524                   0.86303  0.58557
-    2                   1.42666  1.18326                   0.86733  0.56624
-    3                   1.57744  1.12690                   0.79579  0.59609
-    4                   1.40598  1.13464                   0.81091  0.57104
-    
-       Trust (Government Corruption)  Generosity  Dystopia Residual
-    0                        0.44453     0.36171            2.73939
-    1                        0.41203     0.28083            2.69463
-    2                        0.14975     0.47678            2.83137
-    3                        0.35776     0.37895            2.66465
-    4                        0.41004     0.25492            2.82596
-
 %% Cell type:markdown id:departmental-exhibition tags:

 ## Do a boxplot showing the differences in `happiness` between `Region`:

 %% Cell type:code id:saved-identity tags:

 ``` python
 sns.boxplot(data=ha_df, x="Happiness Score", y="Region")
 ```

 %% Cell type:markdown id:portuguese-worse tags:

 ## Using a histogram and continuous probability density curve, display the distribution of `Freedom` in the dataset

 %% Cell type:code id:continuous-indian tags:

 ``` python
 sns.histplot(data=ha_df, x="Freedom")
 ```

 %% Cell type:code id:understanding-vegetarian tags:

 ``` python
 sns.histplot(data=ha_df, x="Freedom", kde=True)
 ```

 %% Cell type:markdown id:prepared-stephen tags:

 - Use a barplot to show the count of country per Region (see the documentation for a countplot)

 %% Cell type:code id:worldwide-communication tags:

 ``` python
 sns.countplot(data=ha_df, x="Region")
 ```

 %% Cell type:markdown id:4e1d16f2-57d8-4f0e-9a69-e9ab193a3ebc tags:

 As you can see the labels overlaps each ohers and are not readable

 One possibility is to rotate the X-labels. In this case is better to provide the labels.

 %% Cell type:code id:64ee74bb-5f37-485f-8c03-d06ac14d3010 tags:

 ``` python
 # extract the Region from the data, I will use them as labels for figures below
 regions = ha_df.loc[:, 'Region'].drop_duplicates()
 ```

 %% Cell type:code id:eb7c96ac-585a-4787-a2ee-dec3d04790ca tags:

 ``` python
 ax = sns.countplot(data=ha_df, x="Region")
 ax.set_xticks(regions)
 ax.set_xticklabels(regions, rotation=45, ha='right', rotation_mode='anchor')
 ```

 %% Cell type:markdown id:e2eac27c-2de9-4942-82b9-294318ec5fd4 tags:

 ## On the same data `Happiness` and `Region` do a boxplot and a swarmplot to display the structure of the data

 %% Cell type:code id:b0fd6058-65b0-4f9b-bc59-dce0193f1580 tags:

 ``` python
 ax = sns.swarmplot(data=ha_df, x="Region", y="Happiness Score")
 ax.set_xticks(regions)
 ax.set_xticklabels(regions,rotation=45, ha='right', rotation_mode='anchor')
 ```

 %% Cell type:code id:1fe079b4-013a-4d48-ac31-9daad4b4673e tags:

 ``` python
 ax = sns.boxplot(data=ha_df, x="Region", y="Happiness Score", hue='Region') # see the result of the option hue
 ax.set_xticks(regions)
 ax.set_xticklabels(regions,rotation=45, ha='right', rotation_mode='anchor')
 ```

 %% Cell type:markdown id:immediate-method tags:

 ## Plot the distribution of `happiness` for the people leaving `Western Europe`

 %% Cell type:code id:academic-measure tags:

 ``` python
 sns.histplot(data=ha_df.query("Region == 'Western Europe'"), x="Happiness Score", kde=True)
 ```

 %% Cell type:markdown id:fd9789db-9bab-478a-bcec-1c8b6775cf20 tags:

-## Plot the `Health (Life Expectancy)` vs `Happiness Score` and color the dots according to the region
+## Plot the `Health (Life Expectancy)` vs `Happiness Score` and color the dots according to the region specify a size for the figure (9 inches x 7)
+
+1. import pyplot
+2. then create a new figure and axis at the right size
+3. create the plot

 %% Cell type:code id:d42f72f1-1d01-496e-89a1-68391ffa4281 tags:

 ``` python
 import matplotlib.pyplot as plt
 ```

 %% Cell type:code id:52ae8376-3c66-4ca6-86a9-f9ae9f56076f tags:

 ``` python
 fig, ax = plt.subplots(figsize=(9, 7))
 sns.scatterplot(data=ha_df, x="Health (Life Expectancy)", y="Happiness Score", hue="Region", ax=ax)
 ```

 %% Cell type:markdown id:temporal-synthesis tags:

 - Feel free to explore more of [seaborn](https://seaborn.pydata.org/examples/index.html) !

 %% Cell type:markdown id:3063abf7-2251-48eb-b371-6c5b70b45fe7 tags:

 ## Do a barplot of the Happiness Score for each Region

 %% Cell type:code id:85dd0df6-74e7-43be-9a7c-eb922a06601b tags:

 ``` python
 sns.barplot(data=ha_df, y="Region", x="Happiness Score", hue='Region', orient='h')
 ```

 %% Cell type:markdown id:3ee5741a-64f4-4690-963b-1f7e729398bf tags:

 ## from this point we will focus on the Regions

 ### clean our dataset. Remove not relevant columns

 %% Cell type:code id:0344b730-1535-47fb-82f5-07003fd223f9 tags:

 ``` python
 ha_df.columns
 ```

 %% Cell type:markdown id:36e449d1-0add-4ebc-8903-d535219ce423 tags:

 1. keep only columns 'Region', 'Happiness Score', 'Economy (GDP per Capita)', 'Family', 'Freedom','Trust (Government Corruption)', 'Generosity'
 2. set the index to the Region
 3. have a look on your new data

 %% Cell type:code id:bdf897c4-b8f3-4dff-b9c0-0ad47b25ecc0 tags:

 ``` python
 region_df = ha_df.loc[:, ['Region', 'Happiness Score', 'Economy (GDP per Capita)', 'Family', 'Freedom','Trust (Government Corruption)', 'Generosity']]
 region_df.set_index('Region', inplace=True)
 region_df.head()
 ```

 %% Cell type:markdown id:e1ae03ac-ac7c-436d-987d-113e9cca3eec tags:

 ## Aggregate the new data region by region. Compute the mean of each country as value for the corresponding Region

 %% Cell type:code id:3fc3ea89-a448-4e7b-abfb-3fa92cffc5f7 tags:

 ``` python
 reg_agg = region_df.groupby('Region').agg('mean')
 reg_agg
 ```

 %% Cell type:markdown id:97cb188c-3e50-4492-961f-cadea3611aaa tags:

 ## Do a hierarchically-clustered heatmap

 %% Cell type:code id:9aa21ed4-e9b2-4eb3-a693-c59ceb513552 tags:

 ``` python
 sns.clustermap(data=reg_agg)
 ```

 %% Cell type:markdown id:88d27d29-e3b8-43d7-8324-25e50c247872 tags:

 Check the data.

 %% Cell type:code id:0128f575-0b2a-4cbc-8f6e-8b7e22d81254 tags:

 ``` python
 reg_agg.describe()
 ```

 %% Cell type:markdown id:f9b39ab8-0051-4840-9e87-fe2bcb8ca07a tags:

 The data are not in the same range, so it could be better to standardize the data before to do the clustering

 %% Cell type:code id:ff4beb57-b357-47a3-b7bd-877e05229b6b tags:

 ``` python
 normalized_reg=(reg_agg - reg_agg.mean()) / reg_agg.std()
 normalized_reg
 ```

 %% Cell type:code id:e19f9472-cb9b-434b-8689-2bf09d49b902 tags:

 ``` python
 sns.clustermap(data=normalized_reg, annot=True) # see the results of the annot option
 ```

 %% Cell type:markdown id:d64a0377-339b-4fe7-beb4-a32e4a4e0113 tags:

 It's possible to do that directly in seaborn. with the option z_score (https://seaborn.pydata.org/generated/seaborn.clustermap.html)

 %% Cell type:code id:3b439517-5007-4fbb-828d-265f9835594f tags:

 ``` python
 sns.clustermap(data=reg_agg, z_score=1, annot=True)
 ```

 %% Cell type:markdown id:f8d6188e-3a37-4ba5-b377-a11696054e9c tags:

 - Explore the clustermap documentation to have a more visual heatmap by standardizing the data within genes.

 %% Cell type:markdown id:a2627322-e6a5-422f-8a69-b89dbd4b777e tags:

 ## Create a function which produce a single image with four different plots of your choice and save it to pdf file.

 like the image below.

 %% Cell type:markdown id:4121ff3d-6814-493e-a505-357ad81b0d28 tags:

 <img src="../images/multiple_figure.png" width="50%" />

 %% Cell type:code id:a322c866-9232-4fae-bcee-9a635e3fd70b tags:

 ``` python
 import matplotlib.pyplot as plt
 ```

 %% Cell type:code id:044022d1-741d-4a07-ba7f-c1f863cca138 tags:

 ``` python
 def expression_graph():
    fig, axs = plt.subplots(2,2, figsize=(9,7), constrained_layout=True) # constrained_layout=True avoid overlapping between axis title and X-labels from the above figure
    sns.boxplot(data=ha_df, x="Happiness Score", y="Region", hue='Region', ax=axs[0,0])
    axs[0,0].set_title("happiness data structure")

    sns.scatterplot(data=ha_df, x="Health (Life Expectancy)", y="Happiness Score", hue="Region", legend=False, ax=axs[0,1])
    axs[0,1].set_title("Happiness vs Health")

    sns.barplot(data=ha_df, y="Region", x="Happiness Score", hue='Region', orient='h', ax=axs[1,0])
    axs[1,0].set_title("happiness through the world")

    sns.histplot(data=ha_df, x="Happiness Score", kde=True, ax=axs[1,1])
    axs[1,1].set_title("Happiness data distribution")

    return fig

 ```

 %% Cell type:code id:c33bfc78-7480-4327-93a0-f8aaca0d3614 tags:

 ``` python
 my_fig = expression_graph()
 my_fig.suptitle("Happiness Report")
 my_fig.savefig("happiness_visualization.pdf",  bbox_inches = "tight") # bbox_inches = "tight" avoid to truncate the Y-labels on left on pdf
 ```

 %% Cell type:markdown id:0d05aba4-3c85-4cd9-85f3-5296b19308fb tags:

 # Extras

 %% Cell type:markdown id:66d6668e-683f-462e-a72f-28bdda8736f2 tags:

 - Using ipywidget, make a function to display barplot of `Happiness Score` by country but with region selected by the user (using a Dropdown widget)

 %% Cell type:markdown id:042bd87e-d2dc-4544-a771-51d80c565d0f tags:

 Imports the needed modules
 - `widgets` and `interact` from the `ipywidgets` package

 %% Cell type:code id:64ebeca1-1332-4585-9e5c-c1b66f82be71 tags:

 ``` python
 from ipywidgets import widgets
 from ipywidgets import interact
 ```

 %% Cell type:markdown id:277264e6-a173-40c5-b71e-4cd551a7fa99 tags:

 create a dataframe containing regions (without duplicates values

 %% Cell type:code id:ebf7fde9-b4a1-4e8a-86ab-86ad8b1b533a tags:

 ``` python
 regions = ha_df.loc[:, 'Region'].drop_duplicates()
 ```

 %% Cell type:markdown id:f34e5053-ccf5-4a67-96db-7457fe16bbd6 tags:

 1. Use this DataFarame to populate your dropdown list
 2. Use the region selected in dropdown list as parameter of your function
 3. select form the whole data frame the data corresponding to this region
 4. display the barplot

 %% Cell type:markdown id:feba608f-2ecb-41ae-b04a-12f075fd644b tags:

 below the code skeleton of your function

 ```python
 @interact(region=widgets.Dropdown(options=regions))
 def plot_counts(region):
    data = ha_df.loc[ha_df['Region'] == region]
    ax = sns.barplot(data= ....
 ```

 %% Cell type:code id:fb746fda-36cc-4c35-92d8-257a489fb278 tags:

 ``` python

 @interact(region=widgets.Dropdown(options=regions))
 def plot_counts(region):
    data = ha_df.loc[ha_df['Region'] == region]
    ax = sns.barplot(data=data, y='Happiness Score', x='Country')
    ax.set_xticks(data.Country)
    ax.set_xticklabels(data.Country, rotation=45, ha='right', rotation_mode='anchor')

 ```

 %% Cell type:markdown id:3f4bd68e-eb26-46f8-a00f-86f9d0570580 tags:

 You can customize your figure as classical seaborn/matplotib figure

 for instance to display the value above each bar

 %% Cell type:code id:7bcee7c5-f1c2-4035-9b7c-e68e1d73a932 tags:

 ``` python

 @interact(region=widgets.Dropdown(options=regions))
 def plot_counts(region):
    data = ha_df.loc[ha_df['Region'] == region]
    ax = sns.barplot(data=data, y='Happiness Score', x='Country')
    for i in ax.containers:
        # add label on each bar https://www.geeksforgeeks.org/how-to-show-values-on-seaborn-barplot/
        ax.bar_label(i, fmt="{:.2f}", rotation='vertical', padding=3)

    ax.set_xticks(data.Country)
    ax.set_xticklabels(data.Country, rotation=45, ha='right', rotation_mode='anchor')
    ax.margins(y=0.1) # add margin to avoid to have label outside the barplotboundaries, here add 10% white space vertically
    # https://stackoverflow.com/questions/72662991/how-can-i-prevent-bar-labels-from-going-outside-the-barplot-boundaries-range

 ```

 %% Cell type:code id:d78b7b86-ecaa-4d27-80ca-2d3e46c2aca3 tags:

 ``` python
 ```

 %% Cell type:markdown id:instrumental-personal tags:

 # <center><b>Hands-on</b></center>

 <div style="text-align:center">
    <img src="../images/seaborn.png" width="600px">
    <div>
       Bertrand Néron, François Laurent, Etienne Kornobis, Vincent Guillemot
       <br />
       <a src=" https://research.pasteur.fr/en/team/bioinformatics-and-biostatistics-hub/">Bioinformatics and Biostatistiqucs HUB</a>
       <br />
       © Institut Pasteur, 2024
    </div>
 </div>

 %% Cell type:markdown id:compliant-basis tags:

 Practice your graphing skills through the data of [happiness 2016](https://www.kaggle.com/datasets/unsdsn/world-happiness?select=2016.csv)

 (The data are already in data directory as `happiness_2016.csv`)

 %% Cell type:markdown id:3778963b-3bae-486d-8db7-30f23eb239ac tags:

 ## Import the data and have a look on them

+1. import the pandas and seaborn modules
+2. import the data
+
 %% Cell type:code id:minor-doctrine tags:

 ``` python
 import pandas as pd
 import seaborn as sns
 ```

 %% Cell type:code id:skilled-daniel tags:

 ``` python
 ha_df = pd.read_csv("../data/happiness_2016.csv")
 ```

+%% Cell type:markdown id:3d7b230d-ab79-4075-b3ce-248a3dc85fbb tags:
+
+3. have a look on them
+
 %% Cell type:code id:f8729a5b-314d-42fc-b130-783ca5e2076a tags:

 ``` python
 ha_df.shape
 ```

-%% Output
-
-    (157, 13)
-
 %% Cell type:code id:brutal-manufacturer tags:

 ``` python
 ha_df.head()
 ```

-%% Output
-
-           Country          Region  Happiness Rank  Happiness Score  \
-    0      Denmark  Western Europe               1            7.526
-    1  Switzerland  Western Europe               2            7.509
-    2      Iceland  Western Europe               3            7.501
-    3       Norway  Western Europe               4            7.498
-    4      Finland  Western Europe               5            7.413
-    
-       Lower Confidence Interval  Upper Confidence Interval  \
-    0                      7.460                      7.592
-    1                      7.428                      7.590
-    2                      7.333                      7.669
-    3                      7.421                      7.575
-    4                      7.351                      7.475
-    
-       Economy (GDP per Capita)   Family  Health (Life Expectancy)  Freedom  \
-    0                   1.44178  1.16374                   0.79504  0.57941
-    1                   1.52733  1.14524                   0.86303  0.58557
-    2                   1.42666  1.18326                   0.86733  0.56624
-    3                   1.57744  1.12690                   0.79579  0.59609
-    4                   1.40598  1.13464                   0.81091  0.57104
-    
-       Trust (Government Corruption)  Generosity  Dystopia Residual
-    0                        0.44453     0.36171            2.73939
-    1                        0.41203     0.28083            2.69463
-    2                        0.14975     0.47678            2.83137
-    3                        0.35776     0.37895            2.66465
-    4                        0.41004     0.25492            2.82596
-
 %% Cell type:markdown id:departmental-exhibition tags:

 ## Do a boxplot showing the differences in `happiness` between `Region`:

 %% Cell type:code id:saved-identity tags:

 ``` python
 sns.boxplot(data=ha_df, x="Happiness Score", y="Region")
 ```

 %% Cell type:markdown id:portuguese-worse tags:

 ## Using a histogram and continuous probability density curve, display the distribution of `Freedom` in the dataset

 %% Cell type:code id:continuous-indian tags:

 ``` python
 sns.histplot(data=ha_df, x="Freedom")
 ```

 %% Cell type:code id:understanding-vegetarian tags:

 ``` python
 sns.histplot(data=ha_df, x="Freedom", kde=True)
 ```

 %% Cell type:markdown id:prepared-stephen tags:

 - Use a barplot to show the count of country per Region (see the documentation for a countplot)

 %% Cell type:code id:worldwide-communication tags:

 ``` python
 sns.countplot(data=ha_df, x="Region")
 ```

 %% Cell type:markdown id:4e1d16f2-57d8-4f0e-9a69-e9ab193a3ebc tags:

 As you can see the labels overlaps each ohers and are not readable

 One possibility is to rotate the X-labels. In this case is better to provide the labels.

 %% Cell type:code id:64ee74bb-5f37-485f-8c03-d06ac14d3010 tags:

 ``` python
 # extract the Region from the data, I will use them as labels for figures below
 regions = ha_df.loc[:, 'Region'].drop_duplicates()
 ```

 %% Cell type:code id:eb7c96ac-585a-4787-a2ee-dec3d04790ca tags:

 ``` python
 ax = sns.countplot(data=ha_df, x="Region")
 ax.set_xticks(regions)
 ax.set_xticklabels(regions, rotation=45, ha='right', rotation_mode='anchor')
 ```

 %% Cell type:markdown id:e2eac27c-2de9-4942-82b9-294318ec5fd4 tags:

 ## On the same data `Happiness` and `Region` do a boxplot and a swarmplot to display the structure of the data

 %% Cell type:code id:b0fd6058-65b0-4f9b-bc59-dce0193f1580 tags:

 ``` python
 ax = sns.swarmplot(data=ha_df, x="Region", y="Happiness Score")
 ax.set_xticks(regions)
 ax.set_xticklabels(regions,rotation=45, ha='right', rotation_mode='anchor')
 ```

 %% Cell type:code id:1fe079b4-013a-4d48-ac31-9daad4b4673e tags:

 ``` python
 ax = sns.boxplot(data=ha_df, x="Region", y="Happiness Score", hue='Region') # see the result of the option hue
 ax.set_xticks(regions)
 ax.set_xticklabels(regions,rotation=45, ha='right', rotation_mode='anchor')
 ```

 %% Cell type:markdown id:immediate-method tags:

 ## Plot the distribution of `happiness` for the people leaving `Western Europe`

 %% Cell type:code id:academic-measure tags:

 ``` python
 sns.histplot(data=ha_df.query("Region == 'Western Europe'"), x="Happiness Score", kde=True)
 ```

 %% Cell type:markdown id:fd9789db-9bab-478a-bcec-1c8b6775cf20 tags:

-## Plot the `Health (Life Expectancy)` vs `Happiness Score` and color the dots according to the region
+## Plot the `Health (Life Expectancy)` vs `Happiness Score` and color the dots according to the region specify a size for the figure (9 inches x 7)
+
+1. import pyplot
+2. then create a new figure and axis at the right size
+3. create the plot

 %% Cell type:code id:d42f72f1-1d01-496e-89a1-68391ffa4281 tags:

 ``` python
 import matplotlib.pyplot as plt
 ```

 %% Cell type:code id:52ae8376-3c66-4ca6-86a9-f9ae9f56076f tags:

 ``` python
 fig, ax = plt.subplots(figsize=(9, 7))
 sns.scatterplot(data=ha_df, x="Health (Life Expectancy)", y="Happiness Score", hue="Region", ax=ax)
 ```

 %% Cell type:markdown id:temporal-synthesis tags:

 - Feel free to explore more of [seaborn](https://seaborn.pydata.org/examples/index.html) !

 %% Cell type:markdown id:3063abf7-2251-48eb-b371-6c5b70b45fe7 tags:

 ## Do a barplot of the Happiness Score for each Region

 %% Cell type:code id:85dd0df6-74e7-43be-9a7c-eb922a06601b tags:

 ``` python
 sns.barplot(data=ha_df, y="Region", x="Happiness Score", hue='Region', orient='h')
 ```

 %% Cell type:markdown id:3ee5741a-64f4-4690-963b-1f7e729398bf tags:

 ## from this point we will focus on the Regions

 ### clean our dataset. Remove not relevant columns

 %% Cell type:code id:0344b730-1535-47fb-82f5-07003fd223f9 tags:

 ``` python
 ha_df.columns
 ```

 %% Cell type:markdown id:36e449d1-0add-4ebc-8903-d535219ce423 tags:

 1. keep only columns 'Region', 'Happiness Score', 'Economy (GDP per Capita)', 'Family', 'Freedom','Trust (Government Corruption)', 'Generosity'
 2. set the index to the Region
 3. have a look on your new data

 %% Cell type:code id:bdf897c4-b8f3-4dff-b9c0-0ad47b25ecc0 tags:

 ``` python
 region_df = ha_df.loc[:, ['Region', 'Happiness Score', 'Economy (GDP per Capita)', 'Family', 'Freedom','Trust (Government Corruption)', 'Generosity']]
 region_df.set_index('Region', inplace=True)
 region_df.head()
 ```

 %% Cell type:markdown id:e1ae03ac-ac7c-436d-987d-113e9cca3eec tags:

 ## Aggregate the new data region by region. Compute the mean of each country as value for the corresponding Region

 %% Cell type:code id:3fc3ea89-a448-4e7b-abfb-3fa92cffc5f7 tags:

 ``` python
 reg_agg = region_df.groupby('Region').agg('mean')
 reg_agg
 ```

 %% Cell type:markdown id:97cb188c-3e50-4492-961f-cadea3611aaa tags:

 ## Do a hierarchically-clustered heatmap

 %% Cell type:code id:9aa21ed4-e9b2-4eb3-a693-c59ceb513552 tags:

 ``` python
 sns.clustermap(data=reg_agg)
 ```

 %% Cell type:markdown id:88d27d29-e3b8-43d7-8324-25e50c247872 tags:

 Check the data.

 %% Cell type:code id:0128f575-0b2a-4cbc-8f6e-8b7e22d81254 tags:

 ``` python
 reg_agg.describe()
 ```

 %% Cell type:markdown id:f9b39ab8-0051-4840-9e87-fe2bcb8ca07a tags:

 The data are not in the same range, so it could be better to standardize the data before to do the clustering

 %% Cell type:code id:ff4beb57-b357-47a3-b7bd-877e05229b6b tags:

 ``` python
 normalized_reg=(reg_agg - reg_agg.mean()) / reg_agg.std()
 normalized_reg
 ```

 %% Cell type:code id:e19f9472-cb9b-434b-8689-2bf09d49b902 tags:

 ``` python
 sns.clustermap(data=normalized_reg, annot=True) # see the results of the annot option
 ```

 %% Cell type:markdown id:d64a0377-339b-4fe7-beb4-a32e4a4e0113 tags:

 It's possible to do that directly in seaborn. with the option z_score (https://seaborn.pydata.org/generated/seaborn.clustermap.html)

 %% Cell type:code id:3b439517-5007-4fbb-828d-265f9835594f tags:

 ``` python
 sns.clustermap(data=reg_agg, z_score=1, annot=True)
 ```

 %% Cell type:markdown id:f8d6188e-3a37-4ba5-b377-a11696054e9c tags:

 - Explore the clustermap documentation to have a more visual heatmap by standardizing the data within genes.

 %% Cell type:markdown id:a2627322-e6a5-422f-8a69-b89dbd4b777e tags:

 ## Create a function which produce a single image with four different plots of your choice and save it to pdf file.

 like the image below.

 %% Cell type:markdown id:4121ff3d-6814-493e-a505-357ad81b0d28 tags:

 <img src="../images/multiple_figure.png" width="50%" />

 %% Cell type:code id:a322c866-9232-4fae-bcee-9a635e3fd70b tags:

 ``` python
 import matplotlib.pyplot as plt
 ```

 %% Cell type:code id:044022d1-741d-4a07-ba7f-c1f863cca138 tags:

 ``` python
 def expression_graph():
    fig, axs = plt.subplots(2,2, figsize=(9,7), constrained_layout=True) # constrained_layout=True avoid overlapping between axis title and X-labels from the above figure
    sns.boxplot(data=ha_df, x="Happiness Score", y="Region", hue='Region', ax=axs[0,0])
    axs[0,0].set_title("happiness data structure")

    sns.scatterplot(data=ha_df, x="Health (Life Expectancy)", y="Happiness Score", hue="Region", legend=False, ax=axs[0,1])
    axs[0,1].set_title("Happiness vs Health")

    sns.barplot(data=ha_df, y="Region", x="Happiness Score", hue='Region', orient='h', ax=axs[1,0])
    axs[1,0].set_title("happiness through the world")

    sns.histplot(data=ha_df, x="Happiness Score", kde=True, ax=axs[1,1])
    axs[1,1].set_title("Happiness data distribution")

    return fig

 ```

 %% Cell type:code id:c33bfc78-7480-4327-93a0-f8aaca0d3614 tags:

 ``` python
 my_fig = expression_graph()
 my_fig.suptitle("Happiness Report")
 my_fig.savefig("happiness_visualization.pdf",  bbox_inches = "tight") # bbox_inches = "tight" avoid to truncate the Y-labels on left on pdf
 ```

 %% Cell type:markdown id:0d05aba4-3c85-4cd9-85f3-5296b19308fb tags:

 # Extras

 %% Cell type:markdown id:66d6668e-683f-462e-a72f-28bdda8736f2 tags:

 - Using ipywidget, make a function to display barplot of `Happiness Score` by country but with region selected by the user (using a Dropdown widget)

 %% Cell type:markdown id:042bd87e-d2dc-4544-a771-51d80c565d0f tags:

 Imports the needed modules
 - `widgets` and `interact` from the `ipywidgets` package

 %% Cell type:code id:64ebeca1-1332-4585-9e5c-c1b66f82be71 tags:

 ``` python
 from ipywidgets import widgets
 from ipywidgets import interact
 ```

 %% Cell type:markdown id:277264e6-a173-40c5-b71e-4cd551a7fa99 tags:

 create a dataframe containing regions (without duplicates values

 %% Cell type:code id:ebf7fde9-b4a1-4e8a-86ab-86ad8b1b533a tags:

 ``` python
 regions = ha_df.loc[:, 'Region'].drop_duplicates()
 ```

 %% Cell type:markdown id:f34e5053-ccf5-4a67-96db-7457fe16bbd6 tags:

 1. Use this DataFarame to populate your dropdown list
 2. Use the region selected in dropdown list as parameter of your function
 3. select form the whole data frame the data corresponding to this region
 4. display the barplot

 %% Cell type:markdown id:feba608f-2ecb-41ae-b04a-12f075fd644b tags:

 below the code skeleton of your function

 ```python
 @interact(region=widgets.Dropdown(options=regions))
 def plot_counts(region):
    data = ha_df.loc[ha_df['Region'] == region]
    ax = sns.barplot(data= ....
 ```

 %% Cell type:code id:fb746fda-36cc-4c35-92d8-257a489fb278 tags:

 ``` python

 @interact(region=widgets.Dropdown(options=regions))
 def plot_counts(region):
    data = ha_df.loc[ha_df['Region'] == region]
    ax = sns.barplot(data=data, y='Happiness Score', x='Country')
    ax.set_xticks(data.Country)
    ax.set_xticklabels(data.Country, rotation=45, ha='right', rotation_mode='anchor')

 ```

 %% Cell type:markdown id:3f4bd68e-eb26-46f8-a00f-86f9d0570580 tags:

 You can customize your figure as classical seaborn/matplotib figure

 for instance to display the value above each bar

 %% Cell type:code id:7bcee7c5-f1c2-4035-9b7c-e68e1d73a932 tags:

 ``` python

 @interact(region=widgets.Dropdown(options=regions))
 def plot_counts(region):
    data = ha_df.loc[ha_df['Region'] == region]
    ax = sns.barplot(data=data, y='Happiness Score', x='Country')
    for i in ax.containers:
        # add label on each bar https://www.geeksforgeeks.org/how-to-show-values-on-seaborn-barplot/
        ax.bar_label(i, fmt="{:.2f}", rotation='vertical', padding=3)

    ax.set_xticks(data.Country)
    ax.set_xticklabels(data.Country, rotation=45, ha='right', rotation_mode='anchor')
    ax.margins(y=0.1) # add margin to avoid to have label outside the barplotboundaries, here add 10% white space vertically
    # https://stackoverflow.com/questions/72662991/how-can-i-prevent-bar-labels-from-going-outside-the-barplot-boundaries-range

 ```

 %% Cell type:code id:d78b7b86-ecaa-4d27-80ca-2d3e46c2aca3 tags:

 ``` python
 ```

--- a/notebooks/images/iris_histograms.png
+++ b/notebooks/images/iris_histograms.png
No results found