Skip to content
Snippets Groups Projects
Commit a423c782 authored by Bertrand  NÉRON's avatar Bertrand NÉRON
Browse files

improve ipywidget exo/demo

parent f2e4e59b
No related branches found
No related tags found
No related merge requests found
Pipeline #140863 passed with stages
in 2 minutes
%% Cell type:markdown id:instrumental-personal tags:
# <center><b>Hands-on</b></center>
<div style="text-align:center">
<img src="images/seaborn.png" width="600px">
<div>
Bertrand Néron, François Laurent, Etienne Kornobis, Vincent Guillemot
<br />
<a src=" https://research.pasteur.fr/en/team/bioinformatics-and-biostatistics-hub/">Bioinformatics and Biostatistiqucs HUB</a>
<br />
© Institut Pasteur, 2024
</div>
</div>
%% Cell type:markdown id:compliant-basis tags:
Practice your graphing skills through the data of [happiness 2016](https://www.kaggle.com/datasets/unsdsn/world-happiness?select=2016.csv)
(The data are already in data directory as `happiness_2016.csv`)
%% Cell type:markdown id:3778963b-3bae-486d-8db7-30f23eb239ac tags:
## Import the data and have a look on them
%% Cell type:code id:minor-doctrine tags:
``` python
import pandas as pd
import seaborn as sns
```
%% Cell type:code id:skilled-daniel tags:
``` python
ha_df = pd.read_csv("data/happiness_2016.csv")
```
%% Cell type:code id:f8729a5b-314d-42fc-b130-783ca5e2076a tags:
``` python
ha_df.shape
```
%% Cell type:code id:brutal-manufacturer tags:
``` python
ha_df.head()
```
%% Cell type:markdown id:departmental-exhibition tags:
## Do a boxplot showing the differences in `happiness` between `Region`:
%% Cell type:code id:saved-identity tags:
``` python
sns.boxplot(data=ha_df, x="Happiness Score", y="Region")
```
%% Cell type:markdown id:portuguese-worse tags:
## Using a histogram and continuous probability density curve, display the distribution of `Freedom` in the dataset
%% Cell type:code id:continuous-indian tags:
``` python
sns.histplot(data=ha_df, x="Freedom")
```
%% Cell type:code id:understanding-vegetarian tags:
``` python
sns.histplot(data=ha_df, x="Freedom", kde=True)
```
%% Cell type:markdown id:prepared-stephen tags:
- Use a barplot to show the count of country per Region (see the documentation for a countplot)
%% Cell type:code id:worldwide-communication tags:
``` python
sns.countplot(data=ha_df, x="Region")
```
%% Cell type:markdown id:4e1d16f2-57d8-4f0e-9a69-e9ab193a3ebc tags:
As you can see the labels overlaps each ohers and are not readable
One possibility is to rotate the X-labels. In this case is better to provide the labels.
%% Cell type:code id:64ee74bb-5f37-485f-8c03-d06ac14d3010 tags:
``` python
# extract the Region from the data, I will use them as labels for figures below
regions = ha_df.loc[:, 'Region'].drop_duplicates()
```
%% Cell type:code id:eb7c96ac-585a-4787-a2ee-dec3d04790ca tags:
``` python
ax = sns.countplot(data=ha_df, x="Region")
ax.set_xticks(regions)
ax.set_xticklabels(regions, rotation=45, ha='right', rotation_mode='anchor')
```
%% Cell type:markdown id:e2eac27c-2de9-4942-82b9-294318ec5fd4 tags:
## On the same data `Happiness` and `Region` do a boxplot and a swarmplot to display the structure of the data
%% Cell type:code id:b0fd6058-65b0-4f9b-bc59-dce0193f1580 tags:
``` python
ax = sns.swarmplot(data=ha_df, x="Region", y="Happiness Score")
ax.set_xticks(regions)
ax.set_xticklabels(regions,rotation=45, ha='right', rotation_mode='anchor')
```
%% Cell type:code id:1fe079b4-013a-4d48-ac31-9daad4b4673e tags:
``` python
ax = sns.boxplot(data=ha_df, x="Region", y="Happiness Score", hue='Region') # see the result of the option hue
ax.set_xticks(regions)
ax.set_xticklabels(regions,rotation=45, ha='right', rotation_mode='anchor')
```
%% Cell type:markdown id:immediate-method tags:
## Plot the distribution of `happiness` for the people leaving `Western Europe`
%% Cell type:code id:academic-measure tags:
``` python
sns.histplot(data=ha_df.query("Region == 'Western Europe'"), x="Happiness Score", kde=True)
```
%% Cell type:markdown id:fd9789db-9bab-478a-bcec-1c8b6775cf20 tags:
## Plot the `Health (Life Expectancy)` vs `Happiness Score` and color the dots according to the region
%% Cell type:code id:d42f72f1-1d01-496e-89a1-68391ffa4281 tags:
``` python
import matplotlib.pyplot as plt
```
%% Cell type:code id:52ae8376-3c66-4ca6-86a9-f9ae9f56076f tags:
``` python
fig, ax = plt.subplots(figsize=(9, 7))
sns.scatterplot(data=ha_df, x="Health (Life Expectancy)", y="Happiness Score", hue="Region", ax=ax)
```
%% Cell type:markdown id:temporal-synthesis tags:
- Feel free to explore more of [seaborn](https://seaborn.pydata.org/examples/index.html) !
%% Cell type:markdown id:3063abf7-2251-48eb-b371-6c5b70b45fe7 tags:
## Do a barplot of the Happiness Score for each Region
%% Cell type:code id:85dd0df6-74e7-43be-9a7c-eb922a06601b tags:
``` python
sns.barplot(data=ha_df, y="Region", x="Happiness Score", hue='Region', orient='h')
```
%% Cell type:markdown id:3ee5741a-64f4-4690-963b-1f7e729398bf tags:
## from this point we will focus on the Regions
### clean our dataset. Remove not relevant columns
%% Cell type:code id:0344b730-1535-47fb-82f5-07003fd223f9 tags:
``` python
ha_df.columns
```
%% Cell type:markdown id:36e449d1-0add-4ebc-8903-d535219ce423 tags:
1. keep only columns 'Region', 'Happiness Score', 'Economy (GDP per Capita)', 'Family', 'Freedom','Trust (Government Corruption)', 'Generosity'
2. set the index to the Region
3. have a look on your new data
%% Cell type:code id:bdf897c4-b8f3-4dff-b9c0-0ad47b25ecc0 tags:
``` python
region_df = ha_df.loc[:, ['Region', 'Happiness Score', 'Economy (GDP per Capita)', 'Family', 'Freedom','Trust (Government Corruption)', 'Generosity']]
region_df.set_index('Region', inplace=True)
region_df.head()
```
%% Cell type:markdown id:e1ae03ac-ac7c-436d-987d-113e9cca3eec tags:
## Aggregate the new data region by region. Compute the mean of each country as value for the corresponding Region
%% Cell type:code id:3fc3ea89-a448-4e7b-abfb-3fa92cffc5f7 tags:
``` python
reg_agg = region_df.groupby('Region').agg('mean')
reg_agg
```
%% Cell type:markdown id:97cb188c-3e50-4492-961f-cadea3611aaa tags:
## Do a hierarchically-clustered heatmap
%% Cell type:code id:9aa21ed4-e9b2-4eb3-a693-c59ceb513552 tags:
``` python
sns.clustermap(data=reg_agg)
```
%% Cell type:markdown id:88d27d29-e3b8-43d7-8324-25e50c247872 tags:
Check the data.
%% Cell type:code id:0128f575-0b2a-4cbc-8f6e-8b7e22d81254 tags:
``` python
reg_agg.describe()
```
%% Cell type:markdown id:f9b39ab8-0051-4840-9e87-fe2bcb8ca07a tags:
The data are not in the same range, so it could be better to standardize the data before to do the clustering
%% Cell type:code id:ff4beb57-b357-47a3-b7bd-877e05229b6b tags:
``` python
normalized_reg=(reg_agg - reg_agg.mean()) / reg_agg.std()
normalized_reg
```
%% Cell type:code id:e19f9472-cb9b-434b-8689-2bf09d49b902 tags:
``` python
sns.clustermap(data=normalized_reg, annot=True) # see the results of the annot option
```
%% Cell type:markdown id:d64a0377-339b-4fe7-beb4-a32e4a4e0113 tags:
It's possible to do that directly in seaborn. with the option z_score (https://seaborn.pydata.org/generated/seaborn.clustermap.html)
%% Cell type:code id:3b439517-5007-4fbb-828d-265f9835594f tags:
``` python
sns.clustermap(data=reg_agg, z_score=1, annot=True)
```
%% Cell type:markdown id:f8d6188e-3a37-4ba5-b377-a11696054e9c tags:
- Explore the clustermap documentation to have a more visual heatmap by standardizing the data within genes.
%% Cell type:markdown id:a2627322-e6a5-422f-8a69-b89dbd4b777e tags:
## Create a function which produce a single image with four different plots of your choice and save it to pdf file.
like the image below.
%% Cell type:markdown id:4121ff3d-6814-493e-a505-357ad81b0d28 tags:
<img src="data/multiple_figure.png" width="50%" />
%% Cell type:code id:a322c866-9232-4fae-bcee-9a635e3fd70b tags:
``` python
import matplotlib.pyplot as plt
```
%% Cell type:code id:044022d1-741d-4a07-ba7f-c1f863cca138 tags:
``` python
def expression_graph():
fig, axs = plt.subplots(2,2, figsize=(9,7), constrained_layout=True) # constrained_layout=True avoid overlapping between axis title and X-labels from the above figure
sns.boxplot(data=ha_df, x="Happiness Score", y="Region", hue='Region', ax=axs[0,0])
axs[0,0].set_title("happiness data structure")
sns.scatterplot(data=ha_df, x="Health (Life Expectancy)", y="Happiness Score", hue="Region", legend=False, ax=axs[0,1])
axs[0,1].set_title("Happiness vs Health")
sns.barplot(data=ha_df, y="Region", x="Happiness Score", hue='Region', orient='h', ax=axs[1,0])
axs[1,0].set_title("happiness through the world")
sns.histplot(data=ha_df, x="Happiness Score", kde=True, ax=axs[1,1])
axs[1,1].set_title("Happiness data distribution")
return fig
```
%% Cell type:code id:c33bfc78-7480-4327-93a0-f8aaca0d3614 tags:
``` python
my_fig = expression_graph()
my_fig.suptitle("Happiness Report")
my_fig.savefig("happiness_visualization.pdf", bbox_inches = "tight") # bbox_inches = "tight" avoid to truncate the Y-labels on left on pdf
```
%% Cell type:markdown id:0d05aba4-3c85-4cd9-85f3-5296b19308fb tags:
# Extras
%% Cell type:markdown id:66d6668e-683f-462e-a72f-28bdda8736f2 tags:
- Using ipywidget, make a function to display barplot of `Happiness Score` by country but with region selected by the user (using a Dropdown widget)
%% Cell type:markdown id:042bd87e-d2dc-4544-a771-51d80c565d0f tags:
Imports the needed modules
- `widgets` and `interact` from the `ipywidgets` package
%% Cell type:code id:64ebeca1-1332-4585-9e5c-c1b66f82be71 tags:
``` python
from ipywidgets import widgets
from ipywidgets import interact
```
%% Cell type:markdown id:277264e6-a173-40c5-b71e-4cd551a7fa99 tags:
create a dataframe containing regions (without duplicates values
%% Cell type:code id:ebf7fde9-b4a1-4e8a-86ab-86ad8b1b533a tags:
``` python
regions = ha_df.loc[:, 'Region'].drop_duplicates()
```
%% Cell type:markdown id:f34e5053-ccf5-4a67-96db-7457fe16bbd6 tags:
1. Use this DataFarame to populate your dropdown list
2. Use the region selected in dropdown list as parameter of your function
3. select form the whole data frame the data corresponding to this region
4. display the barplot
%% Cell type:markdown id:feba608f-2ecb-41ae-b04a-12f075fd644b tags:
below the code skeleton of your function
```python
@interact(region=widgets.Dropdown(options=regions))
def plot_counts(region):
data = ha_df.loc[ha_df['Region'] == region]
ax = sns.barplot(data= ....
```
%% Cell type:code id:fb746fda-36cc-4c35-92d8-257a489fb278 tags:
``` python
@interact(region=widgets.Dropdown(options=regions))
def plot_counts(region):
data = ha_df.loc[ha_df['Region'] == region]
ax = sns.barplot(data=data, y='Happiness Score', x='Country')
ax.set_xticks(data.Country)
ax.set_xticklabels(data.Country, rotation=45, ha='right', rotation_mode='anchor')
```
%% Cell type:markdown id:3f4bd68e-eb26-46f8-a00f-86f9d0570580 tags:
You can customize your figure as classical seaborn/matplotib figure
for instance to display the value above each bar
%% Cell type:code id:7bcee7c5-f1c2-4035-9b7c-e68e1d73a932 tags:
``` python
@interact(region=widgets.Dropdown(options=regions))
def plot_counts(region):
data = ha_df.loc[ha_df['Region'] == region]
ax = sns.barplot(data=data, y='Happiness Score', x='Country')
for i in ax.containers:
# add label on each bar https://www.geeksforgeeks.org/how-to-show-values-on-seaborn-barplot/
ax.bar_label(i, fmt="{:.2f}", rotation='vertical', padding=3)
ax.set_xticks(data.Country)
ax.set_xticklabels(data.Country, rotation=45, ha='right', rotation_mode='anchor')
ax.margins(y=0.1) # add margin to avoid to have label outside the barplotboundaries, here add 10% white space vertically
# https://stackoverflow.com/questions/72662991/how-can-i-prevent-bar-labels-from-going-outside-the-barplot-boundaries-range
```
%% Cell type:code id:d78b7b86-ecaa-4d27-80ca-2d3e46c2aca3 tags:
``` python
```
......
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment