update seaborn practicals (2)

edf49df9 · Etienne Kornobis · e3a50101 · edf49df9
Commit edf49df9 authored 2 years ago by Etienne Kornobis
--- a/notebooks/seaborn_TP.ipynb
+++ b/notebooks/seaborn_TP.ipynb
@@ -2,10 +2,10 @@
 "cells": [
  {
   "cell_type": "markdown",
-   "id": "rotary-designation",
+   "id": "instrumental-personal",
   "metadata": {},
   "source": [
-    "# <center>**TP**</center>\n",
+    "# <center><b>Hands-on</b></center>\n",
    "\n",
    "<div style=\"text-align:center\">\n",
    "    <img src=\"images/seaborn.png\" width=\"600px\">\n",
@@ -21,106 +21,392 @@
  },
  {
   "cell_type": "markdown",
-   "id": "respected-history",
+   "id": "compliant-basis",
   "metadata": {},
   "source": [
    "Practice your graphing skills using data from milieu intérieur in `data/mi.csv`:"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "id": "departmental-exhibition",
+   "metadata": {},
+   "source": [
+    "- Do a boxplot showing the differences in temperature between females and males:"
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "adolescent-spirituality",
+   "id": "98e904b6-6e90-4c74-a463-2339d3961250",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
-   "id": "widespread-rendering",
+   "id": "portuguese-worse",
   "metadata": {},
   "source": [
-    "- Do a boxplot showing the differences in temperature between females and males:"
+    "- Using a histogram and continuous probability density curve, display the distribution of age in the dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "dressed-performer",
+   "id": "55756807-e1fb-4fb5-878c-5e46acea7a11",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
-   "id": "acute-debut",
+   "id": "prepared-stephen",
   "metadata": {},
   "source": [
-    "- Using an histogram, display the distribution of age in the dataset (with kde as well)"
+    "- Use a barplot to show the count of vaccinated for yellow fever (see the documentation for a countplot)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "crucial-bracelet",
+   "id": "1425046c-a058-45fe-95b5-5eca6ebbd33a",
   "metadata": {},
   "outputs": [],
   "source": []
  },
+  {
+   "cell_type": "markdown",
+   "id": "immediate-method",
+   "metadata": {},
+   "source": [
+    "- Plot the distribution of age for the people vaccinated for the flu"
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "minor-secretariat",
+   "id": "d567194c-3698-44c9-b5f8-b8a3d3493b0c",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
-   "id": "processed-diameter",
+   "id": "temporal-synthesis",
   "metadata": {},
   "source": [
-    "- Use a barplot to show the count of vaccinated for yellow fever (see the documentation for a countplot)"
+    "- Feel free to explore more of [seaborn](https://seaborn.pydata.org/examples/index.html) !"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "db56d49a-4770-4f9e-af6b-78960574d338",
+   "metadata": {},
+   "source": [
+    "# Exploring count matrices from RNA-seq data"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5377668b-dea5-4c20-8249-5266f98774eb",
+   "metadata": {},
+   "source": [
+    "<img src=\"images/rnaseq.png\" style=\"margin:0 auto;width:800px\">"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ebf1606b-0b21-4821-a899-551ec33c977e",
+   "metadata": {},
+   "source": [
+    "- Import the count_matrix tsv file from the data folder"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "indian-response",
+   "id": "eb53a1f5-9ea7-491e-bcfa-820cb1663af5",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
-   "id": "scenic-adoption",
+   "id": "c80d9947-9ccf-4499-a1c2-9194377cd054",
   "metadata": {},
   "source": [
-    "- Plot the distribution of age for the people vaccinated for the flu"
+    "- Simplify the dataframe to only have the \"Geneid\", \"WTx\" and \"Cx\" columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "weighted-terrain",
+   "id": "56e90032-75ce-47b5-9cd3-95219cd7b26e",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
-   "id": "operating-union",
+   "id": "eb65b51f-f689-4a66-b47c-e79f0e9eba52",
   "metadata": {},
   "source": [
-    "- Feel free to explore more of [seaborn](https://seaborn.pydata.org/examples/index.html) !"
+    "- Format properly your DataFrame to be able to use  https://seaborn.pydata.org/generated/seaborn.clustermap.html to realize a heatmap."
   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9b422fcb-7cc1-4766-92e3-276742381ae6",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f8d6188e-3a37-4ba5-b377-a11696054e9c",
+   "metadata": {},
+   "source": [
+    "- Explore the clustermap documentation to have a more visual heatmap by standardizing the data within genes."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "06be3f98-2167-44ac-9318-955286d77903",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2e61a207-223a-4c01-88ea-76b1b8c3a0b9",
+   "metadata": {},
+   "source": [
+    "- Reformat the counts_df dataframe to have genes in columns and samples in rows.\n",
+    "- Add a \"group\" column defining the grouping of the samples:\n",
+    "    - \"WTx\" samples will be from the \"WT\" group.\n",
+    "    - \"Cx\" samples will be from the \"C\" group."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "eea3f521-6960-44ab-ac0b-fcf5a002237f",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9a88ecb1-9ed3-4160-91ee-24a30e994b71",
+   "metadata": {},
+   "source": [
+    "- Display a barplot showing the mean expression for each group for a particular gene (for example \"gene-LEPBI_RS00065\")."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cf74e85e-eef3-4023-bb88-5a864cf3c3f9",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "99e2455a-cb7d-44d5-a4a0-2cf272c814ab",
+   "metadata": {},
+   "source": [
+    "- Try plotting a swarmplot on top of the previous barplot:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7cf225f9-aea7-4cd9-ac90-a99592799527",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d200d375-362e-4c1d-a88e-130b094e6feb",
+   "metadata": {},
+   "source": [
+    "- Now plot the same data using a boxplot. Can you see the problem of displaying boxplots for this kind of data ?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e4daf00e-9a2c-4ec4-9d26-aa18aae5d82d",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2e1cabe0-aab7-4f0e-888e-81aae7d5df8d",
+   "metadata": {},
+   "source": [
+    "- Compute the median of each genes by groups:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6ffd0f59-0fd7-41b9-a87a-c6e1a74145e8",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "308cc10b-6727-4bc5-b05d-4777037e252e",
+   "metadata": {},
+   "source": [
+    "We are going now to add extra annotations to this median table in order to identify genes of interest.\n",
+    "- Import the annotation.csv table from the data folder: "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9be6ee5b-d497-47fa-8ac5-cf5514fd52c0",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "50fa81a7-3f34-4160-ad2d-f77d21be9ac0",
+   "metadata": {},
+   "source": [
+    "Annotations in this table are available for many types of loci (the \"genetic_type\" column), but here we will focus on the \"gene\" genetic_type. \n",
+    "- Filter the annotation dataframe to have only \"gene\" as \"genetic_type\"."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f9a8bcf7-0bcc-43e8-828a-ec204658e528",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f8a4e744-e7e2-43b6-b3d4-e59feb40d3ff",
+   "metadata": {},
+   "source": [
+    "- Concatenate the dataframe with median by group and the annotation dataframe together:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "afd8467a-33e1-4b9e-8f6d-b2229099c874",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "af9f8e1f-5f8b-4152-b08a-44e957f13cec",
+   "metadata": {},
+   "source": [
+    "- Calculate an estimate of the gene expression fold change for each gene (by dividing the C median expressions by WT median expressions).\n",
+    "- Add it as a \"FoldChange\" column to the previous dataframe."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "bb617d00-2c2d-45cc-ace0-3656dc999b17",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d70eb26b-0a26-4bbc-af03-ba8781b09fb5",
+   "metadata": {},
+   "source": [
+    "- Use a barplot to display fold changes and using the new gene annotation (The \"Name\" column)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4dd4cbee-547f-43f1-9ed7-173f3040b8d5",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "34a26492-7c6b-4a07-a4de-67ec8f693cdc",
+   "metadata": {},
+   "source": [
+    "- By calculating the length of each gene and using a visualisation, does gene expression appears correlated with gene length ?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6f35b696-0807-4df4-9310-cb9197e7bf85",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a2627322-e6a5-422f-8a69-b89dbd4b777e",
+   "metadata": {},
+   "source": [
+    "- Create a function which produce a single image with four different plots of your choice and save it to pdf file."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "70e001a1-2848-4fb7-9f33-7beb4475e0fc",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0d05aba4-3c85-4cd9-85f3-5296b19308fb",
+   "metadata": {},
+   "source": [
+    "# Extras"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "66d6668e-683f-462e-a72f-28bdda8736f2",
+   "metadata": {},
+   "source": [
+    "- Using ipywidget, make a function to display barplot of gene expression by groups with the gene being selected by the user (using a Dropdown widget for example)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e587f202-7ca4-43fb-ac3c-015c740c69d2",
+   "metadata": {},
+   "outputs": [],
+   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
-   "display_name": "Python 3 (ipykernel)",
+   "display_name": "Python [conda env:dev]",
   "language": "python",
-   "name": "python3"
+   "name": "conda-env-dev-py"
  },
  "language_info": {
   "codemirror_mode": {
@@ -132,7 +418,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.8.10"
+   "version": "3.10.4"
  }
 },
 "nbformat": 4,

-%% Cell type:markdown id:rotary-designation tags:
+%% Cell type:markdown id:instrumental-personal tags:

-# <center>**TP**</center>
+# <center><b>Hands-on</b></center>

 <div style="text-align:center">
    <img src="images/seaborn.png" width="600px">
    <div>
       Bertrand Néron, François Laurent, Etienne Kornobis
       <br />
       <a src=" https://research.pasteur.fr/en/team/bioinformatics-and-biostatistics-hub/">Bioinformatics and Biostatistiqucs HUB</a>
       <br />
       © Institut Pasteur, 2021
    </div>
 </div>

-%% Cell type:markdown id:respected-history tags:
+%% Cell type:markdown id:compliant-basis tags:

 Practice your graphing skills using data from milieu intérieur in `data/mi.csv`:

-%% Cell type:code id:adolescent-spirituality tags:
+%% Cell type:markdown id:departmental-exhibition tags:
+
+- Do a boxplot showing the differences in temperature between females and males:
+
+%% Cell type:code id:98e904b6-6e90-4c74-a463-2339d3961250 tags:

 ``` python
 ```

-%% Cell type:markdown id:widespread-rendering tags:
+%% Cell type:markdown id:portuguese-worse tags:

- Do a boxplot showing the differences in temperature between females and males:
+- Using a histogram and continuous probability density curve, display the distribution of age in the dataset

-%% Cell type:code id:dressed-performer tags:
+%% Cell type:code id:55756807-e1fb-4fb5-878c-5e46acea7a11 tags:

 ``` python
 ```

-%% Cell type:markdown id:acute-debut tags:
+%% Cell type:markdown id:prepared-stephen tags:

- Using an histogram, display the distribution of age in the dataset (with kde as well)
+- Use a barplot to show the count of vaccinated for yellow fever (see the documentation for a countplot)

-%% Cell type:code id:crucial-bracelet tags:
+%% Cell type:code id:1425046c-a058-45fe-95b5-5eca6ebbd33a tags:

 ``` python
 ```

-%% Cell type:code id:minor-secretariat tags:
+%% Cell type:markdown id:immediate-method tags:
+
+- Plot the distribution of age for the people vaccinated for the flu
+
+%% Cell type:code id:d567194c-3698-44c9-b5f8-b8a3d3493b0c tags:

 ``` python
 ```

-%% Cell type:markdown id:processed-diameter tags:
+%% Cell type:markdown id:temporal-synthesis tags:

- Use a barplot to show the count of vaccinated for yellow fever (see the documentation for a countplot)
+- Feel free to explore more of [seaborn](https://seaborn.pydata.org/examples/index.html) !
+
+%% Cell type:markdown id:db56d49a-4770-4f9e-af6b-78960574d338 tags:
+
+# Exploring count matrices from RNA-seq data
+
+%% Cell type:markdown id:5377668b-dea5-4c20-8249-5266f98774eb tags:
+
+<img src="images/rnaseq.png" style="margin:0 auto;width:800px">

-%% Cell type:code id:indian-response tags:
+%% Cell type:markdown id:ebf1606b-0b21-4821-a899-551ec33c977e tags:
+
+- Import the count_matrix tsv file from the data folder
+
+%% Cell type:code id:eb53a1f5-9ea7-491e-bcfa-820cb1663af5 tags:

 ``` python
 ```

-%% Cell type:markdown id:scenic-adoption tags:
+%% Cell type:markdown id:c80d9947-9ccf-4499-a1c2-9194377cd054 tags:

- Plot the distribution of age for the people vaccinated for the flu
+- Simplify the dataframe to only have the "Geneid", "WTx" and "Cx" columns

-%% Cell type:code id:weighted-terrain tags:
+%% Cell type:code id:56e90032-75ce-47b5-9cd3-95219cd7b26e tags:

 ``` python
 ```

-%% Cell type:markdown id:operating-union tags:
+%% Cell type:markdown id:eb65b51f-f689-4a66-b47c-e79f0e9eba52 tags:

- Feel free to explore more of [seaborn](https://seaborn.pydata.org/examples/index.html) !
+- Format properly your DataFrame to be able to use  https://seaborn.pydata.org/generated/seaborn.clustermap.html to realize a heatmap.
+
+%% Cell type:code id:9b422fcb-7cc1-4766-92e3-276742381ae6 tags:
+
+``` python
+```
+
+%% Cell type:markdown id:f8d6188e-3a37-4ba5-b377-a11696054e9c tags:
+
+- Explore the clustermap documentation to have a more visual heatmap by standardizing the data within genes.
+
+%% Cell type:code id:06be3f98-2167-44ac-9318-955286d77903 tags:
+
+``` python
+```
+
+%% Cell type:markdown id:2e61a207-223a-4c01-88ea-76b1b8c3a0b9 tags:
+
+- Reformat the counts_df dataframe to have genes in columns and samples in rows.
+- Add a "group" column defining the grouping of the samples:
+    - "WTx" samples will be from the "WT" group.
+    - "Cx" samples will be from the "C" group.
+
+%% Cell type:code id:eea3f521-6960-44ab-ac0b-fcf5a002237f tags:
+
+``` python
+```
+
+%% Cell type:markdown id:9a88ecb1-9ed3-4160-91ee-24a30e994b71 tags:
+
+- Display a barplot showing the mean expression for each group for a particular gene (for example "gene-LEPBI_RS00065").
+
+%% Cell type:code id:cf74e85e-eef3-4023-bb88-5a864cf3c3f9 tags:
+
+``` python
+```
+
+%% Cell type:markdown id:99e2455a-cb7d-44d5-a4a0-2cf272c814ab tags:
+
+- Try plotting a swarmplot on top of the previous barplot:
+
+%% Cell type:code id:7cf225f9-aea7-4cd9-ac90-a99592799527 tags:
+
+``` python
+```
+
+%% Cell type:markdown id:d200d375-362e-4c1d-a88e-130b094e6feb tags:
+
+- Now plot the same data using a boxplot. Can you see the problem of displaying boxplots for this kind of data ?
+
+%% Cell type:code id:e4daf00e-9a2c-4ec4-9d26-aa18aae5d82d tags:
+
+``` python
+```
+
+%% Cell type:markdown id:2e1cabe0-aab7-4f0e-888e-81aae7d5df8d tags:
+
+- Compute the median of each genes by groups:
+
+%% Cell type:code id:6ffd0f59-0fd7-41b9-a87a-c6e1a74145e8 tags:
+
+``` python
+```
+
+%% Cell type:markdown id:308cc10b-6727-4bc5-b05d-4777037e252e tags:
+
+We are going now to add extra annotations to this median table in order to identify genes of interest.
+- Import the annotation.csv table from the data folder:
+
+%% Cell type:code id:9be6ee5b-d497-47fa-8ac5-cf5514fd52c0 tags:
+
+``` python
+```
+
+%% Cell type:markdown id:50fa81a7-3f34-4160-ad2d-f77d21be9ac0 tags:
+
+Annotations in this table are available for many types of loci (the "genetic_type" column), but here we will focus on the "gene" genetic_type.
+- Filter the annotation dataframe to have only "gene" as "genetic_type".
+
+%% Cell type:code id:f9a8bcf7-0bcc-43e8-828a-ec204658e528 tags:
+
+``` python
+```
+
+%% Cell type:markdown id:f8a4e744-e7e2-43b6-b3d4-e59feb40d3ff tags:
+
+- Concatenate the dataframe with median by group and the annotation dataframe together:
+
+%% Cell type:code id:afd8467a-33e1-4b9e-8f6d-b2229099c874 tags:
+
+``` python
+```
+
+%% Cell type:markdown id:af9f8e1f-5f8b-4152-b08a-44e957f13cec tags:
+
+- Calculate an estimate of the gene expression fold change for each gene (by dividing the C median expressions by WT median expressions).
+- Add it as a "FoldChange" column to the previous dataframe.
+
+%% Cell type:code id:bb617d00-2c2d-45cc-ace0-3656dc999b17 tags:
+
+``` python
+```
+
+%% Cell type:markdown id:d70eb26b-0a26-4bbc-af03-ba8781b09fb5 tags:
+
+- Use a barplot to display fold changes and using the new gene annotation (The "Name" column)
+
+%% Cell type:code id:4dd4cbee-547f-43f1-9ed7-173f3040b8d5 tags:
+
+``` python
+```
+
+%% Cell type:markdown id:34a26492-7c6b-4a07-a4de-67ec8f693cdc tags:
+
+- By calculating the length of each gene and using a visualisation, does gene expression appears correlated with gene length ?
+
+%% Cell type:code id:6f35b696-0807-4df4-9310-cb9197e7bf85 tags:
+
+``` python
+```
+
+%% Cell type:markdown id:a2627322-e6a5-422f-8a69-b89dbd4b777e tags:
+
+- Create a function which produce a single image with four different plots of your choice and save it to pdf file.
+
+%% Cell type:code id:70e001a1-2848-4fb7-9f33-7beb4475e0fc tags:
+
+``` python
+```
+
+%% Cell type:markdown id:0d05aba4-3c85-4cd9-85f3-5296b19308fb tags:
+
+# Extras
+
+%% Cell type:markdown id:66d6668e-683f-462e-a72f-28bdda8736f2 tags:
+
+- Using ipywidget, make a function to display barplot of gene expression by groups with the gene being selected by the user (using a Dropdown widget for example).
+
+%% Cell type:code id:e587f202-7ca4-43fb-ac3c-015c740c69d2 tags:
+
+``` python
+```