Commit 7bddf7fb authored by Etienne Kornobis's avatar Etienne Kornobis
Browse files

minor author/cosmetic edits

parent 21a2434b
%% Cell type:markdown id:cultural-palestine tags:
%% Cell type:markdown id:functional-attraction tags:
# <center>**TP**</center>
<div style="text-align:center">
<img src="images/jupyter.png" width="600px">
<div>
Bertrand Néron, François Laurent, Etienne Kornobis
<br />
<a src=" https://research.pasteur.fr/en/team/bioinformatics-and-biostatistics-hub/">Bioinformatics and Biostatistiqucs HUB</a>
<br />
© Institut Pasteur, 2021
</div>
</div>
%% Cell type:markdown id:advance-vaccine tags:
# Introduction to JupyterLab
## Aim of this section
......@@ -104,56 +119,56 @@
Here are some example of useful magic commands:
- Run cell with bash in subprocess:
%% Cell type:code id:public-nightlife tags:
%% Cell type:code id:respective-prize tags:
``` python
%%bash
echo "This is a bash script"
for i in {1..3}; do echo $i; done
echo "Over and out"
```
%% Cell type:markdown id:marine-arctic tags:
%% Cell type:markdown id:pharmaceutical-college tags:
- The exclamation mark character ``!`` can be used as well to execute the following line in a bash subprocess. For example:
%% Cell type:code id:considerable-fleet tags:
%% Cell type:code id:drawn-soldier tags:
``` python
! echo "This is executed in a bash subprocess"
```
%% Cell type:markdown id:satellite-disposal tags:
%% Cell type:markdown id:natural-submission tags:
- `%timeit` can be used to check for execution times:
%% Cell type:code id:delayed-thunder tags:
%% Cell type:code id:under-embassy tags:
``` python
%timeit for _ in range(1000): True
```
%% Cell type:markdown id:vocational-jacksonville tags:
%% Cell type:markdown id:constant-driving tags:
- Load more extension for the notebook, for example `autoreload` is useful extension to automatically reload a module imported in a Jupyter notebook if the module has changed locally:
%% Cell type:code id:physical-steering tags:
%% Cell type:code id:waiting-credit tags:
``` python
%load_ext autoreload
%autoreload 2
```
%% Cell type:markdown id:regular-tiger tags:
%% Cell type:markdown id:laden-seeker tags:
# Exercices
%% Cell type:markdown id:rotary-bouquet tags:
%% Cell type:markdown id:fitted-insert tags:
The aim here is to get comfortable in Jupyterlab.
## Exercise
......@@ -161,34 +176,34 @@
- Create a new notebook with a python3 kernel.
- Create, delete and move cells around using shortcuts and graphical interface.
NB: A kernel provides a programming language support in Jupyter. Kernels are available for Python, R, Julia, and many more.
%% Cell type:code id:chinese-values tags:
%% Cell type:code id:international-thomson tags:
``` python
```
%% Cell type:markdown id:useful-segment tags:
%% Cell type:markdown id:olympic-shoot tags:
## Exercise
In the notebook, create a code cell with simple python code inside with a
``print`` statement, execute the cell and witness its output.
For example::
print("Hello World !")
%% Cell type:code id:classical-extraction tags:
%% Cell type:code id:creative-conditioning tags:
``` python
```
%% Cell type:markdown id:iraqi-wholesale tags:
%% Cell type:markdown id:bibliographic-concern tags:
## Exercise
In the notebook, create a markdown cell with:
......@@ -197,17 +212,17 @@
- A list
- A link to the jupyter documentation ie https://jupyter.org/documentation
Render (execute) the cell to display the cell with a pretty formatting.
%% Cell type:code id:refined-relation tags:
%% Cell type:code id:ruled-bottle tags:
``` python
```
%% Cell type:markdown id:manufactured-treatment tags:
%% Cell type:markdown id:verbal-field tags:
## Exercise
Grasp the concept of cell execution by creating three cells:
......@@ -215,41 +230,41 @@
- 1 cell defining the same variable with a different value from the previous cell (e.g. `myvar=42`)
- 1 cell printing the value of the variable (`print(myvar)`).
Witness how execution order of your cells can affect the result of the cell printing the output. This is potentially dangerous when using notebooks and has to be kept in mind when coded and used.
%% Cell type:code id:featured-converter tags:
%% Cell type:code id:identified-calculation tags:
``` python
```
%% Cell type:markdown id:constant-thriller tags:
%% Cell type:markdown id:unauthorized-carter tags:
## Exercise
Using a Jupyter magic command, create a cell listing the files in the current directory using a bash subprocess.
%% Cell type:code id:illegal-preserve tags:
%% Cell type:code id:abandoned-shareware tags:
``` python
```
%% Cell type:markdown id:written-bidding tags:
%% Cell type:markdown id:molecular-census tags:
## Exercise
Using the graphical interface, export your notebook as html file.
%% Cell type:code id:waiting-concord tags:
%% Cell type:code id:overall-assurance tags:
``` python
```
%% Cell type:markdown id:varying-providence tags:
%% Cell type:markdown id:static-array tags:
# More documentation
JupyterLab: https://jupyterlab.readthedocs.io/en/latest/
......
%% Cell type:markdown id:right-artwork tags:
%% Cell type:markdown id:integral-thermal tags:
# <center>**TP**</center>
<img src="./images/pandas_logo.svg">
<div style="text-align:center">
Bertrand Néron
Bertrand Néron, François Laurent, Etienne Kornobis
<br />
<a src=" https://research.pasteur.fr/en/team/bioinformatics-and-biostatistics-hub/">Bioinformatics and Biostatistiqucs HUB</a>
<br />
© Institut Pasteur, 2021
</div>
%% Cell type:markdown id:sacred-breathing tags:
%% Cell type:markdown id:trained-fighter tags:
# Exploring Blast results
%% Cell type:markdown id:technical-crystal tags:
%% Cell type:markdown id:manufactured-cursor tags:
- Import the file data/blast.txt into a pandas dataframe variable (named `blast_res`). Verify that its type is a pandas
dataframe and display the dataframe in jupyterlab.
NB: The column names for this blast format are: "qseqid", "sseqid", "pident", "length", "mismatch", "gapopen", "qstart", "qend", "sstart", "send", "evalue", "bitscore"
You going to need to pass an extra argument (`names`) to specify the names of the columns.
%% Cell type:code id:recreational-seller tags:
%% Cell type:code id:musical-violence tags:
``` python
import pandas as pd
```
%% Cell type:code id:major-dream tags:
%% Cell type:code id:loaded-transfer tags:
``` python
blast_colnames = ["qseqid","sseqid","pident","length","mismatch","gapopen","qstart","qend","sstart","send","evalue","bitscore"]
blast_res = pd.read_csv("../data/blast.txt", sep="\t", names=blast_colnames)
```
%% Cell type:code id:parliamentary-heaven tags:
%% Cell type:code id:streaming-regulation tags:
``` python
type(blast_res)
```
%%%% Output: execute_result
pandas.core.frame.DataFrame
%% Cell type:code id:changing-drive tags:
%% Cell type:code id:unsigned-coast tags:
``` python
blast_res
```
......@@ -80,20 +80,20 @@
174 9 290 19 310 6.000000e-08 57.0
175 95 155 15 62 1.000000e-06 50.1
[176 rows x 12 columns]
%% Cell type:markdown id:productive-chorus tags:
%% Cell type:markdown id:dominant-knowing tags:
Explore ``blast_res`` dataframe:
- Display the 5 first lines of the dataframe.
- Display the 8 last lines of the dataframe.
- Display a overall statistical description of the dataframe.
- Display the dimensions of the dataframe.
%% Cell type:code id:yellow-matthew tags:
%% Cell type:code id:simplified-progress tags:
``` python
blast_res.head(5)
```
......@@ -111,11 +111,11 @@
1 23 316 51 344 0.0 559.0
2 1 316 1 316 0.0 537.0
3 1 316 1 316 0.0 527.0
4 1 316 1 316 0.0 515.0
%% Cell type:code id:handled-details tags:
%% Cell type:code id:narrow-smell tags:
``` python
blast_res.tail(8)
```
......@@ -139,11 +139,11 @@
172 16 285 50 342 6.000000e-09 60.1
173 11 294 19 285 6.000000e-09 59.7
174 9 290 19 310 6.000000e-08 57.0
175 95 155 15 62 1.000000e-06 50.1
%% Cell type:code id:virgin-forestry tags:
%% Cell type:code id:identical-guest tags:
``` python
blast_res.describe()
```
......@@ -167,25 +167,25 @@
25% 4.000000 272.000000 8.750000e-100 167.000000
50% 8.000000 302.500000 1.000000e-61 205.000000
75% 11.000000 320.000000 8.000000e-48 303.000000
max 84.000000 362.000000 1.000000e-06 654.000000
%% Cell type:code id:superb-papua tags:
%% Cell type:code id:alpine-cleveland tags:
``` python
blast_res.shape
```
%%%% Output: execute_result
(176, 12)
%% Cell type:markdown id:fourth-pennsylvania tags:
%% Cell type:markdown id:imposed-squad tags:
- Extract 3rd line from the ``blast_res`` dataframe. Which type of data structure is returned by this extraction ?
%% Cell type:code id:binding-interest tags:
%% Cell type:code id:complicated-football tags:
``` python
blast_res.iloc[2]
```
......@@ -203,25 +203,25 @@
send 316
evalue 0.0
bitscore 537.0
Name: 2, dtype: object
%% Cell type:code id:careful-dining tags:
%% Cell type:code id:administrative-biodiversity tags:
``` python
type(blast_res.iloc[2])
```
%%%% Output: execute_result
pandas.core.series.Series
%% Cell type:markdown id:common-sixth tags:
%% Cell type:markdown id:equipped-amendment tags:
- Extract the *sseqid* column from the ``blast_res`` dataframe.
%% Cell type:code id:located-waters tags:
%% Cell type:code id:seasonal-europe tags:
``` python
blast_res.sseqid
# OR
blast_res['sseqid']
......@@ -242,63 +242,63 @@
173 sp|P25906|YDBC_ECOLI
174 sp|C6TBN2|AKR1_SOYBN
175 sp|P49261|CROB_LEPLU
Name: sseqid, Length: 176, dtype: object
%% Cell type:markdown id:searching-coach tags:
%% Cell type:markdown id:major-leave tags:
- Get the minimum and maximum value of a the *evalue* column.
%% Cell type:code id:square-airplane tags:
%% Cell type:code id:varied-influence tags:
``` python
blast_res.evalue.min()
```
%%%% Output: execute_result
0.0
%% Cell type:code id:innovative-audio tags:
%% Cell type:code id:little-recipient tags:
``` python
blast_res.evalue.max()
```
%%%% Output: execute_result
1e-06
%% Cell type:markdown id:broad-password tags:
%% Cell type:markdown id:sitting-blackberry tags:
- Get the median and the mean of the *bitscore* column.
%% Cell type:code id:tamil-aggregate tags:
%% Cell type:code id:polyphonic-retro tags:
``` python
blast_res.bitscore.median()
```
%%%% Output: execute_result
205.0
%% Cell type:code id:sitting-metallic tags:
%% Cell type:code id:advisory-symphony tags:
``` python
blast_res.bitscore.mean()
```
%%%% Output: execute_result
231.9528409090909
%% Cell type:markdown id:excessive-tournament tags:
%% Cell type:markdown id:friendly-extra tags:
- Filter in all hits with a percentage of identity (*pident*) superior to 75%.
%% Cell type:code id:duplicate-ghana tags:
%% Cell type:code id:rough-globe tags:
``` python
blast_res.loc[blast_res.pident > 75]
```
......@@ -318,11 +318,11 @@
2 1 316 1 316 0.000000e+00 537.0
3 1 316 1 316 0.000000e+00 527.0
4 1 316 1 316 0.000000e+00 515.0
5 1 316 1 316 2.000000e-177 501.0
%% Cell type:code id:developing-browser tags:
%% Cell type:code id:novel-turkey tags:
``` python
# OR
blast_res.query("pident > 75")
```
......@@ -343,15 +343,15 @@
2 1 316 1 316 0.000000e+00 537.0
3 1 316 1 316 0.000000e+00 527.0
4 1 316 1 316 0.000000e+00 515.0
5 1 316 1 316 2.000000e-177 501.0
%% Cell type:markdown id:nonprofit-fitting tags:
%% Cell type:markdown id:several-light tags:
- Based on the bitscore alone, extract only the best hit(s) (i.e. the highest(s) bitscore(s)).
%% Cell type:code id:chronic-wallace tags:
%% Cell type:code id:arbitrary-style tags:
``` python
# Getting the highest bitscore value
max_bitscore = blast_res.bitscore.max()
# Extracting all the rows with a bitscore equal to the maximum bitscore
......@@ -364,15 +364,15 @@
0 AK1BA_HUMAN sp|O60218|AK1BA_HUMAN 100.0 316 0 0
qstart qend sstart send evalue bitscore
0 1 316 1 316 0.0 654.0
%% Cell type:markdown id:saving-homeless tags:
%% Cell type:markdown id:heated-poultry tags:
- Filter in all hits which are corresponding to human hits in the database (*sseqid*).
%% Cell type:code id:western-language tags:
%% Cell type:code id:failing-crossing tags:
``` python
# This could be done with list comprehension creating a list of Booleans
blast_res.loc[["HUMAN" in x for x in blast_res.sseqid]]
```
......@@ -403,11 +403,11 @@
35 5 316 8 323 1.000000e-101 308.0
36 5 316 8 323 1.000000e-101 308.0
45 5 316 8 323 9.000000e-100 303.0
161 5 118 11 127 3.000000e-30 116.0
%% Cell type:code id:taken-palmer tags:
%% Cell type:code id:trained-durham tags:
``` python
# But pandas as a specific syntax to make operation on strings in a Serie: the method str and its method contains
blast_res.loc[blast_res.sseqid.str.contains("HUMAN")]
```
......@@ -438,11 +438,11 @@
35 5 316 8 323 1.000000e-101 308.0
36 5 316 8 323 1.000000e-101 308.0
45 5 316 8 323 9.000000e-100 303.0
161 5 118 11 127 3.000000e-30 116.0
%% Cell type:code id:tracked-reform tags:
%% Cell type:code id:structural-hybrid tags:
``` python
blast_res.query("~sseqid.str.contains('HUMAN') & pident > 75")
```
......@@ -458,15 +458,15 @@
2 1 316 1 316 0.000000e+00 537.0
3 1 316 1 316 0.000000e+00 527.0
4 1 316 1 316 0.000000e+00 515.0
5 1 316 1 316 2.000000e-177 501.0
%% Cell type:markdown id:reliable-dream tags:
%% Cell type:markdown id:incorporate-interface tags:
- Plot a histogram of the bitscores.
%% Cell type:code id:suspected-substance tags:
%% Cell type:code id:liable-wheat tags:
``` python
blast_res["bitscore"].hist()
```
......@@ -476,15 +476,15 @@
%%%% Output: display_data
![]()
%% Cell type:markdown id:attempted-development tags:
%% Cell type:markdown id:reflected-intervention tags:
- Plot a barplot of the number of hits per species (species are considered the last code after the "_" in the sseqid column)
%% Cell type:code id:swiss-provider tags:
%% Cell type:code id:normal-glenn tags:
``` python
# First extract the species information from the sseqid column
hits_by_sp = blast_res.sseqid.str.split("_", expand=True)
hits_by_sp
......@@ -505,11 +505,11 @@
174 sp|C6TBN2|AKR1 SOYBN
175 sp|P49261|CROB LEPLU
[176 rows x 2 columns]
%% Cell type:code id:russian-mystery tags:
%% Cell type:code id:arranged-intervention tags:
``` python
# Then count their occurences and do the barplot
hits_by_sp.loc[:, 1].value_counts().plot(kind="bar")
```
......@@ -520,38 +520,38 @@
%%%% Output: display_data
![]()
%% Cell type:markdown id:reliable-shark tags:
%% Cell type:markdown id:generous-regression tags:
# Extra exercise
%% Cell type:code id:fabulous-endorsement tags:
%% Cell type:code id:experienced-prediction tags:
``` python
import pandas as pd
```
%% Cell type:markdown id:boxed-basin tags:
%% Cell type:markdown id:purple-legend tags:
read the 'data/city_temperature.csv'
force the City datatype to string by passing
```
dtype={'City': str}
```
As argument to the function to read the file.<br />
Don't worry to the warning, it is due to State wich contains Nan for non US contry, but we do not use these data
%% Cell type:code id:positive-gateway tags:
%% Cell type:code id:arctic-pickup tags:
``` python
world = pd.read_csv('data/city_temperature.csv' , sep=',', dtype={'City': str})
```
%% Cell type:code id:noble-economics tags:
%% Cell type:code id:conventional-section tags:
``` python
world.columns
```
......@@ -559,25 +559,25 @@
Index(['Region', 'Country', 'State', 'City', 'Month', 'Day', 'Year',
'AvgTemperature'],
dtype='object')
%% Cell type:markdown id:international-glenn tags:
%% Cell type:markdown id:authentic-hearts tags:
We will work only on Europe Region. so creat data named europe with only these data
%% Cell type:code id:exciting-founder tags:
%% Cell type:code id:strong-skirt tags:
``` python
europe = world[world['Region'] == 'Europe']
```
%% Cell type:markdown id:dressed-carbon tags:
%% Cell type:markdown id:protected-desperate tags:
wich country are in europe?
%% Cell type:code id:crude-pillow tags:
%% Cell type:code id:unable-establishment tags:
``` python
europe.Country.unique()
```
......@@ -589,35 +589,35 @@
'Italy', 'Latvia', 'Macedonia', 'The Netherlands', 'Norway',
'Poland', 'Portugal', 'Romania', 'Russia', 'Serbia-Montenegro',
'Slovakia', 'Spain', 'Sweden', 'Switzerland', 'Ukraine',
'United Kingdom', 'Yugoslavia'], dtype=object)