Commit d3ce2c00 authored by Etienne Kornobis's avatar Etienne Kornobis
Browse files

add seaborn course

parent 075f6061
%% Cell type:markdown id:lesser-criticism tags:
%% Cell type:markdown id:horizontal-listening tags:
# <center>**Cours**</center>
<div style="text-align:center">
<img src="images/pandas_logo.svg" width="600px">
<div>
Bertrand Néron, François Laurent, Etienne Kornobis
<br />
<a src=" https://research.pasteur.fr/en/team/bioinformatics-and-biostatistics-hub/">Bioinformatics and Biostatistiqucs HUB</a>
<br />
© Institut Pasteur, 2021
</div>
</div>
%% Cell type:markdown id:attempted-certificate tags:
%% Cell type:markdown id:sophisticated-concept tags:
# Intro
**Pandas** is a library to manipulate data structures and perform data analysis and visualization. Pandas is built on top of **Numpy**, a widely used library for mathematical operation particularly on arrays and matrices. Pandas is helping with data analysis stack, including data cleaning/formatting followed by analysis and visualization.
Pandas is particularly well suited to deal with tabular data which can be imported from different formats such are **csv**, **tsv** or even **xlsx**.
The two primary data structures in pandas are **Series** and **DataFrames**.
Pandas is designed to manipulate tabulated data, Numpy is designed to do computation on arrays. So here are the differences:
**Numpy**
* handles one structure: the ndarray.
* an *array* can have 1, 2 or more dimensions.
* A *ndarray* handles homogeneous data, only one datatype in an array.
* So numpy is mostly used to do math on arrays.
**Pandas**
* *Series* have 1 dimension, *DataFrame* have 2 dimensions.
* *Pandas* does **not** handle structures with more than 2 dimensions.
* But a *DataFrame* can contain heterogenous data, each column can have a different datatype.
* *Pandas* is more powerful to query data or manipulate them.
So *Numpy* is mostly used to do math, *Pandas* to explore data structured in tables.
%% Cell type:markdown id:angry-banking tags:
%% Cell type:markdown id:velvet-payroll tags:
# Installation
For *conda* users
```shell
conda install pandas
```
for *pip* users
```shell
pip install pandas
```
%% Cell type:markdown id:british-currency tags:
%% Cell type:markdown id:falling-radar tags:
# Import Convention
%% Cell type:code id:proud-coffee tags:
%% Cell type:code id:executed-tsunami tags:
``` python
import numpy as np
import pandas as pd
```
%% Cell type:markdown id:english-subdivision tags:
%% Cell type:markdown id:foster-convert tags:
# Series
A Series is a one-dimensional array with axis labels. Labels do not need to be
unique but must be hashable.
To create a series, use the pandas `Series` object and specify a list or tuple
of value to feed your serie with as the first argument:
%% Cell type:code id:outer-brass tags:
%% Cell type:code id:musical-civilization tags:
``` python
serie_nolabel = pd.Series([1,2,3])
type(serie_nolabel)
```
%%%% Output: execute_result
pandas.core.series.Series
%% Cell type:code id:executive-right tags:
%% Cell type:code id:superb-relaxation tags:
``` python
serie_nolabel
```
%%%% Output: execute_result
0 1
1 2
2 3
dtype: int64
%% Cell type:markdown id:personal-cleaners tags:
%% Cell type:markdown id:coordinated-issue tags:
You can specify the labels of your Series by providing a list of labels as
for the `index` argument:
%% Cell type:code id:spatial-disposal tags:
%% Cell type:code id:received-flash tags:
``` python
serie_label = pd.Series([1,2,3], index=['A', 'B', 'C'])
serie_label
```
%%%% Output: execute_result
A 1
B 2
C 3
dtype: int64
%% Cell type:markdown id:reduced-retention tags:
%% Cell type:markdown id:sorted-optimum tags:
And we can access these indices with the `index` property:
%% Cell type:code id:classical-sapphire tags:
%% Cell type:code id:immune-physiology tags:
``` python
serie_nolabel.index
```
%%%% Output: execute_result
RangeIndex(start=0, stop=3, step=1)
%% Cell type:code id:known-absorption tags:
%% Cell type:code id:systematic-working tags:
``` python
serie_label.index
```
%%%% Output: execute_result
Index(['A', 'B', 'C'], dtype='object')
%% Cell type:markdown id:amateur-secret tags:
%% Cell type:markdown id:arctic-gibson tags:
## Indexing/Slicing
In order to subset a serie based on an **integer index**, you can use the `iloc` attribute:
%% Cell type:code id:exact-accuracy tags:
%% Cell type:code id:alternate-banks tags:
``` python
serie_nolabel.iloc[1]
```
%%%% Output: execute_result
2
%% Cell type:code id:hairy-inspiration tags:
%% Cell type:code id:standing-train tags:
``` python
serie_label.iloc[1]
```
%%%% Output: execute_result
2
%% Cell type:code id:social-extra tags:
%% Cell type:code id:severe-correlation tags:
``` python
serie_label.iloc[0:2]
```
%%%% Output: execute_result
A 1
B 2
dtype: int64
%% Cell type:code id:diagnostic-flood tags:
%% Cell type:code id:raising-grenada tags:
``` python
serie_label.iloc[::-1]
```
%%%% Output: execute_result
C 3
B 2
A 1
dtype: int64
%% Cell type:markdown id:mysterious-airline tags:
%% Cell type:markdown id:blocked-roommate tags:
Most commonly, You can use **labels** as well for subsetting, using the `loc` attribute:
%% Cell type:code id:private-profession tags:
%% Cell type:code id:accompanied-pantyhose tags:
``` python
serie_label.loc["B"]
```
%%%% Output: execute_result
2
%% Cell type:markdown id:forbidden-conjunction tags:
%% Cell type:markdown id:durable-lesson tags:
**WARNING**: With `loc`, the value is interpreted as a label of the
index, and **never** as an integer position along the index, there is `iloc` for this.
When index labels are strings, you can as well access the corresponding value using this simple syntax `.LABEL_VALUE`
%% Cell type:code id:hawaiian-fever tags:
%% Cell type:code id:comparative-guinea tags:
``` python
serie_label.A
```
%%%% Output: execute_result
1
%% Cell type:markdown id:prescribed-literature tags:
%% Cell type:markdown id:convenient-constitution tags:
Serie objects benefit from many attributes and methods (see [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html)), lot's of them being common with pandas DataFrames. We will see some of the one listed below in action in the DataFrame section of this course.
Here are some attributes of interest:
|Attribute|Action|
|-|-|
|index|Returns the index (0 axis labels) of the Serie|
|name|Return the name of the Serie|
|shape|Return the number of element in the Serie|
And some useful methods:
|Method|Action|
|-|-|
|aggregate|Aggregate using one or more operations over the specified axis|
|all|Return whether all elements are True potentially over an axis|
|any|Return whether any element is True potentially over an axis|
|apply|Invoke function on values of Series|
|astype|Cast a pandas object to a specified dtype|
|copy|Make a copy of this object’s indices and data|
|count|Return number of non-NA/null observations in the Series|
|describe|Generate descriptive statistics that summarize the central tendency dispersion and shape of a dataset’s distribution, excluding NaN values|
|drop|Return Series with specified index labels removed|
|groupby|Group DataFrame or Series using a mapper or by a Series of columns|
|head / tail|Return the first / last n rows|
|max, min, median, mean, sum|Perform the corresponding operation on the Serie|
|plot|Plot graphs from Serie/DataFrame|
|reset_index|Generate a new DataFrame or Series with the index reset|
|sort_values|Sort by values a the specified column|
|str|String methods for series| |
|to_csv, to_excel|Export to csv or excel file|
|unique|Return unique values of Series object|
|value_counts|Return a Series containing counts of unique values|
%% Cell type:markdown id:precious-green tags:
%% Cell type:markdown id:arabic-affairs tags:
## Operations on Series
Comparison operators (ie `==`, `<`, `<=`, `>=`, `>`) can be used on Series as well as DataFrames for subsetting.
For example, we want to see which values are superior to one in our previous Serie:
%% Cell type:code id:optimum-drama tags:
%% Cell type:code id:million-richards tags:
``` python
serie_label > 1
```
%%%% Output: execute_result
A False
B True
C True
dtype: bool
%% Cell type:markdown id:twenty-planet tags:
%% Cell type:markdown id:unlike-monaco tags:
Since `loc` can take list or Series of booleans as input, we can then apply this Boolean Serie as a mask for our Serie:
%% Cell type:code id:universal-responsibility tags:
%% Cell type:code id:ordered-rendering tags:
``` python
serie_label.loc[serie_label>1]
```
%%%% Output: execute_result
B 2
C 3
dtype: int64
%% Cell type:markdown id:pressed-clark tags:
%% Cell type:markdown id:major-intermediate tags:
## Operations between Series
%% Cell type:markdown id:thick-meter tags:
%% Cell type:markdown id:suitable-focus tags:
Operations (ie `+`, `-`, `*`, `/`) between Series will trigger an alignment of the values
based on the index values:
%% Cell type:code id:departmental-creature tags:
%% Cell type:code id:least-cruise tags:
``` python
serie_label + serie_label
```
%%%% Output: execute_result
A 2
B 4
C 6
dtype: int64
%% Cell type:markdown id:regulation-listening tags:
%% Cell type:markdown id:herbal-collaboration tags:
We can see here that the label are aligned prior operation
%% Cell type:code id:electric-cherry tags:
%% Cell type:code id:better-blame tags:
``` python
serie_label + serie_label.iloc[::-1]
```
%%%% Output: execute_result
A 2
B 4
C 6
dtype: int64
%% Cell type:markdown id:positive-batman tags:
%% Cell type:markdown id:loved-orleans tags:
# DataFrames
A pandas DataFrame is a two-dimensional data structure with axis labels. Labels do not need to be unique but must be hashable. DataFrame in pandas are like dictionary containers of Series objects.
## DataFrame Terminology
<img src="images/pandas_dataframe.png" width="300px" />
## Create a DataFrame
[Dataframes](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) in pandas are rarely created from scratch. One common approach is to create a pandas DataFrame from a dictionary or a file, but you can as well create them from a list of lists or numpy ndarrays.
### From a list of lists:
%% Cell type:code id:following-houston tags:
%% Cell type:code id:regulated-ready tags:
``` python
df = pd.DataFrame([[1,2,3],
[4,5,6]],
columns=['A', 'B', 'C'],
index= ['a', 'b'])
df
```
%%%% Output: execute_result
A B C
a 1 2 3
b 4 5 6
%% Cell type:code id:personalized-kennedy tags:
%% Cell type:code id:stable-discharge tags:
``` python
df.index
```
%%%% Output: execute_result
Index(['a', 'b'], dtype='object')
%% Cell type:code id:conceptual-boards tags:
%% Cell type:code id:configured-coral tags:
``` python
df.columns
```
%%%% Output: execute_result
Index(['A', 'B', 'C'], dtype='object')
%% Cell type:markdown id:agricultural-spotlight tags:
%% Cell type:markdown id:exclusive-brave tags:
### From a numpy ndarray
%% Cell type:code id:minor-korean tags:
%% Cell type:code id:facial-curve tags:
``` python
df = pd.DataFrame(np.arange(12).reshape(4,3))
df
```
%%%% Output: execute_result
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
%% Cell type:markdown id:still-commissioner tags:
%% Cell type:markdown id:committed-planning tags:
- From a dictionnary
### From a dictionnary
%% Cell type:code id:intellectual-wilson tags:
%% Cell type:code id:suspected-nirvana tags:
``` python
df = pd.DataFrame({'A': [1,2,3],
'B': np.arange(4,7),
})
df
```
%%%% Output: execute_result
A B
0 1 4
1 2 5
2 3 6
%% Cell type:markdown id:international-checkout tags:
%% Cell type:markdown id:vocational-peoples tags:
- From a file, many options are available, to name only a few:
- [pd.read_csv](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)
- [pd.read_excel](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html)
- [pd.read_html](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html)
NB: For excel and html imports, you might need to install extra libraries.
%% Cell type:code id:bronze-prayer tags:
%% Cell type:code id:sonic-shock tags:
``` python
titanic = pd.read_csv("data/titanic.csv")
```
%% Cell type:markdown id:laden-composer tags:
%% Cell type:markdown id:about-cursor tags:
We want to open *data/bar_data.tsv* file but the 2 first lines are comments and the separator between fields is *tab*
See below the 5 first lines (using the `!` jupyter magic for bash subprocesses)
%% Cell type:code id:grave-party tags:
%% Cell type:code id:bridal-development tags:
``` python
! head -5 data/bar_data.tsv
```
%%%% Output: stream
# generated with fooo software version 12bis
# 2021/02/31
cond1 cond2 cond3 control
14.644417316782045 2.9453091400880465 24.81171864537413 5.114340165446571
12.071043262601615 4.406424332565544 21.574601309211538 2.5071180945299716
%% Cell type:code id:historical-ivory tags:
%% Cell type:code id:listed-framework tags:
``` python
bar = pd.read_csv("data/bar_data.tsv", sep="\t", comment="#")
bar.head()
```
%%%% Output: execute_result
cond1 cond2 cond3 control
0 14.644417 2.945309 24.811719 5.114340
1 12.071043 4.406424 21.574601 2.507118
2 8.227469 3.185252 20.651623 4.449593
3 8.980799 9.233560 24.859737 4.127919
4 9.080359 5.629192 18.443504 4.268572
%% Cell type:markdown id:bacterial-irrigation tags:
%% Cell type:markdown id:explicit-monitoring tags:
If the data in the file are already indexed like in this one:
%% Cell type:code id:supported-health tags:
%% Cell type:code id:allied-artist tags:
``` python
! head -5 data/data_for_plt.csv
```
%%%% Output: stream
MW AlogP PSA HBA
0 0.0 1.0 72.73111270481336 1.1416684150966834
1 3.63 544.59 391.4275648686457 0.9848635571682688
2 2.11 383.4 437.4589821943501 15.040385372412596
3 1.24 162.23 480.1112629835199 11.401906578750385
%% Cell type:code id:discrete-anaheim tags:
%% Cell type:code id:limiting-tokyo tags:
``` python
data = pd.read_csv("data/data_for_plt.csv", sep="\t")
data.head(3)
```
%%%% Output: execute_result
Unnamed: 0 MW AlogP PSA HBA
0 0 0.00 1.00 72.731113 1.141668
1 1 3.63 544.59 391.427565 0.984864
2 2 2.11 383.40 437.458982 15.040385
%% Cell type:markdown id:latest-public tags:
%% Cell type:markdown id:european-tunisia tags:
To avoiding to have an extra column, you can specify which columns to use as index.
To avoid to have an extra column, you can specify which columns to use as index.
This column **must** have distincts values.
%% Cell type:code id:casual-buying tags:
%% Cell type:code id:crucial-flight tags:
``` python
data = pd.read_csv("data/data_for_plt.csv", sep="\t", index_col=0)
data.head()
```
%%%% Output: execute_result
MW AlogP PSA HBA
0 0.00 1.00 72.731113 1.141668
1 3.63 544.59 391.427565 0.984864
2 2.11 383.40 437.458982 15.040385
3 1.24 162.23 480.111263 11.401907
4 -1.37 361.37 448.864769 5.732690
%% Cell type:markdown id:commercial-system tags:
%% Cell type:markdown id:occasional-carnival tags:
The first line is used as header.<br />
So you can specify the number of the row which represents the header,
or you can set this parameter to None if the table has no header.
%% Cell type:code id:golden-myrtle tags:
%% Cell type:code id:oriented-bleeding tags:
``` python
data = pd.read_csv("data/no_header.tsv", sep="\t", index_col=0, header=None)
data.head()
```
%%%% Output: execute_result
1 2 3 4
0
0 0.00 1.00 72.731113 1.141668
1 3.63 544.59 391.427565 0.984864
2 2.11 383.40 437.458982 15.040385
3 1.24 162.23 480.111263 11.401907
4 -1.37 361.37 448.864769 5.732690
%% Cell type:markdown id:thorough-worth tags:
%% Cell type:markdown id:reasonable-straight tags:
### Going back to np.array and list
%% Cell type:code id:competent-negative tags:
``` python
df.values
```
%%%% Output: execute_result
array([[1, 4],
[2, 5],
[3, 6]])
%% Cell type:code id:fantastic-monday tags:
``` python
df.values.tolist()
```
%%%% Output: execute_result
[[1, 4], [2, 5], [3, 6]]
%% Cell type:markdown id:formal-example tags:
## Characterizing a DataFrame
Several DataFrame attributes and methods are provided to characterize your dataset. Here is a subset of them most commonly used.