**Pandas** is a library to manipulate data structures and perform data analysis and visualization. Pandas is built on top of **Numpy**, a widely used library for mathematical operation particularly on arrays and matrices. Pandas is helping with data analysis stack, including data cleaning/formatting followed by analysis and visualization.
Pandas is particularly well suited to deal with tabular data which can be imported from different formats such are **csv**, **tsv** or even **xlsx**.
The two primary data structures in pandas are **Series** and **DataFrames**.
Pandas is designed to manipulate tabulated data, Numpy is designed to do computation on arrays. So here are the differences:
**Numpy**
* handles one structure: the ndarray.
* an *array* can have 1, 2 or more dimensions.
* A *ndarray* handles homogeneous data, only one datatype in an array.
* So numpy is mostly used to do math on arrays.
**Pandas**
**Series* have 1 dimension, *DataFrame* have 2 dimensions.
**Pandas* does **not** handle structures with more than 2 dimensions.
* But a *DataFrame* can contain heterogenous data, each column can have a different datatype.
**Pandas* is more powerful to query data or manipulate them.
So *Numpy* is mostly used to do math, *Pandas* to explore data structured in tables.
Serie objects benefit from many attributes and methods (see [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html)), lot's of them being common with pandas DataFrames. We will see some of the one listed below in action in the DataFrame section of this course.
Here are some attributes of interest:
|Attribute|Action|
|-|-|
|index|Returns the index (0 axis labels) of the Serie|
|name|Return the name of the Serie|
|shape|Return the number of element in the Serie|
And some useful methods:
|Method|Action|
|-|-|
|aggregate|Aggregate using one or more operations over the specified axis|
|all|Return whether all elements are True potentially over an axis|
|any|Return whether any element is True potentially over an axis|
|apply|Invoke function on values of Series|
|astype|Cast a pandas object to a specified dtype|
|copy|Make a copy of this object’s indices and data|
|count|Return number of non-NA/null observations in the Series|
|describe|Generate descriptive statistics that summarize the central tendency dispersion and shape of a dataset’s distribution, excluding NaN values|
|drop|Return Series with specified index labels removed|
|groupby|Group DataFrame or Series using a mapper or by a Series of columns|
|head / tail|Return the first / last n rows|
|max, min, median, mean, sum|Perform the corresponding operation on the Serie|
|plot|Plot graphs from Serie/DataFrame|
|reset_index|Generate a new DataFrame or Series with the index reset|
|sort_values|Sort by values a the specified column|
|str|String methods for series| |
|to_csv, to_excel|Export to csv or excel file|
|unique|Return unique values of Series object|
|value_counts|Return a Series containing counts of unique values|
%% Cell type:markdown id:precious-green tags:
%% Cell type:markdown id:arabic-affairs tags:
## Operations on Series
Comparison operators (ie `==`, `<`, `<=`, `>=`, `>`) can be used on Series as well as DataFrames for subsetting.
For example, we want to see which values are superior to one in our previous Serie:
%% Cell type:code id:optimum-drama tags:
%% Cell type:code id:million-richards tags:
``` python
serie_label>1
```
%%%% Output: execute_result
A False
B True
C True
dtype: bool
%% Cell type:markdown id:twenty-planet tags:
%% Cell type:markdown id:unlike-monaco tags:
Since `loc` can take list or Series of booleans as input, we can then apply this Boolean Serie as a mask for our Serie:
We can see here that the label are aligned prior operation
%% Cell type:code id:electric-cherry tags:
%% Cell type:code id:better-blame tags:
``` python
serie_label+serie_label.iloc[::-1]
```
%%%% Output: execute_result
A 2
B 4
C 6
dtype: int64
%% Cell type:markdown id:positive-batman tags:
%% Cell type:markdown id:loved-orleans tags:
# DataFrames
A pandas DataFrame is a two-dimensional data structure with axis labels. Labels do not need to be unique but must be hashable. DataFrame in pandas are like dictionary containers of Series objects.
[Dataframes](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) in pandas are rarely created from scratch. One common approach is to create a pandas DataFrame from a dictionary or a file, but you can as well create them from a list of lists or numpy ndarrays.