README.md 4.71 KB
Newer Older
Blaise Li's avatar
Blaise Li committed
1 2
# Snakemake workflows used in Germano Cecere's team

Blaise Li's avatar
Blaise Li committed
3 4 5
This repository contains snakemake workflows to deal with (mainly _C. elegans_)
high throughput sequencing data.

6 7 8

## Installing

Blaise Li's avatar
Blaise Li committed
9 10 11 12 13 14 15 16 17 18 19
These workflows rely on external tools, and also depend on other repositories
included as "submodules": they appear as subdirectories, but their content is
actually stored in other repositories.

To get them, you need to do more stuff after cloning this repository:

    # Clone this repository
    git clone git@gitlab.pasteur.fr:bli/bioinfo_utils.git
    # Enter it
    cd bioinfo_utils
    # Get the submodules
20 21
    # https://stackoverflow.com/a/55570998/1878788
    git submodule update --init --remote --merge
Blaise Li's avatar
Blaise Li committed
22

23 24 25
For your convenience, a `requirements.txt` file is provided, to install the
dependencies existing as Python packages, using pip. However, these are not the
only dependencies.
Blaise Li's avatar
Blaise Li committed
26 27


28
### Singularity container
Blaise Li's avatar
Blaise Li committed
29 30 31 32 33 34 35 36 37 38 39

The `singularity` subdirectory contains recipes to build a singularity
container where the workflows are installed together with their dependencies
and wrappers. The container needs to be built on a Linux system, and requires
admin privileges (See `singularity/Makefile`).

It currently does not include genome and annotation files, but may still
provide a less painful experience than having to manually install all
dependencies of the workflows.


40 41
## Running the workflows

Blaise Li's avatar
Blaise Li committed
42
### Directly using Snakemake
43

Blaise Li's avatar
Blaise Li committed
44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83

The workflows are implemented using
[Snakemake](https://snakemake.readthedocs.io). The workflow descriptions
consist in "snakefiles" ending in `.snakefile` located in the `*-seq`
directories.

Those file can be run using the `snakemake` command, specifying them with the
`--snakefile` command-line option. This command is provided by a Python package
installable using `pip3 install snakemake` (Python version at least 3.6 is
necessary, in particular due to a heavy usage of Python
["f-strings"](https://www.python.org/dev/peps/pep-0498/) in the workflows).

The workflows also need a configuration file in YAML format, that indicates,
among other things, where the raw data to be processed (in fastq format) is
located, what sample names to use, or where the genomic information is located.

Some example configuration files (maybe not all up-to-date) are available at
<https://gitlab.pasteur.fr/bli/10825_Cecere/-/tree/master/pipeline_configuration>

The configuration file should be specified using the `--configfile`
command-line option of `snakemake`.


### Via shell wrappers

To facilitate the above process, a shell script `run_pipeline.sh` is provided,
together with symbolic links with names corresponding to the different
workflows. Depending on the symbolic link used to call the shell script, the
appropriate "snakefile" will be selected and passed to the `snakemake` command.

These wrapper scripts however still need the configuration file to be provided,
as first argument but without the `--configfile` options. Further command-line
options will be directly forwarded to `snakemake`, among which the most
important are `-n` to just test that `snakemake` is able to determine which
steps will have to be run and `-j` to specify the number of steps that can be
run in parallel (choose a value suitable for your system, the `htop` command
may help you evaluate how busy your system currently is (install it if you
can)).

For more details, see the `*-seq/*.snakefile` workflow descriptions as well as the
84 85 86
`run_pipeline.sh` wrapper.


Blaise Li's avatar
Blaise Li committed
87 88 89 90 91 92 93 94 95 96 97 98 99 100
### Using the singularity container

The `Makefile` provided in the `singularity` directory builds and install a
singularity container that should contain all software necessary to run the
workflow, as well as a shell wrapper and symbolic links that can be used in the
same way as described in the previous section.


### Genome preparation

A genome preparation workflow is available at
<https://gitlab.pasteur.fr/bli/10825_Cecere/-/tree/master/Genomes>

This workflow will download genomic data from Illumina's iGenomes repository as
Blaise Li's avatar
Blaise Li committed
101
well as data regarding repeated elements directly from UCSC.
Blaise Li's avatar
Blaise Li committed
102 103 104 105 106 107 108 109 110 111 112 113 114 115

It will then do some pre-processing steps to generate annotations in a format
suitable for the data analysis workflows, as well as a .yaml configuration
file that can be used for the `genome_dict` section of the data analysis
workflow configuration files.


### Configuration

The workflow configuration files include a section specifying where the results should be uploaded (using `rsync`).
Upon pipeline failure, this upload will happen using a result directory with a `_err` suffix added.
This upload upon error can be inactivated by adding `--config upload_on_err=False` to the command-line.


Blaise Li's avatar
Blaise Li committed
116 117 118 119 120
## Citing

If you use these tools, please cite the following papers:

> Barucci et al, 2020 (doi: [10.1038/s41556-020-0462-7](https://doi.org/10.1038/s41556-020-0462-7))