Skip to content
Snippets Groups Projects
user avatar
Blaise Li authored
e73f02c9
History

Snakemake workflows used in Germano Cecere's team

This repository contains snakemake workflows to deal with (mainly C. elegans) high throughput sequencing data.

Installing

These workflows rely on external tools, and also depend on other repositories included as "submodules": they appear as subdirectories, but their content is actually stored in other repositories.

To get them, you need to do more stuff after cloning this repository:

# Clone this repository
git clone git@gitlab.pasteur.fr:bli/bioinfo_utils.git
# Enter it
cd bioinfo_utils
# Get the submodules
# https://stackoverflow.com/a/55570998/1878788
git submodule update --init --remote --merge

For your convenience, a requirements.txt file is provided, to install the dependencies existing as Python packages, using pip. However, these are not the only dependencies.

Singularity container

The singularity subdirectory contains recipes to build a singularity container where the workflows are installed together with their dependencies and wrappers. The container needs to be built on a Linux system, and requires admin privileges (See singularity/Makefile).

It currently does not include genome and annotation files, but may still provide a less painful experience than having to manually install all dependencies of the workflows.

Running the workflows

Directly using Snakemake

The workflows are implemented using Snakemake. The workflow descriptions consist in "snakefiles" ending in .snakefile located in the *-seq directories.

Those file can be run using the snakemake command, specifying them with the --snakefile command-line option. This command is provided by a Python package installable using pip3 install snakemake (Python version at least 3.6 is necessary, in particular due to a heavy usage of Python "f-strings" in the workflows).

The workflows also need a configuration file in YAML format, that indicates, among other things, where the raw data to be processed (in fastq format) is located, what sample names to use, or where the genomic information is located.

Some example configuration files (maybe not all up-to-date) are available at https://gitlab.pasteur.fr/bli/10825_Cecere/-/tree/master/pipeline_configuration

The configuration file should be specified using the --configfile command-line option of snakemake.

Via shell wrappers

To facilitate the above process, a shell script run_pipeline.sh is provided, together with symbolic links with names corresponding to the different workflows. Depending on the symbolic link used to call the shell script, the appropriate "snakefile" will be selected and passed to the snakemake command.

These wrapper scripts however still need the configuration file to be provided, as first argument but without the --configfile options. Further command-line options will be directly forwarded to snakemake, among which the most important are -n to just test that snakemake is able to determine which steps will have to be run and -j to specify the number of steps that can be run in parallel (choose a value suitable for your system, the htop command may help you evaluate how busy your system currently is (install it if you can)).

For more details, see the *-seq/*.snakefile workflow descriptions as well as the run_pipeline.sh wrapper.

Using the singularity container

The Makefile provided in the singularity directory builds and install a singularity container that should contain all software necessary to run the workflow, as well as a shell wrapper and symbolic links that can be used in the same way as described in the previous section.

Genome preparation

A genome preparation workflow is available at https://gitlab.pasteur.fr/bli/genome_preparation

This workflow will download genomic data from Illumina's iGenomes repository as well as data regarding repeated elements directly from UCSC.

It will then do some pre-processing steps to generate annotations in a format suitable for the data analysis workflows, as well as a .yaml configuration file that can be used for the genome_dict section of the data analysis workflow configuration files.

Configuration

The workflow configuration files include a section specifying where the results should be uploaded (using rsync). Upon pipeline failure, this upload will happen using a result directory with a _err suffix added. This upload upon error can be inactivated by adding --config upload_on_err=False to the command-line.

Citing

If you use these tools, please cite the following papers:

Barucci et al, 2020 (doi: 10.1038/s41556-020-0462-7)