Snakemake workflows used in Germano Cecere's team
This repository contains snakemake workflows to deal with (mainly C. elegans) high throughput sequencing data.
Installing
These workflows rely on external tools, and also depend on other repositories included as "submodules": they appear as subdirectories, but their content is actually stored in other repositories.
To get them, you need to do more stuff after cloning this repository:
# Clone this repository
git clone git@gitlab.pasteur.fr:bli/bioinfo_utils.git
# Enter it
cd bioinfo_utils
# Get the submodules
# https://stackoverflow.com/a/55570998/1878788
git submodule update --init --remote --merge
For your convenience, a requirements.txt
file is provided, to install the
dependencies existing as Python packages, using pip. However, these are not the
only dependencies.
Singularity container
The singularity
subdirectory contains recipes to build a singularity
container where the workflows are installed together with their dependencies
and wrappers. The container needs to be built on a Linux system, and requires
admin privileges (See singularity/Makefile
).
It currently does not include genome and annotation files, but may still provide a less painful experience than having to manually install all dependencies of the workflows.
Running the workflows
Directly using Snakemake
The workflows are implemented using
Snakemake. The workflow descriptions
consist in "snakefiles" ending in .snakefile
located in the *-seq
directories.
Those file can be run using the snakemake
command, specifying them with the
--snakefile
command-line option. This command is provided by a Python package
installable using pip3 install snakemake
(Python version at least 3.6 is
necessary, in particular due to a heavy usage of Python
"f-strings" in the workflows).
The workflows also need a configuration file in YAML format, that indicates, among other things, where the raw data to be processed (in fastq format) is located, what sample names to use, or where the genomic information is located.
Some example configuration files (maybe not all up-to-date) are available at https://gitlab.pasteur.fr/bli/10825_Cecere/-/tree/master/pipeline_configuration
The configuration file should be specified using the --configfile
command-line option of snakemake
.
Via shell wrappers
To facilitate the above process, a shell script run_pipeline.sh
is provided,
together with symbolic links with names corresponding to the different
workflows. Depending on the symbolic link used to call the shell script, the
appropriate "snakefile" will be selected and passed to the snakemake
command.
These wrapper scripts however still need the configuration file to be provided,
as first argument but without the --configfile
options. Further command-line
options will be directly forwarded to snakemake
, among which the most
important are -n
to just test that snakemake
is able to determine which
steps will have to be run and -j
to specify the number of steps that can be
run in parallel (choose a value suitable for your system, the htop
command
may help you evaluate how busy your system currently is (install it if you
can)).
For more details, see the *-seq/*.snakefile
workflow descriptions as well as the
run_pipeline.sh
wrapper.
Using the singularity container
The Makefile
provided in the singularity
directory builds and install a
singularity container that should contain all software necessary to run the
workflow, as well as a shell wrapper and symbolic links that can be used in the
same way as described in the previous section.
Genome preparation
A genome preparation workflow is available at https://gitlab.pasteur.fr/bli/genome_preparation
This workflow will download genomic data from Illumina's iGenomes repository as well as data regarding repeated elements directly from UCSC.
It will then do some pre-processing steps to generate annotations in a format
suitable for the data analysis workflows, as well as a .yaml configuration
file that can be used for the genome_dict
section of the data analysis
workflow configuration files.
Configuration
The workflow configuration files include a section specifying where the results should be uploaded (using rsync
).
Upon pipeline failure, this upload will happen using a result directory with a _err
suffix added.
This upload upon error can be inactivated by adding --config upload_on_err=False
to the command-line.
Citing
If you use these tools, please cite the following papers:
Barucci et al, 2020 (doi: 10.1038/s41556-020-0462-7)