Custom DB snakemake

Snakemake workflow to run to create a CustomDB

▶ Input:

all_sequences_kraken_and_NCBIassemblies.fna
all_correspondence_headers_kraken_and_metaphlan_taxonomy.txt
taxonomy directory
→ in maestro (recommended): ln -s /pasteur/zeus/projets/p02/metasig/DBs/Kraken2CustomDB/prerequisite_files/input/kraken_db/taxonomy .
→ OR downloadable with: kraken2-build --download-taxonomy --db $DBNAME
python environment with pandas
Snakemake files: Snakefile and config.yaml, at /pasteur/zeus/projets/p02/metasig/DBs/Kraken2CustomDB/snakemake

▶ Config file parameters:

in snakefiles,
metaphlan3_paired: path to Snakefile of the metaphlan3 paired snakemake workflow
metaphlan3_single: path to Snakefile of the metaphlan3 single snakemake workflow
metaphlan2_merge: path to Snakefile of the merging of metaphlan results snakemake workflow
samples: list of sample names. Do not incorporate extension and pair prefix.
example: SAMPLE01_read1.fastq.gz and SAMPLE01_read2.fastq.gz should be listed as SAMPLE01
input_dir: path to the directory containing the samples
output_dir: path to the directory that will contain all output of this snakemake workflow.
output_database: path to directory where to create the Custom Database (must contain kraken2 taxonomy directory)
multifastafile: path to the pre-requisite multifasta file, all_sequences_kraken_and_NCBIassemblies.fna
correspondencefile: path to the pre-requisite correspondence file, all_correspondence_headers_kraken_and_metaphlan_taxonomy.txt
in metaphlan3 (step 1 in "Summary of steps/rules of the Snakemake workflow"),
threads: number of threads to use when calling metaphlan
input_type: {fastq,fasta,bowtie2out,sam} set whether the input is the FASTA file of metagenomic reads or the SAM file of the mapping of the reads against the MetaPhlAn db.
ext: extension of samples
example: for SAMPLE01_read1.fastq.gz and SAMPLE01_read2.fastq.gz, ext should be set as "fastq.gz"
paired: set to True if samples are paired like in the example above
pair_prefix: prefix before number in the sample file name
example: for SAMPLE01_read1.fastq.gz and SAMPLE01_read2.fastq.gz, pair_prefix should be set as "read"
options: All other possible options declared as you would when calling metaphlan. Check Metaphlan documentation (metaphlan -h)
in metaphlan2_merge (step 1 in "Summary of steps/rules of the Snakemake workflow"),
threads: number of threads to use for the merging of metaphlan results

▶ Steps/rules of the Snakemake workflow:

metaphlan3 and metaphlan2_merge run metaphlan3 on the samples to produce a file called merged_taxonomic_profiles.txt, with for each samples, the percentage of presence for each species detected. Only species/genus/... present in at least one samples are listed/appear in the file.
list_txids makes a list of all txids to keep. In the configuration file, an argument level allows to choose at which level between genus and species to operate. To operate at genus level means keeping all sequences from species belonging to a genus present in the samples.
filter_headers and filter_sequences filter allsequences_kraken2_and_NCBI_assemblies.fna to keep only sequences from txids from the list
add_to_library and build_database build custom database

▶ To run Snakemake:

Update config.yaml with your samples, your input_dir, your output_database etc.
Make sure that the kraken2 taxonomy directory is in or has a symbolic link (recommended) in the working directory
Create or load an environment that contains snakemake, kraken2 and metaphlan3 as well as python with pandas library.\

1st option: module load Python
python3 -m venv .snakemake_venv
source .snakemake_venv/bin/activate
module load snakemake muscle/3.8.1551 blast+/2.10.1 samtools/1.10 RAxML/8.2.12 bowtie2/2.3.5.1 fasta ruby diamond/2.0.6 trimal/1.4.1 mafft/7.467 FastTree/2.1.11 IQ-TREE/1.6.12 Mash/2.2 phylophlan/3.0.2 R/4.0.2 graphlan/1.1.3 metaphlan blast+ kraken

2nd option: mamba env create -f environment.yml -n moonbeam
conda activate moonbeam
module load muscle/3.8.1551 blast+/2.10.1 samtools/1.10 RAxML/8.2.12 bowtie2/2.3.5.1 fasta ruby diamond/2.0.6 trimal/1.4.1 mafft/7.467 FastTree/2.1.11 IQ-TREE/1.6.12 Mash/2.2 phylophlan/3.0.2 R/4.0.2 graphlan/1.1.3 metaphlan

Run snakemake (through slurm or not)

sbatch -p atm -c 1 snakemake -p -j 24 -s Snakefile -R all --cluster "sbatch -p atm -c {threads}"

If an unmapped.txt is generated when trying to build the CustomDB, then the accession needs to be added to the seqid2taxid.map file:

grep -f unmapped.txt all_correspondence_headers_kraken_and_metaphlan_taxonomy.txt | sed 's/^>$[^\.]*$\.[^\t]*\t$[0-9]*$\t.*$/\1\t\2/' >> seqid2taxid.map

and then rebuild the database:

sbatch -p atm -c 32 --mem=20G kraken2-build --build --db $DBNAME --threads 32

Edited Mar 09, 2023 by Agnes BAUD