Generation of starting files for the creation of a CustomDB snakemake workflow

Might need to update said files with release of new version of metaphlan3 DB

▶ Output = The 2 pre-requisites files for the creation of a CustomDB workflow:

all_sequences_kraken_and_NCBIassemblies.fna
all_correspondence_headers_kraken_and_metaphlan_taxonomy.txt (example below)

header	species	genus	source	metaphlan3_species
>kraken:taxid\|1937004\|NZ_CP019470.1 Methanopyrus sp. KOL6 chromosome		2319	kraken2
>NCBI01000028.1 MAG: Chromatiales bacterium 21-64-14 04302015_21_scaffold_100, whole genome shotgun sequence	1970504		GCA_002255365.1*	1970504
	99007			99007

* NCBI accessionID
NB: empty cell for species/metaphlan3_species/metaphlan3_genus = species not in metaphlan3 but from a genus present in metaphlan3 DB
empty cell for genus taxid = No information found for genus lato sensu (txid for genus or no rank/clade below a filled-in family txid)
empty cell for header = species not found (in correspondence file to get its genus if building CustomDB at the genus level)

▶ Input:

assembly_summary_refseq.txt and assembly_summary_genbank.txt (downloadable: here)
metaphlan fasta and picklefile (mpa_vX_CHOCPhlAn_YYYYMM.fna/pkl) (downloadable: by following this link > biobakeryX > metaphlan_databases > tar file)
kraken librairies and taxonomy (kraken2-build --download-library $lib --db $DBNAME and kraken2-build --download-taxonomy --db $DBNAME)

▶ Config file parameters:

output_dir: path to the directory that will contain all output of this snakemake workflow.
metaphlan_fastafile and metaphlan_picklefile: path to metaphlan database files (see "Input").
path_to_kraken_db: path to kraken database, meaning the directory that contain or will contain (after download - step 1 in "Summary of steps/rules of the Snakemake workflow") all kraken libraries directory and the kraken taxonomy directory, and that will later host the test kraken database (step 8 in "Summary of steps/rules of the Snakemake workflow")
entrez_email: email used to connect with Bio.Entrez.
assembly summary: list of path to the different assembly_summary files to use to search for NCBI assemblies (see "Input").
libraries: list of the different kraken libraries (to download and) to filter through.
in search_orgatxids (step 2 in "Summary of steps/rules of the Snakemake workflow"),
on_NCBI: set to True if you want to search for the markers' current organism taxid in NCBI; use metaphlan taxid if set to False
resume: set to True if you use on_NCBI and you fear interruption of the process
in download_NCBIassemblies (step 7 in "Summary of steps/rules of the Snakemake workflow"),
quit_snakemake_if_one_assembly_not_downloaded: set to True if you want snakemake to stop its process if it encounters a NCBI assembly that couldn't be downloaded

▶ Summary of steps/rules of the Snakemake workflow:

download_kraken_lib and download_kraken_taxonomy download all libraries listed in config.yaml and taxonomy directories.
These two rules are only run if the libraires and taxonomy are not already in path_to_kraken_db (in config.yaml). So you might have to check the version of the kraken2 libraries and taxonomy.
search_orgatxids tries to list all organism txids present in metaphlan3:

Either use the txids filled in for each markers of the fasta file in the pickle file.

Or search the markers in NCBI with the Bio.Entrez python package. When markers have an accession ID then search through assembly_summary files and then NCBI with Bio.Entrez. If markers isn't found in NCBI then refer to txid filled in pickle file. To use that path, set on_NCBI under search_orgatxids in config.yaml as True. Since that path can take some time there is an argument resume under search_orgatxids in config.yaml that allows to resume where the script left off if it was inadvertently stopped before ending. ⚠️ That path needs internet access. Use if the metaphlan3 db release is old and you have doubt about the topicality of the txid associated with the markers.

get_species_and_genus tries to get the species txid and genus txid from the list of organism txids (organism txids are species txids or below taxonomical levels).
filter_kraken_lib is launched for as many kraken libraries you have set under libraries in config.yaml. This rule filters the kraken library to keep only species txid with a genus txid listed as to keep. It creates 4 output files: species_notfound_in_{lib}.txt, genus_notfound_in_{lib}.txt, sequences.fna (filtered version of library.fna) and correspondence_headers_taxonomy.txt
missing_IDs_after_kraken recapitulates which species and genus txids are still missing after the filtering of the different kraken libraries.
filter_and_order_asssemblies takes assembly_summary files given under assembly_summary in the config.yaml and filters out assemblies tagged in the "excluded_from_refseq" column (column 21), apart for some tags ("genus undefined", "derived from environmental source", "derived from metagenome", "derived from single cell" and "sequence duplications") that are authorized. Then orders the remaining assemblies by loooking first at the RefSeq Category (column 5: "refseq_category"), favoring "reference genome" over "representative genome" and then "na"; then looking at the Assembly level (column 12: "assembly_level"), preferring "Complete Genome", then "Chromosome", "Scaffold" and lastly "Contig"; and ultimately looking at the Sequence release date (column 15: "seq_rel_date"), favoring the most recent assembly.
search_NCBIassemblies, download_NCBIassemblies and creation_NCBI_multifasta_and_correspondencefile searches for the still missing species in the filtered and ordered super-assembly_summary file. For genus that still don't have a representant, looks back through the super-assembly_summary file for one representant. Then download retained NCBI assemblies, concatenate them into a multifasta file and create the correspondence file.
add_NCBIassemblies_to_lib and check_db_building checks that NCBI assemblies can be used to build a customized kraken database. A unmapped.txt file may been generated inside the test Kraken DB. Then user needs to follow the 4th step of "To run Snakemake".
concatenate_preprocess_files and filter_duplicates_and_only_x concatenates the n different multifasta files into one file: all_sequences_kraken_and_NCBIassemblies.fna (n = number of kraken libraries reviewed + 1 (NCBI assemblies)), as well as the n different correspondence file into all_correspondence_headers_taxonomy.txt. And removes from both files the sequences that are duplicated and the sequences that are composed only by "x" (no nucleotidic informations)(observed in kraken libraries)
add_metaphlan_taxonomy adds the metaphlan taxid informations to the super-correspondence file because there might be differences between metaphlan taxids (used in results of metaphlan3) and kraken taxids (used in correspondence file, more up-to-date). The species that weren't found are added under an empty header so that it still allows to search species sharing their genus.

▶ To run Snakemake:

Update config.yaml with path to new metaphlan fasta and picklefile, output_dir etc.
Create or load an environment that contains snakemake, mafft and RAxML

module load Python/3.8.1
python3 -m venv .snakemake_venv
source .snakemake_venv/bin/activate
module load snakemake blast+ kraken
pip install Bio

Run snakemake (through slurm or not) somewhere with internet access ⚠

snakemake -p -j 40 -s /pasteur/zeus/projets/p02/metasig/DBs/Kraken2CustomDB/prerequisite_files/snakemake/Snakefile -R all

Check if a unmapped.txt file has been generated inside the test Kraken DB. That would mean that some of the NCBI assemblies accession aren't registered inside the two *.accession2taxid files. If that's the case, we need to generate a file that link the accessions to the txid of the species:

echo -e "accession\taccession.version\ttaxid\tgi" > taxonomy/unmappedacc.accession2taxid
grep -f unmapped.txt /pasteur/zeus/projets/p02/metasig/DBs/Kraken2CustomDB/prerequisite_files/output/assembly_fna.gz/correspondence_headers_taxonomy.txt | sed 's/^>$[^\.]*$\.$[0-9]$[^\t]*\t$[0-9]*$\t.*$/\1\t\1.\2\t\3\t/' >> taxonomy/unmappedacc.accession2taxid

Try to build the test Kraken DB again (just to be sure)

kraken2-build --build --db {path_to_test_krakenDB} --threads {threads}

NEXT: #37

Edited Mar 09, 2023 by Agnes BAUD