Generation of starting files for the creation of a CustomDB snakemake workflow
Might need to update said files with release of new version of metaphlan3 DB
- all_sequences_kraken_and_NCBIassemblies.fna
- all_correspondence_headers_kraken_and_metaphlan_taxonomy.txt (example below)
header | species | genus | source | metaphlan3_species | metaphlan3_genus |
---|---|---|---|---|---|
>kraken:taxid|1937004|NZ_CP019470.1 Methanopyrus sp. KOL6 chromosome | 2319 | kraken2 | |||
>NCBI01000028.1 MAG: Chromatiales bacterium 21-64-14 04302015_21_scaffold_100, whole genome shotgun sequence | 1970504 | GCA_002255365.1* | 1970504 | ||
99007 | 99007 |
* NCBI accessionID
NB: empty cell for species/metaphlan3_species/metaphlan3_genus = species not in metaphlan3 but from a genus present in metaphlan3 DB
empty cell for genus taxid = No information found for genus lato sensu (txid for genus or no rank/clade below a filled-in family txid)
empty cell for header = species not found (in correspondence file to get its genus if building CustomDB at the genus
level)
- assembly_summary_refseq.txt and assembly_summary_genbank.txt (downloadable: here)
- metaphlan fasta and picklefile (mpa_vX_CHOCPhlAn_YYYYMM.fna/pkl) (downloadable: by following this link > biobakeryX > metaphlan_databases > tar file)
- kraken librairies and taxonomy (
kraken2-build --download-library $lib --db $DBNAME
andkraken2-build --download-taxonomy --db $DBNAME
)
- output_dir: path to the directory that will contain all output of this snakemake workflow.
- metaphlan_fastafile and metaphlan_picklefile: path to metaphlan database files (see "Input").
- path_to_kraken_db: path to kraken database, meaning the directory that contain or will contain (after download - step 1 in "Summary of steps/rules of the Snakemake workflow") all kraken libraries directory and the kraken taxonomy directory, and that will later host the test kraken database (step 8 in "Summary of steps/rules of the Snakemake workflow")
- entrez_email: email used to connect with Bio.Entrez.
- assembly summary: list of path to the different assembly_summary files to use to search for NCBI assemblies (see "Input").
- libraries: list of the different kraken libraries (to download and) to filter through.
- in search_orgatxids (step 2 in "Summary of steps/rules of the Snakemake workflow"),
on_NCBI: set to True if you want to search for the markers' current organism taxid in NCBI; use metaphlan taxid if set to False
resume: set to True if you use on_NCBI and you fear interruption of the process - in download_NCBIassemblies (step 7 in "Summary of steps/rules of the Snakemake workflow"),
quit_snakemake_if_one_assembly_not_downloaded: set to True if you want snakemake to stop its process if it encounters a NCBI assembly that couldn't be downloaded
-
download_kraken_lib and download_kraken_taxonomy download all libraries listed in config.yaml and taxonomy directories.
These two rules are only run if the libraires and taxonomy are not already inpath_to_kraken_db
(in config.yaml). So you might have to check the version of the kraken2 libraries and taxonomy. - search_orgatxids tries to list all organism txids present in metaphlan3:
- Either use the txids filled in for each markers of the fasta file in the pickle file.
- Or search the markers in NCBI with the Bio.Entrez python package. When markers have an accession ID then search through assembly_summary files and then NCBI with Bio.Entrez. If markers isn't found in NCBI then refer to txid filled in pickle file. To use that path, set
on_NCBI
undersearch_orgatxids
in config.yaml as True. Since that path can take some time there is an argumentresume
undersearch_orgatxids
in config.yaml that allows to resume where the script left off if it was inadvertently stopped before ending.⚠ ️ That path needs internet access. Use if the metaphlan3 db release is old and you have doubt about the topicality of the txid associated with the markers.
- get_species_and_genus tries to get the species txid and genus txid from the list of organism txids (organism txids are species txids or below taxonomical levels).
-
filter_kraken_lib is launched for as many kraken libraries you have set under
libraries
in config.yaml. This rule filters the kraken library to keep only species txid with a genus txid listed as to keep. It creates 4 output files: species_notfound_in_{lib}.txt, genus_notfound_in_{lib}.txt, sequences.fna (filtered version of library.fna) and correspondence_headers_taxonomy.txt - missing_IDs_after_kraken recapitulates which species and genus txids are still missing after the filtering of the different kraken libraries.
-
filter_and_order_asssemblies takes assembly_summary files given under
assembly_summary
in the config.yaml and filters out assemblies tagged in the "excluded_from_refseq" column (column 21), apart for some tags ("genus undefined", "derived from environmental source", "derived from metagenome", "derived from single cell" and "sequence duplications") that are authorized. Then orders the remaining assemblies by loooking first at the RefSeq Category (column 5: "refseq_category"), favoring "reference genome" over "representative genome" and then "na"; then looking at the Assembly level (column 12: "assembly_level"), preferring "Complete Genome", then "Chromosome", "Scaffold" and lastly "Contig"; and ultimately looking at the Sequence release date (column 15: "seq_rel_date"), favoring the most recent assembly. - search_NCBIassemblies, download_NCBIassemblies and creation_NCBI_multifasta_and_correspondencefile searches for the still missing species in the filtered and ordered super-assembly_summary file. For genus that still don't have a representant, looks back through the super-assembly_summary file for one representant. Then download retained NCBI assemblies, concatenate them into a multifasta file and create the correspondence file.
- add_NCBIassemblies_to_lib and check_db_building checks that NCBI assemblies can be used to build a customized kraken database. A unmapped.txt file may been generated inside the test Kraken DB. Then user needs to follow the 4th step of "To run Snakemake".
- concatenate_preprocess_files and filter_duplicates_and_only_x concatenates the n different multifasta files into one file: all_sequences_kraken_and_NCBIassemblies.fna (n = number of kraken libraries reviewed + 1 (NCBI assemblies)), as well as the n different correspondence file into all_correspondence_headers_taxonomy.txt. And removes from both files the sequences that are duplicated and the sequences that are composed only by "x" (no nucleotidic informations)(observed in kraken libraries)
- add_metaphlan_taxonomy adds the metaphlan taxid informations to the super-correspondence file because there might be differences between metaphlan taxids (used in results of metaphlan3) and kraken taxids (used in correspondence file, more up-to-date). The species that weren't found are added under an empty header so that it still allows to search species sharing their genus.
- Update config.yaml with path to new metaphlan fasta and picklefile, output_dir etc.
- Create or load an environment that contains snakemake, mafft and RAxML
module load Python/3.8.1
python3 -m venv .snakemake_venv
source .snakemake_venv/bin/activate
module load snakemake blast+ kraken
pip install Bio
- Run snakemake (through slurm or not) somewhere with internet access
⚠
snakemake -p -j 40 -s /pasteur/zeus/projets/p02/metasig/DBs/Kraken2CustomDB/prerequisite_files/snakemake/Snakefile -R all
- Check if a unmapped.txt file has been generated inside the test Kraken DB. That would mean that some of the NCBI assemblies accession aren't registered inside the two *.accession2taxid files. If that's the case, we need to generate a file that link the accessions to the txid of the species:
echo -e "accession\taccession.version\ttaxid\tgi" > taxonomy/unmappedacc.accession2taxid
grep -f unmapped.txt /pasteur/zeus/projets/p02/metasig/DBs/Kraken2CustomDB/prerequisite_files/output/assembly_fna.gz/correspondence_headers_taxonomy.txt | sed 's/^>\([^\.]*\)\.\([0-9]\)[^\t]*\t\([0-9]*\)\t.*$/\1\t\1.\2\t\3\t/' >> taxonomy/unmappedacc.accession2taxid
- Try to build the test Kraken DB again (just to be sure)
kraken2-build --build --db {path_to_test_krakenDB} --threads {threads}
NEXT: #37