Metagenomics issueshttps://gitlab.pasteur.fr/groups/metagenomics/-/issues2023-03-15T10:37:55+01:00https://gitlab.pasteur.fr/metagenomics/snakemake/-/issues/37Phylogenetic Tree based on CustomDB snakemake workflow2023-03-15T10:37:55+01:00Agnes BAUDPhylogenetic Tree based on CustomDB snakemake workflowTo update at the same time as prerequisite files (#35), eg. for new release of metaphlan3 DB.
▶ <ins>Output:</ins>
- RAXML phylogenetic tree that contains all the Bacteria species included in prerequiste multifasta file in newick format...To update at the same time as prerequisite files (#35), eg. for new release of metaphlan3 DB.
▶ <ins>Output:</ins>
- RAXML phylogenetic tree that contains all the Bacteria species included in prerequiste multifasta file in newick format
<ins>NB:</ins> Generated tree's leaves label follow the format **'{species name}, {txid}'** and **'{species name}, {txid}*'** for species added by add_missing_species rule. <br>
➔ Leaves labels need to be changed for the tree to be used in diversity analyses<br>
> Example of one way to do it:<br>
`from moonstone.utils.phylogenetic_tree_editing import adapt_phylogenetic_tree_to_counts_df`<br>
`parser = SunbeamKraken2Parser(krakenfile)`<br>
`kraken2_counts = parser.dataframe`<br>
`adapt_phylogenetic_tree_to_counts_df(kraken2_counts["NCBI_taxonomy_ID"], tree_file, output_tree_file)`
<br>
▶ <ins>Input:</ins>
- testDB constructed by the rule check_db_building in the Snakefile run to generate prerequisite files (#35)
- SILVA fasta dataset
> If you start with the snakemake at the first rule (associate_txids_to_headers), you'll need to get the FASTA file as follow:<br>
1) "File" > "Export" > "Export to external format"<br>
2) In "ARB EXPORT" window : "Export" -> "all"; "Compress" -> "all gaps"; "Select a format" -> "fasta_NCBI_sequin.eft"<br>
3) "Go"
Snakemake files: Snakefile and config.yaml; at /pasteur/zeus/projets/p02/metasig/DBs/Kraken2CustomDB/phylogenetic_tree/snakemake/
▶ <ins>Config file parameters:</ins>
- **custom_db_path**: path to a database created with the whole multifasta file
- **silva_fasta_dataset**: path to the SILVA fasta dataset (see "Input")
- **entrez_email**: email used to connect with Bio.Entrez.
- **superkingdoms**: for now only work with "bacteria"
- in **associate_txids_to_headers** (step 1 in "Summary of steps/rules of the Snakemake workflow"),<br>
**user_choice**: set to True only if you aren't using slurm to launch the snakemake workflow and you want to review tricky associations between organism name and taxid.<br>
**verbosity**: 0 ≤ int ≤ 3. Integer to define the verbosity level: 0 for no message; 1 for error messages; 2 for warning and error messages; 3 for info, warning and error messages
- in **search_16S** (step 2 in "Summary of steps/rules of the Snakemake workflow"),<br>
**verbosity**: 0 ≤ int ≤ 3. Integer to define the verbosity level: 0 for no message; 1 for error messages; 2 for warning and error messages; 3 for info, warning and error messages
- in **download_NCBI_16Sseq** (step 2 in "Summary of steps/rules of the Snakemake workflow"),<br>
**quit_snakemake_if_one_16S_seq_not_downloaded**: set to True if you want snakemake to stop its process if it encounters a NCBI 16 sequence that couldn't be downloaded
- in **MAFFT** (step 3 in "Summary of steps/rules of the Snakemake workflow"),<br>
**clustalout**: Output: clustal format, default: fasta<br>
**options**: All other possible options declared as you would when calling mafft. Check MAFFT documentation (mafft -h)
- in **RAxML** (step 4 in "Summary of steps/rules of the Snakemake workflow"),<br>
**tree_name**: customizable part of tree file name -> "RAxML_bestTree.{tree_name}"<br>
**model**: model of binary (Morphological), nucleotide, multi-State, or amino acid substitution<br>
**options**: All other possible options declared as you would when calling raxmlHPC-PTHREADS-AVX2. Check raxmlHPC documentation (raxmlHPC -h)
- in **add_species_sharing_accID** (step 6 in "Summary of steps/rules of the Snakemake workflow"),<br>
**edge_length**:{'auto' (default) | Real} Length between the species to add and the node where to add the species (can be an added node in the case the initial node is a leaf node).<br>
If 'auto', (a) for leaf node: edge_length is set as node2node_edge_length if defined, or as node.edge_length/2 (edge length from node to (initial) parent node)<br>
(b) for node with children: edge_length is set to the average distance between all children and the node<br>
**node2node_edge_length**: {'auto' (default) | Real | percentage\*} In the case of a leaf node, you can specify the edge length that should be between leaf node and new posterior node.<br>
If 'auto', node2node_edge_length set as edge_length if defined, or as node.edge_length/2 (edge length from node to (initial) parent node.<br>
\* To pass a percentage write the number from 0 to 100 and paste '%' at the end (ex: 80%)<br>
**verbosity**: 0 ≤ int ≤ 3. Integer to define the verbosity level: 0 for no message; 1 for error messages; 2 for warning and error messages; 3 for info, warning and error messages
- in **determine_node** (step 7 in "Summary of steps/rules of the Snakemake workflow"),<br>
**threshold_proportion_from_rank_in_descendants**: 0 ≤ int ≤ 1. Threshold on the minimum proportion of species from rank among the descendants of node. If no node fill the two threshold requirements then the MRCA of all species from rank is view as the default node to include species on.<br>
NB: If set to 1.0, then the MRCA is used and the 3rd step of step 7 in "Summary of steps/rules of the Snakemake workflow" is skipped<br>
**threshold_proportion_descendants_from_rank**: 0 ≤ int ≤ 1. Threshold on the minimum proportion of descendants of node that actually are species from rank. If no node fill the two threshold requirements then the MRCA of all species from rank is view as the default node to include species on<br>
**character_to_differentiate_added_species**: character to add to the end of the label of the added species. To differentiate added species from the ones that were initially in the tree and whose location in the tree is based on genomic information.<br>
NB: To have an asterisk (\*) at the end you need to use quote the two different type of quote like: character_to_differentiate_added_species: "'\*'"
- in **add_missing_species** (step 7 in "Summary of steps/rules of the Snakemake workflow"),<br>
**edge_length**:{'auto' (default) | Real} Length between the species to add and the node where to add the species (can be an added node in the case the initial node is a leaf node).<br>
If 'auto', (a) for leaf node: edge_length is set as node2node_edge_length if defined, or as node.edge_length/2 (edge length from node to (initial) parent node)<br>
(b) for node with children: edge_length is set to the average distance between all children and the node<br>
**node2node_edge_length**: {'auto' (default) | Real | percentage\*} In the case of a leaf node, you can specify the edge length that should be between leaf node and new posterior node.<br>
If 'auto', node2node_edge_length set as edge_length if defined, or as node.edge_length/2 (edge length from node to (initial) parent node.<br>
\* To pass a percentage write the number from 0 to 100 and paste '%' at the end (ex: 80%)<br>
**verbosity**: 0 ≤ int ≤ 3. Integer to define the verbosity level: 0 for no message; 1 for error messages; 2 for warning and error messages; 3 for info, warning and error messages
![case_node_with_children](/uploads/5c3ec1eaa8be467780eec9b2ebb54b30/case_node_with_children.png)
![case_leaf_node](/uploads/953bc5f2db612e2e76c0f65de391a2ac/case_leaf_node.png)
▶ <ins>Steps/rules of the Snakemake workflow:</ins>
1. **associate_txids_to_headers** tries to associate name of organism in header to species/organism txid
> Doesn't work in every cases so if you ever find a file with species txid you can skip that rule and start with rule search_16S, <br>
by naming your file "SILVA_header_to_txid.tsv" and put it in output_dir/16S_collection <br>
The file needs to have at least 3 columns: <br>
* "header" (as in the fasta file) <br>
* "species_txid" <br>
* "scientific name" <br>
At some point, the script might have to discriminate between several txids that could computationally correspond to the name of the organism of the SILVA sequence. `user_choice` allows the user to review the choices of txids with their scientific name and do the choice themselves (otherwise choice made by the script). **NB:** This argument doesn't work when using slurm.
2. **search_16S**, **filter_SILVA_dataset**, **download_NCBI_16Sseq** and **concat_16Sseq_files** try to gather into a multifasta fila 16S sequences for all species present in the whole (non-specific) CustomDB (in the prerequisites files) by filtering SILVA dataset and then going through the NCBI<br>
The **download_NCBI_16Sseq** rule has an argument `quit_snakemake_if_one_16S_seq_not_downloaded` in config file and `quit` in params of the rule in the Snakefile (also `-q` when directly calling the bash script Download_16S_seq.sh), that makes the bash script exit with a non-zero exit status if all NCBI 16S sequences that were supposed to be downloaded (listed in name_recordid_NCBI.txt) couldn't be downloaded.<br>
⚠️ This step requires an internet access
3. **MAFFT** runs multiple alignment on 16S sequences
4. **RAxML** builds the phylogenetic tree<br>
⚠️ This step can take a **long** time (10202 sequences = 1 week for 100 boostraps)
5. **rename_tree_leaves** replaces leaves' names that are initially the accession ID of the sequences to "{species_name}, {txid_name}"
6. **add_species_sharing_accID**: when many species corresponds to the same NCBI accession ID (when old species has been regrouped into 1 new species meaning that taxonomy as it was used in the construction of the prerequisites files has become obsolete -> but CustomDB phylogenetic tree based on that version of Kraken2 taxonomy), all the species are put at a distance 0 of each other at the leaf node of that NCBI accession ID (edge_length=0, node2node_edge_length=0 <sup>1</sup>)
7. **get_all_leaves_of_rank**, **get_unperfect_nodes**, **divide_in_chunks**, **search_best_node**, **concatenate_nodes_instructions_files**, **add_missing_species**<sup>1</sup> search best node to put missing species on (default MRCA, but it is possible to specify thresholds to find a more adequate node) and add them to the tree.
> This step that consists in adding the missing species to the tree follows a few steps:<br>
> 1) For each species to add, go up taxonomical levels until finding one that has already at least one member present in the tree and list all the species member of that rank (rule **get_all_leaves_of_rank**)(Ex: to add Collinsella sp. zg1085, list all 3 species of same genus (Collinsella, txid=102106))
> 2) From the list of all species member of the rank, compute the Most Recent Common Ancestor (MRCA) and compte proportion of descendants of MRCA node that are actually from rank (and the proportion of species from rank in descendant of MRCA node = 100% by definition)(rule **get_all_leaves_of_rank**)
> 3) Search best node if `thr_proportion_from_rank_in_descendants` < 100 (otherwise stick to MRCA generated in previous step). `thr_proportion_from_rank_in_descendants` in search_best_node_where_to_include_species.py and Snakefile or threshold_proportion_from_rank_in_descendants in config.yaml is an argument that defines the minimum proportion of species from rank that should be among the descendants of chosen node. <br>
`thr_proportion_descendants_from_rank` in python script and Snakefile or `threshold_proportion_descendants_from_rank` in config file is an argument that defines the minimum proportion of descendants that are actually from rank.<br>
-> Gradual removal of species from the list of species from rank used to generate node. Removing species that are the farthest from the other in the tree (pairwise distance = sum of edges length) and look at both proportions to find node that maximize both. If no node respect both threshold then chosen node is the default MRCA node. (rule **search_best_node**)
8. **final_editing** edites the tree text file to make it into a format readable by skbio.Tree (used in moonstone)
<sup>1</sup> The **add_species_sharing_accID** and **add_missing_species** rules both call the add_species_to_tree.py python script. The script allows the user to define where species to add should be added through the `edge_length` and `node2node_edge_length` arguments (in config file, Snakefile and add_species_to_tree.py). When the node where to add a species is a leaf (as is the case in add_species_sharing_accID and in some cases of add_missing_species), a node is added between the leaf and its parent node at a distance of `node2node_edge_length` of the leaf node. `edge_length` defines the length of the edge between the node (new node or original inner node) and the species to add.
▶ <ins>To run Snakemake:</ins>
1. Update config.yaml with your **custom_db_path**, your **output_dir** etc.
2. Create or load an environment that contains snakemake, mafft and RAxML
> `module load Python/3.8.1`<br>
`python3 -m venv .snakemake_venv`<br>
`source .snakemake_venv/bin/activate`<br>
`module load snakemake fasta ruby mafft RAxML`<br>
`# make sure that edirect in your PATH`<br>
`export PATH=${PATH}:${HOME}/edirect`
3. Run snakemake (through slurm or not) somewhere with internet access ⚠️
> `sbatch --mem=600G -p atm -c 1 snakemake -p -j 40 -s /pasteur/zeus/projets/p02/metasig/DBs/Kraken2CustomDB/phylogenetic_tree/snakemake/Snakefile -R all --cluster "sbatch -p atm -c {threads}"`<br>Agnes BAUDAgnes BAUDhttps://gitlab.pasteur.fr/metagenomics/snakemake/-/issues/36Custom DB snakemake2023-04-24T18:52:57+02:00Agnes BAUDCustom DB snakemakeSnakemake workflow to run to create a CustomDB
▶ <ins>Input:</ins>
- all_sequences_kraken_and_NCBIassemblies.fna
- all_correspondence_headers_kraken_and_metaphlan_taxonomy.txt
- taxonomy directory<br>
→ in maestro (recommended): `...Snakemake workflow to run to create a CustomDB
▶ <ins>Input:</ins>
- all_sequences_kraken_and_NCBIassemblies.fna
- all_correspondence_headers_kraken_and_metaphlan_taxonomy.txt
- taxonomy directory<br>
→ in maestro (recommended): `ln -s /pasteur/zeus/projets/p02/metasig/DBs/Kraken2CustomDB/prerequisite_files/input/kraken_db/taxonomy .`<br>
→ **OR** downloadable with: `kraken2-build --download-taxonomy --db $DBNAME`
- python environment with pandas
- Snakemake files: Snakefile and config.yaml, at /pasteur/zeus/projets/p02/metasig/DBs/Kraken2CustomDB/snakemake
▶ <ins>Config file parameters:</ins>
- in **snakefiles**,<br>
**metaphlan3_paired**: path to Snakefile of the metaphlan3 paired snakemake workflow<br>
**metaphlan3_single**: path to Snakefile of the metaphlan3 single snakemake workflow<br>
**metaphlan2_merge**: path to Snakefile of the merging of metaphlan results snakemake workflow<br>
- **samples**: list of sample names. Do not incorporate extension and pair prefix.<br>
_example:_ SAMPLE01_read1.fastq.gz and SAMPLE01_read2.fastq.gz should be listed as SAMPLE01
- **input_dir**: path to the directory containing the samples
- **output_dir**: path to the directory that will contain all output of this snakemake workflow.
- **output_database**: path to directory where to create the Custom Database (must contain kraken2 taxonomy directory)
- **multifastafile**: path to the pre-requisite multifasta file, all_sequences_kraken_and_NCBIassemblies.fna
- **correspondencefile**: path to the pre-requisite correspondence file, all_correspondence_headers_kraken_and_metaphlan_taxonomy.txt
- in **metaphlan3** (step 1 in "Summary of steps/rules of the Snakemake workflow"),<br>
**threads**: number of threads to use when calling metaphlan<br>
**input_type**: {fastq,fasta,bowtie2out,sam} set whether the input is the FASTA file of metagenomic reads or the SAM file of the mapping of the reads against the MetaPhlAn db.<br>
**ext**: extension of samples<br>
_example:_ for SAMPLE01_read1.fastq.gz and SAMPLE01_read2.fastq.gz, ext should be set as "fastq.gz"<br>
**paired**: set to True if samples are paired like in the example above<br>
**pair_prefix**: prefix before number in the sample file name<br>
_example:_ for SAMPLE01_read1.fastq.gz and SAMPLE01_read2.fastq.gz, pair_prefix should be set as "read"<br>
**options**: All other possible options declared as you would when calling metaphlan. Check Metaphlan documentation (metaphlan -h)
- in **metaphlan2_merge** (step 1 in "Summary of steps/rules of the Snakemake workflow"),<br>
**threads**: number of threads to use for the merging of metaphlan results
▶ <ins>Steps/rules of the Snakemake workflow:</ins>
1. **metaphlan3** and **metaphlan2_merge** run metaphlan3 on the samples to produce a file called merged_taxonomic_profiles.txt, with for each samples, the percentage of presence for each species detected. Only species/genus/... present in at least one samples are listed/appear in the file.
2. **list_txids** makes a list of all txids to keep. In the configuration file, an argument `level` allows to choose at which level between genus and species to operate. To operate at genus level means keeping all sequences from species belonging to a genus present in the samples.
3. **filter_headers** and **filter_sequences** filter allsequences_kraken2_and_NCBI_assemblies.fna to keep only sequences from txids from the list
4. **add_to_library** and **build_database** build custom database
▶ <ins>To run Snakemake:</ins>
1. Update config.yaml with your **samples**, your **input_dir**, your **output_database** etc.
2. Make sure that the kraken2 taxonomy directory is in or has a symbolic link (recommended) in the working directory
3. Create or load an environment that contains snakemake, kraken2 and metaphlan3 as well as python with pandas library.\
> 1st option:
`module load Python`\
`python3 -m venv .snakemake_venv`\
`source .snakemake_venv/bin/activate`\
`module load snakemake muscle/3.8.1551 blast+/2.10.1 samtools/1.10 RAxML/8.2.12 bowtie2/2.3.5.1 fasta ruby diamond/2.0.6 trimal/1.4.1 mafft/7.467 FastTree/2.1.11 IQ-TREE/1.6.12 Mash/2.2 phylophlan/3.0.2 R/4.0.2 graphlan/1.1.3 metaphlan blast+ kraken`<br><br>
> 2nd option:
`mamba env create -f environment.yml -n moonbeam`\
`conda activate moonbeam`\
`module load muscle/3.8.1551 blast+/2.10.1 samtools/1.10 RAxML/8.2.12 bowtie2/2.3.5.1 fasta ruby diamond/2.0.6 trimal/1.4.1 mafft/7.467 FastTree/2.1.11 IQ-TREE/1.6.12 Mash/2.2 phylophlan/3.0.2 R/4.0.2 graphlan/1.1.3 metaphlan`
4. Run snakemake (through slurm or not)
> `sbatch -p atm -c 1 snakemake -p -j 24 -s Snakefile -R all --cluster "sbatch -p atm -c {threads}"`
5. If an unmapped.txt is generated when trying to build the CustomDB, then the accession needs to be added to the seqid2taxid.map file:
> `grep -f unmapped.txt all_correspondence_headers_kraken_and_metaphlan_taxonomy.txt | sed 's/^>\([^\.]*\)\.[^\t]*\t\([0-9]*\)\t.*$/\1\t\2/' >> seqid2taxid.map`<br>
and then rebuild the database:
> `sbatch -p atm -c 32 --mem=20G kraken2-build --build --db $DBNAME --threads 32`Agnes BAUDAgnes BAUDhttps://gitlab.pasteur.fr/metagenomics/snakemake/-/issues/35Generation of starting files for the creation of a CustomDB snakemake workflow2023-04-24T18:52:57+02:00Agnes BAUDGeneration of starting files for the creation of a CustomDB snakemake workflowMight need to update said files with release of new version of metaphlan3 DB
▶ <ins>Output</ins> = The 2 pre-requisites files for the [creation of a CustomDB workflow](https://gitlab.pasteur.fr/metagenomics/snakemake/-/issues/36):
- all...Might need to update said files with release of new version of metaphlan3 DB
▶ <ins>Output</ins> = The 2 pre-requisites files for the [creation of a CustomDB workflow](https://gitlab.pasteur.fr/metagenomics/snakemake/-/issues/36):
- all_sequences_kraken_and_NCBIassemblies.fna <br>
- all_correspondence_headers_kraken_and_metaphlan_taxonomy.txt (example below)
| header | species | genus | source | metaphlan3_species | metaphlan3_genus |
| ------ | ------ | ------ | ------ | ------ | ------ |
| >kraken:taxid\|1937004\|NZ_CP019470.1 Methanopyrus sp. KOL6 chromosome | | 2319 | kraken2 | | |
| >NCBI01000028.1 MAG: Chromatiales bacterium 21-64-14 04302015_21_scaffold_100, whole genome shotgun sequence | 1970504 | | GCA_002255365.1* | 1970504 | |
| | 99007 | | | 99007 | |
\* NCBI accessionID <br>
<ins>NB:</ins> empty cell for species/metaphlan3_species/metaphlan3_genus = species not in metaphlan3 but from a genus present in metaphlan3 DB<br>
empty cell for genus taxid = No information found for genus lato sensu (txid for genus or no rank/clade below a filled-in family txid)<br>
empty cell for header = species not found (in correspondence file to get its genus if building CustomDB at the `genus` level)
▶ <ins>Input:</ins>
- assembly_summary_refseq.txt and assembly_summary_genbank.txt (downloadable: [here](https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/))
- metaphlan fasta and picklefile (mpa_vX_CHOCPhlAn_YYYYMM.fna/pkl) (downloadable: by following this [link](http://cmprod1.cibio.unitn.it/) > biobakeryX > metaphlan_databases > tar file)
- kraken librairies and taxonomy (`kraken2-build --download-library $lib --db $DBNAME` and `kraken2-build --download-taxonomy --db $DBNAME`)
▶ <ins>Config file parameters:</ins>
- **output_dir**: path to the directory that will contain all output of this snakemake workflow.
- **metaphlan_fastafile** and **metaphlan_picklefile**: path to metaphlan database files (see "Input").
- **path_to_kraken_db**: path to kraken database, meaning the directory that contain or will contain (after download - step 1 in "Summary of steps/rules of the Snakemake workflow") all kraken libraries directory and the kraken taxonomy directory, and that will later host the test kraken database (step 8 in "Summary of steps/rules of the Snakemake workflow")
- **entrez_email**: email used to connect with Bio.Entrez.
- **assembly summary**: list of path to the different assembly_summary files to use to search for NCBI assemblies (see "Input").
- **libraries**: list of the different kraken libraries (to download and) to filter through.
- in **search_orgatxids** (step 2 in "Summary of steps/rules of the Snakemake workflow"),<br>
**on_NCBI**: set to True if you want to search for the markers' current organism taxid in NCBI; use metaphlan taxid if set to False<br>
**resume**: set to True if you use on_NCBI and you fear interruption of the process
- in **download_NCBIassemblies** (step 7 in "Summary of steps/rules of the Snakemake workflow"),<br>
**quit_snakemake_if_one_assembly_not_downloaded**: set to True if you want snakemake to stop its process if it encounters a NCBI assembly that couldn't be downloaded
▶ <ins>Summary of steps/rules of the Snakemake workflow:</ins>
1. **download_kraken_lib** and **download_kraken_taxonomy** download all libraries listed in config.yaml and taxonomy directories.<br>
These two rules are only run if the libraires and taxonomy are not already in `path_to_kraken_db` (in config.yaml). So you might have to check the version of the kraken2 libraries and taxonomy.
2. **search_orgatxids** tries to list all organism txids present in metaphlan3:
> - Either use the txids filled in for each markers of the fasta file in the pickle file.
> - Or search the markers in NCBI with the Bio.Entrez python package. When markers have an accession ID then search through assembly_summary files and then NCBI with Bio.Entrez. If markers isn't found in NCBI then refer to txid filled in pickle file. To use that path, set `on_NCBI` under `search_orgatxids` in config.yaml as True. Since that path can take some time there is an argument `resume` under `search_orgatxids` in config.yaml that allows to resume where the script left off if it was inadvertently stopped before ending. :warning:️ That path needs internet access. Use if the metaphlan3 db release is old and you have doubt about the topicality of the txid associated with the markers.
3. **get_species_and_genus** tries to get the species txid and genus txid from the list of organism txids (organism txids are species txids or below taxonomical levels).
4. **filter_kraken_lib** is launched for as many kraken libraries you have set under `libraries` in config.yaml. This rule filters the kraken library to keep only species txid with a genus txid listed as to keep. It creates 4 output files: species_notfound_in_{lib}.txt, genus_notfound_in_{lib}.txt, sequences.fna (filtered version of library.fna) and correspondence_headers_taxonomy.txt
5. **missing_IDs_after_kraken** recapitulates which species and genus txids are still missing after the filtering of the different kraken libraries.
6. **filter_and_order_asssemblies** takes assembly_summary files given under `assembly_summary` in the config.yaml and filters out assemblies tagged in the "excluded_from_refseq" column (column 21), apart for some tags ("genus undefined", "derived from environmental source", "derived from metagenome", "derived from single cell" and "sequence duplications") that are authorized. Then orders the remaining assemblies by loooking first at the RefSeq Category (column 5: "refseq_category"), favoring "reference genome" over "representative genome" and then "na"; then looking at the Assembly level (column 12: "assembly_level"), preferring "Complete Genome", then "Chromosome", "Scaffold" and lastly "Contig"; and ultimately looking at the Sequence release date (column 15: "seq_rel_date"), favoring the most recent assembly.
7. **search_NCBIassemblies**, **download_NCBIassemblies** and **creation_NCBI_multifasta_and_correspondencefile** searches for the still missing species in the filtered and ordered super-assembly_summary file. For genus that still don't have a representant, looks back through the super-assembly_summary file for one representant. Then download retained NCBI assemblies, concatenate them into a multifasta file and create the correspondence file.
8. **add_NCBIassemblies_to_lib** and **check_db_building** checks that NCBI assemblies can be used to build a customized kraken database. A unmapped.txt file may been generated inside the test Kraken DB. Then user needs to follow the 4th step of "To run Snakemake".
9. **concatenate_preprocess_files** and **filter_duplicates_and_only_x** concatenates the n different multifasta files into one file: all_sequences_kraken_and_NCBIassemblies.fna (n = number of kraken libraries reviewed + 1 (NCBI assemblies)), as well as the n different correspondence file into all_correspondence_headers_taxonomy.txt. And removes from both files the sequences that are duplicated and the sequences that are composed only by "x" (no nucleotidic informations)(observed in kraken libraries)
10. **add_metaphlan_taxonomy** adds the metaphlan taxid informations to the super-correspondence file because there might be differences between metaphlan taxids (used in results of metaphlan3) and kraken taxids (used in correspondence file, more up-to-date). The species that weren't found are added under an empty header so that it still allows to search species sharing their genus.
▶ <ins>To run Snakemake:</ins>
1. Update config.yaml with path to new **metaphlan fasta** and **picklefile**, **output_dir** etc.
2. Create or load an environment that contains snakemake, mafft and RAxML
> `module load Python/3.8.1`<br>
`python3 -m venv .snakemake_venv`<br>
`source .snakemake_venv/bin/activate`<br>
`module load snakemake blast+ kraken`<br>
`pip install Bio`<br>
3. Run snakemake (through slurm or not) somewhere with internet access :warning:
> `snakemake -p -j 40 -s /pasteur/zeus/projets/p02/metasig/DBs/Kraken2CustomDB/prerequisite_files/snakemake/Snakefile -R all`
4. Check if a **unmapped.txt** file has been generated inside the test Kraken DB. That would mean that some of the NCBI assemblies accession aren't registered inside the two \*.accession2taxid files. If that's the case, we need to generate a file that link the accessions to the txid of the species:
> `echo -e "accession\taccession.version\ttaxid\tgi" > taxonomy/unmappedacc.accession2taxid`<br>
`grep -f unmapped.txt /pasteur/zeus/projets/p02/metasig/DBs/Kraken2CustomDB/prerequisite_files/output/assembly_fna.gz/correspondence_headers_taxonomy.txt | sed 's/^>\([^\.]*\)\.\([0-9]\)[^\t]*\t\([0-9]*\)\t.*$/\1\t\1.\2\t\3\t/' >> taxonomy/unmappedacc.accession2taxid`
5. Try to build the test Kraken DB again (just to be sure)
> `kraken2-build --build --db {path_to_test_krakenDB} --threads {threads}`
NEXT: #37Agnes BAUDAgnes BAUDhttps://gitlab.pasteur.fr/metagenomics/fastgenes/-/issues/26Deploy on Kubernetes2022-02-15T09:49:14+01:00Kenzo-Hugo HillionDeploy on Kuberneteshttps://rocketchat.pasteur.cloud/channel/support-kubernetes-gitlab-ci-cd?msg=n73SE45oEP39e3uzh
namespaces:
* fastgenes-dev
* fastgenes-prodhttps://rocketchat.pasteur.cloud/channel/support-kubernetes-gitlab-ci-cd?msg=n73SE45oEP39e3uzh
namespaces:
* fastgenes-dev
* fastgenes-prodhttps://gitlab.pasteur.fr/metagenomics/fastgenes/-/issues/24Return true updated payload for gene update2022-01-10T15:08:04+01:00Kenzo-Hugo HillionReturn true updated payload for gene updateAt the moment, if we update a link for a gene (for instance catalog link) it is not returned by the payload.At the moment, if we update a link for a gene (for instance catalog link) it is not returned by the payload.https://gitlab.pasteur.fr/metagenomics/fastgenes/-/issues/23Gene annotations2022-01-05T15:25:59+01:00Kenzo-Hugo HillionGene annotationshttps://gitlab.pasteur.fr/metagenomics/fastgenes/-/issues/21Make app asynchronous2021-12-17T13:27:00+01:00Kenzo-Hugo HillionMake app asynchronousCurrently app is not asynchronous.
Following the steps described here: https://testdriven.io/blog/fastapi-sqlmodel/, makes the app asynchronousCurrently app is not asynchronous.
Following the steps described here: https://testdriven.io/blog/fastapi-sqlmodel/, makes the app asynchronoushttps://gitlab.pasteur.fr/metagenomics/fastgenes/-/issues/19Better local test DB2021-12-10T13:40:09+01:00Kenzo-Hugo HillionBetter local test DBhttps://gitlab.pasteur.fr/metagenomics/fastgenes/-/issues/15Impose unique constraint on EggNog ID based on version2021-12-06T21:05:23+01:00Kenzo-Hugo HillionImpose unique constraint on EggNog ID based on versionAt the moment, since different version of EggNOG are supported, unique constraint cannot be applied to eggnog_id.At the moment, since different version of EggNOG are supported, unique constraint cannot be applied to eggnog_id.https://gitlab.pasteur.fr/metagenomics/fastgenes/-/issues/12Creation of EggNOG db2021-12-10T18:07:47+01:00Kenzo-Hugo HillionCreation of EggNOG dbhttps://gitlab.pasteur.fr/metagenomics/fastgenes/-/issues/10Improve experiments models and make CRUD2022-01-12T11:12:02+01:00Kenzo-Hugo HillionImprove experiments models and make CRUDImprove Experiment models to give more information about tools and parameters used.Improve Experiment models to give more information about tools and parameters used.Kenzo-Hugo HillionKenzo-Hugo Hillionhttps://gitlab.pasteur.fr/metagenomics/fastgenes/-/issues/6CI with docker2021-12-10T13:43:23+01:00Kenzo-Hugo HillionCI with dockerhttps://gitlab.pasteur.fr/metagenomics/metagenedb/-/blob/dev/.gitlab-ci.ymlhttps://gitlab.pasteur.fr/metagenomics/metagenedb/-/blob/dev/.gitlab-ci.ymlKenzo-Hugo HillionKenzo-Hugo Hillionhttps://gitlab.pasteur.fr/metagenomics/fastgenes/-/issues/5Auth and JWT tokens2021-12-10T13:43:08+01:00Kenzo-Hugo HillionAuth and JWT tokenshttps://gitlab.pasteur.fr/metagenomics/fastgenes/-/issues/3Choose name for the application2021-11-16T15:51:25+01:00Kenzo-Hugo HillionChoose name for the applicationWe should define a name before asking for the creation of a namespace on Kubernetes.
What do you think @abaud and @skennedy ?
Current name is a reference to the library used for the API called FastAPI.We should define a name before asking for the creation of a namespace on Kubernetes.
What do you think @abaud and @skennedy ?
Current name is a reference to the library used for the API called FastAPI.