Skip to content
Snippets Groups Projects
Commit 6fcc0bf5 authored by Amandine  PERRIN's avatar Amandine PERRIN
Browse files

Merge branch 'documentation'

parents c5573fc7 2792138b
No related branches found
No related tags found
No related merge requests found
...@@ -59,7 +59,7 @@ author = 'PERRIN Amandine' ...@@ -59,7 +59,7 @@ author = 'PERRIN Amandine'
# built documents. # built documents.
# #
# The short X.Y version. # The short X.Y version.
version = '1.2' version = '1.2.0'
# The full version, including alpha/beta/rc tags. # The full version, including alpha/beta/rc tags.
release = '1.2.0' release = '1.2.0'
......
...@@ -102,7 +102,7 @@ You can see all required arguments and available options with:: ...@@ -102,7 +102,7 @@ You can see all required arguments and available options with::
The ``prepare`` module works in 3 steps: The ``prepare`` module works in 3 steps:
1) Downloading assemblies from refseq 1) Downloading assemblies from refseq or genbank
2) Quality control to filter assemblies in terms of sequence quality 2) Quality control to filter assemblies in terms of sequence quality
3) Filtering step dedicated to remove redundant and miss-classified genomes, based on Mash genetic distance. 3) Filtering step dedicated to remove redundant and miss-classified genomes, based on Mash genetic distance.
...@@ -113,7 +113,7 @@ Inputs ...@@ -113,7 +113,7 @@ Inputs
Your input will depend on the step from which you are starting. Your input will depend on the step from which you are starting.
- If your start from the beginning, your input is a NCBI taxid and/or a NCBI species. You can also specify which assembly level(s) you want to download - If your start from the beginning, your input is a NCBI taxid and/or a NCBI species taxid and/or a NCBI species name. You can also specify which assembly level(s) you want to download, as well as the NCBI section (genbank or refseq)
- If you start from step 2, your input will be a database of fasta sequences, in :ref:`sequences format <seq>`. - If you start from step 2, your input will be a database of fasta sequences, in :ref:`sequences format <seq>`.
- If you start from step 3, your input will be the database as previously, as well as the LSTINFO output of :ref:`step 2 <step2>`. - If you start from step 3, your input will be the database as previously, as well as the LSTINFO output of :ref:`step 2 <step2>`.
...@@ -129,7 +129,7 @@ All sequences are in fasta format, as described in :ref:`sequences format <seq>` ...@@ -129,7 +129,7 @@ All sequences are in fasta format, as described in :ref:`sequences format <seq>`
In your output directory, you will find: In your output directory, you will find:
- Only if you started from step 1: A folder called ``refseq/bacteria``, containing 1 folder per assembly (called with the assembly accession number), and, inside, the assembly sequence in fasta.gz format, and the MD5SUMS of this file. - Only if you started from step 1: A folder called ``refseq/bacteria`` (or ``genbank/bacteria`` if you downloaded all genomes from genbank), containing 1 folder per assembly (called with the assembly accession number), and, inside, the assembly sequence in fasta.gz format, and the MD5SUMS of this file.
- Only if you started from step 1: A folder called ``Database_init``, containing all assemblies downloaded from refseq in fasta format - Only if you started from step 1: A folder called ``Database_init``, containing all assemblies downloaded from refseq in fasta format
- Only if you started from step 1 or 2: A folder called ``tmp_files`` containing your genomic sequences, split at each stretch of at least 5 ``N`` (see :ref:`sequences format <seq>` for more details on the splitting part). - Only if you started from step 1 or 2: A folder called ``tmp_files`` containing your genomic sequences, split at each stretch of at least 5 ``N`` (see :ref:`sequences format <seq>` for more details on the splitting part).
...@@ -204,25 +204,35 @@ Running from step 1 ...@@ -204,25 +204,35 @@ Running from step 1
To download genomes, and then process them by the `prepare` filters, run:: To download genomes, and then process them by the `prepare` filters, run::
PanACoTA prepare [-t <NCBI species taxid> -s <NCBI species> -l <assembly_level(s)>] PanACoTA prepare [-g <NCBI species> -T <NCBI species taxid> -t <NCBI taxid> -s <genbank or refseq> -l <assembly_level(s)>]
Give at least one of ``-t`` or ``-s`` parameters. With: Give at least one of ``-T``, ``-t`` or ``-g`` parameters (one of them is enough) With:
- ``-t <NCBI taxid>``: the taxid provided by the NCBI for the species you want to study. - ``-g <NCBI species>``: the name of the species, as written by the NCBI. Give name between quotes.
- ``-s <NCBI species>``: the name of the species, as written by the NCBI. Give name between quotes. - ``-T <NCBI species taxid>``: the taxid provided by the NCBI for the species you want to download
- ``-t <NCBI taxid>``: the taxid provided by the NCBI for the subspecies or specific strain you want to download
If you want to download all genomes in genbank, and not only the ones in refseq, use option ``-s genbank`` (default is ``-s refseq``).
If you do not want to download all assemblies in refseq, but only genomes with specific assembly levels, use option ``-l <level(s)>``. Give it a comma separated list of assembly levels you want to download, between 'all', 'complete', 'chromosome', 'scaffold', 'contig' (default is 'all'). If you do not want to download all assemblies in refseq, but only genomes with specific assembly levels, use option ``-l <level(s)>``. Give it a comma separated list of assembly levels you want to download, between 'all', 'complete', 'chromosome', 'scaffold', 'contig' (default is 'all').
For example, if we want to download refseq assemblies of *Acetobacter orleanensis*. With the `taxonomy browser <https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=104099&lvl=3&p=has_linkout&p=blast_url&p=genome_blast&lin=f&keep=1&srchmode=1&unlock>`_, we can find its corresponding NCBI species taxid: "104099". For example, if we want to download refseq assemblies of *Acetobacter orleanensis*: With the `taxonomy browser <https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=104099&lvl=3&p=has_linkout&p=blast_url&p=genome_blast&lin=f&keep=1&srchmode=1&unlock>`_, we can find its corresponding NCBI species taxid: "104099".
To download all assembly levels:: To download all assembly levels::
PanACoTA prepare -t 104099 -s "Acetobacter orleanensis" PanACoTA prepare -T 104099 -g "Acetobacter orleanensis"
Or, to download only complete and scaffold assemblies:: Or, to download only complete and scaffold assemblies::
PanACoTA prepare -s "Acetobacter orleanensis" -l complete,scafflod PanACoTA prepare -g "Acetobacter orleanensis" -l complete,scafflod
To download the subspecies Acetobacter pasteurianus subsp. Pasteurianus (taxid = 481145):
PanACoTA prepare -t 481145
To download only the specific strain "Acetobacter orleanensis JCM 7639" (taxid = 1231342):
PanACoTA prepare -t 1231342
Only one of 'species taxid' and 'species name' argument is enough.
Running from step 2 Running from step 2
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment