Skip to content
Snippets Groups Projects
Commit 11d2792d authored by Alexis  CRISCUOLO's avatar Alexis CRISCUOLO :black_circle:
Browse files

0.4

parent 60bcccb0
No related branches found
No related tags found
No related merge requests found
# wgetGenBankWGS
_wgetGenBankWGS_ is a command line program written in [Bash](https://www.gnu.org/software/bash/) to download genome assembly files in FASTA format from the GenBank or RefSeq repositories.
The FASTA files to dowload are selected from the [GenBank](https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_genbank.txt) or [RefSeq](https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt) genome assembly reports using [extended regular expressions](https://www.gnu.org/software/grep/manual/grep.html#Regular-Expressions) as implemented by [_grep_](https://www.gnu.org/software/grep/) (with option -E).
_wgetGenBankWGS_ is a command line program written in [Bash](https://www.gnu.org/software/bash/) to download genome assembly files from the GenBank or RefSeq repositories.
The files to dowload are selected from the [GenBank](https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_genbank.txt) or [RefSeq](https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt) genome assembly reports using [extended regular expressions](https://www.gnu.org/software/grep/manual/grep.html#Regular-Expressions) as implemented by [_grep_](https://www.gnu.org/software/grep/) (with option -E).
Every download is performed by the standard tool [_wget_](https://www.gnu.org/software/wget/).
......@@ -28,43 +28,65 @@ Execute _wgetGenBankWGS_ with the following command line model:
Launch _wgetGenBankWGS_ without option to read the following documentation:
```
wgetGenBankWGS
Downloading FASTA-formatted nucleotide sequence files corresponding to selected entries from genome assembly report files:
GenBank: ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_genbank.txt
RefSeq: ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt
USAGE:
wgetGenBankWGS.sh -e <pattern> [-v <pattern>] [-o <outdir>] [-t <nthreads>] [-n]
wgetGenBankWGS v.0.4.200504ac
Downloading sequence files corresponding to selected entries from genome assembly report files:
GenBank: ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_genbank.txt
RefSeq: ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt
Writing output files 'Species.isolate--accn--GCA' with the following content (and extension):
-f 1 genomic sequence(s) in FASTA format (.fasta)
-f 2 genomic sequence(s) in GenBank format (.gbk)
-f 3 annotations in GFF3 format (.gff)
-f 4 codon CDS in FASTA format (.fasta)
-f 5 amino acid CDS in FASTA format (.fasta)
-f 6 RNA sequences in FASTA format (.fasta)
USAGE:
wgetGenBankWGS.sh -e <pattern> [-v <pattern>] [-o <outdir>] [-f <integer>] [-n] [-z] [-t <nthreads>]
where:
-e <pattern> extended regexp selection pattern (grep -E style; mandatory)
-v <pattern> extended regexp exclusion pattern (grep -E style; default: none)
-e <pattern> extended regexp selection pattern (mandatory)
-v <pattern> extended regexp exclusion pattern (default: none)
-d <string> either 'genbank' or 'refseq' (default: genbank)
-n no download, i.e. to only print the number of selected files (default: not set)
-t type strain name(s) for each selected species gathered from straininfo.net (default: not set)
-f <integer> file type identifier (see above; default: 1)
-z no unzip, i.e. downloaded files are compressed (default: not set)
-o <outdir> output directory (default: .)
-c <nthreads> number of threads (default: 1)
-t <nthreads> number of threads (default: 1)
EXAMPLES:
+ get the total number of available complete Salmonella genomes inside RefSeq, as well as the type strain list:
+ getting the total number of available complete Salmonella genomes inside RefSeq:
wgetGenBankWGS.sh -e "Salmonella.*Complete Genome" -v "phage|virus" -d refseq -n
+ get the total number of genomes deposited in 1996 (see details in the written file summary.txt):
wgetGenBankWGS.sh -e "1996/[01-12]+/[01-31]+" -n
+ download in the directory Dermatophilaceae every available genome sequence from this family using 30 threads:
+ getting the total number of genomes inside GenBank deposited in 1996:
wgetGenBankWGS.sh -e "1996/[01-12]+/[01-31]+" -n
+ getting the total number of available SARS-CoV-2 genomes (taxid=694009) inside GenBank:
wgetGenBankWGS.sh -e $'\t'694009$'\t' -n
+ downloading the full RefSeq assembly report:
wgetGenBankWGS.sh -e "/" -d refseq -n
+ downloading the GenBank files with the assembly accessions GCF_900002335, GCF_000002415 and GCF_000002765:
wgetGenBankWGS.sh -e "GCF_900002335|GCF_000002415|GCF_000002765" -d refseq
+ downloading in the directory Dermatophilaceae every available genome sequence from this family using 30 threads:
wgetGenBankWGS.sh -e "Austwickia|Dermatophilus|Kineosphaera|Mobilicoccus|Piscicoccus|Tonsilliphilus" -o Dermatophilaceae -t 30
+ download in the current directory the non-Listeria genomes with the wgs_master starting with "PPP":
wgetGenBankWGS.sh -e $'\t'"PPP.00000000" -v "Listeria"
```
+ downloading the non-Listeria proteomes with the wgs_master starting with "PPP":
wgetGenBankWGS.sh -e $'\t'"PPP.00000000" -v "Listeria" -f 5
+ downloading the genome annotation of every Klesiella type strain in compressed gff3 format using 30 threads
wgetGenBankWGS.sh -e "Klebsiella.*type material" -f 3 -z -t 30
```
## Notes
* The output FASTA file names are created with the organism name, followed by the intraspecific and isolate names (if any), and ending with the WGS master (is any) and the assembly accession.
* The output file names are created with the organism name, followed by the intraspecific and isolate names (if any), and ending with the WGS master (is any) and the assembly accession. File extension depends on the file type specified using option -f.
* After each usage, a file `summary.txt` containing the selected raw(s) of the GenBank or RefSeq tab-separated assembly report is written. If the option -n is not set, this file is completed by the name(s) of the written FASTA files (first column 'fasta_file').
* After each usage, a file `summary.txt` containing the selected raw(s) of the GenBank or RefSeq tab-separated assembly report is written. If the option -n is not set, this file is completed by the name(s) of the written files (first column 'file').
* Very fast running times are expected when running _wgetGenBankWGS_ on multiple threads. As a rule of thumb, using twice the maximum number of available threads generally leads to good performances with bacterial genomes (depending on the bandwidth).
......
This diff is collapsed.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment