Commit 9ff9f769 authored by Alexis  CRISCUOLO's avatar Alexis CRISCUOLO

v0.2.190228ac

parent 923c2e48
This diff is collapsed.
# wgetGenBankWGS
a tool to download genome assembly files in FASTA format from the GenBank or RefSeq repositories
\ No newline at end of file
_wgetGenBankWGS_ is a command line program written in [Bash](https://www.gnu.org/software/bash/) to download genome assembly files in FASTA format from the GenBank or RefSeq repositories.
The FASTA files to dowload are selected from the [GenBank](https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_genbank.txt) or [RefSeq](https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt) genome assembly reports using [extended regular expressions](https://www.gnu.org/software/grep/manual/grep.html#Regular-Expressions) as implemented by [_grep_](https://www.gnu.org/software/grep/) (with option -E).
Every download is performed by the standard tool [_wget_](https://www.gnu.org/software/wget/).
## Installation and execution
Clone this repository with the following command line:
```bash
git clone https://gitlab.pasteur.fr/GIPhy/wgetGenBankWGS.git
```
Give the execute permission to the file `wgetGenBankWGS.sh`:
```bash
chmod +x wgetGenBankWGS.sh
```
Execute _wgetGenBankWGS_ with the following command line model:
```bash
./wgetGenBankWGS.sh [options]
```
## Usage
Launch _wgetGenBankWGS_ without option to read the following documentation:
```
wgetGenBankWGS
Downloading FASTA-formatted nucleotide sequence files corresponding to selected entries from genome assembly report files:
GenBank: ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_genbank.txt
RefSeq: ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt
USAGE:
wgetGenBankWGS.sh -e <pattern> [-v <pattern>] [-o <outdir>] [-t <nthreads>] [-n]
where:
-e <pattern> extended regexp selection pattern (grep -E style; mandatory)
-v <pattern> extended regexp exclusion pattern (grep -E style; default: none)
-d <string> either 'genbank' or 'refseq' (default: genbank)
-n no download, i.e. to only print the number of selected files (default: not set)
-t type strain name(s) for each selected species gathered from straininfo.net (default: not set)
-o <outdir> output directory (default: .)
-c <nthreads> number of threads (default: 1)
EXAMPLES:
+ get the total number of available complete Salmonella genomes inside RefSeq, as well as the type strain list:
wgetGenBankWGS.sh -e "Salmonella.*Complete Genome" -v "phage|virus" -d refseq -n
+ get the total number of genomes deposited in 1996 (see details in the written file summary.txt):
wgetGenBankWGS.sh -e "1996/[01-12]+/[01-31]+" -n
+ download in the directory Dermatophilaceae every available genome sequence from this family using 30 threads:
wgetGenBankWGS.sh -e "Austwickia|Dermatophilus|Kineosphaera|Mobilicoccus|Piscicoccus|Tonsilliphilus" -o Dermatophilaceae -t 30
+ download in the current directory the non-Listeria genomes with the wgs_master starting with "PPP":
wgetGenBankWGS.sh -e $'\t'"PPP.00000000" -v "Listeria"
```
## Notes
* The output FASTA file names are created with the organism name, followed by the intraspecific and isolate names (if any), and ending with the WGS master (is any) and the assembly accession.
* After each usage, a file `summary.txt` containing the selected raw(s) of the GenBank or RefSeq tab-separated assembly report is written. If the option -n is not set, this file is completed by the name(s) of the written FASTA files (first column 'fasta_file').
* Very fast running times are expected when running _wgetGenBankWGS_ on multiple threads. As a rule of thumb, using twice the maximum number of available threads generally leads to good performances with bacterial genomes (depending on the bandwidth).
This diff is collapsed.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment