README.md 4.58 KB
Newer Older
Alexis  CRISCUOLO's avatar
Alexis CRISCUOLO committed
1 2
# wgetGenBankWGS

Alexis  CRISCUOLO's avatar
0.4  
Alexis CRISCUOLO committed
3 4
_wgetGenBankWGS_ is a command line program written in [Bash](https://www.gnu.org/software/bash/) to download genome assembly files from the GenBank or RefSeq repositories.
The files to dowload are selected from the [GenBank](https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_genbank.txt) or [RefSeq](https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt) genome assembly reports using [extended regular expressions](https://www.gnu.org/software/grep/manual/grep.html#Regular-Expressions) as implemented by [_grep_](https://www.gnu.org/software/grep/) (with option -E).
Alexis  CRISCUOLO's avatar
Alexis CRISCUOLO committed
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Every download is performed by the standard tool [_wget_](https://www.gnu.org/software/wget/).


## Installation and execution

Clone this repository with the following command line:

```bash
git clone https://gitlab.pasteur.fr/GIPhy/wgetGenBankWGS.git
```

Give the execute permission to the file `wgetGenBankWGS.sh`:
```bash
chmod +x wgetGenBankWGS.sh
```

Execute _wgetGenBankWGS_ with the following command line model:
```bash
./wgetGenBankWGS.sh [options]
```

## Usage

Launch _wgetGenBankWGS_ without option to read the following documentation:

```
Alexis  CRISCUOLO's avatar
0.4  
Alexis CRISCUOLO committed
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
 wgetGenBankWGS v.0.4.200504ac

 Downloading sequence files corresponding to selected entries from genome assembly report files:
   GenBank:  ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_genbank.txt
   RefSeq:   ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt

 Writing output files 'Species.isolate--accn--GCA' with the following content (and extension):
   -f 1      genomic sequence(s) in FASTA format (.fasta)
   -f 2      genomic sequence(s) in GenBank format (.gbk)
   -f 3      annotations in GFF3 format (.gff)
   -f 4      codon CDS in FASTA format (.fasta)
   -f 5      amino acid CDS in FASTA format (.fasta)
   -f 6      RNA sequences in FASTA format (.fasta)

 USAGE:  
    wgetGenBankWGS.sh  -e <pattern>  [-v <pattern>]  [-o <outdir>]  [-f <integer>]  [-n]  [-z]  [-t <nthreads>]
Alexis  CRISCUOLO's avatar
Alexis CRISCUOLO committed
47
  where:
Alexis  CRISCUOLO's avatar
0.4  
Alexis CRISCUOLO committed
48 49
    -e <pattern>  extended regexp selection pattern (mandatory) 
    -v <pattern>  extended regexp exclusion pattern (default: none)
Alexis  CRISCUOLO's avatar
Alexis CRISCUOLO committed
50 51
    -d <string>   either 'genbank' or 'refseq' (default: genbank)
    -n            no download, i.e. to only print the number of selected files (default: not set)
Alexis  CRISCUOLO's avatar
0.4  
Alexis CRISCUOLO committed
52 53
    -f <integer>  file type identifier (see above; default: 1)
    -z            no unzip, i.e. downloaded files are compressed (default: not set)
Alexis  CRISCUOLO's avatar
Alexis CRISCUOLO committed
54
    -o <outdir>   output directory (default: .)
Alexis  CRISCUOLO's avatar
0.4  
Alexis CRISCUOLO committed
55
    -t <nthreads> number of threads (default: 1)
Alexis  CRISCUOLO's avatar
Alexis CRISCUOLO committed
56 57

 EXAMPLES:
Alexis  CRISCUOLO's avatar
0.4  
Alexis CRISCUOLO committed
58
  + getting the total number of available complete Salmonella genomes inside RefSeq:
Alexis  CRISCUOLO's avatar
Alexis CRISCUOLO committed
59 60
     wgetGenBankWGS.sh -e "Salmonella.*Complete Genome" -v "phage|virus" -d refseq -n

Alexis  CRISCUOLO's avatar
0.4  
Alexis CRISCUOLO committed
61 62 63 64 65 66 67 68 69 70 71 72 73
  + getting the total number of genomes inside GenBank deposited in 1996:
     wgetGenBankWGS.sh -e "1996/[01-12]+/[01-31]+" -n
 
  + getting the total number of available SARS-CoV-2 genomes (taxid=694009) inside GenBank:
     wgetGenBankWGS.sh -e $'\t'694009$'\t' -n
 
  + downloading the full RefSeq assembly report:
      wgetGenBankWGS.sh -e "/" -d refseq -n
 
  + downloading the GenBank files with the assembly accessions GCF_900002335, GCF_000002415 and GCF_000002765:
     wgetGenBankWGS.sh -e "GCF_900002335|GCF_000002415|GCF_000002765" -d refseq

  + downloading in the directory Dermatophilaceae every available genome sequence from this family using 30 threads:
Alexis  CRISCUOLO's avatar
Alexis CRISCUOLO committed
74 75
     wgetGenBankWGS.sh -e "Austwickia|Dermatophilus|Kineosphaera|Mobilicoccus|Piscicoccus|Tonsilliphilus" -o Dermatophilaceae -t 30

Alexis  CRISCUOLO's avatar
0.4  
Alexis CRISCUOLO committed
76 77 78 79 80 81 82
  + downloading the non-Listeria proteomes with the wgs_master starting with "PPP":
     wgetGenBankWGS.sh -e $'\t'"PPP.00000000" -v "Listeria" -f 5

  + downloading the genome annotation of every Klesiella type strain in compressed gff3 format using 30 threads
     wgetGenBankWGS.sh -e "Klebsiella.*type material" -f 3 -z -t 30

```
Alexis  CRISCUOLO's avatar
Alexis CRISCUOLO committed
83 84 85 86


## Notes

Alexis  CRISCUOLO's avatar
0.4  
Alexis CRISCUOLO committed
87
* The output file names are created with the organism name, followed by the intraspecific and isolate names (if any), and ending with the WGS master (is any) and the assembly accession. File extension depends on the file type specified using option -f.
Alexis  CRISCUOLO's avatar
Alexis CRISCUOLO committed
88

Alexis  CRISCUOLO's avatar
0.4  
Alexis CRISCUOLO committed
89
* After each usage, a file `summary.txt` containing the selected raw(s) of the GenBank or RefSeq tab-separated assembly report is written. If the option -n is not set, this file is completed by the name(s) of the written files (first column 'file').
Alexis  CRISCUOLO's avatar
Alexis CRISCUOLO committed
90 91 92 93

* Very fast running times are expected when running _wgetGenBankWGS_ on multiple threads. As a rule of thumb, using twice the maximum number of available threads generally leads to good performances with bacterial genomes (depending on the bandwidth).