wgetENAHTS
wgetENAHTS is a command line program written in Bash to download gzipped FASTQ files from the European Nucleotide Archive (ENA) ftp repository. Every download is performed using the standard tool wget.
Installation and execution
Clone this repository with the following command line:
git clone https://gitlab.pasteur.fr/GIPhy/wgetENAHTS.git
Give the execute permission to the file wgetENAHTS.sh
:
chmod +x wgetENAHTS.sh
Execute wgetENAHTS with the following command line model:
./wgetENAHTS.sh [options]
Usage
Run wgetENAHTS without option to read the following documentation:
USAGE: wgetENAHTS.sh [[-o <dir>] [-f <infile>]
[-t <nthreads>] [-r <rate>] [-n] [-h]] [<accn> ...]
Downloads FASTQ files corresponding to the specified DRR/ERR/SRR accession(s)
Files are downloaded from the ENA ftp repository ftp.sra.ebi.ac.uk/vol1/fastq
OPTIONS:
-o <dir> output directory (default: .)
-f <file> to read accession(s) from the specified file (default: all the last
arguments)
-t <int> number of thread(s) (default: 2)
-r <int> maximum download rate per file (in kb per seconds; default: entire
available bandwidth)
-n no file download, only check (default: not set)
-h prints this help and exits
EXAMPLES:
+ downloading the SE FASTQ file corresponding to accession DRR000003:
wgetENAHTS.sh DRR000003
+ downloading the FASTQ files corresponding to accessions ERR000001 and ERR000004:
wgetENAHTS.sh ERR000001 ERR000004
+ assessing the repository existence for accessions SRR9870010-39:
wgetENAHTS.sh -n SRR98700{10..39}
+ downloading the FASTQ files (if any) corresponding to accessions SRR9870010-39:
wgetENAHTS.sh SRR98700{10..39}
+ same as above with (at most) 6 parallel downloads and saved outputs:
wgetENAHTS.sh -t 6 SRR98700{10..39} > log.txt 2> err.txt
+ downloading the FASTQ files from accessions available in the file accn.txt:
wgetENAHTS.sh -f accn.txt
+ same as above with 9 parallel downloads and 500kb/sec download rate per file:
wgetENAHTS.sh -t 9 -r 500 -f accn.txt
Notes
-
The HTS read accessions should starts with DRR, ERR or SRR (specified as final arguments, or in a text file using option
-f
). The output file names are identical to those available in the repository corresponding to each specified accession identifier. Every downloaded file has file extension.fastq.gz
. -
For each specified accession, a summary file (extension
.weh
) is written. This summary file contains the list of associated FASTQ file(s) together with their expected MD5 hash value. -
After checking the existence of a repository for each specified accession, a first step of (parallel) downloading is performed. Each downloaded file that seems incomplete (MD5 checksum) is downloaded a second time.
-
No download is performed when the output directory already contains files named with the specified accessions.
-
For a given DRR/ERR/SRR accession, the existence of a repository within the ENA can be easily assessed using option
-n
(i.e. no file download). -
Fast running times are expected when running wgetENAHTS on multiple threads (option
-t
). Depending on the bandwidth, the maximum download rate per file can be restricted using option-r
.