Commit 9ff9f769 authored by Alexis  CRISCUOLO's avatar Alexis CRISCUOLO
Browse files

v0.2.190228ac

parent 923c2e48
This diff is collapsed.
# wgetGenBankWGS
a tool to download genome assembly files in FASTA format from the GenBank or RefSeq repositories
\ No newline at end of file
_wgetGenBankWGS_ is a command line program written in [Bash](https://www.gnu.org/software/bash/) to download genome assembly files in FASTA format from the GenBank or RefSeq repositories.
The FASTA files to dowload are selected from the [GenBank](https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_genbank.txt) or [RefSeq](https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt) genome assembly reports using [extended regular expressions](https://www.gnu.org/software/grep/manual/grep.html#Regular-Expressions) as implemented by [_grep_](https://www.gnu.org/software/grep/) (with option -E).
Every download is performed by the standard tool [_wget_](https://www.gnu.org/software/wget/).
## Installation and execution
Clone this repository with the following command line:
```bash
git clone https://gitlab.pasteur.fr/GIPhy/wgetGenBankWGS.git
```
Give the execute permission to the file `wgetGenBankWGS.sh`:
```bash
chmod +x wgetGenBankWGS.sh
```
Execute _wgetGenBankWGS_ with the following command line model:
```bash
./wgetGenBankWGS.sh [options]
```
## Usage
Launch _wgetGenBankWGS_ without option to read the following documentation:
```
wgetGenBankWGS
Downloading FASTA-formatted nucleotide sequence files corresponding to selected entries from genome assembly report files:
GenBank: ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_genbank.txt
RefSeq: ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt
USAGE:
wgetGenBankWGS.sh -e <pattern> [-v <pattern>] [-o <outdir>] [-t <nthreads>] [-n]
where:
-e <pattern> extended regexp selection pattern (grep -E style; mandatory)
-v <pattern> extended regexp exclusion pattern (grep -E style; default: none)
-d <string> either 'genbank' or 'refseq' (default: genbank)
-n no download, i.e. to only print the number of selected files (default: not set)
-t type strain name(s) for each selected species gathered from straininfo.net (default: not set)
-o <outdir> output directory (default: .)
-c <nthreads> number of threads (default: 1)
EXAMPLES:
+ get the total number of available complete Salmonella genomes inside RefSeq, as well as the type strain list:
wgetGenBankWGS.sh -e "Salmonella.*Complete Genome" -v "phage|virus" -d refseq -n
+ get the total number of genomes deposited in 1996 (see details in the written file summary.txt):
wgetGenBankWGS.sh -e "1996/[01-12]+/[01-31]+" -n
+ download in the directory Dermatophilaceae every available genome sequence from this family using 30 threads:
wgetGenBankWGS.sh -e "Austwickia|Dermatophilus|Kineosphaera|Mobilicoccus|Piscicoccus|Tonsilliphilus" -o Dermatophilaceae -t 30
+ download in the current directory the non-Listeria genomes with the wgs_master starting with "PPP":
wgetGenBankWGS.sh -e $'\t'"PPP.00000000" -v "Listeria"
```
## Notes
* The output FASTA file names are created with the organism name, followed by the intraspecific and isolate names (if any), and ending with the WGS master (is any) and the assembly accession.
* After each usage, a file `summary.txt` containing the selected raw(s) of the GenBank or RefSeq tab-separated assembly report is written. If the option -n is not set, this file is completed by the name(s) of the written FASTA files (first column 'fasta_file').
* Very fast running times are expected when running _wgetGenBankWGS_ on multiple threads. As a rule of thumb, using twice the maximum number of available threads generally leads to good performances with bacterial genomes (depending on the bandwidth).
#!/bin/bash
#############################################################################################################
# #
# wgetGenBankWGS: downloading WGS nucleotide sequences from NCBI #
# #
# Copyright (C) 2019 Alexis Criscuolo #
# #
# This program is free software: you can redistribute it and/or modify it under the terms of the GNU #
# General Public License as published by the Free Software Foundation, either version 3 of the License, or #
# (at your option) any later version. #
# #
# This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even #
# the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public #
# License for more details. #
# #
# You should have received a copy of the GNU General Public License along with this program. If not, see #
# <http://www.gnu.org/licenses/>. #
# #
# Contact: #
# Institut Pasteur #
# Bioinformatics and Biostatistics Hub #
# C3BI, USR 3756 IP CNRS #
# Paris, FRANCE #
# #
# alexis.criscuolo@pasteur.fr #
# #
#############################################################################################################
#############################################################################################################
# #
# ============ #
# = VERSIONS = #
# ============ #
# #
VERSION=0.2.190228ac #
# + option -d for downloading from either genbank or refseq #
# + option -t to get the type strain name(s) for each selected species #
# #
# VERSION=0.1.190124ac #
# #
#############################################################################################################
#############################################################################################################
# #
# ============ #
# = DOC = #
# ============ #
# #
if [ "$1" = "-?" ] || [ "$1" = "-h" ] || [ $# -le 1 ] #
then #
cat <<EOF
wgetGenBankWGS v.$VERSION
Downloading FASTA-formatted nucleotide sequence files corresponding to selected entries from genome assembly report files:
GenBank: ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_genbank.txt
RefSeq: ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt
USAGE:
wgetGenBankWGS.sh -e <pattern> [-v <pattern>] [-o <outdir>] [-t <nthreads>] [-n]
where:
-e <pattern> extended regexp selection pattern (mandatory)
-v <pattern> extended regexp exclusion pattern (default: none)
-d <string> either 'genbank' or 'refseq' (default: genbank)
-n no download, i.e. to only print the number of selected files (default: not set)
-t type strain name(s) for each selected species gathered from straininfo.net (default: not set)
-o <outdir> output directory (default: .)
-c <nthreads> number of threads (default: 1)
EXAMPLES:
+ getting the total number of available complete Salmonella genomes inside RefSeq:
wgetGenBankWGS.sh -e "Salmonella.*Complete Genome" -v "phage|virus" -d refseq -n
+ getting the total number of genomes deposited in 1996:
wgetGenBankWGS.sh -e "1996/[01-12]+/[01-31]+" -n
+ downloading in the directory Dermatophilaceae every available genome sequence from this family using 30 threads:
wgetGenBankWGS.sh -e "Austwickia|Dermatophilus|Kineosphaera|Mobilicoccus|Piscicoccus|Tonsilliphilus" -o Dermatophilaceae -t 30
+ downloading in the current directory the non-Listeria genomes with the wgs_master starting with "PPP":
wgetGenBankWGS.sh -e $'\t'"PPP.00000000" -v "Listeria"
EOF
exit 1 ; #
fi #
# #
#############################################################################################################
#############################################################################################################
# #
# =============== #
# = CONSTANTS = #
# =============== #
# #
WGETOPT="--retry-connrefused --waitretry=1 --read-timeout=20 --timeout=15 -q";
STRAININFO="http://www.straininfo.net";
# #
# #
# =============== #
# = FUNCTIONS = #
# =============== #
# #
# = gettime() arguments: ============================================================================== #
# 1. START: the starting time in seconds
# returns the elapsed time since $START
gettime() {
t=$(( $SECONDS - $1 )); sec=$(( $t % 60 )); min=$(( $t / 60 ));
if [ $sec -lt 10 ]; then sec="0$sec"; fi
if [ $min -lt 10 ]; then min="0$min"; fi
echo "[$min:$sec]" ;
}
# = randomfile() arguments: ============================================================================== #
# 1. PREFIX: prefix file name #
# returns a random file name from a given PREFIX file name #
# #
randomfile() {
rdmf=$1.$RANDOM; while [ -e $rdmf ]; do rdmf=$1.$RANDOM ; done
echo $rdmf ;
}
# #
# = dwnl() arguments: ==================================================================================== #
# 1. URL: URL of the file to download #
# 2. OUTFILE: output file name #
# downloads the file from URL and writes it into OUTFILE #
# #
dwnl() {
tmp=$(randomfile $2);
while [ 1 ]
do
wget $WGETOPT -O $tmp $1 ;
if [ $? == 0 ]; then mv $tmp $2 ; break; fi
sleep 1 ;
done
}
# #
# = dwnlgz() arguments: ================================================================================== #
# 1. URL: URL of the gz file to download #
# 2. OUTFILE: output file name #
# downloads the file from URL and unzip it into OUTFILE #
# #
dwnlgz() {
tmp=$(randomfile $2);
while [ 1 ]
do
wget $WGETOPT -O - $1 | gunzip -c > $tmp ;
if [ $? == 0 ]; then mv $tmp $2 ; break; fi
sleep 1 ;
done
}
# #
# = straininfo() arguments: ============================================================================== #
# 1. GENUS: genus name #
# 2. SPECIES: species name #
# 3. COOKIE: cookie file name (if unknown, it will be first generated) #
# returns the list of the type strain names gathered from straininfo.net #
# #
straininfo() {
if [ ! -e $3 ]
then
while [ 1 ]
do
wget $WGETOPT --keep-session-cookies --save-cookies=$3 -O /dev/null "$STRAININFO" ;
[ $? == 0 ] && break || sleep 1 ;
done
fi
while [ 1 ]
do
strainlist="$(wget $WGETOPT --load-cookies=$3 -O - "$STRAININFO/taxonGet.jsp?taxon=$1%20$2" | grep -F "is <strong>type strain</strong> of:<br/>" |
sed -e 's/<div class='"'"'popup'"'"'>//g;s/ is <strong>type strain<\/strong> of:<br\/>//g' | tr '\n' '\t' | sed 's/\t$/\n/')";
[ $? == 0 ] && break || sleep 1 ;
done
echo -e "$strainlist" ;
}
# #
#############################################################################################################
#############################################################################################################
#### ####
#### INITIALIZING PARAMETERS AND READING OPTIONS ####
#### ####
#############################################################################################################
INCLUDE_PATTERN="";
EXCLUDE_PATTERN="^#";
REPOSITORY="genbank";
OUTDIR=".";
TYPES=false;
NTHREADS=1;
DWNL=true;
WAITIME=0.5;
while getopts :e:v:o:c:d:nt option
do
case $option in
e) INCLUDE_PATTERN="$OPTARG" ;;
v) EXCLUDE_PATTERN="$OPTARG" ;;
d) REPOSITORY="$OPTARG" ;;
o) OUTDIR="$OPTARG" ;;
c) NTHREADS=$OPTARG ;;
n) DWNL=false ;;
t) TYPES=true ;;
:) echo "option $OPTARG : missing argument" ; exit 1 ;;
\?) echo "$OPTARG : option invalide" ; exit 1 ;;
esac
done
if [ -z "$INCLUDE_PATTERN" ]; then echo "no specified pattern (option -p)" ; exit 1 ; fi
if [ $NTHREADS -lt 1 ]; then echo "incorrect number of threads (option -t): $THREADS" ; exit 1 ; fi
if [ "$REPOSITORY" != "genbank" ] && [ "$REPOSITORY" != "refseq" ]; then "incorrect repository name (options -d): $REPOSITORY" ; exit 1 ; fi
ASSEMBLY_REPORT=ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_$REPOSITORY.txt;
if [ ! -e $OUTDIR ]; then echo "creating output directory: $OUTDIR" ; mkdir $OUTDIR ; fi
SUMMARY=$OUTDIR/summary.txt;
t=$(( $(date +%s%N) / 1000000 ));
while [ 1 ]
do
wget $WGETOPT -O - ftp://$ASSEMBLY_REPORT | sed -n '2p' ;
[ $? == 0 ] && break || sleep 1 ;
done > $SUMMARY
f=$(( $(date +%s%N) / 1000000 - $t ));
while [ 1 ]
do
wget $WGETOPT -O - https://$ASSEMBLY_REPORT | sed -n '2p' ;
[ $? == 0 ] && break || sleep 1 ;
done > $SUMMARY
h=$(( $(date +%s%N) / 1000000 - $f - $t ));
[ $h -lt $f ] && PROTOCOL="https:" || PROTOCOL="ftp:";
#############################################################################################################
#### ####
#### SELECTING WGS ENTRIES ####
#### ####
#############################################################################################################
echo "selection criterion: $INCLUDE_PATTERN" ;
if [ "$EXCLUDE_PATTERN" != "^#" ]; then echo "exclusion criterion: $EXCLUDE_PATTERN" ; fi
tmp=$(randomfile $SUMMARY);
while [ 1 ]
do
wget $WGETOPT -O - $PROTOCOL"//"$ASSEMBLY_REPORT | grep -E "$INCLUDE_PATTERN" | grep -v -E "$EXCLUDE_PATTERN" | grep -F "ftp://ftp.ncbi.nlm.nih.gov" > $tmp ;
if [ $? == 0 ]; then cat $tmp >> $SUMMARY ; rm -f $tmp ; break; fi
sleep 1 ;
done
n=$(grep -v -c "^#" $SUMMARY);
echo "$REPOSITORY: $n WGS nucleotide sequence FASTA files" ;
if [ $n -eq 0 ]; then exit 0 ; fi
if ! $DWNL ; then echo "see details in the report file: $SUMMARY" ; fi
#############################################################################################################
#### ####
#### GETTING TYPE STRAIN ISOLATE NAMES ####
#### ####
#############################################################################################################
if $TYPES
then
TYPESTRAINS=$OUTDIR/type.strains.txt;
COOKIE=$(randomfile $TYPESTRAINS);
awk -F"\t" '! /^#/{print$8}' $SUMMARY | awk -F" " '{print$1"\t"$2}' | sort -u |
while read -r genus species ; do echo -e "$genus $species\t$(straininfo $genus $species $COOKIE)" ; done > $TYPESTRAINS ;
echo "see type strain name(s) for each selected species in the following file: $TYPESTRAINS" ;
rm -f $COOKIE ;
fi
if ! $DWNL ; then exit 0 ; fi
#############################################################################################################
#### ####
#### DOWNLOADING WGS NUCLEOTIDE SEQUENCES ####
#### ####
#############################################################################################################
FULLSUMMARY=$(randomfile $SUMMARY);
head -1 $SUMMARY | sed 's/^# /# fasta_file\t/' > $FULLSUMMARY ;
START=$SECONDS;
i=-1;
tr '\t' '|' < $SUMMARY |
while IFS="|" read -r assembly_accession _ _ wgs_master _ _ _ organism_name infraspecific_name isolate _ _ _ _ _ _ _ _ _ ftp_path _ _
do
let i++; if [ $i -lt 1 ]; then continue; fi
>&2 echo "$(gettime $START) [$i/$n] $organism_name | $infraspecific_name | $isolate | $assembly_accession | $wgs_master | $ftp_path" ;
GZFILE=$(basename $ftp_path)"_genomic.fna.gz";
NAME=$(echo "$organism_name" | tr ',/\?%*:|"<>()[]#;' '_' | ### replacing special char. by '_'
sed -e 's/ bv\./ bv/;s/ genomosp\./ genomosp/;s/ sp\./ sp/;s/ str\./ str/;s/ subsp\./ subsp/');
STRAIN=$(echo "$infraspecific_name" | sed 's/strain=//g' | tr ',/\?%*:|"<>()[]#;' '_'); ### replacing special char. by '_'
[ -n "$STRAIN" ] && [ $(echo "$NAME" | grep -c -F "$STRAIN") -eq 0 ] && NAME="$NAME.$STRAIN";
ISOLATE=$(echo "$isolate" | tr ',/\?%*:|"<>()[]#;' '_'); ### replacing special char. by '_'
[ -n "$ISOLATE" ] && [ $(echo "$NAME" | grep -c -F "$ISOLATE") -eq 0 ] && NAME="$NAME.$ISOLATE";
accn=${wgs_master:0:5}"1";
[ -n "$wgs_master" ] && NAME="$NAME""--""$accn";
[ -n "$assembly_accession" ] && NAME="$NAME""--""$assembly_accession";
URL=$(echo $ftp_path | sed "s/ftp:/$PROTOCOL/")"/$GZFILE";
OUTFILE=$(echo "$NAME" | tr ' ' '.' | sed 's/\.\.*/\./g').fasta; ### replacing blank spaces by '.', and successive dots by only one
echo -e "$OUTFILE\t$(sed -n "$(( $i + 1 )) p" $SUMMARY)" ;
dwnlgz $URL $OUTDIR/$OUTFILE &
while [ $(jobs -r | wc -l) -gt $NTHREADS ]; do sleep $WAITIME ; done
done >> $FULLSUMMARY
wait ;
mv $FULLSUMMARY $SUMMARY ;
echo "see details in the report file: $SUMMARY" ;
exit ;
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment