Commit 6dea2000 authored by Alexis  CRISCUOLO's avatar Alexis CRISCUOLO
Browse files

0.6

parent 3debe216
......@@ -14,27 +14,40 @@ git clone https://gitlab.pasteur.fr/GIPhy/wgetGenBankWGS.git
```
Give the execute permission to the file `wgetGenBankWGS.sh`:
```bash
chmod +x wgetGenBankWGS.sh
```
Execute _wgetGenBankWGS_ with the following command line model:
Run _wgetGenBankWGS_ with the following command line model:
```bash
./wgetGenBankWGS.sh [options]
```
## Usage
Launch _wgetGenBankWGS_ without option to read the following documentation:
Run _wgetGenBankWGS_ without option to read the following documentation:
```
wgetGenBankWGS v.0.4.200504ac
wgetGenBankWGS Copyright (C) 2019-2021 Institut Pasteur
Downloading sequence files corresponding to selected entries from genome assembly report files:
GenBank: ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_genbank.txt
RefSeq: ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt
Writing output files 'Species.isolate--accn--GCA' with the following content (and extension):
Selected entries (options -e and -v) can be restricted to a specific phylum using option -p:
-p A archaea
-p B bacteria
-p F fungi
-p I invertebrate
-p M mammalia
-p N non-mammalia vertebrate
-p P plant
-p V virus
-p Z protozoa
Output files 'Species.isolate--accn--GC' can be written with the following content (and extension):
-f 1 genomic sequence(s) in FASTA format (.fasta)
-f 2 genomic sequence(s) in GenBank format (.gbk)
-f 3 annotations in GFF3 format (.gff)
......@@ -43,11 +56,13 @@ Launch _wgetGenBankWGS_ without option to read the following documentation:
-f 6 RNA sequences in FASTA format (.fasta)
USAGE:
wgetGenBankWGS.sh -e <pattern> [-v <pattern>] [-o <outdir>] [-f <integer>] [-n] [-z] [-t <nthreads>]
wgetGenBankWGS.sh -e <pattern> [-v <pattern>] [-d <repository>] [-p <phylum>]
[-o <outdir>] [-f <integer>] [-n] [-z] [-t <nthreads>]
where:
-e <pattern> extended regexp selection pattern (mandatory)
-v <pattern> extended regexp exclusion pattern (default: none)
-d <string> either 'genbank' or 'refseq' (default: genbank)
-p <char> specific phylum (see above; default: not set)
-n no download, i.e. to only print the number of selected files (default: not set)
-f <integer> file type identifier (see above; default: 1)
-z no unzip, i.e. downloaded files are compressed (default: not set)
......@@ -55,6 +70,9 @@ Launch _wgetGenBankWGS_ without option to read the following documentation:
-t <nthreads> number of threads (default: 1)
EXAMPLES:
+ getting the total number of available fungi genomes inside RefSeq:
wgetGenBankWGS.sh -e "ftp" -d refseq -p F -n
+ getting the total number of available complete Salmonella genomes inside RefSeq:
wgetGenBankWGS.sh -e "Salmonella.*Complete Genome" -v "phage|virus" -d refseq -n
......@@ -78,7 +96,6 @@ Launch _wgetGenBankWGS_ without option to read the following documentation:
+ downloading the genome annotation of every Klesiella type strain in compressed gff3 format using 30 threads
wgetGenBankWGS.sh -e "Klebsiella.*type material" -f 3 -z -t 30
```
......
......@@ -4,7 +4,7 @@
# #
# wgetGenBankWGS: downloading WGS genome assembly files from NCBI #
# #
# Copyright (C) 2019,2020 Institut Pasteur #
# Copyright (C) 2019-2021 Institut Pasteur #
# #
# This program is free software: you can redistribute it and/or modify it under the terms of the GNU #
# General Public License as published by the Free Software Foundation, either version 3 of the License, or #
......@@ -33,7 +33,11 @@
# = VERSIONS = #
# ============ #
# #
VERSION=0.5.201018ac #
VERSION=0.6.201018ac #
# + takes into account the last field 'asm_not_live_date' in genome assembly report files #
# + adding option -p to select a specific phylum #
# #
# VERSION=0.5.201018ac #
# + adding flag -T- or -t- in file name for type material #
# + adding flag -w- in file name for genomes excluded from RefSeq #
# #
......@@ -67,13 +71,24 @@ if [ "$1" = "-?" ] || [ "$1" = "-h" ] || [ $# -le 1 ]
then #
cat <<EOF
wgetGenBankWGS v.$VERSION Copyright (C) 2019-2020 Institut Pasteur
wgetGenBankWGS v.$VERSION Copyright (C) 2019-2021 Institut Pasteur
Downloading sequence files corresponding to selected entries from genome assembly report files:
GenBank: ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_genbank.txt
RefSeq: ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt
Writing output files 'Species.isolate--accn--GC' with the following content (and extension):
Selected entries (options -e and -v) can be restricted to a specific phylum using option -p:
-p A archaea
-p B bacteria
-p F fungi
-p I invertebrate
-p M mammalia
-p N non-mammalia vertebrate
-p P plant
-p V virus
-p Z protozoa
Output files 'Species.isolate--accn--GC' can be written with the following content (and extension):
-f 1 genomic sequence(s) in FASTA format (.fasta)
-f 2 genomic sequence(s) in GenBank format (.gbk)
-f 3 annotations in GFF3 format (.gff)
......@@ -82,11 +97,13 @@ then
-f 6 RNA sequences in FASTA format (.fasta)
USAGE:
wgetGenBankWGS.sh -e <pattern> [-v <pattern>] [-o <outdir>] [-f <integer>] [-n] [-z] [-t <nthreads>]
wgetGenBankWGS.sh -e <pattern> [-v <pattern>] [-d <repository>] [-p <phylum>]
[-o <outdir>] [-f <integer>] [-n] [-z] [-t <nthreads>]
where:
-e <pattern> extended regexp selection pattern (mandatory)
-v <pattern> extended regexp exclusion pattern (default: none)
-d <string> either 'genbank' or 'refseq' (default: genbank)
-p <char> specific phylum (see above; default: not set)
-n no download, i.e. to only print the number of selected files (default: not set)
-f <integer> file type identifier (see above; default: 1)
-z no unzip, i.e. downloaded files are compressed (default: not set)
......@@ -94,6 +111,9 @@ then
-t <nthreads> number of threads (default: 1)
EXAMPLES:
+ getting the total number of available fungi genomes inside RefSeq:
wgetGenBankWGS.sh -e "ftp" -d refseq -p F -n
+ getting the total number of available complete Salmonella genomes inside RefSeq:
wgetGenBankWGS.sh -e "Salmonella.*Complete Genome" -v "phage|virus" -d refseq -n
......@@ -211,14 +231,16 @@ OUTDIR=".";
NTHREADS=1;
DWNL=true;
FTYPE=1;
PHYLUM="all";
UNZIP=true;
WAITIME=0.5;
while getopts :e:v:o:t:d:f:nz option
while getopts :e:v:o:t:d:f:p:nz option
do
case $option in
e) INCLUDE_PATTERN="$OPTARG" ;;
v) EXCLUDE_PATTERN="$OPTARG" ;;
d) REPOSITORY="$OPTARG" ;;
p) PHYLUM="$OPTARG" ;;
o) OUTDIR="$OPTARG" ;;
t) NTHREADS=$OPTARG ;;
f) FTYPE=$OPTARG ;;
......@@ -234,16 +256,31 @@ if [ "$REPOSITORY" != "genbank" ] && [ "$REPOSITORY" != "refseq" ]; then "incorr
INEXT="_genomic.fna.gz"; OUTEXT=".fasta";
if $DWNL
then
if [ "$FTYPE" == "1" ]; then echo "file type: genomic sequence(s) in FASTA format" ; FTYPE=1; INEXT="_genomic.fna.gz"; OUTEXT=".fasta";
elif [ "$FTYPE" == "2" ]; then echo "file type: genomic sequence(s) in GenBank format" ; FTYPE=2; INEXT="_genomic.gbff.gz"; OUTEXT=".gbk";
elif [ "$FTYPE" == "3" ]; then echo "file type: annotations in GFF3 format" ; FTYPE=3; INEXT="_genomic.gff.gz"; OUTEXT=".gff";
elif [ "$FTYPE" == "4" ]; then echo "file type: codon CDS in FASTA format" ; FTYPE=4; INEXT="_cds_from_genomic.fna.gz"; OUTEXT=".fasta";
elif [ "$FTYPE" == "5" ]; then echo "file type: amino acid CDS in FASTA format" ; FTYPE=5; INEXT="_protein.faa.gz"; OUTEXT=".fasta";
elif [ "$FTYPE" == "6" ]; then echo "file type: RNA sequences in FASTA format" ; FTYPE=6; INEXT="_rna_from_genomic.fna.gz"; OUTEXT=".fasta";
if [ "$FTYPE" == "1" ]; then echo "file type: genomic sequence(s) in FASTA format" ; FTYPE=1; INEXT="_genomic.fna.gz"; OUTEXT=".fasta";
elif [ "$FTYPE" == "2" ]; then echo "file type: genomic sequence(s) in GenBank format" ; FTYPE=2; INEXT="_genomic.gbff.gz"; OUTEXT=".gbk";
elif [ "$FTYPE" == "3" ]; then echo "file type: annotations in GFF3 format" ; FTYPE=3; INEXT="_genomic.gff.gz"; OUTEXT=".gff";
elif [ "$FTYPE" == "4" ]; then echo "file type: codon CDS in FASTA format" ; FTYPE=4; INEXT="_cds_from_genomic.fna.gz"; OUTEXT=".fasta";
elif [ "$FTYPE" == "5" ]; then echo "file type: amino acid CDS in FASTA format" ; FTYPE=5; INEXT="_protein.faa.gz"; OUTEXT=".fasta";
elif [ "$FTYPE" == "6" ]; then echo "file type: RNA sequences in FASTA format" ; FTYPE=6; INEXT="_rna_from_genomic.fna.gz"; OUTEXT=".fasta";
else echo "incorrect file type (option -f): $FTYPE" ; exit 1 ;
fi
fi
if [ "$PHYLUM" != "all" ]
then
if [ "$PHYLUM" == "A" ]; then PHYLUM="archaea";
elif [ "$PHYLUM" == "B" ]; then PHYLUM="bacteria";
elif [ "$PHYLUM" == "F" ]; then PHYLUM="fungi";
elif [ "$PHYLUM" == "I" ]; then PHYLUM="invertebrate";
elif [ "$PHYLUM" == "M" ]; then PHYLUM="vertebrate_mammalian";
elif [ "$PHYLUM" == "N" ]; then PHYLUM="vertebrate_other";
elif [ "$PHYLUM" == "P" ]; then PHYLUM="plant";
elif [ "$PHYLUM" == "V" ]; then PHYLUM="viral";
elif [ "$PHYLUM" == "Z" ]; then PHYLUM="protozoa";
else echo "incorrect phylum (option -p): $PHYLUM" ; exit 1 ;
fi
fi
OUTDIR=$(dirname $OUTDIR/.);
if [ ! -e $OUTDIR ]; then echo "creating output directory: $OUTDIR" ; mkdir $OUTDIR ; fi
if [ ! -e $OUTDIR ]; then echo "creating output directory: $OUTDIR" ; mkdir $OUTDIR ; fi
trap "echo interrupting wgetGenBankWGS ; wait ; if [ \"$OUTDIR\" != "." ]; then rm -r $OUTDIR ; fi ; exit 1" INT ;
......@@ -254,8 +291,14 @@ trap "echo interrupting wgetGenBankWGS ; wait ; if [ \"$OUTDIR\" != "." ]; then
#### DOWNLOADING GENOME ASSEMBLY REPORT FILE ####
#### ####
#############################################################################################################
echo -n "downloading $REPOSITORY assembly report ... " ;
ASSEMBLY_REPORT=ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_$REPOSITORY.txt;
if [ "$PHYLUM" == "all" ]
then
echo -n "downloading $REPOSITORY assembly report ... " ;
ASSEMBLY_REPORT=ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_$REPOSITORY.txt;
else
echo -n "downloading $REPOSITORY ($PHYLUM) assembly report ... " ;
ASSEMBLY_REPORT=ftp.ncbi.nlm.nih.gov/genomes/$REPOSITORY/$PHYLUM/assembly_summary.txt;
fi
SUMMARY=$OUTDIR/summary.txt;
dwnl $PROTOCOL"//"$ASSEMBLY_REPORT $SUMMARY ;
echo "[ok]" ;
......@@ -291,7 +334,7 @@ head -1 $SUMMARY | sed 's/^# /# file\t/' > $FULLSUMMARY ;
tr '\t' '|' < $SUMMARY > $tmp ; mv $tmp $SUMMARY ; ## to deal with empty entries, not well managed using IFS=$'\t'
START=$SECONDS;
i=-1;
while IFS="|" read -r assembly_accession _ _ wgs_master _ _ _ organism_name infraspecific_name isolate _ _ _ _ _ _ _ _ _ ftp_path excluded_from_refseq relation_to_type_material
while IFS="|" read -r assembly_accession _ _ wgs_master _ _ _ organism_name infraspecific_name isolate _ _ _ _ _ _ _ _ _ ftp_path excluded_from_refseq relation_to_type_material _
do
let i++; if [ $i -lt 1 ]; then continue; fi
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment