Skip to content
Snippets Groups Projects
criscuol's avatar
Alexis CRISCUOLO authored
625bf08c
Name Last commit Last update
example
COPYING
README.md
eCDS.sh

eCDS

eCDS is a command line tool written in Bash to ease the extraction of coding sequence(s) (CDS) from a nucleotide sequence FASTA file (typically, a set of contig sequences). Given a specified query file (containing either an amino acid sequence or a position-specific scoring matrix), eCDS first runs a TBLASTN similarity search (e.g. Gertz et al. 2006) against the specified subject sequence(s), and next extracts the CDS corresponding to each selected BLAST hit.

eCDS runs on UNIX, Linux and most OS X operating systems.

Dependencies

You will need to install the required programs and tools listed in the following table, or to verify that they are already installed with the required version.

program package version sources
eFASTA - ≥ 1.2b gitlab.pasteur.fr/GIPhy/eFASTA
gawk - > 4.0.0 ftp.gnu.org/gnu/gawk
makeblastdb
tblastn
blast+ ≥ 2.9.0 ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST

Installation and execution

A. Clone this repository with the following command line:

git clone https://gitlab.pasteur.fr/GIPhy/eCDS.git

B. Give the execute permission to the file eCDS.sh:

chmod +x eCDS.sh

C. Execute eCDS with the following command line model:

./eCDS.sh  [options]

D. If at least one of the required program (see Dependencies) is not available on your $PATH variable (or if one compiled binary has a different default name), eCDS will exit with an error message. When running eCDS without option, a documentation should be displayed; otherwise, the name of the missing program is displayed before existing. In such a case, edit the file eCDS.sh and indicate the local path to the corresponding binary(ies) within the code block REQUIREMENTS (approximately lines 50-90). For each required program, the table below reports the corresponding variable assignment instruction to edit (if needed) within the code block REQUIREMENTS

program variable assignment program variable assignment
eFASTA EFASTA_BIN=eFASTA; makeblastdb MAKEBLASTDB_BIN=makeblastdb;
gawk GAWK_BIN=gawk; tblastn TBLASTN_BIN=tblastn;

Note that depending on the installation of some required programs, the corresponding variable can be assigned with complex commands. For example, as eFASTA is a Java tool that can be run using a Java virtual machine, the executable jar file EFASTA.jar can be used by eCDS by editing the corresponding variable assignment instruction as follows: EFASTA_BIN="java -jar EFASTA.jar".

Usage

Run eCDS without option to read the following documentation:

 USAGE : eCDS   -q <infile>    -s <infile>   -o <basename> 
               [-p <ppos>]    [-c <qcov>]   [-b <bitscore>]
               [-S <integer>] [-f]   [-a]   [-h]

 where options are :

   -q <infile>    query file containing either a sequence (FASTA format) or a position-
                  specific scoring matrix (PSSM; ASN.1 format)
   -s <infile>    subject sequence file (FASTA format)
   -o <basename>  to write the inferred  CDS into the output  file names [basename].fna 
                  (codon) and [basename].faa (translated amino acids)                  
   -p <integer>   minimum allowed  percentage of  positive-scoring  amino acid  matches 
                  (default: 50)
   -c <integer>   minimum allowed percentage of query coverage (default: 70)
   -b <integer>   minimum allowed bit score value (default: 0)
   -f             writes only CDS inferred from the first best hit (default: not set)
   -S <integer>   to accept alternate START codons when searching for the complete CDS:
                    1 = ATG (default)
                    2 = ATG GTG TTG
                    3 = ATG GTG TTG CTG ATT
                    4 = ATG GTG TTG CTG ATT ATC ATA
   -a             appends CDS to  the end of  the FASTA output  files when they already 
                  exist (default: content replaced)
   -h             prints this help and exits

Notes

  • eCDS is able to consider as query (option -q) either a single amino acid sequence in FASTA format or a position-specific scoring matrix (PSSM) in ASN.1 format. When using a FASTA file as query, only the first sequence is considered. When using a PSSM file as query, rescaling is automatically performed to be read by the tblastn binary (i.e. scalingFactor = 1 into the ASN.1-formatted PSSM file).

  • A PSSM file summarizes a multiple amino acid alignment; it is therefore very useful to use a PSSM as a generic query against genome sequences belonging to any genera. A very large collection of PSSM can be found at ftp.ncbi.nih.gov/pub/mmdb/cdd/ (into the archive cdd.tar.gz). For a given gene name (e.g. "ribosomal protein L3"), the associated conserved domain PSSM identifier(s) can be easily retrieved using the search tool at www.ncbi.nlm.nih.gov/cdd or www.ncbi.nlm.nih.gov/proteinclusters; complete lists are also available at ftp.ncbi.nih.gov/genomes/CLUSTERS/ (files *_clusters.txt). Moreover, for a given CDS with GenBank identifier (e.g. NP_417779), the associated conserved domain PSSM identifier(s) can be retrieved using the following URL: www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi?FULL&SEQUENCE=NP_417779 (juste change the accession identifier at the end of the URL). Note that using PSSM generated from the automatically aligned sequences classified as stable clusters in the NCBI Protein Clusters database (i.e. PRK data) enables to observe accurate results when using eCDS.

  • After carrying out a TBLASTN search of the specified query (option -q) against the specified subject FASTA file (option -s), non-significant BLAST hits are filtered out according three possible criteria:
      • minimum percentage of positive-scoring amino matches (option -p; default: ppos = 50%),
      • minimum percentage of query coverage (option -c; default: qcov = 70%),
      • minimum bit score value (option -b; default: bitscore = 0).
    The first criterion cutoff can be enlarged when using a query sequence that is expected to be closely related to the subject sequence (e.g. -p 80), as well as the second criterion cutoff. The third criterion can be useful when using one of the query PSSM distributed by the NCBI, as a minimum bitscore threshold is available for each of them. For example, PRK05644, a PSSM associated to the gene gyrB, can be used as query with the bitscore threshold specified at the corresponding webpage: www.ncbi.nlm.nih.gov/Structure/cdd/cddsrv.cgi?uid=PRK05644 ('+ Statistics' on the left); note also that updated lists of bitscore thresholds are available at ftp.ncbi.nih.gov/pub/mmdb/cdd/ (files bitscore_specific_*.txt). However, a simple approach is also to run eCDS with relaxed criteria and only consider the first best BLAST hit (option -f).

  • Finally, for each selected BLAST hit, the corresponding subject sequence region is considered to extract the smallest putative CDS containing that region, using eFASTA. Note that the determination of the CDS can be tuned by accepting alternate START codons (option -S).

  • Amino acid and codon sequences are written into FASTA files with extensions .faa and .fna, respectively. The base name the the output files should be specified using option -o. Note that a log file containing several BLAST results is also written (extension .log.txt).

Example

In order to illustrate the usefulness of eCDS, the directory example/ contains the whole chromosome sequence of Bacillus subtilis strain 168T (GenBank accession AL009126.3), as well as four PSSM files:
    • PRK00013.groEL.smp corresponding to the conserved domain PRK00013 associated to the gene groEL,
    • PRK00380 .panC.smp corresponding to the conserved domain PRK00380 associated to the gene panC,
    • PRK00405.rpoB.smp corresponding to the conserved domain PRK00405 associated to the gene rpoB,
    • PRK05644.gyrB.smp corresponding to the conserved domain PRK05644 associated to the gene gyrB.

Basically, the four following command lines can be used to extract the four groEL, panC, rpoB and gyrB CDS from the B. subtilis 168T chromosome:

eCDS.sh -q PRK00013.groEL.smp -s Bacillus.subtilis.168.fasta  -o bc.groEL 
eCDS.sh -q PRK00380.panC.smp  -s Bacillus.subtilis.168.fasta  -o bc.panC
eCDS.sh -q PRK00405.rpoB.smp  -s Bacillus.subtilis.168.fasta  -o bc.rpoB
eCDS.sh -q PRK05644.gyrB.smp  -s Bacillus.subtilis.168.fasta  -o bc.gyrB

For groEL gene, the output file bc.groEL.faa (see example/) contains the expected amino acid sequence, i.e. identical to the annotated groEL gene within the B. subtilis 168T chromosome AL009126.3 (i.e. CAB12422). For panC gene, eCDS also extracts an amino acid sequence (bc.panC.faa in example/) that is identical to the annotated one (i.e. CAB14158).

For rpoB gene, one can observe that the above command line leads to an amino acid sequence that is not starting with a START codon:

>AL009126.3::121907-125497
GVNQLTGQLVQYGRHRQRRSYARISEVLELPNLIEIQTSSYQWFLDEGLREMFQDISPIEDFTGNLSLEFIDYSLGEPKYPVEESKERDVTYSAPLRVKVRLINKETGEVKDQDVFMGDFPIMTDTGTFIINGAERVIVSQLVRSPSVYFSGKVDKNGKKGFTATVIPNRGAWLEYETDAKDVVYVRIDRTRKLPVTVLLRALGFGSDQEILDLIGENEYLRNTLDKDNTENSDKALLEIYERLRPGEPPTVENAKSLLDSRFFDPKRYDLANVGRYKINKKLHIKNRLFNQRLAETLVDPETGEILAEKGQILDRRTLDKVLPYLENGIGFRKLYPNGGVVEDEVTLQSIKIFAPTDQEGEQVINVIGNAYIEEEIKNITPADIISSISYFFNLLHGVGDTDDIDHLGNRRLRSVGELLQNQFRIGLSRMERVVRERMSIQDTNTITPQQLINIRPVIASIKEFFGSSQLSQFMDQTNPLAELTHKRRLSALGPGGLTRERAGMEVRDVHYSHYGRMCPIETPEGPNIGLINSLSSYAKVNRFGFIETPYRRVDPETGKVTGRIDYLTADEEDNYVVAQANARLDDEGAFIDDSIVARFRGENTVVSRNRVDYMDVSPKQVVSAATACIPFLENDDSNRALMGANMQRQAVPLMQPEAPFVGTGMEYVSGKDSGAAVICKHPGIVERVEAKNVWVRRYEEVDGQKVKGNLDKYSLLKFVRSNQGTCYNQRPIVSVGDEVVKGEILADGPSMELGELALGRNVMVGFMTWDGYNYEDAIIMSERLVKDDVYTSIHIEEYESEARDTKLGPEEITRDIPNVGEDALRNLDDRGIIRIGAEVKDGDLLVGKVTPKGVTELTAEERLLHAIFGEKAREVRDTSLRVPHGGGGIIHDVKVFNREDGDELPPGVNQLVRVYIVQKRKISEGDKMAGRHGNKGVISKILPEEDMPYLPDGTPIDIMLNPLGVPSRMNIGQVLELHMGMAARYLGIHIASPVFDGAREEDVWETLEEAGMSRDAKTVLYDGRTGEPFDNRVSVGIMYMIKLAHMVDDKLHARSTGPYSLVTQQPLGGKAQFGGQRFGEMEVWALEAYGAAYTLQEILTVKSDDVVGRVKTYEAIVKGDNVPEPGVPESFKVLIKELQSLGMDVKILSGDEEEIEMRDLEDEEDAKQADGLALSGDEEPEETASADVERDVVTKE

However, such a problem can be generally fixed by using alternate START codons. This can be achieved by the following example command line:

eCDS.sh -q PRK00405.rpoB.smp  -s Bacillus.subtilis.168.fasta  -S 2  -o bc.rpoB

that writes the output file bc.rpoB.faa (see example/) containing an amino acid sequence that is identical to the annotated B. subtilis 168T rpoB CDS (i.e. CAB11883).

Finally, for gyrB gene, one can observe that the above command line leads to two extracted sequences. As expected, the first one (i.e. first best BLAST hit) is identical to the B. subtilis 168T gyrB CDS (i.e. CAB11782). However, the second sequence corresponds to the gene parE (subunit B of DNA topoisomerase IV), a well-known paralog of gyrB (e.g. Poirier et al. 2018). A way to circumvent this problem is to only consider the first best BLAST hit (option -f). However, an alternative way is to use the bitscore threshold suggested for PRK05644, available at www.ncbi.nlm.nih.gov/Structure/cdd/cddsrv.cgi?uid=PRK05644; this leads to the following example command line:

eCDS.sh -q PRK05644.gyrB.smp  -s Bacillus.subtilis.168.fasta  -p 0  -c 0  -b 1029  -o bc.gyrB

that writes only the gyrB amino acid and codon sequences into the files bc.rpoB.faa and bc.rpoB.fna, respectively (see example/).

References

Gertz EM, Yu YK, Agarwala R, Schäffer AA, Altschul SF (2006) Composition-based statistics and translated nucleotide searches: Improving the TBLASTN module of BLAST. BMC Biology, 4:41. doi:10.1186/1741-7007-4-41.

Poirier S, Rué O, Peguilhan R, Coeuret G, Zagorec M, Champomier-Vergès M-C, Loux V, Chaillou S (2018) Deciphering intra-species bacterial diversity of meat and seafood spoilage microbiota using gyrB amplicon sequencing: A comparative analysis with 16S rDNA V3-V4 amplicon sequencing. PLoS One, 13(9):e0204629. doi:10.1371/journal.pone.0204629.