Commit 665c3e1c authored by Alexis  CRISCUOLO's avatar Alexis CRISCUOLO
Browse files

.10

parent df57163a
This diff is collapsed.
# DNA2ORF
Translating the maximum number of codons using the minimum number of ORFs
\ No newline at end of file
_DNA2ORF_ is a command line program written in [Java](https://docs.oracle.com/en/java/) to find and translate open reading frames (ORF) inside nucleotide sequence(s), an ORF being unequivocally defined here as _a nucleotide segment of length divisible by 3 that is free of in-frame STOP codons_ (therefore not necessarily containing a START codon, e.g. Woodcroft et al. 2016, Sieber et al. 2018).
_DNA2ORF_ implements an algorithm that solves a specific problem: _translating the maximum number of non-STOP codons within a nucleotide sequence using the minimum number of non-overlapping ORFs_.
The ORF definition 2 established by Sieber et al. (2018) may therefore be verified by most of the ORFs inferred by _DNA2ORF_ (e.g. the first —longer— ones are generally bounded by STOP codons), but not all (e.g. the last —shorter— ones are often not bounded by STOP codons to prevent ORF overlap).
_DNA2ORF_ was not implemented to predict coding sequences (although it can be used as a prior step for this task).
_DNA2ORF_ was developed to quickly translate most of the regions of one (or several) nucleotide sequence(s), in order to perform amino acid-/codon-based comparisons between genomes (e.g. sequence similarity, _k_-mer contents, residue frequencies).
Thanks to its generic ORF definition (see above), _DNA2ORF_ can be indiscriminately used on any kind of genomes (i.e. archaea, bacteria, eukaryota, viruses).
## Compilation and execution
The source code of _DNA2ORF_ is inside the _src_ directory and could be compiled and executed in two different ways.
#### Building an executable jar file
Clone this repository with the following command line:
```bash
git clone https://gitlab.pasteur.fr/GIPhy/DNA2ORF.git
```
On computers with [Oracle JDK](http://www.oracle.com/technetwork/java/javase/downloads/index.html) (**11** or higher) installed, a Java executable jar file can be created. In a command-line window, go to the _src_ directory and type:
```bash
javac DNA2ORF.java
echo Main-Class: DNA2ORF > MANIFEST.MF
jar -cmvf MANIFEST.MF DNA2ORF.jar DNA2ORF.class DNA2ORF\$ORF.class
rm MANIFEST.MF DNA2ORF.class DNA2ORF\$ORF.class
```
This will create the executable jar file `DNA2ORF.jar` that could be run with the following command line model:
```bash
java -jar DNA2ORF.jar [options]
```
#### Building a native code binary
Clone this repository with the following command line:
```bash
git clone https://gitlab.pasteur.fr/GIPhy/DNA2ORF.git
```
On computers with [GraalVM](https://www.graalvm.org/downloads/) installed, a native executable can be built. In a command-line window, go to the _src_ directory, and type:
```bash
javac DNA2ORF.java
native-image DNA2ORF DNA2ORF
rm DNA2ORF.class DNA2ORF\$ORF.class
```
This will create the native executable `DNA2ORF` that can be run with the following command line model:
```bash
./DNA2ORF [options]
```
## Usage
Run _DNA2ORF_ without option to read the following documentation:
```
DNA2ORF
USAGE: DNA2ORF -i <infile> [-l <minlgt>] [-o <outfile>]
OPTIONS:
-i <filename> (mandatory) FASTA-formatted nucleotide sequence file
name; should end with .gz when gzipped
-l <integer> minimum length of the inferred ORFs (default: 30 bps)
-o <filename> output file name (default: standard output)
```
## Notes
* In brief, _DNA2ORF_ first looks for every nucleotide subsequence _S_ of length divisible by 3 (no less than the minimum allowed length) that is bounded by STOP codons in the 6 reading frames. Next, _DNA2ORF_ sorts all such ORFs according to their decreasing length, and finally iterates through each of them. When none of the codons in the current ORF _S_ was translated in the initial input sequence, _S_ is outputted. Otherwise, _S_ is shrunk until obtaining a shortened ORF _S'_ with no overlap with a previously outputted ORF: if the length of _S'_ is lower than the minimum allowed length, it is discarded; otherwise, it is put back in the sorted ORF set. The algorithm ends when every ORF was either outputted or discarded.
* The minimum allowed length of inferred ORFs can be set using option `-l` (in number of nucleotides, inclusive; default: 30 bps). _DNA2ORF_ is generally running fast with such a default option, from few seconds (bacterial genome, e.g. 5 Mbps) to a few minutes (large eukaryote genome, e.g. 3 Gbps). Speed can be improved by increasing the minimum allowed ORF length (e.g. 150 bps), but at the cost of a reduced proportion of translated nucleotides.
* The input file (option `-i`) should be in FASTA format. _DNA2ORF_ is able to read an input file compressed using [_gzip_](https://www.gnu.org/software/gzip/), provided that the file name ends with the extension `.gz`.
* _DNA2ORF_ outputs each ORF using four fields in tab-delimited format (default: standard output):<br>
&emsp; (1) sequence name (i.e. FASTA header);<br>
&emsp; (2) start-end positions (inclusive; reverse strand when start > end);<br>
&emsp; (3) codon sequence (reverse-complement when start > end);<br>
&emsp; (4) amino acid sequence (standard genetic code).
* Tab-delimited standard output can be easily used to write inferred codon or amino acid ORFs into FASTA-formatted output files, using e.g. [_gawk_](https://www.gnu.org/software/gawk/):<br>
```bash
DNA2ORF -i dna.fasta | gawk 'BEGIN{FS="\t"}{print">"$1":"$2;print$3}' > orf.codon.fasta
DNA2ORF -i dna.fasta | gawk 'BEGIN{FS="\t"}{print">"$1":"$2;print$4}' > orf.amino.fasta
```
* For each input nucleotide sequence, _DNA2ORF_ also outputs further information in standard error: sequence name and length, percentage of translated nucleotides, and number of outputted ORFs.
## Example
The compressed FASTA file `CP003292.fasta.gz` is available in the directory _example_.
It contains the plasmid sequence [pG-EA11](https://www.ncbi.nlm.nih.gov/nuccore/CP003292.1).
To translate the maximum number of its non-STOP codons using the minimum number of non-overlapping ORFs, _DNA2ORF_ can be run using the following command line:
```bash
DNA2ORF -i CP003292.fasta.gz
```
that will lead to the following output:
```
CP003292.1 817-299 GTGCTCAATGTGGCTTCCAGTTCTCTTCCGCTTCCTCGTCGATTCGCTCGCTACGCTCGGTCATCGACTTGCGGCGAGCGGGATCACCTCTCGCATGTTGTCGGTGATCCCGGTGTTGTTCTTTTGGGGATGAGTCGCCAGATAGCCCCTCAAACGCGCTCAAAAGCTCTCAGACGCACGAAACGGGTTTTAGATGCCCGCCAGGGCATGGAAGGCGCGAAAAACGCTTACAGAGGCTCTCAGGAGGTCAAGCGCGCCGATACCCCTAACCAGGGGCTTTCAGGTCGCCAGGAACCGCGCCAGCCCCCTGCAAAGCCCATTTGCACGGAACAGGCAGCCATGCGGTTGGTAGGTGGTCGTAGACCGTCGGTGTTGCAGCGATGGGGGTCTGGGGTTCTCCCCAGACGGTTTCCGGTAGCGCTCAAGGCGCGGCCGGAACCGTGGCGGCTCGGTTTTTTTTGGAGAACGGGGAGCAAAAACACGCAAAGGATTTTGAGGCGGCAAAAGAGCAATGCAGGA VLNVASSSLPLPRRFARYARSSTCGERDHLSHVVGDPGVVLLGMSRQIAPQTRSKALRRTKRVLDARQGMEGAKNAYRGSQEVKRADTPNQGLSGRQEPRQPPAKPICTEQAAMRLVGGRRPSVLQRWGSGVLPRRFPVALKARPEPWRLGFFWRTGSKNTQRILRRQKSNAG
CP003292.1 1274-1549 ACCAACGTTAGTGCATGGGATTTTCAGAGGGAAAAAATCATGTTTATTGATTCAGAAAAACGACTGAAACAACTTTCAGATGAGGCAAAGAAAAACACCGAGGATCTCGAAGAAGCAAAGAAAAATTCAAGGTTTACACAGGTATCCCCAAAAGGTTGGGAACGTGTTCGAGAGCTGCTGAAGGATAGCCAAGGCATATCAGCACTGAAGCTGTACTCATTTTTAGCGGAGCATATCGATCCTACGTGTGGCGCTGTCGTTGCGGATCAGCAATTT TNVSAWDFQREKIMFIDSEKRLKQLSDEAKKNTEDLEEAKKNSRFTQVSPKGWERVRELLKDSQGISALKLYSFLAEHIDPTCGAVVADQQF
CP003292.1 1-237 CTAGCTGAAAAACTTGGAGTTAGCAGAAGCACAATTATTCGGTGGCTCAATTACTTAGAATCAAAAAATGCATTAGTTAGAATCCCCGTTGCTGGTAAGGTTTGTGCGTATGCCCTCGATCCACATGAAGTCTGGAAGGGATACAACACTACGAAAAACCATGCAGCGTTTGTCACTAAAACACTGGTCAACAAAGACGGTGATATTCAGCGCCGAATCATGGCCATGTTTTCAAAT LAEKLGVSRSTIIRWLNYLESKNALVRIPVAGKVCAYALDPHEVWKGYNTTKNHAAFVTKTLVNKDGDIQRRIMAMFSN
CP003292.1 1041-820 CGGAGTGTATACTGGCTTACTCCTATGCAAGCCTCGCCAGCCGTGATTTTTGCCAGATTTCGTGGTTTTCACGTTCAACAGCTCCATCGTTTTTTTGCCGTCCGTTCACCGTTAAAACGCCTCAGAAATCGGGACAGGATGTGTAATCGTATAACCGCGCATGCACGCCGATACGATTACCACTTGGTCAGGGCTTCACCCCGACACCCCGCATCAGTGCGT RSVYWLTPMQASPAVIFARFRGFHVQQLHRFFAVRSPLKRLRNRDRMCNRITAHARRYDYHLVRASPRHPASVR
CP003292.1 1056-1193 CTGGGTCAGGGCTTCGCCCCGACACCCCCAAAGGGCGTTGCGTTGCACGCAACACCCTTGCCTAGAATAGATCTTTTACGCCGATTTGTAAGTGGTTGTTTTTTTAGTGTGTTTTTTATTTACCTGTCGCATCCAGGA LGQGFAPTPPKGVALHATPLPRIDLLRRFVSGCFFSVFFIYLSHPG
CP003292.1 1273-1196 TCAAATCTAAGCCATCATGACACATCTATGCAACACCTTGTTGCGTGTTGTCGCATCCAGGATGCGACAACGTTGTGT SNLSHHDTSMQHLVACCRIQDATTLC
CP003292.1 246-296 CGGCAGGCGGACAATCAGGGGCTACGTGTTAACGTTCTGACCATGATTGTC RQADNQGLRVNVLTMIV
#CP003292.1 1549 bps 98.19% translated 7 ORFs
```
The nucleotide sequence [CP003292](https://www.ncbi.nlm.nih.gov/nuccore/CP003292.1) can therefore be decomposed into 7 ORFs (of length at least 30 bps) that cover 98.19% of its total length.
The corresponding tab-delimited file `CP003292.ORF.tsv` is available in the directory _example_.
Of note, the first (longest) ORF `CP003292.1:817-299` contains a coding sequence (CDS) that is identical to the replication protein [AFS59907.1](https://www.ncbi.nlm.nih.gov/protein/AFS59907.1), whereas the ORFs 2 and 3 (`CP003292.1:1274-1549` and `CP003292.1:1-237`) correspond to the unique [pG-EA11](https://www.ncbi.nlm.nih.gov/nuccore/CP003292.1) predicted CDS [AFS77085](https://www.ncbi.nlm.nih.gov/protein/AFS77085)
## References
Sieber P, Platzer M, Schuster S (2018) _The Definition of Open Reading Frame Revisited_. **Trends in Genetics**, 34(3):P167-170. [doi:10.1016/j.tig.2017.12.009](https://doi.org/10.1016/j.tig.2017.12.009).
Woodcroft BJ, Boyd JA, Tyson GW (2016) _OrfM: a fast open reading frame predictor for metagenomic data_. **Bioinformatics**, 32(17):2702-2703. [doi:10.1093/bioinformatics/btw241](https://doi.org/10.1093/bioinformatics/btw241).
CP003292.1 817-299 GTGCTCAATGTGGCTTCCAGTTCTCTTCCGCTTCCTCGTCGATTCGCTCGCTACGCTCGGTCATCGACTTGCGGCGAGCGGGATCACCTCTCGCATGTTGTCGGTGATCCCGGTGTTGTTCTTTTGGGGATGAGTCGCCAGATAGCCCCTCAAACGCGCTCAAAAGCTCTCAGACGCACGAAACGGGTTTTAGATGCCCGCCAGGGCATGGAAGGCGCGAAAAACGCTTACAGAGGCTCTCAGGAGGTCAAGCGCGCCGATACCCCTAACCAGGGGCTTTCAGGTCGCCAGGAACCGCGCCAGCCCCCTGCAAAGCCCATTTGCACGGAACAGGCAGCCATGCGGTTGGTAGGTGGTCGTAGACCGTCGGTGTTGCAGCGATGGGGGTCTGGGGTTCTCCCCAGACGGTTTCCGGTAGCGCTCAAGGCGCGGCCGGAACCGTGGCGGCTCGGTTTTTTTTGGAGAACGGGGAGCAAAAACACGCAAAGGATTTTGAGGCGGCAAAAGAGCAATGCAGGA VLNVASSSLPLPRRFARYARSSTCGERDHLSHVVGDPGVVLLGMSRQIAPQTRSKALRRTKRVLDARQGMEGAKNAYRGSQEVKRADTPNQGLSGRQEPRQPPAKPICTEQAAMRLVGGRRPSVLQRWGSGVLPRRFPVALKARPEPWRLGFFWRTGSKNTQRILRRQKSNAG
CP003292.1 1274-1549 ACCAACGTTAGTGCATGGGATTTTCAGAGGGAAAAAATCATGTTTATTGATTCAGAAAAACGACTGAAACAACTTTCAGATGAGGCAAAGAAAAACACCGAGGATCTCGAAGAAGCAAAGAAAAATTCAAGGTTTACACAGGTATCCCCAAAAGGTTGGGAACGTGTTCGAGAGCTGCTGAAGGATAGCCAAGGCATATCAGCACTGAAGCTGTACTCATTTTTAGCGGAGCATATCGATCCTACGTGTGGCGCTGTCGTTGCGGATCAGCAATTT TNVSAWDFQREKIMFIDSEKRLKQLSDEAKKNTEDLEEAKKNSRFTQVSPKGWERVRELLKDSQGISALKLYSFLAEHIDPTCGAVVADQQF
CP003292.1 1-237 CTAGCTGAAAAACTTGGAGTTAGCAGAAGCACAATTATTCGGTGGCTCAATTACTTAGAATCAAAAAATGCATTAGTTAGAATCCCCGTTGCTGGTAAGGTTTGTGCGTATGCCCTCGATCCACATGAAGTCTGGAAGGGATACAACACTACGAAAAACCATGCAGCGTTTGTCACTAAAACACTGGTCAACAAAGACGGTGATATTCAGCGCCGAATCATGGCCATGTTTTCAAAT LAEKLGVSRSTIIRWLNYLESKNALVRIPVAGKVCAYALDPHEVWKGYNTTKNHAAFVTKTLVNKDGDIQRRIMAMFSN
CP003292.1 1041-820 CGGAGTGTATACTGGCTTACTCCTATGCAAGCCTCGCCAGCCGTGATTTTTGCCAGATTTCGTGGTTTTCACGTTCAACAGCTCCATCGTTTTTTTGCCGTCCGTTCACCGTTAAAACGCCTCAGAAATCGGGACAGGATGTGTAATCGTATAACCGCGCATGCACGCCGATACGATTACCACTTGGTCAGGGCTTCACCCCGACACCCCGCATCAGTGCGT RSVYWLTPMQASPAVIFARFRGFHVQQLHRFFAVRSPLKRLRNRDRMCNRITAHARRYDYHLVRASPRHPASVR
CP003292.1 1056-1193 CTGGGTCAGGGCTTCGCCCCGACACCCCCAAAGGGCGTTGCGTTGCACGCAACACCCTTGCCTAGAATAGATCTTTTACGCCGATTTGTAAGTGGTTGTTTTTTTAGTGTGTTTTTTATTTACCTGTCGCATCCAGGA LGQGFAPTPPKGVALHATPLPRIDLLRRFVSGCFFSVFFIYLSHPG
CP003292.1 1273-1196 TCAAATCTAAGCCATCATGACACATCTATGCAACACCTTGTTGCGTGTTGTCGCATCCAGGATGCGACAACGTTGTGT SNLSHHDTSMQHLVACCRIQDATTLC
CP003292.1 246-296 CGGCAGGCGGACAATCAGGGGCTACGTGTTAACGTTCTGACCATGATTGTC RQADNQGLRVNVLTMIV
This diff is collapsed.
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment