README.md 9.36 KB
Newer Older
Alexis  CRISCUOLO's avatar
Alexis CRISCUOLO committed
1
2
# DNA2ORF

Alexis  CRISCUOLO's avatar
.10    
Alexis CRISCUOLO committed
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126

_DNA2ORF_ is a command line program written in [Java](https://docs.oracle.com/en/java/) to find and translate open reading frames (ORF) inside nucleotide sequence(s), an ORF being unequivocally defined here as _a nucleotide segment of length divisible by 3 that is free of in-frame STOP codons_ (therefore not necessarily containing a START codon, e.g. Woodcroft et al. 2016, Sieber et al. 2018).

_DNA2ORF_ implements an algorithm that solves a specific problem: _translating the maximum number of non-STOP codons within a nucleotide sequence using the minimum number of non-overlapping ORFs_.
The ORF definition 2 established by Sieber et al. (2018) may therefore be verified by most of the ORFs inferred by _DNA2ORF_ (e.g. the first —longer— ones are generally bounded by STOP codons), but not all (e.g. the last —shorter— ones are often not bounded by STOP codons to prevent ORF overlap).

_DNA2ORF_ was not implemented to predict coding sequences (although it can be used as a prior step for this task).
_DNA2ORF_ was developed to quickly translate most of the regions of one (or several) nucleotide sequence(s), in order to perform amino acid-/codon-based comparisons between genomes (e.g. sequence similarity, _k_-mer contents, residue frequencies). 
Thanks to its generic ORF definition (see above), _DNA2ORF_ can be indiscriminately used on any kind of genomes (i.e. archaea, bacteria, eukaryota, viruses).


## Compilation and execution

The source code of _DNA2ORF_ is inside the _src_ directory and could be compiled and executed in two different ways. 

#### Building an executable jar file

Clone this repository with the following command line:
```bash
git clone https://gitlab.pasteur.fr/GIPhy/DNA2ORF.git
```
On computers with [Oracle JDK](http://www.oracle.com/technetwork/java/javase/downloads/index.html) (**11** or higher) installed, a Java executable jar file can be created. In a command-line window, go to the _src_ directory and type:
```bash
javac DNA2ORF.java 
echo Main-Class: DNA2ORF > MANIFEST.MF 
jar -cmvf MANIFEST.MF DNA2ORF.jar DNA2ORF.class DNA2ORF\$ORF.class
rm MANIFEST.MF DNA2ORF.class DNA2ORF\$ORF.class
```
This will create the executable jar file `DNA2ORF.jar` that could be run with the following command line model:
```bash
java -jar DNA2ORF.jar [options]
```

#### Building a native code binary

Clone this repository with the following command line:
```bash
git clone https://gitlab.pasteur.fr/GIPhy/DNA2ORF.git
```
On computers with [GraalVM](https://www.graalvm.org/downloads/) installed, a native executable can be built. In a command-line window, go to the _src_ directory, and type:
```bash
javac DNA2ORF.java 
native-image DNA2ORF DNA2ORF
rm DNA2ORF.class DNA2ORF\$ORF.class
```
This will create the native executable `DNA2ORF` that can be run with the following command line model:
```bash
./DNA2ORF [options]
```

## Usage

Run _DNA2ORF_ without option to read the following documentation:

```
 DNA2ORF

 USAGE:  DNA2ORF  -i <infile>  [-l <minlgt>]  [-o <outfile>]

 OPTIONS:

   -i <filename>  (mandatory) FASTA-formatted nucleotide sequence file
                  name; should end with .gz when gzipped
   -l <integer>   minimum length of the inferred ORFs (default: 30 bps)
   -o <filename>  output file name (default: standard output)
```


## Notes

* In brief, _DNA2ORF_ first looks for every nucleotide subsequence _S_ of length divisible by 3 (no less than the minimum allowed length) that is bounded by STOP codons in the 6 reading frames. Next, _DNA2ORF_ sorts all such ORFs according to their decreasing length, and finally iterates through each of them. When none of the codons in the current ORF _S_ was translated in the initial input sequence, _S_ is outputted. Otherwise, _S_ is shrunk until obtaining a shortened ORF _S'_ with no overlap with a previously outputted ORF: if the length of _S'_ is lower than the minimum allowed length, it is discarded; otherwise, it is put back in the sorted ORF set. The algorithm ends when every ORF was either outputted or discarded.

* The minimum allowed length of inferred ORFs can be set using option `-l` (in number of nucleotides, inclusive; default: 30 bps). _DNA2ORF_ is generally running fast with such a default option, from few seconds (bacterial genome, e.g. 5 Mbps) to a few minutes (large eukaryote genome, e.g. 3 Gbps). Speed can be improved by increasing the minimum allowed ORF length (e.g. 150 bps), but at the cost of a reduced proportion of translated nucleotides. 

* The input file (option `-i`) should be in FASTA format. _DNA2ORF_ is able to read an input file compressed using [_gzip_](https://www.gnu.org/software/gzip/), provided that the file name ends with the extension `.gz`.

* _DNA2ORF_ outputs each ORF using four fields in tab-delimited format (default: standard output):<br> 
  &emsp; (1) sequence name (i.e. FASTA header);<br>
  &emsp; (2) start-end positions (inclusive; reverse strand when start > end);<br>
  &emsp; (3) codon sequence (reverse-complement when start > end);<br>
  &emsp; (4) amino acid sequence (standard genetic code).

* Tab-delimited standard output can be easily used to write inferred codon or amino acid ORFs into FASTA-formatted output files, using e.g. [_gawk_](https://www.gnu.org/software/gawk/):<br>
  ```bash
  DNA2ORF -i dna.fasta | gawk 'BEGIN{FS="\t"}{print">"$1":"$2;print$3}' > orf.codon.fasta
  DNA2ORF -i dna.fasta | gawk 'BEGIN{FS="\t"}{print">"$1":"$2;print$4}' > orf.amino.fasta
  ```

* For each input nucleotide sequence, _DNA2ORF_ also outputs further information in standard error: sequence name and length, percentage of translated nucleotides, and number of outputted ORFs.


## Example

The compressed FASTA file `CP003292.fasta.gz` is available in the directory _example_.
It contains the plasmid sequence [pG-EA11](https://www.ncbi.nlm.nih.gov/nuccore/CP003292.1).
To translate the maximum number of its non-STOP codons using the minimum number of non-overlapping ORFs, _DNA2ORF_ can be run using the following command line:
```bash
DNA2ORF -i CP003292.fasta.gz 
```

that will lead to the following output:
```
CP003292.1   817-299    GTGCTCAATGTGGCTTCCAGTTCTCTTCCGCTTCCTCGTCGATTCGCTCGCTACGCTCGGTCATCGACTTGCGGCGAGCGGGATCACCTCTCGCATGTTGTCGGTGATCCCGGTGTTGTTCTTTTGGGGATGAGTCGCCAGATAGCCCCTCAAACGCGCTCAAAAGCTCTCAGACGCACGAAACGGGTTTTAGATGCCCGCCAGGGCATGGAAGGCGCGAAAAACGCTTACAGAGGCTCTCAGGAGGTCAAGCGCGCCGATACCCCTAACCAGGGGCTTTCAGGTCGCCAGGAACCGCGCCAGCCCCCTGCAAAGCCCATTTGCACGGAACAGGCAGCCATGCGGTTGGTAGGTGGTCGTAGACCGTCGGTGTTGCAGCGATGGGGGTCTGGGGTTCTCCCCAGACGGTTTCCGGTAGCGCTCAAGGCGCGGCCGGAACCGTGGCGGCTCGGTTTTTTTTGGAGAACGGGGAGCAAAAACACGCAAAGGATTTTGAGGCGGCAAAAGAGCAATGCAGGA   VLNVASSSLPLPRRFARYARSSTCGERDHLSHVVGDPGVVLLGMSRQIAPQTRSKALRRTKRVLDARQGMEGAKNAYRGSQEVKRADTPNQGLSGRQEPRQPPAKPICTEQAAMRLVGGRRPSVLQRWGSGVLPRRFPVALKARPEPWRLGFFWRTGSKNTQRILRRQKSNAG
CP003292.1   1274-1549  ACCAACGTTAGTGCATGGGATTTTCAGAGGGAAAAAATCATGTTTATTGATTCAGAAAAACGACTGAAACAACTTTCAGATGAGGCAAAGAAAAACACCGAGGATCTCGAAGAAGCAAAGAAAAATTCAAGGTTTACACAGGTATCCCCAAAAGGTTGGGAACGTGTTCGAGAGCTGCTGAAGGATAGCCAAGGCATATCAGCACTGAAGCTGTACTCATTTTTAGCGGAGCATATCGATCCTACGTGTGGCGCTGTCGTTGCGGATCAGCAATTT   TNVSAWDFQREKIMFIDSEKRLKQLSDEAKKNTEDLEEAKKNSRFTQVSPKGWERVRELLKDSQGISALKLYSFLAEHIDPTCGAVVADQQF
CP003292.1   1-237      CTAGCTGAAAAACTTGGAGTTAGCAGAAGCACAATTATTCGGTGGCTCAATTACTTAGAATCAAAAAATGCATTAGTTAGAATCCCCGTTGCTGGTAAGGTTTGTGCGTATGCCCTCGATCCACATGAAGTCTGGAAGGGATACAACACTACGAAAAACCATGCAGCGTTTGTCACTAAAACACTGGTCAACAAAGACGGTGATATTCAGCGCCGAATCATGGCCATGTTTTCAAAT   LAEKLGVSRSTIIRWLNYLESKNALVRIPVAGKVCAYALDPHEVWKGYNTTKNHAAFVTKTLVNKDGDIQRRIMAMFSN
CP003292.1   1041-820   CGGAGTGTATACTGGCTTACTCCTATGCAAGCCTCGCCAGCCGTGATTTTTGCCAGATTTCGTGGTTTTCACGTTCAACAGCTCCATCGTTTTTTTGCCGTCCGTTCACCGTTAAAACGCCTCAGAAATCGGGACAGGATGTGTAATCGTATAACCGCGCATGCACGCCGATACGATTACCACTTGGTCAGGGCTTCACCCCGACACCCCGCATCAGTGCGT   RSVYWLTPMQASPAVIFARFRGFHVQQLHRFFAVRSPLKRLRNRDRMCNRITAHARRYDYHLVRASPRHPASVR
CP003292.1   1056-1193  CTGGGTCAGGGCTTCGCCCCGACACCCCCAAAGGGCGTTGCGTTGCACGCAACACCCTTGCCTAGAATAGATCTTTTACGCCGATTTGTAAGTGGTTGTTTTTTTAGTGTGTTTTTTATTTACCTGTCGCATCCAGGA   LGQGFAPTPPKGVALHATPLPRIDLLRRFVSGCFFSVFFIYLSHPG
CP003292.1   1273-1196  TCAAATCTAAGCCATCATGACACATCTATGCAACACCTTGTTGCGTGTTGTCGCATCCAGGATGCGACAACGTTGTGT   SNLSHHDTSMQHLVACCRIQDATTLC
CP003292.1   246-296    CGGCAGGCGGACAATCAGGGGCTACGTGTTAACGTTCTGACCATGATTGTC   RQADNQGLRVNVLTMIV
#CP003292.1  1549 bps   98.19% translated   7 ORFs
```

The nucleotide sequence [CP003292](https://www.ncbi.nlm.nih.gov/nuccore/CP003292.1) can therefore be decomposed into 7 ORFs (of length at least 30 bps) that cover 98.19% of its total length.
The corresponding tab-delimited file `CP003292.ORF.tsv` is available in the directory _example_.

Of note, the first (longest) ORF `CP003292.1:817-299` contains a coding sequence (CDS) that is identical to the replication protein [AFS59907.1](https://www.ncbi.nlm.nih.gov/protein/AFS59907.1), whereas the ORFs 2 and 3 (`CP003292.1:1274-1549` and `CP003292.1:1-237`) correspond to the unique [pG-EA11](https://www.ncbi.nlm.nih.gov/nuccore/CP003292.1) predicted CDS [AFS77085](https://www.ncbi.nlm.nih.gov/protein/AFS77085)


## References

Sieber P, Platzer M, Schuster S (2018) _The Definition of Open Reading Frame Revisited_. **Trends in Genetics**, 34(3):P167-170. [doi:10.1016/j.tig.2017.12.009](https://doi.org/10.1016/j.tig.2017.12.009).

Woodcroft BJ, Boyd JA, Tyson GW (2016) _OrfM: a fast open reading frame predictor for metagenomic data_. **Bioinformatics**, 32(17):2702-2703. [doi:10.1093/bioinformatics/btw241](https://doi.org/10.1093/bioinformatics/btw241).