Commit 6c15b2ab authored by Alexis  CRISCUOLO's avatar Alexis CRISCUOLO

Initial commit

parents
This diff is collapsed.
# gbk2ENA
_gbk2ENA_ is a command line program written in [Python](https://www.python.org/) that allows a standard [Genbank](https://www.ncbi.nlm.nih.gov/genbank/samplerecord/) file to be converted into an EMBL-like file suitable for submission to the European Nucleotide Archive ([ENA](https://www.ebi.ac.uk/ena/submit)).
For more details about the sequence annotation format required for submitting to [ENA](https://www.ebi.ac.uk/ena/submit), see [ftp.ebi.ac.uk/pub/databases/embl/doc/usrman.txt](http://ftp.ebi.ac.uk/pub/databases/embl/doc/usrman.txt).
## Installation and execution
Clone this repository with the following command line:
```bash
git clone https://gitlab.pasteur.fr/GIPhy/gbk2ENA.git
```
Verify that [Python](https://www.python.org/downloads/) (2.7 or higher) is installed, as well as [Biopython](https://biopython.org/) (1.43 or higher).
Execute the file `gbk2ENA.py` available inside the _src_ directory with the following command line model:
```bash
python gbk2ENA.py [options]
```
## Usage
Launch _gbk2ENA_ with option `-h` to read the following documentation:
```
usage: gbk2ENA [-h] -i FILEINPUT -o FILEOUTPUT -p PROJECTID [-a AUTHORS]
[-t TITLE] [-s SEQTOPOLOGY] [-m MOLECULETYPE]
[-c DATACLASS] [-d TAXODIV]
This tool converts Genbank files into EMBL-like files for submission to ENA
optional arguments:
-h, --help show this help message and exit
-i FILEINPUT (mandatory) input file in genbank format
-o FILEOUTPUT (mandatory) output file name
-p PROJECTID (mandatory) project id (PR lines)
-a AUTHORS reference authors (RA lines); default: "Unknown"
-t TITLE reference title (RT lines); default: "N/A"
-s SEQTOPOLOGY sequence topology (ID token 3); default: "linear"
-m MOLECULETYPE molecule type (ID token 4); default: "genomic DNA"
-c DATACLASS data class (ID token 5); default: "STD"
(see 3.1 at ftp.ebi.ac.uk/pub/databases/embl/doc/usrman.txt)
-d TAXODIV taxonomic division (ID token 6); default: "UNC"
(see 3.2 at ftp.ebi.ac.uk/pub/databases/embl/doc/usrman.txt)
```
##
Below are some useful excerpts from [ftp.ebi.ac.uk/pub/databases/embl/doc/usrman.txt](http://ftp.ebi.ac.uk/pub/databases/embl/doc/usrman.txt).
**Section 3.1, concerning the option `-c`**
> ```
> The data class of each entry, representing a methodological approach to the
> generation of the data or a type of data, is indicated on the first (ID) line
> of the entry. Each entry belongs to exactly one data class.
>
> Class Definition
> ----------- -----------------------------------------------------------
> CON Entry constructed from segment entry sequences; if unannotated,
> annotation may be drawn from segment entries
> PAT Patent
> EST Expressed Sequence Tag
> GSS Genome Survey Sequence
> HTC High Thoughput CDNA sequencing
> HTG High Thoughput Genome sequencing
> MGA Mass Genome Annotation
> WGS Whole Genome Shotgun
> TSA Transcriptome Shotgun Assembly
> STS Sequence Tagged Site
> STD Standard (all entries not classified as above)
> ```
**Section 3.2, concerning the option `-d`**
> ```
> The entries which constitute the database are grouped into taxonomic divisions,
> the object being to create subsets of the database which reflect areas of
> interest for many users.
> In addition to the division, each entry contains a full taxonomic
> classification of the organism that was the source of the stored sequence,
> from kingdom down to genus and species (see below).
> Each entry belongs to exactly one taxonomic division. The ID line of each entry
> indicates its taxonomic division, using the three letter codes shown below:
>
> Division Code
> ----------------- ----
> Bacteriophage PHG
> Environmental Sample ENV
> Fungal FUN
> Human HUM
> Invertebrate INV
> Other Mammal MAM
> Other Vertebrate VRT
> Mus musculus MUS
> Plant PLN
> Prokaryote PRO
> Other Rodent ROD
> Synthetic SYN
> Transgenic TGN
> Unclassified UNC
> Viral VRL
> ```
**Section 3.4.1, concerning the option `-s`**
> ```
> Sequence topology: 'circular' or 'linear'
>
> ```
**Section 3.4.1 Note 1, concerning the option `-m`**
> ```
> Molecule type: this represents the type of molecule as stored and can be
> any value from the list of current values for the mandatory mol_type source
> qualifier. This item should be the same as the value in the mol_type
> qualifier(s) in a given entry.
> ```
## Example
The Genbank file _F.columnare.PH-97028.gbk_ inside the directory _example_ contains the annotated draft assembly of a _Flavobacterium columnare_ strain (Criscuolo et al. 2018) created by the annotation program [_Prokka_](https://github.com/tseemann/prokka) (Seemann et al. 2014).
The following command line allows creating the file _F.columnare.PH-97028.embl_ suitable for submission to the ENA under the project id PRJEB25044:
```bash
python gbk2ENA.py -i F.columnare.PH-97028.gbk -p PRJEB25044 -t "Draft genome of Flavobacterium columnare strain PH-97028 (= CIP 109753)" -d PRO -o F.columnare.PH-97028.embl
```
## References
Criscuolo A, Chesneau O, Clermont D, Bizet C (2018) Draft genome sequence of the fish pathogen _Flavobacterium columnare_ genomovar III strain PH-97028 (=CIP 109753). Genome Announcement, 6(14):e00222-18. [doi:10.1128/genomeA.00222-18](https://mra.asm.org/content/6/14/e00222-18)
Seemann T (2014) Prokka: rapid prokaryotic genome annotation. Bioinformatics, 30(14):2068-2069. [doi:10.1093/bioinformatics/btu153](https://academic.oup.com/bioinformatics/article/30/14/2068/2390517)
This diff is collapsed.
This diff is collapsed.
#!/usr/bin/env python3.6
# -*- coding: utf-8 -*-
"""
@Author: Melanie Hennart
@PASTEUR_2018
@Python_3.6
"""
#=============================================================================#
"""
gbk2ENA: converting Genbank files into EMBL-like files for submission to ENA
[Version 1.0]
Copyright (C) 2018 Melanie Hennart, Alexis Criscuolo
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.
Contact:
Institut Pasteur
Biodiversity and Epidemiology of Bacterial Pathogens
Paris, FRANCE
melanie.hennart@pasteur.fr
Institut Pasteur
Bioinformatics and Biostatistics Hub
C3BI, USR 3756 IP CNRS
Paris, FRANCE
alexis.criscuolo@pasteur.fr
"""
#=============================================================================#
from Bio import SeqIO
import os
import argparse
#=== Parameters
parser = argparse.ArgumentParser(prog="gbk2ENA", description="gbk2ENA v.1.0\n\nThis tool converts Genbank files into EMBL-like files for submission to ENA." , formatter_class=argparse.RawTextHelpFormatter)
parser.add_argument("-i", dest="fileInput" , type=str, required=True, help="(mandatory) input file in genbank format")
parser.add_argument("-o", dest="fileOutput" , type=str, required=True, help="(mandatory) output file name")
parser.add_argument("-p", dest="ProjectId" , type=str, required=True, help="(mandatory) project id (PR lines)")
parser.add_argument("-a", dest="Authors" , type=str, required=False, default="Unknown" , help='reference authors (RA lines); default: "Unknown"')
parser.add_argument("-t", dest="Title" , type=str, required=False, default="N/A" , help='reference title (RT lines); default: "N/A"')
parser.add_argument("-s", dest="SeqTopology" , type=str, required=False, default="linear" , help='sequence topology (ID token 3); default: "linear"')
parser.add_argument("-m", dest="MoleculeType", type=str, required=False, default="genomic DNA", help='molecule type (ID token 4); default: "genomic DNA"')
parser.add_argument("-c", dest="DataClass" , type=str, required=False, default="STD" , help='data class (ID token 5); default: "STD"\n(see 3.1 at ftp.ebi.ac.uk/pub/databases/embl/doc/usrman.txt)')
parser.add_argument("-d", dest="TaxoDiv" , type=str, required=False, default="UNC" , help='taxonomic division (ID token 6); default: "UNC"\n(see 3.2 at ftp.ebi.ac.uk/pub/databases/embl/doc/usrman.txt)\n ')
args = parser.parse_args()
InputFile = args.fileInput
OutputFile = args.fileOutput
Temp_File = OutputFile+'.temp'
Projet = args.ProjectId
Authors = args.Authors
Title = args.Title
SeqTopology = args.SeqTopology
MoleculeType = args.MoleculeType
DataClass = args.DataClass
TaxoDiv = args.TaxoDiv
#=== Convert "Genbank" => "Embl"
count = SeqIO.convert(InputFile, "genbank", Temp_File, "embl")
#=== Convert "Embl" => "ENA"
File = open(Temp_File, 'r')
OutFile = open(OutputFile, 'w')
for line in File.readlines() :
if line[:2] == 'ID':
length = line.split('; ')[-1]
lineID = ["ID XXX","XXX", SeqTopology, MoleculeType, DataClass, TaxoDiv, length]
OutFile.write('; '.join(lineID))
OutFile.write("XX\n")
OutFile.write("AC XXX;\n")
elif line[:2] == 'AC':
Info = line.split(' ')[1]
OutFile.write("AC * _"+Info)
OutFile.write("XX\n")
OutFile.write("PR Project:"+Projet+";\n")
elif line[:2] == "KW":
OutFile.write("RA Submitter, "+Authors+";\n")
OutFile.write("RT "+Title+";\n")
elif line[:2] != "OS" and line[:2] !="OC" :
OutFile.write(line)
File.close()
OutFile.close()
os.system("rm "+Temp_File)
print("Converted %i records" % count)
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment