Commit 6c15b2ab authored by Alexis  CRISCUOLO's avatar Alexis CRISCUOLO

Initial commit

This diff is collapsed.
# gbk2ENA
_gbk2ENA_ is a command line program written in [Python]( that allows a standard [Genbank]( file to be converted into an EMBL-like file suitable for submission to the European Nucleotide Archive ([ENA](
For more details about the sequence annotation format required for submitting to [ENA](, see [](
## Installation and execution
Clone this repository with the following command line:
git clone
Verify that [Python]( (2.7 or higher) is installed, as well as [Biopython]( (1.43 or higher).
Execute the file `` available inside the _src_ directory with the following command line model:
python [options]
## Usage
Launch _gbk2ENA_ with option `-h` to read the following documentation:
This tool converts Genbank files into EMBL-like files for submission to ENA
optional arguments:
-h, --help show this help message and exit
-i FILEINPUT (mandatory) input file in genbank format
-o FILEOUTPUT (mandatory) output file name
-p PROJECTID (mandatory) project id (PR lines)
-a AUTHORS reference authors (RA lines); default: "Unknown"
-t TITLE reference title (RT lines); default: "N/A"
-s SEQTOPOLOGY sequence topology (ID token 3); default: "linear"
-m MOLECULETYPE molecule type (ID token 4); default: "genomic DNA"
-c DATACLASS data class (ID token 5); default: "STD"
(see 3.1 at
-d TAXODIV taxonomic division (ID token 6); default: "UNC"
(see 3.2 at
Below are some useful excerpts from [](
**Section 3.1, concerning the option `-c`**
> ```
> The data class of each entry, representing a methodological approach to the
> generation of the data or a type of data, is indicated on the first (ID) line
> of the entry. Each entry belongs to exactly one data class.
> Class Definition
> ----------- -----------------------------------------------------------
> CON Entry constructed from segment entry sequences; if unannotated,
> annotation may be drawn from segment entries
> PAT Patent
> EST Expressed Sequence Tag
> GSS Genome Survey Sequence
> HTC High Thoughput CDNA sequencing
> HTG High Thoughput Genome sequencing
> MGA Mass Genome Annotation
> WGS Whole Genome Shotgun
> TSA Transcriptome Shotgun Assembly
> STS Sequence Tagged Site
> STD Standard (all entries not classified as above)
> ```
**Section 3.2, concerning the option `-d`**
> ```
> The entries which constitute the database are grouped into taxonomic divisions,
> the object being to create subsets of the database which reflect areas of
> interest for many users.
> In addition to the division, each entry contains a full taxonomic
> classification of the organism that was the source of the stored sequence,
> from kingdom down to genus and species (see below).
> Each entry belongs to exactly one taxonomic division. The ID line of each entry
> indicates its taxonomic division, using the three letter codes shown below:
> Division Code
> ----------------- ----
> Bacteriophage PHG
> Environmental Sample ENV
> Fungal FUN
> Human HUM
> Invertebrate INV
> Other Mammal MAM
> Other Vertebrate VRT
> Mus musculus MUS
> Plant PLN
> Prokaryote PRO
> Other Rodent ROD
> Synthetic SYN
> Transgenic TGN
> Unclassified UNC
> Viral VRL
> ```
**Section 3.4.1, concerning the option `-s`**
> ```
> Sequence topology: 'circular' or 'linear'
> ```
**Section 3.4.1 Note 1, concerning the option `-m`**
> ```
> Molecule type: this represents the type of molecule as stored and can be
> any value from the list of current values for the mandatory mol_type source
> qualifier. This item should be the same as the value in the mol_type
> qualifier(s) in a given entry.
> ```
## Example
The Genbank file _F.columnare.PH-97028.gbk_ inside the directory _example_ contains the annotated draft assembly of a _Flavobacterium columnare_ strain (Criscuolo et al. 2018) created by the annotation program [_Prokka_]( (Seemann et al. 2014).
The following command line allows creating the file _F.columnare.PH-97028.embl_ suitable for submission to the ENA under the project id PRJEB25044:
python -i F.columnare.PH-97028.gbk -p PRJEB25044 -t "Draft genome of Flavobacterium columnare strain PH-97028 (= CIP 109753)" -d PRO -o F.columnare.PH-97028.embl
## References
Criscuolo A, Chesneau O, Clermont D, Bizet C (2018) Draft genome sequence of the fish pathogen _Flavobacterium columnare_ genomovar III strain PH-97028 (=CIP 109753). Genome Announcement, 6(14):e00222-18. [doi:10.1128/genomeA.00222-18](
Seemann T (2014) Prokka: rapid prokaryotic genome annotation. Bioinformatics, 30(14):2068-2069. [doi:10.1093/bioinformatics/btu153](
This diff is collapsed.
This diff is collapsed.
#!/usr/bin/env python3.6
# -*- coding: utf-8 -*-
@Author: Melanie Hennart
gbk2ENA: converting Genbank files into EMBL-like files for submission to ENA
[Version 1.0]
Copyright (C) 2018 Melanie Hennart, Alexis Criscuolo
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <>.
Institut Pasteur
Biodiversity and Epidemiology of Bacterial Pathogens
Institut Pasteur
Bioinformatics and Biostatistics Hub
from Bio import SeqIO
import os
import argparse
#=== Parameters
parser = argparse.ArgumentParser(prog="gbk2ENA", description="gbk2ENA v.1.0\n\nThis tool converts Genbank files into EMBL-like files for submission to ENA." , formatter_class=argparse.RawTextHelpFormatter)
parser.add_argument("-i", dest="fileInput" , type=str, required=True, help="(mandatory) input file in genbank format")
parser.add_argument("-o", dest="fileOutput" , type=str, required=True, help="(mandatory) output file name")
parser.add_argument("-p", dest="ProjectId" , type=str, required=True, help="(mandatory) project id (PR lines)")
parser.add_argument("-a", dest="Authors" , type=str, required=False, default="Unknown" , help='reference authors (RA lines); default: "Unknown"')
parser.add_argument("-t", dest="Title" , type=str, required=False, default="N/A" , help='reference title (RT lines); default: "N/A"')
parser.add_argument("-s", dest="SeqTopology" , type=str, required=False, default="linear" , help='sequence topology (ID token 3); default: "linear"')
parser.add_argument("-m", dest="MoleculeType", type=str, required=False, default="genomic DNA", help='molecule type (ID token 4); default: "genomic DNA"')
parser.add_argument("-c", dest="DataClass" , type=str, required=False, default="STD" , help='data class (ID token 5); default: "STD"\n(see 3.1 at')
parser.add_argument("-d", dest="TaxoDiv" , type=str, required=False, default="UNC" , help='taxonomic division (ID token 6); default: "UNC"\n(see 3.2 at\n ')
args = parser.parse_args()
InputFile = args.fileInput
OutputFile = args.fileOutput
Temp_File = OutputFile+'.temp'
Projet = args.ProjectId
Authors = args.Authors
Title = args.Title
SeqTopology = args.SeqTopology
MoleculeType = args.MoleculeType
DataClass = args.DataClass
TaxoDiv = args.TaxoDiv
#=== Convert "Genbank" => "Embl"
count = SeqIO.convert(InputFile, "genbank", Temp_File, "embl")
#=== Convert "Embl" => "ENA"
File = open(Temp_File, 'r')
OutFile = open(OutputFile, 'w')
for line in File.readlines() :
if line[:2] == 'ID':
length = line.split('; ')[-1]
lineID = ["ID XXX","XXX", SeqTopology, MoleculeType, DataClass, TaxoDiv, length]
OutFile.write('; '.join(lineID))
OutFile.write("AC XXX;\n")
elif line[:2] == 'AC':
Info = line.split(' ')[1]
OutFile.write("AC * _"+Info)
OutFile.write("PR Project:"+Projet+";\n")
elif line[:2] == "KW":
OutFile.write("RA Submitter, "+Authors+";\n")
OutFile.write("RT "+Title+";\n")
elif line[:2] != "OS" and line[:2] !="OC" :
os.system("rm "+Temp_File)
print("Converted %i records" % count)
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment