
SimiPlot
SimiPlot is a command line program written in Bash to create SVG figures that represent the overall similarity of genome(s) against a reference one. SimiPlot is very similar to SimPlot (Lole et al. 1999) and SimPlot++ (Samson et al. 2022). However, SimiPlot is running fast via simple command lines, does not require any prior alignment, and creates alternative pairwise similarity representations based on scatter plots.
SimiPlot runs on UNIX, Linux and most OS X operating systems.
Dependencies
You will need to install the required programs listed in the following table, or to verify that they are already installed with the required version.
program | package | version | sources |
---|---|---|---|
gawk | - | > 4.0.0 | ftp.gnu.org/gnu/gawk |
makeblastdb blastn |
blast+ | ≥ 2.12.0 | ftp.ncbi.nlm.nih.gov/blast/executables/blast+ |
Installation and execution
Clone this repository with the following command line:
git clone https://gitlab.pasteur.fr/GIPhy/SimiPlot.git
Go to the directory SimiPlot/
to give the execute permission to the file:
cd SimiPlot/
chmod +x SimiPlot.sh
and run it with the following command line model:
./SimiPlot.sh [options]
If at least one of the indicated programs (see Dependencies) is not available on your $PATH
variable (or if one compiled binary has a different default name), SimiPlot will exit with an error message (when the requisite programs are missing).
To set a required program that is not available on your $PATH
variable, edit the file and indicate the local path to the corresponding binary(ies) within the code block REQUIREMENTS
.
Usage
Run SimiPlot without option to read the following documentation:
USAGE: SimiPlot -r <reffile> -o <outfile> [options] <fasta1> [<fasta2> ...]
OPTIONS:
-r <file> FASTA file containing the reference sequence (mandatory)
-o <file> SVG output file name (mandatory)
-w <int> window size (default: auto)
-s smoothing step (default: not set)
-x <int> x-axis start (default: 0)
-X <int> x-axis end (default: reference length)
-y <int> y-axis start (default: 0)
-Y <int> y-axis end (default: 100)
-d <int> dot size factor (default: 1.0)
-a <real> aspect ratio (detault: 3.0)
-t <int> number of threads (default: 2)
-h prints this help and exits
Notes
-
For each non-reference input file, SimiPlot decomposes the nucleotide sequence(s) into overlapping equal-length fragments (step = half the fragment length). Each fragment is searched against the reference sequence (option
-r
) using blastn (Altschul et al. 1990; Camacho et al. 2008) with tuned parameters (as suggested by Goris et al. 2007). For each fragment, only the best BLAST hit is considered (E-value threshold = 0.5). All BLAST hits are graphically represented as a scatter plot, where x is the hit BLAST position within the reference, y is the percentage of similarity, and the dot radius is proportional to the aligned part of the fragment. -
Each input file should be in FASTA format, not compressed, and may contain nucleotide sequences. At least one input files should be specified.
-
Fragment length can be modified using option
-w
. By default, the fragment length is the reference sequence length divided by 1,000. -
Faster running times can be obtained by using a large number of threads (option
-t
; default: 2; recommended: ≥ 10). -
Specific regions can be represented by specifying start and end positions within the reference sequence using options
-x
and-X
, respectively. By default, the whole reference sequence is represented. Y-axis range can be also modified using options-y
and-Y
(default: 0% and 100% similarity, respectively). -
To obtain convenient and more readable figures with clearer similarity representation, the smoothing option
-s
can often be useful to reduce variability between neighbor dots. Another way is to increase the aspect ratio (i.e. width/heigth) of the scatter plot using option-a
(default: 3.0). Dot size can be also controlled using option-d
. -
A different dot color is used for each input file. The first colors are: (1) red, (2) blue, (3) orange, (4) green, (5) gray, (6) brown, (7) dark green, (8) pink, (9) light blue. To associate a given input file to a specific color, change the input file order.
Examples
The directory example/ contains several SVG files created using SimiPlot.
Comparing Enterobacterales genomes
The chromosome sequences of the four bacterial strains Klebsiella pneumoniae NTUH-K2044 (AP006725), K. pneumoniae MGH 78578 (CP000647), Salmonella enterica LT2 (AE006468) and Yersinia pestis CO92 (AL590842) can be downloaded using the following command lines:
EUTILS="https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id=";
t="K.pneumoniae"; s="NTUH-K2044"; a="AP006725"; wget -q -O $t.$s.fasta $EUTILS$a ;
t="K.pneumoniae"; s="MGH78578"; a="CP000647"; wget -q -O $t.$s.fasta $EUTILS$a ;
t="S.enterica"; s="LT2"; a="AE006468"; wget -q -O $t.$s.fasta $EUTILS$a ;
t="Y.pestis"; s="CO92"; a="AL590842"; wget -q -O $t.$s.fasta $EUTILS$a ;
The overall similarity of the three last chromosomes against K. pneumoniae NTUH-K2044 as reference can be easily drawn using SimiPlot with the following command line:
SimiPlot.sh -t 12 -r K.pneumoniae.NTUH-K2044.fasta -o enterobacterales.1.svg K.pneumoniae.MGH78578.fasta Y.pestis.CO92.fasta S.enterica.LT2.fasta
As the overall scatter plots of Salmonella enterica LT2 and Yersinia pestis CO92 are quite scattered, the smoothing step (option -s
) can be set to obtain a clearer figure:
SimiPlot.sh -t 12 -s -r K.pneumoniae.NTUH-K2044.fasta -o enterobacterales.2.svg K.pneumoniae.MGH78578.fasta Y.pestis.CO92.fasta S.enterica.LT2.fasta
As expected, this second figure clearly shows that Salmonella enterica LT2 is more similar to K. pneumoniae NTUH-K2044 than Yersinia pestis CO92.
Interestingly, the figures show that K. pneumoniae NTUH-K2044 has a specific region that is not shared by the three other chromosomes (i.e. approximately between positions 572000 and 625000). The figures also show another region that had been transferred from K. pneumoniae NTUH-K2044 to Yersinia pestis CO92 (i.e. approx. between positions 3380000 and 3490000). To have a better look at this specific region, SimiPlot options to restrict X-axis and Y-axis ranges can be useful, e.g.
SimiPlot.sh -t 12 -s -x 3300000 -X 3500000 -y 50 -r K.pneumoniae.NTUH-K2044.fasta -o enterobacterales.3.svg K.pneumoniae.MGH78578.fasta Y.pestis.CO92.fasta S.enterica.LT2.fasta
This third figure clearly shows that the K. pneumoniae NTUH-K2044 chromosome region 3396000-3472000 is not shared by the very closely related strain MGH 78578, whereas the smaller region 3396000-3430000 is transferred to Yersinia pestis CO92.
To identify the corresponding transferred region within the Yersinia pestis CO92 chromosome, another figure can be generated with SimiPlot by using it as reference, e.g.
SimiPlot.sh -t 12 -s -r Y.pestis.CO92.fasta -o enterobacterales.4.svg K.pneumoniae.NTUH-K2044.fasta
This fourth figure suggests that the transferred (and almost identical) region is around the Yersinia pestis CO92 chromosome position 2150000. However, better readibility can be obtained by increasing the dot size factor (option -d
), e.g.
SimiPlot.sh -t 12 -s -d 1.5 -r Y.pestis.CO92.fasta -o enterobacterales.5.svg K.pneumoniae.NTUH-K2044.fasta
Comparing Klebsiella genomes
Hennart et al. (2022) reported several cases of hybrid Klebsiella quasipneumoniae genomes containing large segments from Klebsiella pneumoniae genomes.
The chromosome sequences of K. pneumoniae NTUH-K2044 (AP006725), K. quasipneumoniae subsp. quasipneumoniae 01A030T (CP084876) and K. quasipneumoniae subsp. similipneumoniae G747 (CP034136) can be downloaded using the following command lines:
EUTILS="https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id=";
t="K.pneumoniae"; s="NTUH-K2044"; a="AP006725"; wget -q -O $t.$s.fasta $EUTILS$a ;
t="K.quasipneumoniae.subsp.quasipneumoniae"; s="01A030T"; a="CP084876"; wget -q -O $t.$s.fasta $EUTILS$a ;
t="K.quasipneumoniae.subsp.similipneumoniae"; s="G747"; a="CP034136"; wget -q -O $t.$s.fasta $EUTILS$a ;
The draft genome sequence of Klebsiella sp. 4300STDY6636950 (UFBM00000000), highlighted among others by Hennart et al. (2022), can be downloaded using the following command line:
WGSDWL="https://sra-download.ncbi.nlm.nih.gov/traces/wgs03/wgs_aux/";
t="K.sp"; s="4300STDY6636950"; a="UFBM01"; wget -q -O - $WGSDWL${a:0:2}/${a:2:2}/$a/$a.1.fsa_nt.gz | gunzip -c > $t.$s.fasta ;
The overall similarity of these genomes against the K. quasipneumoniae subsp. similipneumoniae G747 chromosome can be represented by SimiPlot using the following command line:
SimiPlot.sh -t 12 -y 70 -s -r K.quasipneumoniae.subsp.similipneumoniae.G747.fasta -o klebsiella.1.svg K.quasipneumoniae.subsp.quasipneumoniae.01A030T.fasta K.pneumoniae.NTUH-K2044.fasta K.sp.4300STDY6636950.fasta
This first figure shows that the average nucleotide identity (ANI) of K. quasipneumoniae subsp. quasipneumoniae 01A030T and K. pneumoniae NTUH-K2044 against K. quasipneumoniae subsp. similipneumoniae G747 is greater and lower than 95%, respectively, therefore justifying their taxonomic classification. The figure also shows that the strain 4300STDY6636950 is broadly similar to the reference K. quasipneumoniae subsp. similipneumoniae G747. However, a large (500000 bp-long) region of the Klebsiella sp. 4300STDY6636950 genome is as far distant as the K. pneumoniae NTUH-K2044 genome. This irregularity can be verified using K. pneumoniae NTUH-K2044 as reference, e.g.
SimiPlot.sh -t 12 -y 70 -s -r K.pneumoniae.NTUH-K2044.fasta -o klebsiella.2.svg K.quasipneumoniae.subsp.quasipneumoniae.01A030T.fasta K.quasipneumoniae.subsp.similipneumoniae.G747.fasta K.sp.4300STDY6636950.fasta
The transferred K. pneumoniae NTUH-K2044 region can be better assessed with SimiPlot by restricting the X-axis range, e.g.
SimiPlot.sh -t 12 -x 3000000 -X 4000000 -y 70 -s -r K.pneumoniae.NTUH-K2044.fasta -o klebsiella.3.svg K.quasipneumoniae.subsp.quasipneumoniae.01A030T.fasta K.quasipneumoniae.subsp.similipneumoniae.G747.fasta K.sp.4300STDY6636950.fasta
However, a quite clearer figure can be obtained by increasing the aspect ratio (option -a
) and decreasing the dot size (option -
d), e.g.
SimiPlot.sh -t 12 -x 3000000 -X 4000000 -y 50 -a 6.0 -d 0.5 -s -r K.pneumoniae.NTUH-K2044.fasta -o klebsiella.4.svg K.quasipneumoniae.subsp.quasipneumoniae.01A030T.fasta K.quasipneumoniae.subsp.similipneumoniae.G747.fasta K.sp.4300STDY6636950.fasta
Comparing SARS-CoV-2 genomes
Temmam et al. (2022) described three SARS-CoV-2 bat-borne genomes (BANAL-20-52, BANAL-20-103, BANAL-20-236) that are more closely related to the reference Wuhan-Hu-1 genome than that from any other bat strain described so far, in particular the one from Rhinolophus affinis, RaTG13.
The genome sequence of SARS-Cov-2 isolates Wuhan-Hu-1 (MN908947), RaTG13 (MN996532), BANAL-20-52 (MZ937000), BANAL-20-103 (MZ937001) and BANAL-20-236 (MZ937003) can be downloaded using the following command lines:
EUTILS="https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id=";
t="SARS-CoV-2"; s="Wuhan-Hu-1"; a="MN908947"; wget -q -O $t.$s.fasta $EUTILS$a ;
t="SARS-CoV-2"; s="RaTG13"; a="MN996532"; wget -q -O $t.$s.fasta $EUTILS$a ;
t="SARS-CoV-2"; s="BANAL-20-52"; a="MZ937000"; wget -q -O $t.$s.fasta $EUTILS$a ;
t="SARS-CoV-2"; s="BANAL-20-103"; a="MZ937001"; wget -q -O $t.$s.fasta $EUTILS$a ;
t="SARS-CoV-2"; s="BANAL-20-236"; a="MZ937003"; wget -q -O $t.$s.fasta $EUTILS$a ;
The overall similarity of these genomes against the reference Wuhan-Hu-1 can be obtained using the following command line:
SimiPlot.sh -t 4 -y 60 -s -r SARS-CoV-2.Wuhan-Hu-1.fasta -o sars-cov-2.1.svg SARS-CoV-2.RaTG13.fasta SARS-CoV-2.BANAL-20-52.fasta SARS-CoV-2.BANAL-20-103.fasta SARS-CoV-2.BANAL-20-236.fasta
A clearer figure can be obtained by increasing the aspect ratio (option -a
) and decreasing the dot size (option -
d), e.g.
SimiPlot.sh -t 4 -y 60 -a 9.0 -d 0.4 -s -r SARS-CoV-2.Wuhan-Hu-1.fasta -o sars-cov-2.2.svg SARS-CoV-2.RaTG13.fasta SARS-CoV-2.BANAL-20-52.fasta SARS-CoV-2.BANAL-20-103.fasta SARS-CoV-2.BANAL-20-236.fasta
Another possibility to obtain a clearer figure is to reduce the number of dots by increasing the window size. As the reference Wuhan-Hu-1 genome is short (29903 bp), the shortest window size is used by SimiPlot (i.e. 31 bp). One can then run SimiPlot with larger fragments, e.g. window size of 71 bp:
SimiPlot.sh -t 4 -y 60 -w 71 -d 0.3 -s -r SARS-CoV-2.Wuhan-Hu-1.fasta -o sars-cov-2.3.svg SARS-CoV-2.RaTG13.fasta SARS-CoV-2.BANAL-20-52.fasta SARS-CoV-2.BANAL-20-103.fasta SARS-CoV-2.BANAL-20-236.fasta
Window size of 101 bp:
SimiPlot.sh -t 4 -y 60 -w 101 -d 0.25 -s -r SARS-CoV-2.Wuhan-Hu-1.fasta -o sars-cov-2.4.svg SARS-CoV-2.RaTG13.fasta SARS-CoV-2.BANAL-20-52.fasta SARS-CoV-2.BANAL-20-103.fasta SARS-CoV-2.BANAL-20-236.fasta
Window size of 151 bp:
SimiPlot.sh -t 4 -y 60 -w 151 -d 0.2 -s -r SARS-CoV-2.Wuhan-Hu-1.fasta -o sars-cov-2.5.svg SARS-CoV-2.RaTG13.fasta SARS-CoV-2.BANAL-20-52.fasta SARS-CoV-2.BANAL-20-103.fasta SARS-CoV-2.BANAL-20-236.fasta
References
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. Journal of Molecular Biology, 215(3):403-410. doi:10.1016/S0022-2836(05)80360-2
Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL (2008) BLAST+: architecture and applications. BMC Bioinformatics, 10:421. doi:10.1186/1471-2105-10-421
Goris J, Konstantinidis KT, Klappenbach JA, Coenye T, Vandamme P, Tiedje JM (2007) DNA-DNA hybridization values and their relationship to whole-genome sequence similarities. International Journal of Systematic and Evolutionary Biology, 57(1):81-91. doi:10.1099/ijs.0.64483-0
Hennart M, Guglielmini J, Maiden MCJ, Jolley KA, Criscuolo A, Brisse S (2022) A dual barcoding approach to bacterial strain nomenclature: Genomic taxonomy of Klebsiella pneumoniae strains. bioRxiv, doi:10.1101/2021.07.26.453808
Lole KS, Bollinger RC, Paranjape RS, Gadkari D, Kulkarni SS, Novak NG, Ingersoll R, Sheppard HW, Ray SC (1999) Full-length human immunodeficiency virus type 1 genomes from subtype C-infected seroconverters in India, with evidence of intersubtype recombination. Journal of Virology, 73(1):152-160. doi:10.1128/jvi.73.1.152-160.1999
Samson S, Lord É, Makarenkov V (2022) SimPlot++: a Python application for representing sequence similarity and detecting recombination. arXiv, arXiv:2112.09755
Temmam S, Vongphayloth K, Salazar EB, Munier S, Bonomi M, Regnault B, Douangboubpha B, Karami Y, Chrétien D, Sanamxay D, Xayaphet V, Paphaphanh P, Lacoste V, Somlor S, Lakeomany K, Phommavanh N, Pérot P, Dehan O, Amara F, Donati F, Bigot T, Nilges M, Rey FA, van der Werf S, Brey PT, Eloit M (2022) Bat coronaviruses related to SARS-CoV-2 and infectious for human cells. Nature. doi:10.1038/s41586-022-04532-4