Newer
Older
_JolyTree_ (named in memory of [Nicolas Joly](https://research.pasteur.fr/en/member/nicolas-joly/)) is a command line script written in [Bash](https://www.gnu.org/software/bash/) that allows a distance-based phylogenetic tree with branch supports to be quickly inferred from non-aligned genome sequences.
_JolyTree_ runs on UNIX, Linux and most OS X operating systems.
## Installation and execution
**A.** Install the following programs and tools, or verify that they are already installed with the required version:
* [mash](http://mash.readthedocs.io/en/latest/) [(Ondov et al. 2016)](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0997-x) version >= 1.0.2;
* binaries: [github.com/marbl/Mash/releases](https://github.com/marbl/Mash/releases)
* sources: [github.com/marbl/Mash](https://github.com/marbl/Mash)
* [gawk](https://www.gnu.org/software/gawk/manual/) version >= 4.1.0
* sources: [ftp.gnu.org/gnu/gawk](http://ftp.gnu.org/gnu/gawk/)
* [FastME](http://www.atgc-montpellier.fr/fastme/usersguide.php) [(Lefort et al. 2015)](https://doi.org/10.1093/molbev/msv150) version >= 2.1.5.1
* sources: [gite.lirmm.fr/atgc/FastME](https://gite.lirmm.fr/atgc/FastME)
* [REQ](https://research.pasteur.fr/en/tool/r%ce%b5q-assessing-branch-supports-o%c6%92-a-distance-based-phylogenetic-tree-with-the-rate-o%c6%92-elementary-quartets/) version >= 1.2
* sources: [gitlab.pasteur.fr/GIPhy/REQ](https://gitlab.pasteur.fr/GIPhy/REQ)
**B.** Clone this repository with the following command line:
```bash
**C.** If at least one of the four required binaries (step A) is not available on your `$PATH` variable, edit the file `JolyTree.sh` and indicate the local path to the `mash`, `gawk`, `FastME` and/or `REQ` binary(ies) (approximately between lines 100 and 200):
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
```bash
#############################################################################################################
# #
# ================ #
# = INSTALLATION = #
# ================ #
# #
# [1] REQUIREMENTS ======================================================================================= #
# JolyTree depends on Mash, gawk, FastME and REQ (see below), each with a minimum version required. You #
# should have them installed on your computer prior to using JolyTree. Make sure that each is installed on #
# your $PATH variable, or specify below the full path to each of them. #
# #
# -- Mash: fast pairwise p-distance estimation -------------------------------------------------------- #
# VERSION >= 1.0.2 #
# src: github.com/marbl/Mash #
# ################################################
################################################
MASH=mash; ## <=== WRITE HERE THE PATH TO THE MASH ##
## BINARY (VERSION 1.0.2 MINIMUM) ##
################################################
################################################
# #
# -- gawk: fast text file processing ------------------------------------------------------------------ #
# VERSION >= 4.1.0 #
# src: ftp.gnu.org/gnu/gawk #
# ################################################
################################################
GAWK=gawk; ## <=== WRITE HERE THE PATH TO THE GAWK ##
## BINARY (VERSION 4.1.0 MINIMUM) ##
################################################
################################################
# #
# -- FastME: fast distance-based phylogenetic tree inference ------------------------------------------ #
# VERSION >= 2.1.5.1 #
# src: gite.lirmm.fr/atgc/FastME/ #
# ################################################
################################################
FASTME=fastme; ## <=== WRITE HERE THE PATH TO THE FASTME ##
## BINARY (VERSION 2.1.5.1 MINIMUM) ##
################################################
################################################
# #
# -- REQ: fast computation of the rates of elementary quartets ---------------------------------------- #
# VERSION >= 1.2 #
# src: gitlab.pasteur.fr/GIPhy/REQ #
# ################################################
################################################
REQ=REQ; ## <=== WRITE HERE THE PATH TO THE REQ ##
## BINARY (VERSION 1.2 MINIMUM) ##
################################################
################################################
# #
#############################################################################################################
```
**D.** Give the execute permission to the file `JolyTree.sh`:
```bash
chmod +x JolyTree.sh
```
**E.** Execute _JolyTree_ with the following command line model:
```bash
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
```
## Usage
Launch _JolyTree_ without option to read the following documentation:
```
USAGE:
JolyTree.sh [options]
where:
-i <directory> directory name containing FASTA-formatted contig files; only files
ending with .fa, .fna, .fas or .fasta will be considered (mandatory)
-b <basename> basename of every written output file (mandatory)
-s <int> sketch size (default: 25% of the largest genome size)
-q <double> probability of observing a random k-mer (default: 0.0001)
-k <int> k-mer size (default: estimated from the average genome size with the
probability set by option -q)
-c <real> if at least one of the estimated p-distances is above this specified
cutoff, then a F81 correction is performed (default: 0.1)
-n no BME tree inference (only pairwise distance estimation)
-r <int> number of steps when performing the ratchet-based BME tree search
(default: 100)
-t <int> number of threads (default: 2)
```
## Notes
* It is not recommended to modify the option -k. The optimal value of _k_ is automatically estimated by equation (2) in [Ondov et al. (2016)](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0997-x) from the desired probability _q_ of observing a random _k_-mer (option -q). Increasing _q_ (e.g. > 0.001) is not recommended, especially when dealing distantly-related genomes. Lowering _q_ (e.g. < 0.00001) leads to larger _k_-mer size that increases the variance of the estimated evolutionary distances.
* Increasing the sketch size (option -s) does not generally modify the inferred phylogenetic tree; on the other side, it is not recommended to set a sketch size lower than 10,000 (except for small genomic sequences, e.g. plasmids, viruses)
* Lowering the cutoff value for correcting the evolutionary distances (option -c) does generally not modify the inferred phylogenetic tree; on the other side, it is strongly not recommended to increase this cutoff value.
* The option -c allows multiple substitutions per character to be accurately estimated when an observed _p_-distance is quite large (e.g.> 0.1; see [Figure 3.1](https://books.google.fr/books?id=3Xc8DwAAQBAJ&pg=PA41) in Nei and Kumar 2000). In such cases, the F81 correction is performed by using the equation (4) in [Tamura and Kumar (2002)](https://academic.oup.com/mbe/article/19/10/1727/1258975) that allows estimating the pairwise distance based on the Equal-Input model of evolution ([Felsenstein 1981](https://link.springer.com/article/10.1007/BF01734359); [Tajima and Nei 1982](https://link.springer.com/article/10.1007/BF01810830), [1984](https://academic.oup.com/mbe/article/1/3/269/1244029)). This transformation was chosen because it could be directly computed from a _p_-distance value, and it takes into account putative unequal base frequencies and heterogeneous base composition among lineages.
* Fast running times will be observed when using multiple threads; of note, only pairwise distance estimation step benefits from a large number of threads (other steps are quite fast).
* The verbosity of _JolyTree_ could be reduced by ending the command line by `2>/dev/null`
* To launch _JolyTree_ on multiple cores on a cluster managed by [SLURM](https://slurm.schedmd.com), edit the file `JolyTree.sh` and read the subsection [3] of the _Installation_ section (approximately line 200).
## Example
In order to illustrate the usefulness of _JolyTree_ and to describe its output files, the following use case example describes its usage for inferring a phylogenetic tree of _Klebsiella_ genomes derived from the analysis of [Rodrigues et al. (2019)](https://doi.org/10.1016/j.resmic.2019.02.003).
The following [Bash](https://www.gnu.org/software/bash/) command lines allows the genome sequences of 40 _Klebsiella_ species (36 belonging to the _Klebsiella pneumoniae_ complex –Kp1 to Kp7– and 4 outgroup species –Kog–) to be downloaded from the [NCBI genome repository](https://www.ncbi.nlm.nih.gov/genome) inside a directory named _genomes_:
```bash
mkdir genomes/ ;
EUTILS="wget -q -O- https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id=";
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
NCBIFTP="wget -q -O- https://ftp.ncbi.nlm.nih.gov/sra/wgs_aux/";
A1='^[A-Z]{2}[0-9]*$'; A2='^[A-Z]{6}[0-9]*$'; Z=".1.fsa_nt.gz";
t="Kp1-K.pneumoniae";
echo -e "SB4-2\tCAAHFS01\nATCC13883_T\tJOOW01\nMGH78578\tCP000647\nSB1139\tCAAHFT01\n5-2\tCAAHGI01\n04A025\tCAAHFZ01\n2-3\tCAAHGH01\nKp13\tCP003999\nNTUH-K2044\tAP006725\nBJ1-GA\tCAAHGC01" |
while read -r s a; do echo $t.$s;([[ $a =~ $A1 ]]&&$EUTILS$a||$NCBIFTP${a:0:2}/${a:2:2}/$([[ $a =~ $A2 ]]&&echo ${a:4:2}/)$a/$a$Z|zcat)>genomes/$t.$s.fa; done
t="Kp2-K.quasipneumoniae.subsp.quasipneumoniae";
echo -e "01A030_T\tCCDF01\nSB1124\tCAAHFU01\nU41\tCAAHGA01\n18A069\tCAAHGF01\n0320584\tCAAHGK01" |
while read -r s a; do echo $t.$s;([[ $a =~ $A1 ]]&&$EUTILS$a||$NCBIFTP${a:0:2}/${a:2:2}/$([[ $a =~ $A2 ]]&&echo ${a:4:2}/)$a/$a$Z|zcat)>genomes/$t.$s.fa; done
t="Kp3-K.variicola";
echo -e "01A065\tCAAHFX01\nF2R9_T\tCAAHGE01\n342\tCP000964\nAt-22\tCP001891" |
while read -r s a; do echo $t.$s;([[ $a =~ $A1 ]]&&$EUTILS$a||$NCBIFTP${a:0:2}/${a:2:2}/$([[ $a =~ $A2 ]]&&echo ${a:4:2}/)$a/$a$Z|zcat)>genomes/$t.$s.fa; done
t="Kp4-K.quasipneumoniae.subsp.similipneumoniae";
echo -e "09A323\tCAAHFV01\n12A476\tCAAHFY01\n07A044_T\tCBZR01\nCIP110288\tCAAHGD01\n1-1\tCAAHGG01" |
while read -r s a; do echo $t.$s;([[ $a =~ $A1 ]]&&$EUTILS$a||$NCBIFTP${a:0:2}/${a:2:2}/$([[ $a =~ $A2 ]]&&echo ${a:4:2}/)$a/$a$Z|zcat)>genomes/$t.$s.fa; done
t="Kp5-K.variicola.subsp.tropicalensis";
echo -e "CDC4241-71\tCAAHGJ01\n814\tCAAHGL01\n885\tCAAHGM01\n1266_T\tCAAHGN01\n1283\tCAAHGO01\n1375\tCAAHGP01" |
while read -r s a; do echo $t.$s;([[ $a =~ $A1 ]]&&$EUTILS$a||$NCBIFTP${a:0:2}/${a:2:2}/$([[ $a =~ $A2 ]]&&echo ${a:4:2}/)$a/$a$Z|zcat)>genomes/$t.$s.fa; done
t="Kp6-K.quasivariicola";
echo -e "08A119\tCAAHGB01\n10982\tAKYX01\nKPN1705\tCP022823\n01-467-2ECBU\tCAAHGR01\n01-310MBV\tCAAHGS01" |
while read -r s a; do echo $t.$s;([[ $a =~ $A1 ]]&&$EUTILS$a||$NCBIFTP${a:0:2}/${a:2:2}/$([[ $a =~ $A2 ]]&&echo ${a:4:2}/)$a/$a$Z|zcat)>genomes/$t.$s.fa; done
t="Kp7-K.africanensis";
echo -e "200023\tCAAHGQ01" |
while read -r s a; do echo $t.$s;([[ $a =~ $A1 ]]&&$EUTILS$a||$NCBIFTP${a:0:2}/${a:2:2}/$([[ $a =~ $A2 ]]&&echo ${a:4:2}/)$a/$a$Z|zcat)>genomes/$t.$s.fa; done
t="Kog-K";
echo -e "oxytoca.ATCC13182\tCAAHFW01\naerogenes.ATCC13048\tQVMZ01\ngrimontii.06D021\tFZTC01\nmichiganensis.DSM25444\tPRDB01" |
while read -r s a; do echo $t.$s;([[ $a =~ $A1 ]]&&$EUTILS$a||$NCBIFTP${a:0:2}/${a:2:2}/$([[ $a =~ $A2 ]]&&echo ${a:4:2}/)$a/$a$Z|zcat)>genomes/$t.$s.fa; done
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
```
##### Launching _jolyTree_
The following command line allows the script `jolyTree.sh` to be launched with default options on 8 threads:
```bash
./JolyTree.sh -i genomes -b klebsiella -t 8 2>/dev/null
```
Of note, the verbosity could be expanded by omitting the final `2>/dev/null`.
As the basename was set to 'klebsiella', _JolyTree_ writes in few minutes the four following output files:
* `klebsiella.acgt`: the A, C, G and T residue counts for each genome,
* `klebsiella.oepl`: every pairwise _p_-distance in [OEPL (One Entry Per Line) format](http://giphy.pasteur.fr/faq/phylogenetics/distance-matrix-file-conversion/#how-to-deal-with-the-one-entry-per-line-oepl-matrix-format)
* `klebsiella.d`: the matrix of (corrected) pairwise evolutionary distances in PHYLIP square format
* `klebsiella.nwk`: the BME phylogenetic tree in NEWICK format with REQ confidence support at branches
## References
Felsenstein J (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of Molecular Evolution, 17(6):368-376. [doi:10.1007/BF01734359](https://link.springer.com/article/10.1007/BF01734359).
Lefort V, Desper R, Gascuel O (2015) FastME 2.0: a comprehensive, accurate, and fast distance-based phylogeny inference program. Molecular Biology and Evolution, 32(10):2798-2800. [doi:10.1093/molbev/msv150](https://doi.org/10.1093/molbev/msv150).
Nei M, Kumar S (2000) Molecular Evolution and Phylogenetics. Oxford University Press, Oxford. ISBN: 0-19-513584-9.
Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, Phillippy AM (2016) Mash: fast genome and metagenome distance estimation using MinHash. Genome Biology, 17(1):132. [doi:10.1186/s13059-016-0997-x](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0997-x).
Rodrigues C, Passet V, Rakotondrasoa A, Abdoulaye Diallo T, Criscuolo A, Brisse S (2019) Description of _Klebsiella africanensis_ sp. nov., _Klebsiella variicola_ subsp. _tropicalensis_ subsp. nov. and _Klebsiella variicola_ subsp. _variicola_ subsp. nov. Research in Microbiology. [doi:10.1016/j.resmic.2019.02.003](https://doi.org/10.1016/j.resmic.2019.02.003).
Tajima F, Nei M (1982) Biases of the estimates of DNA divergence obtained by the restriction enzyme technique. Journal of Molecular Evolution, 18(2):115-120. [doi:10.1007/BF01810830](https://link.springer.com/article/10.1007/BF01810830).
Tajima F, Nei M (1984) Estimation of evolutionary distance between nucleotide sequences. Molecular Biology and Evolution, 1(3):269-285. [doi:10.1093/oxfordjournals.molbev.a040317](https://academic.oup.com/mbe/article/1/3/269/1244029).
Tamura K, Kumar S (2002) Evolutionary distance estimation under heterogeneous substitution pattern among lineages. Molecular Biology and Evolution, 19(10):1727-1736. [doi:10.1093/oxfordjournals.molbev.a003995](https://academic.oup.com/mbe/article/19/10/1727/1258975).