Skip to content
Snippets Groups Projects 17.3 KiB
Newer Older
Alexis  CRISCUOLO's avatar
Alexis CRISCUOLO committed
# JolyTree

Alexis  CRISCUOLO's avatar
Alexis CRISCUOLO committed
_JolyTree_ (named in memory of [Nicolas Joly]( is a command line script written in [Bash]( that allows a distance-based phylogenetic tree with branch supports to be quickly inferred from non-aligned genome sequences.
_JolyTree_ runs on UNIX, Linux and most OS X operating systems.

## Installation and execution

**A.** Install the following programs and tools, or verify that they are already installed with the required version:
* [mash]( [(Ondov et al. 2016)]( version >= 1.0.2;
  * binaries: [](
  * sources: [](
* [gawk]( version >= 4.1.0
  * sources: [](
* [FastME]( [(Lefort et al. 2015)]( version >=
  * sources: [](
* [REQ]( version >= 1.2
  * sources: [](

**B.** Clone this repository with the following command line:
Alexis  CRISCUOLO's avatar
Alexis CRISCUOLO committed
git clone
Alexis  CRISCUOLO's avatar
Alexis CRISCUOLO committed

Alexis  CRISCUOLO's avatar
Alexis CRISCUOLO committed
**C.** If at least one of the four required binaries (step A) is not available on your `$PATH` variable, edit the file `` and indicate the local path to the `mash`, `gawk`, `FastME` and/or `REQ` binary(ies) (approximately between lines 100 and 200):
Alexis  CRISCUOLO's avatar
Alexis CRISCUOLO committed

#                                                                                                           #
# ================                                                                                          #
# = INSTALLATION =                                                                                          #
# ================                                                                                          #
#                                                                                                           #
# [1] REQUIREMENTS =======================================================================================  #
# JolyTree depends on Mash, gawk,  FastME and REQ (see below),  each with a minimum version required.  You  #
# should have them installed on your computer prior to using JolyTree. Make sure that each is installed on  #
# your $PATH variable, or specify below the full path to each of them.                                      #
#                                                                                                           #
# -- Mash: fast pairwise p-distance estimation --------------------------------------------------------     #
#    VERSION >= 1.0.2                                                                                       #
#    src:                                                                             #
#                                                                 ################################################
  MASH=mash;                                                      ## <=== WRITE HERE THE PATH TO THE MASH       ##
                                                                  ##      BINARY (VERSION 1.0.2 MINIMUM)        ##
#                                                                                                           #
# -- gawk: fast text file processing ------------------------------------------------------------------     #
#    VERSION >= 4.1.0                                                                                       #
#    src:                                                                              #
#                                                                 ################################################
  GAWK=gawk;                                                      ## <=== WRITE HERE THE PATH TO THE GAWK       ##
                                                                  ##      BINARY (VERSION 4.1.0 MINIMUM)        ##
#                                                                                                           #
# -- FastME: fast distance-based phylogenetic tree inference ------------------------------------------     #
#    VERSION >=                                                                                     #
#    src:                                                                        #
#                                                                 ################################################
  FASTME=fastme;                                                  ## <=== WRITE HERE THE PATH TO THE FASTME     ##
                                                                  ##      BINARY (VERSION MINIMUM)      ##
#                                                                                                           #
# -- REQ: fast computation of the rates of elementary quartets ----------------------------------------     #
#    VERSION >= 1.2                                                                                         #
#    src:                                                                       #
#                                                                 ################################################
  REQ=REQ;                                                        ## <=== WRITE HERE THE PATH TO THE REQ        ##
                                                                  ##      BINARY (VERSION 1.2 MINIMUM)          ##
#                                                                                                           #


**D.** Give the execute permission to the file ``:
chmod +x

**E.** Execute _JolyTree_ with the following command line model:
Alexis  CRISCUOLO's avatar
Alexis CRISCUOLO committed
./  [options]
Alexis  CRISCUOLO's avatar
Alexis CRISCUOLO committed

## Usage

Launch _JolyTree_ without option to read the following documentation:

 USAGE:  [options]
    -i <directory>  directory name containing  FASTA-formatted contig files;  only files
                    ending with .fa, .fna, .fas or .fasta will be considered (mandatory)
    -b <basename>   basename of every written output file (mandatory)
    -s <int>        sketch size (default: 25% of the largest genome size)
    -q <double>     probability of observing a random k-mer (default: 0.0001)
    -k <int>        k-mer size (default: estimated from the average genome size with the
                    probability set by option -q)
    -c <real>       if at least one of the estimated p-distances is above this specified
                    cutoff, then a F81 correction is performed (default: 0.1)
    -n              no BME tree inference (only pairwise distance estimation)
    -r <int>        number of steps  when performing the  ratchet-based  BME tree search
                    (default: 100)
    -t <int>        number of threads (default: 2)

## Notes

* It is not recommended to modify the option -k. The optimal value of _k_ is automatically estimated by equation (2) in [Ondov et al. (2016)]( from the desired probability _q_ of observing a random _k_-mer (option -q). Increasing _q_ (e.g. > 0.001) is not recommended, especially when dealing distantly-related genomes. Lowering _q_ (e.g. < 0.00001) leads to larger _k_-mer size that increases the variance of the estimated evolutionary distances.

* Increasing the sketch size (option -s) does not generally modify the inferred phylogenetic tree; on the other side, it is not recommended to set a sketch size lower than 10,000 (except for small genomic sequences, e.g. plasmids, viruses)

* Lowering the cutoff value for correcting the evolutionary distances (option -c) does generally not modify the inferred phylogenetic tree; on the other side, it is strongly not recommended to increase this cutoff value.

* The option -c allows multiple substitutions per character to be accurately estimated when an observed _p_-distance is quite large (e.g.> 0.1; see [Figure 3.1]( in Nei and Kumar 2000). In such cases, the F81 correction is performed by using the equation (4) in [Tamura and Kumar (2002)]( that allows estimating the pairwise distance based on the Equal-Input model of evolution ([Felsenstein 1981](; [Tajima and Nei 1982](, [1984]( This transformation was chosen because it could be directly computed from a _p_-distance value, and it takes into account putative unequal base frequencies and heterogeneous base composition among lineages.

* Fast running times will be observed when using multiple threads; of note, only pairwise distance estimation step benefits from a large number of threads (other steps are quite fast).

* The verbosity of _JolyTree_ could be reduced by ending the command line by `2>/dev/null`
* To launch _JolyTree_ on multiple cores on a cluster managed by [SLURM](, edit the file `` and read the subsection [3] of the _Installation_ section (approximately line 200).

## Example

Alexis  CRISCUOLO's avatar
Alexis CRISCUOLO committed
In order to illustrate the usefulness of _JolyTree_ and to describe its output files, the following use case example describes its usage for inferring a phylogenetic tree of _Klebsiella_ genomes derived from the analysis of [Rodrigues et al. (2019)](
Alexis  CRISCUOLO's avatar
Alexis CRISCUOLO committed

##### Downloading genome sequences

Alexis  CRISCUOLO's avatar
Alexis CRISCUOLO committed
The following [Bash]( command lines allows the genome sequences of 40 _Klebsiella_ species (36 belonging to the _Klebsiella pneumoniae_ complex &ndash;Kp1 to Kp7&ndash; and 4 outgroup species &ndash;Kog&ndash;) to be downloaded from the [NCBI genome repository]( inside a directory named _genomes_:
Alexis  CRISCUOLO's avatar
Alexis CRISCUOLO committed

mkdir genomes/ ;
EUTILS="wget -q -O-";
Alexis  CRISCUOLO's avatar
Alexis CRISCUOLO committed
NCBIFTP="wget -q -O-"; 
A1='^[A-Z]{2}[0-9]*$'; A2='^[A-Z]{6}[0-9]*$'; Z=".1.fsa_nt.gz";

echo -e "SB4-2\tCAAHFS01\nATCC13883_T\tJOOW01\nMGH78578\tCP000647\nSB1139\tCAAHFT01\n5-2\tCAAHGI01\n04A025\tCAAHFZ01\n2-3\tCAAHGH01\nKp13\tCP003999\nNTUH-K2044\tAP006725\nBJ1-GA\tCAAHGC01" |
  while read -r s a; do echo $t.$s;([[ $a =~ $A1 ]]&&$EUTILS$a||$NCBIFTP${a:0:2}/${a:2:2}/$([[ $a =~ $A2 ]]&&echo ${a:4:2}/)$a/$a$Z|zcat)>genomes/$t.$s.fa; done
echo -e "01A030_T\tCCDF01\nSB1124\tCAAHFU01\nU41\tCAAHGA01\n18A069\tCAAHGF01\n0320584\tCAAHGK01" |
  while read -r s a; do echo $t.$s;([[ $a =~ $A1 ]]&&$EUTILS$a||$NCBIFTP${a:0:2}/${a:2:2}/$([[ $a =~ $A2 ]]&&echo ${a:4:2}/)$a/$a$Z|zcat)>genomes/$t.$s.fa; done
echo -e "01A065\tCAAHFX01\nF2R9_T\tCAAHGE01\n342\tCP000964\nAt-22\tCP001891" |
  while read -r s a; do echo $t.$s;([[ $a =~ $A1 ]]&&$EUTILS$a||$NCBIFTP${a:0:2}/${a:2:2}/$([[ $a =~ $A2 ]]&&echo ${a:4:2}/)$a/$a$Z|zcat)>genomes/$t.$s.fa; done
echo -e "09A323\tCAAHFV01\n12A476\tCAAHFY01\n07A044_T\tCBZR01\nCIP110288\tCAAHGD01\n1-1\tCAAHGG01" |
  while read -r s a; do echo $t.$s;([[ $a =~ $A1 ]]&&$EUTILS$a||$NCBIFTP${a:0:2}/${a:2:2}/$([[ $a =~ $A2 ]]&&echo ${a:4:2}/)$a/$a$Z|zcat)>genomes/$t.$s.fa; done
echo -e "CDC4241-71\tCAAHGJ01\n814\tCAAHGL01\n885\tCAAHGM01\n1266_T\tCAAHGN01\n1283\tCAAHGO01\n1375\tCAAHGP01" |
  while read -r s a; do echo $t.$s;([[ $a =~ $A1 ]]&&$EUTILS$a||$NCBIFTP${a:0:2}/${a:2:2}/$([[ $a =~ $A2 ]]&&echo ${a:4:2}/)$a/$a$Z|zcat)>genomes/$t.$s.fa; done
echo -e "08A119\tCAAHGB01\n10982\tAKYX01\nKPN1705\tCP022823\n01-467-2ECBU\tCAAHGR01\n01-310MBV\tCAAHGS01" |
  while read -r s a; do echo $t.$s;([[ $a =~ $A1 ]]&&$EUTILS$a||$NCBIFTP${a:0:2}/${a:2:2}/$([[ $a =~ $A2 ]]&&echo ${a:4:2}/)$a/$a$Z|zcat)>genomes/$t.$s.fa; done
echo -e "200023\tCAAHGQ01" |
  while read -r s a; do echo $t.$s;([[ $a =~ $A1 ]]&&$EUTILS$a||$NCBIFTP${a:0:2}/${a:2:2}/$([[ $a =~ $A2 ]]&&echo ${a:4:2}/)$a/$a$Z|zcat)>genomes/$t.$s.fa; done
echo -e "oxytoca.ATCC13182\tCAAHFW01\naerogenes.ATCC13048\tQVMZ01\ngrimontii.06D021\tFZTC01\nmichiganensis.DSM25444\tPRDB01" |
  while read -r s a; do echo $t.$s;([[ $a =~ $A1 ]]&&$EUTILS$a||$NCBIFTP${a:0:2}/${a:2:2}/$([[ $a =~ $A2 ]]&&echo ${a:4:2}/)$a/$a$Z|zcat)>genomes/$t.$s.fa; done
Alexis  CRISCUOLO's avatar
Alexis CRISCUOLO committed

##### Launching _jolyTree_

The following command line allows the script `` to be launched with default options on 8 threads:
./  -i genomes  -b klebsiella  -t 8  2>/dev/null
Of note, the verbosity could be expanded by omitting the final `2>/dev/null`.

As the basename was set to 'klebsiella', _JolyTree_ writes in few minutes the four following output files:

* `klebsiella.acgt`: the A, C, G and T residue counts for each genome,
* `klebsiella.oepl`: every pairwise _p_-distance in [OEPL (One Entry Per Line) format](
* `klebsiella.d`: the matrix of (corrected) pairwise evolutionary distances in PHYLIP square format
* `klebsiella.nwk`: the BME phylogenetic tree in NEWICK format with REQ confidence support at branches

## References

Felsenstein J (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of Molecular Evolution, 17(6):368-376. [doi:10.1007/BF01734359](

Lefort V, Desper R, Gascuel O (2015) FastME 2.0: a comprehensive, accurate, and fast distance-based phylogeny inference program. Molecular Biology and Evolution, 32(10):2798-2800. [doi:10.1093/molbev/msv150](

Nei M, Kumar S (2000) Molecular Evolution and Phylogenetics. Oxford University Press, Oxford. ISBN: 0-19-513584-9.

Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, Phillippy AM (2016) Mash: fast genome and metagenome distance estimation using MinHash. Genome Biology, 17(1):132. [doi:10.1186/s13059-016-0997-x](

Alexis  CRISCUOLO's avatar
Alexis CRISCUOLO committed
Rodrigues C, Passet V, Rakotondrasoa A, Abdoulaye Diallo T, Criscuolo A, Brisse S (2019) Description of _Klebsiella africanensis_ sp. nov., _Klebsiella variicola_ subsp. _tropicalensis_ subsp. nov. and _Klebsiella variicola_ subsp. _variicola_ subsp. nov. Research in Microbiology. [doi:10.1016/j.resmic.2019.02.003](

Alexis  CRISCUOLO's avatar
Alexis CRISCUOLO committed
Tajima F, Nei M (1982) Biases of the estimates of DNA divergence obtained by the restriction enzyme technique. Journal of Molecular Evolution, 18(2):115-120. [doi:10.1007/BF01810830](

Tajima F, Nei M (1984) Estimation of evolutionary distance between nucleotide sequences. Molecular Biology and Evolution, 1(3):269-285. [doi:10.1093/oxfordjournals.molbev.a040317](

Tamura K, Kumar S (2002) Evolutionary distance estimation under heterogeneous substitution pattern among lineages. Molecular Biology and Evolution, 19(10):1727-1736. [doi:10.1093/oxfordjournals.molbev.a003995](