README.md

# JolyTree

_JolyTree_ (named in memory of [Nicolas Joly](https://research.pasteur.fr/en/member/nicolas-joly/)) is a command line script written in [Bash](https://www.gnu.org/software/bash/) that allows a distance-based phylogenetic tree with branch supports to be quickly inferred from non-aligned genome sequences.
_JolyTree_ runs on UNIX, Linux and most OS X operating systems.

## Installation and execution

**A.** Install the following programs and tools, or verify that they are already installed with the required version:
* [mash](http://mash.readthedocs.io/en/latest/) [(Ondov et al. 2016)](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0997-x) version >= 1.0.2;
  * binaries: [github.com/marbl/Mash/releases](https://github.com/marbl/Mash/releases)
  * sources: [github.com/marbl/Mash](https://github.com/marbl/Mash)
* [gawk](https://www.gnu.org/software/gawk/manual/) version >= 4.1.0
  * sources: [ftp.gnu.org/gnu/gawk](http://ftp.gnu.org/gnu/gawk/)
* [FastME](http://www.atgc-montpellier.fr/fastme/usersguide.php) [(Lefort et al. 2015)](https://doi.org/10.1093/molbev/msv150) version >= 2.1.5.1
  * sources: [gite.lirmm.fr/atgc/FastME](https://gite.lirmm.fr/atgc/FastME)
* [REQ](https://research.pasteur.fr/en/tool/r%ce%b5q-assessing-branch-supports-o%c6%92-a-distance-based-phylogenetic-tree-with-the-rate-o%c6%92-elementary-quartets/) version >= 1.2
  * sources: [gitlab.pasteur.fr/GIPhy/REQ](https://gitlab.pasteur.fr/GIPhy/REQ)

**B.** Clone this repository with the following command line:
```bash
git clone https://gitlab.pasteur.fr/GIPhy/JolyTree.git
```

**C.** If at least one of the four required binaries (step A) is not available on your `$PATH` variable, edit the file `JolyTree.sh` and indicate the local path to the `mash`, `gawk`, `FastME` and/or `REQ` binary(ies) (approximately between lines 100 and 200):

```bash
#############################################################################################################
#                                                                                                           #
# ================                                                                                          #
# = INSTALLATION =                                                                                          #
# ================                                                                                          #
#                                                                                                           #
# [1] REQUIREMENTS =======================================================================================  #
# JolyTree depends on Mash, gawk,  FastME and REQ (see below),  each with a minimum version required.  You  #
# should have them installed on your computer prior to using JolyTree. Make sure that each is installed on  #
# your $PATH variable, or specify below the full path to each of them.                                      #
#                                                                                                           #
# -- Mash: fast pairwise p-distance estimation --------------------------------------------------------     #
#    VERSION >= 1.0.2                                                                                       #
#    src: github.com/marbl/Mash                                                                             #
#                                                                 ################################################
                                                                  ################################################
  MASH=mash;                                                      ## <=== WRITE HERE THE PATH TO THE MASH       ##
                                                                  ##      BINARY (VERSION 1.0.2 MINIMUM)        ##
                                                                  ################################################
                                                                  ################################################
#                                                                                                           #
# -- gawk: fast text file processing ------------------------------------------------------------------     #
#    VERSION >= 4.1.0                                                                                       #
#    src: ftp.gnu.org/gnu/gawk                                                                              #
#                                                                 ################################################
                                                                  ################################################
  GAWK=gawk;                                                      ## <=== WRITE HERE THE PATH TO THE GAWK       ##
                                                                  ##      BINARY (VERSION 4.1.0 MINIMUM)        ##
                                                                  ################################################
                                                                  ################################################
#                                                                                                           #
# -- FastME: fast distance-based phylogenetic tree inference ------------------------------------------     #
#    VERSION >= 2.1.5.1                                                                                     #
#    src: gite.lirmm.fr/atgc/FastME/                                                                        #
#                                                                 ################################################
                                                                  ################################################
  FASTME=fastme;                                                  ## <=== WRITE HERE THE PATH TO THE FASTME     ##
                                                                  ##      BINARY (VERSION 2.1.5.1 MINIMUM)      ##
                                                                  ################################################
                                                                  ################################################
#                                                                                                           #
# -- REQ: fast computation of the rates of elementary quartets ----------------------------------------     #
#    VERSION >= 1.2                                                                                         #
#    src: gitlab.pasteur.fr/GIPhy/REQ                                                                       #
#                                                                 ################################################
                                                                  ################################################
  REQ=REQ;                                                        ## <=== WRITE HERE THE PATH TO THE REQ        ##
                                                                  ##      BINARY (VERSION 1.2 MINIMUM)          ##
                                                                  ################################################
                                                                  ################################################
#                                                                                                           #
#############################################################################################################

```

**D.** Give the execute permission to the file `JolyTree.sh`:
```bash
chmod +x JolyTree.sh
```

**E.** Execute _JolyTree_ with the following command line model:
```bash
./JolyTree.sh  [options]
```

## Usage

Launch _JolyTree_ without option to read the following documentation:

```
 USAGE:
    JolyTree.sh  [options]
 where:
    -i <directory>  directory name containing  FASTA-formatted contig files;  only files
                    ending with .fa, .fna, .fas or .fasta will be considered (mandatory)
    -b <basename>   basename of every written output file (mandatory)
    -s <int>        sketch size (default: 25% of the largest genome size)
    -q <double>     probability of observing a random k-mer (default: 0.0001)
    -k <int>        k-mer size (default: estimated from the average genome size with the
                    probability set by option -q)
    -c <real>       if at least one of the estimated p-distances is above this specified
                    cutoff, then a F81 correction is performed (default: 0.1)
    -n              no BME tree inference (only pairwise distance estimation)
    -r <int>        number of steps  when performing the  ratchet-based  BME tree search
                    (default: 100)
    -t <int>        number of threads (default: 2)
```

## Notes

* It is not recommended to modify the option -k. The optimal value of _k_ is automatically estimated by equation (2) in [Ondov et al. (2016)](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0997-x) from the desired probability _q_ of observing a random _k_-mer (option -q). Increasing _q_ (e.g. > 0.001) is not recommended, especially when dealing distantly-related genomes. Lowering _q_ (e.g. < 0.00001) leads to larger _k_-mer size that increases the variance of the estimated evolutionary distances.

* Increasing the sketch size (option -s) does not generally modify the inferred phylogenetic tree; on the other side, it is not recommended to set a sketch size lower than 10,000 (except for small genomic sequences, e.g. plasmids, viruses)

* Lowering the cutoff value for correcting the evolutionary distances (option -c) does generally not modify the inferred phylogenetic tree; on the other side, it is strongly not recommended to increase this cutoff value.

* The option -c allows multiple substitutions per character to be accurately estimated when an observed _p_-distance is quite large (e.g.> 0.1; see [Figure 3.1](https://books.google.fr/books?id=3Xc8DwAAQBAJ&pg=PA41) in Nei and Kumar 2000). In such cases, the F81 correction is performed by using the equation (4) in [Tamura and Kumar (2002)](https://academic.oup.com/mbe/article/19/10/1727/1258975) that allows estimating the pairwise distance based on the Equal-Input model of evolution ([Felsenstein 1981](https://link.springer.com/article/10.1007/BF01734359); [Tajima and Nei 1982](https://link.springer.com/article/10.1007/BF01810830), [1984](https://academic.oup.com/mbe/article/1/3/269/1244029)). This transformation was chosen because it could be directly computed from a _p_-distance value, and it takes into account putative unequal base frequencies and heterogeneous base composition among lineages.

* Fast running times will be observed when using multiple threads; of note, only pairwise distance estimation step benefits from a large number of threads (other steps are quite fast).

* The verbosity of _JolyTree_ could be reduced by ending the command line by `2>/dev/null`
* To launch _JolyTree_ on multiple cores on a cluster managed by [SLURM](https://slurm.schedmd.com), edit the file `JolyTree.sh` and read the subsection [3] of the _Installation_ section (approximately line 200).


## Example

In order to illustrate the usefulness of _JolyTree_ and to describe its output files, the following use case example describes its usage for inferring a phylogenetic tree of _Klebsiella_ genomes derived from the analysis of [Rodrigues et al. (2019)](https://doi.org/10.1016/j.resmic.2019.02.003).

##### Downloading genome sequences

The following [Bash](https://www.gnu.org/software/bash/) command lines allows the genome sequences of 40 _Klebsiella_ species (36 belonging to the _Klebsiella pneumoniae_ complex &ndash;Kp1 to Kp7&ndash; and 4 outgroup species &ndash;Kog&ndash;) to be downloaded from the [NCBI genome repository](https://www.ncbi.nlm.nih.gov/genome) inside a directory named _genomes_:

```bash
mkdir genomes/ ;
EUTILS="wget -q -O- https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id=";
NCBIFTP="wget -q -O- https://ftp.ncbi.nlm.nih.gov/sra/wgs_aux/"; 
A1='^[A-Z]{2}[0-9]*$'; A2='^[A-Z]{6}[0-9]*$'; Z=".1.fsa_nt.gz";

t="Kp1-K.pneumoniae";
echo -e "SB4-2\tCAAHFS01\nATCC13883_T\tJOOW01\nMGH78578\tCP000647\nSB1139\tCAAHFT01\n5-2\tCAAHGI01\n04A025\tCAAHFZ01\n2-3\tCAAHGH01\nKp13\tCP003999\nNTUH-K2044\tAP006725\nBJ1-GA\tCAAHGC01" |
  while read -r s a; do echo $t.$s;([[ $a =~ $A1 ]]&&$EUTILS$a||$NCBIFTP${a:0:2}/${a:2:2}/$([[ $a =~ $A2 ]]&&echo ${a:4:2}/)$a/$a$Z|zcat)>genomes/$t.$s.fa; done
 
t="Kp2-K.quasipneumoniae.subsp.quasipneumoniae";
echo -e "01A030_T\tCCDF01\nSB1124\tCAAHFU01\nU41\tCAAHGA01\n18A069\tCAAHGF01\n0320584\tCAAHGK01" |
  while read -r s a; do echo $t.$s;([[ $a =~ $A1 ]]&&$EUTILS$a||$NCBIFTP${a:0:2}/${a:2:2}/$([[ $a =~ $A2 ]]&&echo ${a:4:2}/)$a/$a$Z|zcat)>genomes/$t.$s.fa; done
 
t="Kp3-K.variicola";
echo -e "01A065\tCAAHFX01\nF2R9_T\tCAAHGE01\n342\tCP000964\nAt-22\tCP001891" |
  while read -r s a; do echo $t.$s;([[ $a =~ $A1 ]]&&$EUTILS$a||$NCBIFTP${a:0:2}/${a:2:2}/$([[ $a =~ $A2 ]]&&echo ${a:4:2}/)$a/$a$Z|zcat)>genomes/$t.$s.fa; done
 
t="Kp4-K.quasipneumoniae.subsp.similipneumoniae";
echo -e "09A323\tCAAHFV01\n12A476\tCAAHFY01\n07A044_T\tCBZR01\nCIP110288\tCAAHGD01\n1-1\tCAAHGG01" |
  while read -r s a; do echo $t.$s;([[ $a =~ $A1 ]]&&$EUTILS$a||$NCBIFTP${a:0:2}/${a:2:2}/$([[ $a =~ $A2 ]]&&echo ${a:4:2}/)$a/$a$Z|zcat)>genomes/$t.$s.fa; done
 
t="Kp5-K.variicola.subsp.tropicalensis";
echo -e "CDC4241-71\tCAAHGJ01\n814\tCAAHGL01\n885\tCAAHGM01\n1266_T\tCAAHGN01\n1283\tCAAHGO01\n1375\tCAAHGP01" |
  while read -r s a; do echo $t.$s;([[ $a =~ $A1 ]]&&$EUTILS$a||$NCBIFTP${a:0:2}/${a:2:2}/$([[ $a =~ $A2 ]]&&echo ${a:4:2}/)$a/$a$Z|zcat)>genomes/$t.$s.fa; done
 
t="Kp6-K.quasivariicola"; 
echo -e "08A119\tCAAHGB01\n10982\tAKYX01\nKPN1705\tCP022823\n01-467-2ECBU\tCAAHGR01\n01-310MBV\tCAAHGS01" |
  while read -r s a; do echo $t.$s;([[ $a =~ $A1 ]]&&$EUTILS$a||$NCBIFTP${a:0:2}/${a:2:2}/$([[ $a =~ $A2 ]]&&echo ${a:4:2}/)$a/$a$Z|zcat)>genomes/$t.$s.fa; done
 
t="Kp7-K.africanensis";
echo -e "200023\tCAAHGQ01" |
  while read -r s a; do echo $t.$s;([[ $a =~ $A1 ]]&&$EUTILS$a||$NCBIFTP${a:0:2}/${a:2:2}/$([[ $a =~ $A2 ]]&&echo ${a:4:2}/)$a/$a$Z|zcat)>genomes/$t.$s.fa; done
 
t="Kog-K";
echo -e "oxytoca.ATCC13182\tCAAHFW01\naerogenes.ATCC13048\tQVMZ01\ngrimontii.06D021\tFZTC01\nmichiganensis.DSM25444\tPRDB01" |
  while read -r s a; do echo $t.$s;([[ $a =~ $A1 ]]&&$EUTILS$a||$NCBIFTP${a:0:2}/${a:2:2}/$([[ $a =~ $A2 ]]&&echo ${a:4:2}/)$a/$a$Z|zcat)>genomes/$t.$s.fa; done
```

##### Launching _jolyTree_

The following command line allows the script `jolyTree.sh` to be launched with default options on 8 threads:
```bash
./JolyTree.sh  -i genomes  -b klebsiella  -t 8  2>/dev/null
```
Of note, the verbosity could be expanded by omitting the final `2>/dev/null`.

As the basename was set to 'klebsiella', _JolyTree_ writes in few minutes the four following output files:

* `klebsiella.acgt`: the A, C, G and T residue counts for each genome,
* `klebsiella.oepl`: every pairwise _p_-distance in [OEPL (One Entry Per Line) format](http://giphy.pasteur.fr/faq/phylogenetics/distance-matrix-file-conversion/#how-to-deal-with-the-one-entry-per-line-oepl-matrix-format)
* `klebsiella.d`: the matrix of (corrected) pairwise evolutionary distances in PHYLIP square format
* `klebsiella.nwk`: the BME phylogenetic tree in NEWICK format with REQ confidence support at branches


## References

Felsenstein J (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of Molecular Evolution, 17(6):368-376. [doi:10.1007/BF01734359](https://link.springer.com/article/10.1007/BF01734359).

Lefort V, Desper R, Gascuel O (2015) FastME 2.0: a comprehensive, accurate, and fast distance-based phylogeny inference program. Molecular Biology and Evolution, 32(10):2798-2800. [doi:10.1093/molbev/msv150](https://doi.org/10.1093/molbev/msv150).

Nei M, Kumar S (2000) Molecular Evolution and Phylogenetics. Oxford University Press, Oxford. ISBN: 0-19-513584-9.

Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, Phillippy AM (2016) Mash: fast genome and metagenome distance estimation using MinHash. Genome Biology, 17(1):132. [doi:10.1186/s13059-016-0997-x](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0997-x).

Rodrigues C, Passet V, Rakotondrasoa A, Abdoulaye Diallo T, Criscuolo A, Brisse S (2019) Description of _Klebsiella africanensis_ sp. nov., _Klebsiella variicola_ subsp. _tropicalensis_ subsp. nov. and _Klebsiella variicola_ subsp. _variicola_ subsp. nov. Research in Microbiology. [doi:10.1016/j.resmic.2019.02.003](https://doi.org/10.1016/j.resmic.2019.02.003).

Tajima F, Nei M (1982) Biases of the estimates of DNA divergence obtained by the restriction enzyme technique. Journal of Molecular Evolution, 18(2):115-120. [doi:10.1007/BF01810830](https://link.springer.com/article/10.1007/BF01810830).

Tajima F, Nei M (1984) Estimation of evolutionary distance between nucleotide sequences. Molecular Biology and Evolution, 1(3):269-285. [doi:10.1093/oxfordjournals.molbev.a040317](https://academic.oup.com/mbe/article/1/3/269/1244029).

Tamura K, Kumar S (2002) Evolutionary distance estimation under heterogeneous substitution pattern among lineages. Molecular Biology and Evolution, 19(10):1727-1736. [doi:10.1093/oxfordjournals.molbev.a003995](https://academic.oup.com/mbe/article/19/10/1727/1258975).