_JolyTree_ (named in memory of [Nicolas Joly](https://research.pasteur.fr/en/member/nicolas-joly/)) is a command line script written in [Bash](https://www.gnu.org/software/bash/) that allows a distance-based phylogenetic tree with branch supports to be quickly inferred from non-aligned genome sequences.
_JolyTree_ runs on UNIX, Linux and most OS X operating systems.
## Installation and execution
**A.** Install the following programs and tools, or verify that they are already installed with the required version:
*[mash](http://mash.readthedocs.io/en/latest/)[(Ondov et al. 2016)](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0997-x) version >= 1.0.2;
*[REQ](https://research.pasteur.fr/en/tool/r%ce%b5q-assessing-branch-supports-o%c6%92-a-distance-based-phylogenetic-tree-with-the-rate-o%c6%92-elementary-quartets/) version >= 1.2
**C.** If at least one of the four required binaries (step A) is not available on your `$PATH` variable, edit the file `JolyTree.sh` and indicate the local path to the mash, gawk, FastME and/or REQ binary(ies) (approximately between lines 100 and 200):
**D.** Give the execute permission to the file `JolyTree.sh`:
```bash
chmod +x JolyTree.sh
```
**E.** Execute _JolyTree_ with the following command line model:
```bash
./jolyTree.sh [options]
```
## Usage
Launch _JolyTree_ without option to read the following documentation:
```
USAGE:
JolyTree.sh [options]
where:
-i <directory> directory name containing FASTA-formatted contig files; only files
ending with .fa, .fna, .fas or .fasta will be considered (mandatory)
-b <basename> basename of every written output file (mandatory)
-s <int> sketch size (default: 25% of the largest genome size)
-q <double> probability of observing a random k-mer (default: 0.0001)
-k <int> k-mer size (default: estimated from the average genome size with the
probability set by option -q)
-c <real> if at least one of the estimated p-distances is above this specified
cutoff, then a F81 correction is performed (default: 0.1)
-n no BME tree inference (only pairwise distance estimation)
-r <int> number of steps when performing the ratchet-based BME tree search
(default: 100)
-t <int> number of threads (default: 2)
```
## Notes
* It is not recommended to modify the option -k. The optimal value of _k_ is automatically estimated by equation (2) in [Ondov et al. (2016)](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0997-x) from the desired probability _q_ of observing a random _k_-mer (option -q). Increasing _q_ (e.g. > 0.001) is not recommended, especially when dealing distantly-related genomes. Lowering _q_ (e.g. < 0.00001) leads to larger _k_-mer size that increases the variance of the estimated evolutionary distances.
* Increasing the sketch size (option -s) does not generally modify the inferred phylogenetic tree; on the other side, it is not recommended to set a sketch size lower than 10,000 (except for small genomic sequences, e.g. plasmids, viruses)
* Lowering the cutoff value for correcting the evolutionary distances (option -c) does generally not modify the inferred phylogenetic tree; on the other side, it is strongly not recommended to increase this cutoff value.
* The option -c allows multiple substitutions per character to be accurately estimated when an observed _p_-distance is quite large (e.g.> 0.1; see [Figure 3.1](https://books.google.fr/books?id=3Xc8DwAAQBAJ&pg=PA41) in Nei and Kumar 2000). In such cases, the F81 correction is performed by using the equation (4) in [Tamura and Kumar (2002)](https://academic.oup.com/mbe/article/19/10/1727/1258975) that allows estimating the pairwise distance based on the Equal-Input model of evolution ([Felsenstein 1981](https://link.springer.com/article/10.1007/BF01734359); [Tajima and Nei 1982](https://link.springer.com/article/10.1007/BF01810830), [1984](https://academic.oup.com/mbe/article/1/3/269/1244029)). This transformation was chosen because it could be directly computed from a _p_-distance value, and it takes into account putative unequal base frequencies and heterogeneous base composition among lineages.
* Fast running times will be observed when using multiple threads; of note, only pairwise distance estimation step benefits from a large number of threads (other steps are quite fast).
* The verbosity of _JolyTree_ could be reduced by ending the command line by `2>/dev/null`
* To launch _JolyTree_ on multiple cores on a cluster managed by [SLURM](https://slurm.schedmd.com), edit the file `JolyTree.sh` and read the subsection [3] of the _Installation_ section (approximately line 200).
## Example
In order to illustrate the usefulness of _jolyTree_ and to describe its output files, the following use case example describes its usage for inferring an exploratory phylogenetic tree of _Klebsiella_ genomes.
##### Downloading genome sequences
The following command lines allows downloading the genome sequences of 39 _Klebsiella_ species from the [NCBI genome repository](https://www.ncbi.nlm.nih.gov/genome) inside a directory named _genomes_:
Of note, the verbosity could be expanded by omitting the final `2>/dev/null`.
As the basename was set to 'klebsiella', _JolyTree_ writes in few minutes the four following output files:
*`klebsiella.acgt`: the A, C, G and T residue counts for each genome,
*`klebsiella.oepl`: every pairwise _p_-distance in [OEPL (One Entry Per Line) format](http://giphy.pasteur.fr/faq/phylogenetics/distance-matrix-file-conversion/#how-to-deal-with-the-one-entry-per-line-oepl-matrix-format)
*`klebsiella.d`: the matrix of (corrected) pairwise evolutionary distances in PHYLIP square format
*`klebsiella.nwk`: the BME phylogenetic tree in NEWICK format with REQ confidence support at branches
## References
Felsenstein J (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of Molecular Evolution, 17(6):368-376. [doi:10.1007/BF01734359](https://link.springer.com/article/10.1007/BF01734359).
Lefort V, Desper R, Gascuel O (2015) FastME 2.0: a comprehensive, accurate, and fast distance-based phylogeny inference program. Molecular Biology and Evolution, 32(10):2798-2800. [doi:10.1093/molbev/msv150](https://doi.org/10.1093/molbev/msv150).
Nei M, Kumar S (2000) Molecular Evolution and Phylogenetics. Oxford University Press, Oxford. ISBN: 0-19-513584-9.
Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, Phillippy AM (2016) Mash: fast genome and metagenome distance estimation using MinHash. Genome Biology, 17(1):132. [doi:10.1186/s13059-016-0997-x](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0997-x).
Tajima F, Nei M (1982) Biases of the estimates of DNA divergence obtained by the restriction enzyme technique. Journal of Molecular Evolution, 18(2):115-120. [doi:10.1007/BF01810830](https://link.springer.com/article/10.1007/BF01810830).
Tajima F, Nei M (1984) Estimation of evolutionary distance between nucleotide sequences. Molecular Biology and Evolution, 1(3):269-285. [doi:10.1093/oxfordjournals.molbev.a040317](https://academic.oup.com/mbe/article/1/3/269/1244029).
Tamura K, Kumar S (2002) Evolutionary distance estimation under heterogeneous substitution pattern among lineages. Molecular Biology and Evolution, 19(10):1727-1736. [doi:10.1093/oxfordjournals.molbev.a003995](https://academic.oup.com/mbe/article/19/10/1727/1258975).