_GenoMed_ is a command line tool written in [Bash](https://www.gnu.org/software/bash/) to determine the [medoid](https://en.wikipedia.org/wiki/Medoid) of a set of genomes.
_GenoMed_ computes the average evolutionary distance δ<sub>_g_</sub> of each genome _g_ to all other ones, and next sorts the genomes according to the decreasing order of δ<sub>_g_</sub>, the medoid genome being the one that minimizes δ<sub>_g_</sub>.
_GenoMed_ runs on UNIX, Linux and most OS X operating systems.
## Dependencies
You will need to install the required programs and tools listed in the following table, or to verify that they are already installed with the required version.
For some Mac OS X, [_flock_](https://man7.org/linux/man-pages/man1/flock.1.html) is not available by default, but it can be easily installed using [_homebrew_](https://brew.sh)(see the README of the [discoteq/flock git](https://github.com/discoteq/flock)).<br>
It is also worth noting that [BSD _xargs_](https://www.freebsd.org/cgi/man.cgi?xargs) does not offer all the functionalities provided by [GNU _xargs_](https://www.gnu.org/software/findutils/manual/html_node/find_html/xargs-options.html) and required by _GenoMed_.
However, [GNU _xargs_](https://www.gnu.org/software/findutils/manual/html_node/find_html/xargs-options.html)(here named `gxargs`) can be easily installed using [_homebrew_](https://brew.sh), i.e. `brew install findutils`.
Of note, _GenoMed_ first looks for the `gxargs` binary on the `$PATH`, and, if missing, the `xargs` binary.
## Installation and execution
**A.** Clone this repository with the following command line:
**B.** Give the execute permission to the file `GenoMed.sh`:
```bash
chmod +x GenoMed.sh
```
**C.** Execute _GenoMed_ with the following command line model:
```bash
./GenoMed.sh [options]
```
**D.** If at least one of the required program (see [Dependencies](#dependencies)) is not available on your `$PATH` variable (or if one compiled binary has a different default name), _GenoMed_ will exit with an error message.
When running _GenoMed_ without option, a documentation should be displayed; otherwise, the name of the missing program is displayed before existing.
In such a case, edit the file `eCDS.sh` and indicate the local path to the corresponding binary(ies) within the code block `REQUIREMENTS` (approximately lines 100-150).
For each required program, the table below reports the corresponding variable assignment instruction to edit (if needed) within the code block `REQUIREMENTS`
<divalign="center">
<sup>
| program | variable assignment | | program | variable assignment |
* In brief, _GenoMed_ uses the tool [_mash_](https://mash.readthedocs.io/en/latest/) to compute all pairwise _p_-distances between genomes, and next transforms them into EI/F81 evolutionary distances (see Criscuolo 2020). To obtain accurate _p_-distance estimates with [_mash_](https://mash.readthedocs.io/en/latest/), the sketch size is defined as the average genome length, and the _k_-mer length is given by _k_ = log<sub>4</sub> (_m_<sup>2</sup>-_m_), where _m_ is the maximum genome length (this optimal estimate of _k_ is derived from Formula 1 in Fofanov et al. 2014). All these pairwise evolutionary distances are finally used to compute the average distance δ<sub>_g_</sub> of each genome _g_ to all other ones. The medoid genome is the one that minimizes δ<sub>_g_</sub>.
* All input files (at least 3) should be in FASTA format and non compressed. _GenoMed_ is able to consider many input files summarized using [filename expansion](https://tldp.org/LDP/abs/html/globbingref.html), e.g. `dirname/*.fasta`.
* Faster running times can be obtained when using multiple threads (option `-t`).
## References
Criscuolo A (2020) _On the transformation of MinHash-based uncorrected distances into proper evolutionary distances for phylogenetic inference_. **F1000Research**, 9:1309. [doi:10.12688/f1000research.26930.1](https://doi.org/10.12688/f1000research.26930.1)
Fofanov Y, Luo Y, Katili C, Wang J, Belosludtsev Y, Powdrill T, Belapurkar C, Fofanov V, Li T-B, Chumakov S, Pettitt BM (2004) _How independent are the appearances of n-mers in different genomes?_. **Bioinformatics**, 20(15):2421-2428. [doi:10.1093/bioinformatics/bth266](https://doi.org/10.1093/bioinformatics/bth266)