Skip to content
Snippets Groups Projects
Commit 472db87b authored by Alexis  CRISCUOLO's avatar Alexis CRISCUOLO
Browse files

1.0

parent d8f66df3
No related branches found
No related tags found
No related merge requests found
COPYING 0 → 100644
This diff is collapsed.
This diff is collapsed.
# GenoMed
Estimating Genome Medoid
\ No newline at end of file
_GenoMed_ is a command line tool written in [Bash](https://www.gnu.org/software/bash/) to determine the [medoid](https://en.wikipedia.org/wiki/Medoid) of a set of genomes.
_GenoMed_ computes the average evolutionary distance &delta;<sub>_g_</sub> of each genome _g_ to all other ones, and next sorts the genomes according to the decreasing order of &delta;<sub>_g_</sub>, the medoid genome being the one that minimizes &delta;<sub>_g_</sub>.
_GenoMed_ runs on UNIX, Linux and most OS X operating systems.
## Dependencies
You will need to install the required programs and tools listed in the following table, or to verify that they are already installed with the required version.
<div align="center">
| program | package | version | sources |
|:----------------------------------------------------------------------------------------------- |:--------------------------------------------------------:| -----------:|:----------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [_flock_](https://man7.org/linux/man-pages/man1/flock.1.html) | [util-linux](https://en.wikipedia.org/wiki/Util-linux) | &ge; 2.31.1 | Linux: [github.com/util-linux/util-linux](https://github.com/util-linux/util-linux)<br>OS X: [github.com/discoteq/flock](https://github.com/discoteq/flock) |
| [_xargs_](https://www.gnu.org/software/findutils/manual/html_node/find_html/xargs-options.html) | [GNU findutils](https://www.gnu.org/software/findutils/) | &ge; 4.7.0 | [ftp.gnu.org/gnu/findutils](https://ftp.gnu.org/gnu/findutils/) |
| [_gawk_](https://www.gnu.org/software/gawk/) | - | > 4.0.0 | [ftp.gnu.org/gnu/gawk](http://ftp.gnu.org/gnu/gawk/) |
| [_mash_](https://mash.readthedocs.io/en/latest/) | - | &ge; 2.2 | [github.com/marbl/Mash](https://github.com/marbl/Mash) |
</div>
##### Note for Mac OS X
For some Mac OS X, [_flock_](https://man7.org/linux/man-pages/man1/flock.1.html) is not available by default, but it can be easily installed using [_homebrew_](https://brew.sh) (see the README of the [discoteq/flock git](https://github.com/discoteq/flock)).<br>
It is also worth noting that [BSD _xargs_](https://www.freebsd.org/cgi/man.cgi?xargs) does not offer all the functionalities provided by [GNU _xargs_](https://www.gnu.org/software/findutils/manual/html_node/find_html/xargs-options.html) and required by _GenoMed_.
However, [GNU _xargs_](https://www.gnu.org/software/findutils/manual/html_node/find_html/xargs-options.html) (here named `gxargs`) can be easily installed using [_homebrew_](https://brew.sh), i.e. `brew install findutils`.
Of note, _GenoMed_ first looks for the `gxargs` binary on the `$PATH`, and, if missing, the `xargs` binary.
## Installation and execution
**A.** Clone this repository with the following command line:
```bash
git clone https://gitlab.pasteur.fr/GIPhy/GenoMed.git
```
**B.** Give the execute permission to the file `GenoMed.sh`:
```bash
chmod +x GenoMed.sh
```
**C.** Execute _GenoMed_ with the following command line model:
```bash
./GenoMed.sh [options]
```
**D.** If at least one of the required program (see [Dependencies](#dependencies)) is not available on your `$PATH` variable (or if one compiled binary has a different default name), _GenoMed_ will exit with an error message.
When running _GenoMed_ without option, a documentation should be displayed; otherwise, the name of the missing program is displayed before existing.
In such a case, edit the file `eCDS.sh` and indicate the local path to the corresponding binary(ies) within the code block `REQUIREMENTS` (approximately lines 100-150).
For each required program, the table below reports the corresponding variable assignment instruction to edit (if needed) within the code block `REQUIREMENTS`
<div align="center">
<sup>
| program | variable assignment | | program | variable assignment |
|:---------|:------------------------------------------------- | - |:---------|:-------------------- |
| _flock_ | `FLOCK_BIN=flock;` | | _gawk_ | `GAWK_BIN=gawk;` |
| _xargs_ | `XARGS_BIN=xargs;`<br>`GXARGS_BIN=gxargs;` (OS X) | | _mash_ | `MASH_BIN=mash;` |
</sup>
</div>
## Usage
Run _GenoMed_ without option to read the following documentation:
```
USAGE: GenoMed.sh [OPTIONS] <fasta1> <fasta2> <fasta3> [<fasta4> ...]
OPTIONS:
-t <int> number of threads (default: 2)
-h prints this help and exits
```
## Notes
* In brief, _GenoMed_ uses the tool [_mash_](https://mash.readthedocs.io/en/latest/) to compute all pairwise _p_-distances between genomes, and next transforms them into EI/F81 evolutionary distances (see Criscuolo 2020). To obtain accurate _p_-distance estimates with [_mash_](https://mash.readthedocs.io/en/latest/), the sketch size is defined as the average genome length, and the _k_-mer length is given by _k_&nbsp;=&nbsp;log<sub>4</sub>&nbsp;(_m_<sup>2</sup>-_m_), where _m_ is the maximum genome length (this optimal estimate of _k_ is derived from Formula 1 in Fofanov et al. 2014). All these pairwise evolutionary distances are finally used to compute the average distance &delta;<sub>_g_</sub> of each genome _g_ to all other ones. The medoid genome is the one that minimizes &delta;<sub>_g_</sub>.
* All input files (at least 3) should be in FASTA format and non compressed. _GenoMed_ is able to consider many input files summarized using [filename expansion](https://tldp.org/LDP/abs/html/globbingref.html), e.g. `dirname/*.fasta`.
* Faster running times can be obtained when using multiple threads (option `-t`).
## References
Criscuolo A (2020) _On the transformation of MinHash-based uncorrected distances into proper evolutionary distances for phylogenetic inference_. **F1000Research**, 9:1309. [doi:10.12688/f1000research.26930.1](https://doi.org/10.12688/f1000research.26930.1)
Fofanov Y, Luo Y, Katili C, Wang J, Belosludtsev Y, Powdrill T, Belapurkar C, Fofanov V, Li T-B, Chumakov S, Pettitt BM (2004) _How independent are the appearances of n-mers in different genomes?_. **Bioinformatics**, 20(15):2421-2428. [doi:10.1093/bioinformatics/bth266](https://doi.org/10.1093/bioinformatics/bth266)
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment