1.0

472db87b · Alexis CRISCUOLO · d8f66df3 · 472db87b · 472db87b · 472db87b
Commit 472db87b authored 3 years ago by Alexis CRISCUOLO
--- a/COPYING
+++ b/COPYING
--- a/GenoMed.sh
+++ b/GenoMed.sh
--- a/README.md
+++ b/README.md
 # GenoMed

-Estimating Genome Medoid
\ No newline at end of file
+_GenoMed_ is a command line tool written in [Bash](https://www.gnu.org/software/bash/) to determine the [medoid](https://en.wikipedia.org/wiki/Medoid) of a set of genomes.
+_GenoMed_ computes the average evolutionary distance &delta;<sub>_g_</sub> of each genome _g_ to all other ones, and next sorts the genomes according to the decreasing order of &delta;<sub>_g_</sub>, the medoid genome being the one that minimizes &delta;<sub>_g_</sub>.
+
+_GenoMed_ runs on UNIX, Linux and most OS X operating systems.
+
+
+## Dependencies
+
+You will need to install the required programs and tools listed in the following table, or to verify that they are already installed with the required version.
+
+<div align="center">
+
+| program                                                                                         | package                                                  | version     | sources                                                                                                                                                             |
+|:----------------------------------------------------------------------------------------------- |:--------------------------------------------------------:| -----------:|:----------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| [_flock_](https://man7.org/linux/man-pages/man1/flock.1.html)                                   | [util-linux](https://en.wikipedia.org/wiki/Util-linux)   | &ge; 2.31.1 | Linux: [github.com/util-linux/util-linux](https://github.com/util-linux/util-linux)<br>OS X: [github.com/discoteq/flock](https://github.com/discoteq/flock) |
+| [_xargs_](https://www.gnu.org/software/findutils/manual/html_node/find_html/xargs-options.html) | [GNU findutils](https://www.gnu.org/software/findutils/) | &ge; 4.7.0  | [ftp.gnu.org/gnu/findutils](https://ftp.gnu.org/gnu/findutils/)                                                                                             |
+| [_gawk_](https://www.gnu.org/software/gawk/)                                                    | -                                                        | > 4.0.0     | [ftp.gnu.org/gnu/gawk](http://ftp.gnu.org/gnu/gawk/)                                                                                                        |
+| [_mash_](https://mash.readthedocs.io/en/latest/)                                                | -                                                        | &ge; 2.2    | [github.com/marbl/Mash](https://github.com/marbl/Mash)                                                                                                              |
+
+</div>
+
+##### Note for Mac OS X
+
+For some Mac OS X, [_flock_](https://man7.org/linux/man-pages/man1/flock.1.html) is not available by default, but it can be easily installed using [_homebrew_](https://brew.sh) (see the README of the [discoteq/flock git](https://github.com/discoteq/flock)).<br>
+It is also worth noting that [BSD _xargs_](https://www.freebsd.org/cgi/man.cgi?xargs) does not offer all the functionalities provided by [GNU _xargs_](https://www.gnu.org/software/findutils/manual/html_node/find_html/xargs-options.html) and required by _GenoMed_.
+However, [GNU _xargs_](https://www.gnu.org/software/findutils/manual/html_node/find_html/xargs-options.html) (here named `gxargs`) can be easily installed using [_homebrew_](https://brew.sh), i.e. `brew install findutils`.
+Of note, _GenoMed_ first looks for the `gxargs` binary on the `$PATH`, and, if missing, the `xargs` binary.
+
+
+## Installation and execution
+
+**A.** Clone this repository with the following command line:
+
+```bash
+git clone https://gitlab.pasteur.fr/GIPhy/GenoMed.git
+```
+
+**B.** Give the execute permission to the file `GenoMed.sh`:
+
+```bash
+chmod +x GenoMed.sh
+```
+
+**C.** Execute _GenoMed_ with the following command line model:
+
+```bash
+./GenoMed.sh  [options]
+```
+
+**D.** If at least one of the required program (see [Dependencies](#dependencies)) is not available on your `$PATH` variable (or if one compiled binary has a different default name), _GenoMed_ will exit with an error message.
+When running _GenoMed_ without option, a documentation should be displayed; otherwise, the name of the missing program is displayed before existing.
+In such a case, edit the file `eCDS.sh` and indicate the local path to the corresponding binary(ies) within the code block `REQUIREMENTS` (approximately lines 100-150).
+For each required program, the table below reports the corresponding variable assignment instruction to edit (if needed) within the code block `REQUIREMENTS`
+
+<div align="center">
+<sup>
+
+| program  | variable assignment                               |   | program  | variable assignment  |
+|:---------|:------------------------------------------------- | - |:---------|:-------------------- |
+| _flock_  | `FLOCK_BIN=flock;`                                |   | _gawk_   | `GAWK_BIN=gawk;`     |
+| _xargs_  | `XARGS_BIN=xargs;`<br>`GXARGS_BIN=gxargs;` (OS X) |   | _mash_   | `MASH_BIN=mash;`     |
+
+</sup>
+</div>
+
+
+## Usage
+
+Run _GenoMed_ without option to read the following documentation:
+
+```
+ USAGE:  GenoMed.sh  [OPTIONS]  <fasta1> <fasta2> <fasta3> [<fasta4> ...]
+
+ OPTIONS:
+  -t <int>    number of threads (default: 2)
+  -h          prints this help and exits
+```
+
+
+## Notes
+
+* In brief, _GenoMed_ uses the tool [_mash_](https://mash.readthedocs.io/en/latest/) to compute all pairwise _p_-distances between genomes, and next transforms them into EI/F81 evolutionary distances (see Criscuolo 2020). To obtain accurate _p_-distance estimates with [_mash_](https://mash.readthedocs.io/en/latest/), the sketch size is defined as the average genome length, and the _k_-mer length is given by  _k_&nbsp;=&nbsp;log<sub>4</sub>&nbsp;(_m_<sup>2</sup>-_m_), where _m_ is the maximum genome length (this optimal estimate of _k_ is derived from Formula 1 in Fofanov et al. 2014). All these pairwise evolutionary distances are finally used to compute the average distance &delta;<sub>_g_</sub> of each genome _g_ to all other ones. The medoid genome is the one that minimizes &delta;<sub>_g_</sub>.
+
+* All input files (at least 3) should be in FASTA format and non compressed. _GenoMed_ is able to consider many input files summarized using [filename expansion](https://tldp.org/LDP/abs/html/globbingref.html), e.g. `dirname/*.fasta`.
+
+* Faster running times can be obtained when using multiple threads (option `-t`).
+
+
+## References
+
+Criscuolo A (2020) _On the transformation of MinHash-based uncorrected distances into proper evolutionary distances for phylogenetic inference_. **F1000Research**, 9:1309. [doi:10.12688/f1000research.26930.1](https://doi.org/10.12688/f1000research.26930.1) 
+ 
+Fofanov Y, Luo Y, Katili C, Wang J, Belosludtsev Y, Powdrill T, Belapurkar C, Fofanov V, Li T-B, Chumakov S, Pettitt BM (2004) _How independent are the appearances of n-mers in different genomes?_. **Bioinformatics**, 20(15):2421-2428. [doi:10.1093/bioinformatics/bth266](https://doi.org/10.1093/bioinformatics/bth266)