Commit 814b4c6e authored by Alexis  CRISCUOLO's avatar Alexis CRISCUOLO

v0.1.190513ac

parent f520a039
This diff is collapsed.
This diff is collapsed.
# Gklust
Fast greedy clustering of genomes based on minhash similarity
\ No newline at end of file
_Gklust_ is a command line program written in [Bash](https://www.gnu.org/software/bash/) that implements a fast greedy heuristic for clustering genome assembly files in FASTA format.
_Gklust_ can be useful for reducing redundancy within a large set of genomes, or to perform fast taxonomic assignations by comparing genome assemblies to known reference genomes.
_Gklust_ runs on UNIX, Linux and most OS X operating systems.
## Installation and execution
**A.** As _Gklust_ uses the program [mash](http://mash.readthedocs.io/en/latest/) [(Ondov et al. 2016)](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0997-x) for estimating pairwise genome similarities, first install it or verify that it is already installed:
* binaries: [github.com/marbl/Mash/releases](https://github.com/marbl/Mash/releases)
* sources: [github.com/marbl/Mash](https://github.com/marbl/Mash)
**B.** Clone this repository with the following command line:
```bash
git clone https://gitlab.pasteur.fr/GIPhy/Gklust.git
```
**C.** If [mash](http://mash.readthedocs.io/en/latest/) is not available on your `$PATH` variable, edit the file `Gklust.sh` and indicate the local path to the `mash` binary (approximately between lines 70 and 100):
```bash
#############################################################################################################
# #
# ================ #
# = INSTALLATION = #
# ================ #
# #
# [1] REQUIREMENTS ======================================================================================= #
# Gklust depends on Mash. You should have it installed on your computer prior to using Gklust. Make sure #
# that it is installed on your $PATH variable, or specify below the full path to the Mash binary. #
# #
# -- Mash: fast pairwise minhash dissimilarity estimation --------------------------------------------- #
# + binaries: github.com/marbl/Mash/releases #
# + Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, Phillippy AM (2016) Mash: #
# fast genome and metagenome distance estimation using MinHash. Genome Biology, 17:132. doi: #
# 10.1186/s13059-016-0997-x #
# ################################################
################################################
MASH=mash; ## <=== WRITE HERE THE PATH TO THE MASH ##
## BINARY (VERSION 1.0.2 MINIMUM) ##
################################################
################################################
# #
#############################################################################################################
```
**D.** Give the execute permission to the file `Gklust.sh`:
```bash
chmod +x Gklust.sh
```
**E.** Execute _Gklust_ with the following command line model:
```bash
./Gklust.sh [options]
```
## Usage
Launch _Gklust_ without option to read the following documentation:
```
__ _ __ _ _ _ __ _____
/ _]| |/ /| | | || |/' _/|_ _|
| [/\| < | |_| || |'._'. | |
\__/|_|\_\|___|____||___/ |_|
USAGE:
Gklust.sh [options]
where:
-i <file> input file containing the FASTA-formatted genome file names (one
per line) to process (mandatory)
-r <file> input file containing the FASTA-formatted genome file name(s)
(one per line) to be used as reference genome(s)
-c <real> minhash similarity clustering cutoff (in percent; default: 95)
-s <int> minhash sketch size (default: 1000)
-k <int> minhash k-mer size (default: 21)
-t <int> number of threads (default: 2)
```
## Notes
* In short, _Gklust_ first adds the longest genome into the set _R_ of representative genome(s).
Next, for each of the remaining genomes _g_ sorted according their decreasing lengths, _Gklust_ searches for the most similar genome _r_ inside _R_:
* if the nucleotide similarity between _g_ and _r_ is lower than a given cutoff (95% by default), then _g_ is added into _R_ as a new representative genome;
* otherwise, _g_ is added into the cluster _C_<sub>_r_</sub> corresponding to _r_.
At the end, _Gklust_ returns the different representative genome(s) _r_ inside _R_, as well as the content of each associated cluster(s) _C_<sub>_r_</sub>.
The nucleotide similarity between each pair of representative genomes from _R_ is always lower than the specified cutoff (option -c), unless closely related genomes are specified as reference ones (option -r). The nucleotide similarity between a representative genome _r_ and each genome from its cluster _C<sub>r</sub>_ is always greater than the specified cutoff (option -c).
* A starting set of representative genomes (e.g. type strains, complete genomes) could be specified with option -r.
* FASTA file names containing blank spaces or the special characters `;` or `|` will not be processed by _Gklust_.
* The nucleotide similarity cutoff could be modified with option -c. Of note, the default cutoff value (i.e. 95) leads to clusters containing genomes that belongs to the same species, as 95% is a well-accepted species delineation cutoff (e.g. Jain et al. 2018). Larger cutoff values (e.g. close to 100) lead to many clusters of very similar genomes.
* The options -s and -k allow setting the sketch and k-mer sizes, respectively, used by Mash to estimate pairwise genome similarities. According to Ondov et al. (2016) recommendations, default values lead to fast and accurate estimates. Increasing the sketch size (option -s) will slow the overall running times, whereas decreasing it will increase the standard error of the estimates.
* Faster running times will be observed when using multiple threads (option -t).
* The verbosity of _Gklust_ could be reduced by ending the command line by `2>/dev/null`.
## Example
The file `GCF.list.txt` inside the directory _example/_ contains a list of 1,062 _Bordetella_ genome assembly identifiers gathered from the RefSeq (June, 2019).
The tool [wgetGenBankWGS](https://gitlab.pasteur.fr/GIPhy/wgetGenBankWGS) can be used to quickly download the corresponding FASTA-formatted files (~4.5 Gb) into the directory _example/_ using e.g. 20 threads:
```bash
./wgetGenBankWGS.sh -d refseq -e $(tr '\n' '|' < example/GCF.list.txt | sed 's/|$//') -o example -c 20
```
To cluster these genome assembly files using _Gklust_, an input file (e.g. _Bordetella.genomes_) can be written with the following command line:
```bash
ls example/*.fasta > Bordetella.genomes
```
The file `GCF.type_strains.txt` inside the directory _example/_ contains the genome assembly identifiers of the 11 _Bordetella_ type strains available from the RefSeq (June, 2019).
These assemblies can be used as reference genomes by setting the _Gklust_ option -r with a second input file (e.g. _Bordetella.references_) created with the following command line:
```bash
cat Bordetella.genomes | grep -F -f example/GCF.type_strains.txt > Bordetella.references
```
Using these two input files, the following command line allows _Gklust_ to be launched on 20 threads for clustering the 1,062 _Bordetella_ genome assembly files:
```bash
./Gklust.sh -i Bordetella.genomes -r Bordetella.references -t 20 2>/dev/null
```
After a few seconds, this leads to the following standard output:
```
20 threads
k-mer size: 21
sketch size: 1000
cutoff=95% (0.0500)
1062 input files
cluster size reference
1 9 example/Bordetella.trematum.NCTC12995--FKBR01--GCF_900078335.1.fasta
2 5 example/Bordetella.pseudohinzii.8-296-03--JHEP01--GCF_000657795.2.fasta
3 1 example/Bordetella.petrii.DSM.12804--GCF_000067205.1.fasta
4 834 example/Bordetella.pertussis.18323--GCF_000306945.1.fasta
5 56 example/Bordetella.parapertussis.NCTC5952--UFUC01--GCF_900445785.1.fasta
6 33 example/Bordetella.holmesii.NCTC12912--UFTX01--GCF_900445775.1.fasta
7 13 example/Bordetella.hinzii.ATCC.51730--AWNM01--GCF_000471685.1.fasta
8 1 example/Bordetella.flabilis.AU10664--GCF_001676725.1.fasta
9 58 example/Bordetella.bronchiseptica.NBRC.13691--BCZI01--GCF_001598655.1.fasta
10 2 example/Bordetella.bronchialis.AU3182--GCF_001676705.1.fasta
11 23 example/Bordetella.avium.HAMBI_2160--QLKW01--GCF_003350095.1.fasta
12 1 example/Bordetella.genomosp.10.AU16122--NEVM01--GCF_002261225.1.fasta
13 1 example/Bordetella.sp.N--GCF_001433395.1.fasta
14 1 example/Bordetella.genomosp.11.AU8856--NEVS01--GCF_002261215.1.fasta
15 1 example/Bordetella.ansorpii.H050680373--FKIF01--GCF_900078705.1.fasta
16 1 example/Bordetella.ansorpii.NCTC13364--FKBS01--GCF_900078315.1.fasta
17 1 example/Bordetella.genomosp.9.AU21707--NEVJ01--GCF_002261425.1.fasta
18 1 example/Bordetella.genomosp.8.AU19157--GCF_002119685.1.fasta
19 2 example/Bordetella.genomosp.1.AU17610--NEVL01--GCF_002261335.1.fasta
20 1 example/Bordetella.sp.H567--GCF_001704295.1.fasta
21 2 example/Bordetella.genomosp.4.AU9919--NEVQ01--GCF_002261185.1.fasta
22 1 example/Bordetella.genomosp.13.AU7206--GCF_002119665.1.fasta
23 2 example/Bordetella.genomosp.5.AU10456--NEVP01--GCF_002261315.1.fasta
24 2 example/Bordetella.genomosp.2.AU8256--NEVT01--GCF_002261345.1.fasta
25 2 example/Bordetella.genomosp.9.AU17164--GCF_002119725.1.fasta
26 1 example/Bordetella.genomosp.12.AU6712--NEVU01--GCF_002261355.1.fasta
27 3 example/Bordetella.petrii.J49--JAEJ01--GCF_000518845.1.fasta
28 1 example/Bordetella.sp.HZ20--GCF_003058465.1.fasta
29 1 example/Bordetella.sp.FB-8--ARNH01--GCF_000382185.1.fasta
30 1 example/Bordetella.sp.J329--GCF_004006215.1.fasta
31 1 example/Bordetella.sp.3d-2-2--QETA01--GCF_003123725.1.fasta
details written into Bordetella.genomes.clust
```
The standard output shows that _Gklust_ created 31 clusters from the 1,062 _Bordetella_ genomes (the 11 firsts correspond to the specified type strains).
Therefore, the space-separated output file `Bordetella.genomes.clust` will contain 31 lines (i.e. one per cluster), each containing the reference genome file name (first entry) followed by the file names that belong to the corresponding cluster.
Of note, as _Gklust_ was used with the default genome similarity cutoff (i.e. 95%), the above output allows observing that the 1,062 genomes are mainly _B. pertussis_ ones (i.e. 834), that many _Bordetella_ new (genomo)species exist, and that the strain J49 (cluster 27) does likely not belong to the species _B. petrii_.
## References
Jain C, Rodriguez LM, Phillippy AM, Konstantinidis KT, Aluru S (2018) High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nature Communication, 9:5114. [doi:10.1038/s41467-018-07641-9](https://www.nature.com/articles/s41467-018-07641-9).
Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, Phillippy AM (2016) Mash: fast genome and metagenome distance estimation using MinHash. Genome Biology, 17(1):132. [doi:10.1186/s13059-016-0997-x](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0997-x).
GCF_000067205
GCF_000070465
GCF_000193515
GCF_000193535
GCF_000193555
GCF_000193575
GCF_000193595
GCF_000193615
GCF_000195675
GCF_000195695
GCF_000195715
GCF_000212975
GCF_000306945
GCF_000313065
GCF_000313085
GCF_000317335
GCF_000317935
GCF_000317955
GCF_000318015
GCF_000341465
GCF_000341485
GCF_000382185
GCF_000471685
GCF_000471705
GCF_000479395
GCF_000479415
GCF_000479435
GCF_000479455
GCF_000479475
GCF_000479495
GCF_000479515
GCF_000479535
GCF_000479555
GCF_000479575
GCF_000479595
GCF_000479615
GCF_000479635
GCF_000479655
GCF_000479675
GCF_000479695
GCF_000479715
GCF_000479735
GCF_000479755
GCF_000479775
GCF_000479795
GCF_000479815
GCF_000479855
GCF_000479875
GCF_000479895
GCF_000479915
GCF_000479935
GCF_000504325
GCF_000518845
GCF_000518965
GCF_000571975
GCF_000571995
GCF_000572015
GCF_000583025
GCF_000598125
GCF_000612485
GCF_000648965
GCF_000648985
GCF_000649005
GCF_000649025
GCF_000649045
GCF_000649065
GCF_000649145
GCF_000649165
GCF_000649185
GCF_000657695
GCF_000657715
GCF_000657735
GCF_000657755
GCF_000657775
GCF_000657795
GCF_000657815
GCF_000662055
GCF_000662095
GCF_000662115
GCF_000662135
GCF_000662155
GCF_000662175
GCF_000662195
GCF_000662215
GCF_000662235
GCF_000662255
GCF_000662275
GCF_000662295
GCF_000662335
GCF_000689675
GCF_000689695
GCF_000689715
GCF_000689735
GCF_000689755
GCF_000689775
GCF_000689795
GCF_000689815
GCF_000689835
GCF_000689855
GCF_000689875
GCF_000689915
GCF_000689935
GCF_000689955
GCF_000689995
GCF_000690015
GCF_000690095
GCF_000690115
GCF_000690135
GCF_000690155
GCF_000690175
GCF_000690195
GCF_000690215
GCF_000690235
GCF_000690255
GCF_000690275
GCF_000690295
GCF_000690315
GCF_000690335
GCF_000690355
GCF_000690375
GCF_000690395
GCF_000690415
GCF_000690435
GCF_000690455
GCF_000690475
GCF_000690495
GCF_000690515
GCF_000690615
GCF_000690635
GCF_000690755
GCF_000698985
GCF_000699925
GCF_000746565
GCF_000765395
GCF_000773675
GCF_000812165
GCF_000813405
GCF_000829175
GCF_001013565
GCF_001078275
GCF_001078295
GCF_001187405
GCF_001191795
GCF_001192455
GCF_001192535
GCF_001193155
GCF_001193455
GCF_001193515
GCF_001193835
GCF_001193855
GCF_001193935
GCF_001194095
GCF_001194225
GCF_001194565
GCF_001194605
GCF_001194985
GCF_001195245
GCF_001195265
GCF_001195625
GCF_001195885
GCF_001196005
GCF_001196085
GCF_001196395
GCF_001196455
GCF_001196575
GCF_001196855
GCF_001196975
GCF_001196995
GCF_001197455
GCF_001197495
GCF_001197815
GCF_001198095
GCF_001198195
GCF_001198375
GCF_001198715
GCF_001198815
GCF_001198955
GCF_001199115
GCF_001199415
GCF_001199955
GCF_001200055
GCF_001200095
GCF_001200175
GCF_001200315
GCF_001200435
GCF_001200475
GCF_001200675
GCF_001200735
GCF_001201155
GCF_001201315
GCF_001201375
GCF_001201395
GCF_001201635
GCF_001201995
GCF_001202375
GCF_001202455
GCF_001202495
GCF_001202575
GCF_001202695
GCF_001202875
GCF_001203115
GCF_001203235
GCF_001203935
GCF_001204435
GCF_001204675
GCF_001205255
GCF_001205275
GCF_001205335
GCF_001205915
GCF_001206235
GCF_001206335
GCF_001206615
GCF_001206895
GCF_001206935
GCF_001207095
GCF_001207655
GCF_001207695
GCF_001207905
GCF_001208005
GCF_001208065
GCF_001208145
GCF_001208185
GCF_001208205
GCF_001208285
GCF_001208905
GCF_001209065
GCF_001209245
GCF_001209845
GCF_001209985
GCF_001210045
GCF_001210145
GCF_001210185
GCF_001210765
GCF_001210865
GCF_001211005
GCF_001211225
GCF_001211645
GCF_001211785
GCF_001211805
GCF_001211905
GCF_001211945
GCF_001211985
GCF_001212045
GCF_001212285
GCF_001212485
GCF_001212505
GCF_001212525
GCF_001212625
GCF_001212645
GCF_001298875
GCF_001307525
GCF_001307565
GCF_001307585
GCF_001307605
GCF_001307625
GCF_001307645
GCF_001307665
GCF_001307685
GCF_001307705
GCF_001307725
GCF_001307745
GCF_001319225
GCF_001319245
GCF_001319265
GCF_001319285
GCF_001330995
GCF_001331015
GCF_001331035
GCF_001331055
GCF_001331135
GCF_001331155
GCF_001331175
GCF_001331215
GCF_001331235
GCF_001331255
GCF_001331275
GCF_001331295
GCF_001331335
GCF_001331355
GCF_001331375
GCF_001331395
GCF_001331415
GCF_001331435
GCF_001331475
GCF_001331495
GCF_001331515
GCF_001331535
GCF_001331555
GCF_001331575
GCF_001331595
GCF_001331615
GCF_001331635
GCF_001331655
GCF_001331675
GCF_001331695
GCF_001331715
GCF_001331735
GCF_001331755
GCF_001331775
GCF_001331795
GCF_001331815
GCF_001331835
GCF_001331855
GCF_001331875
GCF_001331895
GCF_001331915
GCF_001331955
GCF_001331975
GCF_001331995
GCF_001332015
GCF_001332035
GCF_001332055
GCF_001332095
GCF_001332115
GCF_001332135
GCF_001332155
GCF_001332175
GCF_001332195
GCF_001332215
GCF_001332255
GCF_001332275
GCF_001332295
GCF_001332315
GCF_001332335
GCF_001332375
GCF_001332435
GCF_001332455
GCF_001332475
GCF_001332495
GCF_001332515
GCF_001332535
GCF_001332555
GCF_001332575
GCF_001332595
GCF_001332615
GCF_001332655
GCF_001332675
GCF_001332695
GCF_001332715
GCF_001332735
GCF_001332755
GCF_001332795
GCF_001332835
GCF_001332875
GCF_001332895
GCF_001332915
GCF_001333035
GCF_001333055
GCF_001333075
GCF_001333115
GCF_001333135
GCF_001333155
GCF_001333175
GCF_001333195
GCF_001333215
GCF_001333235
GCF_001333255
GCF_001333275
GCF_001333295
GCF_001333315
GCF_001333335
GCF_001333355
GCF_001333375
GCF_001333395
GCF_001333415
GCF_001333435
GCF_001333455
GCF_001333495
GCF_001333515
GCF_001333535
GCF_001333555
GCF_001333575
GCF_001333595
GCF_001333615
GCF_001333635
GCF_001333655
GCF_001333695
GCF_001333755
GCF_001333775
GCF_001333795
GCF_001333815
GCF_001333835
GCF_001333855
GCF_001333875
GCF_001333895
GCF_001333915
GCF_001333935
GCF_001433395
GCF_001509895
GCF_001509915
GCF_001525545
GCF_001525555
GCF_001548475
GCF_001548485
GCF_001548495
GCF_001558395
GCF_001559055
GCF_001598615
GCF_001598655
GCF_001601775
GCF_001601785
GCF_001605035
GCF_001605055
GCF_001605075
GCF_001605095
GCF_001605115
GCF_001605135
GCF_001605155
GCF_001605175
GCF_001605195
GCF_001605215
GCF_001605235
GCF_001605255
GCF_001605275
GCF_001605295
GCF_001605345
GCF_001605365
GCF_001605385
GCF_001605405
GCF_001605425
GCF_001605445
GCF_001605465
GCF_001605485
GCF_001605505
GCF_001605525
GCF_001605545
GCF_001605565
GCF_001605585
GCF_001605605
GCF_001605625
GCF_001605645
GCF_001605665
GCF_001605685
GCF_001605705
GCF_001628735
GCF_001676705
GCF_001676725
GCF_001676745
GCF_001686365
GCF_001686405
GCF_001686425
GCF_001686445
GCF_001686465
GCF_001686485
GCF_001686505
GCF_001686525
GCF_001686545
GCF_001686565
GCF_001686585
GCF_001686605
GCF_001686625
GCF_001687385
GCF_001687405
GCF_001698185
GCF_001704215
GCF_001704255
GCF_001704295
GCF_001704355
GCF_001831395
GCF_001831415
GCF_001831435
GCF_001831455
GCF_001984925
GCF_001984945
GCF_001984965
GCF_001984985
GCF_001985005
GCF_001985025
GCF_001985045
GCF_001985065
GCF_001985085
GCF_001985105
GCF_001985125
GCF_001985145
GCF_001985165