# Gklust _Gklust_ is a command line program written in [Bash](https://www.gnu.org/software/bash/) that implements a fast greedy heuristic for clustering genome assembly files in FASTA format. _Gklust_ can be useful for reducing redundancy within a large set of genomes, or to perform fast taxonomic assignations by comparing genome assemblies to known reference genomes. _Gklust_ runs on UNIX, Linux and most OS X operating systems. ## Installation and execution **A.** As _Gklust_ uses the program [mash](http://mash.readthedocs.io/en/latest/) [(Ondov et al. 2016)](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0997-x) for estimating pairwise genome similarities, first install it or verify that it is already installed: * binaries: [github.com/marbl/Mash/releases](https://github.com/marbl/Mash/releases) * sources: [github.com/marbl/Mash](https://github.com/marbl/Mash) **B.** Clone this repository with the following command line: ```bash git clone https://gitlab.pasteur.fr/GIPhy/Gklust.git ``` **C.** If [mash](http://mash.readthedocs.io/en/latest/) is not available on your `$PATH` variable, edit the file `Gklust.sh` and indicate the local path to the `mash` binary (approximately between lines 70 and 100): ```bash ############################################################################################################# # # # ================ # # = INSTALLATION = # # ================ # # # # [1] REQUIREMENTS ======================================================================================= # # Gklust depends on Mash. You should have it installed on your computer prior to using Gklust. Make sure # # that it is installed on your $PATH variable, or specify below the full path to the Mash binary. # # # # -- Mash: fast pairwise minhash dissimilarity estimation --------------------------------------------- # # + binaries: github.com/marbl/Mash/releases # # + Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, Phillippy AM (2016) Mash: # # fast genome and metagenome distance estimation using MinHash. Genome Biology, 17:132. doi: # # 10.1186/s13059-016-0997-x # # ################################################ ################################################ MASH=mash; ## <=== WRITE HERE THE PATH TO THE MASH ## ## BINARY (VERSION 1.0.2 MINIMUM) ## ################################################ ################################################ # # ############################################################################################################# ``` **D.** Give the execute permission to the file `Gklust.sh`: ```bash chmod +x Gklust.sh ``` **E.** Execute _Gklust_ with the following command line model: ```bash ./Gklust.sh [options] ``` ## Usage Launch _Gklust_ without option to read the following documentation: ``` __ _ __ _ _ _ __ _____ / _]| |/ /| | | || |/' _/|_ _| | [/\| < | |_| || |'._'. | | \__/|_|\_\|___|____||___/ |_| USAGE: Gklust.sh [options] where: -i input file containing the FASTA-formatted genome file names (one per line) to process (mandatory) -r input file containing the FASTA-formatted genome file name(s) (one per line) to be used as reference genome(s) -c minhash similarity clustering cutoff (in percent; default: 95) -s minhash sketch size (default: 1000) -k minhash k-mer size (default: 21) -t number of threads (default: 2) ``` ## Notes * In short, _Gklust_ first adds the longest genome into the set _R_ of representative genome(s). Next, for each of the remaining genomes _g_ sorted according their decreasing lengths, _Gklust_ searches for the most similar genome _r_ inside _R_: * if the nucleotide similarity between _g_ and _r_ is lower than a given cutoff (95% by default), then _g_ is added into _R_ as a new representative genome; * otherwise, _g_ is added into the cluster _C__r_ corresponding to _r_. At the end, _Gklust_ returns the different representative genome(s) _r_ inside _R_, as well as the content of each associated cluster(s) _C__r_. The nucleotide similarity between each pair of representative genomes from _R_ is always lower than the specified cutoff (option -c), unless closely related genomes are specified as reference ones (option -r). The nucleotide similarity between a representative genome _r_ and each genome from its cluster _Cr_ is always greater than the specified cutoff (option -c). * A starting set of representative genomes (e.g. type strains, complete genomes) could be specified with option -r. * FASTA file names containing blank spaces or the special characters `;` or `|` will not be processed by _Gklust_. * The nucleotide similarity cutoff could be modified with option -c. Of note, the default cutoff value (i.e. 95) leads to clusters containing genomes that belongs to the same species, as 95% is a well-accepted species delineation cutoff (e.g. Jain et al. 2018). Larger cutoff values (e.g. close to 100) lead to many clusters of very similar genomes. * The options -s and -k allow setting the sketch and k-mer sizes, respectively, used by Mash to estimate pairwise genome similarities. According to Ondov et al. (2016) recommendations, default values lead to fast and accurate estimates. Increasing the sketch size (option -s) will slow the overall running times, whereas decreasing it will increase the standard error of the estimates. * Faster running times will be observed when using multiple threads (option -t). * The verbosity of _Gklust_ could be reduced by ending the command line by `2>/dev/null`. ## Example The file `GCF.list.txt` inside the directory _example/_ contains a list of 1,062 _Bordetella_ genome assembly identifiers gathered from the RefSeq (June, 2019). The tool [wgetGenBankWGS](https://gitlab.pasteur.fr/GIPhy/wgetGenBankWGS) can be used to quickly download the corresponding FASTA-formatted files (~4.5 Gb) into the directory _example/_ using e.g. 20 threads: ```bash ./wgetGenBankWGS.sh -d refseq -e $(tr '\n' '|' < example/GCF.list.txt | sed 's/|$//') -o example -c 20 ``` To cluster these genome assembly files using _Gklust_, an input file (e.g. _Bordetella.genomes_) can be written with the following command line: ```bash ls example/*.fasta > Bordetella.genomes ``` The file `GCF.type_strains.txt` inside the directory _example/_ contains the genome assembly identifiers of the 11 _Bordetella_ type strains available from the RefSeq (June, 2019). These assemblies can be used as reference genomes by setting the _Gklust_ option -r with a second input file (e.g. _Bordetella.references_) created with the following command line: ```bash cat Bordetella.genomes | grep -F -f example/GCF.type_strains.txt > Bordetella.references ``` Using these two input files, the following command line allows _Gklust_ to be launched on 20 threads for clustering the 1,062 _Bordetella_ genome assembly files: ```bash ./Gklust.sh -i Bordetella.genomes -r Bordetella.references -t 20 2>/dev/null ``` After a few seconds, this leads to the following standard output: ``` 20 threads k-mer size: 21 sketch size: 1000 cutoff=95% (0.0500) 1062 input files cluster size reference 1 9 example/Bordetella.trematum.NCTC12995--FKBR01--GCF_900078335.1.fasta 2 5 example/Bordetella.pseudohinzii.8-296-03--JHEP01--GCF_000657795.2.fasta 3 1 example/Bordetella.petrii.DSM.12804--GCF_000067205.1.fasta 4 834 example/Bordetella.pertussis.18323--GCF_000306945.1.fasta 5 56 example/Bordetella.parapertussis.NCTC5952--UFUC01--GCF_900445785.1.fasta 6 33 example/Bordetella.holmesii.NCTC12912--UFTX01--GCF_900445775.1.fasta 7 13 example/Bordetella.hinzii.ATCC.51730--AWNM01--GCF_000471685.1.fasta 8 1 example/Bordetella.flabilis.AU10664--GCF_001676725.1.fasta 9 58 example/Bordetella.bronchiseptica.NBRC.13691--BCZI01--GCF_001598655.1.fasta 10 2 example/Bordetella.bronchialis.AU3182--GCF_001676705.1.fasta 11 23 example/Bordetella.avium.HAMBI_2160--QLKW01--GCF_003350095.1.fasta 12 1 example/Bordetella.genomosp.10.AU16122--NEVM01--GCF_002261225.1.fasta 13 1 example/Bordetella.sp.N--GCF_001433395.1.fasta 14 1 example/Bordetella.genomosp.11.AU8856--NEVS01--GCF_002261215.1.fasta 15 1 example/Bordetella.ansorpii.H050680373--FKIF01--GCF_900078705.1.fasta 16 1 example/Bordetella.ansorpii.NCTC13364--FKBS01--GCF_900078315.1.fasta 17 1 example/Bordetella.genomosp.9.AU21707--NEVJ01--GCF_002261425.1.fasta 18 1 example/Bordetella.genomosp.8.AU19157--GCF_002119685.1.fasta 19 2 example/Bordetella.genomosp.1.AU17610--NEVL01--GCF_002261335.1.fasta 20 1 example/Bordetella.sp.H567--GCF_001704295.1.fasta 21 2 example/Bordetella.genomosp.4.AU9919--NEVQ01--GCF_002261185.1.fasta 22 1 example/Bordetella.genomosp.13.AU7206--GCF_002119665.1.fasta 23 2 example/Bordetella.genomosp.5.AU10456--NEVP01--GCF_002261315.1.fasta 24 2 example/Bordetella.genomosp.2.AU8256--NEVT01--GCF_002261345.1.fasta 25 2 example/Bordetella.genomosp.9.AU17164--GCF_002119725.1.fasta 26 1 example/Bordetella.genomosp.12.AU6712--NEVU01--GCF_002261355.1.fasta 27 3 example/Bordetella.petrii.J49--JAEJ01--GCF_000518845.1.fasta 28 1 example/Bordetella.sp.HZ20--GCF_003058465.1.fasta 29 1 example/Bordetella.sp.FB-8--ARNH01--GCF_000382185.1.fasta 30 1 example/Bordetella.sp.J329--GCF_004006215.1.fasta 31 1 example/Bordetella.sp.3d-2-2--QETA01--GCF_003123725.1.fasta details written into Bordetella.genomes.clust ``` The standard output shows that _Gklust_ created 31 clusters from the 1,062 _Bordetella_ genomes (the 11 firsts correspond to the specified type strains). Therefore, the space-separated output file `Bordetella.genomes.clust` will contain 31 lines (i.e. one per cluster), each containing the reference genome file name (first entry) followed by the file names that belong to the corresponding cluster. Of note, as _Gklust_ was used with the default genome similarity cutoff (i.e. 95%), the above output allows observing that the 1,062 genomes are mainly _B. pertussis_ ones (i.e. 834), that many _Bordetella_ new (genomo)species exist, and that the strain J49 (cluster 27) does likely not belong to the species _B. petrii_. ## References Jain C, Rodriguez LM, Phillippy AM, Konstantinidis KT, Aluru S (2018) High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nature Communication, 9:5114. [doi:10.1038/s41467-018-07641-9](https://www.nature.com/articles/s41467-018-07641-9). Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, Phillippy AM (2016) Mash: fast genome and metagenome distance estimation using MinHash. Genome Biology, 17(1):132. [doi:10.1186/s13059-016-0997-x](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0997-x).