README.md 11.8 KB
Newer Older
Alexis  CRISCUOLO's avatar
Alexis CRISCUOLO committed
1 2
# Gklust

Alexis  CRISCUOLO's avatar
Alexis CRISCUOLO committed
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181
_Gklust_ is a command line program written in [Bash](https://www.gnu.org/software/bash/) that implements a fast greedy heuristic for clustering genome assembly files in FASTA format.
_Gklust_ can be useful for reducing redundancy within a large set of genomes, or to perform fast taxonomic assignations by comparing genome assemblies to known reference genomes.
_Gklust_ runs on UNIX, Linux and most OS X operating systems.


## Installation and execution

**A.** As _Gklust_ uses the program [mash](http://mash.readthedocs.io/en/latest/) [(Ondov et al. 2016)](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0997-x) for estimating pairwise genome similarities, first install it or verify that it is already installed:

  * binaries: [github.com/marbl/Mash/releases](https://github.com/marbl/Mash/releases)
  * sources: [github.com/marbl/Mash](https://github.com/marbl/Mash)

**B.** Clone this repository with the following command line:
```bash
git clone https://gitlab.pasteur.fr/GIPhy/Gklust.git
```

**C.** If [mash](http://mash.readthedocs.io/en/latest/) is not available on your `$PATH` variable, edit the file `Gklust.sh` and indicate the local path to the `mash` binary (approximately between lines 70 and 100):

```bash
#############################################################################################################
#                                                                                                           #
# ================                                                                                          #
# = INSTALLATION =                                                                                          #
# ================                                                                                          #
#                                                                                                           #
# [1] REQUIREMENTS =======================================================================================  #
#  Gklust depends on Mash. You should have it installed on your computer prior to using Gklust. Make sure   #
#  that it is installed on your $PATH variable, or specify below the full path to the Mash binary.          #
#                                                                                                           #
# -- Mash: fast pairwise minhash dissimilarity estimation ---------------------------------------------     #
#    + binaries: github.com/marbl/Mash/releases                                                             #
#    + Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, Phillippy AM (2016) Mash:        #
#        fast genome and  metagenome distance estimation using MinHash. Genome Biology, 17:132. doi:        #
#        10.1186/s13059-016-0997-x                                                                          #
#                                                                 ################################################
                                                                  ################################################
  MASH=mash;                                                      ## <=== WRITE HERE THE PATH TO THE MASH       ##
                                                                  ##      BINARY (VERSION 1.0.2 MINIMUM)        ##
                                                                  ################################################
                                                                  ################################################
#                                                                                                           #
#############################################################################################################

```

**D.** Give the execute permission to the file `Gklust.sh`:
```bash
chmod +x Gklust.sh
```

**E.** Execute _Gklust_ with the following command line model:
```bash
./Gklust.sh  [options]
```


## Usage

Launch _Gklust_ without option to read the following documentation:

```
   __  _  __ _   _  _   __  _____  
  / _]| |/ /| | | || |/' _/|_   _| 
 | [/\|   < | |_| || |'._'.  | |   
  \__/|_|\_\|___|____||___/  |_|  

 USAGE:
    Gklust.sh  [options]
 where:
    -i <file>   input file containing the FASTA-formatted genome file names (one
                per line) to process (mandatory)
    -r <file>   input file  containing the  FASTA-formatted  genome file name(s)
                (one per line) to be used as reference genome(s)
    -c <real>   minhash similarity clustering cutoff (in percent; default: 95)
    -s <int>    minhash sketch size (default: 1000)
    -k <int>    minhash k-mer size (default: 21)
    -t <int>    number of threads (default: 2)
```


## Notes

* In short, _Gklust_ first adds the longest genome into the set _R_ of representative genome(s).
Next, for each of the remaining genomes _g_ sorted according their decreasing lengths, _Gklust_ searches for the most similar genome _r_ inside _R_:
  * if the nucleotide similarity between _g_ and _r_ is lower than a given cutoff (95% by default), then _g_ is added into _R_ as a new representative genome; 
  * otherwise, _g_ is added into the cluster _C_<sub>_r_</sub> corresponding to _r_.

  At the end, _Gklust_ returns the different representative genome(s) _r_ inside _R_, as well as the content of each associated cluster(s) _C_<sub>_r_</sub>.
  The nucleotide similarity between each pair of representative genomes from _R_ is always lower than the specified cutoff (option -c), unless closely related genomes are specified as reference ones (option -r). The nucleotide similarity between a representative genome _r_ and each genome from its cluster _C<sub>r</sub>_ is always greater than the specified cutoff (option -c).
* A starting set of representative genomes (e.g. type strains, complete genomes) could be specified with option -r.
* FASTA file names containing blank spaces or the special characters `;` or `|` will not be processed by _Gklust_.
* The nucleotide similarity cutoff could be modified with option -c. Of note, the default cutoff value (i.e. 95) leads to clusters containing genomes that belongs to the same species, as 95% is a well-accepted species delineation cutoff (e.g. Jain et al. 2018). Larger cutoff values (e.g. close to 100) lead to many clusters of very similar genomes. 
* The options -s and -k allow setting the sketch and k-mer sizes, respectively, used by Mash to estimate pairwise genome similarities. According to Ondov et al. (2016) recommendations, default values lead to fast and accurate estimates. Increasing the sketch size (option -s) will slow the overall running times, whereas decreasing it will increase the standard error of the estimates.
* Faster running times will be observed when using multiple threads (option -t).
* The verbosity of _Gklust_ could be reduced by ending the command line by `2>/dev/null`.


## Example

The file `GCF.list.txt` inside the directory _example/_ contains a list of 1,062 _Bordetella_ genome assembly identifiers gathered from the RefSeq (June, 2019).
The tool [wgetGenBankWGS](https://gitlab.pasteur.fr/GIPhy/wgetGenBankWGS) can be used to quickly download the corresponding FASTA-formatted files (~4.5 Gb) into the directory _example/_ using e.g. 20 threads:
```bash
./wgetGenBankWGS.sh -d refseq -e $(tr '\n' '|' < example/GCF.list.txt | sed 's/|$//') -o example -c 20
```

To cluster these genome assembly files using _Gklust_, an input file (e.g. _Bordetella.genomes_) can be written with the following command line:
```bash
ls example/*.fasta > Bordetella.genomes
```

The file `GCF.type_strains.txt` inside the directory _example/_ contains the genome assembly identifiers of the 11 _Bordetella_ type strains available from the RefSeq (June, 2019).
These assemblies can be used as reference genomes by setting the _Gklust_ option -r with a second input file (e.g. _Bordetella.references_) created with the following command line:
```bash
cat Bordetella.genomes | grep -F -f example/GCF.type_strains.txt > Bordetella.references
```

Using these two input files, the following command line allows _Gklust_ to be launched on 20 threads for clustering the 1,062 _Bordetella_ genome assembly files:
```bash
./Gklust.sh  -i Bordetella.genomes  -r Bordetella.references  -t 20  2>/dev/null
```

After a few seconds, this leads to the following standard output:
```
20 threads
k-mer size: 21
sketch size: 1000
cutoff=95% (0.0500)
1062 input files

cluster	size	reference
1	9	example/Bordetella.trematum.NCTC12995--FKBR01--GCF_900078335.1.fasta
2	5	example/Bordetella.pseudohinzii.8-296-03--JHEP01--GCF_000657795.2.fasta
3	1	example/Bordetella.petrii.DSM.12804--GCF_000067205.1.fasta
4	834	example/Bordetella.pertussis.18323--GCF_000306945.1.fasta
5	56	example/Bordetella.parapertussis.NCTC5952--UFUC01--GCF_900445785.1.fasta
6	33	example/Bordetella.holmesii.NCTC12912--UFTX01--GCF_900445775.1.fasta
7	13	example/Bordetella.hinzii.ATCC.51730--AWNM01--GCF_000471685.1.fasta
8	1	example/Bordetella.flabilis.AU10664--GCF_001676725.1.fasta
9	58	example/Bordetella.bronchiseptica.NBRC.13691--BCZI01--GCF_001598655.1.fasta
10	2	example/Bordetella.bronchialis.AU3182--GCF_001676705.1.fasta
11	23	example/Bordetella.avium.HAMBI_2160--QLKW01--GCF_003350095.1.fasta
12	1	example/Bordetella.genomosp.10.AU16122--NEVM01--GCF_002261225.1.fasta
13	1	example/Bordetella.sp.N--GCF_001433395.1.fasta
14	1	example/Bordetella.genomosp.11.AU8856--NEVS01--GCF_002261215.1.fasta
15	1	example/Bordetella.ansorpii.H050680373--FKIF01--GCF_900078705.1.fasta
16	1	example/Bordetella.ansorpii.NCTC13364--FKBS01--GCF_900078315.1.fasta
17	1	example/Bordetella.genomosp.9.AU21707--NEVJ01--GCF_002261425.1.fasta
18	1	example/Bordetella.genomosp.8.AU19157--GCF_002119685.1.fasta
19	2	example/Bordetella.genomosp.1.AU17610--NEVL01--GCF_002261335.1.fasta
20	1	example/Bordetella.sp.H567--GCF_001704295.1.fasta
21	2	example/Bordetella.genomosp.4.AU9919--NEVQ01--GCF_002261185.1.fasta
22	1	example/Bordetella.genomosp.13.AU7206--GCF_002119665.1.fasta
23	2	example/Bordetella.genomosp.5.AU10456--NEVP01--GCF_002261315.1.fasta
24	2	example/Bordetella.genomosp.2.AU8256--NEVT01--GCF_002261345.1.fasta
25	2	example/Bordetella.genomosp.9.AU17164--GCF_002119725.1.fasta
26	1	example/Bordetella.genomosp.12.AU6712--NEVU01--GCF_002261355.1.fasta
27	3	example/Bordetella.petrii.J49--JAEJ01--GCF_000518845.1.fasta
28	1	example/Bordetella.sp.HZ20--GCF_003058465.1.fasta
29	1	example/Bordetella.sp.FB-8--ARNH01--GCF_000382185.1.fasta
30	1	example/Bordetella.sp.J329--GCF_004006215.1.fasta
31	1	example/Bordetella.sp.3d-2-2--QETA01--GCF_003123725.1.fasta

details written into Bordetella.genomes.clust
```

The standard output shows that _Gklust_ created 31 clusters from the 1,062 _Bordetella_ genomes (the 11 firsts correspond to the specified type strains).
Therefore, the space-separated output file `Bordetella.genomes.clust` will contain 31 lines (i.e. one per cluster), each containing the reference genome file name (first entry) followed by the file names that belong to the corresponding cluster.

Of note, as _Gklust_ was used with the default genome similarity cutoff (i.e. 95%), the above output allows observing that the 1,062 genomes are mainly _B. pertussis_ ones (i.e. 834), that many _Bordetella_ new (genomo)species exist, and that the strain J49 (cluster 27) does likely not belong to the species _B. petrii_.

## References

Jain C, Rodriguez LM, Phillippy AM, Konstantinidis KT, Aluru S (2018) High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nature Communication, 9:5114. [doi:10.1038/s41467-018-07641-9](https://www.nature.com/articles/s41467-018-07641-9).

Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, Phillippy AM (2016) Mash: fast genome and metagenome distance estimation using MinHash. Genome Biology, 17(1):132. [doi:10.1186/s13059-016-0997-x](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0997-x).