MSTclust
MSTclust is a command line program written in Java to compute Minimum Spanning Tree (MST)-based clustering from either profile (i.e. tuples of positive integers) or pairwise distance data. MSTclust can also be used to compute pairwise distances between profiles, single-linkage hierarchical classification trees, minimum spanning trees, and several statistics to assess the efficiency of a specific clustering.
Compilation and execution
The source code of MSTclust is inside the src directory and can be compiled and executed in two different ways.
Building an executable jar file
On computers with Oracle JDK (8 or higher) installed, a Java executable jar file could be created. In a command-line window, go to the src directory and type:
javac MSTclust.java
echo Main-Class: MSTclust > MANIFEST.MF
jar -cmvf MANIFEST.MF MSTclust.jar MSTclust.class
rm MANIFEST.MF MSTclust.class
This will create the executable jar file MSTclust.jar
that can be launched with the following command line model:
java -jar MSTclust.jar [options]
Building a native executable
On computers with GraalVM installed, a native executable can also be built. In a command-line window, go to the src directory, and type:
javac MSTclust.java
native-image MSTclust MSTclust
rm MSTclust.class
This will create the native executable MSTclust
that can be launched with the following command line model:
./MSTclust [options]
Usage
Launch MSTclust without option to read the following documentation:
MSTclust
Minimum Spanning Tree-based clustering
USAGE: MSTclust -i <infile> -o <basename> [options]
OPTIONS:
-i <infile> input file containing either tab-delimited profiles or a
lower-triangular distance matrix (mandatory)
-o <basenmae> basename for output files (mandatory)
-r <string> selecting only specified rows (default: "1-")
-l <string> (input tab-delimited profiles) field(s) containing profile
labels (default: "1")
-p <string> (input tab-delimited profiles) fields containing profiles
(default: "2-")
-e <float> (input tab-delimited profiles) maximum allowed proportion
of empty entries per profile (default: 0.25)
-c <float> inclusive cutoff to define cluster(s) (default: not set)
-S <integer> seed value for data perturbation and subsampling analyses
(default: 0)
-L <integer> profile length to carry out data perturbation analysis
(default: not set)
-B <integer> number of bins to carry out data subsampling analyses
(default: not set)
-R <integer> number of replicates for data perturbation and subsampling
analyses (default: 100)
-h writing a single-linkage hierarchical classification tree
into an output file (default: not set)
-m writing a minimum spanning tree into an output file
(default: not set)
Notes
- Profile data can be read via a tab-delimited file with some field(s) containing profile labels and other fields containing positive non-zero integers up to 65535. Some row entries can be empty and entries containing non-integer values are considered as empty. Label and profile fields can be determined using options
-l
and-p
, respectively. A distance matrix is directly computed from input profiles. - Pairwise distance data can be read via a text file containing the corresponding lower-triangular distance matrix (without the zero diagonal). Consecutive matrix entries should be separated by one blank space (no tab).
- Specific rows can be selected using option
-r
. - A minimum spanning tree and a single-linkage hierarchical classification tree can be written in GraphML and Newick output files using options
-m
and-h
, respectively. These two options require important running times when considering large datasets (e.g. > 5,000 rows). - By definition, setting small cutoff values yield clustering with a large number of classes with fast running times.
- Profile length is required to carry out data perturbation analyses (option
-L
). - Data subsampling analyses progressively subsample raw data with rate values b/(B+1) where b = 1 ... B and B > 1 is the number of bins specified using option
-B
. At least B = 9 bins (i.e. rates = 10%, 20% ... 90%) are recommended. - For more details on the profile distance computation, clustering algorithm and descriptive statistics, see the technical notes pdf file.
Example
The directory example contains a tab-delimited file profiles.L1.P2-2039.tsv downloaded from the Bordetella sequence typing database hosted by the Institut Pasteur (cg)MLST repository. This file contains 414 allelic profiles on 2,038 loci (for more details, see Bouchez et al. 2018) with label names contained in the first field.
Distance matrix from profile data
The 2,038 loci are contained in fields 2-2039 and the first line is not a profile (locus names). Therefore, a distance matrix can be computed with MSTclust using the following command line:
MSTclust -i profiles.L1.P2-2039.tsv -o data -l 1 -p 2-2039 -r 2- -e 0.05 -h -m
This will output:
no. fields 2038
reading file ... [ok]
no. elements 414
8.19% missing entries discarding profile 5 (5)
remaining elements 413
computing distances ... [ok]
distance matrix data.d
hierarchical tree data.nwk
minimum spanning tree data.graphml
Of note, using option -e 0.05
causes the removal of the fifth profile (8.19% missing entries).
The pairwise distance matrix is written into data.d (available in example/), and options -m
and -h
yield the creation of the files data.graphml and data.nwk.
MST-based clustering from distance data
A first MST-based clustering with 0.007 as cutoff (option -c
) can be carried out from the written distance matrix file data.d using the following simple command line:
MSTclust -i data.d -o clust -c 0.007
This will output:
reading file ... [ok]
no. elements 413
clustering ... [ok]
no. classes 16
silhouette 0.552998
clustering info clust.txt
This clustering is a partition of the 413 profiles into 16 classes, assessed with an overall silhouette of ~0.553. Details of the clustering is written into clust.txt.
Assessing robustness using data perturbation and subsampling analyses
The robustness of the previous clustering can be assessed by performing data perturbation and subsampling analyses (options -L
and -B
) based on 1,000 replicates (option -R
) using the following command line:
MSTclust -i data.d -o clust -c 0.007 -L 2038 -B 9 -R 1000
This will output:
reading file ... [ok]
no. elements 413
clustering ... [ok]
no. classes 16
silhouette 0.552998
clustering info clust.txt
noising ... [ok]
noise silhouette 0.544606 [0.252816 , 0.877700]
noise aWallace1 0.702105 [0.306373 , 1.000000]
noise aWallace2 0.984209 [0.754643 , 1.000000]
subsampling 1/9 ... 10.0% (41) S 0.5653 [0.0000 , 0.8534] aW1 1.0000 [1.0000 , 1.0000] aW2 0.9987 [0.9863 , 1.0000]
subsampling 2/9 ... 20.0% (83) S 0.5400 [0.4614 , 0.7296] aW1 1.0000 [1.0000 , 1.0000] aW2 0.9985 [0.9933 , 1.0000]
subsampling 3/9 ... 30.0% (124) S 0.5263 [0.4750 , 0.6392] aW1 1.0000 [1.0000 , 1.0000] aW2 0.9987 [0.9953 , 1.0000]
subsampling 4/9 ... 40.0% (165) S 0.5260 [0.4812 , 0.6056] aW1 1.0000 [1.0000 , 1.0000] aW2 0.9989 [0.9961 , 1.0000]
subsampling 5/9 ... 50.0% (206) S 0.5307 [0.4914 , 0.5882] aW1 1.0000 [1.0000 , 1.0000] aW2 0.9991 [0.9971 , 1.0000]
subsampling 6/9 ... 60.0% (248) S 0.5335 [0.4959 , 0.5843] aW1 1.0000 [1.0000 , 1.0000] aW2 0.9992 [0.9977 , 1.0000]
subsampling 7/9 ... 70.0% (289) S 0.5398 [0.5029 , 0.5821] aW1 1.0000 [1.0000 , 1.0000] aW2 0.9995 [0.9983 , 1.0000]
subsampling 8/9 ... 80.0% (330) S 0.5439 [0.5093 , 0.5788] aW1 1.0000 [1.0000 , 1.0000] aW2 0.9997 [0.9986 , 1.0000]
subsampling 9/9 ... 90.0% (372) S 0.5493 [0.5169 , 0.5738] aW1 1.0000 [1.0000 , 1.0000] aW2 0.9998 [0.9993 , 1.0000]
nAUC silhouette 0.768597 [0.729724 , 0.813833]
nAUC aWallace1 1.000000 [1.000000 , 1.000000]
nAUC aWallace2 0.999118 [0.996141 , 1.000000]
The clustering based on the cutoff 0.007 seems robust to data subsampling, as the three estimated nAUC are quite high (e.g. > 0.75). However, the same clustering seems less robust to data perturbation, as the 2.5% CI are quite small for the silhouette and the first Wallace coefficient, e.g. < 0.4.
Searching optimal clustering
An optimal MST-based clustering can be defined by the cutoff value that maximizes the different estimated statistics.
By considering every branch length from the minimum spanning tree in data.graphml (see above) as a putative cutoff, MSTclust can be used to display in standard output these statistics in a convenient tab-delimited format (option -t
):
grep "data" data.graphml | awk -F '[<>]' '{print$3}' | sort -g | uniq |
while read c ; do MSTclust -i data.d -o out -c $c -L 2038 -B 9 -t ; done 2>/dev/null
After observing the different outputted statistics (not shown), it seems that an optimal clustering can be obtained using 0.016691213 as cutoff. This can be summarized using the following command line:
MSTclust -i data.d -o clust -c 0.016691213 -L 2038 -B 9 -t 2>&1 | tail -2
This will output:
n c k silhouette noise silhouette [low avg up] noise aWallace1 [low avg up] noise aWallace2 [low avg up] nAUC silhouette [low avg up] nAUC aWallace1 [low avg up] nAUC aWallace2 [low avg up]
413 0.016691213 3 0.829778 0.763112 0.838848 0.918818 0.444911 0.977796 1.000000 0.998107 0.999338 1.000000 0.825384 0.899548 0.938995 1.000000 1.000000 1.000000 0.828055 0.978866 1.000000
Details of the corresponding clustering (3 classes) is written into file clust.txt (available in example/). A silhouette plot can be easily drawn using the output of the following command line on clust.txt:
grep -F " s=" clust.txt | sed 's/ s=//' | sed 1d
References
Bouchez V, Guglielmini J, Dazas M, Landier A, Toubiana J, Guillot S, Criscuolo A, Brisse S (2018) Genomic Sequencing of Bordetella Pertussis for Epidemiology and Global Surveillance of Whooping Cough. Emerging Infectious Diseases, 24(6):988-994. doi:10.3201/eid2406.171464