Skip to content
Snippets Groups Projects
Alexis  CRISCUOLO's avatar
Alexis CRISCUOLO authored
bfb6f93d
History
Name Last commit Last update
example
src
COPYING
README.md
Technical.Notes.pdf

MSTclust

MSTclust is a command line program written in Java to compute Minimum Spanning Tree (MST)-based clustering from either profile (i.e. tuples of positive integers) or pairwise distance data. MSTclust can also be used to compute pairwise distances between profiles, single-linkage hierarchical classification trees, minimum spanning trees, and several statistics to assess the efficiency of a specific clustering.

Compilation and execution

The source code of MSTclust is inside the src directory and can be compiled and executed in two different ways.

Building an executable jar file

On computers with Oracle JDK (8 or higher) installed, a Java executable jar file could be created. In a command-line window, go to the src directory and type:

javac MSTclust.java 
echo Main-Class: MSTclust > MANIFEST.MF 
jar -cmvf MANIFEST.MF MSTclust.jar MSTclust.class 
rm MANIFEST.MF MSTclust.class 

This will create the executable jar file MSTclust.jar that can be launched with the following command line model:

java -jar MSTclust.jar [options]

Building a native executable

On computers with GraalVM installed, a native executable can also be built. In a command-line window, go to the src directory, and type:

javac MSTclust.java 
native-image MSTclust MSTclust
rm MSTclust.class

This will create the native executable MSTclust that can be launched with the following command line model:

./MSTclust [options]

Usage

Launch MSTclust without option to read the following documentation:

 MSTclust

 Minimum Spanning Tree-based clustering

 USAGE:  MSTclust  -i <infile>  -o <basename>  [options]

 OPTIONS:

 -i <infile>    input file containing  either tab-delimited  profiles or a
                lower-triangular distance matrix (mandatory)
 -o <basenmae>  basename for output files (mandatory)
 -r <string>    selecting only specified rows (default: "1-")
 -l <string>    (input tab-delimited profiles) field(s) containing profile
                labels (default: "1")
 -p <string>    (input tab-delimited profiles)  fields containing profiles
                (default: "2-")
 -e <float>     (input tab-delimited profiles)  maximum allowed proportion
                of empty entries per profile (default: 0.25)
 -c <float>     inclusive cutoff to define cluster(s) (default: not set)
 -S <integer>   seed value for data perturbation  and subsampling analyses
                (default: 0)
 -L <integer>   profile length  to carry  out data  perturbation  analysis
                (default: not set)
 -B <integer>   number of  bins to  carry out  data  subsampling  analyses
                (default: not set)
 -R <integer>   number of replicates for data perturbation and subsampling
                analyses (default: 100)
 -h             writing a single-linkage  hierarchical classification tree
                into an output file (default: not set)
 -m             writing  a  minimum  spanning  tree  into an  output  file
                (default: not set)

Notes

  • Profile data can be read via a tab-delimited file with some field(s) containing profile labels and other fields containing positive non-zero integers up to 65535. Some row entries can be empty and entries containing non-integer values are considered as empty. Label and profile fields can be determined using options -l and -p, respectively. A distance matrix is directly computed from input profiles.
  • Pairwise distance data can be read via a text file containing the corresponding lower-triangular distance matrix (without the zero diagonal). Consecutive matrix entries should be separated by one blank space (no tab).
  • Specific rows can be selected using option -r.
  • A minimum spanning tree and a single-linkage hierarchical classification tree can be written in GraphML and Newick output files using options -m and -h, respectively. These two options require important running times when considering large datasets (e.g. > 5,000 rows).
  • By definition, setting small cutoff values yield clustering with a large number of classes with fast running times.
  • Profile length is required to carry out data perturbation analyses (option -L).
  • Data subsampling analyses progressively subsample raw data with rate values b/(B+1) where b = 1 ... B and B > 1 is the number of bins specified using option -B. At least B = 9 bins (i.e. rates = 10%, 20% ... 90%) are recommended.
  • For more details on the profile distance computation, clustering algorithm and descriptive statistics, see the technical notes pdf file.

Example

The directory example contains a tab-delimited file profiles.L1.P2-2039.tsv downloaded from the Bordetella sequence typing database hosted by the Institut Pasteur (cg)MLST repository. This file contains 414 allelic profiles on 2,038 loci (for more details, see Bouchez et al. 2018) with label names contained in the first field.

Distance matrix from profile data

The 2,038 loci are contained in fields 2-2039 and the first line is not a profile (locus names). Therefore, a distance matrix can be computed with MSTclust using the following command line:

MSTclust  -i profiles.L1.P2-2039.tsv  -o data  -l 1  -p 2-2039  -r 2-  -e 0.05  -h  -m 

This will output:

no. fields              2038
reading file ...        [ok]
no. elements            414
8.19% missing entries   discarding profile 5 (5)
remaining elements      413
computing distances ... [ok]
distance matrix         data.d
hierarchical tree       data.nwk
minimum spanning tree   data.graphml

Of note, using option -e 0.05 causes the removal of the fifth profile (8.19% missing entries). The pairwise distance matrix is written into data.d (available in example/), and options -m and -h yield the creation of the files data.graphml and data.nwk.

MST-based clustering from distance data

A first MST-based clustering with 0.007 as cutoff (option -c) can be carried out from the written distance matrix file data.d using the following simple command line:

MSTclust  -i data.d  -o clust  -c 0.007 

This will output:

reading file ...        [ok]
no. elements            413
clustering ...          [ok]
no. classes             16
silhouette              0.552998
clustering info         clust.txt

This clustering is a partition of the 413 profiles into 16 classes, assessed with an overall silhouette of ~0.553. Details of the clustering is written into clust.txt.

Assessing robustness using data perturbation and subsampling analyses

The robustness of the previous clustering can be assessed by performing data perturbation and subsampling analyses (options -L and -B) based on 1,000 replicates (option -R) using the following command line:

MSTclust  -i data.d  -o clust  -c 0.007  -L 2038  -B 9  -R 1000

This will output:

reading file ...        [ok]
no. elements            413
clustering ...          [ok]
no. classes             16
silhouette              0.552998
clustering info         clust.txt
noising ...             [ok]
noise silhouette        0.544606  [0.252816 , 0.877700]
noise aWallace1         0.702105  [0.306373 , 1.000000]
noise aWallace2         0.984209  [0.754643 , 1.000000]
subsampling 1/9 ...     10.0% (41)    S 0.5653 [0.0000 , 0.8534]  aW1 1.0000 [1.0000 , 1.0000]  aW2 0.9987 [0.9863 , 1.0000]
subsampling 2/9 ...     20.0% (83)    S 0.5400 [0.4614 , 0.7296]  aW1 1.0000 [1.0000 , 1.0000]  aW2 0.9985 [0.9933 , 1.0000]
subsampling 3/9 ...     30.0% (124)   S 0.5263 [0.4750 , 0.6392]  aW1 1.0000 [1.0000 , 1.0000]  aW2 0.9987 [0.9953 , 1.0000]
subsampling 4/9 ...     40.0% (165)   S 0.5260 [0.4812 , 0.6056]  aW1 1.0000 [1.0000 , 1.0000]  aW2 0.9989 [0.9961 , 1.0000]
subsampling 5/9 ...     50.0% (206)   S 0.5307 [0.4914 , 0.5882]  aW1 1.0000 [1.0000 , 1.0000]  aW2 0.9991 [0.9971 , 1.0000]
subsampling 6/9 ...     60.0% (248)   S 0.5335 [0.4959 , 0.5843]  aW1 1.0000 [1.0000 , 1.0000]  aW2 0.9992 [0.9977 , 1.0000]
subsampling 7/9 ...     70.0% (289)   S 0.5398 [0.5029 , 0.5821]  aW1 1.0000 [1.0000 , 1.0000]  aW2 0.9995 [0.9983 , 1.0000]
subsampling 8/9 ...     80.0% (330)   S 0.5439 [0.5093 , 0.5788]  aW1 1.0000 [1.0000 , 1.0000]  aW2 0.9997 [0.9986 , 1.0000]
subsampling 9/9 ...     90.0% (372)   S 0.5493 [0.5169 , 0.5738]  aW1 1.0000 [1.0000 , 1.0000]  aW2 0.9998 [0.9993 , 1.0000]
nAUC silhouette         0.768597  [0.729724 , 0.813833]
nAUC aWallace1          1.000000  [1.000000 , 1.000000]
nAUC aWallace2          0.999118  [0.996141 , 1.000000]

The clustering based on the cutoff 0.007 seems robust to data subsampling, as the three estimated nAUC are quite high (e.g. > 0.75). However, the same clustering seems less robust to data perturbation, as the 2.5% CI are quite small for the silhouette and the first Wallace coefficient, e.g. < 0.4.

Searching optimal clustering

An optimal MST-based clustering can be defined by the cutoff value that maximizes the different estimated statistics. By considering every branch length from the minimum spanning tree in data.graphml (see above) as a putative cutoff, MSTclust can be used to display in standard output these statistics in a convenient tab-delimited format (option -t):

grep "data" data.graphml | awk -F '[<>]' '{print$3}' | sort -g | uniq | 
  while read c ; do  MSTclust  -i data.d  -o out  -c $c  -L 2038  -B 9  -t ; done  2>/dev/null

After observing the different outputted statistics (not shown), it seems that an optimal clustering can be obtained using 0.016691213 as cutoff. This can be summarized using the following command line:

MSTclust  -i data.d  -o clust  -c 0.016691213  -L 2038  -B 9  -t  2>&1  |  tail -2

This will output:

n   c           k  silhouette  noise silhouette [low avg up]  noise aWallace1 [low avg up]  noise aWallace2 [low avg up]  nAUC silhouette [low avg up]  nAUC aWallace1 [low avg up]  nAUC aWallace2 [low avg up]
413 0.016691213 3  0.829778    0.763112 0.838848 0.918818     0.444911 0.977796 1.000000    0.998107 0.999338 1.000000    0.825384 0.899548 0.938995    1.000000 1.000000 1.000000   0.828055 0.978866 1.000000

Details of the corresponding clustering (3 classes) is written into file clust.txt (available in example/). A silhouette plot can be easily drawn using the output of the following command line on clust.txt:

grep -F " s=" clust.txt | sed 's/ s=//' | sed 1d

References

Bouchez V, Guglielmini J, Dazas M, Landier A, Toubiana J, Guillot S, Criscuolo A, Brisse S (2018) Genomic Sequencing of Bordetella Pertussis for Epidemiology and Global Surveillance of Whooping Cough. Emerging Infectious Diseases, 24(6):988-994. doi:10.3201/eid2406.171464