@@ -77,14 +77,14 @@ Launch _MSTclust_ without option to read the following documentation:
...
@@ -77,14 +77,14 @@ Launch _MSTclust_ without option to read the following documentation:
## Notes
## Notes
* Profile data can be read via a tab-delimited file with some field(s) containing profile labels and other fields containing positive non-zero integers (some raw entries can be empty). Label and profile fields can be determined using options `-l` and `-p`, respectively. A distance matrix is directly computed from input profiles.
* Profile data can be read via a tab-delimited file with some field(s) containing profile labels and other fields containing positive non-zero integers up to 65535. Some row entries can be empty and entries containing non-integer values are considered as empty. Label and profile fields can be determined using options `-l` and `-p`, respectively. A distance matrix is directly computed from input profiles.
* Pairwise distance data can be read via a text file containing the corresponding lower-triangular distance matrix (without the zero diagonal).
* Pairwise distance data can be read via a text file containing the corresponding lower-triangular distance matrix (without the zero diagonal). Consecutive matrix entries should be separated by one blank space (no tab).
* Specific rows can be selected using option `-r`.
* Specific rows can be selected using option `-r`.
* A [minimum spanning tree](https://en.wikipedia.org/wiki/Minimum_spanning_tree) and a [single-linkage hierarchical classification tree](https://en.wikipedia.org/wiki/Single-linkage_clustering) can be written in [Newick](http://evolution.genetics.washington.edu/phylip/newicktree.html)and [GraphML](http://graphml.graphdrawing.org/index.html)output files using options `-m` and `-h`, respectively. These two options can require important running times when considering large datasets (e.g. > 5,000 rows).
* A [minimum spanning tree](https://en.wikipedia.org/wiki/Minimum_spanning_tree) and a [single-linkage hierarchical classification tree](https://en.wikipedia.org/wiki/Single-linkage_clustering) can be written in [GraphML](http://graphml.graphdrawing.org/index.html) and [Newick](http://evolution.genetics.washington.edu/phylip/newicktree.html) output files using options `-m` and `-h`, respectively. These two options require important running times when considering large datasets (e.g. > 5,000 rows).
* By definition, setting small cutoff values yield clustering with a large number of classes with fast running times.
* By definition, setting small cutoff values yield clustering with a large number of classes with fast running times.
* Profile length is required to carry out data perturbation analyses (option `-L`).
* Profile length is required to carry out data perturbation analyses (option `-L`).
* Data subsampling analyses progressively subsample raw data with rate values _b_/(_B_+1) where _b_ = 1 ... _B_ and _B_ > 1 is the number of bins specified using option `-B`. At least _B_ = 10 bins are recommended.
* Data subsampling analyses progressively subsample raw data with rate values _b_/(_B_+1) where _b_ = 1 ... _B_ and _B_ > 1 is the number of bins specified using option `-B`. At least _B_ = 9 bins (i.e. rates = 10%, 20% ... 90%) are recommended.
* For more details on the profile distance computation, clustering algorithm, and descriptive statistics, see the technical notes pdf file.
* For more details on the profile distance computation, clustering algorithm and descriptive statistics, see the technical notes pdf file.
...
@@ -114,7 +114,7 @@ hierarchical tree data.nwk
...
@@ -114,7 +114,7 @@ hierarchical tree data.nwk
minimum spanning tree data.graphml
minimum spanning tree data.graphml
```
```
Of note, using option `-e 0.05` causes the removal of the fifth profile (8.19% missing entries).
Of note, using option `-e 0.05` causes the removal of the fifth profile (8.19% missing entries).
The pairwise distance matrix is written into _data.d_, and options `-m` and `-h` yield the creation of the files _data.graphml_ and _data.nwk_.
The pairwise distance matrix is written into _data.d_ (available in _src/_), and options `-m` and `-h` yield the creation of the files _data.graphml_ and _data.nwk_.
**MST-based clustering from distance data**
**MST-based clustering from distance data**
...
@@ -136,7 +136,7 @@ Details of the clustering is written into _clust.txt_.
...
@@ -136,7 +136,7 @@ Details of the clustering is written into _clust.txt_.
**Assessing robustness using data perturbation and subsampling analyses**
**Assessing robustness using data perturbation and subsampling analyses**
The robustness of the previous clustering can be assessed by performing data perturbation and subsampling analyses (options `-L` and `-B`) based on 1,000 replicate (option `-R`) using the following command line:
The robustness of the previous clustering can be assessed by performing data perturbation and subsampling analyses (options `-L` and `-B`) based on 1,000 replicates (option `-R`) using the following command line:
Details of the corresponding clustering (3 classes) is written into file _clust.txt_.
Details of the corresponding clustering (3 classes) is written into file _clust.txt_ (available in _src/_).
A silhouette plot can be easily drawn using the output of the following command line on _clust.txt_:
```bash
grep-F" s=" clust.txt | sed's/ s=//' | sed 1d
```
...
@@ -195,6 +199,6 @@ Details of the corresponding clustering (3 classes) is written into file _clust.
...
@@ -195,6 +199,6 @@ Details of the corresponding clustering (3 classes) is written into file _clust.
## References
## References
Bouchez V, Guglielmini J, Dazas M, Landier A, Toubiana J, Guillot S, Criscuolo A, Brisse S (2018) Genomic Sequencing of Bordetella Pertussis for Epidemiology and Global Surveillance of Whooping Cough. Emerging Infectious Diseases, 24(6):988-994. [doi:10.3201/eid2406.171464](https://doi.org/10.3201/eid2406.171464)
Bouchez V, Guglielmini J, Dazas M, Landier A, Toubiana J, Guillot S, Criscuolo A, Brisse S (2018) Genomic Sequencing of _Bordetella Pertussis_ for Epidemiology and Global Surveillance of Whooping Cough. Emerging Infectious Diseases, 24(6):988-994. [doi:10.3201/eid2406.171464](https://doi.org/10.3201/eid2406.171464)