The clustering based on the cutoff 0.007 seems robust to data subsampling, as the three estimated nAUC are quite high (e.g. > 0.75).
The clustering based on the cutoff 0.007 seems robust to data subsampling, as the three estimated nAUC are quite high (e.g. > 0.75).
However, the same clustering seems less robust to data perturbation, as the 2.5% CI are quite small for the silhouette and the first Wallace coefficient, e.g. < 0.4.
However, the same clustering seems less robust to data perturbation, as the 2.5% CI are quite small for the silhouette and the first Wallace coefficient, e.g. < 0.4.
**Searching for optimal clustering**
**Searching for optimal clustering(s)**
An optimal MST-based clustering can be defined by the cutoff value that maximizes the different estimated statistics.
An optimal MST-based clustering can be defined by the cutoff value that maximizes the different estimated statistics.
By considering every branch length from the minimum spanning tree in _data.graphml_ (see above) as a putative cutoff, _MSTclust_ can be used to display in standard output these statistics in a convenient tab-delimited format (option `-t`):
By considering every branch length from the minimum spanning tree in _data.graphml_ (see above) as a putative cutoff, _MSTclust_ can be used to display in standard output these statistics in a convenient tab-delimited format (option `-t`):
...
@@ -180,18 +180,37 @@ By considering every branch length from the minimum spanning tree in _data.graph
...
@@ -180,18 +180,37 @@ By considering every branch length from the minimum spanning tree in _data.graph
while read c ;do MSTclust -i data.d -o out -c$c-L 2038 -B 9 -t;done 2>/dev/null
while read c ;do MSTclust -i data.d -o out -c$c-L 2038 -B 9 -t;done 2>/dev/null
```
```
After observing the different outputted statistics (not shown), it seems that an optimal clustering can be obtained using
After observing the different outputted statistics (not shown), it seems that an optimal clustering can be obtained using a cutoff between 0.010060363 (approx. 20/2038) and 0.048804782 (approx. 100/2038).
0.016691213 as cutoff.
_MSTclust_ can be used again to display the statistics associated to the clustering built from each of these cutoff:
This second set of tab-delimited statistics (not shown) demonstrates that using 0.03581943 as cutoff yields an optimal MST-based clustering (i.e. that maximizes all statistics derived from data perturbation and subsampling analyses).
This can be summarized using the following command line:
This can be summarized using the following command line:
Of note, previous analyses also show that the cutoff value 0.016192345 leads to a clear local optimum between 0.009813542 (20/2038) and 0.049067713 (100/2048).
In consequence, the corresponding MST-based clustering can be also considered if the number of classes obtained from the optimal one (i.e. _k_ = 2, see above) is found to be too small.
This second clustering can be summarized using the following command line: