Skip to content
Snippets Groups Projects
Commit 148f4e15 authored by Alexis  CRISCUOLO's avatar Alexis CRISCUOLO :black_circle:
Browse files

0.1

parent 6a440cc6
No related branches found
No related tags found
No related merge requests found
...@@ -77,14 +77,14 @@ Launch _MSTclust_ without option to read the following documentation: ...@@ -77,14 +77,14 @@ Launch _MSTclust_ without option to read the following documentation:
## Notes ## Notes
* Profile data can be read via a tab-delimited file with some field(s) containing profile labels and other fields containing positive non-zero integers (some raw entries can be empty). Label and profile fields can be determined using options `-l` and `-p`, respectively. A distance matrix is directly computed from input profiles. * Profile data can be read via a tab-delimited file with some field(s) containing profile labels and other fields containing positive non-zero integers up to 65535. Some row entries can be empty and entries containing non-integer values are considered as empty. Label and profile fields can be determined using options `-l` and `-p`, respectively. A distance matrix is directly computed from input profiles.
* Pairwise distance data can be read via a text file containing the corresponding lower-triangular distance matrix (without the zero diagonal). * Pairwise distance data can be read via a text file containing the corresponding lower-triangular distance matrix (without the zero diagonal). Consecutive matrix entries should be separated by one blank space (no tab).
* Specific rows can be selected using option `-r`. * Specific rows can be selected using option `-r`.
* A [minimum spanning tree](https://en.wikipedia.org/wiki/Minimum_spanning_tree) and a [single-linkage hierarchical classification tree](https://en.wikipedia.org/wiki/Single-linkage_clustering) can be written in [Newick](http://evolution.genetics.washington.edu/phylip/newicktree.html) and [GraphML](http://graphml.graphdrawing.org/index.html) output files using options `-m` and `-h`, respectively. These two options can require important running times when considering large datasets (e.g. > 5,000 rows). * A [minimum spanning tree](https://en.wikipedia.org/wiki/Minimum_spanning_tree) and a [single-linkage hierarchical classification tree](https://en.wikipedia.org/wiki/Single-linkage_clustering) can be written in [GraphML](http://graphml.graphdrawing.org/index.html) and [Newick](http://evolution.genetics.washington.edu/phylip/newicktree.html) output files using options `-m` and `-h`, respectively. These two options require important running times when considering large datasets (e.g. > 5,000 rows).
* By definition, setting small cutoff values yield clustering with a large number of classes with fast running times. * By definition, setting small cutoff values yield clustering with a large number of classes with fast running times.
* Profile length is required to carry out data perturbation analyses (option `-L`). * Profile length is required to carry out data perturbation analyses (option `-L`).
* Data subsampling analyses progressively subsample raw data with rate values _b_/(_B_+1) where _b_ = 1 ... _B_ and _B_ > 1 is the number of bins specified using option `-B`. At least _B_ = 10 bins are recommended. * Data subsampling analyses progressively subsample raw data with rate values _b_/(_B_+1) where _b_ = 1 ... _B_ and _B_ > 1 is the number of bins specified using option `-B`. At least _B_ = 9 bins (i.e. rates = 10%, 20% ... 90%) are recommended.
* For more details on the profile distance computation, clustering algorithm, and descriptive statistics, see the technical notes pdf file. * For more details on the profile distance computation, clustering algorithm and descriptive statistics, see the technical notes pdf file.
...@@ -114,7 +114,7 @@ hierarchical tree data.nwk ...@@ -114,7 +114,7 @@ hierarchical tree data.nwk
minimum spanning tree data.graphml minimum spanning tree data.graphml
``` ```
Of note, using option `-e 0.05` causes the removal of the fifth profile (8.19% missing entries). Of note, using option `-e 0.05` causes the removal of the fifth profile (8.19% missing entries).
The pairwise distance matrix is written into _data.d_, and options `-m` and `-h` yield the creation of the files _data.graphml_ and _data.nwk_. The pairwise distance matrix is written into _data.d_ (available in _src/_), and options `-m` and `-h` yield the creation of the files _data.graphml_ and _data.nwk_.
**MST-based clustering from distance data** **MST-based clustering from distance data**
...@@ -136,7 +136,7 @@ Details of the clustering is written into _clust.txt_. ...@@ -136,7 +136,7 @@ Details of the clustering is written into _clust.txt_.
**Assessing robustness using data perturbation and subsampling analyses** **Assessing robustness using data perturbation and subsampling analyses**
The robustness of the previous clustering can be assessed by performing data perturbation and subsampling analyses (options `-L` and `-B`) based on 1,000 replicate (option `-R`) using the following command line: The robustness of the previous clustering can be assessed by performing data perturbation and subsampling analyses (options `-L` and `-B`) based on 1,000 replicates (option `-R`) using the following command line:
```bash ```bash
MSTclust -i data.d -o clust -c 0.007 -L 2038 -B 9 -R 1000 MSTclust -i data.d -o clust -c 0.007 -L 2038 -B 9 -R 1000
``` ```
...@@ -187,7 +187,11 @@ This will output: ...@@ -187,7 +187,11 @@ This will output:
n c k silhouette noise silhouette [low avg up] noise aWallace1 [low avg up] noise aWallace2 [low avg up] nAUC silhouette [low avg up] nAUC aWallace1 [low avg up] nAUC aWallace2 [low avg up] n c k silhouette noise silhouette [low avg up] noise aWallace1 [low avg up] noise aWallace2 [low avg up] nAUC silhouette [low avg up] nAUC aWallace1 [low avg up] nAUC aWallace2 [low avg up]
413 0.016691213 3 0.829778 0.763112 0.838848 0.918818 0.444911 0.977796 1.000000 0.998107 0.999338 1.000000 0.825384 0.899548 0.938995 1.000000 1.000000 1.000000 0.828055 0.978866 1.000000 413 0.016691213 3 0.829778 0.763112 0.838848 0.918818 0.444911 0.977796 1.000000 0.998107 0.999338 1.000000 0.825384 0.899548 0.938995 1.000000 1.000000 1.000000 0.828055 0.978866 1.000000
``` ```
Details of the corresponding clustering (3 classes) is written into file _clust.txt_. Details of the corresponding clustering (3 classes) is written into file _clust.txt_ (available in _src/_).
A silhouette plot can be easily drawn using the output of the following command line on _clust.txt_:
```bash
grep -F " s=" clust.txt | sed 's/ s=//' | sed 1d
```
...@@ -195,6 +199,6 @@ Details of the corresponding clustering (3 classes) is written into file _clust. ...@@ -195,6 +199,6 @@ Details of the corresponding clustering (3 classes) is written into file _clust.
## References ## References
Bouchez V, Guglielmini J, Dazas M, Landier A, Toubiana J, Guillot S, Criscuolo A, Brisse S (2018) Genomic Sequencing of Bordetella Pertussis for Epidemiology and Global Surveillance of Whooping Cough. Emerging Infectious Diseases, 24(6):988-994. [doi:10.3201/eid2406.171464](https://doi.org/10.3201/eid2406.171464) Bouchez V, Guglielmini J, Dazas M, Landier A, Toubiana J, Guillot S, Criscuolo A, Brisse S (2018) Genomic Sequencing of _Bordetella Pertussis_ for Epidemiology and Global Surveillance of Whooping Cough. Emerging Infectious Diseases, 24(6):988-994. [doi:10.3201/eid2406.171464](https://doi.org/10.3201/eid2406.171464)
File moved
# n=413
# c=0.01669121
# k=3
# s=0.82977824
## cluster_1 n=402 s=0.83278240
143 s=0.89825432
85 s=0.89815314
37 s=0.89815145
74 s=0.89811303
343 s=0.89792441
86 s=0.89785800
396 s=0.89541756
298 s=0.89456357
124 s=0.89423503
378 s=0.89408981
98 s=0.89394730
170 s=0.89390835
41 s=0.89390809
334 s=0.88770005
322 s=0.88763977
336 s=0.88760053
169 s=0.88760047
401 s=0.88760031
157 s=0.88759993
151 s=0.88601817
235 s=0.88600377
340 s=0.88598133
226 s=0.88586892
144 s=0.88553706
405 s=0.88528186
181 s=0.88523413
191 s=0.88509641
253 s=0.88508710
296 s=0.88505816
16 s=0.88505004
408 s=0.88498494
282 s=0.88494903
7 s=0.88491086
276 s=0.88491082
66 s=0.88487400
208 s=0.88481098
237 s=0.88480514
39 s=0.88430297
153 s=0.88411530
72 s=0.88264892
126 s=0.88204060
140 s=0.88143228
252 s=0.88136205
67 s=0.88128918
158 s=0.88121773
99 s=0.88118281
73 s=0.88118281
163 s=0.88118193
185 s=0.88114746
200 s=0.88111037
15 s=0.88111037
12 s=0.88111037
393 s=0.88103797
115 s=0.87568758
364 s=0.87532500
407 s=0.87514509
338 s=0.87511094
177 s=0.87507125
192 s=0.87499870
283 s=0.87499782
173 s=0.87485384
387 s=0.87481860
374 s=0.87478150
194 s=0.87478082
178 s=0.87470751
161 s=0.87359261
76 s=0.87357888
132 s=0.87351307
345 s=0.87316932
399 s=0.87313918
81 s=0.87310261
375 s=0.87307435
92 s=0.87303104
274 s=0.87255919
6 s=0.87249452
230 s=0.87245748
55 s=0.87244258
409 s=0.87242326
406 s=0.87242326
370 s=0.87241431
128 s=0.87241266
206 s=0.87241252
290 s=0.87238702
127 s=0.87237623
391 s=0.87234363
344 s=0.87231243
278 s=0.87231170
272 s=0.87228193
201 s=0.87227681
245 s=0.87226769
295 s=0.87223913
82 s=0.87223112
121 s=0.87220443
137 s=0.87220420
232 s=0.87220357
270 s=0.87216656
52 s=0.87216656
31 s=0.87215535
179 s=0.87209883
10 s=0.87209628
231 s=0.87209521
175 s=0.87203458
47 s=0.87178727
17 s=0.87163787
382 s=0.87161931
347 s=0.87110186
389 s=0.87105454
83 s=0.87090626
164 s=0.86998089
101 s=0.86978062
119 s=0.86978012
404 s=0.86969943
113 s=0.86969838
400 s=0.86917035
380 s=0.86910031
168 s=0.86902814
166 s=0.86892374
381 s=0.86892243
352 s=0.86888720
139 s=0.86885197
155 s=0.86885189
109 s=0.86878288
122 s=0.86878133
183 s=0.86871211
159 s=0.86871209
104 s=0.86871080
216 s=0.86857762
196 s=0.86823079
297 s=0.86703724
330 s=0.86660966
78 s=0.86657057
331 s=0.86548614
266 s=0.86373299
149 s=0.86322457
142 s=0.86272856
49 s=0.86269319
42 s=0.86262110
213 s=0.86258008
280 s=0.86254361
267 s=0.86241688
293 s=0.86233045
165 s=0.86222207
285 s=0.86195927
152 s=0.86155977
138 s=0.86118748
333 s=0.86111735
383 s=0.86111599
277 s=0.86110932
46 s=0.86079404
79 s=0.86057596
148 s=0.86021880
33 s=0.86019292
367 s=0.86017428
281 s=0.86013522
392 s=0.86011796
359 s=0.86011254
243 s=0.86004086
335 s=0.85996651
18 s=0.85996132
105 s=0.85996084
365 s=0.85993197
112 s=0.85989825
372 s=0.85988819
339 s=0.85988819
264 s=0.85987062
90 s=0.85982600
304 s=0.85976483
302 s=0.85971821
180 s=0.85971797
221 s=0.85968328
217 s=0.85957884
204 s=0.85954408
291 s=0.85950954
96 s=0.85940198
107 s=0.85901557
110 s=0.85876651
286 s=0.85858493
244 s=0.85829192
303 s=0.85822478
64 s=0.85815688
68 s=0.85814321
254 s=0.85806914
141 s=0.85801638
354 s=0.85779538
97 s=0.85777240
114 s=0.85763390
398 s=0.85756234
388 s=0.85750140
84 s=0.85697218
203 s=0.85695166
133 s=0.85679821
123 s=0.85672995
265 s=0.85163380
225 s=0.85161286
316 s=0.85142122
25 s=0.85142112
301 s=0.85139621
320 s=0.85138558
269 s=0.85138553
321 s=0.85137460
262 s=0.85074922
260 s=0.85058498
14 s=0.85051893
215 s=0.85051385
70 s=0.85044842
58 s=0.85044842
40 s=0.85044412
385 s=0.85023382
182 s=0.85012973
292 s=0.85009299
284 s=0.84988056
329 s=0.84984652
54 s=0.84984591
8 s=0.84938597
299 s=0.84929314
195 s=0.84927013
59 s=0.84893027
273 s=0.84882479
38 s=0.84879064
75 s=0.84867535
60 s=0.84864384
246 s=0.84853303
50 s=0.84850404
162 s=0.84836231
100 s=0.84832772
44 s=0.84829549
186 s=0.84816327
210 s=0.84803842
188 s=0.84800315
56 s=0.84797376
35 s=0.84794210
207 s=0.84792141
348 s=0.84790449
289 s=0.84788846
51 s=0.84743654
11 s=0.84729519
20 s=0.84675544
350 s=0.84646689
271 s=0.84576159
69 s=0.84552037
279 s=0.84541764
361 s=0.84528200
116 s=0.84524732
147 s=0.84524613
351 s=0.84517876
187 s=0.84516535
222 s=0.84510934
288 s=0.84510903
236 s=0.84489492
174 s=0.84486905
341 s=0.84003932
323 s=0.83997345
256 s=0.83990112
318 s=0.83989680
319 s=0.83986228
317 s=0.83986228
314 s=0.83977109
184 s=0.83976077
403 s=0.83965524
229 s=0.83956283
71 s=0.83955506
250 s=0.83953735
197 s=0.83875174
306 s=0.83872536
108 s=0.83865536
263 s=0.83841422
189 s=0.83824859
145 s=0.83792224
275 s=0.83786784
120 s=0.83756081
355 s=0.83753576
287 s=0.83753495
211 s=0.83750092
346 s=0.83744383
369 s=0.83742965
356 s=0.83729554
373 s=0.83671387
62 s=0.83593853
368 s=0.83589778
88 s=0.83586696
223 s=0.83508608
77 s=0.83481174
353 s=0.83475770
357 s=0.83465939
402 s=0.83458450
134 s=0.83411138
363 s=0.83383968
160 s=0.83380661
305 s=0.83353679
255 s=0.83350269
146 s=0.83343447
219 s=0.83335839
309 s=0.82905428
313 s=0.82894967
315 s=0.82868625
308 s=0.82868625
251 s=0.82852992
248 s=0.82846167
247 s=0.82846130
23 s=0.82844604
3 s=0.82837868
257 s=0.82837739
342 s=0.82833741
312 s=0.82813618
130 s=0.82800002
234 s=0.82799453
167 s=0.82796398
36 s=0.82772334
233 s=0.82765762
48 s=0.82762442
95 s=0.82708935
227 s=0.82667815
156 s=0.82643620
61 s=0.82594375
94 s=0.82594231
397 s=0.82581265
242 s=0.82547137
371 s=0.82508870
80 s=0.82477636
193 s=0.82473934
117 s=0.82429320
135 s=0.82415146
240 s=0.82407542
258 s=0.82407338
238 s=0.82400528
205 s=0.82394608
228 s=0.82362416
45 s=0.82336730
202 s=0.82224984
63 s=0.82126097
310 s=0.81835772
307 s=0.81803689
311 s=0.81792943
214 s=0.81767199
199 s=0.81741562
241 s=0.81567181
65 s=0.81544331
172 s=0.81543692
218 s=0.81543690
377 s=0.81490001
220 s=0.81486646
395 s=0.81479891
249 s=0.81458737
13 s=0.81405246
106 s=0.81374803
22 s=0.81373053
360 s=0.81322170
136 s=0.81288298
261 s=0.81287908
102 s=0.81284974
111 s=0.81275787
198 s=0.81228949
358 s=0.81159486
259 s=0.81098264
386 s=0.81075611
24 s=0.80936262
171 s=0.80639256
154 s=0.80452136
93 s=0.80445272
376 s=0.80415445
87 s=0.80411933
190 s=0.80376578
150 s=0.80154019
212 s=0.80110144
294 s=0.79617692
125 s=0.79520744
57 s=0.79316846
89 s=0.79304017
103 s=0.79234143
384 s=0.79126037
129 s=0.79106778
43 s=0.78935000
414 s=0.78534625
91 s=0.78394775
53 s=0.78361285
366 s=0.78159332
394 s=0.78097067
379 s=0.77968901
390 s=0.76884389
26 s=0.74811280
4 s=0.71702322
131 s=0.69890047
328 s=0.68702138
30 s=0.68701967
209 s=0.68670817
2 s=0.66485392
239 s=0.64841072
411 s=0.62230583
337 s=0.61921847
176 s=0.61921768
412 s=0.61529783
325 s=0.51759181
362 s=0.50438460
27 s=0.47275294
413 s=0.38401053
349 s=0.37688457
327 s=0.33924291
332 s=0.31875402
34 s=0.31348939
32 s=0.30576043
224 s=0.30412398
118 s=0.27288434
## cluster_2 n=5 s=0.82793797
326 s=0.85681704
1 s=0.83456644
19 s=0.82185474
410 s=0.81560084
9 s=0.81085080
## cluster_3 n=6 s=0.63003300
29 s=0.72546691
324 s=0.72546498
300 s=0.68082799
268 s=0.61902163
28 s=0.51521003
21 s=0.51420649
This diff is collapsed.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment