_contig_info_ is a command line program written in [Bash](https://www.gnu.org/software/bash/) for quickly estimating several standard descriptive statistics from FASTA-formatted contig files inferred by _de novo_ genome assembly methods.
Estimated statistics are sequence number, residue counts, AT- and GC-content, sequence lengths, [auN](https://lh3.github.io/2020/04/08/a-new-metric-on-assembly-contiguity)(also called E-size, Salzberg et al. 2012), [N50](https://en.wikipedia.org/wiki/N50,_L50,_and_related_statistics)(Lander et al. 2001), [NG50](https://en.wikipedia.org/wiki/N50,_L50,_and_related_statistics)(Earl et al. 2011), and the related N(G)75, N(G)G90, L(G)50, L(G)75, L(G)90.
_contig_info_ can also estimates nucleotide content statistics for each contig sequence.
Estimated statistics are:
 ▹  sequence number,
 ▹  nucleotide residue counts,
 ▹  AT- and GC-content,
 ▹  sequence lengths,
 ▹ [auN](https://lh3.github.io/2020/04/08/a-new-metric-on-assembly-contiguity)(also called E-size, Salzberg et al. 2012) or auNG,
 ▹ [N50](https://en.wikipedia.org/wiki/N50,_L50,_and_related_statistics)(Lander et al. 2001) and the related N75 and N90 (e.g. Reinhardt et al. 2009, Craig Venter et al. 2001),
 ▹ [L50](https://en.wikipedia.org/wiki/N50,_L50,_and_related_statistics) and the related L75 and L90,
 ▹ [NG50](https://en.wikipedia.org/wiki/N50,_L50,_and_related_statistics)(Earl et al. 2011) and the related NG75, NGG90, LG50, LG75, LG90.
_contig_info_ can also compute nucleotide content statistics for each contig sequence.
## Installation and execution
...
...
@@ -37,7 +54,7 @@ Run _contig_info_ without option to read the following documentation:
## Examples
The following [Bash](https://www.gnu.org/software/bash/) command lines allows the genome sequences of the 5 _Mucor circinelloides_ strains 1006PhL, CBS 277.49, WJ11, B8987 and JCM 22480 to be downloaded from the [NCBI genome repository](https://www.ncbi.nlm.nih.gov/genome):
The following [Bash](https://www.gnu.org/software/bash/) command lines enable to download the genome sequences of the 5 _Mucor circinelloides_ strains 1006PhL, CBS 277.49, WJ11, B8987 and JCM 22480 from the [NCBI genome repository](https://www.ncbi.nlm.nih.gov/genome):
Note that the last column `Pval` assesses the GC-content adequation between each contig and the longest one.
When _Pval_ is close to 0, then the %GC of the corresponding contig is significantly different to the %GC of the longest one.
This _p_-value can be used as an indicator when searching for particular replicons (e.g. plasmids, mitochondrion) or artefactual contigs, as such sequences often induce specific nucleotide compositions.
Indeed, when considering a FASTA file outputted by a _de novo_ assembly program, the longest contig generally does not correspond to such replicon(s), therefore giving a good approximation of the expected GC-content within the hole chromosome.
Note that the last column `Pval` assesses the GC-content adequation between each contig and the overall file content.
Briefly, (up to) 5,000 nucleotide segments (non-overlapping, of length 200 bases) are first sampled from all the contig sequences, each being used to estimate the %GC, therefore leading to (up to) 5,000 %GC values (i.e. the set GC<sub>all</sub>) representative of the GC-content variation within the whole genome assembly.
Next, for each contig, (up to) 500 nucleotide segments (non-overlapping, of length 200 bases) are sampled, leading to (up to) 500 %GC values (i.e. the set GC<sub>seq</sub>) representative of the GC-content variation within the contig.
For each contig sequence, the adequation between GC<sub>seq</sub> and GC<sub>all</sub> is assessed using a Mann-Whitney (1947) _U_ test.
When _Pval_ is close to 0, the GC-content of the corresponding contig is significantly different to the overall %GC.
These _U_ test _p_-values can be used to identify artefactual or particular (e.g. plasmid, mitochondrion) contigs, as such sequences often induce specific nucleotide compositions.
## References
Earl D, Bradnam K, St John J, Darling A, Lin D, Fass J, Yu HO, Buffalo V, Zerbino DR, Diekhans M, Nguyen N, Ariyaratne PN, Sung WK, Ning Z, Haimel M, Simpson JT, Fonseca NA, Birol İ, Docking TR, Ho IY, Rokhsar DS, Chikhi R, Lavenier D, Chapuis G, Naquin D, Maillet N, Schatz MC, Kelley DR, Phillippy AM, Koren S, Yang SP, Wu W, Chou WC, Srivastava A, Shaw TI, Ruby JG, Skewes-Cox P, Betegon M, Dimon MT, Solovyev V, Seledtsov I, Kosarev P, Vorobyev D, Ramirez-Gonzalez R, Leggett R, MacLean D, Xia F, Luo R, Li Z, Xie Y, Liu B, Gnerre S, MacCallum I, Przybylski D, Ribeiro FJ, Yin S, Sharpe T, Hall G, Kersey PJ, Durbin R, Jackman SD, Chapman JA, Huang X, DeRisi JL, Caccamo M, Li Y, Jaffe DB, Green RE, Haussler D, Korf I, Paten B (2011) _Assemblathon 1: a competitive assessment of de novo short read assembly methods_. **Genome Research**, 21(12):2224-2241. [doi:10.1101/gr.126599.111](https://genome.cshlp.org/content/21/12/2224).
Craig Venter J, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt RA, Gocayne JD, Amanatides P, Ballew RM, Huson DH, Russo Wortman J, Zhang Q, Kodira CD, Zheng XH, Chen L, Skupski M, Subramanian G, Thomas PD, Zhang J, Gabor Miklos GL, Nelson C, Broder S, Clark AG, Nadeau J, McKusick VA, Zinder N, Levine AJ, Roberts RJ, Simon M, Slayman C, Hunkapiller M, Bolanos R, Delcher A, Dew I, Fasulo D, Flanigan M, Florea M, Halpern A, Hannenhalli S, Kravitz S, Levy S, Mobarry C, Reinert K, Remington K, Abu-Threideh J, Beasley E, Biddick K, Bonazzi V, Brandon R, Cargill M, Chandramouliswaran I, Charlab, Chaturvedi K, Deng Z, Di Francesco V, Dunn P, Eilbeck K, Evangelista C, Gabrielian AE, Gan W, Ge W, Gong F, Gu Z, Guan P, Heiman TJ, Higgins ME, Ji M-R, Ke Z, Ketchum KA, Lai Z, Lei Y, Li Z, Li J, Liang Y, Lin X, Lu F, Merkulov GV, Milshina N, Moore HM, Naik HK, Narayan VA, Neelam B, Nusskern D, Rusch DB, Salzberg S, Shao W, Shue B, Sun J, Yuan Wang Z, Wang A, Wang X, Wang J, Wei M-H, Wides R, Xiao C, Yan C, Yao A, Ye J, Zhan M, Zhang W, Zhang H, Zhao Q, Zheng L, Zhong F, Zhong W, Zhu SC, Zhao S, Gilbert D, Baumhueter S, Spier G, Carter C, Cravchik A, Woodage T, Ali F, An H, Awe A, Baldwin D, Baden H, Barnstead M, Barrow I, Beeson K, Busam D, Carver A, Center A, Lai Cheng M, Curry L, Danaher S, Davenport L, Desilets R, Dietz S, Dodson K, Doup L, Ferriera S, Garg N, Gluecksmann A, Hart B, Haynes J, Haynes C, Heiner C, Hladun S, Hostin D, Houck J, Howland T, Ibegwam C, Johnson J, Kalush F, Kline L, Koduru S, Love A, Mann F, May D, McCawley S, McIntosh T, McMullen I, Moy M, Moy L, Murphy B, Nelson K, Pfannkoch C, Pratts E, Puri V, Qureshi V, Reardon M, Rodriguez R, Rogers Y-H, Romblad D, Ruhfel B, Scott R, Sitter C, Smallwood M, Stewart E, Strong R, Suh E, Thomas R, Tint NN, Tse S, Vech C, Wang G, Wetter J, Williams S, Williams M, Windsor S, Winn-Deen E, Wolfe K, Zaveri J, Zaveri K, Abril JF, Guigó R, Campbell MJ, Sjolander KV, Karlak B, Kejariwal A, Mi H, Lazareva B, Hatton T, Narechania A, Diemer K, Muruganujan A, Guo N, Sato S, Bafna V, Istrail S, Lippert R, Schwartz R, Walenz B, Yooseph R, Allen D, Basu A, Baxendale J, Blick L, Caminha M, Carnes-Stine J, Caulk P, Chiang Y-H, Coyne M, Dahlke C, Deslattes Mays A, Dombroski M, Donnelly M, Ely D, Esparham S, Fosler C, Gire H, Glanowski S, Glasser K, Glodek A, Gorokhov M, Graham K, Gropman B, Harris M, Heil J, Henderson S, Hoover J, Jennings D, Jordan C, Jordan J, Kasha J, Kagan L, Kraft C, Levitsky A, Lewis M, Liu X, Lopez J, Ma D, Majoros W, McDaniel J, Murphy S, Newman M, Nguyen T, Nguyen N, Nodell M, Pan S, Peck D, Peterson M, Rowe W, Sanders R, Scott J, Simpson M, Smith T, Sprague A, Stockwell T, Turner R, Venter E, Wang M, Wen M, Wu D, Wu M, Xia A, Zandieh A, Zhu X (2001) _The Sequence of the Human Genome_. **Science**, 291(5507):1304-1351. [doi:10.1126/science.1058040](https://science.sciencemag.org/content/291/5507/1304).
Earl D, Bradnam K, St John J, Darling A, Lin D, Fass J, Yu HO, Buffalo V, Zerbino DR, Diekhans M, Nguyen N, Ariyaratne PN, Sung WK, Ning Z, Haimel M, Simpson JT, Fonseca NA, Birol İ, Docking TR, Ho IY, Rokhsar DS, Chikhi R, Lavenier D, Chapuis G, Naquin D, Maillet N, Schatz MC, Kelley DR, Phillippy AM, Koren S, Yang SP, Wu W, Chou WC, Srivastava A, Shaw TI, Ruby JG, Skewes-Cox P, Betegon M, Dimon MT, Solovyev V, Seledtsov I, Kosarev P, Vorobyev D, Ramirez-Gonzalez R, Leggett R, MacLean D, Xia F, Luo R, Li Z, Xie Y, Liu B, Gnerre S, MacCallum I, Przybylski D, Ribeiro FJ, Yin S, Sharpe T, Hall G, Kersey PJ, Durbin R, Jackman SD, Chapman JA, Huang X, DeRisi JL, Caccamo M, Li Y, Jaffe DB, Green RE, Haussler D, Korf I, Paten B (2011) _Assemblathon 1: a competitive assessment of de novo short read assembly methods_. **Genome Research**, 21(12):2224-2241. [doi:10.1101/gr.126599.111](https://genome.cshlp.org/content/21/12/2224).
Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, Funke R, Gage D, Harris K, Heaford A, Howland J, Kann L, Lehoczky J, LeVine R, McEwan P, McKernan K, Meldrim J, Mesirov JP, Miranda C, Morris W, Naylor J, Raymond C, Rosetti M, Santos R, Sheridan A, Sougnez C, Stange-Thomann Y, Stojanovic N, Subramanian A, Wyman D, Rogers J, Sulston J, Ainscough R, Beck S, Bentley D, Burton J, Clee C, Carter N, Coulson A, Deadman R, Deloukas P, Dunham A, Dunham I, Durbin R, French L, Grafham D, Gregory S, Hubbard T, Humphray S, Hunt A, Jones M, Lloyd C, McMurray A, Matthews L, Mercer S, Milne S, Mullikin JC, Mungall A, Plumb R, Ross M, Shownkeen R, Sims S, Waterston RH, Wilson RK, Hillier LW, McPherson JD, Marra MA, Mardis ER, Fulton LA, Chinwalla AT, Pepin KH, Gish WR, Chissoe SL, Wendl MC, Delehaunty KD, Miner TL, Delehaunty A, Kramer JB, Cook LL, Fulton RS, Johnson DL, Minx PJ, Clifton SW, Hawkins T, Branscomb E, Predki P, Richardson P, Wenning S, Slezak T, Doggett N, Cheng JF, Olsen A, Lucas S, Elkin C, Uberbacher E, Frazier M, Gibbs RA, Muzny DM, Scherer SE, Bouck JB, Sodergren EJ, Worley KC, Rives CM, Gorrell JH, Metzker ML, Naylor SL, Kucherlapati RS, Nelson DL, Weinstock GM, Sakaki Y, Fujiyama A, Hattori M, Yada T, Toyoda A, Itoh T, Kawagoe C, Watanabe H, Totoki Y, Taylor T, Weissenbach J, Heilig R, Saurin W, Artiguenave F, Brottier P, Bruls T, Pelletier E, Robert C, Wincker P, Smith DR, Doucette-Stamm L, Rubenfield M, Weinstock K, Lee HM, Dubois J, Rosenthal A, Platzer M, Nyakatura G, Taudien S, Rump A, Yang H, Yu J, Wang J, Huang G, Gu J, Hood L, Rowen L, Madan A, Qin S, Davis RW, Federspiel NA, Abola AP, Proctor MJ, Myers RM, Schmutz J, Dickson M, Grimwood J, Cox DR, Olson MV, Kaul R, Raymond C, Shimizu N, Kawasaki K, Minoshima S, Evans GA, Athanasiou M, Schultz R, Roe BA, Chen F, Pan H, Ramser J, Lehrach H, Reinhardt R, McCombie WR, de la Bastide M, Dedhia N, Blöcker H, Hornischer K, Nordsiek G, Agarwala R, Aravind L, Bailey JA, Bateman A, Batzoglou S, Birney E, Bork P, Brown DG, Burge CB, Cerutti L, Chen HC, Church D, Clamp M, Copley RR, Doerks T, Eddy SR, Eichler EE, Furey TS, Galagan J, Gilbert JG, Harmon C, Hayashizaki Y, Haussler D, Hermjakob H, Hokamp K, Jang W, Johnson LS, Jones TA, Kasif S, Kaspryzk A, Kennedy S, Kent WJ, Kitts P, Koonin EV, Korf I, Kulp D, Lancet D, Lowe TM, McLysaght A, Mikkelsen T, Moran JV, Mulder N, Pollara VJ, Ponting CP, Schuler G, Schultz J, Slater G, Smit AF, Stupka E, Szustakowki J, Thierry-Mieg D, Thierry-Mieg J, Wagner L, Wallis J, Wheeler R, Williams A, Wolf YI, Wolfe KH, Yang SP, Yeh RF, Collins F, Guyer MS, Peterson J, Felsenfeld A, Wetterstrand KA, Patrinos A, Morgan MJ, de Jong P, Catanese JJ, Osoegawa K, Shizuya H, Choi S, Chen YJ, Szustakowki J; International Human Genome Sequencing Consortium (2001) _Initial sequencing and analysis of the human genome_. **Nature**, 409(6822):860-921. [doi:10.1038/35057062](https://www.nature.com/articles/35057062).
Mann HB, Whitney DR (1947) _On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other_. **Annals of Mathematical Statistics**, 18(1):50-60. [doi:10.1214/aoms/1177730491](https://doi.org/10.1214/aoms/1177730491).
Reinhardt JA, Baltrus DA, Nishimura MT, Jeck WR, Jones CD, Dangl JL (2009) _De novo assembly using low-coverage short read sequence data from the rice pathogen Pseudomonas syringae pv. oryzae_ **Genome Research**, 19:294-305. [doi:10.1101/gr.083311.108](https://dx.doi.org/10.1101%2Fgr.083311.108).
Salzberg SL, Phillippy AM, Zimin A, Puiu D, Magoc T, Koren S, Treangen TJ, Schatz MC, Delcher AL, Roberts M, Marçais G, Pop M, Yorke JA (2012) _GAGE: A critical evaluation of genome assemblies and assembly algorithms_. **Genome Research**, 22(3):557-567. [doi:10.1101/gr.131383.111](https://genome.cshlp.org/content/22/3/557.long).