diff --git a/README.md b/README.md index 143e1036b25c5b3978b806dfccbfc2ca46e81547..d20509b2820f0ffa23417d34151330a392bbe5c3 100644 --- a/README.md +++ b/README.md @@ -1,49 +1,55 @@ # contig_info -_contig_info_ is a command line program written in [Bash](https://www.gnu.org/software/bash/) that allows several standard descriptive statistics to be quickly estimated from FASTA-formatted contig files inferred by _de novo_ genome assembly methods. -Estimated statistics are sequence number, residue counts, AT- and GC-content, sequence lengths, N50 (Lander et al. 2001), NG50 (Earl et al. 2011), and the related N75, NG75, N90, NG90, L50, LG50, L75, LG75, L90, LG90. +_contig_info_ is a command line program written in [Bash](https://www.gnu.org/software/bash/) for quickly estimating several standard descriptive statistics from FASTA-formatted contig files inferred by _de novo_ genome assembly methods. +Estimated statistics are sequence number, residue counts, AT- and GC-content, sequence lengths, [auN](https://lh3.github.io/2020/04/08/a-new-metric-on-assembly-contiguity) (also called E-size, Salzberg et al. 2012), [N50](https://en.wikipedia.org/wiki/N50,_L50,_and_related_statistics) (Lander et al. 2001), [NG50](https://en.wikipedia.org/wiki/N50,_L50,_and_related_statistics) (Earl et al. 2011), and the related N75, NG75, N90, NG90, L50, LG50, L75, LG75, L90, LG90. ## Installation and execution Give the execute permission to the file `contig_info.sh` by typing: + ```bash chmod +x contig_info.sh ``` -and launch it with the following command line model: +and run it with the following command line model: + ```bash ./contig_info.sh [options] ``` ## Usage -Launch _contig_info_ without option to read the following documentation: +Run _contig_info_ without option to read the following documentation: ``` - USAGE: contig_info.sh [options] <contig_files> + USAGE: contig_info.sh [options] <contig_files> where 'options' are: - -m <int> minimum contig length; every contig sequence of length - shorter than this cutoff will be discarded (default: 1) - -g <int> expected genome size for computing {N,L}G{50,75,90} - values instead of {N,L}{50,75,90} ones, respectively + -m <int> minimum contig length; every contig sequence of length shorter + than this cutoff will be discarded (default: 1) + -g <int> expected genome size for computing auNG and {N,L}G{50,75,90} + values instead of auN and {N,L}{50,75,90} ones, respectively -t tab-delimited output ``` ## Examples The following [Bash](https://www.gnu.org/software/bash/) command lines allows the genome sequences of the 5 _Mucor circinelloides_ strains 1006PhL, CBS 277.49, WJ11, B8987 and JCM 22480 to be downloaded from the [NCBI genome repository](https://www.ncbi.nlm.nih.gov/genome): + ```bash NCBIFTP="wget -q -O- https://ftp.ncbi.nlm.nih.gov/sra/wgs_aux/"; Z=".1.fsa_nt.gz"; echo -e "1006PhL\tAOCY01\nCBS277.49\tAMYB01\nWJ11\tLGTF01\nB8987\tJNDM01\nJCM22480\tBCHG01" | while read -r s a; do echo -n "$s ... ";$NCBIFTP${a:0:2}/${a:2:2}/$a/$a$Z|zcat>Mucor.$s.fasta;echo "[ok]";done ``` -The following command line allows the script `contig_info.sh` to be launched to analyze the first downloaded file _Mucor.1006PhL.fasta_: +The following command line runs `contig_info.sh` to analyze the first downloaded file _Mucor.1006PhL.fasta_: + ```bash ./contig_info.sh Mucor.1006PhL.fasta ``` + leading to the following standard output: + ``` File Mucor.1006PhL.fasta @@ -69,6 +75,7 @@ Sequence lengths: Average 23395.89 Contiguity statistics: + auN 65329 N50 58982 N75 36291 N90 18584 @@ -77,63 +84,74 @@ Contiguity statistics: L90 562 ``` -The same results could be outputted in tab-delimited format with the following command line: +The same results can be outputted in tab-delimited format using option `-t`: + ```bash ./contig_info.sh -t Mucor.1006PhL.fasta ``` ``` -#File Nseq Nres A C G T N %A %C %G %T %N %AT %GC Min Q25 Med Q75 Max Avg N50 N75 N90 L50 L75 L90 -Mucor.1006PhL.fasta 1459 34134616 10320010 6747611 6731530 10335465 0 30.23% 19.76% 19.72% 30.27% 0% 60.52% 39.48% 410 1660 6176 37608 213712 23395.89 58982 36291 18584 194 376 562 +#File Nseq Nres A C G T N %A %C %G %T %N %AT %GC Min Q25 Med Q75 Max Avg auN N50 N75 N90 L50 L75 L90 +Mucor.1006PhL.fasta 1459 34134616 10320010 6747611 6731530 10335465 0 30.23% 19.76% 19.72% 30.27% 0% 60.52% 39.48% 410 1660 6176 37608 213712 23395.89 65329 58982 36291 18584 194 376 562 ``` -Of note, the five downloaded FASTA files could be analyzed with a single command line: +Of note, the five downloaded FASTA files can be analyzed with a single command line: + ```bash ./contig_info.sh -t Mucor.*.fasta ``` ``` -#File Nseq Nres A C G T N %A %C %G %T %N %AT %GC Min Q25 Med Q75 Max Avg N50 N75 N90 L50 L75 L90 -Mucor.1006PhL.fasta 1459 34134616 10320010 6747611 6731530 10335465 0 30.23% 19.76% 19.72% 30.27% 0% 60.52% 39.48% 410 1660 6176 37608 213712 23395.89 58982 36291 18584 194 376 562 -Mucor.B8987.fasta 2210 36700617 11096810 7247117 7233795 11122895 0 30.23% 19.74% 19.71% 30.30% 0% 60.55% 39.45% 206 839 2482 20727 258792 16606.61 58460 30025 13274 193 416 674 -Mucor.CBS277.49.fasta 21 36567582 10571030 7715901 7705901 10574750 0 28.90% 21.10% 21.07% 28.91% 0% 57.83% 42.17% 4155 41542 934259 3187354 6050249 1741313.42 4318338 3096690 1074709 4 7 9 -Mucor.JCM22480.fasta 401 36616466 10586281 6882218 6899109 10581984 1659222 28.91% 18.79% 18.84% 28.89% 4.53% 60.57% 39.43% 1038 4814 50332 135940 659822 91312.88 197059 109360 63107 61 121 183 -Mucor.WJ11.fasta 2519 33065171 9974064 6559358 6556539 9975210 0 30.16% 19.83% 19.82% 30.16% 0% 60.34% 39.66% 430 3275 7692 18010 118704 13126.30 24148 12884 5672 429 898 1455 +#File Nseq Nres A C G T N %A %C %G %T %N %AT %GC Min Q25 Med Q75 Max Avg auN N50 N75 N90 L50 L75 L90 +Mucor.1006PhL.fasta 1459 34134616 10320010 6747611 6731530 10335465 0 30.23% 19.76% 19.72% 30.27% 0% 60.52% 39.48% 410 1660 6176 37608 213712 23395.89 65329 58982 36291 18584 194 376 562 +Mucor.B8987.fasta 2210 36700617 11096810 7247117 7233795 11122895 0 30.23% 19.74% 19.71% 30.30% 0% 60.55% 39.45% 206 839 2482 20727 258792 16606.61 69144 58460 30025 13274 193 416 674 +Mucor.CBS277.49.fasta 21 36567582 10571030 7715901 7705901 10574750 0 28.90% 21.10% 21.07% 28.91% 0% 57.83% 42.17% 4155 41542 934259 3187354 6050249 1741313.42 3912950 4318338 3096690 1074709 4 7 9 +Mucor.JCM22480.fasta 401 36616466 10586281 6882218 6899109 10581984 1659222 28.91% 18.79% 18.84% 28.89% 4.53% 60.57% 39.43% 1038 4814 50332 135940 659822 91312.88 229712 197059 109360 63107 61 121 183 +Mucor.WJ11.fasta 2519 33065171 9974064 6559358 6556539 9975210 0 30.16% 19.83% 19.82% 30.16% 0% 60.34% 39.66% 430 3275 7692 18010 118704 13126.30 28368 24148 12884 5672 429 898 1455 ``` -The tab-delimited output format could be useful for focusing on specific fields like, e.g. the six contiguity statistics: +The tab-delimited output format can be useful for focusing on specific fields like, e.g. the seven contiguity statistics: + ```bash ./contig_info.sh -t Mucor.*.fasta | cut -f1,22- ``` ``` -#File N50 N75 N90 L50 L75 L90 -Mucor.1006PhL.fasta 58982 36291 18584 194 376 562 -Mucor.B8987.fasta 58460 30025 13274 193 416 674 -Mucor.CBS277.49.fasta 4318338 3096690 1074709 4 7 9 -Mucor.JCM22480.fasta 197059 109360 63107 61 121 183 -Mucor.WJ11.fasta 24148 12884 5672 429 898 1455 +#File auN N50 N75 N90 L50 L75 L90 +Mucor.1006PhL.fasta 65329 58982 36291 18584 194 376 562 +Mucor.B8987.fasta 69144 58460 30025 13274 193 416 674 +Mucor.CBS277.49.fasta 3912950 4318338 3096690 1074709 4 7 9 +Mucor.JCM22480.fasta 229712 197059 109360 63107 61 121 183 +Mucor.WJ11.fasta 28368 24148 12884 5672 429 898 1455 ``` -Finally, the option -g could be used to set an expected genome size for obtaining {N,L}G{50,75,90} statistics instead of {N,L}{50,75,90} ones: +Finally, the option `-g` can be used to set an expected genome size for obtaining auNG and {N,L}G{50,75,90} statistics instead of auN and {N,L}{50,75,90} ones: + ```bash ./contig_info.sh -t -g 36000000 Mucor.*.fasta | cut -f1,22- ``` ``` -#File N50 N75 N90 L50 L75 L90 ExpSize -Mucor.1006PhL.fasta 57499 32472 7652 210 417 692 36000000 -Mucor.B8987.fasta 59771 30857 15730 187 399 631 36000000 -Mucor.CBS277.49.fasta 4318338 3096690 1074709 4 7 9 36000000 -Mucor.JCM22480.fasta 197663 113006 69531 59 117 175 36000000 -Mucor.WJ11.fasta 21799 9865 2445 493 1092 2146 36000000 +#File auN N50 N75 N90 L50 L75 L90 ExpSize +Mucor.1006PhL.fasta 61944 57499 32472 7652 210 417 692 36000000 +Mucor.B8987.fasta 70490 59771 30857 15730 187 399 631 36000000 +Mucor.CBS277.49.fasta 3974642 4318338 3096690 1074709 4 7 9 36000000 +Mucor.JCM22480.fasta 233645 197663 113006 69531 59 117 175 36000000 +Mucor.WJ11.fasta 26055 21799 9865 2445 493 1092 2146 36000000 ``` ## References -Earl D, Bradnam K, St John J, Darling A, Lin D, Fass J, Yu HO, Buffalo V, Zerbino DR, Diekhans M, Nguyen N, Ariyaratne PN, Sung WK, Ning Z, Haimel M, Simpson JT, Fonseca NA, Birol İ, Docking TR, Ho IY, Rokhsar DS, Chikhi R, Lavenier D, Chapuis G, Naquin D, Maillet N, Schatz MC, Kelley DR, Phillippy AM, Koren S, Yang SP, Wu W, Chou WC, Srivastava A, Shaw TI, Ruby JG, Skewes-Cox P, Betegon M, Dimon MT, Solovyev V, Seledtsov I, Kosarev P, Vorobyev D, Ramirez-Gonzalez R, Leggett R, MacLean D, Xia F, Luo R, Li Z, Xie Y, Liu B, Gnerre S, MacCallum I, Przybylski D, Ribeiro FJ, Yin S, Sharpe T, Hall G, Kersey PJ, Durbin R, Jackman SD, Chapman JA, Huang X, DeRisi JL, Caccamo M, Li Y, Jaffe DB, Green RE, Haussler D, Korf I, Paten B (2011) Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Research, 21(12):2224-2241. [doi:10.1101/gr.126599.111](https://genome.cshlp.org/content/21/12/2224). +<sub> +Earl D, Bradnam K, St John J, Darling A, Lin D, Fass J, Yu HO, Buffalo V, Zerbino DR, Diekhans M, Nguyen N, Ariyaratne PN, Sung WK, Ning Z, Haimel M, Simpson JT, Fonseca NA, Birol İ, Docking TR, Ho IY, Rokhsar DS, Chikhi R, Lavenier D, Chapuis G, Naquin D, Maillet N, Schatz MC, Kelley DR, Phillippy AM, Koren S, Yang SP, Wu W, Chou WC, Srivastava A, Shaw TI, Ruby JG, Skewes-Cox P, Betegon M, Dimon MT, Solovyev V, Seledtsov I, Kosarev P, Vorobyev D, Ramirez-Gonzalez R, Leggett R, MacLean D, Xia F, Luo R, Li Z, Xie Y, Liu B, Gnerre S, MacCallum I, Przybylski D, Ribeiro FJ, Yin S, Sharpe T, Hall G, Kersey PJ, Durbin R, Jackman SD, Chapman JA, Huang X, DeRisi JL, Caccamo M, Li Y, Jaffe DB, Green RE, Haussler D, Korf I, Paten B (2011) _Assemblathon 1: a competitive assessment of de novo short read assembly methods_. **Genome Research**, 21(12):2224-2241. [doi:10.1101/gr.126599.111](https://genome.cshlp.org/content/21/12/2224). +</sub> -Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, Funke R, Gage D, Harris K, Heaford A, Howland J, Kann L, Lehoczky J, LeVine R, McEwan P, McKernan K, Meldrim J, Mesirov JP, Miranda C, Morris W, Naylor J, Raymond C, Rosetti M, Santos R, Sheridan A, Sougnez C, Stange-Thomann Y, Stojanovic N, Subramanian A, Wyman D, Rogers J, Sulston J, Ainscough R, Beck S, Bentley D, Burton J, Clee C, Carter N, Coulson A, Deadman R, Deloukas P, Dunham A, Dunham I, Durbin R, French L, Grafham D, Gregory S, Hubbard T, Humphray S, Hunt A, Jones M, Lloyd C, McMurray A, Matthews L, Mercer S, Milne S, Mullikin JC, Mungall A, Plumb R, Ross M, Shownkeen R, Sims S, Waterston RH, Wilson RK, Hillier LW, McPherson JD, Marra MA, Mardis ER, Fulton LA, Chinwalla AT, Pepin KH, Gish WR, Chissoe SL, Wendl MC, Delehaunty KD, Miner TL, Delehaunty A, Kramer JB, Cook LL, Fulton RS, Johnson DL, Minx PJ, Clifton SW, Hawkins T, Branscomb E, Predki P, Richardson P, Wenning S, Slezak T, Doggett N, Cheng JF, Olsen A, Lucas S, Elkin C, Uberbacher E, Frazier M, Gibbs RA, Muzny DM, Scherer SE, Bouck JB, Sodergren EJ, Worley KC, Rives CM, Gorrell JH, Metzker ML, Naylor SL, Kucherlapati RS, Nelson DL, Weinstock GM, Sakaki Y, Fujiyama A, Hattori M, Yada T, Toyoda A, Itoh T, Kawagoe C, Watanabe H, Totoki Y, Taylor T, Weissenbach J, Heilig R, Saurin W, Artiguenave F, Brottier P, Bruls T, Pelletier E, Robert C, Wincker P, Smith DR, Doucette-Stamm L, Rubenfield M, Weinstock K, Lee HM, Dubois J, Rosenthal A, Platzer M, Nyakatura G, Taudien S, Rump A, Yang H, Yu J, Wang J, Huang G, Gu J, Hood L, Rowen L, Madan A, Qin S, Davis RW, Federspiel NA, Abola AP, Proctor MJ, Myers RM, Schmutz J, Dickson M, Grimwood J, Cox DR, Olson MV, Kaul R, Raymond C, Shimizu N, Kawasaki K, Minoshima S, Evans GA, Athanasiou M, Schultz R, Roe BA, Chen F, Pan H, Ramser J, Lehrach H, Reinhardt R, McCombie WR, de la Bastide M, Dedhia N, Blöcker H, Hornischer K, Nordsiek G, Agarwala R, Aravind L, Bailey JA, Bateman A, Batzoglou S, Birney E, Bork P, Brown DG, Burge CB, Cerutti L, Chen HC, Church D, Clamp M, Copley RR, Doerks T, Eddy SR, Eichler EE, Furey TS, Galagan J, Gilbert JG, Harmon C, Hayashizaki Y, Haussler D, Hermjakob H, Hokamp K, Jang W, Johnson LS, Jones TA, Kasif S, Kaspryzk A, Kennedy S, Kent WJ, Kitts P, Koonin EV, Korf I, Kulp D, Lancet D, Lowe TM, McLysaght A, Mikkelsen T, Moran JV, Mulder N, Pollara VJ, Ponting CP, Schuler G, Schultz J, Slater G, Smit AF, Stupka E, Szustakowki J, Thierry-Mieg D, Thierry-Mieg J, Wagner L, Wallis J, Wheeler R, Williams A, Wolf YI, Wolfe KH, Yang SP, Yeh RF, Collins F, Guyer MS, Peterson J, Felsenfeld A, Wetterstrand KA, Patrinos A, Morgan MJ, de Jong P, Catanese JJ, Osoegawa K, Shizuya H, Choi S, Chen YJ, Szustakowki J; International Human Genome Sequencing Consortium (2001) Initial sequencing and analysis of the human genome. Nature, 409(6822):860-921. [doi:10.1038/35057062](https://www.nature.com/articles/35057062). +<sub> +Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, Funke R, Gage D, Harris K, Heaford A, Howland J, Kann L, Lehoczky J, LeVine R, McEwan P, McKernan K, Meldrim J, Mesirov JP, Miranda C, Morris W, Naylor J, Raymond C, Rosetti M, Santos R, Sheridan A, Sougnez C, Stange-Thomann Y, Stojanovic N, Subramanian A, Wyman D, Rogers J, Sulston J, Ainscough R, Beck S, Bentley D, Burton J, Clee C, Carter N, Coulson A, Deadman R, Deloukas P, Dunham A, Dunham I, Durbin R, French L, Grafham D, Gregory S, Hubbard T, Humphray S, Hunt A, Jones M, Lloyd C, McMurray A, Matthews L, Mercer S, Milne S, Mullikin JC, Mungall A, Plumb R, Ross M, Shownkeen R, Sims S, Waterston RH, Wilson RK, Hillier LW, McPherson JD, Marra MA, Mardis ER, Fulton LA, Chinwalla AT, Pepin KH, Gish WR, Chissoe SL, Wendl MC, Delehaunty KD, Miner TL, Delehaunty A, Kramer JB, Cook LL, Fulton RS, Johnson DL, Minx PJ, Clifton SW, Hawkins T, Branscomb E, Predki P, Richardson P, Wenning S, Slezak T, Doggett N, Cheng JF, Olsen A, Lucas S, Elkin C, Uberbacher E, Frazier M, Gibbs RA, Muzny DM, Scherer SE, Bouck JB, Sodergren EJ, Worley KC, Rives CM, Gorrell JH, Metzker ML, Naylor SL, Kucherlapati RS, Nelson DL, Weinstock GM, Sakaki Y, Fujiyama A, Hattori M, Yada T, Toyoda A, Itoh T, Kawagoe C, Watanabe H, Totoki Y, Taylor T, Weissenbach J, Heilig R, Saurin W, Artiguenave F, Brottier P, Bruls T, Pelletier E, Robert C, Wincker P, Smith DR, Doucette-Stamm L, Rubenfield M, Weinstock K, Lee HM, Dubois J, Rosenthal A, Platzer M, Nyakatura G, Taudien S, Rump A, Yang H, Yu J, Wang J, Huang G, Gu J, Hood L, Rowen L, Madan A, Qin S, Davis RW, Federspiel NA, Abola AP, Proctor MJ, Myers RM, Schmutz J, Dickson M, Grimwood J, Cox DR, Olson MV, Kaul R, Raymond C, Shimizu N, Kawasaki K, Minoshima S, Evans GA, Athanasiou M, Schultz R, Roe BA, Chen F, Pan H, Ramser J, Lehrach H, Reinhardt R, McCombie WR, de la Bastide M, Dedhia N, Blöcker H, Hornischer K, Nordsiek G, Agarwala R, Aravind L, Bailey JA, Bateman A, Batzoglou S, Birney E, Bork P, Brown DG, Burge CB, Cerutti L, Chen HC, Church D, Clamp M, Copley RR, Doerks T, Eddy SR, Eichler EE, Furey TS, Galagan J, Gilbert JG, Harmon C, Hayashizaki Y, Haussler D, Hermjakob H, Hokamp K, Jang W, Johnson LS, Jones TA, Kasif S, Kaspryzk A, Kennedy S, Kent WJ, Kitts P, Koonin EV, Korf I, Kulp D, Lancet D, Lowe TM, McLysaght A, Mikkelsen T, Moran JV, Mulder N, Pollara VJ, Ponting CP, Schuler G, Schultz J, Slater G, Smit AF, Stupka E, Szustakowki J, Thierry-Mieg D, Thierry-Mieg J, Wagner L, Wallis J, Wheeler R, Williams A, Wolf YI, Wolfe KH, Yang SP, Yeh RF, Collins F, Guyer MS, Peterson J, Felsenfeld A, Wetterstrand KA, Patrinos A, Morgan MJ, de Jong P, Catanese JJ, Osoegawa K, Shizuya H, Choi S, Chen YJ, Szustakowki J; International Human Genome Sequencing Consortium (2001) _Initial sequencing and analysis of the human genome_. **Nature**, 409(6822):860-921. [doi:10.1038/35057062](https://www.nature.com/articles/35057062). +</sub> +<sub> +Salzberg SL, Phillippy AM, Zimin A, Puiu D, Magoc T, Koren S, Treangen TJ, Schatz MC, Delcher AL, Roberts M, Marçais G, Pop M, Yorke JA (2012) _GAGE: A critical evaluation of genome assemblies and assembly algorithms_. **Genome Research**, 22(3):557-567. [doi:10.1101/gr.131383.111](https://genome.cshlp.org/content/22/3/557.long). +</sub> diff --git a/contig_info.sh b/contig_info.sh index a9a081a68a8da7e26fe0f8c9451838d9ca74423b..c22e1c1a2adf1b09586cd6157e1ad4241ce2b215 100755 --- a/contig_info.sh +++ b/contig_info.sh @@ -1,114 +1,115 @@ #!/bin/bash -######################################################################################## -# # -# contig_info: a BASH script to estimate standard statistics from FASTA contig files # -# # -# Copyright (C) 2015,2018,2019 Alexis Criscuolo # -# # -# This program is free software: you can redistribute it and/or modify it under the # -# terms of the GNU General Public License as published by the Free Software # -# Foundation, either version 3 of the License, or (at your option) any later version # -# # -# This program is distributed in the hope that it will be useful, but WITHOUT ANY # -# WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A # -# PARTICULAR PURPOSE. See the GNU General Public License for more details. # -# # -# You should have received a copy of the GNU General Public License along with this # -# program. If not, see <http://www.gnu.org/licenses/>. # -# # -# Contact: # -# Institut Pasteur # -# Bioinformatics and Biostatistics Hub # -# C3BI, USR 3756 IP CNRS # -# Paris, FRANCE # -# # -# alexis.criscuolo@pasteur.fr # -# # -######################################################################################## - -######################################################################################## -# # -# ============ # -# = VERSIONS = # -# ============ # -# # - VERSION=1.0.190426ac # -# + options -l and -d (i.e. printing sequence lengths and length distribution, resp.) # -# are no longer supported # -# + residue count always computed (option -r discarded) # -# + ultrafast residue count (based on tr + wc) # -# + estimating %AT, %GC, L50, L75, L90 # -# + faster estimation of the sequence length statistics (100% awk) # -# + ability to read multiple input files # -# # -# VERSION=0.3.180515ac # -# # -######################################################################################## +############################################################################################################## +# # +# contig_info: a BASH script to estimate standard statistics from FASTA contig files # +# # +# Copyright (C) 2018-2021 Institut Pasteur # +# # +# This program is free software: you can redistribute it and/or modify it under the terms of the GNU # +# General Public License as published by the Free Software Foundation, either version 3 of the License, or # +# (at your option) any later version. # +# # +# This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even # +# the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public # +# License for more details. # +# # +# You should have received a copy of the GNU General Public License along with this program. If not, see # +# <http://www.gnu.org/licenses/>. # +# # +# Contact: # +# Alexis Criscuolo alexis.criscuolo@pasteur.fr # +# Genome Informatics & Phylogenetics (GIPhy) giphy.pasteur.fr # +# Bioinformatics and Biostatistics Hub research.pasteur.fr/en/team/hub-giphy # +# USR 3756 IP CNRS research.pasteur.fr/team/bioinformatics-and-biostatistics-hub # +# Dpt. Biologie Computationnelle research.pasteur.fr/department/computational-biology # +# Institut Pasteur, Paris, FRANCE research.pasteur.fr # +# # +############################################################################################################## + +############################################################################################################## +# # +# ============ # +# = VERSIONS = # +# ============ # +# # + VERSION=1.1.201007ac # +# + estimating auN (also called E-size) # +# # +# VERSION=1.0.190426ac # +# + options -l and -d (i.e. printing sequence lengths and length distribution) are no longer supported # +# + residue count always computed (option -r discarded) # +# + ultrafast residue count (based on tr + wc) # +# + estimating %AT, %GC, L50, L75, L90 # +# + faster estimation of the sequence length statistics (100% awk) # +# + ability to read multiple input files # +# # +# VERSION=0.3.180515ac # +# # +############################################################################################################## -######################################################################################## -# # -# ================ # -# = INSTALLATION = # -# ================ # -# # -# Just give the execute permission to the script contig_info.sh with the following # -# command line: # -# # -# chmod +x contig_info.sh # -# # -######################################################################################## - -######################################################################################## -# # -# ================ # -# = MANUAL = # -# ================ # -# # -# # +############################################################################################################## +# # +# ================ # +# = INSTALLATION = # +# ================ # +# # +# Just give the execute permission to the script contig_info.sh with the following command line: # +# # +# chmod +x contig_info.sh # +# # +############################################################################################################## + +############################################################################################################# +# # +# ================ # +# = MANUAL = # +# ================ # +# # +# # if [ "$1" = "-?" ] || [ $# -lt 1 ] then cat <<EOF - contig_info v.$VERSION + contig_info v.$VERSION Copyright (C) 2018-2021 Institut Pasteur USAGE: contig_info.sh [options] <contig_files> where 'options' are: - -m <int> minimum contig length; every contig sequence of length - shorter than this cutoff will be discarded (default: 1) - -g <int> expected genome size for computing {N,L}G{50,75,90} - values instead of {N,L}{50,75,90} ones, respectively + -m <int> minimum contig length; every contig sequence of length shorter + than this cutoff will be discarded (default: 1) + -g <int> expected genome size for computing auNG and {N,L}G{50,75,90} + values instead of auN and {N,L}{50,75,90} ones, respectively -t tab-delimited output EOF - exit -fi -# # -######################################################################################## - -######################################################################################## -# # -# ================ # -# = FUNCTIONS = # -# ================ # -# # -# = randomfile() ==================================================================== # -# returns a random file name within /tmp/ # -# # + exit # +fi # +# # +############################################################################################################# + +############################################################################################################# +# # +# ================ # +# = FUNCTIONS = # +# ================ # +# # +# = randomfile() ========================================================================================= # +# returns a random file name within /tmp/ # +# # randomfile() { rdmf=/tmp/$RANDOM; while [ -e $rdmf ]; do rdmf=/tmp/$RANDOM ; done echo $rdmf ; } -# # -######################################################################################## - -######################################################################################## -#### #### -#### INITIALIZING PARAMETERS AND READING OPTIONS #### -#### #### -######################################################################################## +# # +############################################################################################################# + +############################################################################################################# +#### #### +#### INITIALIZING PARAMETERS AND READING OPTIONS #### +#### #### +############################################################################################################# MIN_CONTIG_LGT=1; GENOME_SIZE=0; TSVOUT=false; @@ -125,14 +126,14 @@ done if [ $MIN_CONTIG_LGT -lt 1 ]; then echo " the min contig length threshold must be a positive integer (option -m)" ; exit 1 ; fi if [ $GENOME_SIZE -lt 0 ]; then echo " the expected genome size must be a positive integer (option -g)" ; exit 1 ; fi -######################################################################################## -#### #### -#### CONTIG INFO #### -#### #### -######################################################################################## +############################################################################################################# +#### #### +#### CONTIG INFO #### +#### #### +############################################################################################################# if $TSVOUT then - CSVCAPT="#File\tNseq\tNres\tA\tC\tG\tT\tN\t%A\t%C\t%G\t%T\t%N\t%AT\t%GC\tMin\tQ25\tMed\tQ75\tMax\tAvg\tN50\tN75\tN90\tL50\tL75\tL90"; + CSVCAPT="#File\tNseq\tNres\tA\tC\tG\tT\tN\t%A\t%C\t%G\t%T\t%N\t%AT\t%GC\tMin\tQ25\tMed\tQ75\tMax\tAvg\tauN\tN50\tN75\tN90\tL50\tL75\tL90"; [ $GENOME_SIZE -ne 0 ]&&CSVCAPT="$CSVCAPT\tExpSize"; echo -e "$CSVCAPT" ; fi @@ -150,12 +151,11 @@ do N=$(tr -cd N < $SEQS | wc -c); fN=$(bc -l <<<"scale=2;100*$N/$R" | sed 's/^\./0./'); fGC=$(bc -l <<<"scale=2;100*($C+$G)/($A+$C+$G+$T)" | sed 's/^\./0./'); fAT=$(bc -l <<<"scale=2;100-$fGC" | sed 's/^\./0./'); ER=$R; [ $GENOME_SIZE != 0 ] && ER=$GENOME_SIZE; - STATS=$(awk '{print length}' $SEQS | sort -rn | awk -v g=$ER '{l[++n]=$0}END{g50=g/2;g75=3*g/4;g90=9*g/10;i=s=n50=n75=n90=0;while(++i<=n&&n90==0){s+=l[i];n50==0&&s>=g50&&n50=l[i]+(l50=i);n75==0&&s>=g75&&n75=l[i]+(l75=i);n90==0&&s>=g90&&n90=l[i]+(l90=i)}print (n50-l50)"\t"(n75-l75)"\t"(n90-l90)"\t"l50"\t"l75"\t"l90"\t"l[1]"\t"l[int(n/4+1)]"\t"l[int(n/2+1)]"\t"l[int(3*n/4+1)]"\t"l[n]}'); - N50=$(cut -f1 <<<"$STATS"); N75=$(cut -f2 <<<"$STATS"); N90=$(cut -f3 <<<"$STATS"); - L50=$(cut -f4 <<<"$STATS"); L75=$(cut -f5 <<<"$STATS"); L90=$(cut -f6 <<<"$STATS"); - MAX=$(cut -f7 <<<"$STATS"); - Q75=$(cut -f8 <<<"$STATS"); Q50=$(cut -f9 <<<"$STATS"); Q25=$(cut -f10 <<<"$STATS"); - MIN=$(cut -f11 <<<"$STATS"); + STATS=$(awk '{print length}' $SEQS | sort -rn | awk -v g=$ER '{l[++n]=$0;aun+=$0*$0}END{g50=g/2;g75=3*g/4;g90=9*g/10;i=s=n50=n75=n90=0;while(++i<=n&&n90==0){s+=l[i];n50==0&&s>=g50&&n50=l[i]+(l50=i);n75==0&&s>=g75&&n75=l[i]+(l75=i);n90==0&&s>=g90&&n90=l[i]+(l90=i)}print (n50-l50)"\t"(n75-l75)"\t"(n90-l90)"\t"l50"\t"l75"\t"l90"\t"l[1]"\t"l[int(n/4+1)]"\t"l[int(n/2+1)]"\t"l[int(3*n/4+1)]"\t"l[n]"\t"int(0.5+aun/g)}'); + N50=$(cut -f1 <<<"$STATS"); N75=$(cut -f2 <<<"$STATS"); N90=$(cut -f3 <<<"$STATS"); + L50=$(cut -f4 <<<"$STATS"); L75=$(cut -f5 <<<"$STATS"); L90=$(cut -f6 <<<"$STATS"); + Q75=$(cut -f8 <<<"$STATS"); Q50=$(cut -f9 <<<"$STATS"); Q25=$(cut -f10 <<<"$STATS"); + MIN=$(cut -f11 <<<"$STATS"); MAX=$(cut -f7 <<<"$STATS"); AUN=$(cut -f12 <<<"$STATS"); if ! $TSVOUT then @@ -184,6 +184,7 @@ do echo " Average $AVG" ; echo ; echo "Contiguity statistics:" ; + echo " auN $AUN" ; echo " N50 $N50" ; echo " N75 $N75" ; echo " N90 $N90" ; @@ -193,7 +194,7 @@ do if [ $GENOME_SIZE -ne 0 ]; then echo " Expected genome size $GENOME_SIZE"; fi echo ; else - CSVLINE="$(basename $INFILE)\t$S\t$R\t$A\t$C\t$G\t$T\t$N\t$fA%\t$fC%\t$fG%\t$fT%\t$fN%\t$fAT%\t$fGC%\t$MIN\t$Q25\t$Q50\t$Q75\t$MAX\t$AVG\t$N50\t$N75\t$N90\t$L50\t$L75\t$L90"; + CSVLINE="$(basename $INFILE)\t$S\t$R\t$A\t$C\t$G\t$T\t$N\t$fA%\t$fC%\t$fG%\t$fT%\t$fN%\t$fAT%\t$fGC%\t$MIN\t$Q25\t$Q50\t$Q75\t$MAX\t$AVG\t$AUN\t$N50\t$N75\t$N90\t$L50\t$L75\t$L90"; [ $GENOME_SIZE -ne 0 ]&&CSVLINE="$CSVLINE\t$ER"; echo -e "$CSVLINE" ; fi