README.md 9.87 KB
Newer Older
Alexis  CRISCUOLO's avatar
Alexis CRISCUOLO committed
1 2
# contig_info

Alexis  CRISCUOLO's avatar
Alexis CRISCUOLO committed
3 4
_contig_info_ is a command line program written in [Bash](https://www.gnu.org/software/bash/) that allows several standard descriptive statistics to be quickly estimated from FASTA-formatted contig files inferred by _de novo_ genome assembly methods.
Estimated statistics are sequence number, residue counts, AT- and GC-content, sequence lengths, N50 (Lander et al. 2001), NG50 (Earl et al. 2011), and the related N75, NG75, N90, NG90, L50, LG50, L75, LG75, L90, LG90.
Alexis  CRISCUOLO's avatar
Alexis CRISCUOLO committed
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

## Installation and execution

Give the execute permission to the file `contig_info.sh` by typing:
```bash
chmod +x contig_info.sh
```
and launch it with the following command line model:
```bash
./contig_info.sh [options]
```

## Usage

Launch _contig_info_ without option to read the following documentation:

```
Alexis  CRISCUOLO's avatar
Alexis CRISCUOLO committed
22
 USAGE:  contig_info.sh  [options]  <contig_files>
Alexis  CRISCUOLO's avatar
Alexis CRISCUOLO committed
23 24 25 26

  where 'options' are:

   -m <int>    minimum contig length;  every contig sequence of length
Alexis  CRISCUOLO's avatar
Alexis CRISCUOLO committed
27 28 29
               shorter than this cutoff will be discarded (default: 1)
   -g <int>    expected  genome  size  for  computing {N,L}G{50,75,90}
               values instead of {N,L}{50,75,90} ones, respectively
Alexis  CRISCUOLO's avatar
Alexis CRISCUOLO committed
30 31
   -t          tab-delimited output
```
Alexis  CRISCUOLO's avatar
Alexis CRISCUOLO committed
32

Alexis  CRISCUOLO's avatar
Alexis CRISCUOLO committed
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133
## Examples

The following [Bash](https://www.gnu.org/software/bash/) command lines allows the genome sequences of the 5 _Mucor circinelloides_ strains 1006PhL, CBS 277.49, WJ11, B8987 and JCM 22480 to be downloaded from the [NCBI genome repository](https://www.ncbi.nlm.nih.gov/genome):
```bash
NCBIFTP="wget -q -O- https://ftp.ncbi.nlm.nih.gov/sra/wgs_aux/"; Z=".1.fsa_nt.gz";
echo -e "1006PhL\tAOCY01\nCBS277.49\tAMYB01\nWJ11\tLGTF01\nB8987\tJNDM01\nJCM22480\tBCHG01" |
  while read -r s a; do echo -n "$s ... ";$NCBIFTP${a:0:2}/${a:2:2}/$a/$a$Z|zcat>Mucor.$s.fasta;echo "[ok]";done
```

The following command line allows the script `contig_info.sh` to be launched to analyze the first downloaded file _Mucor.1006PhL.fasta_:
```bash
./contig_info.sh  Mucor.1006PhL.fasta
```
leading to the following standard output:
```
File                           Mucor.1006PhL.fasta

Number of sequences            1459

Residue counts:
  Number of A's                10320010  30.23 %
  Number of C's                6747611  19.76 %
  Number of G's                6731530  19.72 %
  Number of T's                10335465  30.27 %
  Number of N's                0  0 %
  Total                        34134616

  %AT                          60.52 %
  %GC                          39.48 %

Sequence lengths:
  Minimum                      410
  Quartile 25%                 1660
  Median                       6176
  Quartile 75%                 37608
  Maximum                      213712
  Average                      23395.89

Contiguity statistics:
  N50                          58982
  N75                          36291
  N90                          18584
  L50                          194
  L75                          376
  L90                          562
```

The same results could be outputted in tab-delimited format with the following command line:
```bash
./contig_info.sh  -t  Mucor.1006PhL.fasta
```

```
#File               Nseq   Nres     A        C       G       T        N    %A     %C     %G     %T     %N   %AT    %GC     Min   Q25   Med   Q75   Max    Avg       N50   N75   N90   L50 L75 L90
Mucor.1006PhL.fasta 1459   34134616 10320010 6747611 6731530 10335465 0    30.23% 19.76% 19.72% 30.27% 0%   60.52% 39.48%  410   1660  6176  37608 213712 23395.89  58982 36291 18584 194 376 562
```

Of note, the five downloaded FASTA files could be analyzed with a single command line:
```bash
./contig_info.sh  -t  Mucor.*.fasta
```

```
#File                 Nseq   Nres      A        C       G       T        N       %A     %C     %G     %T     %N    %AT    %GC     Min   Q25   Med    Q75     Max     Avg         N50      N75     N90      L50 L75 L90
Mucor.1006PhL.fasta   1459   34134616  10320010 6747611 6731530 10335465 0       30.23% 19.76% 19.72% 30.27% 0%    60.52% 39.48%  410   1660  6176   37608   213712  23395.89    58982    36291   18584    194 376 562
Mucor.B8987.fasta     2210   36700617  11096810 7247117 7233795 11122895 0       30.23% 19.74% 19.71% 30.30% 0%    60.55% 39.45%  206   839   2482   20727   258792  16606.61    58460    30025   13274    193 416 674
Mucor.CBS277.49.fasta 21     36567582  10571030 7715901 7705901 10574750 0       28.90% 21.10% 21.07% 28.91% 0%    57.83% 42.17%  4155  41542 934259 3187354 6050249 1741313.42  4318338  3096690 1074709  4   7   9
Mucor.JCM22480.fasta  401    36616466  10586281 6882218 6899109 10581984 1659222 28.91% 18.79% 18.84% 28.89% 4.53% 60.57% 39.43%  1038  4814  50332  135940  659822  91312.88    197059   109360  63107    61  121 183
Mucor.WJ11.fasta      2519   33065171  9974064  6559358 6556539 9975210  0       30.16% 19.83% 19.82% 30.16% 0%    60.34% 39.66%  430   3275  7692   18010   118704  13126.30    24148    12884   5672     429 898 1455
```

The tab-delimited output format could be useful for focusing on specific fields like, e.g. the six contiguity statistics:
```bash
./contig_info.sh  -t  Mucor.*.fasta  |  cut -f1,22-
```

```
#File                 N50      N75     N90      L50  L75  L90
Mucor.1006PhL.fasta   58982    36291   18584    194  376  562
Mucor.B8987.fasta     58460    30025   13274    193  416  674
Mucor.CBS277.49.fasta 4318338  3096690 1074709  4    7    9
Mucor.JCM22480.fasta  197059   109360  63107    61   121  183
Mucor.WJ11.fasta      24148    12884   5672     429  898  1455
```


Finally, the option -g could be used to set an expected genome size for obtaining {N,L}G{50,75,90} statistics instead of {N,L}{50,75,90} ones:
```bash
./contig_info.sh  -t  -g 36000000  Mucor.*.fasta | cut -f1,22-
```

```
#File                 N50      N75     N90      L50  L75  L90  ExpSize
Mucor.1006PhL.fasta   57499    32472   7652     210  417  692  36000000
Mucor.B8987.fasta     59771    30857   15730    187  399  631  36000000
Mucor.CBS277.49.fasta 4318338  3096690 1074709  4    7    9    36000000
Mucor.JCM22480.fasta  197663   113006  69531    59   117  175  36000000
Mucor.WJ11.fasta      21799    9865    2445     493  1092 2146 36000000
```


Alexis  CRISCUOLO's avatar
Alexis CRISCUOLO committed
134 135 136 137 138 139
## References

Earl D, Bradnam K, St John J, Darling A, Lin D, Fass J, Yu HO, Buffalo V, Zerbino DR, Diekhans M, Nguyen N, Ariyaratne PN, Sung WK, Ning Z, Haimel M, Simpson JT, Fonseca NA, Birol İ, Docking TR, Ho IY, Rokhsar DS, Chikhi R, Lavenier D, Chapuis G, Naquin D, Maillet N, Schatz MC, Kelley DR, Phillippy AM, Koren S, Yang SP, Wu W, Chou WC, Srivastava A, Shaw TI, Ruby JG, Skewes-Cox P, Betegon M, Dimon MT, Solovyev V, Seledtsov I, Kosarev P, Vorobyev D, Ramirez-Gonzalez R, Leggett R, MacLean D, Xia F, Luo R, Li Z, Xie Y, Liu B, Gnerre S, MacCallum I, Przybylski D, Ribeiro FJ, Yin S, Sharpe T, Hall G, Kersey PJ, Durbin R, Jackman SD, Chapman JA, Huang X, DeRisi JL, Caccamo M, Li Y, Jaffe DB, Green RE, Haussler D, Korf I, Paten B (2011) Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Research, 21(12):2224-2241. [doi:10.1101/gr.126599.111](https://genome.cshlp.org/content/21/12/2224).

Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, Funke R, Gage D, Harris K, Heaford A, Howland J, Kann L, Lehoczky J, LeVine R, McEwan P, McKernan K, Meldrim J, Mesirov JP, Miranda C, Morris W, Naylor J, Raymond C, Rosetti M, Santos R, Sheridan A, Sougnez C, Stange-Thomann Y, Stojanovic N, Subramanian A, Wyman D, Rogers J, Sulston J, Ainscough R, Beck S, Bentley D, Burton J, Clee C, Carter N, Coulson A, Deadman R, Deloukas P, Dunham A, Dunham I, Durbin R, French L, Grafham D, Gregory S, Hubbard T, Humphray S, Hunt A, Jones M, Lloyd C, McMurray A, Matthews L, Mercer S, Milne S, Mullikin JC, Mungall A, Plumb R, Ross M, Shownkeen R, Sims S, Waterston RH, Wilson RK, Hillier LW, McPherson JD, Marra MA, Mardis ER, Fulton LA, Chinwalla AT, Pepin KH, Gish WR, Chissoe SL, Wendl MC, Delehaunty KD, Miner TL, Delehaunty A, Kramer JB, Cook LL, Fulton RS, Johnson DL, Minx PJ, Clifton SW, Hawkins T, Branscomb E, Predki P, Richardson P, Wenning S, Slezak T, Doggett N, Cheng JF, Olsen A, Lucas S, Elkin C, Uberbacher E, Frazier M, Gibbs RA, Muzny DM, Scherer SE, Bouck JB, Sodergren EJ, Worley KC, Rives CM, Gorrell JH, Metzker ML, Naylor SL, Kucherlapati RS, Nelson DL, Weinstock GM, Sakaki Y, Fujiyama A, Hattori M, Yada T, Toyoda A, Itoh T, Kawagoe C, Watanabe H, Totoki Y, Taylor T, Weissenbach J, Heilig R, Saurin W, Artiguenave F, Brottier P, Bruls T, Pelletier E, Robert C, Wincker P, Smith DR, Doucette-Stamm L, Rubenfield M, Weinstock K, Lee HM, Dubois J, Rosenthal A, Platzer M, Nyakatura G, Taudien S, Rump A, Yang H, Yu J, Wang J, Huang G, Gu J, Hood L, Rowen L, Madan A, Qin S, Davis RW, Federspiel NA, Abola AP, Proctor MJ, Myers RM, Schmutz J, Dickson M, Grimwood J, Cox DR, Olson MV, Kaul R, Raymond C, Shimizu N, Kawasaki K, Minoshima S, Evans GA, Athanasiou M, Schultz R, Roe BA, Chen F, Pan H, Ramser J, Lehrach H, Reinhardt R, McCombie WR, de la Bastide M, Dedhia N, Blöcker H, Hornischer K, Nordsiek G, Agarwala R, Aravind L, Bailey JA, Bateman A, Batzoglou S, Birney E, Bork P, Brown DG, Burge CB, Cerutti L, Chen HC, Church D, Clamp M, Copley RR, Doerks T, Eddy SR, Eichler EE, Furey TS, Galagan J, Gilbert JG, Harmon C, Hayashizaki Y, Haussler D, Hermjakob H, Hokamp K, Jang W, Johnson LS, Jones TA, Kasif S, Kaspryzk A, Kennedy S, Kent WJ, Kitts P, Koonin EV, Korf I, Kulp D, Lancet D, Lowe TM, McLysaght A, Mikkelsen T, Moran JV, Mulder N, Pollara VJ, Ponting CP, Schuler G, Schultz J, Slater G, Smit AF, Stupka E, Szustakowki J, Thierry-Mieg D, Thierry-Mieg J, Wagner L, Wallis J, Wheeler R, Williams A, Wolf YI, Wolfe KH, Yang SP, Yeh RF, Collins F, Guyer MS, Peterson J, Felsenfeld A, Wetterstrand KA, Patrinos A, Morgan MJ, de Jong P, Catanese JJ, Osoegawa K, Shizuya H, Choi S, Chen YJ, Szustakowki J; International Human Genome Sequencing Consortium (2001) Initial sequencing and analysis of the human genome. Nature, 409(6822):860-921. [doi:10.1038/35057062](https://www.nature.com/articles/35057062).