Commit 326049af authored by Alexis  CRISCUOLO's avatar Alexis CRISCUOLO

1.0.190426ac

parent a4cdd168
# contig_info
_contig_info_ is a command line program written in [Bash](https://www.gnu.org/software/bash/) that allows estimating several standard descriptive statistics from a FASTA-formatted contig file inferred by a _de novo_ genome assembly method. Estimated statistics are sequence number, residue counts, sequence length distribution, N50 (Lander et al. 2001), NG50 (Earl et al. 2011), and its related N75, NG75, N90, and NG90.
_contig_info_ is a command line program written in [Bash](https://www.gnu.org/software/bash/) that allows several standard descriptive statistics to be quickly estimated from FASTA-formatted contig files inferred by _de novo_ genome assembly methods.
Estimated statistics are sequence number, residue counts, AT- and GC-content, sequence lengths, N50 (Lander et al. 2001), NG50 (Earl et al. 2011), and the related N75, NG75, N90, NG90, L50, LG50, L75, LG75, L90, LG90.
## Installation and execution
......@@ -18,20 +19,118 @@ and launch it with the following command line model:
Launch _contig_info_ without option to read the following documentation:
```
USAGE: contig_info.sh [options] <contig_file>
USAGE: contig_info.sh [options] <contig_files>
where 'options' are:
-m <int> minimum contig length; every contig sequence of length
shorter than this cutoff will be discarded (default: 0)
-g <int> expected genome size for computing NG50, NG75 and NG90
values instead of N50, N75 and N90 ones, respectively
-d print contig sequence length distribution
-l print length of each contig sequence
-r print residue counts
shorter than this cutoff will be discarded (default: 1)
-g <int> expected genome size for computing {N,L}G{50,75,90}
values instead of {N,L}{50,75,90} ones, respectively
-t tab-delimited output
```
## Examples
The following [Bash](https://www.gnu.org/software/bash/) command lines allows the genome sequences of the 5 _Mucor circinelloides_ strains 1006PhL, CBS 277.49, WJ11, B8987 and JCM 22480 to be downloaded from the [NCBI genome repository](https://www.ncbi.nlm.nih.gov/genome):
```bash
NCBIFTP="wget -q -O- https://ftp.ncbi.nlm.nih.gov/sra/wgs_aux/"; Z=".1.fsa_nt.gz";
echo -e "1006PhL\tAOCY01\nCBS277.49\tAMYB01\nWJ11\tLGTF01\nB8987\tJNDM01\nJCM22480\tBCHG01" |
while read -r s a; do echo -n "$s ... ";$NCBIFTP${a:0:2}/${a:2:2}/$a/$a$Z|zcat>Mucor.$s.fasta;echo "[ok]";done
```
The following command line allows the script `contig_info.sh` to be launched to analyze the first downloaded file _Mucor.1006PhL.fasta_:
```bash
./contig_info.sh Mucor.1006PhL.fasta
```
leading to the following standard output:
```
File Mucor.1006PhL.fasta
Number of sequences 1459
Residue counts:
Number of A's 10320010 30.23 %
Number of C's 6747611 19.76 %
Number of G's 6731530 19.72 %
Number of T's 10335465 30.27 %
Number of N's 0 0 %
Total 34134616
%AT 60.52 %
%GC 39.48 %
Sequence lengths:
Minimum 410
Quartile 25% 1660
Median 6176
Quartile 75% 37608
Maximum 213712
Average 23395.89
Contiguity statistics:
N50 58982
N75 36291
N90 18584
L50 194
L75 376
L90 562
```
The same results could be outputted in tab-delimited format with the following command line:
```bash
./contig_info.sh -t Mucor.1006PhL.fasta
```
```
#File Nseq Nres A C G T N %A %C %G %T %N %AT %GC Min Q25 Med Q75 Max Avg N50 N75 N90 L50 L75 L90
Mucor.1006PhL.fasta 1459 34134616 10320010 6747611 6731530 10335465 0 30.23% 19.76% 19.72% 30.27% 0% 60.52% 39.48% 410 1660 6176 37608 213712 23395.89 58982 36291 18584 194 376 562
```
Of note, the five downloaded FASTA files could be analyzed with a single command line:
```bash
./contig_info.sh -t Mucor.*.fasta
```
```
#File Nseq Nres A C G T N %A %C %G %T %N %AT %GC Min Q25 Med Q75 Max Avg N50 N75 N90 L50 L75 L90
Mucor.1006PhL.fasta 1459 34134616 10320010 6747611 6731530 10335465 0 30.23% 19.76% 19.72% 30.27% 0% 60.52% 39.48% 410 1660 6176 37608 213712 23395.89 58982 36291 18584 194 376 562
Mucor.B8987.fasta 2210 36700617 11096810 7247117 7233795 11122895 0 30.23% 19.74% 19.71% 30.30% 0% 60.55% 39.45% 206 839 2482 20727 258792 16606.61 58460 30025 13274 193 416 674
Mucor.CBS277.49.fasta 21 36567582 10571030 7715901 7705901 10574750 0 28.90% 21.10% 21.07% 28.91% 0% 57.83% 42.17% 4155 41542 934259 3187354 6050249 1741313.42 4318338 3096690 1074709 4 7 9
Mucor.JCM22480.fasta 401 36616466 10586281 6882218 6899109 10581984 1659222 28.91% 18.79% 18.84% 28.89% 4.53% 60.57% 39.43% 1038 4814 50332 135940 659822 91312.88 197059 109360 63107 61 121 183
Mucor.WJ11.fasta 2519 33065171 9974064 6559358 6556539 9975210 0 30.16% 19.83% 19.82% 30.16% 0% 60.34% 39.66% 430 3275 7692 18010 118704 13126.30 24148 12884 5672 429 898 1455
```
The tab-delimited output format could be useful for focusing on specific fields like, e.g. the six contiguity statistics:
```bash
./contig_info.sh -t Mucor.*.fasta | cut -f1,22-
```
```
#File N50 N75 N90 L50 L75 L90
Mucor.1006PhL.fasta 58982 36291 18584 194 376 562
Mucor.B8987.fasta 58460 30025 13274 193 416 674
Mucor.CBS277.49.fasta 4318338 3096690 1074709 4 7 9
Mucor.JCM22480.fasta 197059 109360 63107 61 121 183
Mucor.WJ11.fasta 24148 12884 5672 429 898 1455
```
Finally, the option -g could be used to set an expected genome size for obtaining {N,L}G{50,75,90} statistics instead of {N,L}{50,75,90} ones:
```bash
./contig_info.sh -t -g 36000000 Mucor.*.fasta | cut -f1,22-
```
```
#File N50 N75 N90 L50 L75 L90 ExpSize
Mucor.1006PhL.fasta 57499 32472 7652 210 417 692 36000000
Mucor.B8987.fasta 59771 30857 15730 187 399 631 36000000
Mucor.CBS277.49.fasta 4318338 3096690 1074709 4 7 9 36000000
Mucor.JCM22480.fasta 197663 113006 69531 59 117 175 36000000
Mucor.WJ11.fasta 21799 9865 2445 493 1092 2146 36000000
```
## References
Earl D, Bradnam K, St John J, Darling A, Lin D, Fass J, Yu HO, Buffalo V, Zerbino DR, Diekhans M, Nguyen N, Ariyaratne PN, Sung WK, Ning Z, Haimel M, Simpson JT, Fonseca NA, Birol İ, Docking TR, Ho IY, Rokhsar DS, Chikhi R, Lavenier D, Chapuis G, Naquin D, Maillet N, Schatz MC, Kelley DR, Phillippy AM, Koren S, Yang SP, Wu W, Chou WC, Srivastava A, Shaw TI, Ruby JG, Skewes-Cox P, Betegon M, Dimon MT, Solovyev V, Seledtsov I, Kosarev P, Vorobyev D, Ramirez-Gonzalez R, Leggett R, MacLean D, Xia F, Luo R, Li Z, Xie Y, Liu B, Gnerre S, MacCallum I, Przybylski D, Ribeiro FJ, Yin S, Sharpe T, Hall G, Kersey PJ, Durbin R, Jackman SD, Chapman JA, Huang X, DeRisi JL, Caccamo M, Li Y, Jaffe DB, Green RE, Haussler D, Korf I, Paten B (2011) Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Research, 21(12):2224-2241. [doi:10.1101/gr.126599.111](https://genome.cshlp.org/content/21/12/2224).
......
This diff is collapsed.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment