Commit 39ffb34a authored by Alexis  CRISCUOLO's avatar Alexis CRISCUOLO
Browse files

1.0

parent 7e56e2b3
# fastq_info
_fastq_info_ is a command line program written in [Bash](https://www.gnu.org/software/bash/)/[Gawk](https://www.gnu.org/software/gawk/) for quickly estimating several standard descriptive statistics from FASTQ-formatted High-Throughput Sequencing (HTS) files.
Estimated statistics are:
_fastq_info_ is a command line program written in [Bash](https://www.gnu.org/software/bash/)/[AWK](https://en.wikipedia.org/wiki/AWK) for quickly estimating several standard descriptive statistics from FASTQ-formatted High-Throughput Sequencing (HTS) files.
Estimated statistics per FASTQ file are:
  ▹   HTS read and base numbers per file,
  ▹   HTS read and base numbers,
  ▹   HTS read length distribution per file,
  ▹   HTS read length distribution,
  ▹   GC-content per HTS read position,
  ▹   Phred score distribution (quartiles per file and per HTS read position),
  ▹   Phred score distribution (global and for each HTS read position),
  ▹   average HTS read Phred score distribution (quartiles per file).
  ▹   average Phred score distribution per HTS read.
Several output result formats are available (e.g. reduced/full table, tab-delimited).
......@@ -28,20 +28,30 @@ and run it with the following command line model:
./fastq_info.sh [options]
```
Note that _fastq_info_ requires **[gawk](https://www.gnu.org/software/gawk/) version at least 4.0** (generally available on recent Linux distributions).
Moreover, to be able to deal with standard FASTQ compression formats, it is expected that the following binaries are available in the `$PATH`:
* [gzip](https://www.gnu.org/software/gzip/) (to deal with files compressed using gzip)
##### About AWK
* [bzip2](https://sourceware.org/bzip2/) (to deal with files compressed usin bzip2)
_fastq_info_ requires an AWK interpreter in the `$PATH`, which is always the case for most Linux distributions.
By default, _fastq_info_ first considers [gawk](https://www.gnu.org/software/gawk/) (GNU awk, generally available on recent Linux distributions); otherwise the basic command `awk` in the `$PATH` is used.
However, alternative implementations of AWK can be specified using option `-a` (see Usage section).
In practice, _fastq_info_ is able to detect and deal with most AWK interpreters (e.g. [nawk](https://github.com/onetrueawk/), [mawk](https://invisible-island.net/mawk/), [goawk](https://github.com/benhoyt/goawk)).
* [pigz](https://zlib.net/pigz/) (to deal on multiple threads with files compressed using gzip)
* [pbzip2](http://compression.ca/pbzip2/) (to deal on multiple threads with files compressed using bzip2)
##### About compressed FASTQ files
* [DSRC 2.0](http://sun.aei.polsl.pl/dsrc) (to deal with files compressed using DSRC 2.0 RC/RC2)
To use _fastq_info_ with standard FASTQ compression formats, it is expected that the following binaries are available in the `$PATH`:
To run _fastq_info_, it is not required to intall all these binaries, but the dedicated tool(s) should be available depending on the compression format of the input files.
* [gzip](https://www.gnu.org/software/gzip/), required to deal with files compressed using gzip;
* [bzip2](https://sourceware.org/bzip2/), required to deal with files compressed using bzip2;
* [pigz](https://zlib.net/pigz/), expected to deal with files compressed using gzip on multiple threads (when not installed, [gzip](https://www.gnu.org/software/gzip/) is used instead);
* [pbzip2](http://compression.ca/pbzip2/), expected to deal with files compressed using bzip2 on multiple threads (when not installed, [bzip2](https://sourceware.org/bzip2/) is used instead);
* [DSRC 2.0](http://sun.aei.polsl.pl/dsrc), required to deal with files compressed using DSRC 2.0 RC/RC2.
To run _fastq_info_, it is not required to install all these binaries, but the dedicated tool(s) should be available depending on the compression format of the input files.
## Usage
......@@ -49,24 +59,24 @@ To run _fastq_info_, it is not required to intall all these binaries, but the de
Run _fastq_info_ without option to read the following documentation:
```
USAGE: fastq_info.sh [options] <file1> <file2> ...
USAGE: fastq_info.sh [options] [<file1> <file2> ...]
Allowed file extensions:
.fastq
.fq
.txt ..... considered as FASTQ-formated files; directly read using cat
.txt ..... considered as FASTQ-formatted files
.gz
.gzip .... considered as FASTQ-formated files compressed using gzip;
.gzip .... considered as FASTQ-formatted files compressed using gzip;
decompressed using gunzip or pigz (when available in $PATH)
.bz
.bz2
.bzip
.bzip2 ... considered as FASTQ-formated files compressed using bzip2;
.bzip2 ... considered as FASTQ-formatted files compressed using bzip2;
decompressed using bunzip2 or pbzip2 (when available in
$PATH)
.dsrc
.dsrc2 ... considered as FASTQ-formated files compressed using DSRC
.dsrc2 ... considered as FASTQ-formatted files compressed using DSRC
v2.0 (sun.aei.polsl.pl/dsrc); decompressed using DSRC v2.0
(when available in $PATH)
......@@ -79,10 +89,27 @@ Run _fastq_info_ without option to read the following documentation:
-v <char> reduced (r), full (f) or tab-delimited (t) result output
(default: r)
-t <int> number of threads for decompressing files (default: 2)
-a to specify the AWK interpreter (default: awk/gawk in $PATH)
-c checking decompressing tools (default: not set)
-d DOS end-of-lines in input file(s) (default: not set)
```
## Notes
* Each input file is decompressed (if required) and parsed. Numbers of High-Throughput Sequencing (HTS) reads and bases (NR and NB in tab-delimited output, respectively) are exact, as well as the derived average HTS read length (AL). All other descriptive statistics are based on FASTQ block subsampling (except when setting option `-s 1`). Low subsampling rate (e.g. ~10% by default) is generally sufficient to obtain results representative of the whole set of HTS reads. In outputted results, every empty entry is indicated by a dot.
* Specific AWK interpreters can be used via the option `-a`. _fastq_info_ was successfully run together with [gawk](https://www.gnu.org/software/gawk/), [nawk](https://github.com/onetrueawk/), [mawk](https://invisible-island.net/mawk/), and [goawk](https://github.com/benhoyt/goawk). However, faster runnning times were generally assessed using [gawk](https://www.gnu.org/software/gawk/) versions &#8805; 4.0 (on Linux).
* Option `-c` can be useful to obtain a check list of the required/expected binaries available in the `$PATH`, as well as their respective version (especially for the AWK interpreter).
* _fastq_info_ is able to consider many input files summarized using [filename expansion](https://tldp.org/LDP/abs/html/globbingref.html), e.g. `dirname/*.fastq.gz`
* Option `-d` can be useful when dealing with FASTQ files written on a Microsoft Windows OS.
## Examples
The following [Bash](https://www.gnu.org/software/bash/) command line enables to download the pair of gzipped FASTQ files *SRR001666\_1.fastq.gz* and *SRR001666\_2.fastq.gz*:
......@@ -90,6 +117,7 @@ The following [Bash](https://www.gnu.org/software/bash/) command line enables to
```bash
wget -q ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR001/SRR001666/SRR001666*.fastq.gz
```
#### Basic usage
The following command line runs `fastq_info.sh` to analyze the second downloaded file :
......@@ -154,15 +182,16 @@ reads 26 30 34
```
The first part of the outputted table is made up by one row per HTS read position (column `pos`).
For each `pos` value, the corresponding row indicates the percentage of HTS read of length being equal to `pos` (column `Lfreq`), the percentage of observed GC bases (column `GCfreq`), and the first, second and third quartiles of observed Phred score (columns `Q1`, `Q2` and `Q3`, respectively).
The bottom part of the table summarizes the distribution of the Phred score within all bases (three quartiles `Q1`, `Q2` and `Q3` at row `bases`), and the distribution of the average Phred score per HTS read (three quartiles `Q1`, `Q2` and `Q3`at row `reads`).
For each `pos` value (varying from 1 to the largest observed HTS read length), the corresponding row indicates the percentage of HTS read of length being equal to `pos` (column `Lfreq`), the percentage of observed GC bases (column `GCfreq`), and the 1st, 2nd and 3rd quartiles of observed Phred score (columns `Q1`, `Q2` and `Q3`, respectively).
The bottom part of the table summarizes the distribution of the Phred scores within all bases (row `bases`: three quartiles `Q1`, `Q2` and `Q3`), and the distribution of the average Phred score per HTS read (last row `reads`: three quartiles `Q1`, `Q2` and `Q3`).
The above example therefore shows that the majority of Phred scores are decreasing below _Q_ = 20 at positions 28-36 (i.e. the median Phred score _Q_<sub>2</sub> is lower than 20 as of HTS read position 28).
At least 25% of all sequenced bases are associated to Phred scores of at most 18 (first quartile _Q_<sub>1</sub> = 18 in row `bases`), but at least 50% of the HTS reads have an average Phred score of at least 30 (median _Q_<sub>2</sub> = 30 in row `reads`)
The above example therefore shows that the majority of Phred scores are decreasing below Q=20 at positions 28-36 (i.e. the median Phred score Q2 is lower than 20 as of HTS read position 28).
At least 25% of all sequenced bases are associated to Phred scores of at most 18 (first quartile Q1=18 in row `bases`), but at least 50% of the HTS reads have an average Phred score of at least 30 (median Q2=30 in row `reads`)
#### Advanced usage
For more details (i.e. one supplementary column per observed Phred score), a full table can be outputted using option `-v f`:
For more details (i.e. one supplementary column for each observed Phred score), a full table can be outputted using option `-v f`:
```bash
./fastq_info.sh -v f SRR001666_2.fastq.gz
......@@ -224,7 +253,7 @@ reads 26 30 34 0.1 0.0 0.0 0.0 0.1 0.1 0.1 0.1
#### Tab-delimited outputs
Finally, to help with reading, the main statistics for all files can be summarized in tab-delimited format using option `-v t`:
To help with reading, the main statistics for all files can be summarized in tab-delimited format using option `-v t`:
```bash
./fastq_info.sh -v t *.fastq.gz
......@@ -236,14 +265,16 @@ SRR001666_1.fastq.gz 7047668 253716048 36.0 30 40 40 32 35 37
SRR001666_2.fastq.gz 7047668 253716048 36.0 18 40 40 26 30 34
```
This simple output format enables reading every file name (`#File`), no. HTS reads (`NR`) and bases (`NB`), average HTS read length (`AL`), as well as the quartiles of the distribution of all Phred scores (`BQ1`, `BQ2`, `BQ3`) and the quartiles of the distribution of the average Phred score per HTS read (`RQ1`, `RQ2`, `RQ3`).
This simple output format enables to easily read every file name (`#File`), no. HTS reads (`NR`) and bases (`NB`), average HTS read length (`AL`), as well as the three quartiles of the distribution of all Phred scores (`BQ1`, `BQ2`, `BQ3`) and the three quartiles of the distribution of the average Phred score per HTS read (`RQ1`, `RQ2`, `RQ3`).
The above example clearly shows that the overall sequencing error rate is lower in file *SRR001666\_1.fastq.gz* than in file *SRR001666\_2.fastq.gz*.
#### Note on the running time
Note that by default, all statistics are estimated from a subset of all input FASTQ blocks to obtain fast running times.
#### Note on the subsampling rate
By default, all statistics are estimated from a subset of all input FASTQ blocks to obtain fast running times (the subsampling rate is indicated when using option `-v f`).
In almost all cases, default subsampling rate (i.e. ~10% with option `-s 5`) is sufficient to efficiently approximate the different distributions (i.e. read length, GC-content, Phred score).
For example, the below command line uses all FASTQ blocks (i.e. option `-s 1`):
......
This diff is collapsed.
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment