Commit ab45b7c2 authored by Alexis  CRISCUOLO's avatar Alexis CRISCUOLO
Browse files

1.0

parent b16be946
......@@ -107,15 +107,15 @@ Run _fastq_info_ without option to read the following documentation:
* In outputted results, every empty entry is indicated by a dot instead of zero.
* Tab-delimited option `-v t` enabls to output only several statistics: numbers of HTS reads and bases (NR and NB, respectively), average HTS read length (AL), the three quartiles of the global Phred score distribution (BQ1, BQ2, BQ3) and the three quartiles of the average Phred score per HTS read distribution (RQ1, RQ2, RQ3). For detailled distributions per HTS read position, use options `-v r` or `-v f`.
* Tab-delimited option `-v t` enables to output only several statistics: numbers of HTS reads and bases (NR and NB, respectively), average HTS read length (AL), the three quartiles of the global Phred score distribution (BQ1, BQ2, BQ3) and the three quartiles of the average Phred score per HTS read distribution (RQ1, RQ2, RQ3). For detailled distributions per HTS read position, use options `-v r` or `-v f`.
* Specific AWK interpreters can be used via the option `-a`. _fastq_info_ was successfully run together with [gawk](https://www.gnu.org/software/gawk/), [nawk](https://github.com/onetrueawk/), [mawk](https://invisible-island.net/mawk/), and [goawk](https://github.com/benhoyt/goawk). However, faster runnning times were generally assessed using [gawk](https://www.gnu.org/software/gawk/) versions ≥ 4.0 (on Linux).
* Specific AWK interpreters can be used via the option `-a`. _fastq_info_ was successfully run together with [gawk](https://www.gnu.org/software/gawk/), [nawk](https://github.com/onetrueawk/), [mawk](https://invisible-island.net/mawk/), and [goawk](https://github.com/benhoyt/goawk). However, faster running times were generally observed using [gawk](https://www.gnu.org/software/gawk/) versions ≥ 4.0 (on Linux).
* Option `-c` can be useful to obtain a check list of the required/expected binaries available in the `$PATH`, as well as their respective version (especially for the AWK interpreter).
* _fastq_info_ is able to consider many input files summarized using [filename expansion](https://tldp.org/LDP/abs/html/globbingref.html), e.g. `dirname/*.fastq.gz`
* Option `-d` can be useful when dealing with FASTQ files written on a Microsoft Windows OS.
* Option `-d` can be useful when dealing with FASTQ files containing non-Unix end-of-lines (e.g. created under Microsoft Windows).
......@@ -130,7 +130,7 @@ wget -q ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR001/SRR001666/SRR001666*.fastq.gz
#### Basic usage
The following command line runs `fastq_info.sh` to analyze the second downloaded file :
The following command line runs `fastq_info.sh` to analyze the second (i.e. R2) downloaded file :
```bash
./fastq_info.sh SRR001666_2.fastq.gz
......@@ -192,11 +192,11 @@ reads 26 30 34
```
The first part of the outputted table is made up by one row per HTS read position (column `pos`).
For each `pos` value (varying from 1 to the largest observed HTS read length), the corresponding row indicates the percentage of HTS read of length being equal to `pos` (column `Lfreq`), the percentage of observed GC bases (column `GCfreq`), and the 1st, 2nd and 3rd quartiles of observed Phred score (columns `Q1`, `Q2` and `Q3`, respectively).
The bottom part of the table summarizes the distribution of the Phred scores within all bases (row `bases`: three quartiles `Q1`, `Q2` and `Q3`), and the distribution of the average Phred score per HTS read (last row `reads`: three quartiles `Q1`, `Q2` and `Q3`).
For each `pos` value (varying from 1 to the largest observed HTS read length), the corresponding row indicates the percentage of HTS read of length being equal to `pos` (column `Lfreq`), the percentage of observed GC bases (column `GCfreq`), and the 1st, 2nd and 3rd quartiles of observed Phred scores (columns `Q1`, `Q2` and `Q3`, respectively).
The bottom part of the table summarizes the global Phred score distribution (row `bases`: three quartiles `Q1`, `Q2` and `Q3`), and the average Phred score per HTS read distribution (last row `reads`: three quartiles `Q1`, `Q2` and `Q3`).
The above example therefore shows that the majority of Phred scores are decreasing below _Q_ = 20 at positions 28-36 (i.e. the median Phred score _Q_<sub>2</sub> is lower than 20 as of HTS read position 28).
At least 25% of all sequenced bases are associated to Phred scores of at most 18 (first quartile _Q_<sub>1</sub> = 18 in row `bases`), but at least 50% of the HTS reads have an average Phred score of at least 30 (median _Q_<sub>2</sub> = 30 in row `reads`)
At least 25% of all sequenced bases are associated to Phred scores < 19 (i.e. first quartile _Q_<sub>1</sub> = 18 in row `bases`), but at least 50% of the HTS reads have an average Phred score > 29 (median _Q_<sub>2</sub> = 30 in row `reads`)
#### Advanced usage
......@@ -275,7 +275,7 @@ SRR001666_1.fastq.gz 7047668 253716048 36.0 30 40 40 32 35 37
SRR001666_2.fastq.gz 7047668 253716048 36.0 18 40 40 26 30 34
```
This simple output format enables to easily read every file name (`#File`), no. HTS reads (`NR`) and bases (`NB`), average HTS read length (`AL`), as well as the three quartiles of the distribution of all Phred scores (`BQ1`, `BQ2`, `BQ3`) and the three quartiles of the distribution of the average Phred score per HTS read (`RQ1`, `RQ2`, `RQ3`).
This simple output format enables to easily read every file name (`#File`), no. HTS reads (`NR`) and bases (`NB`), average HTS read length (`AL`), as well as the three quartiles of the global Phred score distribution (`BQ1`, `BQ2`, `BQ3`) and of the average Phred score per HTS read distribution (`RQ1`, `RQ2`, `RQ3`).
The above example clearly shows that the overall sequencing error rate is lower in file *SRR001666\_1.fastq.gz* than in file *SRR001666\_2.fastq.gz*.
......@@ -284,8 +284,8 @@ The above example clearly shows that the overall sequencing error rate is lower
#### Note on the subsampling rate
By default, all statistics are estimated from a subset of all input FASTQ blocks to obtain fast running times (the subsampling rate is indicated when using option `-v f`).
In almost all cases, default subsampling rate (i.e. ~10% with option `-s 5`) is sufficient to efficiently approximate the different distributions (i.e. read length, GC-content, Phred score).
By default, all distributions are estimated from a subset of all input FASTQ blocks to obtain fast running times (the subsampling rate is indicated when using option `-v f`).
In almost all cases, default subsampling rate (i.e. ~10% with option `-s 5`) is sufficient to efficiently approximate the different distributions (i.e. read lengths, GC-content, Phred scores).
For example, the below command line uses all FASTQ blocks (i.e. option `-s 1`):
......@@ -299,6 +299,6 @@ SRR001666_1.fastq.gz 7047668 253716048 36.0 30 40 40 32 35 37
SRR001666_2.fastq.gz 7047668 253716048 36.0 18 40 40 26 30 34
```
As shown above, all statistics are identical to the ones previously estimated, but the overall running time was 8 times slower...
All statistics are identical to the ones previously estimated, but the overall running time was 8 times slower...
For comparison, when used with default options, _fastq_info_ is expected to run ~1.5 times faster than [FastQC](https://github.com/s-andrews/FastQC) to process one FASTQ file.
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment