Commit a9c67ddb authored by Alexis  CRISCUOLO's avatar Alexis CRISCUOLO
Browse files

1.0

parent d16542c0
......@@ -23,7 +23,7 @@ Clone this repository with the following command line:
git clone https://gitlab.pasteur.fr/GIPhy/fastq_info.git
```
Go to the directory `fastq_info` to give the execute permission to the file `fastq_info.sh`:
Go to the directory `fastq_info/` to give the execute permission to the file `fastq_info.sh`:
```bash
chmod +x fastq_info.sh
......@@ -47,15 +47,19 @@ In practice, _fastq_info_ is able to detect and deal with most AWK interpreters
To use _fastq_info_ with standard FASTQ compression formats, it is expected that the following binaries are available in the `$PATH`:
* [gzip](https://www.gnu.org/software/gzip/), required to deal with files compressed using gzip;
 **+** [gzip](https://www.gnu.org/software/gzip/), required to deal with files compressed using gzip;
* [bzip2](https://sourceware.org/bzip2/), required to deal with files compressed using bzip2;
 **+** [bzip2](https://sourceware.org/bzip2/), required to deal with files compressed using bzip2;
* [pigz](https://zlib.net/pigz/), expected to deal with files compressed using gzip on multiple threads (when not installed, [gzip](https://www.gnu.org/software/gzip/) is used instead);
 **+** [pigz](https://zlib.net/pigz/), expected to deal with files compressed using gzip on multiple threads (when not installed, [gzip](https://www.gnu.org/software/gzip/) is used instead);
* [pbzip2](http://compression.ca/pbzip2/), expected to deal with files compressed using bzip2 on multiple threads (when not installed, [bzip2](https://sourceware.org/bzip2/) is used instead);
 **+** [pbzip2](http://compression.ca/pbzip2/), expected to deal with files compressed using bzip2 on multiple threads (when not installed, [bzip2](https://sourceware.org/bzip2/) is used instead);
* [DSRC 2.0](http://sun.aei.polsl.pl/dsrc), required to deal with files compressed using DSRC 2.0 RC/RC2.
 **+** [dsrc](http://sun.aei.polsl.pl/dsrc), required to deal with files compressed using DSRC 2.0 RC/RC2 (Roguski and Deorowicz 2014);
 **+** [fqzcomp](https://github.com/jkbonfield/fqzcomp), required to deal with files compressed using fqzcomp 4.0 (Bonfield and Mahoney 2013);
 **+** [quip](https://github.com/dcjones/quip), required to deal with files compressed using QUIP (Jones et al. 2012).
To run _fastq_info_, it is not required to install all these binaries, but the dedicated tool(s) should be available depending on the compression format of the input files.
......@@ -67,14 +71,7 @@ Run _fastq_info_ without option to read the following documentation:
```
USAGE: fastq_info.sh [options] [<file1> <file2> ...]
Allowed file extensions:
.fastq
.fq
.txt ..... considered as FASTQ-formatted files
.gz
.gzip .... considered as FASTQ-formatted files compressed using gzip;
decompressed using gunzip or pigz (when available in $PATH)
Allowed file extensions (case insensitive):
.bz
.bz2
.bzip
......@@ -85,36 +82,51 @@ Run _fastq_info_ without option to read the following documentation:
.dsrc2 ... considered as FASTQ-formatted files compressed using DSRC
v2.0 (sun.aei.polsl.pl/dsrc); decompressed using DSRC v2.0
(when available in $PATH)
.fastq
.fq
.txt ..... considered as uncompressed FASTQ-formatted files
.fqz ..... considered as FASTQ-formatted files compressed using
fqzcomp v4 (github.com/jkbonfield/fqzcomp); decompressed
using fqzcomp v4 (when available in $PATH)
.gz
.gzip .... considered as FASTQ-formatted files compressed using gzip;
decompressed using gunzip or pigz (when available in $PATH)
.qp ...... considered as FASTQ-formatted files compressed using QUIP
(github.com/dcjones/quip); decompressed using QUIP (when
available in $PATH)
Options:
-s <int> speed index between 1 (slower) and 9 (faster) to manage the
subsampling rate; when set to 1, 2, 3, 4 or 5, results are
based on 100%, ~33%, ~20%, ~15% or ~10% of all FASTQ blocks
(default: 5)
-p <int> Phred quality offset (default: 33)
-v <char> reduced (r), full (f) or tab-delimited (t) result output
(default: r)
-t <int> number of threads for decompressing files (default: 2)
-a to specify the AWK interpreter (default: awk/gawk in $PATH)
-c checking decompressing tools (default: not set)
-p <int> Phred quality offset (default: 33)
-d DOS end-of-lines in input file(s) (default: not set)
-t <int> number of thread(s) for decompressing files (default: 1)
-a AWK interpreter (default: gawk or awk in $PATH)
-c checks available tools (default: not set)
-h prints this help and exits
```
## Notes
* Each input file is decompressed (if required) and parsed. Numbers of High-Throughput Sequencing (HTS) reads and bases are exact, as well as the derived average HTS read length. All other descriptive statistics are based on FASTQ block subsampling (except when setting option `-s 1`). Low subsampling rate (e.g. ~10% by default) is generally sufficient to obtain results representative of the whole set of HTS reads.
* Each input file is decompressed (if required) and parsed. Numbers of High-Throughput Sequencing (HTS) reads and bases are exact, as well as the derived average HTS read length. All other descriptive statistics are estimated by an AWK program based on FASTQ block subsampling (except when setting option `-s 1`). Low subsampling rate (e.g. ~10% by default) is generally sufficient to obtain results representative of the whole set of HTS reads.
* _fastq_info_ is able to consider many input files summarized using [filename expansion](https://tldp.org/LDP/abs/html/globbingref.html), e.g. `dirname/*.fastq.gz`
* In outputted results, every empty entry is indicated by a dot instead of zero.
* Tab-delimited option `-v t` enables to output only several statistics: numbers of HTS reads and bases (NR and NB, respectively), average HTS read length (AL), the three quartiles of the global Phred score distribution (BQ1, BQ2, BQ3) and the three quartiles of the average Phred score per HTS read distribution (RQ1, RQ2, RQ3). For detailed distributions per HTS read position and/or Phred score value, use options `-v r` or `-v f`.
* Specific AWK interpreters can be used via the option `-a`. _fastq_info_ was successfully run together with [gawk](https://www.gnu.org/software/gawk/), [nawk](https://github.com/onetrueawk/), [mawk](https://invisible-island.net/mawk/), and [goawk](https://github.com/benhoyt/goawk). However, faster running times were generally observed using [gawk](https://www.gnu.org/software/gawk/) versions &#8805; 4.0 (on Linux).
* Specific AWK interpreters can be used via the option `-a` (either a name within the `$PATH` or the full path to a binary). _fastq_info_ was successfully run together with [gawk](https://www.gnu.org/software/gawk/), [nawk](https://github.com/onetrueawk/), [mawk](https://invisible-island.net/mawk/), and [goawk](https://github.com/benhoyt/goawk). However, faster running times were generally observed using [gawk](https://www.gnu.org/software/gawk/) versions &#8805; 4.0 (on Linux).
* Option `-c` can be useful to obtain a check list of the required/expected binaries available in the `$PATH`, as well as their respective version (especially for the AWK interpreter).
* _fastq_info_ is able to consider many input files summarized using [filename expansion](https://tldp.org/LDP/abs/html/globbingref.html), e.g. `dirname/*.fastq.gz`
* Option `-d` can be useful when dealing with FASTQ files containing non-Unix end-of-lines (e.g. created under Microsoft Windows).
......@@ -122,7 +134,7 @@ Run _fastq_info_ without option to read the following documentation:
## Examples
The following [Bash](https://www.gnu.org/software/bash/) command line enables to download the pair of gzipped FASTQ files *SRR001666\_1.fastq.gz* and *SRR001666\_2.fastq.gz*:
The following [Bash](https://www.gnu.org/software/bash/) command line enables to download the pair of gzipped FASTQ files *SRR001666\_1.fastq.gz* and *SRR001666\_2.fastq.gz* to be used as examples:
```bash
wget -q ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR001/SRR001666/SRR001666*.fastq.gz
......@@ -266,7 +278,7 @@ reads 26 30 34 0.1 0.0 0.0 0.0 0.1 0.1 0.1 0.1
To help with reading, the main statistics for all files can be summarized in tab-delimited format using option `-v t`:
```bash
./fastq_info.sh -v t *.fastq.gz
./fastq_info.sh -v t SRR001666*.fastq.gz
```
```
......@@ -285,12 +297,12 @@ The above example clearly shows that the overall sequencing error rate is lower
#### Note on the subsampling rate
By default, all distributions are estimated from a subset of all input FASTQ blocks to obtain fast running times (the subsampling rate is indicated when using option `-v f`).
In almost all cases, default subsampling rate (i.e. ~10% with option `-s 5`) is sufficient to efficiently approximate the different distributions (i.e. read lengths, GC-content, Phred scores).
In almost all cases, default subsampling rate (i.e. ~10% with option `-s 5`) is sufficient to efficiently approximate the different distributions (i.e. HTS read lengths, GC-content, Phred scores).
For example, the below command line uses all FASTQ blocks (i.e. option `-s 1`):
For example, the below command line uses all FASTQ blocks from each input file (i.e. option `-s 1`):
```bash
./fastq_info.sh -s 1 -v t *.fastq.gz
./fastq_info.sh -s 1 -v t SRR001666*.fastq.gz
```
```
......@@ -299,6 +311,19 @@ SRR001666_1.fastq.gz 7047668 253716048 36.0 30 40 40 32 35 37
SRR001666_2.fastq.gz 7047668 253716048 36.0 18 40 40 26 30 34
```
All statistics are identical to the ones previously estimated, but the overall running time was 8 times slower...
For comparison, when used with default options, _fastq_info_ is expected to run ~1.5 times faster than [FastQC](https://github.com/s-andrews/FastQC) to process one FASTQ file.
All statistics are identical to the ones previously estimated (see above), but the overall running time was 8 times slower...
For comparison, when used with default options, _fastq_info_ is expected to run ~1.5 times faster than [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc) to process one FASTQ file.
## References
Bonfield JK, Mahoney MV (2013) _Compression of FASTQ and SAM format sequencing data_. PLOS One, 8(3):e59190. [doi:10.1371/journal.pone.0059190](https://doi.org/10.1371/journal.pone.0059190).
Jones DC, Ruzzo WL, Peng X, Katze MG (2012) _Compression of next-generation sequencing reads aided by highly efficient de novo assembly_. Nucleic Acids Research, 40(22):e171–e171. [doi:10.1093/nar/gks754](https://doi.org/10.1093/nar/gks754).
Roguski L, Deorowicz S (2014) _DSRC 2 - Industry-oriented compression of FASTQ files._ Bioinformatics, 30(15):2213-2215. [doi:10.1093/bioinformatics/btu208](https://doi.org/10.1093/bioinformatics/btu208).
This diff is collapsed.
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment