Commit 8edb43ee authored by Alexis  CRISCUOLO's avatar Alexis CRISCUOLO
Browse files

1.0

parent a9c67ddb
# fastq_info
_fastq_info_ is a command line program written in [Bash](https://www.gnu.org/software/bash/)/[AWK](https://en.wikipedia.org/wiki/AWK) for quickly estimating several standard descriptive statistics from FASTQ-formatted High-Throughput Sequencing (HTS) files.
_fastq_info_ is a command line program written in [Bash](https://www.gnu.org/software/bash/)/[AWK](https://en.wikipedia.org/wiki/AWK) for quickly estimating several standard descriptive statistics from FASTQ-formatted High-Throughput Sequencing (HTS) read files.
Estimated statistics per FASTQ file are:
  ▹   HTS read and base numbers,
......@@ -13,7 +13,8 @@ Estimated statistics per FASTQ file are:
  ▹   average Phred score per HTS read distribution.
Several output result formats are available (e.g. reduced/full table, tab-delimited).
Different file compression formats can be handled.
Several output result formats are available (e.g. reduced/full tables, tab-delimited).
## Installation and execution
......@@ -26,6 +27,7 @@ git clone https://gitlab.pasteur.fr/GIPhy/fastq_info.git
Go to the directory `fastq_info/` to give the execute permission to the file `fastq_info.sh`:
```bash
cd fastq_info/
chmod +x fastq_info.sh
```
and run it with the following command line model:
......@@ -40,7 +42,7 @@ and run it with the following command line model:
_fastq_info_ requires an AWK interpreter in the `$PATH`, which is always the case for most Linux distributions.
By default, _fastq_info_ first considers [gawk](https://www.gnu.org/software/gawk/) (GNU awk, generally available on recent Linux distributions); otherwise the basic command `awk` in the `$PATH` is used.
However, alternative implementations of AWK can be specified using option `-a` (see Usage section).
In practice, _fastq_info_ is able to detect and deal with most AWK interpreters (e.g. [nawk](https://github.com/onetrueawk/), [mawk](https://invisible-island.net/mawk/), [goawk](https://github.com/benhoyt/goawk)).
In practice, as two POSIX codes are implemented (i.e. original One-True-AWK and extended GNU AWK), _fastq_info_ is able to detect and deal with almost all AWK interpreters (e.g. [nawk](https://github.com/onetrueawk/), [mawk](https://invisible-island.net/mawk/), [goawk](https://github.com/benhoyt/goawk)).
##### About compressed FASTQ files
......@@ -83,8 +85,7 @@ Run _fastq_info_ without option to read the following documentation:
v2.0 (sun.aei.polsl.pl/dsrc); decompressed using DSRC v2.0
(when available in $PATH)
.fastq
.fq
.txt ..... considered as uncompressed FASTQ-formatted files
.fq ...... considered as uncompressed FASTQ-formatted files
.fqz ..... considered as FASTQ-formatted files compressed using
fqzcomp v4 (github.com/jkbonfield/fqzcomp); decompressed
......@@ -107,7 +108,7 @@ Run _fastq_info_ without option to read the following documentation:
-p <int> Phred quality offset (default: 33)
-d DOS end-of-lines in input file(s) (default: not set)
-t <int> number of thread(s) for decompressing files (default: 1)
-a AWK interpreter (default: gawk or awk in $PATH)
-a <prog> AWK interpreter (default: gawk or awk in $PATH)
-c checks available tools (default: not set)
-h prints this help and exits
```
......@@ -115,15 +116,15 @@ Run _fastq_info_ without option to read the following documentation:
## Notes
* Each input file is decompressed (if required) and parsed. Numbers of High-Throughput Sequencing (HTS) reads and bases are exact, as well as the derived average HTS read length. All other descriptive statistics are estimated by an AWK program based on FASTQ block subsampling (except when setting option `-s 1`). Low subsampling rate (e.g. ~10% by default) is generally sufficient to obtain results representative of the whole set of HTS reads.
* Each input file is decompressed (if required, depending on its extension) and parsed. Numbers of High-Throughput Sequencing (HTS) reads and bases are exact, as well as the derived average HTS read length. All other descriptive statistics are estimated by an AWK program based on FASTQ block subsampling (except when setting option `-s 1`). Low subsampling rate (e.g. ~10% by default) is generally sufficient to obtain results representative of the whole set of HTS reads. For very fast running times, use option `-s 9` (i.e. subsampling rate ~6% ), but at the cost of reduced accuracy.
* _fastq_info_ is able to consider many input files summarized using [filename expansion](https://tldp.org/LDP/abs/html/globbingref.html), e.g. `dirname/*.fastq.gz`
* _fastq_info_ is able to consider many input files summarized using [filename expansion](https://tldp.org/LDP/abs/html/globbingref.html), e.g. `dirname/*.fastq.gz`. A warning is displayed when an input file is not a regular one (e.g. directory) or when its extension is unknown. All warning messages can be disabled by ending the command line with `2>/dev/null`.
* In outputted results, every empty entry is indicated by a dot instead of zero.
* Tab-delimited option `-v t` enables to output only several statistics: numbers of HTS reads and bases (NR and NB, respectively), average HTS read length (AL), the three quartiles of the global Phred score distribution (BQ1, BQ2, BQ3) and the three quartiles of the average Phred score per HTS read distribution (RQ1, RQ2, RQ3). For detailed distributions per HTS read position and/or Phred score value, use options `-v r` or `-v f`.
* Specific AWK interpreters can be used via the option `-a` (either a name within the `$PATH` or the full path to a binary). _fastq_info_ was successfully run together with [gawk](https://www.gnu.org/software/gawk/), [nawk](https://github.com/onetrueawk/), [mawk](https://invisible-island.net/mawk/), and [goawk](https://github.com/benhoyt/goawk). However, faster running times were generally observed using [gawk](https://www.gnu.org/software/gawk/) versions &#8805; 4.0 (on Linux).
* Specific AWK interpreters can be used via the option `-a` (either a name within the `$PATH` or the full path to a binary). _fastq_info_ was successfully run together with [gawk](https://www.gnu.org/software/gawk/), [nawk](https://github.com/onetrueawk/), [mawk](https://invisible-island.net/mawk/), and [goawk](https://github.com/benhoyt/goawk). However, faster running times were generally observed using [gawk](https://www.gnu.org/software/gawk/) versions &#8805; 4.0 ([mawk](https://invisible-island.net/mawk/) remains acceptable, but [goawk](https://github.com/benhoyt/goawk) is not recommended).
* Option `-c` can be useful to obtain a check list of the required/expected binaries available in the `$PATH`, as well as their respective version (especially for the AWK interpreter).
......@@ -312,7 +313,7 @@ SRR001666_2.fastq.gz 7047668 253716048 36.0 18 40 40 26 30 34
```
All statistics are identical to the ones previously estimated (see above), but the overall running time was 8 times slower...
For comparison, when used with default options, _fastq_info_ is expected to run ~1.5 times faster than [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc) to process one FASTQ file.
For comparison, when used with default options, _fastq_info_ is expected to run from 1.5 to 2 times faster than [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc) to process one FASTQ file.
......
......@@ -88,17 +88,17 @@
# #
# -- DSRC 2 (sun.aei.polsl.pl/dsrc) ----------------------------------------------------------------------- #
# #
DSRC="dsrc";
UNDSRC="dsrc d -s";
DSRC=dsrc;
UNDSRC="$DSRC d -s";
# #
# -- FQZCOMP 4 (github.com/jkbonfield/fqzcomp) ------------------------------------------------------------ #
# #
FQZCOMP="fqzcomp";
FQZCOMP=fqzcomp;
UNFQZCOMP="$FQZCOMP -d";
# #
# -- QUIP (github.com/dcjones/quip) ----------------------------------------------------------------------- #
# #
QUIP="quip";
QUIP=quip;
UNQUIP="$QUIP -d -c";
# #
##############################################################################################################
......@@ -130,8 +130,7 @@ mandoc() {
v2.0 (sun.aei.polsl.pl/dsrc); decompressed using DSRC v2.0
(when available in \$PATH)
.fastq
.fq
.txt ..... considered as uncompressed FASTQ-formatted files
.fq ...... considered as uncompressed FASTQ-formatted files
.fqz ..... considered as FASTQ-formatted files compressed using
fqzcomp v4 (github.com/jkbonfield/fqzcomp); decompressed
......@@ -154,7 +153,7 @@ mandoc() {
-p <int> Phred quality offset (default: 33)
-d DOS end-of-lines in input file(s) (default: not set)
-t <int> number of thread(s) for decompressing files (default: 1)
-a AWK interpreter (default: gawk or awk in \$PATH)
-a <prog> AWK interpreter (default: gawk or awk in \$PATH)
-c checks available tools (default: not set)
-h prints this help and exits
......@@ -194,6 +193,8 @@ do
\?) mandoc ; exit 1 ;;
esac
done
shift "$(( $OPTIND - 1 ))"
if ! [[ $PHRED =~ ^[0-9]+$ ]]; then echo " incorrect value (option -p): $PHRED" >&2 ; exit 1 ; fi
if [ $PHRED -lt 33 ]; then echo " Phred should be >32 (option -p): $PHRED" >&2 ; exit 1 ; fi
if ! [[ $SPEED =~ ^[0-9]+$ ]]; then echo " incorrect value (option -s): $SPEED" >&2 ; exit 1 ; fi
......@@ -256,38 +257,41 @@ trap "disown -a ; kill -9 $(jobs -pr) &> /dev/null ; rm -f $TMP1 $TMP2 &>/dev/nu
for INFILE in "$@"
do
if [ ! -e $INFILE ] || [ ! -f $INFILE ] || [ ! -r $INFILE ] || [ "$INFILE" == "$AWK" ]; then continue; fi
if [ ! -e $INFILE ]; then echo "[WARNING] not exist; skipping $INFILE" >&2 ; continue ; fi
if [ -d $INFILE ]; then echo "[WARNING] not a file; skipping $INFILE" >&2 ; continue ; fi
if [ ! -f $INFILE ]; then echo "[WARNING] not regular; skipping $INFILE" >&2 ; continue ; fi
if [ ! -r $INFILE ]; then echo "[WARNING] no read permission; skipping $INFILE" >&2 ; continue ; fi
EXT=${INFILE##*.}; EXT=${EXT^^};
if [ "$EXT" == "FASTQ" ] || [ "$EXT" == "FQ" ] || [ "$EXT" == "TXT" ] ########################################################
if [ "$EXT" == "FASTQ" ] || [ "$EXT" == "FQ" ] ################################################################################ fastq
then READFILE="cat $INFILE";
elif [ "$EXT" == "GZIP" ] || [ "$EXT" == "GZ" ] ###############################################################################
elif [ "$EXT" == "GZIP" ] || [ "$EXT" == "GZ" ] ################################################################################ fastq.gz
then
if [ $NTHREADS -eq 1 ]
then READFILE="$GUNZIP $INFILE ";
elif [ ! $(command -v $PGZIP) ]
then echo "[WARNING] $PGZIP non available; using $GZIP on 1 thread instead" >&2 ; READFILE="$GUNZIP $INFILE";
else READFILE="$PGUNZIP -p $NTHREADS $INFILE"; fi
elif [ "$EXT" == "BZIP" ] || [ "$EXT" == "BZ" ] || [ "$EXT" == "BZIP2" ] || [ "$EXT" == "BZ2" ] ##############################
elif [ "$EXT" == "BZIP" ] || [ "$EXT" == "BZ" ] || [ "$EXT" == "BZIP2" ] || [ "$EXT" == "BZ2" ] ############################### fastq.bz2
then
if [ $NTHREADS -eq 1 ]
then READFILE="$BUNZIP $INFILE";
elif [ ! $(command -v $PBZIP) ]
then echo "[WARNING] $PBZIP non available; using $BZIP on 1 thread instead" >&2 ; READFILE="$BUNZIP $INFILE";
else READFILE="$PBUNZIP -p $NTHREADS $INFILE"; fi
elif [ "$EXT" == "DSRC" ] || [ "$EXT" == "DSRC2" ] #############################################################################
elif [ "$EXT" == "DSRC" ] || [ "$EXT" == "DSRC2" ] ############################################################################## fastq.dsrc2
then
if [ ! $(command -v $DSRC) ]
then echo "[WARNING] $DSRC non available; skipping $INFILE" >&2 ; continue ;
else READFILE="$UNDSRC -t$NTHREADS $INFILE"; fi
elif [ "$EXT" == "FQZ" ] #######################################################################################################
elif [ "$EXT" == "FQZ" ] ######################################################################################################## fastq.fqz
then
if [ ! $(command -v $FQZCOMP) ]
then echo "[WARNING] $FQZCOMP non available; skipping $INFILE" >&2 ; continue ;
elif [ $NTHREADS -eq 1 ]
then READFILE="$UNFQZCOMP -P $INFILE";
else READFILE="$UNFQZCOMP $INFILE"; fi
elif [ "$EXT" == "QP" ] ########################################################################################################
elif [ "$EXT" == "QP" ] ######################################################################################################### fastq.qp
then
if [ ! $(command -v $QUIP) ]
then echo "[WARNING] $QUIP non available; skipping $INFILE" >&2 ; continue ;
......@@ -297,9 +301,9 @@ do
$DOS2UNIX && READFILE="$READFILE | tr -d '\r'" ;
if ! $POSIX92
then ####################################################################################################### GNU AWK v4+ #######
then ############################################################################################################################ GNU AWK v4+
$READFILE |
tee >( sed -n '4~4p' | wc -l -c > $TMP1 ) |
tee >( sed -n '4~4p' | LC_ALL=C wc -l -c > $TMP1 ) |
sed -n "2~$STEP p" |
$AWK -v phred=$PHRED '
BEGIN{
......@@ -366,9 +370,9 @@ do
printf"-------------------------------------";q=0;while(++q<=qmax)printf"------";print"";
print((nr<np)?np:nr);
}' > $TMP2 ;
else ################################################################################################## AWK POSIX 1003.2 #######
else ############################################################################################################################ AWK POSIX 1003.2
$READFILE |
tee >( sed -n '4~4p' | wc -l -c > $TMP1 ) |
tee >( sed -n '4~4p' | LC_ALL=C wc -l -c > $TMP1 ) |
sed -n "2~$STEP p" |
$AWK -v phred=$PHRED '
BEGIN{
......@@ -438,13 +442,13 @@ do
fi
wait ;
NR=$($AWK '{print$1}' $TMP1);
NB=$($AWK '{print$2}' $TMP1);
NR=`$AWK '{print$1}' $TMP1`;
NB=`$AWK '{print$2}' $TMP1`;
NB=$(( $NB - $NR ));
AL=$(bc -l <<<"scale=1;$NB/$NR");
SS=$(sed -n '$p' $TMP2);
AL=`bc -l <<<"scale=1;$NB/$NR"`;
SS=`sed -n '$p' $TMP2`;
if [ "$OUT" == "r" ] || [ "$OUT" == "f" ]
if [ "$OUT" == "r" ] || [ "$OUT" == "f" ]
then
echo "##File: $INFILE" ;
echo "#no.reads: $NR" ;
......@@ -455,19 +459,21 @@ do
sed '$d' $TMP2 | cut -c1-30 ;
else
sed '$d' $TMP2 ;
echo "#subsampling.rate: "$(bc -l <<<"scale=3;$SS/$NR" | sed 's/^0$/0.00/;s/^\./0./') ;
echo "#subsampling.rate: "`bc -l <<<"scale=3;$SS/$NR" | sed 's/^0$/0.00/;s/^\./0./'` ;
fi
echo ;
fi
if [ "$OUT" == "t" ]
elif [ "$OUT" == "t" ]
then
BQ123="$(tail $TMP2 | grep "^bases " $TMP2 | cut -c22-30 | xargs echo | tr ' ' '\t')";
RQ123="$(tail $TMP2 | grep "^reads " $TMP2 | cut -c22-30 | xargs echo | tr ' ' '\t')";
BQ123="`tail $TMP2 | grep "^bases " $TMP2 | cut -c22-30 | xargs echo | tr ' ' '\t'`";
RQ123="`tail $TMP2 | grep "^reads " $TMP2 | cut -c22-30 | xargs echo | tr ' ' '\t'`";
echo -e "$INFILE\t$NR\t$NB\t$AL\t$BQ123\t$RQ123" ;
fi
rm -f $TMP1 $TMP2 ;
echo -n > $TMP1 ;
echo -n > $TMP2 ;
done
rm -f $TMP1 $TMP2 ;
exit ;
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment