@@ -4,23 +4,24 @@ _fqCleanER_ (fastq Cleaning and Enhancing Routine) is a command line tool writte
Eight standard HTS read processing steps can be carried out using _fqCleanER_:
+ contaminating HTS read removal, using [_AlienRemover_](https://gitlab.pasteur.fr/GIPhy/AlienRemover),
 ❶ contaminating HTS read removal, using [_AlienRemover_](https://gitlab.pasteur.fr/GIPhy/AlienRemover),
+ sequencing error correction, using [_Musket_](http://musket.sourceforge.net/homepage.htm)(Liu et al. 2013),
 ❷ sequencing error correction, using [_Musket_](http://musket.sourceforge.net/homepage.htm)(Liu et al. 2013),
+ HTS read deduplication, using [_fqduplicate_]() from the [_fqtools_](http://ftp.pasteur.fr/pub/gensoft/projects/fqtools/) package,
 ❸ HTS read deduplication, using _fqduplicate_ from the [fqtools](http://ftp.pasteur.fr/pub/gensoft/projects/fqtools/) package,
+ low-coverage HTS read removal, using [_ROCK_](https://gitlab.pasteur.fr/vlegrand/ROCK),
 ❹ low-coverage HTS read removal, using [_ROCK_](https://gitlab.pasteur.fr/vlegrand/ROCK),
+ digital normalization, using [_ROCK_](https://gitlab.pasteur.fr/vlegrand/ROCK),
 ❺ digital normalization, using [_ROCK_](https://gitlab.pasteur.fr/vlegrand/ROCK),
+ paired-ends HTS read merging, using [_FLASh_](https://ccb.jhu.edu/software/FLASH/)(Magoc and Salzberg 2011),
 ❻ paired-ends HTS read merging, using [_FLASh_](https://ccb.jhu.edu/software/FLASH/)(Magoc and Salzberg 2011),
+ high-coverage (redundant) HTS read reduction, using [_ROCK_](https://gitlab.pasteur.fr/vlegrand/ROCK),
 ❼ high-coverage (redundant) HTS read reduction, using [_ROCK_](https://gitlab.pasteur.fr/vlegrand/ROCK),
+ HTS read trimming and clipping, using [_AlienTrimmer_](https://research.pasteur.fr/en/software/alientrimmer/)(Criscuolo and Brisse 2013).
 ❽ HTS read trimming and clipping, using [_AlienTrimmer_](https://research.pasteur.fr/en/software/alientrimmer/)(Criscuolo and Brisse 2013).
All these steps can be performed in any order on up to three paired- and/or single-end FASTQ files (compressed or not).
_fqCleanER_ runs on UNIX, Linux and most OS X operating systems.
...
...
@@ -35,19 +36,19 @@ You will need to install the required programs listed in the following table, or
**D.** If at least one of the required program (see Requirements) is not available on your `$PATH` variable (or if one compiled binary has a different default name), _fqCleanER_ will exit with an error message.
When running _fqCleanER_ without option, a documentation should be displayed; otherwise, the name of the missing program is displayed.
**D.** If at least one of the required program (see [Dependencies](#dependencies)) is not available on your `$PATH` variable (or if one compiled binary has a different default name), _fqCleanER_ will exit with an error message.
When running _fqCleanER_ without option, a documentation should be displayed; otherwise, the name of the missing program is displayed before exiting.
In such a case, edit the file `fqCleanER.sh` and indicate the local path to the corresponding binary(ies) within the code block `REQUIREMENTS` (approximately lines 70-200).
For each required program, the table below reports the corresponding variable assignment instruction to edit (if needed) within the code block `REQUIREMENTS`
...
...
@@ -94,7 +95,7 @@ For each required program, the table below reports the corresponding variable as
</div>
Note that depending on the installation of some required programs, the corresponding variable can be assigned with complex commands.
For example, as _AlienTrimmer_ is a Java tool that can be run using a Java virtual machine, the executable jar file `AlienTrimmer.jar` can be used by _fqCleanER_ by editing the corresponding variable assignment instruction as follows: `ALIENTRIMMER_BIN="java -jar AlienTrimmer.jar"`.
For example, as _AlienTrimmer_ is a Java tool that can be run using a Java virtual machine, the executable jar file `AlienTrimmer.jar` can be used by _fqCleanER_ after editing the corresponding variable assignment instruction as follows: `ALIENTRIMMER_BIN="java -jar AlienTrimmer.jar"`.
## Usage
...
...
@@ -163,7 +164,7 @@ Run _fqCleanER_ without option to read the following documentation:
-z <string> compressed output file(s) using gzip ("gz"), bzip2 ("bz2") or DSRC ("dsrc")
@@ -178,7 +179,7 @@ Run _fqCleanER_ without option to read the following documentation:
* Output files are defined by a specified prefix (mandatory option `-b`) and written in a specified output directory (mandatory option `-o`). Output files can be compressed using [_gzip_](https://www.gnu.org/software/gzip/), [_bzip2_](https://sourceware.org/bzip2/) or [_DSRC_](http://sun.aei.polsl.pl/REFRESH/index.php?page=projects&project=dsrc&subpage=about)(option`-z`).
* Temporary files are written into a dedicated directory created into the `$TMPDIR` directory (when defined, otherwise `tmp/`). When possible, it is highly recommended to set a temp directory with large capacity (option `-w`).
* Temporary files are written into a dedicated directory created into the `$TMPDIR` directory (when defined, otherwise `/tmp`). When possible, it is highly recommended to set a temp directory with large capacity (option `-w`).
* The cleaning/enhancing steps can be specified using option `-s` in any order. The same step can be specified several times (e.g. `-s DTDNEN`).
...
...
@@ -188,11 +189,11 @@ Run _fqCleanER_ without option to read the following documentation:
**[E]** Sequencing error correction (`-s E`) is performed using [_Musket_](http://musket.sourceforge.net/homepage.htm)(Liu et al. 2013) with _k_-mer length _k_ = 21. This step generally requires quite important running times and will benefit from a large number of threads (option `-t`).
**[L][N][R]** These three steps (`-s L`, `-s N`, `-s R`, respectively) are related to the digital normalization procedure (Brown et al. 2012), performed using [_ROCK_](https://gitlab.pasteur.fr/vlegrand/ROCK) with _k_-mer length _k_ = 25. Given a lower-bound and a upper-bound coverage depth thresholds (options `-c` and `-C`, respectively), the digital normalization selects a subset of HTS reads such that every sequenced base has a coverage depth between these two bounds. When setting a moderate upper-bound (that is lower than the overall average coverage depth; default: `-C 90`), every sequenced base from the selected HTS read subset is expected to have a coverage depth close to this bound. When setting a small lower-bound (default: `-c 4`), all HTS reads corresponding to a sequenced region with coverage depth lower than this bound will be discarded (e.g. artefactual or erroneous HTS read, low-coverage contaminating HTS read). Step N (`-s N`) uses the two bounds (options `-C` and `-c`), whereas steps L and R (`-s L` and `-s R`, respectively) use only the lower- and upper-bounds, respectively.
**[L][N][R]** These three steps (`-s L`, `-s N`, `-s R`, respectively) are related to the digital normalization procedure (e.g. Brown et al. 2012, Wedemeyer et al. 2017, Durai and Schulz 2019), performed using [_ROCK_](https://gitlab.pasteur.fr/vlegrand/ROCK) with _k_-mer length _k_ = 25. Given a lower-bound and a upper-bound coverage depth thresholds (options `-c` and `-C`, respectively), the digital normalization selects a subset of HTS reads such that every sequenced base has a coverage depth between these two bounds. When setting a moderate upper-bound (that is lower than the overall average coverage depth; default: `-C 90`), every sequenced base from the selected HTS read subset is expected to have a coverage depth close to this bound. When setting a small lower-bound (default: `-c 4`), all HTS reads corresponding to a sequenced region with coverage depth lower than this bound will be discarded (e.g. artefactual or erroneous HTS read, low-coverage contaminating HTS read). Step N (`-s N`) uses the two bounds (options `-C` and `-c`), whereas steps L/R (`-s L` and `-s R`, respectively) use only the lower-/upper-bound, respectively.
**[M]** PE HTS read merging (`-s M`, only with PE input files) is performed using [_FLASh_](https://ccb.jhu.edu/software/FLASH/)(Magoc and Salzberg 2011) when the insert size is shorter than the sum of the two paired HTS read lengths. When using this step, dedicated output files are written (_.M.fastq_ file extension).
**[T]** Trimming and clipping (`-s T`) are performed using [_AlienTrimmer_](https://research.pasteur.fr/en/software/alientrimmer/)(Criscuolo and Brisse 2013). Clipping is carried out based on the specified alien oligonucleotides (option `-a`), where alien oligonucleotide sequences can be (i) set using precomputed standard library names, (ii) specified via user-defined FASTA-formatted file, or (iii) directly estimated from the input files using [_minion_](http://wwwdev.ebi.ac.uk/enright-dev/kraken/reaper/src/reaper-latest/doc/minion.html)(option`-a AUTO`). When step T is run without setting option `-a`, clipping is carried out with the four homopolymers as alien oligonucleotides. Trimming is carried out by deleting 5' and 3' regions containing many non-confident bases, where a base is considered as non-confident when its Phred score is lower than a Phred score threshold (set using option `-q`; default: 15). After trimming/clipping an HTS read, it can be discarded when the number of remaining bases is lower than a specified threshold (option `-l`; default: 50 bases) or when the percentage of remaining non-confident bases is higher than another specified threshold (option `-p`; default: 50%). Note that when HTS read discarding breaks a PE, singletons are written into dedicated output files (_.S.fastq_ file extension).
**[T]** Trimming and clipping (`-s T`) are performed using [_AlienTrimmer_](https://research.pasteur.fr/en/software/alientrimmer/)(Criscuolo and Brisse 2013). Clipping is carried out based on the specified alien oligonucleotides (option `-a`), where alien oligonucleotide sequences can be (i) set using precomputed standard library names, (ii) specified via user-defined FASTA-formatted file, or (iii) directly estimated from the input files using [_minion_](http://wwwdev.ebi.ac.uk/enright-dev/kraken/reaper/src/reaper-latest/doc/minion.html)(option`-a AUTO`). When step T is run without setting option `-a`, clipping is carried out with the four homopolymers as alien oligonucleotides. Trimming is carried out by deleting 5' and 3' regions containing many non-confident bases, where a base is considered as non-confident when its Phred score is lower than a Phred score threshold (set using option `-q`; default: 15). After trimming/clipping an HTS read, it can be discarded when the number of remaining bases is lower than a specified length threshold (option `-l`; default: 50 bases) or when the percentage of remaining non-confident bases is higher than another specified threshold (option `-p`; default: 50%). Note that when HTS read discarding breaks PE, singletons are written into dedicated output files (_.S.fastq_ file extension).
@@ -201,11 +202,15 @@ Run _fqCleanER_ without option to read the following documentation:
Brown TC, Howe A, Zhang Q, Pyrkosz AB, Brom TH (2012) _A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data_. [arXiv:1203.4802](https://arxiv.org/abs/1203.4802).
Criscuolo A, Brisse S (2013) _AlienTrimmer: a tool to quickly and accurately trim off multiple short contaminant sequences from high-throughput sequencing reads_. Genomics, 102(5-6):500-506. [doi:10.1016/j.ygeno.2013.07.011](https://doi.org/10.1016/j.ygeno.2013.07.011).
Criscuolo A, Brisse S (2013) _AlienTrimmer: a tool to quickly and accurately trim off multiple short contaminant sequences from high-throughput sequencing reads_. **Genomics**, 102(5-6):500-506. [doi:10.1016/j.ygeno.2013.07.011](https://doi.org/10.1016/j.ygeno.2013.07.011).
Durai DA, Schulz MH (2019) _Improving in-silico normalization using read weights_. **Scientific Reports**, 9:5133. [doi:10.1038/s41598-019-41502-9](https://doi.org/10.1038/s41598-019-41502-9).
Liu Y, Schröder J, Schmidt B (2013) _Musket: a multistage k-mer spectrum-based error corrector for Illumina sequence data_. **Bioinformatics**, 29(3):308-315. [doi:10.1093/bioinformatics/bts690](https://doi.org/10.1093/bioinformatics/bts690).
Liu Y, Schröder J, Schmidt B (2013) _Musket: a multistage k-mer spectrum-based error corrector for Illumina sequence data_. Bioinformatics, 29(3):308-315. [doi:10.1093/bioinformatics/bts690](https://doi.org/10.1093/bioinformatics/bts690).
Magoc T, Salzberg S (2011) _FLASH: Fast length adjustment of short reads to improve genome assemblies_. **Bioinformatics**, 27:21:2957-2963. [doi:10.1093/bioinformatics/btr507](https://doi.org/10.1093/bioinformatics/btr507).
Magoc T, Salzberg S (2011) _FLASH: Fast length adjustment of short reads to improve genome assemblies_. Bioinformatics, 27:21:2957-2963. [doi:10.1093/bioinformatics/btr507](https://doi.org/10.1093/bioinformatics/btr507).
Roguski L, Deorowicz S (2014) _DSRC 2: Industry-oriented compression of FASTQ files_. **Bioinformatics**, 30(15):2213-2215. [doi:10.1093/bioinformatics/btu208](https://doi.org/10.1093/bioinformatics/btu208).
Roguski L, Deorowicz S (2014) _DSRC 2: Industry-oriented compression of FASTQ files_. Bioinformatics, 30(15):2213-2215. [doi:10.1093/bioinformatics/btu208](https://doi.org/10.1093/bioinformatics/btu208).
Wedemeyer A, Kliemann L, Srivastav A, Schielke C, Reusch TB, Rosenstiel P (2017) _An improved filtering algorithm for big read datasets and its application to single-cell assembly_. **BMC Bioinformatics**, 18:324. [doi:10.1186/s12859-017-1724-7](https://doi.org/10.1186/s12859-017-1724-7).