_ROCK_ (Reducing Over-Covering K-mers) is a command line program written in [C++](https://isocpp.org/) that runs an alternative implementation of the _digital normalization_ method (e.g. Brown et al. 2012, Wedemeyer et al. 2017, Durai and Schulz 2019).
_ROCK_ can be used to reduce the overall coverage depth within large sets of high-throughput sequencing (HTS) reads contained in one or several FASTQ file(s).
_ROCK_ can also be used to discard low-covering HTS reads that are often artefactual, highly erroneous or contaminating sequences.
## Installation
Clone this repository with the following command line:
On computers with [gcc](https://ftp.gnu.org/gnu/gcc/)(version at least 4.4.7) and [clang](https://clang.llvm.org/)(version at least 503.0.40) installed, the compilation and installation of _ROCK_ can be carried out using the following command lines:
```bash
./configure
make check
make
make install
```
This will create the executable _rock_ that can be run with the following command line model:
```bash
./rock [options]
```
## Usage
Run _ROCK_ without option to read the following documentation:
```
Reducing Over-Covering K-mers within FASTQ file(s)
USAGE: rock [options] [files]
OPTIONS:
-i <file> file containing the name(s) of the input FASTQ file(s) to
process; single-end: one file name per line; paired-end:
two file names per line separated by a comma; up to 15
FASTQ file names can be specified; of note, input file
name(s) can also be specified as program argument(s)
-o <file> file containing the name(s) of the output FASTQ file(s);
FASTQ file name(s) should be structured in the same way as
-l <int> number of hashing function(s) (default: 4)
-n <int> expected total number of distinct k-mers within the input
read sequences; not compatible with option -l.
-f <float> maximum expected false positive probability when computing
the optimal number of hashing functions from the number of
distinct k-mers specified with option -n (default: 0.05).
-q <int> sets as valid only k-mers made up of nucleotides with
Phred score (+33 offset) above this cutoff (default: 0)
-m <int> minimum number of valid k-mer(s) to consider a read; all
non-considered reads are written into output file(s) with
extension undefined.fastq (default: 1)
-v verbose mode
-h prints this message and exit
```
## Notes
* In brief, given an upper-bound _k_-mer coverage depth cutoff _κ_ (option `-C`) and a lower-bound _k_-mer coverage cutoff _κ'_ (option `-c`), the aim of _ROCK_ is to select an HTS read subset _S_ such that the overall _k_-mer coverage depth induced by the members of _S_ is expected to be always comprised between _κ'_ and _κ_. When considering FASTQ files with high redundancy (i.e. coverage depth greater than _κ_), _ROCK_ therefore returns smaller FASTQ files such that each of its HTS read corresponds to a genome region with _k_-mer coverage depth of at most _κ_. Setting _κ'_ (option `-c`) enables to discard HTS reads associated to a _k_-mer coverage depth lower than this lower-bound cutoff, which is often observed with artefactual, highly erroneous or contaminating HTS reads.
* After creating an empty count-min sketch (CMS; see below) to store the number of occurences of every canonical _k_-mer shared by the input HTS reads, _ROCK_ proceeds in three main steps :
1. Sorting the input SE/PE HTS reads from the most to the less accurate ones (as defined by the sum of the Phred scores).
2. For each sorted SE/PE HTS read(s), approximating its _k_-mer coverage depth _c_ (defined by the median of its _k_-mer occurence values, as returned by the CMS); when _c_ ≤ _κ_, adding the SE/PE HTS read(s) into the subset _S_ and updating the CMS for every corresponding canonical _k_-mer.
3. (when the lower-bound cutoff _κ'_ > 0) For each SE/PE HTS read(s) in _S_, (re)approximating its _k_-mer coverage depth _c_; when _c_ ≤ _κ'_, removing the SE/PE HTS read(s) from _S_.
At the end, all SE/PE HTS read(s) inside _S_ are written into output FASTQ file(s) (by default, file extension _.rock.fastq_).
* _ROCK_ stores the number of occurences of every traversed canonical _k_-mer in a count-min sketch (CMS; e.g. Cormode and Muthukrishnan 2005), a dedicated probabilistic data structure with controllable false positive probability (FPP). By default, _ROCK_ instantiates a CMS based on four hashing functions (option `-l`), which can be sufficient for many cases (e.g. up to 10 billions canonical _k_-mers with _κ'_ = 2, _κ_ > _κ'_ and FPP <5%).Howeveritishighlyrecommandedtoprovidetheexpectednumber_F_<sub>0</sub> of canonical _k_-mers (option `-n`) to enable _ROCK_ to compute the optimal CMS dimensions required to store this specified number of canonical _k_-mers with low FPP (option `-f`). For instance, the programs [_KMC_](https://github.com/refresh-bio/KMC)(Deorowicz et al. 2013, 2015; Kokot et al. 2017), [_KmerStream_](https://github.com/pmelsted/KmerStream)(Melsted and Halldórsson 2014) or [_ntCard_](https://github.com/bcgsc/ntCard)(Mohamadi et al. 2017) can be used to quickly approximate this number (_F_<sub>0</sub>).
* Of important note is that the upper- and lower-bound cutoffs (options `-C` and `-c`, respectively) each corresponds to a _k_-mer coverage depth value (denoted here _c_<sub>_k_</sub>), which is quite different to the base coverage depth value (denoted here _c_<sub>_b_</sub>). However, when _L_ is the average input HTS read length, _c_<sub>_b_</sub> / _c_<sub>_k_</sub> and _L_ / (_L_-_k_+1) are expected to be identical for any fixed small _k_ (e.g. Liu et al. 2013). In consequence, when an overall (base) coverage depth _c_<sub>_b_</sub> is expected, one can therefore set _κ_ = _c_<sub>_b_</sub> (_L_-_k_+1) / _L_. For example, when dealing with HTS reads of length _L_ = 144 (on average), an HTS read subset with expected base coverage depth _c_<sub>_b_</sub> = 60x can be inferred by _ROCK_ by setting _k_ = 25 (option `-k`) and _κ_ = 60 (144 -25+1) / 144 = 50 (option `-C`).
* By default, _ROCK_ uses _k_-mers of length _k_ = 25 (option `-k`). Increasing this length is not recommanded when dealing with large FASTQ files (e.g. average coverage depth > 500x of genomes of length > 1 Gbps), as the total number of canonical _k_-mers can quickly grow, therefore implying a very large CMS (i.e. many hashing functions) to maintains low FPP (e.g. < 0.05). Using small _k_-mers (e.g. _k_ < 21) is also not recommanded, as this will decrease the overall specificity (e.g. too many identical _k_-mers arising from different sequenced genome region).
* All _ROCK_ steps are based on the usage of valid _k_-mers, i.e. _k_-mers that do not contain any undetermined base `N`. Valid _k_-mers can also be determined by bases associated to a Phred score greater than a specified threshold (option `-q`). A minimum number of valid _k_-mers can be specified to consider a SE/PE HTS read(s) (option `-m`; default 1). All SE/PE HTS read(s) that do not contain enough valid _k_-mers are written into FASTQ file(s) with extension _.undetermined.fastq_.
## References
Brown CT, Howe A, Zhang Q, Pyrkosz AB, Brom YH (2012) _A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data_. arXiv:[1203.4802v2](https://arxiv.org/abs/1203.4802v2).
Cormode G, Muthukrishnan S (2005) _An Improved Data Stream Summary: The Count-Min Sketch and its Applications_. Journal of Algorithms, 55:29-38. [doi:10.1016/j.jalgor.2003.12.001](https://doi.org/10.1016/j.jalgor.2003.12.001).
Deorowicz S, Debudaj-Grabysz A, Grabowski S (2013) _Disk-based k-mer counting on a PC_. BMC Bioinformatics, 14:160. [doi:10.1186/1471-2105-14-160](https://doi.org/10.1186/1471-2105-14-160).
Deorowicz S, Kokot M, Grabowski S, Debudaj-Grabysz A (2015) _KMC 2: Fast and resource-frugal k-mer counting_. Bioinformatics, 31(10):1569-1576. [doi:10.1093/bioinformatics/btv022](https://doi.org/10.1093/bioinformatics/btv022).
Durai DA, Schulz MH (2019) _Improving in-silico normalization using read weights_. Scientific Reports, 9:5133. [doi:10.1038/s41598-019-41502-9](https://doi.org/10.1038/s41598-019-41502-9).
Kokot M, Długosz M, Deorowicz S (2017) _KMC 3: counting and manipulating k -mer statistics_. Bioinformatics, 33(17):2759-2761. [doi:10.1093/bioinformatics/btx304](https://doi.org/10.1093/bioinformatics/btx304).
Liu B, Shi Y, Yuan J, Hu X, Zhang H, Li N, Li Z, Chen Y, Mu D, Fan W (2013) _Estimation of genomic characteristics by analyzing k-mer frequency in de novo genome projects_. arXiv:[1308.2012v2](https://arxiv.org/abs/1308.2012v2).
Melsted P, Halldórsson BV (2014) _KmerStream: streaming algorithms for k -mer abundance estimation_. Bioinformatics, 30(24):3541-3547. [doi:10.1093/bioinformatics/btu713](https://doi.org/10.1093/bioinformatics/btu713).
Mohamadi H, Khan H, Birol I (2017) _ntCard: a streaming algorithm for cardinality estimation in genomics data_. Bioinformatics, 33(9):1324-1330. [doi:10.1093/bioinformatics/btw832](https://doi.org/10.1093/bioinformatics/btw832).
Wedemeyer A, Kliemann L, Srivastav A, Schielke C, Reusch TB, Rosenstiel P (2017) _An improved filtering algorithm for big read datasets and its application to single-cell assembly_. BMC Bioinformatics, 18:324. [doi:10.1186/s12859-017-1724-7](https://doi.org/10.1186/s12859-017-1724-7).