adding README.md

d75f7e76 · Alexis CRISCUOLO · 9852f43b · d75f7e76
Commit d75f7e76 authored 4 years ago by Alexis CRISCUOLO
--- a/README.md
+++ b/README.md
+# ROCK
+
+_ROCK_ (Reducing Over-Covering K-mers) is a command line program written in [C++](https://isocpp.org/) that runs an alternative implementation of the _digital normalization_ method (e.g. Brown et al. 2012, Wedemeyer et al. 2017, Durai and Schulz 2019).
+_ROCK_ can be used to reduce the overall coverage depth within large sets of high-throughput sequencing (HTS) reads contained in one or several FASTQ file(s).
+_ROCK_ can also be used to discard low-covering HTS reads that are often artefactual, highly erroneous or contaminating sequences. 
+
+
+
+## Installation
+
+Clone this repository with the following command line:
+
+```bash
+git clone https://gitlab.pasteur.fr/vlegrand/ROCK.git
+```
+
+
+
+## Compilation and execution
+
+_ROCK_ is developped in C++ 98.
+On computers with [gcc](https://ftp.gnu.org/gnu/gcc/) (version at least 4.4.7) and [clang](https://clang.llvm.org/) (version at least 503.0.40) installed, the compilation and installation of _ROCK_ can be carried out using the following command lines:
+
+```bash
+./configure
+make check
+make
+make install
+```
+
+This will create the executable _rock_ that can be run with the following command line model:
+
+```bash
+./rock [options]
+```
+
+
+
+## Usage
+
+Run _ROCK_ without option to read the following documentation:
+
+```
+Reducing Over-Covering K-mers within FASTQ file(s)
+
+USAGE: rock [options] [files]
+
+OPTIONS:
+ -i <file>  file containing the name(s)  of the input FASTQ file(s) to
+            process;  single-end: one file name per line;  paired-end:
+            two file names  per line  separated by  a comma;  up to 15
+            FASTQ file  names can  be specified;  of note,  input file
+            name(s) can also be specified as program argument(s)
+ -o <file>  file containing the  name(s) of the  output FASTQ file(s);
+            FASTQ file name(s) should be structured in the same way as
+            the file specified in option -i.
+ -k <int>   k-mer length (default 25)
+ -c <int>   lower-bound k-mer coverage depth threshold (default: 0)
+ -C <int>   upper-bound k-mer coverage depth threshold (default: 70)
+ -l <int>   number of hashing function(s) (default: 4)
+ -n <int>   expected total number of  distinct k-mers within the input
+            read sequences; not compatible with option -l.
+ -f <float> maximum expected false positive probability when computing
+            the optimal number of hashing functions from the number of
+            distinct k-mers specified with option -n (default: 0.05).
+ -q <int>   sets as valid  only k-mers  made  up  of  nucleotides with
+            Phred score (+33 offset) above this cutoff (default: 0)
+ -m <int>   minimum number of  valid k-mer(s) to  consider a read; all
+            non-considered reads are  written into output file(s) with
+            extension undefined.fastq (default: 1)
+ -v         verbose mode
+ -h         prints this message and exit
+```
+
+
+## Notes
+
+* In brief, given an upper-bound _k_-mer coverage depth cutoff _&kappa;_ (option `-C`) and a lower-bound _k_-mer coverage cutoff _&kappa;'_ (option `-c`), the aim of _ROCK_ is to select an HTS read subset _S_ such that the overall _k_-mer coverage depth induced by the members of _S_ is expected to be always comprised between _&kappa;'_ and _&kappa;_. When considering FASTQ files with high redundancy (i.e. coverage depth greater than _&kappa;_), _ROCK_ therefore returns smaller FASTQ files such that each of its HTS read corresponds to a genome region with _k_-mer coverage depth of at most _&kappa;_. Setting  _&kappa;'_ (option `-c`) enables to discard HTS reads associated to a _k_-mer coverage depth lower than this lower-bound cutoff, which is often observed with artefactual, highly erroneous or contaminating HTS reads.
+
+* After creating an empty count-min sketch (CMS; see below) to store the number of occurences of every canonical _k_-mer shared by the input HTS reads, _ROCK_ proceeds in three main steps :
+  1. Sorting the input SE/PE HTS reads from the most to the less accurate ones (as defined by the sum of the Phred scores).
+  2. For each sorted SE/PE HTS read(s), approximating its _k_-mer coverage depth _c_ (defined by the median of its _k_-mer occurence values, as returned by the CMS); when _c_ &le; _&kappa;_, adding the SE/PE HTS read(s) into the subset _S_ and updating the CMS for every corresponding canonical _k_-mer.
+  3. (when the lower-bound cutoff _&kappa;'_ > 0) For each SE/PE HTS read(s) in _S_, (re)approximating its _k_-mer coverage depth _c_; when _c_ &le; _&kappa;'_, removing the SE/PE HTS read(s) from _S_. 
+
+  At the end, all SE/PE HTS read(s) inside _S_ are written into output FASTQ file(s) (by default, file extension _.rock.fastq_).
+
+* _ROCK_ stores the number of occurences of every traversed canonical _k_-mer in a count-min sketch (CMS; e.g. Cormode and Muthukrishnan 2005), a dedicated probabilistic data structure with controllable false positive probability (FPP). By default, _ROCK_ instantiates a CMS based on four hashing functions (option `-l`), which can be sufficient for many cases (e.g. up to 10 billions canonical _k_-mers with _&kappa;'_ = 2,  _&kappa;_ > _&kappa;'_ and FPP < 5%). However it is highly recommanded to provide the expected number _F_<sub>0</sub> of canonical _k_-mers (option `-n`) to enable _ROCK_ to compute the optimal CMS dimensions required to store this specified number of canonical _k_-mers with low FPP (option `-f`). For instance, the programs [_KMC_](https://github.com/refresh-bio/KMC) (Deorowicz et al. 2013, 2015; Kokot et al. 2017), [_KmerStream_](https://github.com/pmelsted/KmerStream) (Melsted and Halldórsson 2014) or [_ntCard_](https://github.com/bcgsc/ntCard) (Mohamadi et al. 2017) can be used to quickly approximate this number (_F_<sub>0</sub>). 
+
+* Of important note is that the upper- and lower-bound cutoffs (options `-C` and `-c`, respectively) each corresponds to a _k_-mer coverage depth value (denoted here _c_<sub>_k_</sub>), which is quite different to the base coverage depth value (denoted here _c_<sub>_b_</sub>). However, when _L_ is the average input HTS read length, _c_<sub>_b_</sub> / _c_<sub>_k_</sub> and _L_ / (_L_-_k_+1) are expected to be identical for any fixed small _k_ (e.g. Liu et al. 2013). In consequence, when an overall (base) coverage depth _c_<sub>_b_</sub> is expected, one can therefore set _&kappa;_ =  _c_<sub>_b_</sub> (_L_-_k_+1) / _L_. For example, when dealing with HTS reads of length _L_ = 144 (on average), an HTS read subset with expected base coverage depth _c_<sub>_b_</sub> = 60x can be inferred by _ROCK_ by setting _k_ = 25 (option `-k`) and _&kappa;_ =  60 (144 -25+1) / 144 = 50 (option `-C`).
+
+* By default, _ROCK_ uses _k_-mers of length _k_ = 25 (option `-k`). Increasing this length is not recommanded when dealing with large FASTQ files (e.g. average coverage depth > 500x of genomes of length > 1 Gbps), as the total number of canonical _k_-mers can quickly grow, therefore implying a very large CMS (i.e. many hashing functions) to maintains low FPP (e.g. < 0.05). Using small _k_-mers (e.g. _k_ < 21) is also not recommanded, as this will decrease the overall specificity (e.g. too many identical _k_-mers arising from different sequenced genome region).
+
+* All _ROCK_ steps are based on the usage of valid _k_-mers, i.e. _k_-mers that do not contain any undetermined base `N`. Valid _k_-mers can also be determined by bases associated to a Phred score greater than a specified threshold (option `-q`). A minimum number of valid _k_-mers can be specified to consider a SE/PE HTS read(s) (option `-m`; default 1). All SE/PE HTS read(s) that do not contain enough valid _k_-mers are written into FASTQ file(s) with extension _.undetermined.fastq_.
+
+
+## References
+
+Brown CT, Howe A, Zhang Q, Pyrkosz AB, Brom YH (2012) _A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data_. arXiv:[1203.4802v2](https://arxiv.org/abs/1203.4802v2).
+
+Cormode G, Muthukrishnan S (2005) _An Improved Data Stream Summary: The Count-Min Sketch and its Applications_. Journal of Algorithms, 55:29-38. [doi:10.1016/j.jalgor.2003.12.001](https://doi.org/10.1016/j.jalgor.2003.12.001).
+
+Deorowicz S, Debudaj-Grabysz A, Grabowski S (2013) _Disk-based k-mer counting on a PC_. BMC Bioinformatics, 14:160. [doi:10.1186/1471-2105-14-160](https://doi.org/10.1186/1471-2105-14-160).
+
+Deorowicz S, Kokot M, Grabowski S, Debudaj-Grabysz A (2015) _KMC 2: Fast and resource-frugal k-mer counting_. Bioinformatics, 31(10):1569-1576. [doi:10.1093/bioinformatics/btv022](https://doi.org/10.1093/bioinformatics/btv022).
+
+Durai DA, Schulz MH (2019) _Improving in-silico normalization using read weights_. Scientific Reports, 9:5133. [doi:10.1038/s41598-019-41502-9](https://doi.org/10.1038/s41598-019-41502-9).
+
+Kokot M, Długosz M, Deorowicz S (2017) _KMC 3: counting and manipulating k -mer statistics_. Bioinformatics, 33(17):2759-2761. [doi:10.1093/bioinformatics/btx304](https://doi.org/10.1093/bioinformatics/btx304).
+
+Liu B, Shi Y, Yuan J, Hu X, Zhang H, Li N, Li Z, Chen Y, Mu D, Fan W (2013) _Estimation of genomic characteristics by analyzing k-mer frequency in de novo genome projects_. arXiv:[1308.2012v2](https://arxiv.org/abs/1308.2012v2).
+
+Melsted P, Halldórsson BV (2014) _KmerStream: streaming algorithms for k -mer abundance estimation_. Bioinformatics, 30(24):3541-3547. [doi:10.1093/bioinformatics/btu713](https://doi.org/10.1093/bioinformatics/btu713).
+
+Mohamadi H, Khan H, Birol I (2017) _ntCard: a streaming algorithm for cardinality estimation in genomics data_. Bioinformatics, 33(9):1324-1330. [doi:10.1093/bioinformatics/btw832](https://doi.org/10.1093/bioinformatics/btw832).
+
+Wedemeyer A, Kliemann L, Srivastav A, Schielke C, Reusch TB, Rosenstiel P (2017) _An improved filtering algorithm for big read datasets and its application to single-cell assembly_. BMC Bioinformatics, 18:324. [doi:10.1186/s12859-017-1724-7](https://doi.org/10.1186/s12859-017-1724-7).