From def76f827028a6d12f0eaea61471f3306ab6fb0b Mon Sep 17 00:00:00 2001
From: Alexis  CRISCUOLO <alexis.criscuolo@pasteur.fr>
Date: Fri, 6 May 2022 13:18:54 +0200
Subject: [PATCH] Updating README.md with an Examples section

---
 README.md | 168 +++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 153 insertions(+), 15 deletions(-)

diff --git a/README.md b/README.md
index 94b8192..5f2faed 100644
--- a/README.md
+++ b/README.md
@@ -70,14 +70,16 @@ make install
 
 Note that this last command line can require some extra permissions (e.g. `sudo`).
 
+
 #### Installation in a mamba virtual environment
 
-It is possible to install the autotools and to compile ROCK in a mamba virtual environment.
-Here's how to do:
+It is possible to install the autotools and to compile _ROCK_ in a [mamba](https://mamba.readthedocs.io/en/latest) virtual environment using to following command lines:
+
+
 ```bash
 mamba create -n ROCK_env
 mamba init 
-cd ROCK
+cd ROCK/
 mamba activate ROCK_env
 mamba install -y autoconf
 mamba install -y automake
@@ -131,18 +133,159 @@ OPTIONS:
 
 * After creating an empty count-min sketch (CMS; see below) to store the number of occurrences of every canonical _k_-mer shared by the input HTS reads, _ROCK_ proceeds in three main steps :
   1. Sorting the input SE/PE HTS reads from the most to the less accurate ones (as defined by the sum of the Phred scores).
-  2. For each sorted SE/PE HTS read(s), approximating its _k_-mer coverage depth <em>c</em><sub><em>k</em></sub> (defined by the median of its _k_-mer occurence values, as returned by the CMS); if <em>c</em><sub><em>k</em></sub> &le; <em>&kappa;</em>, then adding the SE/PE HTS read(s) into the subset _S_<sub><em>&kappa;'</em>,<em>&kappa;</em></sub> and updating the CMS for every corresponding canonical _k_-mer.
-  3. (when the lower-bound cutoff <em>&kappa;'</em> > 0) For each SE/PE HTS read(s) in _S_<sub><em>&kappa;'</em>,<em>&kappa;</em></sub>, (re)approximating its _k_-mer coverage depth <em>c</em><sub><em>k</em></sub>; if <em>c</em><sub><em>k</em></sub> &le; <em>&kappa;'</em>, then removing the SE/PE HTS read(s) from _S_<sub><em>&kappa;'</em>,<em>&kappa;</em></sub>. <br>
+  2. For each sorted SE/PE HTS read(s), approximating its _k_-mer coverage depth <em>c</em><sub><em>k</em></sub> (defined by the median of its _k_-mer occurence values, as returned by the CMS); if <em>c</em><sub><em>k</em></sub>&nbsp;&le;&nbsp;<em>&kappa;</em>, then adding the SE/PE HTS read(s) into the subset _S_<sub><em>&kappa;'</em>,<em>&kappa;</em></sub> and updating the CMS for every corresponding canonical _k_-mer.
+  3. (when the lower-bound cutoff <em>&kappa;'</em>&nbsp;>&nbsp;0) For each SE/PE HTS read(s) in _S_<sub><em>&kappa;'</em>,<em>&kappa;</em></sub>, (re)approximating its _k_-mer coverage depth <em>c</em><sub><em>k</em></sub>; if <em>c</em><sub><em>k</em></sub>&nbsp;&le;&nbsp;<em>&kappa;'</em>, then removing the SE/PE HTS read(s) from _S_<sub><em>&kappa;'</em>,<em>&kappa;</em></sub>. <br>
 
   At the end, all SE/PE HTS read(s) inside _S_<sub><em>&kappa;'</em>,<em>&kappa;</em></sub> are written into output FASTQ file(s) (by default, file extension _.rock.fastq_). It is worth noticing that the step 1 ensures that all the HTS reads inside the returned subset _S_<sub><em>&kappa;'</em>,<em>&kappa;</em></sub> are the most accurate ones.
 
-* _ROCK_ stores the number of occurences of every traversed canonical _k_-mer in a count-min sketch (CMS; e.g. Cormode and Muthukrishnan 2005), a dedicated probabilistic data structure with controllable false positive probability (FPP). By default, _ROCK_ instantiates a CMS based on four hashing functions (option `-l`), which can be sufficient for many cases, e.g. up to 10 billions canonical _k_-mers with <em>&kappa;</em> > <em>&kappa;'</em> > 1 and FPP &le; 5%. However, as each hashing function is defined on [0, 2<sup>32</sup>[, the memory footprint of the CMS is 4<em>&lambda;</em> Gb (when _&kappa;_ &le; 255, twice otherwise), where <em>&lambda;</em> is the total number of hashing functions. It is therefore highly recommanded to provide the expected number _F_<sub>0</sub> of canonical _k_-mers (option `-n`) to enable _ROCK_ to compute the optimal CMS dimension <em>&lambda;</em> required to store this specified number of canonical _k_-mers with low FPP (option `-f`), e.g. <em>&lambda;</em> = 1 is sufficient to deal with up to 3 billions canonical _k_-mers when <em>&kappa;</em> > <em>&kappa;'</em> > 1 while ensuring FPP &le; 5%. Moreover, a CMS based on few hashing functions entails faster running times. For instance, the programs [_KMC_](https://github.com/refresh-bio/KMC) (Deorowicz et al. 2013, 2015; Kokot et al. 2017), [_KmerStream_](https://github.com/pmelsted/KmerStream) (Melsted and HalldÃ³rsson 2014) or [_ntCard_](https://github.com/bcgsc/ntCard) (Mohamadi et al. 2017) can be used to quickly approximate this number (_F_<sub>0</sub>). 
+* _ROCK_ stores the number of occurences of every traversed canonical _k_-mer in a count-min sketch (CMS; e.g. Cormode and Muthukrishnan 2005), a dedicated probabilistic data structure with controllable false positive probability (FPP). By default, _ROCK_ instantiates a CMS based on four hashing functions (option `-l`), which can be sufficient for many cases, e.g. up to 10 billions canonical _k_-mers with <em>&kappa;</em>&nbsp;>&nbsp;<em>&kappa;'</em>&nbsp;>&nbsp;1 and FPP&nbsp;&le;&nbsp;5%. However, as each hashing function is defined on [0,&nbsp;2<sup>32</sup>[, the memory footprint of the CMS is 4<em>&lambda;</em>&nbsp;Gb (when _&kappa;_&nbsp;&le;&nbsp;255, twice otherwise), where <em>&lambda;</em> is the total number of hashing functions. It is therefore highly recommanded to provide the expected number _F_<sub>0</sub> of canonical _k_-mers (option `-n`) to enable _ROCK_ to compute the optimal CMS dimension <em>&lambda;</em> required to store this specified number of canonical _k_-mers with low FPP (option `-f`), e.g. <em>&lambda;</em>&nbsp;=&nbsp;1 is sufficient to deal with up to 3 billions canonical _k_-mers when <em>&kappa;</em>&nbsp;>&nbsp;<em>&kappa;'</em>&nbsp;>&nbsp;1 while ensuring FPP&nbsp;&le;&nbsp;5%. Moreover, a CMS based on few hashing functions entails faster running times. For instance, the programs [_KMC_](https://github.com/refresh-bio/KMC) (Deorowicz et al. 2013, 2015; Kokot et al. 2017), [_KmerStream_](https://github.com/pmelsted/KmerStream) (Melsted and HalldÃ³rsson 2014) or [_ntCard_](https://github.com/bcgsc/ntCard) (Mohamadi et al. 2017) can be used to quickly approximate this number (_F_<sub>0</sub>). 
+
+* Of important note is that each of the upper- and lower-bound cutoffs (options `-C` and `-c`, respectively) corresponds to a _k_-mer coverage depth value (denoted here <em>c</em><sub><em>k</em></sub>), which is quite different to the base coverage depth value (denoted here <em>c</em><sub><em>b</em></sub>). However, when _L_ is the average input HTS read length, <em>c</em><sub><em>b</em></sub>&nbsp;/&nbsp;<em>c</em><sub><em>k</em></sub> and _L_&nbsp;/&nbsp;(_L_&nbsp;&minus;&nbsp;_k_&nbsp;+&nbsp;1) are expected to be identical for any fixed small _k_ (e.g. Liu et al. 2013). In consequence, when an overall (base) coverage depth <em>c</em><sub><em>b</em></sub> is expected, one can therefore set <em>&kappa;</em> =  <em>c</em><sub><em>b</em></sub> (_L_&nbsp;&minus;&nbsp;_k_&nbsp;+&nbsp;1)&nbsp;/&nbsp;_L_. For example, when dealing with HTS reads of length _L_&nbsp;=&nbsp;144 (on average), an HTS read subset with expected base coverage depth <em>c</em><sub><em>b</em></sub>&nbsp;=&nbsp;60x can be inferred by _ROCK_ by setting _k_&nbsp;=&nbsp;25 (option `-k`) and <em>&kappa;</em>&nbsp;=&nbsp;60&nbsp;(144&nbsp;&minus;&nbsp;25&nbsp;+&nbsp;1)&nbsp;/&nbsp;144&nbsp;=&nbsp;50 (option `-C`).
+
+* By default, _ROCK_ uses _k_-mers of length _k_&nbsp;=&nbsp;25 (option `-k`). Increasing this length is not recommanded when dealing with large FASTQ files (e.g. average coverage depth > 500x from genome size > 1 Gbps), as the total number of canonical _k_-mers can quickly grow, therefore implying a very large CMS (i.e. many hashing functions) to maintains low FPP (e.g. &le;&nbsp;0.05). Using small _k_-mers (e.g. _k_&nbsp;<&nbsp;21) is also not recommanded, as this can negatively affect the overall specificity (i.e. too many identical _k_-mers arising from different sequenced genome region).
+
+* All _ROCK_ steps are based on the usage of valid _k_-mers, i.e. _k_-mers that do not contain any undetermined base `N`. Valid _k_-mers can also be determined by bases associated to a Phred score greater than a specified threshold (option `-q`; Phred +33 offset, default:&nbsp;0). A minimum number of valid _k_-mers can be specified to consider a SE/PE HTS read(s) (option `-m`; default:&nbsp;1). 
+
+
+## Examples
+
+##### Single FASTQ file
+
+The following [Bash](https://www.gnu.org/software/bash/) command line enables to download and uncompress the FASTQ file (845 Mb) associated to the run accession [ERR6807819](https://www.ncbi.nlm.nih.gov/sra/ERR6807819):
+
+```bash
+wget -O - https://ftp.sra.ebi.ac.uk/vol1/fastq/ERR680/009/ERR6807819/ERR6807819.fastq.gz | zcat > ERR6807819.fastq
+```
+
+The downloaded FASTQ file _ERR6807819.fastq_ contains 2,310,531 Ion Torrent S5 reads (394,782,864 bases) corresponding to the whole genome sequencing of a SARS-CoV-2 strain.
+
+_ROCK_ can be directly run on this FASTQ file to perform digital normalization using default options (e.g. _k_&nbsp;=&nbsp;25, <em>&kappa;</em>&nbsp;=&nbsp;70, <em>&kappa;'</em>&nbsp;&nbsp;=&nbsp;0):
+
+```bash
+rock  ERR6807819.fastq
+```
+
+After performing the digital normalization of these single-end reads (running time: ~1 minute), _ROCK_ writes the default output FASTQ file _ERR6807819.rock.fastq_ (93,810 HTS reads, 13,467,261 bases).
+
+The default values of the upper- and lower-bound _k_-mer coverage depth cutoff  <em>&kappa;</em> and <em>&kappa;'</em> can be modified using options `-C` and `-c`, respectively:
+
+```bash
+rock  -c 5  -C 80  ERR6807819.fastq
+```
+
+The above command line leads to 76,523 HTS reads (11,608,269 bases).
+
+
+As [ERR6807819](https://www.ncbi.nlm.nih.gov/sra/ERR6807819) is the result of the sequencing of a SARS-CoV-2 genome, it is expected that its number _F_<sub>0</sub> of distinct canonical _k_-mers is far below 3 billions. In consequence, _ROCK_ can be adequately run using only <em>&lambda;</em>&nbsp;=&nbsp;1 hashing function:
+
+```bash
+rock  -c 5  -C 80  -l 1  ERR6807819.fastq
+```
+
+The above command line leads to 76,526 HTS reads (11,607,360 bases).
+
+Alternatively, _F_<sub>0</sub> can be priorly estimated (e.g. with _k_&nbsp;=&nbsp;25, [_ntCard_](https://github.com/bcgsc/ntCard) returns _F_<sub>0</sub>&nbsp;=&nbsp;4,840,553), and next specified using option `-n`:
+
+```bash
+rock  -c 5  -C 80  -n 4840553  -v  ERR6807819.fastq
+```
+
+As the option `-v` has been set, the above command line prints the following information:
+
+```
+SE input file(s)                        ERR6807819.fastq
+upper-bound coverage depth              80
+lower-bound coverage depth              5
+k-mer length                            25
+expected no. distinct k-mers            4840553
+no. hash. function(s)                   1
+expected false positive proba.          0.000000 (cov. depth > 1)
+no. buckets for hash. 1                 4283781797
+no. bits per bucket                     8
+count-min sketch size (Gb)              4
+no. input reads                         2310531
+no. input k-mers                        339330120
+no. unset buckets for hash. 1           4280838158
+approx. no. distinct k-mers             2944656
+approx. false positive proba.           0.000000
+no. selected reads                      76526
+SE output files                         ERR6807819.rock.fastq
+```
+
+This shows that <em>&lambda;</em>&nbsp;=&nbsp;1 hashing function is indeed sufficient to deal with _F_<sub>0</sub>&nbsp;=&nbsp;4,840,553 distinct canonical _k_-mers. The selected HTS read subset is therefore identical to the one obtained using `-l 1`.
+
+Valid _k_-mers (to fill the CMS) can also be assessed by specifying a minimum Phred score using option `-q`, e.g. _Q_&nbsp;&geq;&nbsp;10:
+
+```bash
+rock  -c 5  -C 80  -n 4840553  -q 10  ERR6807819.fastq
+```
+
+The above command line leads to 44,687 HTS reads (7,367,710 bases).
+
+
+##### Paired FASTQ files
+
+
+The following [Bash](https://www.gnu.org/software/bash/) command lines enable to download and uncompress the two FASTQ files (199 Mb each) associated to the run accession [ERR7756257](https://www.ncbi.nlm.nih.gov/sra/ERR7756257):
+
+```bash
+EBIURL="https://ftp.sra.ebi.ac.uk/vol1/fastq";
+wget -O - $EBIURL/ERR775/007/ERR7756257/ERR7756257_1.fastq.gz | gunzip -c > ERR7756257.1.fastq ;
+wget -O - $EBIURL/ERR775/007/ERR7756257/ERR7756257_2.fastq.gz | gunzip -c > ERR7756257.2.fastq ;
+```
+
+The downloaded FASTQ files _ERR7756257.1.fastq_ and _ERR7756257.2.fastq_ contains 571,893 Illumina MiSeq PE reads (84,146,312 and  84,156,299 bases, respectively) corresponding to the whole genome sequencing of a SARS-CoV-2 strain.
+
+_ROCK_ can be directly run on these paired FASTQ files to perform digital normalization using default options (e.g. _k_&nbsp;=&nbsp;25, <em>&kappa;</em>&nbsp;=&nbsp;70, <em>&kappa;'</em>&nbsp;&nbsp;=&nbsp;0):
+
+```bash
+rock  ERR7756257.1.fastq,ERR7756257.2.fastq
+```
+
+After performing the digital normalization of these HTS PE reads (running time: ~30 seconds), _ROCK_ writes the default output FASTQ files _ERR7756257.1.rock.fastq_ and  _ERR7756257.2.rock.fastq_ (17,984 HTS PE reads, 2,633,631 and 2,636,701 bases, respectively).
 
-* Of important note is that each of the upper- and lower-bound cutoffs (options `-C` and `-c`, respectively) corresponds to a _k_-mer coverage depth value (denoted here <em>c</em><sub><em>k</em></sub>), which is quite different to the base coverage depth value (denoted here <em>c</em><sub><em>b</em></sub>). However, when _L_ is the average input HTS read length, <em>c</em><sub><em>b</em></sub> / <em>c</em><sub><em>k</em></sub> and _L_ / (_L_-_k_+1) are expected to be identical for any fixed small _k_ (e.g. Liu et al. 2013). In consequence, when an overall (base) coverage depth <em>c</em><sub><em>b</em></sub> is expected, one can therefore set <em>&kappa;</em> =  <em>c</em><sub><em>b</em></sub> (_L_-_k_+1) / _L_. For example, when dealing with HTS reads of length _L_ = 144 (on average), an HTS read subset with expected base coverage depth <em>c</em><sub><em>b</em></sub> = 60x can be inferred by _ROCK_ by setting _k_ = 25 (option `-k`) and <em>&kappa;</em> =  60 (144-25+1) / 144 = 50 (option `-C`).
+Input and output FASTQ file names can be specified into txt files, e.g.
+
+```bash
+echo "ERR7756257.1.fastq,ERR7756257.2.fastq" > infiles.txt ;
+echo "ERR7756257.norm.R1.fastq,ERR7756257.norm.R2.fastq" > outfiles.txt ;
+rock  -i infiles.txt  -o outfiles.txt
+```
+
+The above command lines lead to the paired FASTQ output files _ERR7756257.norm.R1.fastq_ and _ERR7756257.norm.R2.fastq_. PE FASTQ files should be specified in the same line, separated by a comma (`,`), without blank space. Up to 15 FASTQ files (paired or not) can be specified. Of important note, output FASTQ files should be specified in the same way and in the same order as the corresponding input FASTQ files.
+
+
+Options described above (see Single FASTQ file) can also be used with paired FASTQ files:
+
+```bash
+rock  -c 5  -C 80  -n 2277975  -q 10  -v  -i infiles.txt  -o outfiles.txt
+```
+
+The above command line prints the following information:
+
+```
+PE input files                          ERR7756257.1.fastq ERR7756257.2.fastq
+upper-bound coverage depth              80
+lower-bound coverage depth              5
+k-mer length                            25
+expected no. distinct k-mers            2277975
+no. hash. function(s)                   1
+expected false positive proba.          0.000000 (cov. depth > 1)
+no. buckets for hash. 1                 4283781797
+no. bits per bucket                     8
+count-min sketch size (Gb)              4
+no. input reads                         571792
+no. input k-mers                        140851747
+no. unset buckets for hash. 1           4282925434
+approx. no. distinct k-mers             856450
+approx. false positive proba.           0.000000
+no. selected reads                      15804
+PE output files                         ERR7756257.norm.R1.fastq ERR7756257.norm.R2.fastq
+```
+
+The output FASTQ files contain 15,804 HTS PE reads (2,319,355 and 2,319,367 bases, respectively)
 
-* By default, _ROCK_ uses _k_-mers of length _k_ = 25 (option `-k`). Increasing this length is not recommanded when dealing with large FASTQ files (e.g. average coverage depth > 500x from genome size > 1 Gbps), as the total number of canonical _k_-mers can quickly grow, therefore implying a very large CMS (i.e. many hashing functions) to maintains low FPP (e.g. &le; 0.05). Using small _k_-mers (e.g. _k_ < 21) is also not recommanded, as this can negatively affect the overall specificity (i.e. too many identical _k_-mers arising from different sequenced genome region).
 
-* All _ROCK_ steps are based on the usage of valid _k_-mers, i.e. _k_-mers that do not contain any undetermined base `N`. Valid _k_-mers can also be determined by bases associated to a Phred score greater than a specified threshold (option `-q`; Phred +33 offset, default: 0). A minimum number of valid _k_-mers can be specified to consider a SE/PE HTS read(s) (option `-m`; default: 1). 
 
 ## References
 
@@ -173,9 +316,4 @@ Wedemeyer A, Kliemann L, Srivastav A, Schielke C, Reusch TB, Rosenstiel P (2017)
 
 ## Contributions
 
-Suggestions for improving the ROCK software or for adding new functionalities as well as merge requests are welcome.
-
-
-
-
-
+Suggestions for improving _ROCK_ or for adding new functionalities, as well as merge requests, are welcome.
-- 
GitLab