From a2dd5cf12fe4fa89de605a8b8e3c8fe88a1561b1 Mon Sep 17 00:00:00 2001 From: Veronique Legrand <vlegrand@pasteur.fr> Date: Thu, 27 May 2021 15:11:02 +0200 Subject: [PATCH] updated README with Alexis's changes and information related to the autotools. --- README.md | 107 ++++++++++++++++++++++++++++-------------------------- 1 file changed, 55 insertions(+), 52 deletions(-) diff --git a/README.md b/README.md index 8b248ae..aabd080 100644 --- a/README.md +++ b/README.md @@ -2,82 +2,73 @@ _ROCK_ (Reducing Over-Covering K-mers) is a command line program written in [C++](https://isocpp.org/) that runs an alternative implementation of the _digital normalization_ method (e.g. Brown et al. 2012, Wedemeyer et al. 2017, Durai and Schulz 2019). -_ROCK_ can be used to reduce the overall coverage depth within large sets of high-throughput sequencing (HTS) reads contained in one or several FASTQ file(s). -_ROCK_ can also be used to discard low-covering HTS reads that are often artefactual, highly erroneous or contaminating sequences. +Given one or several [FASTQ](https://en.wikipedia.org/wiki/FASTQ_format) file(s), the main aim of _ROCK_ is to build a subset of accurate high-throughput sequencing (HTS) reads such that the induced coverage depth is comprised between two specified lower- and upper-bounds. +_ROCK_ can therefore be used to reduce and/or homogenize the overall coverage depth within large sets of HTS reads, which is often required to quickly infer accurate _de novo_ genome assemblies (e.g. Desai et al. 2013, Chen et al. 2015). +_ROCK_ can also be used to discard low-covering HTS reads, as those ones are often artefactual, highly erroneous or contaminating sequences. -## Installation, compilation and execution -Clone this repository with the following command line: +## Compilation and installation -```bash -git clone https://gitlab.pasteur.fr/vlegrand/ROCK.git -``` - - -_ROCK_ is developped in C++ 98 and is C++11 compatible. -On computers with [gcc](https://ftp.gnu.org/gnu/gcc/) (version at least 4.4.7) or [clang](https://clang.llvm.org/) (version at least 503.0.40) installed, the compilation and installation of _ROCK_ can be carried out easily: +#### Prerequisites -If you don't plan to add new files to ROCK's source code, you do not need anything else than one of the compilers mentionned above and the make tool. -First thing to do is to run: +First and foremost, clone this repository with the following command line: ```bash -./configure +git clone https://gitlab.pasteur.fr/vlegrand/ROCK.git ``` -This will generate the files needed by make to produce the executable with the default options. +_ROCK_ is developped in C++ 98 and compatible with C++ 11. +Compilation of the source code can be carried out using [gcc](https://ftp.gnu.org/gnu/gcc/) (version ≥ 4.4.7) or [clang](https://clang.llvm.org/) (version ≥ 503.0.40), together with the tool [_make_](https://www.gnu.org/software/make). +You also need autoconf 2.69 and automake 1.16. +It is possible to use other versions of these tools but this will require to generate again aclocal.m4 and all the Makefile.in files (if you are using another version of automake than 1.16) and configure (if you are using another version of autoconf than 2.69). -Two things are important here: -1) To get the best performances, you must pass the -O3 flag: -```bash -./configure CFLAGS='-O3' CXXFLAGS='-O3' -``` +#### Basic compilation and execution -2) You do not need to install ROCK on your system to use it. You can simply run the executable that make generates in the src directory. -Yet, if you want to install it in your system's default location for executables; then make sure that you have sudo rights for that. -A way to install ROCK on your system without sudo rights is to provide the configure script with another directory path. This can be done with the --prefix option of the configure script. - -```bash -./configure --prefix=my_home/my_tools -``` - -Once You have ran ./configure with the options that suit you best, you can check that ROCK works well on your system by running: +If you have autoconf 2.69 and automake 1.16, an executable can be built by running the three following command lines: ```bash +./configure make check +make ``` -Make check will compile the code and generate an executable in a temporary location. Then, it will run all unit tests and non regression tests. -This is not a mandatory step. If eveything goes well, you can go on. Otherwise, you can report the problem to the authors. +The first command line generates the files required by [_make_](https://www.gnu.org/software/make) to compile the source code. +<br> +The second command line (non-mandatory) compiles the source code, and next runs all unit and non-regression tests. +If a problem occurs during this step, please report it to the authors (see file _AUTHORS_). +<br> +The third command line finally compiles the source code to built the executable _src/rock_. -To generate the executable in the src directory, simply run: +The executable _rock_ can be run with the following command line model: ```bash -make +./rock [options] ``` -and to install it, run either +#### Advanced compilation and installation + +Better performances can be obtained by passing the `-O3` flag during the configuration step: ```bash -sudo make install +./configure CFLAGS='-O3' CXXFLAGS='-O3' ``` -or +You can also specify a location for the built executable, e.g. ```bash -make install +./configure --prefix=$path_to_output_directory ``` -depending on whether you used --prefix or not with ./configure. - -The executable _rock_ can be run with the following command line model: +Otherwise, the executable _rock_ can be installed in the system default location (e.g. _/usr/local/_) using the following fourth command line (after the three ones described above): ```bash -./rock [options] +make install ``` +Note that this last command line can require some extra permissions (e.g. `sudo`). ## Usage @@ -119,42 +110,54 @@ OPTIONS: ## Notes -* In brief, given an upper-bound _k_-mer coverage depth cutoff _κ_ (option `-C`) and a lower-bound _k_-mer coverage cutoff _κ'_ (option `-c`), the aim of _ROCK_ is to select an HTS read subset _S_<sub>_κ'_,_κ_</sub> such that the overall _k_-mer coverage depth induced by the members of _S_<sub>_κ'_,_κ_</sub> is expected to be always comprised between _κ'_ and _κ_. When considering FASTQ files with high redundancy (i.e. coverage depth greater than _κ_), _ROCK_ therefore returns smaller FASTQ files such that each of its HTS read corresponds to a genome region with _k_-mer coverage depth of at most _κ_. Setting _κ'_ (option `-c`) enables to discard HTS reads associated to a _k_-mer coverage depth lower than this lower-bound cutoff, which is often observed with artefactual, highly erroneous or contaminating HTS reads. +* In brief, given an upper-bound _k_-mer coverage depth cutoff <em>κ</em> (option `-C`) and a lower-bound _k_-mer coverage cutoff <em>κ'</em> (option `-c`), the aim of _ROCK_ is to select an HTS read subset _S_<sub><em>κ'</em>,<em>κ</em></sub> such that the overall _k_-mer coverage depth induced by the members of _S_<sub><em>κ'</em>,<em>κ</em></sub> is expected to be always comprised between <em>κ'</em> and <em>κ</em>. When considering FASTQ files with high redundancy (i.e. coverage depth greater than <em>κ</em>), _ROCK_ therefore returns smaller FASTQ files such that each of its HTS read corresponds to a genome region with _k_-mer coverage depth of at most <em>κ</em>. Setting <em>κ'</em> (option `-c`) enables to discard HTS reads associated to a _k_-mer coverage depth lower than this lower-bound cutoff, which is often observed with artefactual, highly erroneous or contaminating HTS reads. -* After creating an empty count-min sketch (CMS; see below) to store the number of occurences of every canonical _k_-mer shared by the input HTS reads, _ROCK_ proceeds in three main steps : +* After creating an empty count-min sketch (CMS; see below) to store the number of occurrences of every canonical _k_-mer shared by the input HTS reads, _ROCK_ proceeds in three main steps : 1. Sorting the input SE/PE HTS reads from the most to the less accurate ones (as defined by the sum of the Phred scores). - 2. For each sorted SE/PE HTS read(s), approximating its _k_-mer coverage depth _c_<sub>_k_</sub> (defined by the median of its _k_-mer occurence values, as returned by the CMS); if _c_<sub>_k_</sub> ≤ _κ_, then adding the SE/PE HTS read(s) into the subset _S_<sub>_κ'_,_κ_</sub> and updating the CMS for every corresponding canonical _k_-mer. - 3. (when the lower-bound cutoff _κ'_ > 0) For each SE/PE HTS read(s) in _S_<sub>_κ'_,_κ_</sub>, (re)approximating its _k_-mer coverage depth _c_<sub>_k_</sub>; if _c_<sub>_k_</sub> ≤ _κ'_, then removing the SE/PE HTS read(s) from _S_<sub>_κ'_,_κ_</sub>. <br><br> + 2. For each sorted SE/PE HTS read(s), approximating its _k_-mer coverage depth <em>c</em><sub><em>k</em></sub> (defined by the median of its _k_-mer occurence values, as returned by the CMS); if <em>c</em><sub><em>k</em></sub> ≤ <em>κ</em>, then adding the SE/PE HTS read(s) into the subset _S_<sub><em>κ'</em>,<em>κ</em></sub> and updating the CMS for every corresponding canonical _k_-mer. + 3. (when the lower-bound cutoff <em>κ'</em> > 0) For each SE/PE HTS read(s) in _S_<sub><em>κ'</em>,<em>κ</em></sub>, (re)approximating its _k_-mer coverage depth <em>c</em><sub><em>k</em></sub>; if <em>c</em><sub><em>k</em></sub> ≤ <em>κ'</em>, then removing the SE/PE HTS read(s) from _S_<sub><em>κ'</em>,<em>κ</em></sub>. <br> - At the end, all SE/PE HTS read(s) inside _S_<sub>_κ'_,_κ_</sub> are written into output FASTQ file(s) (by default, file extension _.rock.fastq_). It is worth noticing that the step 1 ensures that all the HTS reads inside the returned subset _S_<sub>_κ'_,_κ_</sub> are the most accurate ones. + At the end, all SE/PE HTS read(s) inside _S_<sub><em>κ'</em>,<em>κ</em></sub> are written into output FASTQ file(s) (by default, file extension _.rock.fastq_). It is worth noticing that the step 1 ensures that all the HTS reads inside the returned subset _S_<sub><em>κ'</em>,<em>κ</em></sub> are the most accurate ones. -* _ROCK_ stores the number of occurences of every traversed canonical _k_-mer in a count-min sketch (CMS; e.g. Cormode and Muthukrishnan 2005), a dedicated probabilistic data structure with controllable false positive probability (FPP). By default, _ROCK_ instantiates a CMS based on four hashing functions (option `-l`), which can be sufficient for many cases (e.g. up to 10 billions canonical _k_-mers with _κ'_ = 2, _κ_ < 256 and FPP < 5%). However, as each hashing function is defined on [0, 2<sup>32</sup>[, the memory footprint of the CMS is 4ℓ Gb (when _κ_ < 265, twice otherwise), where ℓ is the total number of hashing functions. It is therefore highly recommanded to provide the expected number _F_<sub>0</sub> of canonical _k_-mers (option `-n`) to enable _ROCK_ to compute the optimal CMS dimension ℓ required to store this specified number of canonical _k_-mers with low FPP (option `-f`). For instance, the programs [_KMC_](https://github.com/refresh-bio/KMC) (Deorowicz et al. 2013, 2015; Kokot et al. 2017), [_KmerStream_](https://github.com/pmelsted/KmerStream) (Melsted and Halldórsson 2014) or [_ntCard_](https://github.com/bcgsc/ntCard) (Mohamadi et al. 2017) can be used to quickly approximate this number (_F_<sub>0</sub>). +* _ROCK_ stores the number of occurences of every traversed canonical _k_-mer in a count-min sketch (CMS; e.g. Cormode and Muthukrishnan 2005), a dedicated probabilistic data structure with controllable false positive probability (FPP). By default, _ROCK_ instantiates a CMS based on four hashing functions (option `-l`), which can be sufficient for many cases, e.g. up to 10 billions canonical _k_-mers with <em>κ</em> > <em>κ'</em> > 1 and FPP ≤ 5%. However, as each hashing function is defined on [0, 2<sup>32</sup>[, the memory footprint of the CMS is 4<em>λ</em> Gb (when _κ_ ≤ 255, twice otherwise), where <em>λ</em> is the total number of hashing functions. It is therefore highly recommanded to provide the expected number _F_<sub>0</sub> of canonical _k_-mers (option `-n`) to enable _ROCK_ to compute the optimal CMS dimension <em>λ</em> required to store this specified number of canonical _k_-mers with low FPP (option `-f`), e.g. <em>λ</em> = 1 is sufficient to deal with up to 3 billions canonical _k_-mers when <em>κ</em> > <em>κ'</em> > 1 while ensuring FPP ≤ 5%. Moreover, a CMS based on few hashing functions entails faster running times. For instance, the programs [_KMC_](https://github.com/refresh-bio/KMC) (Deorowicz et al. 2013, 2015; Kokot et al. 2017), [_KmerStream_](https://github.com/pmelsted/KmerStream) (Melsted and Halldórsson 2014) or [_ntCard_](https://github.com/bcgsc/ntCard) (Mohamadi et al. 2017) can be used to quickly approximate this number (_F_<sub>0</sub>). -* Of important note is that each of the upper- and lower-bound cutoffs (options `-C` and `-c`, respectively) corresponds to a _k_-mer coverage depth value (denoted here _c_<sub>_k_</sub>), which is quite different to the base coverage depth value (denoted here _c_<sub>_b_</sub>). However, when _L_ is the average input HTS read length, _c_<sub>_b_</sub> / _c_<sub>_k_</sub> and _L_ / (_L_-_k_+1) are expected to be identical for any fixed small _k_ (e.g. Liu et al. 2013). In consequence, when an overall (base) coverage depth _c_<sub>_b_</sub> is expected, one can therefore set _κ_ = _c_<sub>_b_</sub> (_L_-_k_+1) / _L_. For example, when dealing with HTS reads of length _L_ = 144 (on average), an HTS read subset with expected base coverage depth _c_<sub>_b_</sub> = 60x can be inferred by _ROCK_ by setting _k_ = 25 (option `-k`) and _κ_ = 60 (144 -25+1) / 144 = 50 (option `-C`). +* Of important note is that each of the upper- and lower-bound cutoffs (options `-C` and `-c`, respectively) corresponds to a _k_-mer coverage depth value (denoted here <em>c</em><sub><em>k</em></sub>), which is quite different to the base coverage depth value (denoted here <em>c</em><sub><em>b</em></sub>). However, when _L_ is the average input HTS read length, <em>c</em><sub><em>b</em></sub> / <em>c</em><sub><em>k</em></sub> and _L_ / (_L_-_k_+1) are expected to be identical for any fixed small _k_ (e.g. Liu et al. 2013). In consequence, when an overall (base) coverage depth <em>c</em><sub><em>b</em></sub> is expected, one can therefore set <em>κ</em> = <em>c</em><sub><em>b</em></sub> (_L_-_k_+1) / _L_. For example, when dealing with HTS reads of length _L_ = 144 (on average), an HTS read subset with expected base coverage depth <em>c</em><sub><em>b</em></sub> = 60x can be inferred by _ROCK_ by setting _k_ = 25 (option `-k`) and <em>κ</em> = 60 (144-25+1) / 144 = 50 (option `-C`). -* By default, _ROCK_ uses _k_-mers of length _k_ = 25 (option `-k`). Increasing this length is not recommanded when dealing with large FASTQ files (e.g. average coverage depth > 500x from genome size > 1 Gbps), as the total number of canonical _k_-mers can quickly grow, therefore implying a very large CMS (i.e. many hashing functions) to maintains low FPP (e.g. < 0.05). Using small _k_-mers (e.g. _k_ < 21) is also not recommanded, as this will decrease the overall specificity (e.g. too many identical _k_-mers arising from different sequenced genome region). +* By default, _ROCK_ uses _k_-mers of length _k_ = 25 (option `-k`). Increasing this length is not recommanded when dealing with large FASTQ files (e.g. average coverage depth > 500x from genome size > 1 Gbps), as the total number of canonical _k_-mers can quickly grow, therefore implying a very large CMS (i.e. many hashing functions) to maintains low FPP (e.g. ≤ 0.05). Using small _k_-mers (e.g. _k_ < 21) is also not recommanded, as this can negatively affect the overall specificity (i.e. too many identical _k_-mers arising from different sequenced genome region). -* All _ROCK_ steps are based on the usage of valid _k_-mers, i.e. _k_-mers that do not contain any undetermined base `N`. Valid _k_-mers can also be determined by bases associated to a Phred score greater than a specified threshold (option `-q`). A minimum number of valid _k_-mers can be specified to consider a SE/PE HTS read(s) (option `-m`; default 1). All SE/PE HTS read(s) that do not contain enough valid _k_-mers are written into FASTQ file(s) with extension _.undetermined.fastq_. +* All _ROCK_ steps are based on the usage of valid _k_-mers, i.e. _k_-mers that do not contain any undetermined base `N`. Valid _k_-mers can also be determined by bases associated to a Phred score greater than a specified threshold (option `-q`; Phred +33 offset, default: 0). A minimum number of valid _k_-mers can be specified to consider a SE/PE HTS read(s) (option `-m`; default: 1). All SE/PE HTS read(s) that do not contain enough valid _k_-mers are written into FASTQ file(s) with extension _.undetermined.fastq_. ## References Brown CT, Howe A, Zhang Q, Pyrkosz AB, Brom YH (2012) _A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data_. **arXiv**:[1203.4802v2](https://arxiv.org/abs/1203.4802v2). +Chen TW, Gan RC, Chang YF, Liao W-C, Wu TH, Lee C-C, Huang P-J, Lee C-Y, Chen Y-YM, Chiu CH, Tang P (2015) _Is the whole greater than the sum of its parts? De novo assembly strategies for bacterial genomes based on paired-end sequencing_. **BMC Genomics**, 16:648. [doi:10.1186/s12864-015-1859-8](https://doi.org/10.1186/s12864-015-1859-8). + Cormode G, Muthukrishnan S (2005) _An Improved Data Stream Summary: The Count-Min Sketch and its Applications_. **Journal of Algorithms**, 55:29-38. [doi:10.1016/j.jalgor.2003.12.001](https://doi.org/10.1016/j.jalgor.2003.12.001). Deorowicz S, Debudaj-Grabysz A, Grabowski S (2013) _Disk-based k-mer counting on a PC_. **BMC Bioinformatics**, 14:160. [doi:10.1186/1471-2105-14-160](https://doi.org/10.1186/1471-2105-14-160). Deorowicz S, Kokot M, Grabowski S, Debudaj-Grabysz A (2015) _KMC 2: Fast and resource-frugal k-mer counting_. **Bioinformatics**, 31(10):1569-1576. [doi:10.1093/bioinformatics/btv022](https://doi.org/10.1093/bioinformatics/btv022). +Desai A, Marwah VS, Yadav A, Jha V, Dhaygude K, Bangar U, Kulkarni V, Jere A (2013) _Identification of Optimum Sequencing Depth Especially for De Novo Genome Assembly of Small Genomes Using Next Generation Sequencing Data_. **PLoS ONE**, 8(4):e60204. [doi:10.1371/journal.pone.0060204](https://doi.org/10.1371/journal.pone.0060204). + Durai DA, Schulz MH (2019) _Improving in-silico normalization using read weights_. **Scientific Reports**, 9:5133. [doi:10.1038/s41598-019-41502-9](https://doi.org/10.1038/s41598-019-41502-9). -Kokot M, DÅ‚ugosz M, Deorowicz S (2017) _KMC 3: counting and manipulating k -mer statistics_. **Bioinformatics**, 33(17):2759-2761. [doi:10.1093/bioinformatics/btx304](https://doi.org/10.1093/bioinformatics/btx304). +Kokot M, DÅ‚ugosz M, Deorowicz S (2017) _KMC 3: counting and manipulating k-mer statistics_. **Bioinformatics**, 33(17):2759-2761. [doi:10.1093/bioinformatics/btx304](https://doi.org/10.1093/bioinformatics/btx304). Liu B, Shi Y, Yuan J, Hu X, Zhang H, Li N, Li Z, Chen Y, Mu D, Fan W (2013) _Estimation of genomic characteristics by analyzing k-mer frequency in de novo genome projects_. **arXiv**:[1308.2012v2](https://arxiv.org/abs/1308.2012v2). -Melsted P, Halldórsson BV (2014) _KmerStream: streaming algorithms for k -mer abundance estimation_. **Bioinformatics**, 30(24):3541-3547. [doi:10.1093/bioinformatics/btu713](https://doi.org/10.1093/bioinformatics/btu713). +Melsted P, Halldórsson BV (2014) _KmerStream: streaming algorithms for k-mer abundance estimation_. **Bioinformatics**, 30(24):3541-3547. [doi:10.1093/bioinformatics/btu713](https://doi.org/10.1093/bioinformatics/btu713). Mohamadi H, Khan H, Birol I (2017) _ntCard: a streaming algorithm for cardinality estimation in genomics data_. **Bioinformatics**, 33(9):1324-1330. [doi:10.1093/bioinformatics/btw832](https://doi.org/10.1093/bioinformatics/btw832). Wedemeyer A, Kliemann L, Srivastav A, Schielke C, Reusch TB, Rosenstiel P (2017) _An improved filtering algorithm for big read datasets and its application to single-cell assembly_. **BMC Bioinformatics**, 18:324. [doi:10.1186/s12859-017-1724-7](https://doi.org/10.1186/s12859-017-1724-7). + + + + + + + + -- GitLab