-
Véronique LEGRAND authoredVéronique LEGRAND authored
ROCK
ROCK (Reducing Over-Covering K-mers) is a command line program written in C++ that runs an alternative implementation of the digital normalization method (e.g. Brown et al. 2012, Wedemeyer et al. 2017, Durai and Schulz 2019).
Given one or several FASTQ file(s), the main aim of ROCK is to build a subset of accurate high-throughput sequencing (HTS) reads such that the induced coverage depth is comprised between two specified lower- and upper-bounds. ROCK can therefore be used to reduce and/or homogenize the overall coverage depth within large sets of HTS reads, which is often required to quickly infer accurate de novo genome assemblies (e.g. Desai et al. 2013, Chen et al. 2015). ROCK can also be used to discard low-covering HTS reads, as those ones are often artefactual, highly erroneous or contaminating sequences.
Compilation and installation
Prerequisites
First and foremost, clone this repository with the following command line:
git clone https://gitlab.pasteur.fr/vlegrand/ROCK.git
ROCK is developped in C++ 98 and compatible with C++ 11. Compilation of the source code can be carried out using gcc (version ≥ 4.4.7) or clang (version ≥ 503.0.40), together with the tools make, autoconf (version 2.69) and automake (version 1.16).
Basic compilation and execution
An executable can be built by running the five following command lines:
./autoconf
./automake
./configure
make check
make
The first command line generates the files required by make to compile the source code.
The second command line (non-mandatory) compiles the source code, and next runs all unit and non-regression tests.
If a problem occurs during this step, please report it by creating an issue in https://gitlab.pasteur.fr/vlegrand/ROCK.
The third command line finally compiles the source code to built the executable src/rock.
The executable rock can be run with the following command line model:
./rock [options]
Advanced compilation and installation
Better performances can be obtained by passing the -O3
flag during the configuration step:
./configure CFLAGS='-O3' CXXFLAGS='-O3'
You can also specify a location for the built executable, e.g.
./configure --prefix=$path_to_output_directory
Otherwise, the executable rock can be installed in the system default location (e.g. /usr/local/) using the following fourth command line (after the three ones described above):
make install
Note that this last command line can require some extra permissions (e.g. sudo
).
Usage
Run ROCK without option to read the following documentation:
Reducing Over-Covering K-mers within FASTQ file(s)
USAGE: rock [options] [files]
OPTIONS:
-i <file> file containing the name(s) of the input FASTQ file(s) to
process; single-end: one file name per line; paired-end:
two file names per line separated by a comma; up to 15
FASTQ file names can be specified; of note, input file
name(s) can also be specified as program argument(s)
-o <file> file containing the name(s) of the output FASTQ file(s);
FASTQ file name(s) should be structured in the same way as
the file specified in option -i.
-k <int> k-mer length (default 25)
-c <int> lower-bound k-mer coverage depth threshold (default: 0)
-C <int> upper-bound k-mer coverage depth threshold (default: 70)
-l <int> number of hashing function(s) (default: 4)
-n <int> expected total number of distinct k-mers within the input
read sequences; not compatible with option -l.
-f <float> maximum expected false positive probability when computing
the optimal number of hashing functions from the number of
distinct k-mers specified with option -n (default: 0.05).
-q <int> sets as valid only k-mers made up of nucleotides with
Phred score (+33 offset) above this cutoff (default: 0)
-m <int> minimum number of valid k-mer(s) to consider a read
(default: 1)
-v verbose mode
-h prints this message and exit
Notes
-
In brief, given an upper-bound k-mer coverage depth cutoff κ (option
-C
) and a lower-bound k-mer coverage cutoff κ' (option-c
), the aim of ROCK is to select an HTS read subset Sκ',κ such that the overall k-mer coverage depth induced by the members of Sκ',κ is expected to be always comprised between κ' and κ. When considering FASTQ files with high redundancy (i.e. coverage depth greater than κ), ROCK therefore returns smaller FASTQ files such that each of its HTS read corresponds to a genome region with k-mer coverage depth of at most κ. Setting κ' (option-c
) enables to discard HTS reads associated to a k-mer coverage depth lower than this lower-bound cutoff, which is often observed with artefactual, highly erroneous or contaminating HTS reads. -
After creating an empty count-min sketch (CMS; see below) to store the number of occurrences of every canonical k-mer shared by the input HTS reads, ROCK proceeds in three main steps :
- Sorting the input SE/PE HTS reads from the most to the less accurate ones (as defined by the sum of the Phred scores).
- For each sorted SE/PE HTS read(s), approximating its k-mer coverage depth ck (defined by the median of its k-mer occurence values, as returned by the CMS); if ck ≤ κ, then adding the SE/PE HTS read(s) into the subset Sκ',κ and updating the CMS for every corresponding canonical k-mer.
- (when the lower-bound cutoff κ' > 0) For each SE/PE HTS read(s) in Sκ',κ, (re)approximating its k-mer coverage depth ck; if ck ≤ κ', then removing the SE/PE HTS read(s) from Sκ',κ.
At the end, all SE/PE HTS read(s) inside Sκ',κ are written into output FASTQ file(s) (by default, file extension .rock.fastq). It is worth noticing that the step 1 ensures that all the HTS reads inside the returned subset Sκ',κ are the most accurate ones.
-
ROCK stores the number of occurences of every traversed canonical k-mer in a count-min sketch (CMS; e.g. Cormode and Muthukrishnan 2005), a dedicated probabilistic data structure with controllable false positive probability (FPP). By default, ROCK instantiates a CMS based on four hashing functions (option
-l
), which can be sufficient for many cases, e.g. up to 10 billions canonical k-mers with κ > κ' > 1 and FPP ≤ 5%. However, as each hashing function is defined on [0, 232[, the memory footprint of the CMS is 4λ Gb (when κ ≤ 255, twice otherwise), where λ is the total number of hashing functions. It is therefore highly recommanded to provide the expected number F0 of canonical k-mers (option-n
) to enable ROCK to compute the optimal CMS dimension λ required to store this specified number of canonical k-mers with low FPP (option-f
), e.g. λ = 1 is sufficient to deal with up to 3 billions canonical k-mers when κ > κ' > 1 while ensuring FPP ≤ 5%. Moreover, a CMS based on few hashing functions entails faster running times. For instance, the programs KMC (Deorowicz et al. 2013, 2015; Kokot et al. 2017), KmerStream (Melsted and Halldórsson 2014) or ntCard (Mohamadi et al. 2017) can be used to quickly approximate this number (F0). -
Of important note is that each of the upper- and lower-bound cutoffs (options
-C
and-c
, respectively) corresponds to a k-mer coverage depth value (denoted here ck), which is quite different to the base coverage depth value (denoted here cb). However, when L is the average input HTS read length, cb / ck and L / (L-k+1) are expected to be identical for any fixed small k (e.g. Liu et al. 2013). In consequence, when an overall (base) coverage depth cb is expected, one can therefore set κ = cb (L-k+1) / L. For example, when dealing with HTS reads of length L = 144 (on average), an HTS read subset with expected base coverage depth cb = 60x can be inferred by ROCK by setting k = 25 (option-k
) and κ = 60 (144-25+1) / 144 = 50 (option-C
). -
By default, ROCK uses k-mers of length k = 25 (option
-k
). Increasing this length is not recommanded when dealing with large FASTQ files (e.g. average coverage depth > 500x from genome size > 1 Gbps), as the total number of canonical k-mers can quickly grow, therefore implying a very large CMS (i.e. many hashing functions) to maintains low FPP (e.g. ≤ 0.05). Using small k-mers (e.g. k < 21) is also not recommanded, as this can negatively affect the overall specificity (i.e. too many identical k-mers arising from different sequenced genome region). -
All ROCK steps are based on the usage of valid k-mers, i.e. k-mers that do not contain any undetermined base
N
. Valid k-mers can also be determined by bases associated to a Phred score greater than a specified threshold (option-q
; Phred +33 offset, default: 0). A minimum number of valid k-mers can be specified to consider a SE/PE HTS read(s) (option-m
; default: 1).
References
Brown CT, Howe A, Zhang Q, Pyrkosz AB, Brom YH (2012) A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data. arXiv:1203.4802v2.
Chen TW, Gan RC, Chang YF, Liao W-C, Wu TH, Lee C-C, Huang P-J, Lee C-Y, Chen Y-YM, Chiu CH, Tang P (2015) Is the whole greater than the sum of its parts? De novo assembly strategies for bacterial genomes based on paired-end sequencing. BMC Genomics, 16:648. doi:10.1186/s12864-015-1859-8.
Cormode G, Muthukrishnan S (2005) An Improved Data Stream Summary: The Count-Min Sketch and its Applications. Journal of Algorithms, 55:29-38. doi:10.1016/j.jalgor.2003.12.001.
Deorowicz S, Debudaj-Grabysz A, Grabowski S (2013) Disk-based k-mer counting on a PC. BMC Bioinformatics, 14:160. doi:10.1186/1471-2105-14-160.
Deorowicz S, Kokot M, Grabowski S, Debudaj-Grabysz A (2015) KMC 2: Fast and resource-frugal k-mer counting. Bioinformatics, 31(10):1569-1576. doi:10.1093/bioinformatics/btv022.
Desai A, Marwah VS, Yadav A, Jha V, Dhaygude K, Bangar U, Kulkarni V, Jere A (2013) Identification of Optimum Sequencing Depth Especially for De Novo Genome Assembly of Small Genomes Using Next Generation Sequencing Data. PLoS ONE, 8(4):e60204. doi:10.1371/journal.pone.0060204.
Durai DA, Schulz MH (2019) Improving in-silico normalization using read weights. Scientific Reports, 9:5133. doi:10.1038/s41598-019-41502-9.
Kokot M, Długosz M, Deorowicz S (2017) KMC 3: counting and manipulating k-mer statistics. Bioinformatics, 33(17):2759-2761. doi:10.1093/bioinformatics/btx304.
Liu B, Shi Y, Yuan J, Hu X, Zhang H, Li N, Li Z, Chen Y, Mu D, Fan W (2013) Estimation of genomic characteristics by analyzing k-mer frequency in de novo genome projects. arXiv:1308.2012v2.
Melsted P, Halldórsson BV (2014) KmerStream: streaming algorithms for k-mer abundance estimation. Bioinformatics, 30(24):3541-3547. doi:10.1093/bioinformatics/btu713.
Mohamadi H, Khan H, Birol I (2017) ntCard: a streaming algorithm for cardinality estimation in genomics data. Bioinformatics, 33(9):1324-1330. doi:10.1093/bioinformatics/btw832.
Wedemeyer A, Kliemann L, Srivastav A, Schielke C, Reusch TB, Rosenstiel P (2017) An improved filtering algorithm for big read datasets and its application to single-cell assembly. BMC Bioinformatics, 18:324. doi:10.1186/s12859-017-1724-7.