-
Alexis CRISCUOLO authoredAlexis CRISCUOLO authored
title: 'ROCK: digital normalization of whole genome sequencing data'
tags:
- C++
- high-throughput sequencing
- digital normalization
- _k_-mer
authors:
- affiliation: 1
name: Véronique Legrand
- affiliation: "2, 3"
name: Thomas Kergrohen
- affiliation: 1
name: Nicolas Joly^[deceased]
- affiliation: "4, 5"
name: Alexis Criscuolo^[corresponding author]
orcid: 0000-0002-8212-5215
affiliations:
- index: 1
name: Institut Pasteur, Université Paris Cité, Plateforme HPC, F-75015 Paris, France
- index: 2
name: Prédicteurs moléculaires et nouvelles cibles en oncologie, INSERM, Gustave Roussy, Université Paris-Saclay, Villejuif, France
- index: 3
name: Département de Cancérologie de l'Enfant et de l'Adolescent, Gustave Roussy, Université Paris-Saclay, Villejuif, France
- index: 4
name: Institut Pasteur, Université Paris Cité, Bioinformatics and Biostatistics Hub, F-75015 Paris, France
- index: 5
name: Institut Pasteur, Université Paris Cité, Plateforme de Microbiologie Mutualisée (P2M), F-75015 Paris, France
date: "7 May 2022"
bibliography: paper.bib
Summary
Due to advances in high-throughput sequencing technologies, generating whole genome sequencing (WGS) data with high coverage depth (e.g.
To reduce redundancy within a WGS dataset, randomly downsampling high-throughput sequencing reads (HTSR) is trivial. Nevertheless, this first-in-mind strategy is not efficient as it does not minimize variation in sequencing depth, thereby eroding the coverage depth of genome regions that are under-covered (if any). To cope with this problem, a simple greedy algorithm, named digital normalization, was designed to efficiently downsample HTSRs over genome regions that are over-covered [@Brown:2012]. Given an upper-bound threshold
Interestingly, the digital normalization algorithm can be easily enhanced in several ways, so that the final subset contains fewer but more qualitative HTSRs.
Unfortunately, these different improvements are scattered in distinct program tools.
ROCK
(Reducing Over-Covering K-mers) was therefore developed with the key purpose of implementing a fast, accurate and easy-to-use digital normalization procedure.
The C++ source code is available under GNU Affero General Public License v3.0 at https://gitlab.pasteur.fr/vlegrand/ROCK.
Statement of need
The digital normalization algorithm is based on a count-min sketch (CMS), a probabilistic data stream structure able to store the number of occurrences of the canonical k-mers (i.e. short oligonucleotides of fixed length
normalize-by-median
from the package khmer [@Crusoe:2015], the Java program BBnorm
from the package BBTools [@Bushnell:2014] or the C++ program BigNorm
[@Wedemeyer:2017].
Of note, since the HTSRs are selected stepwise, the digital normalization algorithm may be improved by first considering the longest HTSRs, therefore yielding a subset
ORNA
[@Durai:2019].
Finally, given a small lower-bound threshold
BBnorm
.
The main purpose of ROCK
is to pool these complementary strategies to obtain a complete and efficient digital normalization procedure.
It was designed to be used as a preprocessing step prior to performing fast genome de novo assembly.
ROCK
is a command-line program that may consider up to 15 input FASTQ files containing HTSRs from different single-end (SE) or paired-end (PE) sequencing.
Both lower and upper thresholds
ROCK
is also able to automatically calculate it, provided that the number of distinct canonical k-mers is specified.
Implementation
ROCK
first reads the input FASTQ file(s) to store the location of each HTSR into a container
std::map
, where The CMS is a two-dimensional
UINT_MAX
(i.e. 2^{32}\! -\! 1).
Every k-mer is associated with one entry \textbf{C}[i][j] for each row i, where j is the hashing value returned by h_i for that k-mer.
To add a canonical k-mer occurence into the CMS, each of its \lambda associated entries is incremented by one.
As a consequence, when \kappa\! \leq\! 255, the CMS is a two-dimensional array of unsigned char
with a memory footprint of 4 \lambda\! Gb; when \kappa\! >\! 255, the unsigned short
type is used, with twice RAM usage.
By definition, the number \eta of occurences of any given canonical k-mer in the CMS is expected to be the minimum value \tilde{\eta} among the \lambda associated entries.
However, \tilde{\eta} can be strictly higher than the true number \eta of occurences with a probability p_\textrm{\tiny FP} [@Cormode:2005], henceforth a false positive (FP) probability.
Following @Kim:2019, when \lambda\!=\! 1, the probability that an arbitrary entry of \textbf{C} is strictly higher than a threshold \theta after inserting n distinct k-mers is \pi_{\theta,n}\! =\! 1\! -\! \sum_{z \leq \theta} b_{n,1/m}(z), where b_{n,p} is the probability mass function of the binomial distribution B(n,p).
Therefore, one gets p_\textrm{\tiny FP}\! = \! \pi_{\theta,n}^\lambda with \theta\! =\! \kappa (or \theta\! =\! \kappa' if \kappa'\! \ne\! 0).
By default, ROCK
uses \lambda\! =\! 4 and a FP probability cutoff \tau\! =\! 5%, which can cope with up to n =\! 10 billions distinct canonical k-mers while verifying p_\textrm{\tiny FP}\! <\! \tau with any \kappa\! >\! \kappa'\! >\! 1.
However, when the number n of distinct canonical k-mers is provided by the user, ROCK
automatically computes the smallest integer \lambda such that \pi_{\theta,n}^\lambda\! <\! \tau, which often results in \lambda\! =\! 1 with WGS data from small genomes (see Example).
After instantiating a CMS with all entries set to 0, ROCK
performs the digital normalization procedure by traversing \textbf{M}[x][y] from the highest to the lowest quality x.
For every SE/PE HTSR(s) r, the number \alpha of canonical k-mer occurences \tilde{\eta}\! \leq\! \kappa is obtained by querying the CMS, and next compared to the total number \beta of canonical k-mers.
If 2 \alpha\! >\! \beta, then the median of the k-mer occurence values is strictly higher than \kappa, and r is/are not selected; otherwise, r is/are selected and the corresponding k-mers are added in the CMS.
When \kappa'\! \ne\! 0, the second pass is next performed on the selected HTSRs following the same approach.
ROCK
runs the entire procedure using only a unique thread.
At the end, all selected HTSRs are written into FASTQ-formatted output file(s).
Of note, to avoid mixing up HTSRs that derive from different WGS strategies (e.g. different insert sizes), each input file corresponds to its own output file.
Example
To illustrate the performances of ROCK
on a real-case high coverage WGS dataset, we considered the Illumina PE sequencing (run accession SRR2079909) of the Clostridium botulinum ATCC 9564 genome (assembly accession GCA_001273225), of length l \! =\! 3,813,606 base pairs (bps).
The two FASTQ files were first processed using AlienTrimmer
[@Criscuolo:2013] to trim off 5'/3' regions containing many sequencing errors (i.e. Phred score cutoff Q \! =\! 20) or stretches of identical nucleotides.
After discarding too short (i.e. <\! 100 bps) and unpaired HTSRs, a total of 4,973,401 PE HTSRs remained (2,756,008,939 bps, average length \ell\! =\! 277 bps), corresponding to a coverage depth of ~722\times.
To reduce this high coverage depth to c \! = \! 50\times, ROCK
(v1.9.6) was run with k \! =\! 25 and \kappa\! =\! c / \varepsilon\! \approx\! 45.
To assess the optimal CMS size \lambda, the total number n \! =\! 105,584,331 of distinct canonical k-mers was estimated using ntCard
[@Mohamadi:2017], and next specified to ROCK
, leading to \lambda\! =\! 1.
In theory, using \kappa\! =\! 45 is expected to yield a subset of cl/(2\ell)\! \approx\! 344,000 PE HTSRs totaling cl \! \approx\! 191 Mbps.
Obtained results are summarized in Table \ref{Table 1}.
Table: Running times (min:sec), and numbers of PE HTSRs and corresponding base pairs (bps) returned by ROCK
(on 1 thread) with \kappa\! =\! 45 and varying \kappa' values. \label{Table 1}
$~$ **run. time** **no. PE HTSRs** **no. bps**
\kappa'\! =\! 0 3:16 424,040 244,188,971
\kappa'\! =\! 2 3:22 386,665 227,069,995
\kappa'\! =\! 4 3:23 371,834 220,241,468
\kappa'\! =\! 6 3:24 370,336 219,612,591
\kappa'\! =\! 8 3:24 369,994 219,471,025
For comparison sake, comparable standard digital normalizations (\kappa\! =\! 45 and k \! =\! 25) were also carried out on the same computer (AMD Epyc 2.2 GHz processor, 128 Gb RAM) using other dedicated tools (see Statement of need).
They returned slightly larger subsets, containing from 498,899 (275 Mbps; BBnorm
) to 693,978 (367 Mbps; normalize-by-median
) PE HTSRs, with slower running times as compared to ROCK
, varying from 5:17 (BBnorm
on 12 threads) to 21:29 (normalize-by-median
).
Authors' contribution
AC devised the project, the main conceptual ideas and algorithmic improvements.
VL coded most parts of ROCK
.
NJ implemented the data structures with VL, and participated to the coding.
TK and VL ran benchmarks to assess and improve the overall performance of ROCK
.
All authors provided critical feedback to optimize the program.
AC wrote the manuscript in consultation with VL and TK.