AlienRemover
AlienRemover is a command line program written in Java that quickly removes alien (i.e. undesirable) high-throughtput sequencing (HTS) reads from FASTQ files. An HTS reads is considered as an alien one when it likely arises from specified alien sequences (e.g. host genome, cloning vectors, ΦX174, ...).
Installation
Clone this repository with the following command line:
git clone https://gitlab.pasteur.fr/GIPhy/AlienRemover.git
Compilation and execution
The source code of AlienRemover is inside the src directory. It requires Java 13 (or higher) to be be compiled.
Building an executable jar file
On computers with Oracle JDK (13 or higher) installed, a Java executable jar file can be created by typing the following command lines inside the src directory:
javac LHBF.java AlienRemover.java
echo Main-Class: AlienRemover > MANIFEST.MF
jar -cmvf MANIFEST.MF AlienRemover.jar AlienRemover.class LHBF.class
rm MANIFEST.MF AlienRemover.class LHBF.class
This will create the executable jar file AlienRemover.jar
that can be run with the following command line model:
java -jar AlienRemover.jar [options]
Usage
Run AlienRemover without option to read the following documentation:
AlienRemover
Fast removal of alien reads (contaminant, host, ...) from FASTQ file(s)
USAGE:
AlienRemover -a <alienfile> [-b <modelfile>] [-o <basename>] [-k <int>]
AlienRemover -a <alienfile> -i <FASTQ> [-o <basename>] [-k <int>] [-c <float>] [-p <float>] [...]
AlienRemover -a <alienfile> -1 <FASTQ> -2 <FASTQ> [-o <basename>] [-k <int>] [-c <float>] [-p <float>] [...]
OPTIONS:
-a <infile> FASTA file containing alien sequence(s); filename should end with .gz when gzipped
-a <infile> input file containing alien k-mers generated by AlienRemover from FASTA-formatted alien
sequence(s); filename should end with .kmr or .kmz
-i <infile> [SE] FASTQ-formatted input file; filename should end with .gz when gzipped
-1 <infile> [PE] FASTQ-formatted R1 input file; filename should end with .gz when gzipped
-2 <infile> [PE] FASTQ-formatted R2 input file; filename should end with .gz when gzipped
-o <name> outfile basename; output files have the following extensions:
+ alien k-mers: <name>.km<r|z>
+ SE reads: <name>.fastq[.gz] (.gz is added when using option -z)
+ PE reads: <name>.1.fastq[.gz] <name>.2.fastq[.gz] (.gz is added when using option -z)
-k [10-31] k-mer length for alien sequence occurence searching; must lie between 10 and 31 (default: 25)
-p <float> Bloom filter false positive probability cutoff (default: 0.05)
-n <integer> expected number of canonical k-mers (default: estimated from the alien file size)
-l use less bits and more hashing functions, whenever possible (default: not set)
-c <float> criterion to remove a read (default: 0.15)
-s compute Bloom filter statistics (default: not set)
-w write Bloom filter into output file (default: not set)
-r write removed reads into output file(s) (default: not set)
-z gzipped output files (default: not set)
EXAMPLES:
AlienRemover -a alien.fasta -o alien -k 30
AlienRemover -a alien.kmr -i reads.fastq -o flt_reads --p64 -z
AlienRemover -a alien.kmr -1 r1.fq -2 r2.fq -c 0.3 -r
Notes
-
In brief, AlienRemover first stores every alien k-mer into a (less-hash) Bloom filter. Next, AlienRemover determines the alien k-mer content of each HTS read by querying the Bloom filter, and removes those made up by a large proportion of alien k-mer (as ruled by option
-c
). -
When the alien (genome) sequence is short (e.g. < 2Mb), it can be directly set via the option
-a
without affecting the overall running times. For larger alien sequences, it is recommended to first build and save the set of alien k-mers (i.e. the Bloom filter) via the option-w
; such an alien k-mer file (<name>.kmr, or <name>.kmz when using the option-z
) can next be directly used via the option-a
for faster running times. -
When creating an alien k-mer file (option
-w
), it is highly recommended to set the (approximate) total number of canonical k-mers shared by the alien sequences. For instance, the programs KMC or KmerStream (F0) can be used to quickly approximate this number. AlienRemover will use such a number (option-n
) to efficiently compute a (less-hash) Bloom filter with optimal dimensions for storing every alien k-mer. -
To obtain faster running times, a simple trick is to force AlienRemover to use only one hashing function (at the cost of a larger memory footprint) by setting option
-n
with some large value. In practice, for dealing with a total of x canonical k-mers with only one hashing function (while keeping false positive probability FPP < 5%), use the lowest integer value (if any) that is greater than x from the table below:
int | -n |
memory footprint |
---|---|---|
220,000,000 | 220000000 | 0.5 Gb |
440,000,000 | 440000000 | 1 Gb |
880,000,000 | 880000000 | 2 Gb |
1,700,000,000 | 1700000000 | 4 Gb |
3,500,000,000 | 3500000000 | 8 Gb |
-
Inversely, to reduce the overall memory footprint, use the option
-l
to (try to) increase the number of hashing functions. -
Default options are expected to give accurate results in most cases, especially the k-mer length (i.e. 25) and the removal criterion cutoff (i.e. 0.15). Increasing the removal criterion cutoff (option
-c
; e.g. 0.25) ensures that the True Positive Rate (TPR; i.e. the proportion of non-alien reads that are not removed) will be very close to 100%, but at the cost of a significant increase of the False Positive Rate (FPR; i.e. the proportion of alien reads that are not removed). Decreasing the removal criterion cutoff (e.g. < 0.15) is not recommended as this enables a rapid decrease of the TPR.