Skip to content
Snippets Groups Projects

GPLv3 license JAVA publication

AlienTrimmer

AlienTrimmer is a command line program written in Java that to quickly trim off non-confident bases and alien oligo-nucleotide sequences (e.g. adapters, primers, ...) from sequencing reads. For more details, see the associated publication (Criscuolo and Brisse 2013).

Since AlienTrimmer v2.0, this GitLab repository replaces the previous ftp repository.

AlienTrimmer v2.0 (and higher) implements the same alien k-mer search algorithm as the initial release (see S1.2 here), but use slightly different trimming criteria (see below). Of note, AlienTrimmer v2.0 has simplified options, runs quite faster (especially when compiled with GraalVM), and is able to read gzipped FASTQ files.

Note that alien oligo-nucleotide sequence(s) can be easily inferred without any prior knowledge using AlienDiscover.

Installation

Clone this repository with the following command line:

git clone https://gitlab.pasteur.fr/GIPhy/AlienTrimmer.git

Compilation and execution

The source code of AlienTrimmer is inside the src directory. It requires Java 11 (or higher) and can be compiled and executed in two different ways.

Building an executable jar file

On computers with Oracle JDK (11 or higher) installed, a Java executable jar file can be created. In a command-line window, go to the src directory and type:

javac AlienTrimmer.java 
echo Main-Class: AlienTrimmer > MANIFEST.MF 
jar -cmvf MANIFEST.MF AlienTrimmer.jar AlienTrimmer.class 
rm MANIFEST.MF AlienTrimmer.class 

This will create the executable jar file AlienTrimmer.jar that can be run with the following command line model:

java -jar AlienTrimmer.jar [options]

Building a native code binary

On computers with GraalVM installed, a native executable can be built. In a command-line window, go to the src directory, and type:

javac AlienTrimmer.java 
native-image AlienTrimmer AlienTrimmer
rm AlienTrimmer.class

This will create the native executable AlienTrimmer that can be run with the following command line model:

./AlienTrimmer [options]

Usage

Run AlienTrimmer without option to read the following documentation:

 AlienTrimmer

 Fast trimming to  filter out  non-confident  nucleotides and alien  oligo-nucleotide sequences
 (adapters, primers) in both 5' and 3' read ends
 Criscuolo and Brisse (2013) doi:10.1016/j.ygeno.2013.07.011

 USAGE:  AlienTrimmer  [options]

 OPTIONS:
    -i <infile>   [SE] FASTQ-formatted input file; filename should end with .gz when gzipped
    -1 <infile>   [PE] FASTQ-formatted R1 input file; filename should end with .gz when gzipped
    -2 <infile>   [PE] FASTQ-formatted R2 input file; filename should end with .gz when gzipped
    -a <infile>   [SE/PE] input file name containing alien sequence(s);  one line per sequence;
                  lines starting with '>', '%' or '#' are not considered
  --a1 <infile>   [PE] same as -a for only R1 reads
  --a2 <infile>   [PE] same as -a for only R2 reads
    -o <name>     outfile basename: [SE] <name>.fastq[.gz]  or  [PE] <name>.{1,2,S}.fastq[.gz];
                  .gz is added when using option -z
    -k [5-15]     k-mer length k for alien sequence occurence searching; must lie between 5 and
                  15 (default: 10)
    -q [0-40]     Phred quality score  cutoff to define  low-quality bases;  must lie between 0
                  and 40 (default: 13)
    -m <int>      maximum no. allowed successive  non-troublesome bases in the 5'/3' regions to
                  be trimmed (default: k-1)
    -l <int>      minimum allowed read length (default: 50)
    -p [0-100]    maximum allowed percentage of low-quality bases per read (default: 50)
 --p64            Phred+64 FASTQ input file(s) (default: Phred+33)
    -z            gzipped output files (default: not set)
    -v            verbose mode (default: not set)

 EXAMPLES:
  [SE]  AlienTrimmer -i reads.fq -a aliens.fa -o trim -l 30 -p 20
  [SE]  AlienTrimmer -i reads.fq.gz -a aliens.txt -o trim -k 9 -q 13
  [PE]  AlienTrimmer -1 r1.fq -2 r2.fq -a aliens.fa -o trim -m 8 -p 25 -v
  [PE]  AlienTrimmer -1 r2.fq.gz -2 r2.fq.gz --a1 fwd.fa --a2 rev.fa -o trim -z

Notes

  • Since v2.0, trimming criteria to filter out troublesome bases (i.e. alien k-mers and/or non-confident bases) have been modified:

    ## 3' 
    read:         ....ACGTACTACGTACGTAGTACGTACGT
    troublesome:       *      *  ************
    trimming:                 <<<<<<<<<<<<<<<<<<
                              |             |   |
                              e             t   l
    
    t: index of the last troublesome base
    e: index <= t such that base e is troublesome
    l: read length
    
    AlienTrimmer trims off the longest suffix [e,l-1] such that:
     + the proportion x of troublesome bases in [e,l-1] is > 50%
     + no more than m successive non-troublesome bases occur in [e,t] (option -m)
    
    ## 5'  
    read:         ACGTACTACGTACGTAGTACGTACGT....
    troublesome:     * ****** *  *      *
    trimming:     >>>>>>>>>>>>>>>>
                     |            |
                     t            s
    
    t: index of the first troublesome base
    s: index >= t (and < e) such that base s-1 is troublesome
    
    AlienTrimmer trims off the longest prefix [0,s-1] such that:
     + the proportion x of troublesome bases in [0,s-1] is > 50%
     + no more than m successive non-troublesome bases occur in [t,s-1] (option -m)
  • Reading gzipped FASTQ files does not seem to have a negative impact on the overall running times. However, writing gzipped FASTQ output files (option -z) requires approximately 5 times slower running times.

  • The verbose mode (option -v) can be used to observe the trimming details.

Example

The two files example.fastq.gz and aliens.fa can be found in the directory example/. The gzipped file example.fastq.gz contains five FASTQ-formatted reads, and the FASTA-formatted file aliens.fa contains three oligonucleotide sequences to be trimmed off: an indexed TruSeq adapter and the two homopolymeric segments poly-A and poly-C. Of note, it is highly recommended to use AlienTrimmer with these two last homopolymeric segments as aliens; indeed, library preparation oligonucleotides occurring in sequencing reads are often followed by a stretch of A's or G's to be also trimmed off (see e.g. Criscuolo and Brisse 2014).

The following command line can be run to filter out troublesome bases ending up in both 5' and 3' ends:

AlienTrimmer -i example.fastq.gz -a alien.fa -o example.trim -v

As the verbose mode is set (option -v), the following information can be read:

AlienTrimmer 

FASTQ file:         example.fastq.gz
main options:       -k 10  -m 9  -q 13  -l 50  -p 50
outfile:            example.trim.fastq
no. alien k-mers:   118

@HUB022:69:LATEUSH:2:1102:17743:8484 1:N:0:ATCCTTCC
                                                                                                               ========================================
TTCGTAATTGAGTTCCATCAAGAGCAAACTTATCGAGATCGAGTCAATTATTAACGTGTTCAATCAGTGCTTTTCCTAATTCAGCAGCTTCTGAATCGCCGCTATAGGTGAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACATCCTT
             *                              *                                            *                                                   *  *      
                                                                                                               <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

@HUB022:69:LATEUSH:2:1103:25789:13463 1:N:0:ATCTTTCC
                                                                   ===================================== ============================       ===========
TTGCACATCAATGTAGTCAAACTCGCAAATGGAAAGAATTAGAAAAAGATTTCTTTAAAAAATTATGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACATCTTTCCATCTCGTATGCCGTCTTCTGCTTGAAAATGTGGGGGGGGGGG
  *        *    *             *      *           * *          ** ** *   *              *        *       *   * * *         ** * *        *  *  ** *  ***
                                                              <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

@HUB022:69:LATEUSH:2:1104:16893:2253 1:N:0:ATCCTTCC
                                   ===========                                                                                                         
ATTTGCGCTCAAAAAAAGACAACAAAGATAATTGATTTTTTTTTTTAAACATCAGAAGAAAACTTCTCCACACAACGAACAAACATTTCTACACCCATAGACAAAACAGTTTCATCAAAATCAAAACGAGGATGATGATGAGGATAAGCTA
 * * **  *** ****                                 *                 *                                   *                 ** *** * *** *  ***  *    *  
>>>>>>>>>>>>>>>>>                                                                                                         <<<<<<<<<<<<<<<<<<<<<<<<<<<<<

@HUB022:69:LATEUSH:2:1105:17562:2597 1:N:0:ATCCTTCC   [DISCARDED]
              ====================     =======================                                                 ========================================
TTGTTGAAACATCACCCCCCCCCCCCCCCCCCCCACCCACCCCCCCCCCCCCCCCCCCCCCCACCAATTAAAAAAAAACCAAAAATAGGAAACATAAAAATAATATATAATAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
             *                 *   ***********                ********** ** * ****** **** *** ** ** ****** **** * *   *   *  *             * *    *   *
             <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

infile:             5 reads   755 bases
outfile:            4 reads   429 bases

Total running time: 0min 00sec

Each trimmed read is displayed with highlighted troublesome bases, i.e. alien k-mers (=) and non-confident bases (*). Trimmed regions are indicated with > (5') and < (3').

The two first reads end up with the specified TruSeq adapter, and both reads are trimmed off accordingly.

The third read contains many non-confident bases in both 5' and 3' ends, which are trimmed off accordingly.

The fourth read seems artefactual, as it is made up by homopolymeric regions and non-confident bases. After trimming, only 13 bases remain and are therefore discarded, i.e. minimum length = 50 bases (option -l).

References

Criscuolo A, Brisse S (2013) ALIENTRIMMER: A tool to quickly and accurately trim off multiple short contaminant sequences from high-throughput sequencing reads. Genomics, 102(5-6):500-506. doi:10.1016/j.ygeno.2013.07.011

Criscuolo A, Brisse S (2014) AlienTrimmer removes adapter oligonucleotides with high sensitivity in short-insert paired-end reads. Commentary on Turner (2014) Assessment of insert sizes and adapter content in FASTQ data from NexteraXT libraries. Frontiers in Genetics, 5:130. doi:10.3389/fgene.2014.00130