|
|
This project contains the source code and openCL/GPU bench for the phageterm project. I will leave it to the authors (Marc and Julian) to describe it and will just talk of the openCL benchmark part.
|
|
|
|
|
|
1. Mapping or reads
|
|
|
In the original version of phageterm, mapping reads than randomly choosing a mapping position for each read took approximatiely 80% of the execution time. It was done with the regexp python package. I studied several possibilities to reduce execution time.
|
|
|
1. Mapping of reads
|
|
|
In the original version of phageterm, mapping reads than randomly choosing a mapping position for each read took approximatiely 80% of the execution time. It was done with the regexp python package. I considered several possibilities to reduce execution time. here are they.
|
|
|
|
|
|
* optimize regexp or use it differently
|
|
|
I first thought to that by I didn't find anything relevant except compiling the regexp element which is not of any use since it changes with each read.
|
|
|
|
|
|
* use another python package with special text searching algorithms.
|
|
|
I thought of the Knuth Morris Pratt algorithm which I found implemented in the tryalgo package. |
|
|
\ No newline at end of file |
|
|
I thought of the Knuth Morris Pratt algorithm which I found implemented in the tryalgo package. There is also an algorithms package implementing it but it doesn't seem to be maintained. So I didn't try it (we want middle/long term solutions).
|
|
|
|
|
|
* use the string.find() method (not faster than regexp according to forums).Not sure it is worth a try.
|
|
|
|
|
|
* use GPU technology with openCL/pyOpenCL.
|
|
|
|
|
|
Here are the results of the tests with the different implementation.
|
|
|
For the tests, I used the files in the data-virome directory.
|
|
|
My aim was to bench execution time and to see how openCL or another python package could improve things compared to regexp. So, I didn' bother with Pair-end. I searched for the first 20 characters (default value for seed in phageterm) of each SRR4295172_1_div6.fastq in all the sequences in Contigs_30min.fasta.
|
|
|
|
|
|
| original regexp | tryalgo | openCL |
|
|
|
| --------------- |:---------:| ------:|
|
|
|
| | >24hours | 120s | |