This project contains the source code and openCL/GPU bench for the phageterm project. I will leave it to the authors (Marc and Julian) to describe phageterm and will just talk of the openCL benchmark part.
- Mapping of reads In the original version of phageterm, mapping reads than randomly choosing a mapping position for each read took approximatiely 80% of the execution time. It was done with the regexp python package. I considered several possibilities to reduce execution time. here are they.
-
optimize regexp or use it differently I first thought to that by I didn't find anything relevant in the python litterature nor on google.
-
use another python package with special text searching algorithms. I thought of the Knuth Morris Pratt algorithm which I found implemented in the tryalgo package. There is also an algorithms package implementing it but it doesn't seem to be maintained. So I didn't try it (we want middle/long term solutions).
-
use the string.find() method (not faster than regexp according to forums). Not sure it is worth a try.
-
use GPU technology with openCL/pyOpenCL.
Here are the results of the tests with the different implementation. For the tests, I used the files in the data-virome directory. My aim was to bench execution time and to see how openCL or another python package could improve things compared to regexp. So, I didnt' bother with Pair-end. I searched for the first 20 characters (default value for seed in phageterm) of each SRR4295172_1_div6.fastq in all the sequences in Contigs_30min.fasta. Tests ran on myriad-n403 or on my machine (openCL part).
original regexp | tryalgo | openCL on imac | OpenCL on myriad K40 | OpenCL, tars, K80 | OpenCL tars M40 | OpenCL tars P100 |
---|---|---|---|---|---|---|
17min10s | >16 hours | 64s | 4min8s (cuda 7.5) | 4min (cuda 8) | 56 s | 32,5s |