Title: Fast Exact String Matching On the GPU
1Fast Exact String Matching On the GPU
- Michael C. Schatz and Cole Trapnell
- May 8, 2007
- CMSC 740 Computer Graphics
2String Matching Applications
- A very common problem in computational biology is
to find all occurrences (or approximate
occurrences) of one string in another string - Genome Assembly, Gene Finding, Comparative
Genomics, Functional analysis of proteins, Motif
discovery, SNP analysis, Phylogenetic analysis,
Primer Design - Short Read Resequencing 200 Million 50bp reads
-
- Sequence databases are huge, and growing
exponentially - We need ever faster methods for string matching
3Suffix Trees to the Rescue
- Tree of all suffixes of string S
- Suffix i encoded on path to leaf i
- Nodes positions where suffixes diverge
- Edges substrings of S
- Leaves starting position of suffix
- Suffix Links traverse to next suffix
- O(n) Construction
- Ukkonens Algorithm
- Exploits inter-suffix relationships and suffix
links - O(k) Query Match
- Every substring Si,j is a prefix of suffix i.
- Walk from root following the characters in the
query Q. - One leaf for each occurrence of Q in T.
Suffix tree of ACATAC
858E Algorithms for Biosequence Analysis
4Suffix Tree Search
TAC
A
C
7
4
TAC
ATAC
C
3
6
2
ATAC
5
1
Searching for ATA
Suffix tree of ACATAC
5Suffix Tree Search
TAC
A
C
7
4
TAC
ATAC
C
3
6
2
ATAC
5
1
Searching for ATA
Suffix tree of ACATAC
6Suffix Tree Search
TAC
A
C
7
4
TAC
ATAC
C
3
6
2
ATAC
5
1
Searching for ATA
Suffix tree of ACATAC
7Suffix Tree Search
TAC
A
C
7
4
TAC
ATAC
C
3
6
2
ATAC
5
1
Searching for ATA found at position 3!
Suffix tree of ACATAC
8Suffix Tree Search
TAC
A
C
7
4
TAC
ATAC
C
3
6
2
ATAC
5
1
Searching for AC found at positions 1
5 Searching for ACT falls off tree gt Not
in S
Suffix tree of ACATAC
9GPGPU Programming
- Utilize the highly parallel SIMD architecture of
the GPU - Nominally used for in parallel triangle
rendering, texture application - Each processor executes same kernel
- Dramatic runtime improvement for scientific
applications - CUDA Architecture
- API and runtime library to implement C style
programming of stream processors - nVidia GeForce 8800 GTX (G80)
- 16 multiprocessors w/ 8 processors
- 128 stream processors _at_ 1.35 GHz
- 768 MB total on board RAM
- 2D Texture Cache for large readonly data
Image from CUDA Programming Guide
10Cmatch GPU Algorithm
- Load Reference String
- Create Suffix Tree
- Load Query Strings
- Transfer data to GPU
- Execute Query Kernel
- Up to 128 simultaneous matches on GPU
- Fetch Results from GPU
- Output results
11Data Structures on the GPU
- Suffix tree nodes gt 2D Texture
- Encode node information children pointers as
RGBA color of texel - Arrange nodes in 32x32 blocks along space filling
curve - Optimize near root for inter-thread caching,
further down for an individual thread. - Reference String gt 2D Texture
- Access many successive characters along edge
- Query Strings gt On Board RAM
- Q array with offsets in a large array of
strings - Results buffer gt On Board RAM
- Q array with id of last visited node for query
i
12Experimental Protocol
- Comparing running time of (serial) CPU versus
(parallel) GPU programs - CPU 3.0 GHz Intel Xeon
- GPU nVidia GeForce 8800 GTX (128 processors _at_
1.35 GHz) - Simulate short read resequencing projects by
extracting substrings of reference sequences - References
- Genome of Bacillus anthracis (5.20 Mbp)
- Genome of Yersinia pestis (4.6 Mbp)
- BAC-sized portion of Human Chromosome 2 (200 kbp)
- Query sets (250 Mbp total)
- 10 million x 25 bp
- 5 million x 50 bp
- 1.25 million x 200 bp
- 312,500 x 800 bp
13Query Time Results
Speedup of the GPU match kernel versus CPU match
program.
14Long Read Query Time Results
Future work to improve cache hit rate for longer
reads.
15Processing Time
GPU Cmatch is bounded by time to construct suffix
tree and IO processing time
16Conclusions
- We have reduced the computation processing time
for short read resequencing from hours to
minutes. - Make sure you have sufficient cooling available
- Low arithmetic intensity GPGPU programs can have
dramatic performance improvements (35x) over CPU
execution - Utilizing the texture cache with careful node
placement and minimizing register use were
essential to high performance - A single GPU can supply same processing power as
a small computer cluster at a fraction of the
cost - Installing GPUs into an existing cluster can
provide an order of magnitude increase in
computing capacity. - More information
- http//www.cbcb.umd.edu/software/cmatch
17Texture Space filling curve
- Texture cache organized in 2x2 blocks.
- Try to place all children of a node are in the
same cache block