Fast Exact String Matching On the GPU

About This Presentation

Title:

Fast Exact String Matching On the GPU

Description:

768 MB total on board RAM. 2D Texture Cache for large readonly data ... Comparing running time of (serial) CPU versus (parallel) GPU programs. CPU: 3.0 GHz Intel Xeon ... – PowerPoint PPT presentation

Number of Views:243

Avg rating:3.0/5.0

Slides: 18

Provided by: Michael2026

Category:

more less

Transcript and Presenter's Notes

Title: Fast Exact String Matching On the GPU

1
Fast Exact String Matching On the GPU

Michael C. Schatz and Cole Trapnell
May 8, 2007
CMSC 740 Computer Graphics

2
String Matching Applications

A very common problem in computational biology is
to find all occurrences (or approximate
occurrences) of one string in another string
Genome Assembly, Gene Finding, Comparative
Genomics, Functional analysis of proteins, Motif
discovery, SNP analysis, Phylogenetic analysis,
Primer Design
Short Read Resequencing 200 Million 50bp reads
Sequence databases are huge, and growing
exponentially
We need ever faster methods for string matching

3
Suffix Trees to the Rescue

Tree of all suffixes of string S
Suffix i encoded on path to leaf i
Nodes positions where suffixes diverge
Edges substrings of S
Leaves starting position of suffix
Suffix Links traverse to next suffix
O(n) Construction
Ukkonens Algorithm
Exploits inter-suffix relationships and suffix
links
O(k) Query Match
Every substring Si,j is a prefix of suffix i.
Walk from root following the characters in the
query Q.
One leaf for each occurrence of Q in T.

Suffix tree of ACATAC
858E Algorithms for Biosequence Analysis
4
Suffix Tree Search
TAC

A
C
7
4
TAC

ATAC
C
3
6
2
ATAC

5
1
Searching for ATA
Suffix tree of ACATAC
5
Suffix Tree Search
TAC

A
C
7
4
TAC

ATAC
C
3
6
2
ATAC

5
1
Searching for ATA
Suffix tree of ACATAC
6
Suffix Tree Search
TAC

A
C
7
4
TAC

ATAC
C
3
6
2
ATAC

5
1
Searching for ATA
Suffix tree of ACATAC
7
Suffix Tree Search
TAC

A
C
7
4
TAC

ATAC
C
3
6
2
ATAC

5
1
Searching for ATA found at position 3!
Suffix tree of ACATAC
8
Suffix Tree Search
TAC

A
C
7
4
TAC

ATAC
C
3
6
2
ATAC

5
1
Searching for AC found at positions 1
5 Searching for ACT falls off tree gt Not
in S
Suffix tree of ACATAC
9
GPGPU Programming

Utilize the highly parallel SIMD architecture of
the GPU
Nominally used for in parallel triangle
rendering, texture application
Each processor executes same kernel
Dramatic runtime improvement for scientific
applications
CUDA Architecture
API and runtime library to implement C style
programming of stream processors
nVidia GeForce 8800 GTX (G80)
16 multiprocessors w/ 8 processors
128 stream processors _at_ 1.35 GHz
768 MB total on board RAM
2D Texture Cache for large readonly data

Image from CUDA Programming Guide
10
Cmatch GPU Algorithm

Load Reference String
Create Suffix Tree
Load Query Strings
Transfer data to GPU
Execute Query Kernel
Up to 128 simultaneous matches on GPU
Fetch Results from GPU
Output results

11
Data Structures on the GPU

Suffix tree nodes gt 2D Texture
Encode node information children pointers as
RGBA color of texel
Arrange nodes in 32x32 blocks along space filling
curve
Optimize near root for inter-thread caching,
further down for an individual thread.
Reference String gt 2D Texture
Access many successive characters along edge
Query Strings gt On Board RAM
Q array with offsets in a large array of
strings
Results buffer gt On Board RAM
Q array with id of last visited node for query
i

12
Experimental Protocol

Comparing running time of (serial) CPU versus
(parallel) GPU programs
CPU 3.0 GHz Intel Xeon
GPU nVidia GeForce 8800 GTX (128 processors _at_
1.35 GHz)
Simulate short read resequencing projects by
extracting substrings of reference sequences
References
Genome of Bacillus anthracis (5.20 Mbp)
Genome of Yersinia pestis (4.6 Mbp)
BAC-sized portion of Human Chromosome 2 (200 kbp)
Query sets (250 Mbp total)
10 million x 25 bp
5 million x 50 bp
1.25 million x 200 bp
312,500 x 800 bp

13
Query Time Results
Speedup of the GPU match kernel versus CPU match
program.
14
Long Read Query Time Results
Future work to improve cache hit rate for longer
reads.
15
Processing Time
GPU Cmatch is bounded by time to construct suffix
tree and IO processing time
16
Conclusions

We have reduced the computation processing time
for short read resequencing from hours to
minutes.
Make sure you have sufficient cooling available
Low arithmetic intensity GPGPU programs can have
dramatic performance improvements (35x) over CPU
execution
Utilizing the texture cache with careful node
placement and minimizing register use were
essential to high performance
A single GPU can supply same processing power as
a small computer cluster at a fraction of the
cost
Installing GPUs into an existing cluster can
provide an order of magnitude increase in
computing capacity.
More information
http//www.cbcb.umd.edu/software/cmatch