Title: ClawHMMer
1ClawHMMer
- Streaming protein search
-
- Daniel Horn
- Mike Houston
- Pat Hanrahan
- Stanford University
2Data proliferation in biology
- Protein databases
- SWISS-PROT (200,000 annotated proteins)
- NCBI Nonredundant DB (2.5 million proteins)
- UniprotKB/TrEMBL (2.3 million proteins)
- DNA sequences
- DDBJ Japan (42 million genes)
- NCBI GenBank (10 million sequence records)
3Protein matching
- Problem
- Different proteins, same function
- common amino acid patterns
- Solution
- Fuzzy string matching
A
A
As can be seen From side-by-side placement
Similar to
4Current methods
- BLAST Altschul 90
- - Ad hoc gap penalties
- Matches depend on penalty
- Fast
- HMMer Krogh 93
- Robust
- Probabilistic model
- - Slow
5New architectures
IBM Cell
Graphics Hardware
6New architectures
- Traditional CPUs
- P4 3.0 GHz 12 GFlops
- G5 2.5 GHz 10 GFlops
- New architectures offer more compute power
- ATI X1800XT 120 GFlops
- NVIDIA G70 176 GFlops
- Cell 250 GFlops
- However, new architectures are parallel
7Fragment Processor
Texture Memory (256-512MB)
128 Constant Values
8 Interpolated Values
Fragment Processor
32 Temporary R/W Registers
16 floats written sequentially to texture memory
eventually
8Fragment Processors
- Have 16-24 Fragment Processors
- Need 10,000 threads executing in parallel
- Hides memory latency
- Shared Program Counter (PC)
- Newer hardware has more (1 per 4 processors)
Texture Memory (256-512MB)
128 Constant Values
Interpolants
PC
Frag. Proc.
R/W Reg
Frag. Proc.
R/W Reg
Frag. Proc.
R/W Reg
Frag. Proc.
R/W Reg
Data written sequentially to texture memory
9Brook For GPUs
- Abstraction for streaming architectures Buck
04 - Encourages data-parallel applications
- Streams
- Arrays of data
- Kernel
- Small function that can run on a fragment
processor - No access to global variables
- Fixed number of stream inputs, outputs, lookup
tables - Mapping operation
- Runs kernel on each element of input data stream
- Writes to output stream
10Challenges
- Current search algorithm doesnt map well
- Require lots of fine-grained parallelism
- 10,000 kernel invocations in parallel
- High Arithmetic intensity
- 81 computefetch ratio
- Fragment processors have limited resources
- Kernels need to be small
11Hidden Markov Model
- State machine represents unobservable world state
Rainy
Sunny
12Transition probabilities
.3
.7
.6
Rainy
Sunny
.4
13Observations
.3
.7
.6
Rainy
Sunny
.4
P(coatRainy).4 P(nothingRainy).1 P(umbrella
Rainy).5
P(coatSunny).3 P(nothingSunny).6 P(umbrella
Sunny).1
Probabilities of observation
- Observable clues as to the world state
14Viterbi algorithm
- Given HMM and observation sequence, Viterbi finds
- Most likely path through world state
- Probability of this path
- Dynamic programming
- Fills probability table
- Per observation
- Per state
- Max over incoming transitions
- Probability of state machine taking the most
likely path to this current state while emitting
given observation sequence
15Viterbi Example
Sunday Monday Tuesday
Wednesday
max(0?.7,1 ?.4)
? p(UmbrellaRainy)
Rainy
0.0
Sunny
1.0
max(0 ? .3,1 ? .6) ? p(Umbrella Sunny)
16Viterbi Example
Sunday Monday Tuesday
Wednesday
Nothing
Umbrella
Coat
Rainy
Rainy
Rainy
Rainy
0.0
Sunny
Sunny
Sunny
Sunny
1.0
Viterbi path Sunny, Rainy, Rainy, Sunny
.01026gt.00399
17HMM Training
Proposed alignment
Proteins in family
Probabilistic model
F
F
F
F
F
F
F
F
R
R
R
R
R
R
R
R
G
G
Align
N
N
N
N
N
N
N
N
F
F
F
T
T
T
T
T
F
F
T
F
G
T
T
T
T
T
T
T
P
P
P
P
P
T
P
P
P
Insertion
Delete
- Propose an alignment
- Supply training proteins to HMMer
- Compute probabilities of insertions and deletes
Region of interest
18Probabilistic model HMM
F
G
P
N
T
F
R
Emit F
Emit G
Emit F
Emit P
Emit N
Emit T
Emit R
No Emission
- State machine represents unobservable world state
- For HMMer, this is the string match progress
- Observations are amino acids
F
19HMMer Searches a Database
Database
Probabilistic model
Query
- If Probability score high
- Perform traceback for alignment
?
20Viterbi Algorithm Code
- for obs in sequence
//observation loop - for s in states
//next state loop - for predecessor in states
//transition loop - if max likelihood path to s came from
predecessor - fill tableobss
21Parallelizing Viterbi Over a Database
- parallel_for observ_seq in database //DB loop
- for obs in observ_seq
//observation loop - for s in states
//state loop - for predecessor in states
//transition loop - if max likelihood path to s came from
predecessor - fill table observ_seqobss
22Flexible parallelism
- Pull database loop inside
- Reduces size/state of inner loop code
- More flexibility in choosing
- Coarse grained parallelism
- Data Parallel
- Fine Grained Parallelism
- SIMD Instructions
23Loop Reordering
-
-
- for pred in states //transition loop
- if max likelihood path to s came from
pred - fill tableobserv_seq i s
for i 1 to length(longest observ_seq)
//observation loop
for s in states
//next state loop
Kernel
24Loop Reordering
-
-
- for pred in states //transition loop
- if max likelihood path to s came from
pred - fill tableobserv_seq i s
for i 1 to length(longest observ_seq)
//observation loop
for s in states
//next state loop
Kernel
25Further Optimizations HMM
- Unroll Transition Loop Completely
- Unroll State Loop by 3
- Gain fine grained data level parallelism by
processing 3 related states (triplet) at once
Start
End
26Further State Loop Unrolling Increases Arithmetic
Intensity
- Neighboring probability computations
(multi-triplet) - Read Similar Inputs
- Use each others intermediate results
- Fewer Fetches Same Math
- Higher Arithmetic Intensity
Start
End
27Implementation on Graphics Hardware (GPU)
- Wrote 1 triplet and 4 triplet state processing
kernels in - Brook for GPUs Buck 04
- Downloaded as shader program
- Amino acid sequences
- read-only textures
- State probabilities
- read/write textures
- Transition probabilities
- constant registers
- Emission probabilities
- small lookup table
- If necessary, traceback performed on CPU
28Experimental Methodology
- Tested Systems
- Sean Eddys HMMer 2.3.2 Eddy 03
- Pentium IV 3.0GHz
- Erik Lindahls AltiVec v2 implementation Lindahl
05 - 2.4 GHz G5 Power PC
- Our Brook Implementation
- ATI X1800 XT
- NVIDIA 7800GT
- Data set
- NCBI Nonredundant database with 2.5 million
proteins - Representative HMMs from Pfam database
29Adenovirus Performance
- ATI Hardware outperforms PowerPC by factor of 2
- Graphics Hardware performs between 10-25 times
better than HMMer on x86 - Careful optimization may improve x86 in the future
30Performance
- ATI Hardware outperforms PowerPC by factor of 2
- Graphics Hardware performs between 10-25 times
better than HMMer on x86 - Careful optimization may improve x86 in the future
31Performance analysis
- Comparison ATI x800 17.5GB/s 8.3Gops
- 1-Triplet Bandwidth limited
- 4-Triplet Instruction Issue limited
- Conclusions
- 4-triplet kernel achieves 90 of peak performance
- Unrolling kernel important
32Scales Linearly
- Results on
- 16-node cluster
- Gig-E interconnect
- ATI Radeon 9800
- Dual CPU
- Split 2.5 million protein database across nodes
- Linear scalability
- Need 10,000 proteins per cluster to keep GPUs
busy
33HMMer-Pfam
- Solves inverse problem of HMMer-search
- Given a protein of interest
- search a database of HMMs
- Problem for HMMer-pfam
- Requires traceback for every HMM
- Not enough memory on GPU
- GPU requires 10,000 elements in parallel
- Requires storage for 10,000 ? num_states ?
sequence_length floating point probabilities - Need architecture with finer parallelism
34Summary
- Streaming version of HMMer
- Implemented on GPUs
- Performs well
- 2-3x a PowerPC G5
- 20-30x a Pentium 4
- Well suited for other parallel architectures
- Cell
35Acknowledgements
- Funding
- Sheila Vaidya John Johnstone _at_ LLNL
- People
- Erik Lindahl
- ATI folks
- NVIDIA folks
- Sean Eddy
- Ian Brook Team
36Questions?
- danielrh _at_ graphics.stanford.edu
- mhouston _at_ graphics.stanford.edu
37Importance of Streaming HMMer Algorithm
- Scales to many types of parallel architectures
- By adjusting the width of the parallel_for loops
- Cell, Multicore, Clusters
- Cell
- Addressible Read/Write Memory
- Fine-grained parallelism
- Pfam possibility
- Tens, not thousands of parallel DB queries
- On Cell, 16-64 database entries could be
processed - Each kernel could processes all states
- Only returns a probability
38General HMM Performance
- Arbitrary transition matrix
- Exactly like repeated matrix-vector multiply
- Each matrix is transition matrix
- Scaled by emission probability
- Max() supplants add() operation in matrix-vector
analogy - Brook for GPU paper shows
- Matrix-Vector Multiply
- Good for streaming architectures
- Each element is touched once
- Streaming over large quantities of memory
- GPU memory controlled designed for this mode
39Filling Probability Table
States A B C D A B C D A
B C D
Sequence Database
Sequence A Sequence B Sequence C Sequence D
F R N
F .8 .2 .3 .1 F R
N T N T
N F T
G P F
40Competing with CPU Cache
- Probability at each state requires max() over
- incoming state probabilities in memory on
Graphics Hardware - Incoming state probabilities in L1 cache on CPU
- Only final probability must be written to memory
- Only tiny database entries must be read
- Luckily 12-way version instruction, not bandwidth
limited on GPU - CPU also instruction-limited
41Viterbi Example (yes, will fix arrows)
- Observations Umbrella Coat
Nothing
pemit(Rainy,Umbrella) max(0.7,1.4)
.7
Rainy
Rainy
.3
.5max(0,.4) .2
0
.4
Sunny
Sunny
.6
1.0
pemit(Sunny,Umbrella) max(0.3,1.6)
.1max(0,.6).06
42Viterbi Example
- Observations Umbrella Coat
Nothing
pemit(Rainy,Coat) max(.2.7,.06.4)
.7
Rainy
Rainy
Rainy
.4max(.14,.024).056
0
.2
.4
.3
Sunny
Sunny
Sunny
1.0
.6
.06
pemit(Sunny,Coat) max(.2.3,.06.6) .3max(.06,.
036)
.018
43Viterbi Example
- Observations Umbrella Coat
Nothing
.7
pemit(Rainy,Nothing) max(.057.7,.018.4)
Rainy
Rainy
Rainy
Rainy
.1max(.0399,.0072) .00399
0
.2
.056
.4
.3
Sunny
Sunny
Sunny
Sunny
.6
1.0
pemit(Sunny,Nothing) max(.057.3,.018.6)
.6max(.0171,.0108) .01026
.06
.018
44Performance
- ATI Hardware outperforms PowerPC by factor of 2
- Graphics Hardware performs between 10-25 times
better than HMMer on x86 - Careful optimization may improve x86 in the
future
45Transition Probabilities
- Represent chance of the world changing state
- Trained from example data
.5
.25
.25
Proteins in Family
25 chance of correct match
50 chance of insertion
25 chance of deletion
46HMMer Scores a Protein
- Returns probable alignment of protein to model
Database Protein
Most Likely Alignment to Model
Probabilistic model
F
HMMer
R
Probability Score
G
N
T
F
T
P
?
47Probabilisitic Model HMM
- Specialized layout of states
- Trained to randomly generate amino-acid chains
- Chains likely to belong in a desired family
- States with a circle emit no observation
Start
End
48GPU Limitations
- High granularity parallelism of 10,000 elements
- No room to store entire state table
- Download of database
- Readback of results
- Few registers
- Only mechanism of fast read/write storage
- Limits how many state triplets a kernel may
process - Number of kernel outputs
49Viterbi Example
Viterbi Traceback Example
Sunday Monday Tuesday
Wednesday
Rainy
Umbrella
? max(0?.7,1 ?.4)
p(Umbrella Rainy)
Nothing
Coat
Rainy
0
Sunny
1.0
p(Umbrella Sunny) ? max(0?.3,1 ?.6)
Viterbi path Sunny, Rainy, Rainy, Sunny
.06.4 lt .2.7
.018.6 lt .056.3
.01026gt.00399
1.0.4 gt 0.0.7