ClawHMMer - PowerPoint PPT Presentation

1 / 49

About This Presentation

Title:

ClawHMMer

Description:

Importance of Streaming HMMer Algorithm General HMM Performance Filling Probability Table Competing with CPU Cache Viterbi Example (yes, will fix arrows) ... – PowerPoint PPT presentation

Number of Views:42

Avg rating:3.0/5.0

Slides: 50

Provided by: Dan5195

Learn more at: http://graphics.stanford.edu

Category:

more less

Transcript and Presenter's Notes

Title: ClawHMMer

1
ClawHMMer

Streaming protein search
Daniel Horn
Mike Houston
Pat Hanrahan
Stanford University

2
Data proliferation in biology

Protein databases
SWISS-PROT (200,000 annotated proteins)
NCBI Nonredundant DB (2.5 million proteins)
UniprotKB/TrEMBL (2.3 million proteins)
DNA sequences
DDBJ Japan (42 million genes)
NCBI GenBank (10 million sequence records)

3
Protein matching

Problem
Different proteins, same function
common amino acid patterns
Solution
Fuzzy string matching

A
A
As can be seen From side-by-side placement
Similar to
4
Current methods

BLAST Altschul 90
- Ad hoc gap penalties
Matches depend on penalty
Fast
HMMer Krogh 93
Robust
Probabilistic model
- Slow

5
New architectures
IBM Cell
Graphics Hardware
6
New architectures

Traditional CPUs
P4 3.0 GHz 12 GFlops
G5 2.5 GHz 10 GFlops
New architectures offer more compute power
ATI X1800XT 120 GFlops
NVIDIA G70 176 GFlops
Cell 250 GFlops
However, new architectures are parallel

7
Fragment Processor
Texture Memory (256-512MB)
128 Constant Values
8 Interpolated Values
Fragment Processor
32 Temporary R/W Registers
16 floats written sequentially to texture memory
eventually
8
Fragment Processors

Have 16-24 Fragment Processors
Need 10,000 threads executing in parallel
Hides memory latency
Shared Program Counter (PC)
Newer hardware has more (1 per 4 processors)

Texture Memory (256-512MB)
128 Constant Values
Interpolants
PC
Frag. Proc.
R/W Reg

Frag. Proc.
R/W Reg
Frag. Proc.
R/W Reg
Frag. Proc.
R/W Reg
Data written sequentially to texture memory
9
Brook For GPUs

Abstraction for streaming architectures Buck
04
Encourages data-parallel applications
Streams
Arrays of data
Kernel
Small function that can run on a fragment
processor
No access to global variables
Fixed number of stream inputs, outputs, lookup
tables
Mapping operation
Runs kernel on each element of input data stream
Writes to output stream

10
Challenges

Current search algorithm doesnt map well
Require lots of fine-grained parallelism
10,000 kernel invocations in parallel
High Arithmetic intensity
81 computefetch ratio
Fragment processors have limited resources
Kernels need to be small

11
Hidden Markov Model

State machine represents unobservable world state

Rainy
Sunny
12
Transition probabilities
.3
.7
.6
Rainy
Sunny
.4
13
Observations
.3
.7
.6
Rainy
Sunny
.4
P(coatRainy).4 P(nothingRainy).1 P(umbrella
Rainy).5
P(coatSunny).3 P(nothingSunny).6 P(umbrella
Sunny).1
Probabilities of observation

Observable clues as to the world state

14
Viterbi algorithm

Given HMM and observation sequence, Viterbi finds
Most likely path through world state
Probability of this path
Dynamic programming
Fills probability table
Per observation
Per state
Max over incoming transitions
Probability of state machine taking the most
likely path to this current state while emitting
given observation sequence

15
Viterbi Example
Sunday Monday Tuesday
Wednesday

Observations

max(0?.7,1 ?.4)
? p(UmbrellaRainy)
Rainy
0.0
Sunny
1.0
max(0 ? .3,1 ? .6) ? p(Umbrella Sunny)
16
Viterbi Example
Sunday Monday Tuesday
Wednesday

Observations

Nothing
Umbrella
Coat
Rainy
Rainy
Rainy
Rainy
0.0
Sunny
Sunny
Sunny
Sunny
1.0
Viterbi path Sunny, Rainy, Rainy, Sunny
.01026gt.00399
17
HMM Training
Proposed alignment
Proteins in family
Probabilistic model
F
F
F
F
F
F
F
F
R
R
R
R
R
R
R
R
G
G
Align
N
N
N
N
N
N
N
N

F
F
F
T
T
T
T
T
F
F
T
F
G
T
T
T
T
T
T
T
P
P
P
P
P
T
P
P
P
Insertion
Delete

Propose an alignment
Supply training proteins to HMMer
Compute probabilities of insertions and deletes

Region of interest
18
Probabilistic model HMM
F
G
P
N
T
F
R
Emit F
Emit G
Emit F
Emit P
Emit N
Emit T
Emit R
No Emission

State machine represents unobservable world state
For HMMer, this is the string match progress
Observations are amino acids

F
19
HMMer Searches a Database
Database
Probabilistic model
Query

If Probability score high
Perform traceback for alignment

?
20
Viterbi Algorithm Code

for obs in sequence
//observation loop
for s in states
//next state loop
for predecessor in states
//transition loop
if max likelihood path to s came from
predecessor
fill tableobss

21
Parallelizing Viterbi Over a Database

parallel_for observ_seq in database //DB loop
for obs in observ_seq
//observation loop
for s in states
//state loop
for predecessor in states
//transition loop
if max likelihood path to s came from
predecessor
fill table observ_seqobss

22
Flexible parallelism

Pull database loop inside
Reduces size/state of inner loop code
More flexibility in choosing
Coarse grained parallelism
Data Parallel
Fine Grained Parallelism
SIMD Instructions

23
Loop Reordering

for pred in states //transition loop
if max likelihood path to s came from
pred
fill tableobserv_seq i s

for i 1 to length(longest observ_seq)
//observation loop
for s in states
//next state loop
Kernel
24
Loop Reordering

for pred in states //transition loop
if max likelihood path to s came from
pred
fill tableobserv_seq i s

for i 1 to length(longest observ_seq)
//observation loop
for s in states
//next state loop
Kernel
25
Further Optimizations HMM

Unroll Transition Loop Completely
Unroll State Loop by 3
Gain fine grained data level parallelism by
processing 3 related states (triplet) at once

Start
End
26
Further State Loop Unrolling Increases Arithmetic
Intensity

Neighboring probability computations
(multi-triplet)
Read Similar Inputs
Use each others intermediate results
Fewer Fetches Same Math
Higher Arithmetic Intensity

Start
End
27
Implementation on Graphics Hardware (GPU)

Wrote 1 triplet and 4 triplet state processing
kernels in
Brook for GPUs Buck 04
Downloaded as shader program
Amino acid sequences
read-only textures
State probabilities
read/write textures
Transition probabilities
constant registers
Emission probabilities
small lookup table
If necessary, traceback performed on CPU

28
Experimental Methodology

Tested Systems
Sean Eddys HMMer 2.3.2 Eddy 03
Pentium IV 3.0GHz
Erik Lindahls AltiVec v2 implementation Lindahl
05
2.4 GHz G5 Power PC
Our Brook Implementation
ATI X1800 XT
NVIDIA 7800GT
Data set
NCBI Nonredundant database with 2.5 million
proteins
Representative HMMs from Pfam database

29
Adenovirus Performance

ATI Hardware outperforms PowerPC by factor of 2
Graphics Hardware performs between 10-25 times
better than HMMer on x86
Careful optimization may improve x86 in the future

30
Performance

ATI Hardware outperforms PowerPC by factor of 2
Graphics Hardware performs between 10-25 times
better than HMMer on x86
Careful optimization may improve x86 in the future

31
Performance analysis

Comparison ATI x800 17.5GB/s 8.3Gops
1-Triplet Bandwidth limited
4-Triplet Instruction Issue limited
Conclusions
4-triplet kernel achieves 90 of peak performance
Unrolling kernel important

32
Scales Linearly

Results on
16-node cluster
Gig-E interconnect
ATI Radeon 9800
Dual CPU
Split 2.5 million protein database across nodes
Linear scalability
Need 10,000 proteins per cluster to keep GPUs
busy

33
HMMer-Pfam

Solves inverse problem of HMMer-search
Given a protein of interest
search a database of HMMs
Problem for HMMer-pfam
Requires traceback for every HMM
Not enough memory on GPU
GPU requires 10,000 elements in parallel
Requires storage for 10,000 ? num_states ?
sequence_length floating point probabilities
Need architecture with finer parallelism

34
Summary

Streaming version of HMMer
Implemented on GPUs
Performs well
2-3x a PowerPC G5
20-30x a Pentium 4
Well suited for other parallel architectures
Cell

35
Acknowledgements

Funding
Sheila Vaidya John Johnstone _at_ LLNL
People
Erik Lindahl
ATI folks
NVIDIA folks
Sean Eddy
Ian Brook Team

36
Questions?

danielrh _at_ graphics.stanford.edu
mhouston _at_ graphics.stanford.edu

37
Importance of Streaming HMMer Algorithm

Scales to many types of parallel architectures
By adjusting the width of the parallel_for loops
Cell, Multicore, Clusters
Cell
Addressible Read/Write Memory
Fine-grained parallelism
Pfam possibility
Tens, not thousands of parallel DB queries
On Cell, 16-64 database entries could be
processed
Each kernel could processes all states
Only returns a probability

38
General HMM Performance

Arbitrary transition matrix
Exactly like repeated matrix-vector multiply
Each matrix is transition matrix
Scaled by emission probability
Max() supplants add() operation in matrix-vector
analogy
Brook for GPU paper shows
Matrix-Vector Multiply
Good for streaming architectures
Each element is touched once
Streaming over large quantities of memory
GPU memory controlled designed for this mode

39
Filling Probability Table
States A B C D A B C D A
B C D
Sequence Database
Sequence A Sequence B Sequence C Sequence D
F R N
F .8 .2 .3 .1 F R
N T N T
N F T
G P F
40
Competing with CPU Cache

Probability at each state requires max() over
incoming state probabilities in memory on
Graphics Hardware
Incoming state probabilities in L1 cache on CPU
Only final probability must be written to memory
Only tiny database entries must be read
Luckily 12-way version instruction, not bandwidth
limited on GPU
CPU also instruction-limited

41
Viterbi Example (yes, will fix arrows)

Observations Umbrella Coat
Nothing

pemit(Rainy,Umbrella) max(0.7,1.4)
.7
Rainy
Rainy
.3
.5max(0,.4) .2
0
.4
Sunny
Sunny
.6
1.0
pemit(Sunny,Umbrella) max(0.3,1.6)
.1max(0,.6).06
42
Viterbi Example

Observations Umbrella Coat
Nothing

pemit(Rainy,Coat) max(.2.7,.06.4)
.7
Rainy
Rainy
Rainy
.4max(.14,.024).056
0
.2
.4
.3
Sunny
Sunny
Sunny
1.0
.6
.06
pemit(Sunny,Coat) max(.2.3,.06.6) .3max(.06,.
036)
.018
43
Viterbi Example

Observations Umbrella Coat
Nothing

.7
pemit(Rainy,Nothing) max(.057.7,.018.4)
Rainy
Rainy
Rainy
Rainy
.1max(.0399,.0072) .00399
0
.2
.056
.4
.3
Sunny
Sunny
Sunny
Sunny
.6
1.0
pemit(Sunny,Nothing) max(.057.3,.018.6)
.6max(.0171,.0108) .01026
.06
.018
44
Performance

ATI Hardware outperforms PowerPC by factor of 2
Graphics Hardware performs between 10-25 times
better than HMMer on x86
Careful optimization may improve x86 in the
future

45
Transition Probabilities

Represent chance of the world changing state
Trained from example data

.5
.25
.25
Proteins in Family
25 chance of correct match
50 chance of insertion
25 chance of deletion
46
HMMer Scores a Protein

Returns probable alignment of protein to model

Database Protein
Most Likely Alignment to Model
Probabilistic model
F
HMMer
R
Probability Score

G

N
T
F
T
P
?
47
Probabilisitic Model HMM

Specialized layout of states
Trained to randomly generate amino-acid chains
Chains likely to belong in a desired family
States with a circle emit no observation

Start
End
48
GPU Limitations

High granularity parallelism of 10,000 elements
No room to store entire state table
Download of database
Readback of results
Few registers
Only mechanism of fast read/write storage
Limits how many state triplets a kernel may
process
Number of kernel outputs

49
Viterbi Example
Viterbi Traceback Example
Sunday Monday Tuesday
Wednesday
Rainy
Umbrella
? max(0?.7,1 ?.4)
p(Umbrella Rainy)
Nothing
Coat
Rainy
0
Sunny
1.0
p(Umbrella Sunny) ? max(0?.3,1 ?.6)
Viterbi path Sunny, Rainy, Rainy, Sunny
.06.4 lt .2.7
.018.6 lt .056.3
.01026gt.00399
1.0.4 gt 0.0.7

Write a Comment

User Comments (0)