ClawHMMer - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

ClawHMMer

Description:

Importance of Streaming HMMer Algorithm General HMM Performance Filling Probability Table Competing with CPU Cache Viterbi Example (yes, will fix arrows) ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 50
Provided by: Dan5195
Category:
Tags: clawhmmer | hmmer

less

Transcript and Presenter's Notes

Title: ClawHMMer


1
ClawHMMer
  • Streaming protein search
  • Daniel Horn
  • Mike Houston
  • Pat Hanrahan
  • Stanford University

2
Data proliferation in biology
  • Protein databases
  • SWISS-PROT (200,000 annotated proteins)
  • NCBI Nonredundant DB (2.5 million proteins)
  • UniprotKB/TrEMBL (2.3 million proteins)
  • DNA sequences
  • DDBJ Japan (42 million genes)
  • NCBI GenBank (10 million sequence records)

3
Protein matching
  • Problem
  • Different proteins, same function
  • common amino acid patterns
  • Solution
  • Fuzzy string matching

A
A
As can be seen From side-by-side placement
Similar to
4
Current methods
  • BLAST Altschul 90
  • - Ad hoc gap penalties
  • Matches depend on penalty
  • Fast
  • HMMer Krogh 93
  • Robust
  • Probabilistic model
  • - Slow

5
New architectures
IBM Cell
Graphics Hardware
6
New architectures
  • Traditional CPUs
  • P4 3.0 GHz 12 GFlops
  • G5 2.5 GHz 10 GFlops
  • New architectures offer more compute power
  • ATI X1800XT 120 GFlops
  • NVIDIA G70 176 GFlops
  • Cell 250 GFlops
  • However, new architectures are parallel

7
Fragment Processor
Texture Memory (256-512MB)
128 Constant Values
8 Interpolated Values
Fragment Processor
32 Temporary R/W Registers
16 floats written sequentially to texture memory
eventually
8
Fragment Processors
  • Have 16-24 Fragment Processors
  • Need 10,000 threads executing in parallel
  • Hides memory latency
  • Shared Program Counter (PC)
  • Newer hardware has more (1 per 4 processors)

Texture Memory (256-512MB)
128 Constant Values
Interpolants
PC
Frag. Proc.
R/W Reg

Frag. Proc.
R/W Reg
Frag. Proc.
R/W Reg
Frag. Proc.
R/W Reg
Data written sequentially to texture memory
9
Brook For GPUs
  • Abstraction for streaming architectures Buck
    04
  • Encourages data-parallel applications
  • Streams
  • Arrays of data
  • Kernel
  • Small function that can run on a fragment
    processor
  • No access to global variables
  • Fixed number of stream inputs, outputs, lookup
    tables
  • Mapping operation
  • Runs kernel on each element of input data stream
  • Writes to output stream

10
Challenges
  • Current search algorithm doesnt map well
  • Require lots of fine-grained parallelism
  • 10,000 kernel invocations in parallel
  • High Arithmetic intensity
  • 81 computefetch ratio
  • Fragment processors have limited resources
  • Kernels need to be small

11
Hidden Markov Model
  • State machine represents unobservable world state

Rainy
Sunny
12
Transition probabilities
.3
.7
.6
Rainy
Sunny
.4
13
Observations
.3
.7
.6
Rainy
Sunny
.4
P(coatRainy).4 P(nothingRainy).1 P(umbrella
Rainy).5
P(coatSunny).3 P(nothingSunny).6 P(umbrella
Sunny).1
Probabilities of observation
  • Observable clues as to the world state

14
Viterbi algorithm
  • Given HMM and observation sequence, Viterbi finds
  • Most likely path through world state
  • Probability of this path
  • Dynamic programming
  • Fills probability table
  • Per observation
  • Per state
  • Max over incoming transitions
  • Probability of state machine taking the most
    likely path to this current state while emitting
    given observation sequence

15
Viterbi Example
Sunday Monday Tuesday
Wednesday
  • Observations

max(0?.7,1 ?.4)
? p(UmbrellaRainy)
Rainy
0.0
Sunny
1.0
max(0 ? .3,1 ? .6) ? p(Umbrella Sunny)
16
Viterbi Example
Sunday Monday Tuesday
Wednesday
  • Observations

Nothing
Umbrella
Coat
Rainy
Rainy
Rainy
Rainy
0.0
Sunny
Sunny
Sunny
Sunny
1.0
Viterbi path Sunny, Rainy, Rainy, Sunny
.01026gt.00399
17
HMM Training
Proposed alignment
Proteins in family
Probabilistic model
F
F
F
F
F
F
F
F
R
R
R
R
R
R
R
R
G
G
Align
N
N
N
N
N
N
N
N

F
F
F
T
T
T
T
T
F
F
T
F
G
T
T
T
T
T
T
T
P
P
P
P
P
T
P
P
P
Insertion
Delete
  • Propose an alignment
  • Supply training proteins to HMMer
  • Compute probabilities of insertions and deletes

Region of interest
18
Probabilistic model HMM
F
G
P
N
T
F
R
Emit F
Emit G
Emit F
Emit P
Emit N
Emit T
Emit R
No Emission
  • State machine represents unobservable world state
  • For HMMer, this is the string match progress
  • Observations are amino acids

F
19
HMMer Searches a Database
Database
Probabilistic model
Query
  • If Probability score high
  • Perform traceback for alignment

?
20
Viterbi Algorithm Code
  • for obs in sequence
    //observation loop
  • for s in states
    //next state loop
  • for predecessor in states
    //transition loop
  • if max likelihood path to s came from
    predecessor
  • fill tableobss

21
Parallelizing Viterbi Over a Database
  • parallel_for observ_seq in database //DB loop
  • for obs in observ_seq
    //observation loop
  • for s in states
    //state loop
  • for predecessor in states
    //transition loop
  • if max likelihood path to s came from
    predecessor
  • fill table observ_seqobss

22
Flexible parallelism
  • Pull database loop inside
  • Reduces size/state of inner loop code
  • More flexibility in choosing
  • Coarse grained parallelism
  • Data Parallel
  • Fine Grained Parallelism
  • SIMD Instructions

23
Loop Reordering
  • for pred in states //transition loop
  • if max likelihood path to s came from
    pred
  • fill tableobserv_seq i s

for i 1 to length(longest observ_seq)
//observation loop
for s in states
//next state loop
Kernel
24
Loop Reordering
  • for pred in states //transition loop
  • if max likelihood path to s came from
    pred
  • fill tableobserv_seq i s

for i 1 to length(longest observ_seq)
//observation loop
for s in states
//next state loop
Kernel
25
Further Optimizations HMM
  • Unroll Transition Loop Completely
  • Unroll State Loop by 3
  • Gain fine grained data level parallelism by
    processing 3 related states (triplet) at once

Start
End
26
Further State Loop Unrolling Increases Arithmetic
Intensity
  • Neighboring probability computations
    (multi-triplet)
  • Read Similar Inputs
  • Use each others intermediate results
  • Fewer Fetches Same Math
  • Higher Arithmetic Intensity

Start
End
27
Implementation on Graphics Hardware (GPU)
  • Wrote 1 triplet and 4 triplet state processing
    kernels in
  • Brook for GPUs Buck 04
  • Downloaded as shader program
  • Amino acid sequences
  • read-only textures
  • State probabilities
  • read/write textures
  • Transition probabilities
  • constant registers
  • Emission probabilities
  • small lookup table
  • If necessary, traceback performed on CPU

28
Experimental Methodology
  • Tested Systems
  • Sean Eddys HMMer 2.3.2 Eddy 03
  • Pentium IV 3.0GHz
  • Erik Lindahls AltiVec v2 implementation Lindahl
    05
  • 2.4 GHz G5 Power PC
  • Our Brook Implementation
  • ATI X1800 XT
  • NVIDIA 7800GT
  • Data set
  • NCBI Nonredundant database with 2.5 million
    proteins
  • Representative HMMs from Pfam database

29
Adenovirus Performance
  • ATI Hardware outperforms PowerPC by factor of 2
  • Graphics Hardware performs between 10-25 times
    better than HMMer on x86
  • Careful optimization may improve x86 in the future

30
Performance
  • ATI Hardware outperforms PowerPC by factor of 2
  • Graphics Hardware performs between 10-25 times
    better than HMMer on x86
  • Careful optimization may improve x86 in the future

31
Performance analysis
  • Comparison ATI x800 17.5GB/s 8.3Gops
  • 1-Triplet Bandwidth limited
  • 4-Triplet Instruction Issue limited
  • Conclusions
  • 4-triplet kernel achieves 90 of peak performance
  • Unrolling kernel important

32
Scales Linearly
  • Results on
  • 16-node cluster
  • Gig-E interconnect
  • ATI Radeon 9800
  • Dual CPU
  • Split 2.5 million protein database across nodes
  • Linear scalability
  • Need 10,000 proteins per cluster to keep GPUs
    busy

33
HMMer-Pfam
  • Solves inverse problem of HMMer-search
  • Given a protein of interest
  • search a database of HMMs
  • Problem for HMMer-pfam
  • Requires traceback for every HMM
  • Not enough memory on GPU
  • GPU requires 10,000 elements in parallel
  • Requires storage for 10,000 ? num_states ?
    sequence_length floating point probabilities
  • Need architecture with finer parallelism

34
Summary
  • Streaming version of HMMer
  • Implemented on GPUs
  • Performs well
  • 2-3x a PowerPC G5
  • 20-30x a Pentium 4
  • Well suited for other parallel architectures
  • Cell

35
Acknowledgements
  • Funding
  • Sheila Vaidya John Johnstone _at_ LLNL
  • People
  • Erik Lindahl
  • ATI folks
  • NVIDIA folks
  • Sean Eddy
  • Ian Brook Team

36
Questions?
  • danielrh _at_ graphics.stanford.edu
  • mhouston _at_ graphics.stanford.edu

37
Importance of Streaming HMMer Algorithm
  • Scales to many types of parallel architectures
  • By adjusting the width of the parallel_for loops
  • Cell, Multicore, Clusters
  • Cell
  • Addressible Read/Write Memory
  • Fine-grained parallelism
  • Pfam possibility
  • Tens, not thousands of parallel DB queries
  • On Cell, 16-64 database entries could be
    processed
  • Each kernel could processes all states
  • Only returns a probability

38
General HMM Performance
  • Arbitrary transition matrix
  • Exactly like repeated matrix-vector multiply
  • Each matrix is transition matrix
  • Scaled by emission probability
  • Max() supplants add() operation in matrix-vector
    analogy
  • Brook for GPU paper shows
  • Matrix-Vector Multiply
  • Good for streaming architectures
  • Each element is touched once
  • Streaming over large quantities of memory
  • GPU memory controlled designed for this mode

39
Filling Probability Table
States A B C D A B C D A
B C D
Sequence Database
Sequence A Sequence B Sequence C Sequence D
F R N
F .8 .2 .3 .1 F R
N T N T
N F T
G P F
40
Competing with CPU Cache
  • Probability at each state requires max() over
  • incoming state probabilities in memory on
    Graphics Hardware
  • Incoming state probabilities in L1 cache on CPU
  • Only final probability must be written to memory
  • Only tiny database entries must be read
  • Luckily 12-way version instruction, not bandwidth
    limited on GPU
  • CPU also instruction-limited

41
Viterbi Example (yes, will fix arrows)
  • Observations Umbrella Coat
    Nothing

pemit(Rainy,Umbrella) max(0.7,1.4)
.7
Rainy
Rainy
.3
.5max(0,.4) .2
0
.4
Sunny
Sunny
.6
1.0
pemit(Sunny,Umbrella) max(0.3,1.6)
.1max(0,.6).06
42
Viterbi Example
  • Observations Umbrella Coat
    Nothing

pemit(Rainy,Coat) max(.2.7,.06.4)
.7
Rainy
Rainy
Rainy
.4max(.14,.024).056
0
.2
.4
.3
Sunny
Sunny
Sunny
1.0
.6
.06
pemit(Sunny,Coat) max(.2.3,.06.6) .3max(.06,.
036)
.018
43
Viterbi Example
  • Observations Umbrella Coat
    Nothing

.7
pemit(Rainy,Nothing) max(.057.7,.018.4)
Rainy
Rainy
Rainy
Rainy
.1max(.0399,.0072) .00399
0
.2
.056
.4
.3
Sunny
Sunny
Sunny
Sunny
.6
1.0
pemit(Sunny,Nothing) max(.057.3,.018.6)
.6max(.0171,.0108) .01026
.06
.018
44
Performance
  • ATI Hardware outperforms PowerPC by factor of 2
  • Graphics Hardware performs between 10-25 times
    better than HMMer on x86
  • Careful optimization may improve x86 in the
    future

45
Transition Probabilities
  • Represent chance of the world changing state
  • Trained from example data

.5
.25
.25
Proteins in Family
25 chance of correct match
50 chance of insertion
25 chance of deletion
46
HMMer Scores a Protein
  • Returns probable alignment of protein to model

Database Protein
Most Likely Alignment to Model
Probabilistic model
F
HMMer
R
Probability Score

G

N
T
F
T
P
?
47
Probabilisitic Model HMM
  • Specialized layout of states
  • Trained to randomly generate amino-acid chains
  • Chains likely to belong in a desired family
  • States with a circle emit no observation

Start
End
48
GPU Limitations
  • High granularity parallelism of 10,000 elements
  • No room to store entire state table
  • Download of database
  • Readback of results
  • Few registers
  • Only mechanism of fast read/write storage
  • Limits how many state triplets a kernel may
    process
  • Number of kernel outputs

49
Viterbi Example
Viterbi Traceback Example
Sunday Monday Tuesday
Wednesday
Rainy
Umbrella
? max(0?.7,1 ?.4)
p(Umbrella Rainy)
Nothing
Coat
Rainy
0
Sunny
1.0
p(Umbrella Sunny) ? max(0?.3,1 ?.6)
Viterbi path Sunny, Rainy, Rainy, Sunny
.06.4 lt .2.7
.018.6 lt .056.3
.01026gt.00399
1.0.4 gt 0.0.7
Write a Comment
User Comments (0)
About PowerShow.com