Fast Inference and Learning in Large-State-Space HMMs - PowerPoint PPT Presentation

1 / 102
About This Presentation
Title:

Fast Inference and Learning in Large-State-Space HMMs

Description:

Siddiqi and Moore, www.autonlab.org. Fast Inference and ... Andrew W. Moore. The Auton Lab. Carnegie Mellon University. Siddiqi and Moore, www.autonlab.org ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 103
Provided by: awm
Category:

less

Transcript and Presenter's Notes

Title: Fast Inference and Learning in Large-State-Space HMMs


1
Fast Inference and Learning in Large-State-Space
HMMs
  • Sajid M. Siddiqi
  • Andrew W. Moore
  • The Auton Lab
  • Carnegie Mellon University

2
Fast Inference and Learning in Large-State-Space
HMMs
  • Sajid M. Siddiqi
  • Andrew W. Moore
  • The Auton Lab
  • Carnegie Mellon University

3
Sajid Siddiqi Happy
Sajid Siddiqi Discontented
4
Hidden Markov Models
1/3
1
5
Hidden Markov Models
i P(qt1s1qtsi) P(qt1s2qtsi) P(qt1sjqtsi) P(qt1sNqtsi)
1 a11 a12 a1j a1N
2 a21 a22 a2j a2N
3 a31 a32 a3j a3N

i ai1 ai2 aij aiN

N aN1 aN2 aNj aNN
1/3
Each of these probability tables is identical
1
6
Observation Model
O0
O1
O2
O3
O4
7
Observation Model
Notation
O0
i P(Ot1qtsi) P(Ot2qtsi) P(Otkqtsi) P(OtMqtsi)
1 b1(1) b1 (2) b1 (k) b1(M)
2 b2 (1) b2 (2) b2(k) b2 (M)
3 b3 (1) b3 (2) b3(k) b3 (M)

i bi(1) bi (2) bi(k) bi (M)

N bN (1) bN (2) bN(k) bN (M)
O1
O2
O3
O4
8
Some Famous HMM Tasks
  • Question 1 State Estimation
  • What is P(qTSi O1O2OT)

9
Some Famous HMM Tasks
  • Question 1 State Estimation
  • What is P(qTSi O1O2OT)

10
Some Famous HMM Tasks
  • Question 1 State Estimation
  • What is P(qTSi O1O2OT)

11
Some Famous HMM Tasks
  • Question 1 State Estimation
  • What is P(qTSi O1O2OT)
  • Question 2 Most Probable Path
  • Given O1O2OT , what is the most probable path
    that I took?

12
Some Famous HMM Tasks
  • Question 1 State Estimation
  • What is P(qTSi O1O2OT)
  • Question 2 Most Probable Path
  • Given O1O2OT , what is the most probable path
    that I took?

13
Some Famous HMM Tasks
  • Question 1 State Estimation
  • What is P(qTSi O1O2OT)
  • Question 2 Most Probable Path
  • Given O1O2OT , what is the most probable path
    that I took?

Woke up at 8.35, Got on Bus at 9.46, Sat in
lecture 10.05-11.22
14
Some Famous HMM Tasks
  • Question 1 State Estimation
  • What is P(qTSi O1O2OT)
  • Question 2 Most Probable Path
  • Given O1O2OT , what is the most probable path
    that I took?
  • Question 3 Learning HMMs
  • Given O1O2OT , what is the maximum likelihood
    HMM that could have produced this string of
    observations?

15
Some Famous HMM Tasks
  • Question 1 State Estimation
  • What is P(qTSi O1O2OT)
  • Question 2 Most Probable Path
  • Given O1O2OT , what is the most probable path
    that I took?
  • Question 3 Learning HMMs
  • Given O1O2OT , what is the maximum likelihood
    HMM that could have produced this string of
    observations?

16
Some Famous HMM Tasks
Ot
aBB
bB(Ot)
  • Question 1 State Estimation
  • What is P(qTSi O1O2OT)
  • Question 2 Most Probable Path
  • Given O1O2OT , what is the most probable path
    that I took?
  • Question 3 Learning HMMs
  • Given O1O2OT , what is the maximum likelihood
    HMM that could have produced this string of
    observations?

Bus
aAB
aCB
Ot-1
Ot1
aBA
aBC
bA(Ot-1)
bC(Ot1)
Eat
walk
aAA
aCC
17
Basic Operations in HMMs
  • For an observation sequence O O1OT, the three
    basic HMM operations are

Problem Algorithm Complexity
Evaluation Calculating P(O?) Forward-Backward O(TN2)
Inference Computing Q argmaxQ P(O,Q?) Viterbi Decoding O(TN2)
Learning Computing ? argmax? P(O?) Baum-Welch (EM) O(TN2)
T timesteps, N states
18
Basic Operations in HMMs
  • For an observation sequence O O1OT, the three
    basic HMM operations are

Problem Algorithm Complexity
Evaluation Calculating P(O?) Forward-Backward O(TN2)
Inference Computing Q argmaxQ P(O,Q?) Viterbi Decoding O(TN2)
Learning Computing ? argmax? P(O?) Baum-Welch (EM) O(TN2)
This talk A simple approach to reducing the
complexity in N
T timesteps, N states
19
Reducing Quadratic N penalty
  • Why does it matter?
  • Quadratic HMM algorithms hinder HMM computations
    when N is large
  • Several promising applications for efficient
    large-state-space HMM algorithms in
  • biological sequence analysis
  • speech recognition
  • real-time HMM systems such as for activity
    monitoring

20
Idea One Sparse Transition Matrix
  • Only K ltlt N non-zero next-state probabilities

21
Idea One Sparse Transition Matrix
  • Only K ltlt N non-zero next-state probabilities

22
Idea One Sparse Transition Matrix
Only O(TNK)
  • Only K ltlt N non-zero next-state probabilities

23
Idea One Sparse Transition Matrix
Only O(TNK)
  • But can get very badly confused by impossible
    transitions
  • Cannot learn the sparse structure (once chosen
    cannot change)
  • Only K ltlt N non-zero next-state probabilities

24
Dense-Mostly-Constant Transitions
  • K non-constant probabilities per row
  • DMC HMMs comprise a richer and more expressive
    class of models than sparse HMMs

a DMC transition matrix with K2
25
Dense-Mostly-Constant Transitions
  • The transition model for state i now comprises
  • NCi j si?sj is a non-constant transition
    probability
  • ci the transition probability for si to all
    states not in NCi
  • aij the non-constant transition probability for
    si? sj,

NC3 2,5 c3 0.05 a32 0.25 a35 0.6
26
HMM Filtering
  • P(qt si O1, O2 Ot)

27
HMM Filtering
  • P(qt si O1, O2 Ot)
  • Where

28
HMM Filtering
  • P(qt si O1, O2 Ot)
  • Where

29
HMM Filtering
  • P(qt si O1, O2 Ot)
  • Where

t at(1) at(2) at(3) at(N)
1
2
3
4
5
6
7
8
9
30
HMM Filtering
  • P(qt si O1, O2 Ot)
  • Where

t at(1) at(2) at(3) at(N)
1
2
3
4
5
6
7
8
9
31
HMM Filtering
  • P(qt si O1, O2 Ot)
  • Where

t at(1) at(2) at(3) at(N)
1
2
3
4
5
6
7
8
9
  • Cost O(TN2)

32
Fast Evaluation in DMC HMMs
33
Fast Evaluation in DMC HMMs
O(N), but common to all j per timestep t
O(K) for each ?t(j)
  • This yields O(TNK) complexity for the evaluation
    problem.

34
Fast Inference in DMC HMMs
  • The Viterbi algorithm uses dynamic programming to
    calculate the globally optimal state sequence
    QgmaxQP(Q,O?).

Define ?t(i) as
The ? variables can be computed in O(TN2) time,
with the O(N) inductive step
Under the DMC assumption, this step can be
carried out in O(K) time
35
Learning a DMC HMM
36
Learning a DMC HMM
  • Idea One
  • Ask user to tell us the DMC structure
  • Learn the parameters using EM

37
Learning a DMC HMM
  • Idea One
  • Ask user to tell us the DMC structure
  • Learn the parameters using EM
  • Simple
  • But in general, dont know the DMC structure

38
Learning a DMC HMM
  • Idea Two
  • Use EM to learn the DMC structure too
  • Guess DMC structure
  • Find expected transition counts and observation
    parameters, given current model and observations
  • Find maximum likelihood DMC model given counts
  • Goto 2

39
Learning a DMC HMM
  • Idea Two
  • Use EM to learn the DMC structure too
  • Guess DMC structure
  • Find expected transition counts and observation
    parameters, given current model and observations
  • Find maximum likelihood DMC model given counts
  • Goto 2

DMC structure can (and does) change!
40
Learning a DMC HMM
  • Idea Two
  • Use EM to learn the DMC structure too
  • Guess DMC structure
  • Find expected transition counts and observation
    parameters, given current model and observations
  • Find maximum likelihood DMC model given counts
  • Goto 2

In fact, just start with an all-constant
transition model
DMC structure can (and does) change!
41
Learning a DMC HMM
  1. Find expected transition counts and observation
    parameters, given current model and observations

42
We want
new estimate of
43
We want
new estimate of
44
We want
new estimate of
45
We want
new estimate of
where
46
(No Transcript)
47




















a
b
T
T
N
N
48
Can get this in O(TN) time
Can get this in O(TN) time




















a
b
T
T
N
N
49
We want
where
Can get this in O(TN) time
Can get this in O(TN) time




















a
r
T
T
N
N
50
We want
where




















r
a
T
T
N
N
51
N
We want
where





S24
S
N
Dot Product of Columns
a2
r4




















r
a
T
T
N
N
52
N
We want
where





S24
S
N
Dot Product of Columns
a2
r4




















r
a
T
T
O(TN2)
N
N
53
N
We want
where





S24
S
  • Speedups
  • Strassen?

N
Dot Product of Columns
a2
r4




















r
a
T
T
O(TN2)
N
N
54
N
We want
where





S24
S
  • Speedups
  • Strassen
  • Approximate a by DMC

N
Dot Product of Columns
a2
r4




















r
a
T
T
O(TN2)
N
N
55
N
We want
where





S24
S
  • Speedups
  • Strassen
  • Approximate a by DMC
  • Approximate randomized ATB

N
Dot Product of Columns
a2
r4




















r
a
T
T
O(TN2)
N
N
56
N
We want
where





S24
S
  • Speedups
  • Strassen
  • Approximate a by DMC
  • Approximate randomized ATB
  • Sparse structure fine?

N
Dot Product of Columns
a2
r4




















r
a
T
T
O(TN2)
N
N
57
N
We want
where





S24
S
  • Speedups
  • Strassen
  • Approximate a by DMC
  • Approximate randomized ATB
  • Sparse structure fine
  • Fixed DMC is fine?

N
Dot Product of Columns
a2
r4




















r
a
T
T
O(TN2)
N
N
58
N
We want
where





S24
S
  • Speedups
  • Strassen
  • Approximate a by DMC
  • Approximate randomized ATB
  • Sparse structure fine
  • Fixed DMC is fine
  • Speedup without approximation

N
Dot Product of Columns
a2
r4




















r
a
T
T
O(TN2)
N
N
59
N
We want
where





S24
S
  • Insight One only need the top K entries in each
    row of S
  • Insight Two Values in rows of a and r are often
    very skewed

N




















r
a
T
T
N
N
60
For i 1..N, store indexes of R largest values
in ith column of a
For j 1..N, store indexes of R largest values
in jth column of r
r-biggies(j)
a-biggies(i)




















r
a
T
N
N
Theres an important detail Im omitting here to
do with prescaling the rows of a and r.
61
For i 1..N, store indexes of R largest values
in ith column of a
For j 1..N, store indexes of R largest values
in jth column of r
R ltlt T Takes O(TN) time to do all indexes
r-biggies(j)
a-biggies(i)




















r
a
T
N
N
Theres an important detail Im omitting here to
do with prescaling the rows of a and r.
62
For i 1..N, store indexes of R largest values
in ith column of a
For j 1..N, store indexes of R largest values
in jth column of r
R ltlt T Takes O(TN) time to do all indexes
r-biggies(j)
a-biggies(i)




















r
a
T
N
N
Theres an important detail Im omitting here to
do with prescaling the rows of a and r.
63
For i 1..N, store indexes of R largest values
in ith column of a
For j 1..N, store indexes of R largest values
in jth column of r
R ltlt T Takes O(TN) time to do all indexes
r-biggies(j)
a-biggies(i)




















r
a
T
N
N
Theres an important detail Im omitting here to
do with prescaling the rows of a and r.
64
For i 1..N, store indexes of R largest values
in ith column of a
For j 1..N, store indexes of R largest values
in jth column of r
R ltlt T Takes O(TN) time to do all indexes
r-biggies(j)
a-biggies(i)




















r
a
T
N
N
Theres an important detail Im omitting here to
do with prescaling the rows of a and r.
65
For i 1..N, store indexes of R largest values
in ith column of a
For j 1..N, store indexes of R largest values
in jth column of r
R ltlt T Takes O(TN) time to do all indexes
Rth largest value in ith column of a O(1) time
to obtain
r-biggies(j)
a-biggies(i)




















O(R) computation
r
a
O(1) time to obtain (precached for all j in time
O(TN) )
T
N
N
Theres an important detail Im omitting here to
do with prescaling the rows of a and r.
66
N
Computing the ith row of S
S





In O(NR) time, we can put upper and lower bounds
on Sij for j 1,2 .. N
Sij
1
2
3
N

j
67
N
Computing the ith row of S
S





In O(NR) time, we can put upper and lower bounds
on Sij for j 1,2 .. N Only need exact values
of Sij for the k largest values within the row
Sij
1
2
3
N

j
68
N
Computing the ith row of S
S





In O(NR) time, we can put upper and lower bounds
on Sij for j 1,2 .. N Only need exact values
of Sij for the k largest values within the
row Ignore js that cant be the best
Sij
1
2
3
N

j
69
N
Computing the ith row of S
S





In O(NR) time, we can put upper and lower bounds
on Sij for j 1,2 .. N Only need exact values
of Sij for the k largest values within the
row Ignore js that cant be the best Be exact
for the rest O(N) time each.
Sij
1
2
3
N

j
70
N
Computing the ith row of S
S





In O(NR) time, we can put upper and lower bounds
on Sij for j 1,2 .. N Only need exact values
of Sij for the k largest values within the
row Ignore js that cant be the best Be exact
for the rest O(N) time each.
If theres enough pruning, total time is O(TNRN2)
Sij
1
2
3
N

j
71
Evaluation and Inference Speedup
Dataset synthetic data with T2000 time steps
72
Parameter Learning Speedup
Dataset synthetic data with T2000 time steps
73
Performance Experiments
  • DMC-friendly dataset
  • 2-D gaussian 20-state DMC HMM with K5 (20,000
    train, 5,000 test)
  • Anti-DMC dataset
  • 2-D gaussian 20-state regular HMM with steadily
    varying, well-distributed transition
    probabilities (20,000 train, 5,000 test)
  • Motionlogger dataset
  • Accelerometer data from two sensors worn over
    several days (10,000 train, 4,720 test)
  • Regular and DMC HMMs
  • 20 states
  • Small HMM
  • 5-state regular HMM
  • Uniform HMM
  • 20-state HMM with uniform transition
    probabilities

74
Learning Curves for DMC-friendly data
75
Learning Curves for DMC-friendly data
76
Learning Curves for DMC-friendly data
77
Learning Curves for DMC-friendly data
78
Learning Curves for DMC-friendly data
79
Learning Curves for DMC-friendly data
80
Learning Curves for DMC-friendly data
81
Learning Curves for Anti-DMC data
82
Learning Curves for Anti-DMC data
83
Learning Curves for Anti-DMC data
84
Learning Curves for Anti-DMC data
85
Learning Curves for Anti-DMC data
86
Learning Curves for Anti-DMC data
87
Learning Curves for Anti-DMC data
88
Learning Curves for Motionlogger data
89
Learning Curves for Motionlogger data
90
Learning Curves for Motionlogger data
91
Learning Curves for Motionlogger data
92
Learning Curves for Motionlogger data
93
Learning Curves for Motionlogger data
94
Learning Curves for Motionlogger data
95
Tradeoffs between N and K
  • We vary N and K while keeping the number of
    transition parameters (NK) constant
  • Increasing N and decreasing K allows more states
    for modeling data features but fewer parameters
    per state for temporal structure

96
Tradeoffs between N and K
  • Average test-set log-likelihoods at convergence
  • Datasets
  • A DMC-friendly
  • B Anti-DMC
  • C Motionlogger
  • Each dataset has a different optimal N-vs-K
    tradeoff

97
Regularization with DMC HMMs
  • of transition parameters in regular 100-state
    HMM 10,000
  • of transition parameters in DMC 100-state HMM
    with K 5 500

98
Conclusions
  • DMC HMMs are an important class of models that
    allow parameterized complexity-vs-efficiency
    tradeoffs in large state spaces
  • The speedup can be several orders of magnitude
  • Even for non-DMC domains, DMC HMMs yield higher
    scores than baseline models
  • The DMC HMM model can be applied to arbitrary
    state spaces and observation densities

99
Related Work
  • Felzenszwalb et al. (2003) fast HMM algorithms
    when transition probabilities can be expressed as
    distances in an underlying parameter space
  • Murphy and Paskin (2002) fast inference in
    hierarchical HMMs cast as DBNs
  • Salakhutdinov et al. (2003) combined EM and
    conjugate gradient for faster HMM learning when
    missing information amount is high
  • Beam Search widely used heuristic in word
    recognition for speech systems

100
Future Work
  • Investigate DMC HMMs as regularization mechanism
  • Eliminate R parameter using an automatic backoff
    evaluation approach
  • Devise ways to automatically set K parameter,
    have per-row K parameters

101
Future Work
  • Investigate DMC HMMs as regularization mechanism
  • Eliminate R parameter using an automatic backoff
    evaluation approach
  • Devise ways to automatically set K parameter,
    have per-row K parameters

The End
102
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com