Title: Active Learning for Hidden Markov Models
1Active Learning for Hidden Markov Models
???
- Brigham Anderson, Andrew Moore
- brigham_at_cmu.edu, awm_at_cs.cmu.edu
- Computer Science
- Carnegie Mellon University
2Outline
- Active Learning
- Hidden Markov Models
- Active Learning Hidden Markov Models
3Notation
- We Have
- Dataset, D
- Model parameter space, W
- Query algorithm, q
4Dataset (D) Example
5Notation
- We Have
- Dataset, D
- Model parameter space, W
- Query algorithm, q
6Model Example
St
Ot
Probabilistic Classifier
Notation T Number of examples Ot
Vector of features of example t St Class of
example t
7Model Example
Patient state (St) St DiseaseState
Patient Observations (Ot) Ot1 Gender Ot2
Age Ot3 TestA Ot4 TestB Ot5 TestC
8Possible Model Structures
9Model Space
St
Ot
Model
P(St)
Model Parameters
P(OtSt)
Generative Model Must be able to compute P(Sti,
Otot w)
10Model Parameter Space (W)
- W space of possible parameter values
- Prior on parameters
- Posterior over models
11Notation
- We Have
- Dataset, D
- Model parameter space, W
- Query algorithm, q
q(W,D) returns t, the next sample to label
12Game
- while NotDone
- Learn P(W D)
- q chooses next example to label
- Expert adds label to D
13Simulation
O1
O2
O3
O4
O5
O6
O7
S1
S2
S3
S4
S5
S6
S7
hmm
q
14Active Learning Flavors
- Pool
- (random access to patients)
- Sequential
- (must decide as patients walk in the door)
15q?
- Recall q(W,D) returns the most interesting
unlabelled example. - Well, what makes a doctor curious about a patient?
161994
17Score Function
18Uncertainty Sampling Example
FALSE
19Uncertainty Sampling Example
FALSE
TRUE
20(No Transcript)
21Uncertainty Sampling
- GOOD couldnt be easier
- GOOD often performs pretty well
- BAD H(St) measures information gain about the
samples, not the model - Sensitive to noisy samples
22Can we do better thanuncertainty sampling?
231992
24Strategy 2Query by Committee
- Temporary Assumptions
- Pool ? Sequential
- P(W D) ? Version Space
- Probabilistic ? Noiseless
- QBC attacks the size of the Version space
25O1
O2
O3
O4
O5
O6
O7
S1
S2
S3
S4
S5
S6
S7
FALSE!
FALSE!
Model 1
Model 2
26O1
O2
O3
O4
O5
O6
O7
S1
S2
S3
S4
S5
S6
S7
TRUE!
TRUE!
Model 1
Model 2
27O1
O2
O3
O4
O5
O6
O7
S1
S2
S3
S4
S5
S6
S7
FALSE!
TRUE!
Ooh, now were going to learn something for sure!
One of them is definitely wrong.
Model 1
Model 2
28The Original QBCAlgorithm
- As each example arrives
- Choose a committee, C, (usually of size 2)
randomly from Version Space - Have each member of C classify it
- If the committee disagrees, select it.
291992
30QBC Choose Controversial Examples
STOP!
Doesnt model disagreement mean uncertainty?
Why not use Uncertainty Sampling?
Version Space
31- Remember our whiny objection to Uncertainty
Sampling? - H(St) measures information gain about the
samples not the model. - BUT If the source of the sample uncertainty is
model uncertainty, then they equivalent! - Why?
- Symmetry of mutual information.
32(1995)
33Dagan-Engelson QBC
- For each example
- Choose a committee, C, (usually of size 2)
randomly from P(W D) - Have each member C classify it
- Compute the Vote Entropy to measure
disagreement
34How to Generate the Committee?
- This important point is not covered in the talk.
- Vague Suggestions
- Good conjugate priors for parameters
- Importance sampling
35- OK, we could keep extending QBC, but lets cut to
the chase
361992
37Model Entropy
P(WD)
P(WD)
P(WD)
W
W
W
H(W) high
H(W) 0
better
38Information-Gain
- Choose the example that is expected to most
reduce H(W) - I.e., Maximize H(W) H(W St)
39Score Function
40- We usually cant just sum over all models to get
H(StW) - but we can sample from P(W D)
41Conditional Model Entropy
42Score Function
43(No Transcript)
44Amazing Entropy Fact
Symmetry of Mutual Information MI(AB)
H(A) H(AB) H(B) H(BA)
45Score Function
Familiar?
46Uncertainty Sampling Information Gain
47- The information gain framework is cleaner than
the QBC framework, and easy to build on - For instance, we dont need to restrict St to be
the class variable
48Any Missing Feature is Fair Game
49Outline
- Active Learning
- Hidden Markov Models
- Active Learning Hidden Markov Models
50HMMs
Model parameters W p0,A,B
O0
O1
O2
O3
S0
S0
S1
S2
S3
51HMM Light Switch
- OUTPUT
- Probability distribution over
- Absent,
- Meeting,
- Computer, and
- Other
- E.g.,
- There is an 86 chance
- that the user is in a meeting
- right now.
INPUT Binary stream of motion / no-motion
52Light Switch HMM
53CanonicalHMM Tasks
- State Estimation
- For each timestep today, what were the
probabilities of each state? - P(StO1O2O3OT, W)
- ML Path
- Given todays observations, what was the most
likely path? - S argmax P(O1O2O3OT S, W)
- ML Model learning
- Given the last 30 days of data, what are the best
model parameters? - W argmax P(O1O2O3OT W)
Forward-Backward Algorithm
54HMM Light Switch
55Outline
- Active Learning
- Hidden Markov Models
- Active Learning Hidden Markov Models
56Active Learning!
Good Morning Sir! Heres the video footage of
yesterday. Could you just go through it and
label each frame?
Good Morning Sir! Can you tell me what you are
doing in this frame of video?
57HMMs and Active Learning
1
0
0
1
1
0
1
S1
S3
S4
S2
S5
S6
S7
hmm
58- Note the dependencies between states does not
affect the basic algorithm! - the only change in how we compute P(StO1T)
- (we have to use Forward-Backward.)
59HMM Active Learning
- Choose a committee, C, randomly from P(W D)
- Run Forward-Backward for each member of c
- For each timestep, compute H(St) - H(StC)
Done!
60Actively Selecting Excerpts
Good Morning Sir! Im still trying to learn
your HMM. Could you please label the following
scene from yesterday
61- Finding the optimal scene is useful for
- Selecting scenes from video
- Selecting utterances from audio
- Selecting excerpts from text
- Selecting sequences from DNA
62Which sequence should I get labeled?
There are O(T2) of them!
hmm
63Excerpt Selection
- Lets maximize H(S) H(SC)
- Trick question
- Which subsequence maximizes H(S) H(SC)?
64Sequence Selection
We have to include the cost incurred when we
force an expert to sit down and label 1000
examples
65What is the Entropy of a Sequence?
66Amazing Entropy Fact
The Chain Rule H(A,B,C,D) H(A) H(BA)
H(CA,B) H(DA,B,C)
67A
B
C
D
H(A,B,C,D) H(A) H(BA) H(CB) H(DC)
68Entropy of a Sequence
We still get the components of these expressions,
P(St i O1T), and P(St1i St j, O1T),
from a Forward-Backward run.
69Score of a Sequence
70Finding Best Excerpt ofLength k
71Find Best Sequence ofLength k
- Draw committee C from P(W D)
- Run Forward-Backward for each c
- Scan the entire sequence using scoreseqIG(S)
k5
O(T) !
72Find Best Excerpt ofAny Length
73Find Best Sequence ofAny Length
Hmm Thats O(T2). We could cleverly cache
some of the computation as we go But were
still going to be O(T2)
- Score all possible intervals
- Pick the best one
74Similar Problem
Find the interval that has largest integral
f(t)
t
(Note this was a Google interview question!)
75Similar Problem
Can be done using Dynamic Programming in O(T)!
f(t)
t
76- a,b best interval so far
- atemp start of best interval ending at t
- sum(a,b)
- sum(atemp,t )
state(t)
Rules if ( sum(atemp,t-1) y(t) lt 0 )
then atemp t if ( sum(atemp,t) gt sum(a,b)
) then a,b atemp,t
77Find Best Sequence ofAny Length
- Draw committee C from P(W D)
- Run Forward-Backward for each c
- Find best-scoring interval using DP
78Not Just HMMs
- The max-MI Excerpt can be applied to any
sequential process with the Markov property - E.g., Kalman filters
79Aside Active Diagnosis
- What if were not trying to learn a model?
- What if we have a good model already, and we just
want to learn the most about the sequence itself? - E.g., An HMM is trying to translate a news
broadcast. It doesnt want to learn the model,
it just wants the best transcription possible.
80we can use the same DP trick to find the
optimal subsequence too
81Conclusion
- Uncertainty sampling is sometimes correct
- QBC is an approximation to Information Gain
- Finding the most-informative subsequence of a
Markov time series is O(T)
82(No Transcript)
83Light Switch HMM