Active Learning for Hidden Markov Models - PowerPoint PPT Presentation

1 / 83

About This Presentation

Title:

Active Learning for Hidden Markov Models

Description:

Active Learning for Hidden Markov Models. Brigham Anderson, Andrew Moore ... W = space of possible parameter values. Prior on parameters: Posterior over models: ... – PowerPoint PPT presentation

Number of Views:95

Avg rating:3.0/5.0

Slides: 84

Provided by: carnegieme

Category:

more less

Transcript and Presenter's Notes

Title: Active Learning for Hidden Markov Models

1
Active Learning for Hidden Markov Models
???

Brigham Anderson, Andrew Moore
brigham_at_cmu.edu, awm_at_cs.cmu.edu
Computer Science
Carnegie Mellon University

2
Outline

Active Learning
Hidden Markov Models
Active Learning Hidden Markov Models

3
Notation

We Have
Dataset, D
Model parameter space, W
Query algorithm, q

4
Dataset (D) Example
5
Notation

We Have
Dataset, D
Model parameter space, W
Query algorithm, q

6
Model Example
St
Ot
Probabilistic Classifier
Notation T Number of examples Ot
Vector of features of example t St Class of
example t
7
Model Example
Patient state (St) St DiseaseState
Patient Observations (Ot) Ot1 Gender Ot2
Age Ot3 TestA Ot4 TestB Ot5 TestC
8
Possible Model Structures
9
Model Space
St
Ot
Model
P(St)
Model Parameters
P(OtSt)
Generative Model Must be able to compute P(Sti,
Otot w)
10
Model Parameter Space (W)

W space of possible parameter values
Prior on parameters
Posterior over models

11
Notation

We Have
Dataset, D
Model parameter space, W
Query algorithm, q

q(W,D) returns t, the next sample to label
12
Game

while NotDone
Learn P(W D)
q chooses next example to label
Expert adds label to D

13
Simulation
O1
O2
O3
O4
O5
O6
O7
S1
S2
S3
S4
S5
S6
S7
hmm
q
14
Active Learning Flavors

Pool
(random access to patients)
Sequential
(must decide as patients walk in the door)

15
q?

Recall q(W,D) returns the most interesting
unlabelled example.
Well, what makes a doctor curious about a patient?

16
1994
17
Score Function
18
Uncertainty Sampling Example
FALSE
19
Uncertainty Sampling Example
FALSE
TRUE
20
(No Transcript)
21
Uncertainty Sampling

GOOD couldnt be easier
GOOD often performs pretty well
BAD H(St) measures information gain about the
samples, not the model
Sensitive to noisy samples

22
Can we do better thanuncertainty sampling?
23
1992
24
Strategy 2Query by Committee

Temporary Assumptions
Pool ? Sequential
P(W D) ? Version Space
Probabilistic ? Noiseless
QBC attacks the size of the Version space

25
O1
O2
O3
O4
O5
O6
O7
S1
S2
S3
S4
S5
S6
S7
FALSE!
FALSE!
Model 1
Model 2
26
O1
O2
O3
O4
O5
O6
O7
S1
S2
S3
S4
S5
S6
S7
TRUE!
TRUE!
Model 1
Model 2
27
O1
O2
O3
O4
O5
O6
O7
S1
S2
S3
S4
S5
S6
S7
FALSE!
TRUE!
Ooh, now were going to learn something for sure!
One of them is definitely wrong.
Model 1
Model 2
28
The Original QBCAlgorithm

As each example arrives
Choose a committee, C, (usually of size 2)
randomly from Version Space
Have each member of C classify it
If the committee disagrees, select it.

29
1992
30
QBC Choose Controversial Examples
STOP!
Doesnt model disagreement mean uncertainty?
Why not use Uncertainty Sampling?
Version Space
31

Remember our whiny objection to Uncertainty
Sampling?
H(St) measures information gain about the
samples not the model.
BUT If the source of the sample uncertainty is
model uncertainty, then they equivalent!
Why?
Symmetry of mutual information.

32
(1995)
33
Dagan-Engelson QBC

For each example
Choose a committee, C, (usually of size 2)
randomly from P(W D)
Have each member C classify it
Compute the Vote Entropy to measure
disagreement

34
How to Generate the Committee?

This important point is not covered in the talk.
Vague Suggestions
Good conjugate priors for parameters
Importance sampling

OK, we could keep extending QBC, but lets cut to
the chase

36
1992
37
Model Entropy
P(WD)
P(WD)
P(WD)
W
W
W
H(W) high
H(W) 0
better
38
Information-Gain

Choose the example that is expected to most
reduce H(W)
I.e., Maximize H(W) H(W St)

39
Score Function
40

We usually cant just sum over all models to get
H(StW)
but we can sample from P(W D)

41
Conditional Model Entropy
42
Score Function
43
(No Transcript)
44
Amazing Entropy Fact
Symmetry of Mutual Information MI(AB)
H(A) H(AB) H(B) H(BA)
45
Score Function
Familiar?
46
Uncertainty Sampling Information Gain
47

The information gain framework is cleaner than
the QBC framework, and easy to build on
For instance, we dont need to restrict St to be
the class variable

48
Any Missing Feature is Fair Game
49
Outline

Active Learning
Hidden Markov Models
Active Learning Hidden Markov Models

50
HMMs
Model parameters W p0,A,B
O0
O1
O2
O3
S0
S0
S1
S2
S3
51
HMM Light Switch

OUTPUT
Probability distribution over
Absent,
Meeting,
Computer, and
Other
E.g.,
There is an 86 chance
that the user is in a meeting
right now.

INPUT Binary stream of motion / no-motion
52
Light Switch HMM
53
CanonicalHMM Tasks

State Estimation
For each timestep today, what were the
probabilities of each state?
P(StO1O2O3OT, W)
ML Path
Given todays observations, what was the most
likely path?
S argmax P(O1O2O3OT S, W)
ML Model learning
Given the last 30 days of data, what are the best
model parameters?
W argmax P(O1O2O3OT W)

Forward-Backward Algorithm
54
HMM Light Switch
55
Outline

Active Learning
Hidden Markov Models
Active Learning Hidden Markov Models

56
Active Learning!
Good Morning Sir! Heres the video footage of
yesterday. Could you just go through it and
label each frame?
Good Morning Sir! Can you tell me what you are
doing in this frame of video?
57
HMMs and Active Learning
1
0
0
1
1
0
1
S1
S3
S4
S2
S5
S6
S7
hmm
58

Note the dependencies between states does not
affect the basic algorithm!
the only change in how we compute P(StO1T)
(we have to use Forward-Backward.)

59
HMM Active Learning

Choose a committee, C, randomly from P(W D)
Run Forward-Backward for each member of c
For each timestep, compute H(St) - H(StC)

Done!
60
Actively Selecting Excerpts
Good Morning Sir! Im still trying to learn
your HMM. Could you please label the following
scene from yesterday
61

Finding the optimal scene is useful for
Selecting scenes from video
Selecting utterances from audio
Selecting excerpts from text
Selecting sequences from DNA

62
Which sequence should I get labeled?
There are O(T2) of them!
hmm
63
Excerpt Selection

Lets maximize H(S) H(SC)
Trick question
Which subsequence maximizes H(S) H(SC)?

64
Sequence Selection
We have to include the cost incurred when we
force an expert to sit down and label 1000
examples
65
What is the Entropy of a Sequence?

H(S14) H(S1,S2,S3,S4) ?

66
Amazing Entropy Fact
The Chain Rule H(A,B,C,D) H(A) H(BA)
H(CA,B) H(DA,B,C)
67

and even better

A
B
C
D
H(A,B,C,D) H(A) H(BA) H(CB) H(DC)
68
Entropy of a Sequence
We still get the components of these expressions,
P(St i O1T), and P(St1i St j, O1T),
from a Forward-Backward run.
69
Score of a Sequence
70
Finding Best Excerpt ofLength k
71
Find Best Sequence ofLength k

Draw committee C from P(W D)
Run Forward-Backward for each c
Scan the entire sequence using scoreseqIG(S)

k5
O(T) !
72
Find Best Excerpt ofAny Length
73
Find Best Sequence ofAny Length
Hmm Thats O(T2). We could cleverly cache
some of the computation as we go But were
still going to be O(T2)

Score all possible intervals
Pick the best one

74
Similar Problem
Find the interval that has largest integral
f(t)
t
(Note this was a Google interview question!)
75
Similar Problem
Can be done using Dynamic Programming in O(T)!
f(t)
t
76

a,b best interval so far
atemp start of best interval ending at t
sum(a,b)
sum(atemp,t )

state(t)
Rules if ( sum(atemp,t-1) y(t) lt 0 )
then atemp t if ( sum(atemp,t) gt sum(a,b)
) then a,b atemp,t
77
Find Best Sequence ofAny Length

Draw committee C from P(W D)
Run Forward-Backward for each c
Find best-scoring interval using DP

78
Not Just HMMs

The max-MI Excerpt can be applied to any
sequential process with the Markov property
E.g., Kalman filters

79
Aside Active Diagnosis

What if were not trying to learn a model?
What if we have a good model already, and we just
want to learn the most about the sequence itself?
E.g., An HMM is trying to translate a news
broadcast. It doesnt want to learn the model,
it just wants the best transcription possible.

80
we can use the same DP trick to find the
optimal subsequence too
81
Conclusion