Title: Gholamreza Haffari Anoop Sarkar
1 Homotopy-based Semi-Supervised Hidden Markov
Models for Sequence Labeling
- Gholamreza Haffari
Anoop Sarkar - Presenter Milan Tofiloski
- Natural Language Lab
- Simon Fraser university
2Outline
- Motivation Contributions
- Experiments
- Homotopy method
- More experiments
3Maximum Likelihood Principle
- Parameter setting for the joint probability of
input-output which maximizes probability of the
given data
- L labeled data
- U unlabeled data
4Deficiency of MLE
- Which means the relationship of input-output is
ignored when estimating the parameters ! - MLE focuses on modeling the input distribution
P(x) - But we are interested in modeling the joint
distribution P(x,y)
5Remedy for the Deficiency
- Balance the effect of lab and unlab data
- Find which maximally take
advantage of lab and unlab data - MLE
6An experiment with HMM
Lower is Better
MLE Performance
- MLE can hurt the performance
- Balancing lab and unlab data related terms is
beneficial
7Our Contributions
- Introducing a principled way to choose ? for HMM
in sequence labeling (tagging) tasks - Introducing an efficient dynamic programming
algorithm to compute second order statistics in
HMM
8Outline
- Motivation Contributions
- Experiments
- Homotopy method
- More experiments
9Task
- Field segmentation in information extraction
- 13 tag fields AUTHOR, TITLE,
10Experimental Setup
- Use an HMM with 13 states
- Freeze the transition (state-gtstate)
probabilities to what has been observed in the
lab data - Use the Homotopy method to just learn the
emission (state-gtalphabet) probabilities - Do add-? smoothing for the initial values of
emission and transition probabilities - Data statistics
- Average seq. length 36.7
- Average number of segments in a seq 5.4
- Size of Lab/Unlab data is 300/700
11Baselines
- Held-out put aside part of the lab data as a
held-out set, and use it t choose ? - Oracle choose ? based on test data using per
position accuracy - Supervised forgetting about unlab data, and just
using lab data
12Homotopy vs Baselines
Higher is Better
Even very small values of ? can be useful. In
homotopy ?.004, and in supervised ? 0
- Sequence of most probable states decoding
- See paper for more results
13Outline
- Motivation Contributions
- Experiments
- Homotopy method
- More experiments
14Path of Solutions
- Look at ? as ? changes from 0 to 1
- Choose the best ? based on the path
15EM?for HMM
- Let be a state-gtstate or state-gtobservation
event in our HMM - To find best parameter values ? which (locally)
maximizes the objective function for a fixed ?
Repeat until convergence
16Fixed Points of EM?
- Useful fact
- At the fixed points , then
- This is similar to using Homotopy for root
finding - Same numerical techniques should be applicable
here
17Homotopy for Root Finding
- To find a root of G(?)
- start from a root of a simple problem F(?)
- trace the roots of intermediate problems while
morphing F to G - To find ? which satisfy the above
- Set the derivative to zero gives differential
equation - Numerically solve the resulting differential
eqn.
18Solving the Differential Eqn
Jaccobian of EM1
- Repeat until
- Update in a proper direction parallel
to vKernel(M) - Update M
19Jaccobian of EM1
See the paper for details
- So, we need to compute the covariance matrix of
the events - The entry in the row and column of
the covariance matrix is
20Expected Quadratic Counts for HMM
- Dynamic programming algorithm to efficiently
compute - Pre-compute a table Zx for each sequence
- Having table Zx, the EQC can be computed
efficiently - The time complexity is where K
is the number of states in the HMM (see paper for
more details)
21How to Choose ? based on Path
- monotone the first point at which the
monotonocity of ? changes - MaxEnt choose ? for which the model has maximum
entropy on the unlab data - minEig when solving the diff eqn, consider the
minimum singular value of the matrix M. Across
rounds, choose ? for which the minimum singular
value is the smallest
22Outline
- Motivation Contributions
- Experiments
- Homotopy method
- More experiments
23Varying the Size of Unlab Data
- The three Homotopy-based methods outperform EM
- maxEnt outperforms minEig and monotone
- minEig and monotone have similar performances
- Size of the labeled data 100
24Picked ? Values
25Picked ? Values
- EM gives higher weight to unlabeled data compared
to Homotopy-based method
- ? selected by
- maxEnt are much smaller than those selected by
minEig and monotone - minEig and monotone are close
26Conclusion and Future Work
- Using EM can hurt performance in the case L ltlt
U - Proposed a method to alleviate this problem for
HMMs for seq. labeling tasks - To speed up the method
- Using sampling to find approximation to
covariance matrix - Using faster methods in recovering the solution
path, e.g. predictor-corrector
27Questions?
28Is Oracle outperformed by Homotopy?
- No!
-
- - The performance measure used in selecting ? in
oracle method may be different from that used in
comparing homotopy and oracle - - The decoding alg used in oracle may be
different from that used in comparing homotopy
and oracle
29Why not set ?
- This adhoc way of setting ? has two drawbacks
- It still may hurt the performance. The proper ?
may be much smaller than that. - - In some situations, the right choice of ? may
be a big value. Setting is very
conservative and dose not fully take advantage
of the available unlabeled data.
30Homotopy vs Baselines
- Viterbi Decoding most probable seq of states
decoding - SMS Decoding seq of most probable states
decoding