Title: Discriminative Methods with Structure
1Discriminative Methods with Structure
- Simon Lacoste-Julien
- UC Berkeley
- joint work with
- March 21, 2008
2Â Discriminative methodÂ
- Decision theoretic framework
3 with structure on outputs
Input
Output
huge!
Handwriting recognition
brace
Ce n'est pas un autre problème de
classification.
This is not another classification problem.
Machine translation
4 with structure on inputs
new representation
5Structure on outputsDiscriminative Word
Alignmentproject
- (joint work with Ben Taskar, Dan Klein and Mike
Jordan)
6Word Alignment
- Key step in most machine translation systems
En vertu de les nouvelles propositions , quel
est le coût prévu de perception de les
droits ?
x
y
What is the anticipated cost of collecting fees
under the new proposal ?
What is the anticipated cost of collecting fees
under the new proposal?
En vertu des nouvelles propositions, quel est le
coût prévu de perception des droits?
7Overview
- Review of large-margin word alignment
- Taskar et al. EMNLP 05
- Two new extensions to the basic model
- Fertility features
- First order interactions using quadratic
assignment - Results on Hansards dataset
8Feature-Based Alignment
En vertu de les nouvelles propositions , quel
est le coût prévu de perception de le
droits ?
- Features
- Association
- MI 3.2
- Dice 4.1
- Lexical pair
- ID(proposal, proposition) 1
- Position in sentence
- AbsDist 5
- RelDist 0.3
- Orthography
- ExactMatch 0
- Similarity 0.8
- Resources
- PairInDictionary
- Other Models (IBM2, IBM4)
What is the anticipated cost of collecting fees
under the new proposal ?
k
j
9Scoring Whole Alignments
En vertu de les nouvelles propositions , quel
est le coût prévu de perception de le
droits ?
What is the anticipated cost of collecting fees
under the new proposal ?
k
j
10Prediction as a Linear Program
relaxation
Degree constraint
Still guaranteed to have integral solutions y
11Learning w
- Supervised training data
- Training methods
- Maximum likelihood/entropy
- Perceptron
- Maximum margin
12Maximum Likelihood/Entropy
- Probabilistic approach
- Problem denominator is P-complete
- Valiant 79, Jerrum Sinclair 93
- Cant find maximum likelihood parameters
13(Averaged) Perceptron
- Perceptron for structured output Collins 2002
- For each example ,
- Predict
- Update
- Output averaged parameters
14Large Margin Estimation
-
- Equivalent min-max formulation
- Taskar et al 04,05
loss
other score
true score
Simple LP
15Min-max formulation - QP
LP duality
QP of polynomial size! gt Mosek
16Experimental Setup
- French Canadian Hansards Corpus
- Word-level aligned
- 200 sentence pairs (training data)
- 37 sentence pairs (validation data)
- 247 sentence pairs (test data)
- Sentence-level aligned
- 1M sentence pairs
- Generate association-based features
- Learn unsupervised IBM Models
- Learn using Large Margin
- Evaluate alignment quality using standard
- AER (Alignment Error Rate) similar to F1
17Old Results
Prec / Rec
AER
18Improving basic model
- We would like to model
- Fertility
- Alignments are not necessarily 1-to-1
- First-order interactions
- Alignments are mostly locally diagonal would
like to score depending on its neighbors - Strategy extensions keeping
- prediction model as a LP
19Modeling Fertility
fertility penalty
Example of node feature for word w, fraction of
time it had fertility gt k on the training set
20Fertility Results
Prec / Rec
AER
21Fertility Results
Prec / Rec
AER
22Fertility example
Sure align.
Possible align.
Predicted align.
23Modeling First Order Effects
want
Restrict
relaxation
24Integer program
- Quadratic assignment
- NP-complete
- on real-world sentences (2 to 30 words)takes a
few seconds using Mosek (1k variables) - Interestingly, in our dataset
- 80 of examples yield integer solution when
solved via linear relaxation - same AER when using relaxation!
25New Results
Prec / Rec
AER
26New Results
Prec / Rec
AER
27New Results
Prec / Rec
AER
28New Results
Prec / Rec
AER
29Fert qap example
30Fert qap example
31Conclusions
- Feature-based word alignment
- Efficient algorithms for supervised learning
- Exploit unsupervised data via features, other
models - Surprisingly accurate with simple features
- Include fertility model and first order
interactions - 38 AER reduction over intersected Model 4
- Lowest published AER on this data set
- High recall alignments -gt promising for MT
32Structure on inputsdiscLDA project
- (work in progress)
- (joint work with Fei Sha and Mike Jordan)
33Unsupervised dimensionality reduction
new representation
34Analogy PCA vs. FDA
PCA direction
FDA direction
35Goal supervised dim. reduction
- latent variables model with supervised information
new representation
36Review LDA model
37Discriminative version of LDA
- Ultimately, want to learn
discriminatively - -gt but high-dimensional non-convex objective,
hard to optimize! - Instead, propose to learn class-dependent linear
transformation of common s - New generative model
- Equivalently, transformation on
38Simplex Geometry
w1
w1
topic simplex
word simplex
w2
w3
w2
w3
39Interpretation 1
- Shared topic vs. class-specific topic
shared topics
40Interpretation 2
- Generative model from T, add a new latent
variable u
41Compare with AT model
- Author-Topic model Rosen-Zvi et al. 2004
discLDA
42Inference and learning
43Learning
- For fixed T, learn by sampling (z,u)
Rao-Blackwellized Gibbs sampling - For fixed , update T using
stochastic gradient ascent on conditional
log-likelihood - in an online fashion
- get approximate gradient using Monte Carlo EM
- use Harmonic Mean estimator to estimate
- Currently, results are noisy
44Inference (dimensionality reduction)
- Given learned T and
- estimate
- using Harmonic Mean estimator
- compute
- by marginalizing over y to get new
representation of document
45Preliminary Experiments
4620 Newsgroup dataset
11k train 7.5k test vocabulary 50k
- Used fixed T
- Get reduced representation
- -gt train linear SVM on it
hence 110 topics for
47Classification results
- discLDA SVM 20 error
- LDA SVM 25 error
- discLDA predictions 20 error
48Newsgroup embedding (LDA)
49Newsgroup embedding (discLDA)
50using tSNE (on discLDA)
thanks to Laurens van der Maaten for figure!
Hintons group
51using tSNE (on LDA)
- thanks to Laurens van der Maaten for figure!
Hintons group
52Learned topics
53Another embedding
- NIPS papers vs. Psychology abstracts
LDA
5413 scenes dataset Fei-Fei 2005
- train 100 per category
- test 2558
55Vocabulary (visual words)
56Topics
57Conclusion
- fixed transformation T enables topic sharing
exploration - get reduced representation which preserves
predictive power - noisy gradient estimates still work in progress
- will probably try variational approach instead
58(No Transcript)