Discriminative Methods with Structure - PowerPoint PPT Presentation

About This Presentation
Title:

Discriminative Methods with Structure

Description:

En vertu des nouvelles propositions, quel est le co t ... LP duality. QP of polynomial size! = Mosek. Experimental Setup. French Canadian Hansards Corpus ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 59
Provided by: simonlaco
Category:

less

Transcript and Presenter's Notes

Title: Discriminative Methods with Structure


1
Discriminative Methods with Structure
  • Simon Lacoste-Julien
  • UC Berkeley
  • joint work with
  • March 21, 2008
  • Fei Sha
  • Ben Taskar
  • Dan Klein
  • Mike Jordan

2
 Discriminative method 
  • Decision theoretic framework
  • Contrast funtion

3
 with structure  on outputs
Input
Output
huge!
Handwriting recognition
brace
Ce n'est pas un autre problème de
classification.
This is not another classification problem.
Machine translation
4
 with structure  on inputs
new representation
  • latent variable model
  • classification

5
Structure on outputsDiscriminative Word
Alignmentproject
  • (joint work with Ben Taskar, Dan Klein and Mike
    Jordan)

6
Word Alignment
  • Key step in most machine translation systems

En vertu de les nouvelles propositions , quel
est le coût prévu de perception de les
droits ?
x
y
What is the anticipated cost of collecting fees
under the new proposal ?
What is the anticipated cost of collecting fees
under the new proposal?
En vertu des nouvelles propositions, quel est le
coût prévu de perception des droits?
7
Overview
  • Review of large-margin word alignment
  • Taskar et al. EMNLP 05
  • Two new extensions to the basic model
  • Fertility features
  • First order interactions using quadratic
    assignment
  • Results on Hansards dataset

8
Feature-Based Alignment
En vertu de les nouvelles propositions , quel
est le coût prévu de perception de le
droits ?
  • Features
  • Association
  • MI 3.2
  • Dice 4.1
  • Lexical pair
  • ID(proposal, proposition) 1
  • Position in sentence
  • AbsDist 5
  • RelDist 0.3
  • Orthography
  • ExactMatch 0
  • Similarity 0.8
  • Resources
  • PairInDictionary
  • Other Models (IBM2, IBM4)

What is the anticipated cost of collecting fees
under the new proposal ?
k
j
9
Scoring Whole Alignments
En vertu de les nouvelles propositions , quel
est le coût prévu de perception de le
droits ?
What is the anticipated cost of collecting fees
under the new proposal ?
k
j
10
Prediction as a Linear Program
relaxation
Degree constraint
Still guaranteed to have integral solutions y
11
Learning w
  • Supervised training data
  • Training methods
  • Maximum likelihood/entropy
  • Perceptron
  • Maximum margin

12
Maximum Likelihood/Entropy
  • Probabilistic approach
  • Problem denominator is P-complete
  • Valiant 79, Jerrum Sinclair 93
  • Cant find maximum likelihood parameters

13
(Averaged) Perceptron
  • Perceptron for structured output Collins 2002
  • For each example ,
  • Predict
  • Update
  • Output averaged parameters

14
Large Margin Estimation
  • Equivalent min-max formulation
  • Taskar et al 04,05

loss
other score
true score
Simple LP
15
Min-max formulation - QP
LP duality
QP of polynomial size! gt Mosek
16
Experimental Setup
  • French Canadian Hansards Corpus
  • Word-level aligned
  • 200 sentence pairs (training data)
  • 37 sentence pairs (validation data)
  • 247 sentence pairs (test data)
  • Sentence-level aligned
  • 1M sentence pairs
  • Generate association-based features
  • Learn unsupervised IBM Models
  • Learn using Large Margin
  • Evaluate alignment quality using standard
  • AER (Alignment Error Rate) similar to F1

17
Old Results
  • 200 train/247 test split

Prec / Rec
AER
18
Improving basic model
  • We would like to model
  • Fertility
  • Alignments are not necessarily 1-to-1
  • First-order interactions
  • Alignments are mostly locally diagonal would
    like to score depending on its neighbors
  • Strategy extensions keeping
  • prediction model as a LP

19
Modeling Fertility
fertility penalty
Example of node feature for word w, fraction of
time it had fertility gt k on the training set
20
Fertility Results
  • 200 train/247 test split

Prec / Rec
AER
21
Fertility Results
  • 200 train/247 test split

Prec / Rec
AER
22
Fertility example
Sure align.

Possible align.

Predicted align.

23
Modeling First Order Effects
want
Restrict
relaxation
24
Integer program
  • Quadratic assignment
  • NP-complete
  • on real-world sentences (2 to 30 words)takes a
    few seconds using Mosek (1k variables)
  • Interestingly, in our dataset
  • 80 of examples yield integer solution when
    solved via linear relaxation
  • same AER when using relaxation!

25
New Results
  • 200 train/247 test split

Prec / Rec
AER
26
New Results
  • 200 train/247 test split

Prec / Rec
AER
27
New Results
  • 200 train/247 test split

Prec / Rec
AER
28
New Results
  • 200 train/247 test split

Prec / Rec
AER
29
Fert qap example
30
Fert qap example
31
Conclusions
  • Feature-based word alignment
  • Efficient algorithms for supervised learning
  • Exploit unsupervised data via features, other
    models
  • Surprisingly accurate with simple features
  • Include fertility model and first order
    interactions
  • 38 AER reduction over intersected Model 4
  • Lowest published AER on this data set
  • High recall alignments -gt promising for MT

32
Structure on inputsdiscLDA project
  • (work in progress)
  • (joint work with Fei Sha and Mike Jordan)

33
Unsupervised dimensionality reduction
new representation
  • latent variables model
  • classification

34
Analogy PCA vs. FDA
PCA direction
FDA direction
  • x

35
Goal supervised dim. reduction
  • latent variables model with supervised information

new representation
  • classification

36
Review LDA model
37
Discriminative version of LDA
  • Ultimately, want to learn
    discriminatively
  • -gt but high-dimensional non-convex objective,
    hard to optimize!
  • Instead, propose to learn class-dependent linear
    transformation of common s
  • New generative model
  • Equivalently, transformation on

38
Simplex Geometry
w1
w1
topic simplex
word simplex
w2
w3
w2
w3
39
Interpretation 1
  • Shared topic vs. class-specific topic
  • class-specific topics

shared topics
40
Interpretation 2
  • Generative model from T, add a new latent
    variable u

41
Compare with AT model
  • Author-Topic model Rosen-Zvi et al. 2004

discLDA
42
Inference and learning
43
Learning
  • For fixed T, learn by sampling (z,u)
    Rao-Blackwellized Gibbs sampling
  • For fixed , update T using
    stochastic gradient ascent on conditional
    log-likelihood
  • in an online fashion
  • get approximate gradient using Monte Carlo EM
  • use Harmonic Mean estimator to estimate
  • Currently, results are noisy

44
Inference (dimensionality reduction)
  • Given learned T and
  • estimate
  • using Harmonic Mean estimator
  • compute
  • by marginalizing over y to get new
    representation of document

45
Preliminary Experiments
46
20 Newsgroup dataset
11k train 7.5k test vocabulary 50k
  • Used fixed T
  • Get reduced representation
  • -gt train linear SVM on it

hence 110 topics for
47
Classification results
  • discLDA SVM 20 error
  • LDA SVM 25 error
  • discLDA predictions 20 error

48
Newsgroup embedding (LDA)
49
Newsgroup embedding (discLDA)
50
using tSNE (on discLDA)
thanks to Laurens van der Maaten for figure!
Hintons group
51
using tSNE (on LDA)
  • thanks to Laurens van der Maaten for figure!
    Hintons group

52
Learned topics
53
Another embedding
  • NIPS papers vs. Psychology abstracts

LDA
  • discLDA

54
13 scenes dataset Fei-Fei 2005
  • train 100 per category
  • test 2558

55
Vocabulary (visual words)
56
Topics
57
Conclusion
  • fixed transformation T enables topic sharing
    exploration
  • get reduced representation which preserves
    predictive power
  • noisy gradient estimates still work in progress
  • will probably try variational approach instead

58
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com