Discriminative Methods with Structure - PowerPoint PPT Presentation

About This Presentation

Title:

Discriminative Methods with Structure

Description:

En vertu des nouvelles propositions, quel est le co t ... LP duality. QP of polynomial size! = Mosek. Experimental Setup. French Canadian Hansards Corpus ... – PowerPoint PPT presentation

Number of Views:34

Avg rating:3.0/5.0

Slides: 59

Provided by: simonlaco

Category:

more less

Transcript and Presenter's Notes

Title: Discriminative Methods with Structure

1
Discriminative Methods with Structure

Simon Lacoste-Julien
UC Berkeley

joint work with
March 21, 2008

Fei Sha
Ben Taskar

Dan Klein
Mike Jordan

2
Discriminative method

Decision theoretic framework

Contrast funtion

3
with structure on outputs
Input
Output
huge!
Handwriting recognition
brace
Ce n'est pas un autre problème de
classification.
This is not another classification problem.
Machine translation
4
with structure on inputs
new representation

latent variable model

classification

5
Structure on outputsDiscriminative Word
Alignmentproject

(joint work with Ben Taskar, Dan Klein and Mike
Jordan)

6
Word Alignment

Key step in most machine translation systems

En vertu de les nouvelles propositions , quel
est le coût prévu de perception de les
droits ?
x
y
What is the anticipated cost of collecting fees
under the new proposal ?
What is the anticipated cost of collecting fees
under the new proposal?
En vertu des nouvelles propositions, quel est le
coût prévu de perception des droits?
7
Overview

Review of large-margin word alignment
Taskar et al. EMNLP 05
Two new extensions to the basic model
Fertility features
First order interactions using quadratic
assignment
Results on Hansards dataset

8
Feature-Based Alignment
En vertu de les nouvelles propositions , quel
est le coût prévu de perception de le
droits ?

Features
Association
MI 3.2
Dice 4.1
Lexical pair
ID(proposal, proposition) 1
Position in sentence
AbsDist 5
RelDist 0.3
Orthography
ExactMatch 0
Similarity 0.8
Resources
PairInDictionary
Other Models (IBM2, IBM4)

What is the anticipated cost of collecting fees
under the new proposal ?
k
j
9
Scoring Whole Alignments
En vertu de les nouvelles propositions , quel
est le coût prévu de perception de le
droits ?
What is the anticipated cost of collecting fees
under the new proposal ?
k
j
10
Prediction as a Linear Program
relaxation
Degree constraint
Still guaranteed to have integral solutions y
11
Learning w

Supervised training data
Training methods
Maximum likelihood/entropy
Perceptron
Maximum margin

12
Maximum Likelihood/Entropy

Probabilistic approach
Problem denominator is P-complete
Valiant 79, Jerrum Sinclair 93
Cant find maximum likelihood parameters

13
(Averaged) Perceptron

Perceptron for structured output Collins 2002
For each example ,
Predict
Update
Output averaged parameters

14
Large Margin Estimation

Equivalent min-max formulation
Taskar et al 04,05

loss
other score
true score
Simple LP
15
Min-max formulation - QP
LP duality
QP of polynomial size! gt Mosek
16
Experimental Setup

French Canadian Hansards Corpus
Word-level aligned
200 sentence pairs (training data)
37 sentence pairs (validation data)
247 sentence pairs (test data)
Sentence-level aligned
1M sentence pairs
Generate association-based features
Learn unsupervised IBM Models
Learn using Large Margin
Evaluate alignment quality using standard
AER (Alignment Error Rate) similar to F1

17
Old Results

200 train/247 test split

Prec / Rec
AER
18
Improving basic model

We would like to model
Fertility
Alignments are not necessarily 1-to-1
First-order interactions
Alignments are mostly locally diagonal would
like to score depending on its neighbors
Strategy extensions keeping
prediction model as a LP

19
Modeling Fertility
fertility penalty
Example of node feature for word w, fraction of
time it had fertility gt k on the training set
20
Fertility Results

200 train/247 test split

Prec / Rec
AER
21
Fertility Results

200 train/247 test split

Prec / Rec
AER
22
Fertility example
Sure align.

Possible align.

Predicted align.

23
Modeling First Order Effects
want
Restrict
relaxation
24
Integer program

Quadratic assignment
NP-complete
on real-world sentences (2 to 30 words)takes a
few seconds using Mosek (1k variables)
Interestingly, in our dataset
80 of examples yield integer solution when
solved via linear relaxation
same AER when using relaxation!

25
New Results

200 train/247 test split

Prec / Rec
AER
26
New Results

200 train/247 test split

Prec / Rec
AER
27
New Results

200 train/247 test split

Prec / Rec
AER
28
New Results

200 train/247 test split

Prec / Rec
AER
29
Fert qap example
30
Fert qap example
31
Conclusions

Feature-based word alignment
Efficient algorithms for supervised learning
Exploit unsupervised data via features, other
models
Surprisingly accurate with simple features
Include fertility model and first order
interactions
38 AER reduction over intersected Model 4
Lowest published AER on this data set
High recall alignments -gt promising for MT

32
Structure on inputsdiscLDA project

(work in progress)
(joint work with Fei Sha and Mike Jordan)

33
Unsupervised dimensionality reduction
new representation

latent variables model

classification

34
Analogy PCA vs. FDA
PCA direction
FDA direction

35
Goal supervised dim. reduction

latent variables model with supervised information

new representation

classification

36
Review LDA model
37
Discriminative version of LDA

Ultimately, want to learn
discriminatively
-gt but high-dimensional non-convex objective,
hard to optimize!
Instead, propose to learn class-dependent linear
transformation of common s
New generative model
Equivalently, transformation on

38
Simplex Geometry
w1
w1
topic simplex
word simplex
w2
w3
w2
w3
39
Interpretation 1

Shared topic vs. class-specific topic

class-specific topics

shared topics
40
Interpretation 2

Generative model from T, add a new latent
variable u

41
Compare with AT model

Author-Topic model Rosen-Zvi et al. 2004

discLDA
42
Inference and learning
43
Learning

For fixed T, learn by sampling (z,u)
Rao-Blackwellized Gibbs sampling
For fixed , update T using
stochastic gradient ascent on conditional
log-likelihood
in an online fashion
get approximate gradient using Monte Carlo EM
use Harmonic Mean estimator to estimate
Currently, results are noisy

44
Inference (dimensionality reduction)

Given learned T and
estimate
using Harmonic Mean estimator
compute
by marginalizing over y to get new
representation of document

45
Preliminary Experiments
46
20 Newsgroup dataset
11k train 7.5k test vocabulary 50k

Used fixed T
Get reduced representation
-gt train linear SVM on it

hence 110 topics for
47
Classification results

discLDA SVM 20 error
LDA SVM 25 error
discLDA predictions 20 error

48
Newsgroup embedding (LDA)
49
Newsgroup embedding (discLDA)
50
using tSNE (on discLDA)
thanks to Laurens van der Maaten for figure!
Hintons group
51
using tSNE (on LDA)