Learning Structured Classifiers for Statistical Dependency Parsing

About This Presentation

Title:

Learning Structured Classifiers for Statistical Dependency Parsing

Description:

Learning Structured Classifiers for Statistical Dependency Parsing. Qin ... Pereira 2006, Corston-Oliver et al. 2006, Smith & Eisner 2005, Smith & Eisner 2006 ) ... – PowerPoint PPT presentation

Number of Views:32

Avg rating:3.0/5.0

Slides: 84

Provided by: wang6

Category:

more less

Transcript and Presenter's Notes

Title: Learning Structured Classifiers for Statistical Dependency Parsing

1
Learning Structured Classifiers for Statistical
Dependency Parsing

Qin Iris Wang
Joint work with Dekang Lin Dale Schuurmans
University of Alberta
September 11, 2007

2
Ambiguities In NLP
I saw her duck.
How about I saw her duck with a telescope.
3
Dependency Trees vs. Constituency Trees
S
NP
VP
V
NP
N
Dt
N
Mike
ate
the
cake
A Constituency tree
4
Dependency Trees

A dependency tree structure for a sentence
represents
Syntactic relations between word pairs in the
sentence

mod
obj
subj
obj
det
gen
with
a
telescope
duck
saw
I
her
Over 100 possible trees!
1M trees for a 20-word sentence
5
Dependency Parsing

An increasingly active research area (Yamada
Matsumoto 2003, McDonald et al. 2005, McDonald
Pereira 2006, Corston-Oliver et al. 2006, Smith
Eisner 2005, Smith Eisner 2006 )
Dependency trees are much easier to understand
and annotate than other syntactic representations
Dependency relations have been widely used in
Machine translation (Fox 2002, Cherry Lin 2003,
Ding Palmer 2005)
Information extraction (Culotta Sorensen 2004)
Question answering (Pinchak Lin 2006)
Coreference resolution (Bergsma Lin 2006)

6
Overview

Dependency parsing model
Learning dependency parsers
Strictly lexicalized dependency parsing
Structured boosting
Experimental results
Conclusion

7
Dependency Parsing Model

W an input sentence T a candidate
dependency tree
a dependency link from word i to
word j
the set of possible dependency trees
over W
Can be applied to both probabilistic and
non-probabilistic models

Edge/link-based factorization
8
Scoring Functions
A vector of feature weights
A vector of features

Feature weights can be learned either
locally or globally

9
Features for an arc

Word pair indicator
Part-of-Speech (POS) tags of the word pair
Pointwise Mutual Information (PMI) for that word
pair
Distance between words

lots! sparse
Abstraction or smoothing
10
Score of Each Link / Word-pair

The score of each link is based on the features
Considering the word pair (skipped, regularly)
POS (skipped, regularly) (VBD, RB)
PMI (skipped, regularly) 0.27
dist(skipped, regularly) 2

11
Score of A Tree
regularly.
school
The
boy
skipped

Edge-based factorization

12
My Work on Dependency Parsing

1. Strictly Lexicalized Dependency Parsing (Wang
et al. 2005)
MLE similarity-based smoothing
2. Improving Large Margin Training (Wang et al.
2006)
Local constraints capture local errors in a
parse tree
Laplacian regularization enforce similar links
to have similar weights introduced the
similarity-based smoothing technique into the
large margin framework
3. Structured Boosting (Wang et al. 2007)
Global optimization, efficient and flexible

Focus of this talk
13
Overview

Dependency parsing model
Learning dependency parsers
Strictly lexicalized dependency parsing
Structured boosting
Experimental results
Conclusion

14
Strictly Lexicalized Dependency Parsing
-- IWPT 2005
15
POS Tags for Handling Sparseness in Parsing

All previous parsers use a POS lexicon
Natural language data is sparse
Bikel (2004) found that of all the needed bi-gram
statistics, only 1.49 were observed in the Penn
Treebank
Words belonging to the same POS are expected to
have the same syntactic behavior

16
However,

POS tags are not part of natural text
Need to be annotated by human effort
Introduce more noise to training data
For some languages, POS tags are not clearly
defined
Such as Chinese or Japanese
A single word is often combined with other words

Can we use another smoothing technique rather
than POS?
17
Strictly Lexicalized Dependency Parsing

All the features are based on word statistics and
no POS tags needed
Using similarity-based smoothing to deal with
data sparseness

18
An Alternative Method to Deal with Sparseness

Distributional word similarities
Distributional Hypothesis --- Words that appear
in similar contexts have similar meanings
(Harris, 1968)
Soft clusters of words
Advantages of using similarity smoothing
Computed automatically from the raw text
Making the construction of Treebank easier

POS hard clusters of words
19
Similarity Between Word Pairs

Similarity between two words Sim(w1, w2) (Lin,
1998)
Construct a feature vector for w that contains a
set of words occurring within a small context
window
Compute similarities between the two feature
vectors
Using cosine measure
e.g., Sim (skipped, missed) 0.134966
Similarity between word pairs geometric average

20
Similarity-based Smoothing
5
3
2
1
4
school 4
skipped 3
? 0
regularly 5
The 1
kid 2
21
Similarity-based Smoothing
skipped, regularly )
Context
regularly 5
skipped 3
P (
S (regularly)
S (skipped)
frequently 0.365862routinely 0.286178periodica
lly 0.273665often 0.24077constantly
0.234693occasionally 0.226324who
0.200348continuously 0.194026repeatedly
0.177434
skipping 0.229951skip 0.197991skips
0.169982sprinted 0.140535bounced
0.139547missed 0.134966cruised
0.133933scooted 0.13387jogged 0.133638
22
Similarity-based Smoothing
skipped, regularly )
regularly 5
skipped 3
P (
S (regularly)
S (skipped)
frequently 0.365862routinely 0.286178periodica
lly 0.273665often 0.24077constantly
0.234693occasionally 0.226324who
0.200348continuously 0.194026repeatedly
0.177434
Skipping 0.229951skip 0.197991skips
0.169982sprinted 0.140535bounced
0.139547missed 0.134966cruised
0.133933scooted 0.13387jogged 0.133638
23
Similarity-based Smoothing
skipped, regularly )
regularly 5
skipped 3
P (
Pairs seen in the training data
skip frequently skip routinely skip
repeatedly bounced often bounced who bounced
repeatedly
Similar Contexts
24
Similarity-based Smoothing
similarity-based Prob.
MLE-based Prob.
Similar Contexts of C
regularly 5
skipped 3
regularly 5
skipped 3
skip frequently skip routinely skip
repeatedly Bounced often Bounced
who Bounced repeatedly
25
Similarity-based Smoothing

Finally,

P(E C) a PMLE(E C) (1 a) PSIM(E C)
C is the frequency count of the corresponding
context C in the training data
(skipped, regulary) 1 (The, kid) 95
26
Comparison With An Unlexicalized Model

In this model, the input to the parser is the
sequence of the POS tags, which is opposite to
our model
Using gold standard POS tags
Accuracy of the unlexicalized model 71.1
Our strictly lexicalized model 79.9

27
Overview

Dependency parsing model
Learning dependency parsers
Strictly lexicalized dependency parsing
Structured boosting
Experimental results
Conclusion

28
Simple Training of Dependency Parsers via
Structured Boosting
-- IJCAI 2007
29
Our Contributions

Structured Boosting a simple approach to
training structured classifiers by applying a
boosting-like procedure to standard supervised
training methods
Advantages
Simple
Inexpensive
General
Successfully applied to dependency parsing

30
Local Training Examples

Given training data (S, T)

local examples
Word-pair Link-label Weight
Features The-boy L 1 W1_The, W2_boy,
W1W2_The_boy, T1_DT, T2_NN,
T1T2_DT_NN, Dist_1, boy-skipped L 1
W1_boy, W2_skipped, skipped-school R 1
W1_skipped, W2_ school, skipped-regularly R
1 W1_skipped, W2_regularly,
The-skipped N 1 W1_The, W2_skipped,
The-school N 1 W1_The, W2_school,
L left link R right link N no link
31
Local Training Methods

Learn a local link classifier given a set of
features defined on the local examples
For each word pair in a sentence
No link, left link or right link ?
3-class classification
Any classifier can be used as a link classifier
for parsing

32
Combining Local Training witha Parsing Algorithm
Training sentences (S, T)
Local training examples
Link score
Dependency parsing algorithm
Local link classifier h
Standard application of ML
Dependency trees
33
Parsing With a Local Link Classifier

Learn the weight vector over a set of
features defined on the local examples
Maximum entropy models (Ratnaparkhi 1999,
Charniak 2000)
Support vector machines (Yamada and Matsumoto
2003)
The parameters of the local model are not being
trained to consider the global parsing accuracy
Global training can do better

34
Global Training for Parsing

Directly capture the relations between the links
of an output tree
Incorporate the effects of the parser directly
into the training algorithm
Structured SVMs (Tsochantaridis et al. 2004)
Max-Margin Parsing (Taskar et al. 2004)
Online large-margin training (McDonald et al.
2005)
Improving large-margin training (Wang et al. 2006)

35
But, Drawbacks

Unfortunately, these structured training
techniques are
Expensive
Specialized
Complex to implement
Require a great deal of refinement and
computational resources to apply to parsing

Need efficient global training algorithms!
36
Structured Boosting

A simple variant of standard boosting algorithms
Adaboost M1 (Freund Schapire 1997)
Global optimization
As efficient as local methods
General, can use any local classifier
Also, can be easily applied to other tasks

37
Standard Boosting for Classification
training examples
Increase the weight of mis-classified examples
Local predictor h

hk
h1
h2
h3
38
Structured Boosting (An Example)
Gold tree
saw
duck
her
with
telescope
I
a
Parsers output
At each iteration, Increasing the weights of
local examples which are mis-parsed
Iter 1
Try harder!
I
saw
her
with
duck
a
telescope
Try harder!
Iter 2
I
saw
duck
her
with
telescope
a

Good job!
Iter T
I
saw
her
telescope
duck
with
a
39
Structured Boosting (An Example)
saw
her
duck
I
with
a
telescope
Weight of saw-with
Weight of duck-with
Weights of local examples

Iter 1 2 T
40
Structured Boosting for Dependency Parsing
Global training efficient ?
Training sentences (S, T)
Compare with the gold standard trees
Local training examples
Re-weight the mis-parsed examples
Dependency parsing algorithm
Link score
Local link classifier h

h1
h2
h3
hk
Dependency trees
41
Overview

Dependency parsing model
Learning dependency parsers
Strictly lexicalized dependency parsing
Structured boosting
Experimental results
Conclusion

42
Experimental Design

Data set (Linguistic Data Consortium), split into
training, development and test sets
PTB3 50 K sentences (split standard)
CTB4 15 K sentences (split Wang et al. 2005)
CTB5 19 K sentences (split Corston-Oliver et
al. 2006)
Features
Word-pair indicator, POS-pair, PMI, context
features, distance

43
Experimental Design Cont.

Local link classifier
Logistic regression model/ maximum entropy
Boosting method
A variant of Adaboost M1 (Freund Schapire 1997)

44
Results
Accuracy on Chinese and English ()
Dependency accuracy percentage of words that
have the correct head
45
Comparison with State of the Art
IWPT 2005
IJCAI 2007
Chinese Treebank 4.0 Chinese Treebank 5.0
46
Overview

Dependency parsing model
Learning dependency parsers
Strictly lexicalized dependency parsing
Structured boosting
Experimental results
Conclusion

47
Conclusion

Similarity-based smoothing as an alternative to
POS to deal with data sparseness
Structured boosting is an efficient and effective
approach to coordinating local link classifiers
with global parsing accuracy
Both of the above techniques have been
successfully applied to dependency parsing

48
More on Structured Learning

Improved Estimation for Unsupervised
Part-of-Speech Tagging (Wang Schuurmans 2005)
Improved Large Margin Dependency Parsing via
Local Constraints and Laplacian Regularization
(Wang, Cherry, Lizotte and Schuurmans 2006)
Learning Noun Phrase Query Segmentation
(Bergsma and Wang 2007)
Semi-supervised Topic Segmentation of Web Docs
(in progress)

49
Thanks!
Questions?
50

Features,
Linguistic intuitions
Models

Training criteria, Regularization, smoothing
NLP
Machine learning
51
What Have I Worked On?
Dependency parsing
Output is a set of inter-dependent labels with a
specific structure (E.g., a parse tree)
Structured learning
Part-of-Speech tagging
Query segmentation
52
Ambiguities In NLP
Courtesy of Aravind Joshi
I like eating sushi with tuna.
53
Dependency Tree

A dependency tree structure for a sentence
Syntactic relationships between word pairs in
the sentence

obj
obj
mod
obj
subj
with
tuna
sushi
I
like
eating
with
tuna
sushi
I
like
eating
54
Structured Boosting

Train a local link predictor, h1
Re-parse training data using h1
Re-weight local examples
Compare the parser outputs with the gold standard
trees
Increase the weight of mis-parsed local examples
Re-train local link predictor, getting h2
Finally we have h1 , h2 , , hk

55
However

Exponential number of constraints (number of
incorrect trees)
Loss Ignoring the local errors of the parse tree
Over-fitting the training corpus
large number of bi-lexical/word-pair features,
need a good smoothing (regularization) method

56
Variants of Structured Boosting

Using alternative boosting algorithms for
structured boosting
Adaboost M2 (Freund Schapire 1997)
Re-weighting class labels
Logistic regression form of boosting (Collins et
al. 2002)

57
Improving Large Margin Training (Wang et al.
2006)

A margin is created between the correct
dependency tree and each incorrect dependency
tree at least as large as the loss of the
incorrect tree.
Our contributions
Local constraints to capture local errors in a
parse tree
Laplacian regularization to deal with data
sparseness introduce the similarity-based
smoothing technique into the large margin
framework

58
(Existing) Large Margin Training

Having been used for parsing
Tsochantaridis et al. 2004
Taskar et al.2004
State of the art performance in dependency
parsing
McDonald et at. 2005a, 2005b, 2006

59
Large Margin Training

Minimizing a regularized loss (Hastie et at.,
2004)

i the index of the training sentences Ti the
target tree Li a candidate tree the
distance between the two trees
60
Large Margin Training
McDonald 2005

Exponential number of constraints (number of
incorrect trees)
Loss Ignoring the local errors of the parse tree
Over-fitting the training corpus
large number of bi-lexical/word-pair features,
need a good smoothing (regularization) method

61
Local Constraints (an example)
4
1
2
3
school
The
boy
skipped
regularly
loss
6
5
score(The, boy) gt score(The,
skipped) 1 score(boy, skipped) gt
score(The, skipped) 1
score(skipped, school) gt score(school,
regularly) 1 score(skipped, regularly) gt
score(school, regularly) 1
5
1
2
5
polynomial constraints!
6
3
6
4
62
Local Constraints
w1 common node
Convex!
Correct link
Missing link

With slack variables

63
Objective with Local Constraints

The corresponding new quadratic program

polynomial constraints!
j number of constraints in A
64
Laplacian Regularization

Enforce similar links (word pairs) to have
similar weights

L(S) D(S) S D(S) a diagonal matrix
S similarity matrix of word pairs L(S)
Laplacian matrix of S
65
Similarity Between Word Pairs

Similarity between two words Sim(w1, w2)
Cosine

Similarity between word pairs

66
Refined Large Margin Objective
only for bi-lexical features
67
Unsupervised POS Tagging (Wang Schuurmans 2005)

Weakness of transition model and emission model
in HMM tagging
Poorly Learned transition parameters
No form of parameter tying over emission model
Our ideas
Transition model Marginally constrained HMMs
Emission model Similarity-based smoothing

68
Parameters of an HMM
Ti1
Ti
Ti-1
T1
Tn

Wi1
Wi
Wi-1
W1
Wn

69
We Did Better

Improved Estimation for Unsupervised
Part-of-Speech Tagging (Wang Schuurmans,
2005)
Full/unfiltered lexicon
77.2 (Banko and Moore 2004)
90.5 (Our model) ?
Reduced/filtered lexicon
95.9 (Banko and Moore 2004)
94.7 (Our model)

70
Query Segmentation (Bergsma Wang 2007)

Input search engine query
Output query separated into phrases
Goal improve information retrieval
Approach supervised machine-learning
Classifier makes segmentation decisions
Conclusion richer features allow for large
increases in segmentation performance

71
Query Segmentation

Example query
two man power saw
Output segmentations
two man power saw
two man power saw
two man power saw
two man power saw
etc.

72
Query Segmentation

Unsegmented
two man power saw
two
man
power
saw

73
Query Segmentation

two man
power saw

74
Query Segmentation

two
man
power saw

75
Semi-supervised Dependency Parsing

Unsupervised/Semi-supervised dependency parsing
EM
Using a discriminative, convex, unsupervised
structured learning algorithm (Xu et al. 2006)
Combining a supervised structured large margin
loss with a cheap unsupervised least squares loss
on unlabeled data.

Expensive
Much cheaper
76
Topic Segmentation of Web Docs

A structured classification problem
Input a document containing a sequence of k
sentences
Output a sequence of break decisions (each
sentence boundary is a possible segmentation
point)
Goal segment a document into a few blocks
according to subtopic
Approach semi-supervised training (combining a
supervised large-margin training loss with an
unsupervised least squares loss)

77
Experimental Results
Dependency accuracy on Chinese Treebank (CTB) 4.0
An undirected tree root A directed tree
78
Dynamic Features

Also known as non-local features
Take into account the link labels of the
surrounding word-pairs when predicting the label
of current pair
Commonly used in sequential labeling (McCallum
et al 2000, Toutanova et al. 2003)
A simple but useful idea for improving parsing
accuracy
Wang et al. 2005
McDonald and Pereira 2006

79
Dynamic Features
with
a
telescope
duck
I
saw
her
with
a
spot
duck
I
saw
her

Define a canonical order so that a words
children are generated first, before it modifies
another word
telescope/spot are the dynamic features for
deciding whether generating a link between saw
with or duck with

80
Results - 1
Table 1 Boosting with static features
81
Results - 2
Table 2 Boosting with dynamic features
82
Feature Representation