Learning Structured Classifiers for Statistical Dependency Parsing

About This Presentation

Title:

Learning Structured Classifiers for Statistical Dependency Parsing

Description:

State-of-the-art accuracy for English and Chinese. 9/2/09. Qin Iris Wang (University of Alberta) ... A vector of feature weights. A vector of features. 9/2/09 ... – PowerPoint PPT presentation

Number of Views:35

Avg rating:3.0/5.0

Slides: 83

Provided by: wang6

Category:

more less

Transcript and Presenter's Notes

Title: Learning Structured Classifiers for Statistical Dependency Parsing

1
Learning Structured Classifiers for Statistical
Dependency Parsing

Qin Iris Wang
Supervisors Dekang Lin Dale Schuurmans
University of Alberta
May 7th, 2008

2
Ambiguities in NLP
I saw her duck.
How about I saw her duck with a telescope.
3
Dependency Trees
head -gt modifier one and only one head

A dependency tree structure for a sentence
represents
Syntactic relations between word pairs in the
sentence

with
a
telescope
duck
saw
I
her
Over 100 possible trees!
1 million trees for a 20-word sentence
4
My Thesis Research

Goal
Improve dependency parsing via different
statistical machine learning approaches
Method
Tackle the problem from different angles
Employ and develop advanced supervised and
semi-supervised machine learning techniques
Achievement
State-of-the-art accuracy for English and Chinese

5
Overview

Dependency parsing model
Learning dependency parsers
Strictly lexicalized dependency parsing
Extensions to Large Margin Training
Training Dependency Parsers via Structured
boosting
Semi-supervised Convex Training for Dependency
Parsing
Results
Conclusion

6
Dependency Parsing Model

X an input sentence Y a candidate dependency
tree
a dependency link from word i to
word j
the set of possible dependency trees
over X

Edge/link-based factorization
A vector of features
A vector of feature weights
7
Features for an Arc
Lots and sparse!

Word pair indicator
Part-of-Speech (POS) tags of the word pair
Pointwise Mutual Information (PMI) for that word
pair
Distance between words

Abstraction or smoothing
8
My Work on Dependency Parsing

1. Strictly Lexicalized Dependency Parsing (Wang
et al. 2005)
MLE similarity-based smoothing
2. Improving Large Margin Training (Wang et al.
2006)
Local constraints capture local errors in a
parse tree
Laplacian regularization - semi-supervised large
margin training enforce similar links to have
similar weights -- similarity smoothing
3. Structured Boosting (Wang et al. 2007)
Global optimization, efficient and flexible
4. Semi-supervised Convex Training (Wang et al.
2008)
Combine large margin loss with least squares loss

9
Overview

Dependency parsing model
Learning dependency parsers
Strictly lexicalized dependency parsing
Extensions to Large Margin Training
Training Dependency Parsers via Structured
boosting
Semi-supervised Convex Training for Dependency
Parsing
Results
Conclusion

10
Strictly Lexicalized Dependency Parsing
-- IWPT 2005
11
Contributions

All the features are based on word statistics and
no POS tags or grammatical categories needed
Using similarity-based smoothing to deal with
data sparseness

12
POS Tags for Handling Sparseness in Parsing

All previous parsers use a POS lexicon
Natural language data is sparse
Bikel (2004) found that of all the needed bi-gram
statistics, only 1.49 were observed in the Penn
Treebank
Words belonging to the same POS are expected to
have the same syntactic behavior

13
An Alternative Method to Deal with Sparseness

Distributional word similarities
Distributional Hypothesis --- Words that appear
in similar contexts have similar meanings
(Harris, 1968)
Soft clusters of words
Advantages of using similarity smoothing
Computed automatically from the raw text
Making the construction of Treebank easier

POS hard clusters of words
14
Similarity-based Smoothing
skipped, regularly )
Context
regularly 5
skipped 3
P (
S (regularly)
S (skipped)
frequently 0.365862routinely 0.286178periodica
lly 0.273665often 0.24077constantly
0.234693occasionally 0.226324who
0.200348continuously 0.194026repeatedly
0.177434
Skipping 0.229951skip 0.197991skips
0.169982sprinted 0.140535bounced
0.139547missed 0.134966cruised
0.133933scooted 0.13387jogged 0.133638
15
Similarity-based Smoothing
skipped, regularly )
regularly 5
skipped 3
P (
Pairs seen in the training data
skip frequently skip routinely skip
repeatedly bounced often bounced who bounced
repeatedly
Similar Contexts
16
Overview

Dependency parsing model
Learning dependency parsers
Strictly lexicalized dependency parsing
Extensions to Large Margin Training
Training Dependency Parsers via Structured
boosting
Semi-supervised Convex Training for Dependency
Parsing
Results
Conclusion

17
Improved Large Margin Dependency Parsing via
Local Constraints and Laplacian Regularization
-- CoNLL 2006
18
Contributions

Large margin vs. generative training
Included distance and PMI features, but fewer
dynamic features
Local constraints to capture local errors in a
parse tree
Laplacian regularization (based on distributional
word similarity) to deal with data sparseness
Semi-supervised large margin training

19
(Existing) Large Margin Training
Exponential constraints!

However,
Exponential number of constraints
Loss ignoring the local errors of the parse tree
A large number of bi-lexical features,
over-fitting the training corpus

20
Local Constraints (An Example)
4
1
2
3
school
The
boy
skipped
regularly
loss
6
5
score(The, boy) gt score(The,
skipped) 1 score(boy, skipped) gt
score(The, skipped) 1
score(skipped, school) gt score(school,
regularly) 1 score(skipped, regularly) gt
score(school, regularly) 1
5
1
2
5
6
3
6
4
21
Laplacian Regularization

Enforce the similar links (word pairs) to have
similar weights

S similarity matrix of word pairs L(S)
Laplacian matrix of S D(S) a diagonal
matrix L(S) D(S) S
22
Refined Large Margin Objective
only for bi-lexical features
polynomial constraints!
23
Overview

Dependency parsing model
Learning dependency parsers
Strictly lexicalized dependency parsing
Extensions to Large Margin Training
Training Dependency Parsers via Structured
boosting
Semi-supervised Convex Training for Dependency
Parsing
Results
Conclusion

24
Simple Training of Dependency Parsers via
Structured Boosting
-- IJCAI 2007
25
Contributions

Structured Boosting a simple approach to
training structured classifiers by applying a
boosting-like procedure to standard supervised
training methods
Advantages
Global training
Simple efficient
General, can be easily applied to other tasks
Successfully applied to dependency parsing

26
Local Training Examples

Given training data (X, Y)

local examples
Word-pair Link-label Instance_weight
Features The-boy L 1 W1_The,
W2_boy, W1W2_The_boy, T1_DT, T2_NN,
T1T2_DT_NN, Dist_1, boy-skipped L 1
W1_boy, W2_skipped, skipped-school R 1
W1_skipped, W2_ school,
skipped-regularly R 1 W1_skipped,
W2_regularly, The-skipped N 1
W1_The, W2_skipped, The-school N 1
W1_The, W2_school,
L left link R right link N no link
27
Local Training Methods

Learn a local link classifier given a set of
features defined on the local examples
For each word pair in a sentence
No link, left link or right link? (3-class
classification)
However,
The parameters of the local model are not being
trained to consider the global parsing accuracy

28
Global Training for Parsing

Incorporate the effects of the parser directly
into the training algorithm --- directly capture
the relations between the links of an output tree
Unfortunately, available structured training
techniques are
Expensive, specialized, complex to implement
Require a lot of effort to apply to parsing

Need efficient global training algorithms!
29
Structured Boosting for Dependency Parsing
Global training efficient ?
Training sentences (X, Y)
Compare with the gold standard trees
Local training examples
Re-weight the mis-parsed examples
Dependency parsing algorithm
Link score
Local link classifier c

c1
c2
c3
ck
Dependency trees
30
Overview

Dependency parsing model
Learning dependency parsers
Strictly lexicalized dependency parsing
Extensions to Large Margin Training
Training Dependency Parsers via Structured
boosting
Semi-supervised Convex Training for Dependency
Parsing
Results
Conclusion

31
Semi-supervised Convex Training for Dependency
Parsing
-- ACL 2008
32
Contributions

Combined a structured large margin loss on
labeled data and a least squares loss on
unlabeled data
Obtained an efficient, convex, semi-supervised
large margin training algorithm for learning
dependency parsers

33
More Data Is Better Data

The Penn Treebank
4.5 million words
About 200 thousand sentences
Annotation 30 person-minutes/ sentence
Limited expensive!

Raw text data
News wire
Wikipedia
Web resources
Plentiful free

Supervised learning
Semi/ unsupervised learning
34
Semi-supervised Structured SVM
Non-convex!
Both and are variables

where

35
Our Approach
Convex!
36
Efficient Optimization Alg.

Using stochastic gradient steps
Parameters are updated locally on each labeled
and unlabeled sentence
is solved by calling CPLEX
The objective is globally minimized after a few
iterations

Gradient on labeled sentence i

Gradient on unlabeled sentence j
37
Overview

Dependency parsing model
Learning dependency parsers
Strictly lexicalized dependency parsing
Extensions to Large Margin Training
Training Dependency Parsers via Structured
boosting
Semi-supervised Convex Training for Dependency
Parsing
Results
Conclusion

38
Experimental Design

Data set (Linguistic Data Consortium), split into
training, development and test sets
English
PTB 50 K sentences (split standard)
Chinese
CTB4 15 K sentences (split Wang et al. 2005)
CTB5 19 K sentences (split Corston-Oliver et
al. 2006)
Features
Word-pair indicator, POS-pair, PMI, context
features, distance

39
Experimental Design Cont.

Dependency parsing algorithm
A CKY-like chart parser with
complexity
Evaluation measure
Dependency accuracy percentage of words that
have the correct head

40
Results - 1 (IWPT 05)
Evaluation Results on CTB 4.0 ()
An undirected tree root A directed tree
41
Results - 2 (CoNLL 06)
much simpler feature set
Evaluation Results on CTB4 - 10 ()
42
Results - 3 (IJCAI 07)
Evaluation Results on Chinese and English ()
43
Results - 4 (ACL 08)
Evaluation Results on Chinese and English ()
44
Comparison with State-of-the-art
IWPT 2005
IJCAI 2007
Chinese Treebank 4.0 Chinese Treebank 5.0
45
Overview

Dependency parsing model
Learning dependency parsers
Strictly lexicalized dependency parsing
Extensions to Large Margin Training
Training Dependency Parsers via Structured
boosting
Semi-supervised Convex Training for Dependency
Parsing
Results
Conclusion

46
Conclusion

I have developed several statistical approaches
to learning dependency parsers
Strictly lexicalized dependency parsing
Extensions to Large Margin Training
Training Dependency Parsers via Structured
boosting
Semi-supervised Convex Training for Dependency
Parsing
Achieved state-of-the-art accuracy for English
and Chinese

47
Thanks!
48

Features,
Linguistic intuitions
Models

Training criteria, Regularization, smoothing
NLP
Machine learning
49
Ambiguities In NLP
Courtesy of Aravind Joshi
I like eating sushi with tuna.
50
Dependency Trees vs. Constituency Trees
S
NP
VP
V
NP
N
Dt
N
Mike
ate
the
cake
A Constituency tree
51
Dependency Tree

A dependency tree structure for a sentence
Syntactic relationships between word pairs in
the sentence

obj
obj
mod
obj
subj
with
tuna
sushi
I
like
eating
with
tuna
sushi
I
like
eating
52
Dependency Parsing

An increasingly active research area (Yamada
Matsumoto 2003, McDonald et al. 2005, McDonald
Pereira 2006, Corston-Oliver et al. 2006, Smith
Eisner 2005/06, Wang et al. 2006/07/08, Koo et
al. 2008)
Dependency trees are much easier to understand
and annotate than other syntactic representations
Dependency relations have been widely used in
Machine translation (Fox 2002, Cherry Lin 2003,
Ding Palmer 2005)
Information extraction (Culotta Sorensen 2004)
Question answering (Pinchak Lin 2006)
Coreference resolution (Bergsma Lin 2006)

53
Scoring Functions
A vector of feature weights
A vector of features

Feature weights can be learned via either
a local or a global training approach

54
Score of Each Link / Word-pair

The score of each link is based on the features
Considering the word pair (skipped, regularly)
(skipped, regularly) 1
POS (skipped, regularly) (VBD, RB) 1
PMI (skipped, regularly) 0.27
dist (skipped, regularly) 2 dist2(skipped,
regularly) 4

55
However,

POS tags are not part of natural text
Need to be annotated by human effort
Introduce more noise to training data
For some languages, POS tags are not clearly
defined
Such as Chinese or Japanese
A single word is often combined with other words

Can we use another smoothing technique rather
than POS?
56
Feature Representation

Represent a word w by a feature vector
The value of a feature c is

where P(w, c) is the probability of w and c
co-occur in a context window
57
Similarity-based Smoothing

Similarity measure cosine
In (Dagan et al., 1999)

58
Comparison With An Unlexicalized Model

In the unlexicalized model, the input to the
parser is the sequence of the POS tags, which is
opposite to our model
Using gold standard POS tags
Accuracy of the unlexicalized model 71.1
Our strictly lexicalized model 79.9

59
Large Margin Training

Minimizing a regularized loss (Hastie et at.,
2004)

i the index of the training sentences Ti the
target tree Li a candidate tree the
distance between the two trees
60
Objective with Local Constraints

The corresponding new quadratic program

polynomial constraints!
j number of constraints in A
61
Combining Local Training witha Parsing Algorithm
62
Standard Boosting for Classification
training examples
Increase the weight of mis-classified examples
Local predictor h

hk
h1
h2
h3
63
Structured Boosting

Train a local link predictor, h1
Re-parse training data using h1
Re-weight local examples
Compare the parser outputs with the gold standard
trees
Increase the weight of mis-parsed local examples
Re-train local link predictor, getting h2
Finally we have h1 , h2 , , hk

64
Dynamic Features

Also known as non-local features
Take into account the link labels of the
surrounding word-pairs when predicting the label
of current pair
Commonly used in sequential labeling (McCallum
et al 2000, Toutanova et al. 2003)
A simple but useful idea for improving parsing
accuracy
Wang et al. 2005
McDonald and Pereira 2006

65
Dynamic Features
with
a
telescope
duck
I
saw
her
with
a
spot
duck
I
saw
her

Define a canonical order so that a words
children are generated first, before it modifies
another word
telescope/spot are the dynamic features for
deciding whether generating a link between saw
with or duck with

66
Results - 1
Table 1 Boosting with static features
67
Variants of Structured Boosting

Using alternative boosting algorithms for
structured boosting
Adaboost M2 (Freund Schapire 1997)
Re-weighting class labels
Logistic regression form of boosting (Collins et
al. 2002)

68
Structured Boosting (An Example)
Gold tree
saw
duck
her
with
telescope
I
a
Parsers output
At each iteration, Increasing instance weights
that are mis-parsed
Iter 1
Try harder!
I
saw
her
with
duck
a
telescope
Try harder!
Iter 2
I
saw
duck
her
with
telescope
a

Good job!
Iter T
I
saw
her
telescope
duck
with
a
69
Similarity Between Word Pairs

Similarity between two words Sim(w1, w2) (Lin,
1998)
Construct a feature vector for w that contains a
set of words occurring within a small context
window
Compute similarities between the two feature
vectors
Using cosine measure
e.g., Sim (skipped, missed) 0.134966
Similarity between word pairs geometric average

Sim (skipped, regularly, missed, often)
70
A Generative Parsing Model
5
3
2
1
4
school 4
skipped 3
? 0
regularly 5
The 1
kid 2
71
Similarity-based Smoothing
similarity-based Prob.
MLE-based Prob.
Similar Contexts of C
regularly 5
skipped 3
regularly 5
skipped 3
skip frequently skip routinely skip
repeatedly Bounced often Bounced
who Bounced repeatedly
72
Similarity-based Smoothing

Finally,

P(E C) a PMLE(E C) (1 a) PSIM(E C)
C is the frequency count of the corresponding
context C in the training data
(skipped, regulary) 1 (The, kid) 95
73
Dependency Parsing Algorithms

Use a constituency parsing algorithm
Simply treat dependencies as constituents
With complexity
Eisners dependency parsing algorithm
(Eisner 1996)
It stores spans, instead of subtrees
Only the end-words are active (still need a head)

74
Existing Semi/unsupervised Alg.

EM / self-training
Local minima
The disconnect between likelihood and accuracy
Same mistakes can be amplified at next iteration
Standard semi-supervised SVM
Non-convex objective on the unlabeled data
Available solutions are sophisticated and
expensive (e.g, Xu et al. 2006)

75
Structured Boosting

A simple variant of standard boosting algorithms
Adaboost M1 (Freund Schapire 1997)
Global optimization
As efficient as local methods
General, can use any local classifier
Also, can be easily applied to other tasks

76
Dependency Trees

A dependency tree structure for a sentence
represents
Syntactic relations between word pairs in the
sentence

mod
obj
subj
obj
det
gen
Over 100 possible trees!
with
a
telescope
duck
saw
I
her
1 million trees for a 20-word sentence
77
Constraints on

is represented as an adjacency matrix
Use rows to denote heads and columns to denote
children
Constraints on
All entries in are between 0 and 1
Sum over all entries on each column is 1
(one-head rule)
All the entries on the diagonal are zeros (no
self-link rule)
(anti-symmetric rule)
Connectedness (no-cycle)

78
Structured Boosting (An Example)
saw
her
duck
I
with
a
telescope
Instance_weight of saw-with
Instance_weight of duck-with
Weights of local examples

Iter 1 2 3 T
79
Local Constraints
w1 common node
Convex!
Correct link
Missing link