Learning Structured Classifiers for Statistical Dependency Parsing

About This Presentation

Title:

Learning Structured Classifiers for Statistical Dependency Parsing

Description:

bounced 0.139547. missed 0.134966. cruised 0.133933. scooted 0.13387. jogged 0.133638 ... Bounced often. Bounced who. Bounced repeatedly. 9/26/09. Qin Iris Wang ... – PowerPoint PPT presentation

Number of Views:98

Avg rating:3.0/5.0

Slides: 86

Provided by: wang6

Category:

more less

Transcript and Presenter's Notes

Title: Learning Structured Classifiers for Statistical Dependency Parsing

1
Learning Structured Classifiers for Statistical
Dependency Parsing

Qin Iris Wang
Supervisors Dekang Lin Dale Schuurmans
University of Alberta
May 5th, 2008

2
Ambiguities In NLP
I saw her duck.
How about I saw her duck with a telescope.
3
Dependency Trees vs. Constituency Trees
S
NP
VP
V
NP
N
Dt
N
Mike
ate
the
cake
A Constituency tree
4
My Thesis Research

Goal
Improve dependency parsing via statistical
machine learning approaches
Method
Tackle the problem from different angles
Employ and develop advanced supervised and
semi-supervised machine learning techniques
Achievement
State-of-the-art accuracy for English and Chinese

5
Dependency Trees

A dependency tree structure for a sentence
represents
Syntactic relations between word pairs in the
sentence

mod
obj
subj
obj
det
gen
Over 100 possible trees!
with
a
telescope
duck
saw
I
her
1 million trees for a 20-word sentence
6
Dependency Parsing Algorithms

Use a constituency parsing algorithm
Simply treat dependencies as constituents
With complexity
Eisners dependency parsing algorithm
(Eisner 1996)
It stores spans, instead of subtrees
Only the end-words are active (still need a head)

7
Dependency Parsing

An increasingly active research area (Yamada
Matsumoto 2003, McDonald et al. 2005, McDonald
Pereira 2006, Corston-Oliver et al. 2006, Smith
Eisner 2005/06, Wang et al. 2006/07/08, Koo et
al. 2008)
Dependency trees are much easier to understand
and annotate than other syntactic representations
Dependency relations have been widely used in
Machine translation (Fox 2002, Cherry Lin 2003,
Ding Palmer 2005)
Information extraction (Culotta Sorensen 2004)
Question answering (Pinchak Lin 2006)
Coreference resolution (Bergsma Lin 2006)

8
Overview

Dependency parsing model
Learning dependency parsers
Strictly lexicalized dependency parsing
Extensions to Large Margin Training
Training Dependency Parsers via Structured
boosting
Semi-supervised Convex Training for Dependency
Parsing
Results
Conclusion

9
Dependency Parsing Model

W an input sentence T a candidate dependency
tree
a dependency link from word i to
word j
the set of possible dependency trees
over W
Can be applied to both probabilistic and
non-probabilistic models

Edge/link-based factorization
10
Scoring Functions
A vector of feature weights
A vector of features

Feature weights can be learned via either
a local or a global training approach

11
Features for an arc

Word pair indicator
Part-of-Speech (POS) tags of the word pair
Pointwise Mutual Information (PMI) for that word
pair
Distance between words

Lots and sparse!
Abstraction or smoothing
12
Score of Each Link / Word-pair

The score of each link is based on the features
Considering the word pair (skipped, regularly)
(skipped, regularly) 1
POS (skipped, regularly) (VBD, RB) 1
PMI (skipped, regularly) 0.27
dist (skipped, regularly) 2 dist2(skipped,
regularly) 4

13
My Work on Dependency Parsing

1. Strictly Lexicalized Dependency Parsing (Wang
et al. 2005)
MLE similarity-based smoothing
2. Improving Large Margin Training (Wang et al.
2006)
Local constraints capture local errors in a
parse tree
Laplacian regularization - semi-supervised large
margin training enforce similar links to have
similar weights -- similarity smoothing
3. Structured Boosting (Wang et al. 2007)
Global optimization, efficient and flexible
4. Semi-supervised Convex Training (Wang et al.
2008)
Combine large margin loss with least squares loss

14
Overview

Dependency parsing model
Learning dependency parsers
Strictly lexicalized dependency parsing
Extensions to Large Margin Training
Training Dependency Parsers via Structured
boosting
Semi-supervised Convex Training for Dependency
Parsing
Results
Conclusion

15
Strictly Lexicalized Dependency Parsing
-- IWPT 2005
16
Contributions

All the features are based on word statistics and
no POS tags or grammatical categories needed
Using similarity-based smoothing to deal with
data sparseness

17
POS Tags for Handling Sparseness in Parsing

All previous parsers use a POS lexicon
Natural language data is sparse
Bikel (2004) found that of all the needed bi-gram
statistics, only 1.49 were observed in the Penn
Treebank
Words belonging to the same POS are expected to
have the same syntactic behavior

18
However,

POS tags are not part of natural text
Need to be annotated by human effort
Introduce more noise to training data
For some languages, POS tags are not clearly
defined
Such as Chinese or Japanese
A single word is often combined with other words

Can we use another smoothing technique rather
than POS?
19
An Alternative Method to Deal with Sparseness

Distributional word similarities
Distributional Hypothesis --- Words that appear
in similar contexts have similar meanings
(Harris, 1968)
Soft clusters of words
Advantages of using similarity smoothing
Computed automatically from the raw text
Making the construction of Treebank easier

POS hard clusters of words
20
Similarity Between Word Pairs

Similarity between two words Sim(w1, w2) (Lin,
1998)
Construct a feature vector for w that contains a
set of words occurring within a small context
window
Compute similarities between the two feature
vectors
Using cosine measure
e.g., Sim (skipped, missed) 0.134966
Similarity between word pairs geometric average

Sim (skipped, regularly, missed, often)
21
A Generative Parsing Model
5
3
2
1
4
school 4
skipped 3
? 0
regularly 5
The 1
kid 2
22
Similarity-based Smoothing
skipped, regularly )
Context
regularly 5
skipped 3
P (
S (regularly)
S (skipped)
frequently 0.365862routinely 0.286178periodica
lly 0.273665often 0.24077constantly
0.234693occasionally 0.226324who
0.200348continuously 0.194026repeatedly
0.177434
skipping 0.229951skip 0.197991skips
0.169982sprinted 0.140535bounced
0.139547missed 0.134966cruised
0.133933scooted 0.13387jogged 0.133638
23
Similarity-based Smoothing
skipped, regularly )
regularly 5
skipped 3
P (
S (regularly)
S (skipped)
frequently 0.365862routinely 0.286178periodica
lly 0.273665often 0.24077constantly
0.234693occasionally 0.226324who
0.200348continuously 0.194026repeatedly
0.177434
Skipping 0.229951skip 0.197991skips
0.169982sprinted 0.140535bounced
0.139547missed 0.134966cruised
0.133933scooted 0.13387jogged 0.133638
24
Similarity-based Smoothing
skipped, regularly )
regularly 5
skipped 3
P (
Pairs seen in the training data
skip frequently skip routinely skip
repeatedly bounced often bounced who bounced
repeatedly
Similar Contexts
25
Similarity-based Smoothing
similarity-based Prob.
MLE-based Prob.
Similar Contexts of C
regularly 5
skipped 3
regularly 5
skipped 3
skip frequently skip routinely skip
repeatedly Bounced often Bounced
who Bounced repeatedly
26
Similarity-based Smoothing

Finally,

P(E C) a PMLE(E C) (1 a) PSIM(E C)
C is the frequency count of the corresponding
context C in the training data
(skipped, regulary) 1 (The, kid) 95
27
Overview

Dependency parsing model
Learning dependency parsers
Strictly lexicalized dependency parsing
Extensions to Large Margin Training
Training Dependency Parsers via Structured
boosting
Semi-supervised Convex Training for Dependency
Parsing
Results
Conclusion

28
Improved Large Margin Dependency Parsing via
Local Constraints and Laplacian Regularization
-- CoNLL 2006
29
Contributions

Large margin vs. generative training
Included distance and PMI features, but fewer
dynamic features
Local constraints to capture local errors in a
parse tree
Laplacian regularization (based on distributional
word similarity) to deal with data sparseness
Semi-supervised large margin training

30
(Existing) Large Margin Training
Exponential constraints!

Having been used for parsing
Tsochantaridis et al. 2004, Taskar et al.2004
State of the art performance in dependency
parsing
McDonald et at. 2005a, 2005b, 2006

31
However

Exponential number of constraints
Loss Ignoring the local errors of the parse tree
Over-fitting the training corpus
large number of bi-lexical features, need a good
smoothing (regularization) method

32
Local Constraints (an example)
4
1
2
3
school
The
boy
skipped
regularly
loss
6
5
score(The, boy) gt score(The,
skipped) 1 score(boy, skipped) gt
score(The, skipped) 1
score(skipped, school) gt score(school,
regularly) 1 score(skipped, regularly) gt
score(school, regularly) 1
5
1
2
5
polynomial constraints!
6
3
6
4
33
Local Constraints
w1 common node
Convex!
Correct link
Missing link

With slack variables

34
Laplacian Regularization

Enforce the similar links (word pairs) to have
similar weights

L(S) D(S) S D(S) a diagonal matrix
S similarity matrix of word pairs L(S)
Laplacian matrix of S
35
Refined Large Margin Objective
only for bi-lexical features
polynomial constraints!
36
Overview

Dependency parsing model
Learning dependency parsers
Strictly lexicalized dependency parsing
Extensions to Large Margin Training
Training Dependency Parsers via Structured
boosting
Semi-supervised Convex Training for Dependency
Parsing
Results
Conclusion

37
Simple Training of Dependency Parsers via
Structured Boosting
-- IJCAI 2007
38
Contributions

Structured Boosting a simple approach to
training structured classifiers by applying a
boosting-like procedure to standard supervised
training methods
Advantages
Simple
Inexpensive
General
Successfully applied to dependency parsing

39
Local Training Examples

Given training data (S, T)

local examples
Word-pair Link-label Instance_weight
Features The-boy L 1 W1_The,
W2_boy, W1W2_The_boy, T1_DT, T2_NN,
T1T2_DT_NN, Dist_1, boy-skipped L 1
W1_boy, W2_skipped, skipped-school R 1
W1_skipped, W2_ school,
skipped-regularly R 1 W1_skipped,
W2_regularly, The-skipped N 1
W1_The, W2_skipped, The-school N 1
W1_The, W2_school,
L left link R right link N no link
40
Local Training Methods

Learn a local link classifier given a set of
features defined on the local examples
For each word pair in a sentence
No link, left link or right link ?
3-class classification
Any classifier can be used as a link classifier
for parsing

41
Combining Local Training witha Parsing Algorithm
42
Parsing With a Local Link Classifier

Learn the weight vector over a set of
features defined on the local examples
Maximum entropy models (Ratnaparkhi 1999,
Charniak 2000)
Support vector machines (Yamada and Matsumoto
2003)
The parameters of the local model are not being
trained to consider the global parsing accuracy
Global training can do better

43
Global Training for Parsing

Directly capture the relations between the links
of an output tree
Incorporate the effects of the parser directly
into the training algorithm
Structured SVMs (Tsochantaridis et al. 2004)
Max-Margin Parsing (Taskar et al. 2004)
Online large-margin training (McDonald et al.
2005)
Improving large-margin training (Wang et al. 2006)

44
But, Drawbacks

Unfortunately, these structured training
techniques are
Expensive
Specialized
Complex to implement
Require a great deal of refinement and
computational resources to apply to parsing

Need efficient global training algorithms!
45
Structured Boosting

A simple variant of standard boosting algorithms
Adaboost M1 (Freund Schapire 1997)
Global optimization
As efficient as local methods
General, can use any local classifier
Also, can be easily applied to other tasks

46
Standard Boosting for Classification
training examples
Increase the weight of mis-classified examples
Local predictor h

hk
h1
h2
h3
47
Structured Boosting (An Example)
Gold tree
saw
duck
her
with
telescope
I
a
Parsers output
At each iteration, Increasing instance weights
that are mis-parsed
Iter 1
Try harder!
I
saw
her
with
duck
a
telescope
Try harder!
Iter 2
I
saw
duck
her
with
telescope
a

Good job!
Iter T
I
saw
her
telescope
duck
with
a
48
Structured Boosting for Dependency Parsing
Global training efficient ?
Training sentences (S, T)
Compare with the gold standard trees
Local training examples
Re-weight the mis-parsed examples
Dependency parsing algorithm
Link score
Local link classifier h

h1
h2
h3
hk
Dependency trees
49
Overview

Dependency parsing model
Learning dependency parsers
Strictly lexicalized dependency parsing
Extensions to Large Margin Training
Training Dependency Parsers via Structured
boosting
Semi-supervised Convex Training for Dependency
Parsing
Results
Conclusion

50
Semi-supervised Convex Training for Dependency
Parsing
-- ACL 2008
51
Contributions

Combined a structured large margin loss on
labeled data and a least squares loss on
unlabeled data
Obtained an efficient, convex, semi-supervised
large margin training algorithm for learning
dependency parsers

52
More Data Is Better Data

The Penn Treebank
4.5 million words
About 200 thousand sentences
Annotation 30 person-minutes/ sentence

Raw text data
News wire
Wikipedia
Web resources

Limited expensive!
Plentiful Free!
Supervised learning
Semi/ unsupervised learning
53
Existing Semi/unsupervised Alg.

EM / self-training
Local minima
The disconnect between likelihood and accuracy
Same mistakes can be amplified at next iteration
Standard semi-supervised SVM
Non-convex objective on the unlabeled data
Available solutions are sophisticated and
expensive (e.g, Xu et al. 2006)

54
Semi-supervised Structured SVM
Non-convex!
Both and Yj are variables

where

55
Our Approach
Convex!

Constraints on Yj
All entries in Yj are between 0 and 1
Sum over all entries on each column is 1
(one-head rule)
All the entries on the diagonal are zeros (no
self-link rule)
(anti-symmetric rule)

56
Our Approach
Convex!
57
Our Approach Cont.

Yj is represented as an adjacency matrix
Use rows to denote heads and columns to denote
children
Constraints on Yj
All entries in Yj are between 0 and 1
Sum over all entries on each column is 1
(one-head rule)
All the entries on the diagonal are zeros (no
self-link rule)
(anti-symmetric rule)
Connectedness (no-cycle)

58
Efficient Optimization Alg.

Using a stochastic gradient steps
Parameters are updated locally on each sentence
The objective is globally minimized after a few
iterations

Gradient on labeled sentence i

Gradient on unlabeled sentence j
59
Overview

Dependency parsing model
Learning dependency parsers
Strictly lexicalized dependency parsing
Extensions to Large Margin Training
Training Dependency Parsers via Structured
boosting
Semi-supervised Convex Training for Dependency
Parsing
Results
Conclusion

60
Experimental Design

Data set (Linguistic Data Consortium), split into
training, development and test sets
English
PTB 50 K sentences (split standard)
Chinese
CTB4 15 K sentences (split Wang et al. 2005)
CTB5 19 K sentences (split Corston-Oliver et
al. 2006)
Features
Word-pair indicator, POS-pair, PMI, context
features, distance

61
Experimental Design Cont.

Dependency parsing algorithm
A CKY-like chart parser with
complexity
Evaluation measure
Dependency accuracy percentage of words that
have the correct head

62
Results - 1 (IWPT 05)
Evaluation Results on CTB 4.0 ()
An undirected tree root A directed tree
63
Results - 2 (CoNLL 06)
much simpler feature set
Evaluation Results on CTB4 - 10 ()
64
Results - 3 (IJCAI 07)
Evaluation Results on Chinese and English ()
65
Results - 4 (ACL 08)
Evaluation Results on Chinese and English ()
66
Comparison with State of the Art
IWPT 2005
IJCAI 2007
Chinese Treebank 4.0 Chinese Treebank 5.0
67
Overview

Dependency parsing model
Learning dependency parsers
Strictly lexicalized dependency parsing
Extensions to Large Margin Training
Training Dependency Parsers via Structured
boosting
Semi-supervised Convex Training for Dependency
Parsing
Results
Conclusion

68
Conclusion

I have developed several statistical approaches
to improve learning dependency parsers
Strictly lexicalized dependency parsing
Extensions to Large Margin Training
Training Dependency Parsers via Structured
boosting
Semi-supervised Convex Training for Dependency
Parsing
Achieved state-of-the-art accuracy for English
and Chinese

69
Thanks!
Questions?
70

Features,
Linguistic intuitions
Models

Training criteria, Regularization, smoothing
NLP
Machine learning
71
What Have I Worked On?
Dependency parsing
Output is a set of inter-dependent labels with a
specific structure (E.g., a parse tree)
Structured learning
Part-of-Speech tagging
Query segmentation
72
Ambiguities In NLP
Courtesy of Aravind Joshi
I like eating sushi with tuna.
73
Dependency Tree

A dependency tree structure for a sentence
Syntactic relationships between word pairs in
the sentence

obj
obj
mod
obj
subj
with
tuna
sushi
I
like
eating
with
tuna
sushi
I
like
eating
74
Feature Representation

Represent a word w by a feature vector
The value of a feature c is

where P(w, c) is the probability of w and c
co-occur in a context window
75
Similarity-based Smoothing

Similarity measure cosine
In (Dagan et al., 1999)

76
Comparison With An Unlexicalized Model

In the unlexicalized model, the input to the
parser is the sequence of the POS tags, which is
opposite to our model
Using gold standard POS tags
Accuracy of the unlexicalized model 71.1
Our strictly lexicalized model 79.9

77
Large Margin Training

Minimizing a regularized loss (Hastie et at.,
2004)

i the index of the training sentences Ti the
target tree Li a candidate tree the
distance between the two trees
78
Objective with Local Constraints

The corresponding new quadratic program

polynomial constraints!
j number of constraints in A
79
Structured Boosting

Train a local link predictor, h1
Re-parse training data using h1
Re-weight local examples
Compare the parser outputs with the gold standard
trees
Increase the weight of mis-parsed local examples
Re-train local link predictor, getting h2
Finally we have h1 , h2 , , hk

80
Structured Boosting (An Example)
saw
her
duck
I
with
a
telescope
Instance_weight of saw-with
Instance_weight of duck-with
Weights of local examples

Iter 1 2 3 T
81
Variants of Structured Boosting

Using alternative boosting algorithms for
structured boosting
Adaboost M2 (Freund Schapire 1997)
Re-weighting class labels
Logistic regression form of boosting (Collins et
al. 2002)

82
Dynamic Features

Also known as non-local features
Take into account the link labels of the
surrounding word-pairs when predicting the label
of current pair
Commonly used in sequential labeling (McCallum
et al 2000, Toutanova et al. 2003)
A simple but useful idea for improving parsing
accuracy
Wang et al. 2005
McDonald and Pereira 2006

83
Dynamic Features
with
a
telescope
duck
I
saw
her
with
a
spot
duck
I
saw
her

Define a canonical order so that a words
children are generated first, before it modifies
another word
telescope/spot are the dynamic features for
deciding whether generating a link between saw
with or duck with

84
Results - 1
Table 1 Boosting with static features
85
More on Structured Learning

Improved Estimation for Unsupervised
Part-of-Speech Tagging (Wang Schuurmans 2005)
Improved Large Margin Dependency Parsing via
Local Constraints and Laplacian Regularization
(Wang, Cherry, Lizotte and Schuurmans 2006)
Learning Noun Phrase Query Segmentation
(Bergsma and Wang 2007)
Semi-supervised Convex Training for Dependency
Parsing (Wang, Schuurmans, Lin 2008)

Write a Comment

User Comments (0)