Title: Learning Structured Classifiers for Statistical Dependency Parsing
1Learning Structured Classifiers for Statistical
Dependency Parsing
- Qin Iris Wang
-
- Supervisors Dekang Lin Dale Schuurmans
- University of Alberta
- May 5th, 2008
2Ambiguities In NLP
I saw her duck.
How about I saw her duck with a telescope.
3Dependency Trees vs. Constituency Trees
S
NP
VP
V
NP
N
Dt
N
Mike
ate
the
cake
A Constituency tree
4My Thesis Research
- Goal
- Improve dependency parsing via statistical
machine learning approaches - Method
- Tackle the problem from different angles
- Employ and develop advanced supervised and
semi-supervised machine learning techniques - Achievement
- State-of-the-art accuracy for English and Chinese
5Dependency Trees
- A dependency tree structure for a sentence
represents - Syntactic relations between word pairs in the
sentence
mod
obj
subj
obj
det
gen
Over 100 possible trees!
with
a
telescope
duck
saw
I
her
1 million trees for a 20-word sentence
6Dependency Parsing Algorithms
- Use a constituency parsing algorithm
- Simply treat dependencies as constituents
- With complexity
- Eisners dependency parsing algorithm
(Eisner 1996) - It stores spans, instead of subtrees
- Only the end-words are active (still need a head)
7Dependency Parsing
- An increasingly active research area (Yamada
Matsumoto 2003, McDonald et al. 2005, McDonald
Pereira 2006, Corston-Oliver et al. 2006, Smith
Eisner 2005/06, Wang et al. 2006/07/08, Koo et
al. 2008) - Dependency trees are much easier to understand
and annotate than other syntactic representations - Dependency relations have been widely used in
- Machine translation (Fox 2002, Cherry Lin 2003,
Ding Palmer 2005) - Information extraction (Culotta Sorensen 2004)
- Question answering (Pinchak Lin 2006)
- Coreference resolution (Bergsma Lin 2006)
8Overview
- Dependency parsing model
- Learning dependency parsers
- Strictly lexicalized dependency parsing
- Extensions to Large Margin Training
- Training Dependency Parsers via Structured
boosting - Semi-supervised Convex Training for Dependency
Parsing - Results
- Conclusion
9Dependency Parsing Model
- W an input sentence T a candidate dependency
tree - a dependency link from word i to
word j - the set of possible dependency trees
over W - Can be applied to both probabilistic and
non-probabilistic models
Edge/link-based factorization
10Scoring Functions
A vector of feature weights
A vector of features
- Feature weights can be learned via either
a local or a global training approach
11Features for an arc
- Word pair indicator
- Part-of-Speech (POS) tags of the word pair
- Pointwise Mutual Information (PMI) for that word
pair - Distance between words
-
Lots and sparse!
Abstraction or smoothing
12Score of Each Link / Word-pair
- The score of each link is based on the features
- Considering the word pair (skipped, regularly)
- (skipped, regularly) 1
- POS (skipped, regularly) (VBD, RB) 1
- PMI (skipped, regularly) 0.27
- dist (skipped, regularly) 2 dist2(skipped,
regularly) 4
13My Work on Dependency Parsing
- 1. Strictly Lexicalized Dependency Parsing (Wang
et al. 2005) - MLE similarity-based smoothing
- 2. Improving Large Margin Training (Wang et al.
2006) - Local constraints capture local errors in a
parse tree - Laplacian regularization - semi-supervised large
margin training enforce similar links to have
similar weights -- similarity smoothing - 3. Structured Boosting (Wang et al. 2007)
- Global optimization, efficient and flexible
- 4. Semi-supervised Convex Training (Wang et al.
2008) - Combine large margin loss with least squares loss
14Overview
- Dependency parsing model
- Learning dependency parsers
- Strictly lexicalized dependency parsing
- Extensions to Large Margin Training
- Training Dependency Parsers via Structured
boosting - Semi-supervised Convex Training for Dependency
Parsing - Results
- Conclusion
15Strictly Lexicalized Dependency Parsing
-- IWPT 2005
16Contributions
- All the features are based on word statistics and
no POS tags or grammatical categories needed - Using similarity-based smoothing to deal with
data sparseness
17POS Tags for Handling Sparseness in Parsing
- All previous parsers use a POS lexicon
- Natural language data is sparse
- Bikel (2004) found that of all the needed bi-gram
statistics, only 1.49 were observed in the Penn
Treebank - Words belonging to the same POS are expected to
have the same syntactic behavior
18However,
- POS tags are not part of natural text
- Need to be annotated by human effort
- Introduce more noise to training data
- For some languages, POS tags are not clearly
defined - Such as Chinese or Japanese
- A single word is often combined with other words
Can we use another smoothing technique rather
than POS?
19An Alternative Method to Deal with Sparseness
- Distributional word similarities
- Distributional Hypothesis --- Words that appear
in similar contexts have similar meanings
(Harris, 1968) - Soft clusters of words
- Advantages of using similarity smoothing
- Computed automatically from the raw text
- Making the construction of Treebank easier
POS hard clusters of words
20Similarity Between Word Pairs
- Similarity between two words Sim(w1, w2) (Lin,
1998) - Construct a feature vector for w that contains a
set of words occurring within a small context
window - Compute similarities between the two feature
vectors - Using cosine measure
- e.g., Sim (skipped, missed) 0.134966
- Similarity between word pairs geometric average
-
Sim (skipped, regularly, missed, often)
21A Generative Parsing Model
5
3
2
1
4
school 4
skipped 3
? 0
regularly 5
The 1
kid 2
22Similarity-based Smoothing
skipped, regularly )
Context
regularly 5
skipped 3
P (
S (regularly)
S (skipped)
frequently 0.365862routinely 0.286178periodica
lly 0.273665often 0.24077constantly
0.234693occasionally 0.226324who
0.200348continuously 0.194026repeatedly
0.177434
skipping 0.229951skip 0.197991skips
0.169982sprinted 0.140535bounced
0.139547missed 0.134966cruised
0.133933scooted 0.13387jogged 0.133638
23Similarity-based Smoothing
skipped, regularly )
regularly 5
skipped 3
P (
S (regularly)
S (skipped)
frequently 0.365862routinely 0.286178periodica
lly 0.273665often 0.24077constantly
0.234693occasionally 0.226324who
0.200348continuously 0.194026repeatedly
0.177434
Skipping 0.229951skip 0.197991skips
0.169982sprinted 0.140535bounced
0.139547missed 0.134966cruised
0.133933scooted 0.13387jogged 0.133638
24Similarity-based Smoothing
skipped, regularly )
regularly 5
skipped 3
P (
Pairs seen in the training data
skip frequently skip routinely skip
repeatedly bounced often bounced who bounced
repeatedly
Similar Contexts
25Similarity-based Smoothing
similarity-based Prob.
MLE-based Prob.
Similar Contexts of C
regularly 5
skipped 3
regularly 5
skipped 3
skip frequently skip routinely skip
repeatedly Bounced often Bounced
who Bounced repeatedly
26Similarity-based Smoothing
P(E C) a PMLE(E C) (1 a) PSIM(E C)
C is the frequency count of the corresponding
context C in the training data
(skipped, regulary) 1 (The, kid) 95
27Overview
- Dependency parsing model
- Learning dependency parsers
- Strictly lexicalized dependency parsing
- Extensions to Large Margin Training
- Training Dependency Parsers via Structured
boosting - Semi-supervised Convex Training for Dependency
Parsing - Results
- Conclusion
28Improved Large Margin Dependency Parsing via
Local Constraints and Laplacian Regularization
-- CoNLL 2006
29Contributions
- Large margin vs. generative training
- Included distance and PMI features, but fewer
dynamic features - Local constraints to capture local errors in a
parse tree - Laplacian regularization (based on distributional
word similarity) to deal with data sparseness - Semi-supervised large margin training
30(Existing) Large Margin Training
Exponential constraints!
- Having been used for parsing
- Tsochantaridis et al. 2004, Taskar et al.2004
- State of the art performance in dependency
parsing - McDonald et at. 2005a, 2005b, 2006
31However
- Exponential number of constraints
- Loss Ignoring the local errors of the parse tree
- Over-fitting the training corpus
- large number of bi-lexical features, need a good
smoothing (regularization) method
32Local Constraints (an example)
4
1
2
3
school
The
boy
skipped
regularly
loss
6
5
score(The, boy) gt score(The,
skipped) 1 score(boy, skipped) gt
score(The, skipped) 1
score(skipped, school) gt score(school,
regularly) 1 score(skipped, regularly) gt
score(school, regularly) 1
5
1
2
5
polynomial constraints!
6
3
6
4
33Local Constraints
w1 common node
Convex!
Correct link
Missing link
34Laplacian Regularization
- Enforce the similar links (word pairs) to have
similar weights
L(S) D(S) S D(S) a diagonal matrix
S similarity matrix of word pairs L(S)
Laplacian matrix of S
35Refined Large Margin Objective
only for bi-lexical features
polynomial constraints!
36Overview
- Dependency parsing model
- Learning dependency parsers
- Strictly lexicalized dependency parsing
- Extensions to Large Margin Training
- Training Dependency Parsers via Structured
boosting - Semi-supervised Convex Training for Dependency
Parsing - Results
- Conclusion
37Simple Training of Dependency Parsers via
Structured Boosting
-- IJCAI 2007
38Contributions
- Structured Boosting a simple approach to
training structured classifiers by applying a
boosting-like procedure to standard supervised
training methods - Advantages
- Simple
- Inexpensive
- General
- Successfully applied to dependency parsing
39Local Training Examples
- Given training data (S, T)
local examples
Word-pair Link-label Instance_weight
Features The-boy L 1 W1_The,
W2_boy, W1W2_The_boy, T1_DT, T2_NN,
T1T2_DT_NN, Dist_1, boy-skipped L 1
W1_boy, W2_skipped, skipped-school R 1
W1_skipped, W2_ school,
skipped-regularly R 1 W1_skipped,
W2_regularly, The-skipped N 1
W1_The, W2_skipped, The-school N 1
W1_The, W2_school,
L left link R right link N no link
40Local Training Methods
- Learn a local link classifier given a set of
features defined on the local examples - For each word pair in a sentence
- No link, left link or right link ?
- 3-class classification
- Any classifier can be used as a link classifier
for parsing
41Combining Local Training witha Parsing Algorithm
42Parsing With a Local Link Classifier
- Learn the weight vector over a set of
features defined on the local examples - Maximum entropy models (Ratnaparkhi 1999,
Charniak 2000) - Support vector machines (Yamada and Matsumoto
2003) - The parameters of the local model are not being
trained to consider the global parsing accuracy - Global training can do better
43Global Training for Parsing
- Directly capture the relations between the links
of an output tree - Incorporate the effects of the parser directly
into the training algorithm - Structured SVMs (Tsochantaridis et al. 2004)
- Max-Margin Parsing (Taskar et al. 2004)
- Online large-margin training (McDonald et al.
2005) - Improving large-margin training (Wang et al. 2006)
44But, Drawbacks
- Unfortunately, these structured training
techniques are - Expensive
- Specialized
- Complex to implement
- Require a great deal of refinement and
computational resources to apply to parsing
Need efficient global training algorithms!
45Structured Boosting
- A simple variant of standard boosting algorithms
Adaboost M1 (Freund Schapire 1997) - Global optimization
- As efficient as local methods
- General, can use any local classifier
- Also, can be easily applied to other tasks
46Standard Boosting for Classification
training examples
Increase the weight of mis-classified examples
Local predictor h
hk
h1
h2
h3
47Structured Boosting (An Example)
Gold tree
saw
duck
her
with
telescope
I
a
Parsers output
At each iteration, Increasing instance weights
that are mis-parsed
Iter 1
Try harder!
I
saw
her
with
duck
a
telescope
Try harder!
Iter 2
I
saw
duck
her
with
telescope
a
Good job!
Iter T
I
saw
her
telescope
duck
with
a
48Structured Boosting for Dependency Parsing
Global training efficient ?
Training sentences (S, T)
Compare with the gold standard trees
Local training examples
Re-weight the mis-parsed examples
Dependency parsing algorithm
Link score
Local link classifier h
h1
h2
h3
hk
Dependency trees
49Overview
- Dependency parsing model
- Learning dependency parsers
- Strictly lexicalized dependency parsing
- Extensions to Large Margin Training
- Training Dependency Parsers via Structured
boosting - Semi-supervised Convex Training for Dependency
Parsing - Results
- Conclusion
50Semi-supervised Convex Training for Dependency
Parsing
-- ACL 2008
51Contributions
- Combined a structured large margin loss on
labeled data and a least squares loss on
unlabeled data - Obtained an efficient, convex, semi-supervised
large margin training algorithm for learning
dependency parsers
52More Data Is Better Data
- The Penn Treebank
- 4.5 million words
- About 200 thousand sentences
- Annotation 30 person-minutes/ sentence
- Raw text data
- News wire
- Wikipedia
- Web resources
-
Limited expensive!
Plentiful Free!
Supervised learning
Semi/ unsupervised learning
53Existing Semi/unsupervised Alg.
- EM / self-training
- Local minima
- The disconnect between likelihood and accuracy
- Same mistakes can be amplified at next iteration
- Standard semi-supervised SVM
- Non-convex objective on the unlabeled data
- Available solutions are sophisticated and
expensive (e.g, Xu et al. 2006)
54Semi-supervised Structured SVM
Non-convex!
Both and Yj are variables
55Our Approach
Convex!
- Constraints on Yj
- All entries in Yj are between 0 and 1
- Sum over all entries on each column is 1
(one-head rule) - All the entries on the diagonal are zeros (no
self-link rule) -
(anti-symmetric rule)
56Our Approach
Convex!
57Our Approach Cont.
- Yj is represented as an adjacency matrix
- Use rows to denote heads and columns to denote
children - Constraints on Yj
- All entries in Yj are between 0 and 1
- Sum over all entries on each column is 1
(one-head rule) - All the entries on the diagonal are zeros (no
self-link rule) -
(anti-symmetric rule) - Connectedness (no-cycle)
58Efficient Optimization Alg.
- Using a stochastic gradient steps
- Parameters are updated locally on each sentence
- The objective is globally minimized after a few
iterations
- Gradient on labeled sentence i
Gradient on unlabeled sentence j
59Overview
- Dependency parsing model
- Learning dependency parsers
- Strictly lexicalized dependency parsing
- Extensions to Large Margin Training
- Training Dependency Parsers via Structured
boosting - Semi-supervised Convex Training for Dependency
Parsing - Results
- Conclusion
60Experimental Design
- Data set (Linguistic Data Consortium), split into
training, development and test sets - English
- PTB 50 K sentences (split standard)
- Chinese
- CTB4 15 K sentences (split Wang et al. 2005)
- CTB5 19 K sentences (split Corston-Oliver et
al. 2006) - Features
- Word-pair indicator, POS-pair, PMI, context
features, distance
61Experimental Design Cont.
- Dependency parsing algorithm
- A CKY-like chart parser with
complexity - Evaluation measure
- Dependency accuracy percentage of words that
have the correct head
62Results - 1 (IWPT 05)
Evaluation Results on CTB 4.0 ()
An undirected tree root A directed tree
63Results - 2 (CoNLL 06)
much simpler feature set
Evaluation Results on CTB4 - 10 ()
64Results - 3 (IJCAI 07)
Evaluation Results on Chinese and English ()
65Results - 4 (ACL 08)
Evaluation Results on Chinese and English ()
66Comparison with State of the Art
IWPT 2005
IJCAI 2007
Chinese Treebank 4.0 Chinese Treebank 5.0
67Overview
- Dependency parsing model
- Learning dependency parsers
- Strictly lexicalized dependency parsing
- Extensions to Large Margin Training
- Training Dependency Parsers via Structured
boosting - Semi-supervised Convex Training for Dependency
Parsing - Results
- Conclusion
68Conclusion
- I have developed several statistical approaches
to improve learning dependency parsers - Strictly lexicalized dependency parsing
- Extensions to Large Margin Training
- Training Dependency Parsers via Structured
boosting - Semi-supervised Convex Training for Dependency
Parsing - Achieved state-of-the-art accuracy for English
and Chinese
69Thanks!
Questions?
70- Features,
- Linguistic intuitions
- Models
Training criteria, Regularization, smoothing
NLP
Machine learning
71What Have I Worked On?
Dependency parsing
Output is a set of inter-dependent labels with a
specific structure (E.g., a parse tree)
Structured learning
Part-of-Speech tagging
Query segmentation
72Ambiguities In NLP
Courtesy of Aravind Joshi
I like eating sushi with tuna.
73Dependency Tree
- A dependency tree structure for a sentence
- Syntactic relationships between word pairs in
the sentence
obj
obj
mod
obj
subj
with
tuna
sushi
I
like
eating
with
tuna
sushi
I
like
eating
74Feature Representation
- Represent a word w by a feature vector
- The value of a feature c is
where P(w, c) is the probability of w and c
co-occur in a context window
75Similarity-based Smoothing
- Similarity measure cosine
- In (Dagan et al., 1999)
76Comparison With An Unlexicalized Model
- In the unlexicalized model, the input to the
parser is the sequence of the POS tags, which is
opposite to our model - Using gold standard POS tags
- Accuracy of the unlexicalized model 71.1
- Our strictly lexicalized model 79.9
77Large Margin Training
- Minimizing a regularized loss (Hastie et at.,
2004)
i the index of the training sentences Ti the
target tree Li a candidate tree the
distance between the two trees
78Objective with Local Constraints
- The corresponding new quadratic program
polynomial constraints!
j number of constraints in A
79Structured Boosting
- Train a local link predictor, h1
- Re-parse training data using h1
- Re-weight local examples
- Compare the parser outputs with the gold standard
trees - Increase the weight of mis-parsed local examples
- Re-train local link predictor, getting h2
- Finally we have h1 , h2 , , hk
80Structured Boosting (An Example)
saw
her
duck
I
with
a
telescope
Instance_weight of saw-with
Instance_weight of duck-with
Weights of local examples
Iter 1 2 3 T
81Variants of Structured Boosting
- Using alternative boosting algorithms for
structured boosting - Adaboost M2 (Freund Schapire 1997)
- Re-weighting class labels
- Logistic regression form of boosting (Collins et
al. 2002)
82Dynamic Features
- Also known as non-local features
- Take into account the link labels of the
surrounding word-pairs when predicting the label
of current pair - Commonly used in sequential labeling (McCallum
et al 2000, Toutanova et al. 2003) - A simple but useful idea for improving parsing
accuracy - Wang et al. 2005
- McDonald and Pereira 2006
83Dynamic Features
with
a
telescope
duck
I
saw
her
with
a
spot
duck
I
saw
her
- Define a canonical order so that a words
children are generated first, before it modifies
another word - telescope/spot are the dynamic features for
deciding whether generating a link between saw
with or duck with
84Results - 1
Table 1 Boosting with static features
85More on Structured Learning
- Improved Estimation for Unsupervised
Part-of-Speech Tagging (Wang Schuurmans 2005) - Improved Large Margin Dependency Parsing via
Local Constraints and Laplacian Regularization
(Wang, Cherry, Lizotte and Schuurmans 2006) - Learning Noun Phrase Query Segmentation
(Bergsma and Wang 2007) - Semi-supervised Convex Training for Dependency
Parsing (Wang, Schuurmans, Lin 2008)