Title: Learning Structured Classifiers for Statistical Dependency Parsing
1Learning Structured Classifiers for Statistical
Dependency Parsing
- Qin Iris Wang
-
- Joint work with Dekang Lin Dale Schuurmans
- University of Alberta
- September 11, 2007
2Ambiguities In NLP
I saw her duck.
How about I saw her duck with a telescope.
3Dependency Trees vs. Constituency Trees
S
NP
VP
V
NP
N
Dt
N
Mike
ate
the
cake
A Constituency tree
4Dependency Trees
- A dependency tree structure for a sentence
represents - Syntactic relations between word pairs in the
sentence
mod
obj
subj
obj
det
gen
with
a
telescope
duck
saw
I
her
Over 100 possible trees!
1M trees for a 20-word sentence
5Dependency Parsing
- An increasingly active research area (Yamada
Matsumoto 2003, McDonald et al. 2005, McDonald
Pereira 2006, Corston-Oliver et al. 2006, Smith
Eisner 2005, Smith Eisner 2006 ) - Dependency trees are much easier to understand
and annotate than other syntactic representations
- Dependency relations have been widely used in
- Machine translation (Fox 2002, Cherry Lin 2003,
Ding Palmer 2005) - Information extraction (Culotta Sorensen 2004)
- Question answering (Pinchak Lin 2006)
- Coreference resolution (Bergsma Lin 2006)
6Overview
- Dependency parsing model
- Learning dependency parsers
- Strictly lexicalized dependency parsing
- Structured boosting
- Experimental results
- Conclusion
7Dependency Parsing Model
- W an input sentence T a candidate
dependency tree - a dependency link from word i to
word j - the set of possible dependency trees
over W - Can be applied to both probabilistic and
non-probabilistic models
Edge/link-based factorization
8Scoring Functions
A vector of feature weights
A vector of features
- Feature weights can be learned either
locally or globally
9Features for an arc
- Word pair indicator
- Part-of-Speech (POS) tags of the word pair
- Pointwise Mutual Information (PMI) for that word
pair - Distance between words
lots! sparse
Abstraction or smoothing
10Score of Each Link / Word-pair
- The score of each link is based on the features
- Considering the word pair (skipped, regularly)
- POS (skipped, regularly) (VBD, RB)
- PMI (skipped, regularly) 0.27
- dist(skipped, regularly) 2
11Score of A Tree
regularly.
school
The
boy
skipped
12My Work on Dependency Parsing
- 1. Strictly Lexicalized Dependency Parsing (Wang
et al. 2005) - MLE similarity-based smoothing
- 2. Improving Large Margin Training (Wang et al.
2006) - Local constraints capture local errors in a
parse tree - Laplacian regularization enforce similar links
to have similar weights introduced the
similarity-based smoothing technique into the
large margin framework - 3. Structured Boosting (Wang et al. 2007)
- Global optimization, efficient and flexible
Focus of this talk
13Overview
- Dependency parsing model
- Learning dependency parsers
- Strictly lexicalized dependency parsing
- Structured boosting
- Experimental results
- Conclusion
14Strictly Lexicalized Dependency Parsing
-- IWPT 2005
15POS Tags for Handling Sparseness in Parsing
- All previous parsers use a POS lexicon
- Natural language data is sparse
- Bikel (2004) found that of all the needed bi-gram
statistics, only 1.49 were observed in the Penn
Treebank - Words belonging to the same POS are expected to
have the same syntactic behavior
16However,
- POS tags are not part of natural text
- Need to be annotated by human effort
- Introduce more noise to training data
- For some languages, POS tags are not clearly
defined - Such as Chinese or Japanese
- A single word is often combined with other words
Can we use another smoothing technique rather
than POS?
17Strictly Lexicalized Dependency Parsing
- All the features are based on word statistics and
no POS tags needed - Using similarity-based smoothing to deal with
data sparseness
18An Alternative Method to Deal with Sparseness
- Distributional word similarities
- Distributional Hypothesis --- Words that appear
in similar contexts have similar meanings
(Harris, 1968) - Soft clusters of words
- Advantages of using similarity smoothing
- Computed automatically from the raw text
- Making the construction of Treebank easier
POS hard clusters of words
19Similarity Between Word Pairs
- Similarity between two words Sim(w1, w2) (Lin,
1998) - Construct a feature vector for w that contains a
set of words occurring within a small context
window - Compute similarities between the two feature
vectors - Using cosine measure
- e.g., Sim (skipped, missed) 0.134966
- Similarity between word pairs geometric average
-
20Similarity-based Smoothing
5
3
2
1
4
school 4
skipped 3
? 0
regularly 5
The 1
kid 2
21Similarity-based Smoothing
skipped, regularly )
Context
regularly 5
skipped 3
P (
S (regularly)
S (skipped)
frequently 0.365862routinely 0.286178periodica
lly 0.273665often 0.24077constantly
0.234693occasionally 0.226324who
0.200348continuously 0.194026repeatedly
0.177434
skipping 0.229951skip 0.197991skips
0.169982sprinted 0.140535bounced
0.139547missed 0.134966cruised
0.133933scooted 0.13387jogged 0.133638
22Similarity-based Smoothing
skipped, regularly )
regularly 5
skipped 3
P (
S (regularly)
S (skipped)
frequently 0.365862routinely 0.286178periodica
lly 0.273665often 0.24077constantly
0.234693occasionally 0.226324who
0.200348continuously 0.194026repeatedly
0.177434
Skipping 0.229951skip 0.197991skips
0.169982sprinted 0.140535bounced
0.139547missed 0.134966cruised
0.133933scooted 0.13387jogged 0.133638
23Similarity-based Smoothing
skipped, regularly )
regularly 5
skipped 3
P (
Pairs seen in the training data
skip frequently skip routinely skip
repeatedly bounced often bounced who bounced
repeatedly
Similar Contexts
24Similarity-based Smoothing
similarity-based Prob.
MLE-based Prob.
Similar Contexts of C
regularly 5
skipped 3
regularly 5
skipped 3
skip frequently skip routinely skip
repeatedly Bounced often Bounced
who Bounced repeatedly
25Similarity-based Smoothing
P(E C) a PMLE(E C) (1 a) PSIM(E C)
C is the frequency count of the corresponding
context C in the training data
(skipped, regulary) 1 (The, kid) 95
26Comparison With An Unlexicalized Model
- In this model, the input to the parser is the
sequence of the POS tags, which is opposite to
our model - Using gold standard POS tags
- Accuracy of the unlexicalized model 71.1
- Our strictly lexicalized model 79.9
27Overview
- Dependency parsing model
- Learning dependency parsers
- Strictly lexicalized dependency parsing
- Structured boosting
- Experimental results
- Conclusion
28Simple Training of Dependency Parsers via
Structured Boosting
-- IJCAI 2007
29Our Contributions
- Structured Boosting a simple approach to
training structured classifiers by applying a
boosting-like procedure to standard supervised
training methods - Advantages
- Simple
- Inexpensive
- General
- Successfully applied to dependency parsing
30Local Training Examples
- Given training data (S, T)
local examples
Word-pair Link-label Weight
Features The-boy L 1 W1_The, W2_boy,
W1W2_The_boy, T1_DT, T2_NN,
T1T2_DT_NN, Dist_1, boy-skipped L 1
W1_boy, W2_skipped, skipped-school R 1
W1_skipped, W2_ school, skipped-regularly R
1 W1_skipped, W2_regularly,
The-skipped N 1 W1_The, W2_skipped,
The-school N 1 W1_The, W2_school,
L left link R right link N no link
31Local Training Methods
- Learn a local link classifier given a set of
features defined on the local examples - For each word pair in a sentence
- No link, left link or right link ?
- 3-class classification
- Any classifier can be used as a link classifier
for parsing
32Combining Local Training witha Parsing Algorithm
Training sentences (S, T)
Local training examples
Link score
Dependency parsing algorithm
Local link classifier h
Standard application of ML
Dependency trees
33Parsing With a Local Link Classifier
- Learn the weight vector over a set of
features defined on the local examples - Maximum entropy models (Ratnaparkhi 1999,
Charniak 2000) - Support vector machines (Yamada and Matsumoto
2003) - The parameters of the local model are not being
trained to consider the global parsing accuracy - Global training can do better
34Global Training for Parsing
- Directly capture the relations between the links
of an output tree - Incorporate the effects of the parser directly
into the training algorithm - Structured SVMs (Tsochantaridis et al. 2004)
- Max-Margin Parsing (Taskar et al. 2004)
- Online large-margin training (McDonald et al.
2005) - Improving large-margin training (Wang et al. 2006)
35But, Drawbacks
- Unfortunately, these structured training
techniques are - Expensive
- Specialized
- Complex to implement
- Require a great deal of refinement and
computational resources to apply to parsing
Need efficient global training algorithms!
36Structured Boosting
- A simple variant of standard boosting algorithms
Adaboost M1 (Freund Schapire 1997) - Global optimization
- As efficient as local methods
- General, can use any local classifier
- Also, can be easily applied to other tasks
37Standard Boosting for Classification
training examples
Increase the weight of mis-classified examples
Local predictor h
hk
h1
h2
h3
38Structured Boosting (An Example)
Gold tree
saw
duck
her
with
telescope
I
a
Parsers output
At each iteration, Increasing the weights of
local examples which are mis-parsed
Iter 1
Try harder!
I
saw
her
with
duck
a
telescope
Try harder!
Iter 2
I
saw
duck
her
with
telescope
a
Good job!
Iter T
I
saw
her
telescope
duck
with
a
39Structured Boosting (An Example)
saw
her
duck
I
with
a
telescope
Weight of saw-with
Weight of duck-with
Weights of local examples
Iter 1 2 T
40Structured Boosting for Dependency Parsing
Global training efficient ?
Training sentences (S, T)
Compare with the gold standard trees
Local training examples
Re-weight the mis-parsed examples
Dependency parsing algorithm
Link score
Local link classifier h
h1
h2
h3
hk
Dependency trees
41Overview
- Dependency parsing model
- Learning dependency parsers
- Strictly lexicalized dependency parsing
- Structured boosting
- Experimental results
- Conclusion
42Experimental Design
- Data set (Linguistic Data Consortium), split into
training, development and test sets - PTB3 50 K sentences (split standard)
- CTB4 15 K sentences (split Wang et al. 2005)
- CTB5 19 K sentences (split Corston-Oliver et
al. 2006) - Features
- Word-pair indicator, POS-pair, PMI, context
features, distance
43Experimental Design Cont.
- Local link classifier
- Logistic regression model/ maximum entropy
- Boosting method
- A variant of Adaboost M1 (Freund Schapire 1997)
44Results
Accuracy on Chinese and English ()
Dependency accuracy percentage of words that
have the correct head
45Comparison with State of the Art
IWPT 2005
IJCAI 2007
Chinese Treebank 4.0 Chinese Treebank 5.0
46Overview
- Dependency parsing model
- Learning dependency parsers
- Strictly lexicalized dependency parsing
- Structured boosting
- Experimental results
- Conclusion
47Conclusion
- Similarity-based smoothing as an alternative to
POS to deal with data sparseness - Structured boosting is an efficient and effective
approach to coordinating local link classifiers
with global parsing accuracy - Both of the above techniques have been
successfully applied to dependency parsing
48More on Structured Learning
- Improved Estimation for Unsupervised
Part-of-Speech Tagging (Wang Schuurmans 2005) - Improved Large Margin Dependency Parsing via
Local Constraints and Laplacian Regularization
(Wang, Cherry, Lizotte and Schuurmans 2006) - Learning Noun Phrase Query Segmentation
(Bergsma and Wang 2007) - Semi-supervised Topic Segmentation of Web Docs
(in progress)
49Thanks!
Questions?
50- Features,
- Linguistic intuitions
- Models
Training criteria, Regularization, smoothing
NLP
Machine learning
51What Have I Worked On?
Dependency parsing
Output is a set of inter-dependent labels with a
specific structure (E.g., a parse tree)
Structured learning
Part-of-Speech tagging
Query segmentation
52Ambiguities In NLP
Courtesy of Aravind Joshi
I like eating sushi with tuna.
53Dependency Tree
- A dependency tree structure for a sentence
- Syntactic relationships between word pairs in
the sentence
obj
obj
mod
obj
subj
with
tuna
sushi
I
like
eating
with
tuna
sushi
I
like
eating
54Structured Boosting
- Train a local link predictor, h1
- Re-parse training data using h1
- Re-weight local examples
- Compare the parser outputs with the gold standard
trees - Increase the weight of mis-parsed local examples
- Re-train local link predictor, getting h2
- Finally we have h1 , h2 , , hk
55However
- Exponential number of constraints (number of
incorrect trees) - Loss Ignoring the local errors of the parse tree
- Over-fitting the training corpus
- large number of bi-lexical/word-pair features,
need a good smoothing (regularization) method
56Variants of Structured Boosting
- Using alternative boosting algorithms for
structured boosting - Adaboost M2 (Freund Schapire 1997)
- Re-weighting class labels
- Logistic regression form of boosting (Collins et
al. 2002)
57Improving Large Margin Training (Wang et al.
2006)
- A margin is created between the correct
dependency tree and each incorrect dependency
tree at least as large as the loss of the
incorrect tree. - Our contributions
- Local constraints to capture local errors in a
parse tree - Laplacian regularization to deal with data
sparseness introduce the similarity-based
smoothing technique into the large margin
framework
58(Existing) Large Margin Training
- Having been used for parsing
- Tsochantaridis et al. 2004
- Taskar et al.2004
- State of the art performance in dependency
parsing - McDonald et at. 2005a, 2005b, 2006
59Large Margin Training
- Minimizing a regularized loss (Hastie et at.,
2004)
i the index of the training sentences Ti the
target tree Li a candidate tree the
distance between the two trees
60Large Margin Training
McDonald 2005
- Exponential number of constraints (number of
incorrect trees) - Loss Ignoring the local errors of the parse tree
- Over-fitting the training corpus
- large number of bi-lexical/word-pair features,
need a good smoothing (regularization) method
61Local Constraints (an example)
4
1
2
3
school
The
boy
skipped
regularly
loss
6
5
score(The, boy) gt score(The,
skipped) 1 score(boy, skipped) gt
score(The, skipped) 1
score(skipped, school) gt score(school,
regularly) 1 score(skipped, regularly) gt
score(school, regularly) 1
5
1
2
5
polynomial constraints!
6
3
6
4
62Local Constraints
w1 common node
Convex!
Correct link
Missing link
63Objective with Local Constraints
- The corresponding new quadratic program
polynomial constraints!
j number of constraints in A
64Laplacian Regularization
- Enforce similar links (word pairs) to have
similar weights
L(S) D(S) S D(S) a diagonal matrix
S similarity matrix of word pairs L(S)
Laplacian matrix of S
65Similarity Between Word Pairs
- Similarity between two words Sim(w1, w2)
- Cosine
- Similarity between word pairs
66Refined Large Margin Objective
only for bi-lexical features
67Unsupervised POS Tagging (Wang Schuurmans 2005)
- Weakness of transition model and emission model
in HMM tagging - Poorly Learned transition parameters
- No form of parameter tying over emission model
- Our ideas
- Transition model Marginally constrained HMMs
- Emission model Similarity-based smoothing
68Parameters of an HMM
Ti1
Ti
Ti-1
T1
Tn
Wi1
Wi
Wi-1
W1
Wn
69We Did Better
- Improved Estimation for Unsupervised
Part-of-Speech Tagging (Wang Schuurmans,
2005) - Full/unfiltered lexicon
- 77.2 (Banko and Moore 2004)
- 90.5 (Our model) ?
- Reduced/filtered lexicon
- 95.9 (Banko and Moore 2004)
- 94.7 (Our model)
70Query Segmentation (Bergsma Wang 2007)
- Input search engine query
- Output query separated into phrases
- Goal improve information retrieval
- Approach supervised machine-learning
- Classifier makes segmentation decisions
- Conclusion richer features allow for large
increases in segmentation performance
71Query Segmentation
- Example query
- two man power saw
- Output segmentations
- two man power saw
- two man power saw
- two man power saw
- two man power saw
- etc.
72Query Segmentation
- Unsegmented
- two man power saw
- two
- man
- power
- saw
73Query Segmentation
74Query Segmentation
75Semi-supervised Dependency Parsing
- Unsupervised/Semi-supervised dependency parsing
- EM
- Using a discriminative, convex, unsupervised
structured learning algorithm (Xu et al. 2006) - Combining a supervised structured large margin
loss with a cheap unsupervised least squares loss
on unlabeled data.
Expensive
Much cheaper
76Topic Segmentation of Web Docs
- A structured classification problem
- Input a document containing a sequence of k
sentences - Output a sequence of break decisions (each
sentence boundary is a possible segmentation
point) - Goal segment a document into a few blocks
according to subtopic - Approach semi-supervised training (combining a
supervised large-margin training loss with an
unsupervised least squares loss)
77Experimental Results
Dependency accuracy on Chinese Treebank (CTB) 4.0
An undirected tree root A directed tree
78Dynamic Features
- Also known as non-local features
- Take into account the link labels of the
surrounding word-pairs when predicting the label
of current pair - Commonly used in sequential labeling (McCallum
et al 2000, Toutanova et al. 2003) - A simple but useful idea for improving parsing
accuracy - Wang et al. 2005
- McDonald and Pereira 2006
79Dynamic Features
with
a
telescope
duck
I
saw
her
with
a
spot
duck
I
saw
her
- Define a canonical order so that a words
children are generated first, before it modifies
another word - telescope/spot are the dynamic features for
deciding whether generating a link between saw
with or duck with
80Results - 1
Table 1 Boosting with static features
81Results - 2
Table 2 Boosting with dynamic features
82Feature Representation
- Represent a word w by a feature vector
- The value of a feature c is
where P(w, c) is the probability of w and c
co-occur in a context window
83Similarity-based Smoothing
- Similarity measure cosine
- In (Dagan et al., 1999)