Title: Learning Structured Classifiers for Statistical Dependency Parsing
1Learning Structured Classifiers for Statistical
Dependency Parsing
- Qin Iris Wang
-
- Supervisors Dekang Lin Dale Schuurmans
- University of Alberta
- May 7th, 2008
2Ambiguities in NLP
I saw her duck.
How about I saw her duck with a telescope.
3Dependency Trees
head -gt modifier one and only one head
- A dependency tree structure for a sentence
represents - Syntactic relations between word pairs in the
sentence
with
a
telescope
duck
saw
I
her
Over 100 possible trees!
1 million trees for a 20-word sentence
4My Thesis Research
- Goal
- Improve dependency parsing via different
statistical machine learning approaches - Method
- Tackle the problem from different angles
- Employ and develop advanced supervised and
semi-supervised machine learning techniques - Achievement
- State-of-the-art accuracy for English and Chinese
5Overview
- Dependency parsing model
- Learning dependency parsers
- Strictly lexicalized dependency parsing
- Extensions to Large Margin Training
- Training Dependency Parsers via Structured
boosting - Semi-supervised Convex Training for Dependency
Parsing - Results
- Conclusion
6Dependency Parsing Model
- X an input sentence Y a candidate dependency
tree - a dependency link from word i to
word j - the set of possible dependency trees
over X
Edge/link-based factorization
A vector of features
A vector of feature weights
7Features for an Arc
Lots and sparse!
- Word pair indicator
- Part-of-Speech (POS) tags of the word pair
- Pointwise Mutual Information (PMI) for that word
pair - Distance between words
-
Abstraction or smoothing
8My Work on Dependency Parsing
- 1. Strictly Lexicalized Dependency Parsing (Wang
et al. 2005) - MLE similarity-based smoothing
- 2. Improving Large Margin Training (Wang et al.
2006) - Local constraints capture local errors in a
parse tree - Laplacian regularization - semi-supervised large
margin training enforce similar links to have
similar weights -- similarity smoothing - 3. Structured Boosting (Wang et al. 2007)
- Global optimization, efficient and flexible
- 4. Semi-supervised Convex Training (Wang et al.
2008) - Combine large margin loss with least squares loss
9Overview
- Dependency parsing model
- Learning dependency parsers
- Strictly lexicalized dependency parsing
- Extensions to Large Margin Training
- Training Dependency Parsers via Structured
boosting - Semi-supervised Convex Training for Dependency
Parsing - Results
- Conclusion
10Strictly Lexicalized Dependency Parsing
-- IWPT 2005
11Contributions
- All the features are based on word statistics and
no POS tags or grammatical categories needed - Using similarity-based smoothing to deal with
data sparseness
12POS Tags for Handling Sparseness in Parsing
- All previous parsers use a POS lexicon
- Natural language data is sparse
- Bikel (2004) found that of all the needed bi-gram
statistics, only 1.49 were observed in the Penn
Treebank - Words belonging to the same POS are expected to
have the same syntactic behavior
13An Alternative Method to Deal with Sparseness
- Distributional word similarities
- Distributional Hypothesis --- Words that appear
in similar contexts have similar meanings
(Harris, 1968) - Soft clusters of words
- Advantages of using similarity smoothing
- Computed automatically from the raw text
- Making the construction of Treebank easier
POS hard clusters of words
14Similarity-based Smoothing
skipped, regularly )
Context
regularly 5
skipped 3
P (
S (regularly)
S (skipped)
frequently 0.365862routinely 0.286178periodica
lly 0.273665often 0.24077constantly
0.234693occasionally 0.226324who
0.200348continuously 0.194026repeatedly
0.177434
Skipping 0.229951skip 0.197991skips
0.169982sprinted 0.140535bounced
0.139547missed 0.134966cruised
0.133933scooted 0.13387jogged 0.133638
15Similarity-based Smoothing
skipped, regularly )
regularly 5
skipped 3
P (
Pairs seen in the training data
skip frequently skip routinely skip
repeatedly bounced often bounced who bounced
repeatedly
Similar Contexts
16Overview
- Dependency parsing model
- Learning dependency parsers
- Strictly lexicalized dependency parsing
- Extensions to Large Margin Training
- Training Dependency Parsers via Structured
boosting - Semi-supervised Convex Training for Dependency
Parsing - Results
- Conclusion
17Improved Large Margin Dependency Parsing via
Local Constraints and Laplacian Regularization
-- CoNLL 2006
18Contributions
- Large margin vs. generative training
- Included distance and PMI features, but fewer
dynamic features - Local constraints to capture local errors in a
parse tree - Laplacian regularization (based on distributional
word similarity) to deal with data sparseness - Semi-supervised large margin training
19(Existing) Large Margin Training
Exponential constraints!
- However,
- Exponential number of constraints
- Loss ignoring the local errors of the parse tree
- A large number of bi-lexical features,
over-fitting the training corpus
20Local Constraints (An Example)
4
1
2
3
school
The
boy
skipped
regularly
loss
6
5
score(The, boy) gt score(The,
skipped) 1 score(boy, skipped) gt
score(The, skipped) 1
score(skipped, school) gt score(school,
regularly) 1 score(skipped, regularly) gt
score(school, regularly) 1
5
1
2
5
6
3
6
4
21Laplacian Regularization
- Enforce the similar links (word pairs) to have
similar weights
S similarity matrix of word pairs L(S)
Laplacian matrix of S D(S) a diagonal
matrix L(S) D(S) S
22Refined Large Margin Objective
only for bi-lexical features
polynomial constraints!
23Overview
- Dependency parsing model
- Learning dependency parsers
- Strictly lexicalized dependency parsing
- Extensions to Large Margin Training
- Training Dependency Parsers via Structured
boosting - Semi-supervised Convex Training for Dependency
Parsing - Results
- Conclusion
24Simple Training of Dependency Parsers via
Structured Boosting
-- IJCAI 2007
25Contributions
- Structured Boosting a simple approach to
training structured classifiers by applying a
boosting-like procedure to standard supervised
training methods - Advantages
- Global training
- Simple efficient
- General, can be easily applied to other tasks
- Successfully applied to dependency parsing
26Local Training Examples
- Given training data (X, Y)
local examples
Word-pair Link-label Instance_weight
Features The-boy L 1 W1_The,
W2_boy, W1W2_The_boy, T1_DT, T2_NN,
T1T2_DT_NN, Dist_1, boy-skipped L 1
W1_boy, W2_skipped, skipped-school R 1
W1_skipped, W2_ school,
skipped-regularly R 1 W1_skipped,
W2_regularly, The-skipped N 1
W1_The, W2_skipped, The-school N 1
W1_The, W2_school,
L left link R right link N no link
27Local Training Methods
- Learn a local link classifier given a set of
features defined on the local examples - For each word pair in a sentence
- No link, left link or right link? (3-class
classification) - However,
- The parameters of the local model are not being
trained to consider the global parsing accuracy
28Global Training for Parsing
- Incorporate the effects of the parser directly
into the training algorithm --- directly capture
the relations between the links of an output tree - Unfortunately, available structured training
techniques are - Expensive, specialized, complex to implement
- Require a lot of effort to apply to parsing
Need efficient global training algorithms!
29Structured Boosting for Dependency Parsing
Global training efficient ?
Training sentences (X, Y)
Compare with the gold standard trees
Local training examples
Re-weight the mis-parsed examples
Dependency parsing algorithm
Link score
Local link classifier c
c1
c2
c3
ck
Dependency trees
30Overview
- Dependency parsing model
- Learning dependency parsers
- Strictly lexicalized dependency parsing
- Extensions to Large Margin Training
- Training Dependency Parsers via Structured
boosting - Semi-supervised Convex Training for Dependency
Parsing - Results
- Conclusion
31Semi-supervised Convex Training for Dependency
Parsing
-- ACL 2008
32Contributions
- Combined a structured large margin loss on
labeled data and a least squares loss on
unlabeled data - Obtained an efficient, convex, semi-supervised
large margin training algorithm for learning
dependency parsers
33More Data Is Better Data
- The Penn Treebank
- 4.5 million words
- About 200 thousand sentences
- Annotation 30 person-minutes/ sentence
- Limited expensive!
- Raw text data
- News wire
- Wikipedia
- Web resources
-
- Plentiful free
Supervised learning
Semi/ unsupervised learning
34Semi-supervised Structured SVM
Non-convex!
Both and are variables
35Our Approach
Convex!
36Efficient Optimization Alg.
- Using stochastic gradient steps
- Parameters are updated locally on each labeled
and unlabeled sentence - is solved by calling CPLEX
- The objective is globally minimized after a few
iterations
- Gradient on labeled sentence i
Gradient on unlabeled sentence j
37Overview
- Dependency parsing model
- Learning dependency parsers
- Strictly lexicalized dependency parsing
- Extensions to Large Margin Training
- Training Dependency Parsers via Structured
boosting - Semi-supervised Convex Training for Dependency
Parsing - Results
- Conclusion
38Experimental Design
- Data set (Linguistic Data Consortium), split into
training, development and test sets - English
- PTB 50 K sentences (split standard)
- Chinese
- CTB4 15 K sentences (split Wang et al. 2005)
- CTB5 19 K sentences (split Corston-Oliver et
al. 2006) - Features
- Word-pair indicator, POS-pair, PMI, context
features, distance
39Experimental Design Cont.
- Dependency parsing algorithm
- A CKY-like chart parser with
complexity - Evaluation measure
- Dependency accuracy percentage of words that
have the correct head
40Results - 1 (IWPT 05)
Evaluation Results on CTB 4.0 ()
An undirected tree root A directed tree
41Results - 2 (CoNLL 06)
much simpler feature set
Evaluation Results on CTB4 - 10 ()
42Results - 3 (IJCAI 07)
Evaluation Results on Chinese and English ()
43Results - 4 (ACL 08)
Evaluation Results on Chinese and English ()
44Comparison with State-of-the-art
IWPT 2005
IJCAI 2007
Chinese Treebank 4.0 Chinese Treebank 5.0
45Overview
- Dependency parsing model
- Learning dependency parsers
- Strictly lexicalized dependency parsing
- Extensions to Large Margin Training
- Training Dependency Parsers via Structured
boosting - Semi-supervised Convex Training for Dependency
Parsing - Results
- Conclusion
46Conclusion
- I have developed several statistical approaches
to learning dependency parsers - Strictly lexicalized dependency parsing
- Extensions to Large Margin Training
- Training Dependency Parsers via Structured
boosting - Semi-supervised Convex Training for Dependency
Parsing - Achieved state-of-the-art accuracy for English
and Chinese
47Thanks!
48- Features,
- Linguistic intuitions
- Models
Training criteria, Regularization, smoothing
NLP
Machine learning
49Ambiguities In NLP
Courtesy of Aravind Joshi
I like eating sushi with tuna.
50Dependency Trees vs. Constituency Trees
S
NP
VP
V
NP
N
Dt
N
Mike
ate
the
cake
A Constituency tree
51Dependency Tree
- A dependency tree structure for a sentence
- Syntactic relationships between word pairs in
the sentence
obj
obj
mod
obj
subj
with
tuna
sushi
I
like
eating
with
tuna
sushi
I
like
eating
52Dependency Parsing
- An increasingly active research area (Yamada
Matsumoto 2003, McDonald et al. 2005, McDonald
Pereira 2006, Corston-Oliver et al. 2006, Smith
Eisner 2005/06, Wang et al. 2006/07/08, Koo et
al. 2008) - Dependency trees are much easier to understand
and annotate than other syntactic representations - Dependency relations have been widely used in
- Machine translation (Fox 2002, Cherry Lin 2003,
Ding Palmer 2005) - Information extraction (Culotta Sorensen 2004)
- Question answering (Pinchak Lin 2006)
- Coreference resolution (Bergsma Lin 2006)
53Scoring Functions
A vector of feature weights
A vector of features
- Feature weights can be learned via either
a local or a global training approach
54Score of Each Link / Word-pair
- The score of each link is based on the features
- Considering the word pair (skipped, regularly)
- (skipped, regularly) 1
- POS (skipped, regularly) (VBD, RB) 1
- PMI (skipped, regularly) 0.27
- dist (skipped, regularly) 2 dist2(skipped,
regularly) 4
55However,
- POS tags are not part of natural text
- Need to be annotated by human effort
- Introduce more noise to training data
- For some languages, POS tags are not clearly
defined - Such as Chinese or Japanese
- A single word is often combined with other words
Can we use another smoothing technique rather
than POS?
56Feature Representation
- Represent a word w by a feature vector
- The value of a feature c is
where P(w, c) is the probability of w and c
co-occur in a context window
57Similarity-based Smoothing
- Similarity measure cosine
- In (Dagan et al., 1999)
58Comparison With An Unlexicalized Model
- In the unlexicalized model, the input to the
parser is the sequence of the POS tags, which is
opposite to our model - Using gold standard POS tags
- Accuracy of the unlexicalized model 71.1
- Our strictly lexicalized model 79.9
59Large Margin Training
- Minimizing a regularized loss (Hastie et at.,
2004)
i the index of the training sentences Ti the
target tree Li a candidate tree the
distance between the two trees
60Objective with Local Constraints
- The corresponding new quadratic program
polynomial constraints!
j number of constraints in A
61Combining Local Training witha Parsing Algorithm
62Standard Boosting for Classification
training examples
Increase the weight of mis-classified examples
Local predictor h
hk
h1
h2
h3
63Structured Boosting
- Train a local link predictor, h1
- Re-parse training data using h1
- Re-weight local examples
- Compare the parser outputs with the gold standard
trees - Increase the weight of mis-parsed local examples
- Re-train local link predictor, getting h2
- Finally we have h1 , h2 , , hk
64Dynamic Features
- Also known as non-local features
- Take into account the link labels of the
surrounding word-pairs when predicting the label
of current pair - Commonly used in sequential labeling (McCallum
et al 2000, Toutanova et al. 2003) - A simple but useful idea for improving parsing
accuracy - Wang et al. 2005
- McDonald and Pereira 2006
65Dynamic Features
with
a
telescope
duck
I
saw
her
with
a
spot
duck
I
saw
her
- Define a canonical order so that a words
children are generated first, before it modifies
another word - telescope/spot are the dynamic features for
deciding whether generating a link between saw
with or duck with
66Results - 1
Table 1 Boosting with static features
67Variants of Structured Boosting
- Using alternative boosting algorithms for
structured boosting - Adaboost M2 (Freund Schapire 1997)
- Re-weighting class labels
- Logistic regression form of boosting (Collins et
al. 2002)
68Structured Boosting (An Example)
Gold tree
saw
duck
her
with
telescope
I
a
Parsers output
At each iteration, Increasing instance weights
that are mis-parsed
Iter 1
Try harder!
I
saw
her
with
duck
a
telescope
Try harder!
Iter 2
I
saw
duck
her
with
telescope
a
Good job!
Iter T
I
saw
her
telescope
duck
with
a
69Similarity Between Word Pairs
- Similarity between two words Sim(w1, w2) (Lin,
1998) - Construct a feature vector for w that contains a
set of words occurring within a small context
window - Compute similarities between the two feature
vectors - Using cosine measure
- e.g., Sim (skipped, missed) 0.134966
- Similarity between word pairs geometric average
-
Sim (skipped, regularly, missed, often)
70A Generative Parsing Model
5
3
2
1
4
school 4
skipped 3
? 0
regularly 5
The 1
kid 2
71Similarity-based Smoothing
similarity-based Prob.
MLE-based Prob.
Similar Contexts of C
regularly 5
skipped 3
regularly 5
skipped 3
skip frequently skip routinely skip
repeatedly Bounced often Bounced
who Bounced repeatedly
72Similarity-based Smoothing
P(E C) a PMLE(E C) (1 a) PSIM(E C)
C is the frequency count of the corresponding
context C in the training data
(skipped, regulary) 1 (The, kid) 95
73Dependency Parsing Algorithms
- Use a constituency parsing algorithm
- Simply treat dependencies as constituents
- With complexity
- Eisners dependency parsing algorithm
(Eisner 1996) - It stores spans, instead of subtrees
- Only the end-words are active (still need a head)
74Existing Semi/unsupervised Alg.
- EM / self-training
- Local minima
- The disconnect between likelihood and accuracy
- Same mistakes can be amplified at next iteration
- Standard semi-supervised SVM
- Non-convex objective on the unlabeled data
- Available solutions are sophisticated and
expensive (e.g, Xu et al. 2006)
75Structured Boosting
- A simple variant of standard boosting algorithms
Adaboost M1 (Freund Schapire 1997) - Global optimization
- As efficient as local methods
- General, can use any local classifier
- Also, can be easily applied to other tasks
76Dependency Trees
- A dependency tree structure for a sentence
represents - Syntactic relations between word pairs in the
sentence
mod
obj
subj
obj
det
gen
Over 100 possible trees!
with
a
telescope
duck
saw
I
her
1 million trees for a 20-word sentence
77Constraints on
- is represented as an adjacency matrix
- Use rows to denote heads and columns to denote
children - Constraints on
- All entries in are between 0 and 1
- Sum over all entries on each column is 1
(one-head rule) - All the entries on the diagonal are zeros (no
self-link rule) -
(anti-symmetric rule) - Connectedness (no-cycle)
78Structured Boosting (An Example)
saw
her
duck
I
with
a
telescope
Instance_weight of saw-with
Instance_weight of duck-with
Weights of local examples
Iter 1 2 3 T
79Local Constraints
w1 common node
Convex!
Correct link
Missing link
80Future Work - 1
- Using alternative boosting algorithms for
structured boosting - Adaboost M2 (Freund Schapire 1997)
- Re-weighting class labels
- Logistic regression form of boosting (Collins et
al. 2002)
81Future Work - 2
- Multilingual dependency parsing
- Apply our techniques to other languages, such as
Czech, German, Spanish, French
82Future Work - 3
- Domain adaptation
- Apply my parsers to other domains (e.g.,
biomedical data) - Lack of annotated resources in these domains
- Blitzer et al. 2006, McClosky et al. 2006