Title: Large Margin Dependency Parsing Local Constraints and Laplacian Regularization
1Large Margin Dependency Parsing Local
Constraints and Laplacian Regularization
- Qin Iris Wang
Colin Chery - Dan Lizotte
Dale Schuurmans - University of Alberta
- wqin, colinc, dlizotte, dale_at_cs.ualberta.ca
2Large Margin Training in Parsing
- Discriminative training in parsing
- Taskar et al. 2004, Tsochantaridis et al. 2004,
McDonald et at. 2005a - State of the art performance in dependency
parsing - McDonald et at. 2005a, 2005b, 2006
- But, they didnt consider
- The error of any particular component in a tree
- Global loss of the whole parse tree
- Smoothing methods
3Our Contributions
- Two ideas for improving large margin training
- Using local constraints to capture local errors
in a parse tree - Using Laplacian regularization (based on
distributional word similarity) to deal with data
sparseness
4Outline
- Dependency parsing model
- Large margin training
- Training with local constraints
- Laplacian regularization
- Experimental results
- Related work and conclusions
5Dependency Tree
- A dependency tree structure for the sentence
- Syntactic relationships between word pairs in a
sentence -
6Dependency Parsing Model
- W (w1, , wn ) an input sentence
- T a candidate dependency tree
- the set of possible dependency trees
spanning W.
Eisner 1996 McDonald 2005
7Features for an arc
Lots!
- Word pair indicator
- Pointwise Mutual Information for that word pair
- Distance between words
No Part of Speech Feature
8Score of Each Word Pair
regularly.
school
The
boy
skipped
- The score of each word pair is based on the
features - Considering the word pair (skipped, regularly)
- PMI (skipped, regularly) 0.27
- dist(skipped, regularly) 2
9Outline
- Dependency parsing model
- Large margin training
- Training with local constraints
- Laplacian regularization
- Experimental results
- Related work and conclusions
10Large Margin Training
- Minimizing a regularized loss (Hastie et at.,
2004)
i the index of the training sentences Ti the
target tree Li a candidate tree the
distance between the two trees
11Large Margin Training
- Equivalent to solving the quadratic program
McDonald 2005
12However
- Exponential number of constraints
- Loss Ignoring the local errors of the parse tree
- Over-fitting the training corpus
- large number of bi-lexical features , need a good
smoothing (regularization) method
13Outline
- Dependency parsing model
- Large margin training
- Training with local constraints
- Laplacian regularization
- Experimental results
- Related work and conclusions
14Local Constraints (an example)
4
1
2
3
school
The
boy
skipped
regularly
5
6
score(The, boy) gt score(The,
skipped) 1 score(boy, skipped) gt
score(The, skipped) 1
score(skipped, school) gt score(school,
regularly) 1 score(skipped, regularly) gt
score(school, regularly) 1
5
1
2
5
3
6
4
6
15Local Constraints
Convex!
16Objective with Local Constraints
- The corresponding new quadratic program
polynomial constraints!!
j number of constraints in A
17Outline
- Dependency parsing model
- Large margin training
- Training with local constraints
- Laplacian regularization
- Experimental results
- Related work and conclusions
18Distributional Word Similarity
- Words that tend to appear in the same contexts
tend to have similar meanings (Harris, 1968) - Represent a word by a feature vector of contexts
- Similarity of two words cosine similarity of
their vectors
- Similarity between word pairs
19Laplacian Regularization
- Enforce the similar links (word pairs) to have
similar weights
L(S) D(S) S D(S) a diagonal matrix
S similarity matirx of word pairs L(S)
Laplacian matrix of S
20Refined Large Margin Objective
21(No Transcript)
22Outline
- Dependency parsing model
- Large margin training
- Training with local constraints
- Laplacian regularization
- Experimental results
- Related work and conclusions
23Experimental Setup
- Chinese Treebank (CTB) (Xue et al., 2004), same
data split as Bikel (2004). - Training Sections 1-270
- Development Sections 301-325
- Testing Sections 271-300
- Dependency trees
- Converted from the constituency trees (Bikel,
2004) - CTB-10, CTB-15
- Word similarities
- Chinese Gigaword corpus
24Experimental Details
- For any unseen link
- the weight is computed as the similarity weighted
average of similar links seen in the training
corpus. - Parsing accuracy
25Experimental Results - 1
Accuracy Results on Dev Set()
26Experimental Results - 2
Accuracy Results on Test Set()
27Comparison with Other Work
- The probabilistic approach on Chinese dependency
parsing (Wang et al. 2005) - 61.04 (dev set)
- 76.31 (test set)
- Our approach
- 65.71 (dev set) ?
- 68.27 (test set) ?
much simpler feature set
28Outline
- Dependency parsing model
- Large margin training
- Training with local constraints
- Laplacian regularization
- Experimental results
- Related work and conclusions
29Related Work
- Large margin training for parsing
- McDonald et al. 2005
- Taskar et al. 2004, Tsochantaridis et al. 2004
- Yamada and Matsumoto 2003
- Maximize conditional likelihood (maximum entropy)
- Charniak 2000
- Ratnaparkhi 1999
- Dependency parsing on Chinese
- Wang et al. 2005 (Also purely bilexical)
- Bikel and Chiang (2000)
- Levy and Manning (2003)
30Conclusions
- Two contributions to the standard large margin
training approach - Applied the refined local constraints to the
large margin criterion - The parameters were smoothed according to word
similarities, via the Laplacian regularization - Extensions
- Consider directed features, contextual features
- Parse English and other languages
31Thanks!
Questions?
- Updated paper available online at
http//www.cs.ualberta.ca/wqin/
32Lexicalized Dependency Parsing
- Word based parameters
- No POS tags or grammatical categories needed
- Advantages
- Making the annotation of Treebank easier
- Beneficial for languages such as Chinese
33Experimental Results - 3
Accuracy Results on Training Set()
34Features for an arc
Lots!
- Word pair features
- PMI features
- Distance features
35Distributional Word Similarity
- Words that tend to appear in the same contexts
tend to have similar meanings (Harris, 1968) - Represent a word w by a feature vector f of
contexts - P(w, c) the probability of w and c co-occur
- Similarity measure cosine