Learning Structured Classifiers for Statistical Dependency Parsing - PowerPoint PPT Presentation

1 / 82
About This Presentation
Title:

Learning Structured Classifiers for Statistical Dependency Parsing

Description:

State-of-the-art accuracy for English and Chinese. 9/2/09. Qin Iris Wang (University of Alberta) ... A vector of feature weights. A vector of features. 9/2/09 ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 83
Provided by: wang6
Category:

less

Transcript and Presenter's Notes

Title: Learning Structured Classifiers for Statistical Dependency Parsing


1
Learning Structured Classifiers for Statistical
Dependency Parsing
  • Qin Iris Wang
  • Supervisors Dekang Lin Dale Schuurmans
  • University of Alberta
  • May 7th, 2008

2
Ambiguities in NLP
I saw her duck.
How about I saw her duck with a telescope.
3
Dependency Trees
head -gt modifier one and only one head
  • A dependency tree structure for a sentence
    represents
  • Syntactic relations between word pairs in the
    sentence

with
a
telescope
duck
saw
I
her
Over 100 possible trees!
1 million trees for a 20-word sentence
4
My Thesis Research
  • Goal
  • Improve dependency parsing via different
    statistical machine learning approaches
  • Method
  • Tackle the problem from different angles
  • Employ and develop advanced supervised and
    semi-supervised machine learning techniques
  • Achievement
  • State-of-the-art accuracy for English and Chinese

5
Overview
  • Dependency parsing model
  • Learning dependency parsers
  • Strictly lexicalized dependency parsing
  • Extensions to Large Margin Training
  • Training Dependency Parsers via Structured
    boosting
  • Semi-supervised Convex Training for Dependency
    Parsing
  • Results
  • Conclusion

6
Dependency Parsing Model
  • X an input sentence Y a candidate dependency
    tree
  • a dependency link from word i to
    word j
  • the set of possible dependency trees
    over X

Edge/link-based factorization
A vector of features
A vector of feature weights
7
Features for an Arc
Lots and sparse!
  • Word pair indicator
  • Part-of-Speech (POS) tags of the word pair
  • Pointwise Mutual Information (PMI) for that word
    pair
  • Distance between words

Abstraction or smoothing
8
My Work on Dependency Parsing
  • 1. Strictly Lexicalized Dependency Parsing (Wang
    et al. 2005)
  • MLE similarity-based smoothing
  • 2. Improving Large Margin Training (Wang et al.
    2006)
  • Local constraints capture local errors in a
    parse tree
  • Laplacian regularization - semi-supervised large
    margin training enforce similar links to have
    similar weights -- similarity smoothing
  • 3. Structured Boosting (Wang et al. 2007)
  • Global optimization, efficient and flexible
  • 4. Semi-supervised Convex Training (Wang et al.
    2008)
  • Combine large margin loss with least squares loss

9
Overview
  • Dependency parsing model
  • Learning dependency parsers
  • Strictly lexicalized dependency parsing
  • Extensions to Large Margin Training
  • Training Dependency Parsers via Structured
    boosting
  • Semi-supervised Convex Training for Dependency
    Parsing
  • Results
  • Conclusion

10
Strictly Lexicalized Dependency Parsing
-- IWPT 2005
11
Contributions
  • All the features are based on word statistics and
    no POS tags or grammatical categories needed
  • Using similarity-based smoothing to deal with
    data sparseness

12
POS Tags for Handling Sparseness in Parsing
  • All previous parsers use a POS lexicon
  • Natural language data is sparse
  • Bikel (2004) found that of all the needed bi-gram
    statistics, only 1.49 were observed in the Penn
    Treebank
  • Words belonging to the same POS are expected to
    have the same syntactic behavior

13
An Alternative Method to Deal with Sparseness
  • Distributional word similarities
  • Distributional Hypothesis --- Words that appear
    in similar contexts have similar meanings
    (Harris, 1968)
  • Soft clusters of words
  • Advantages of using similarity smoothing
  • Computed automatically from the raw text
  • Making the construction of Treebank easier

POS hard clusters of words
14
Similarity-based Smoothing
skipped, regularly )
Context
regularly 5
skipped 3
P (
S (regularly)
S (skipped)
frequently 0.365862routinely 0.286178periodica
lly 0.273665often 0.24077constantly
0.234693occasionally 0.226324who
0.200348continuously 0.194026repeatedly
0.177434
Skipping 0.229951skip 0.197991skips
0.169982sprinted 0.140535bounced
0.139547missed 0.134966cruised
0.133933scooted 0.13387jogged 0.133638
15
Similarity-based Smoothing
skipped, regularly )
regularly 5
skipped 3
P (
Pairs seen in the training data
skip frequently skip routinely skip
repeatedly bounced often bounced who bounced
repeatedly
Similar Contexts
16
Overview
  • Dependency parsing model
  • Learning dependency parsers
  • Strictly lexicalized dependency parsing
  • Extensions to Large Margin Training
  • Training Dependency Parsers via Structured
    boosting
  • Semi-supervised Convex Training for Dependency
    Parsing
  • Results
  • Conclusion

17
Improved Large Margin Dependency Parsing via
Local Constraints and Laplacian Regularization
-- CoNLL 2006
18
Contributions
  • Large margin vs. generative training
  • Included distance and PMI features, but fewer
    dynamic features
  • Local constraints to capture local errors in a
    parse tree
  • Laplacian regularization (based on distributional
    word similarity) to deal with data sparseness
  • Semi-supervised large margin training

19
(Existing) Large Margin Training
Exponential constraints!
  • However,
  • Exponential number of constraints
  • Loss ignoring the local errors of the parse tree
  • A large number of bi-lexical features,
    over-fitting the training corpus

20
Local Constraints (An Example)
4
1
2
3
school
The
boy
skipped
regularly
loss
6
5
score(The, boy) gt score(The,
skipped) 1 score(boy, skipped) gt
score(The, skipped) 1
score(skipped, school) gt score(school,
regularly) 1 score(skipped, regularly) gt
score(school, regularly) 1
5
1
2
5
6
3
6
4
21
Laplacian Regularization
  • Enforce the similar links (word pairs) to have
    similar weights

S similarity matrix of word pairs L(S)
Laplacian matrix of S D(S) a diagonal
matrix L(S) D(S) S
22
Refined Large Margin Objective
only for bi-lexical features
polynomial constraints!
23
Overview
  • Dependency parsing model
  • Learning dependency parsers
  • Strictly lexicalized dependency parsing
  • Extensions to Large Margin Training
  • Training Dependency Parsers via Structured
    boosting
  • Semi-supervised Convex Training for Dependency
    Parsing
  • Results
  • Conclusion

24
Simple Training of Dependency Parsers via
Structured Boosting
-- IJCAI 2007
25
Contributions
  • Structured Boosting a simple approach to
    training structured classifiers by applying a
    boosting-like procedure to standard supervised
    training methods
  • Advantages
  • Global training
  • Simple efficient
  • General, can be easily applied to other tasks
  • Successfully applied to dependency parsing

26
Local Training Examples
  • Given training data (X, Y)

local examples
Word-pair Link-label Instance_weight
Features The-boy L 1 W1_The,
W2_boy, W1W2_The_boy, T1_DT, T2_NN,
T1T2_DT_NN, Dist_1, boy-skipped L 1
W1_boy, W2_skipped, skipped-school R 1
W1_skipped, W2_ school,
skipped-regularly R 1 W1_skipped,
W2_regularly, The-skipped N 1
W1_The, W2_skipped, The-school N 1
W1_The, W2_school,
L left link R right link N no link
27
Local Training Methods
  • Learn a local link classifier given a set of
    features defined on the local examples
  • For each word pair in a sentence
  • No link, left link or right link? (3-class
    classification)
  • However,
  • The parameters of the local model are not being
    trained to consider the global parsing accuracy

28
Global Training for Parsing
  • Incorporate the effects of the parser directly
    into the training algorithm --- directly capture
    the relations between the links of an output tree
  • Unfortunately, available structured training
    techniques are
  • Expensive, specialized, complex to implement
  • Require a lot of effort to apply to parsing

Need efficient global training algorithms!
29
Structured Boosting for Dependency Parsing
Global training efficient ?
Training sentences (X, Y)
Compare with the gold standard trees
Local training examples
Re-weight the mis-parsed examples
Dependency parsing algorithm
Link score
Local link classifier c

c1
c2
c3
ck
Dependency trees
30
Overview
  • Dependency parsing model
  • Learning dependency parsers
  • Strictly lexicalized dependency parsing
  • Extensions to Large Margin Training
  • Training Dependency Parsers via Structured
    boosting
  • Semi-supervised Convex Training for Dependency
    Parsing
  • Results
  • Conclusion

31
Semi-supervised Convex Training for Dependency
Parsing
-- ACL 2008
32
Contributions
  • Combined a structured large margin loss on
    labeled data and a least squares loss on
    unlabeled data
  • Obtained an efficient, convex, semi-supervised
    large margin training algorithm for learning
    dependency parsers

33
More Data Is Better Data
  • The Penn Treebank
  • 4.5 million words
  • About 200 thousand sentences
  • Annotation 30 person-minutes/ sentence
  • Limited expensive!
  • Raw text data
  • News wire
  • Wikipedia
  • Web resources
  • Plentiful free

Supervised learning
Semi/ unsupervised learning
34
Semi-supervised Structured SVM
Non-convex!
Both and are variables
  • where

35
Our Approach
Convex!
36
Efficient Optimization Alg.
  • Using stochastic gradient steps
  • Parameters are updated locally on each labeled
    and unlabeled sentence
  • is solved by calling CPLEX
  • The objective is globally minimized after a few
    iterations
  • Gradient on labeled sentence i

Gradient on unlabeled sentence j
37
Overview
  • Dependency parsing model
  • Learning dependency parsers
  • Strictly lexicalized dependency parsing
  • Extensions to Large Margin Training
  • Training Dependency Parsers via Structured
    boosting
  • Semi-supervised Convex Training for Dependency
    Parsing
  • Results
  • Conclusion

38
Experimental Design
  • Data set (Linguistic Data Consortium), split into
    training, development and test sets
  • English
  • PTB 50 K sentences (split standard)
  • Chinese
  • CTB4 15 K sentences (split Wang et al. 2005)
  • CTB5 19 K sentences (split Corston-Oliver et
    al. 2006)
  • Features
  • Word-pair indicator, POS-pair, PMI, context
    features, distance

39
Experimental Design Cont.
  • Dependency parsing algorithm
  • A CKY-like chart parser with
    complexity
  • Evaluation measure
  • Dependency accuracy percentage of words that
    have the correct head

40
Results - 1 (IWPT 05)
Evaluation Results on CTB 4.0 ()
An undirected tree root A directed tree
41
Results - 2 (CoNLL 06)
much simpler feature set
Evaluation Results on CTB4 - 10 ()
42
Results - 3 (IJCAI 07)
Evaluation Results on Chinese and English ()
43
Results - 4 (ACL 08)
Evaluation Results on Chinese and English ()
44
Comparison with State-of-the-art
IWPT 2005
IJCAI 2007
Chinese Treebank 4.0 Chinese Treebank 5.0
45
Overview
  • Dependency parsing model
  • Learning dependency parsers
  • Strictly lexicalized dependency parsing
  • Extensions to Large Margin Training
  • Training Dependency Parsers via Structured
    boosting
  • Semi-supervised Convex Training for Dependency
    Parsing
  • Results
  • Conclusion

46
Conclusion
  • I have developed several statistical approaches
    to learning dependency parsers
  • Strictly lexicalized dependency parsing
  • Extensions to Large Margin Training
  • Training Dependency Parsers via Structured
    boosting
  • Semi-supervised Convex Training for Dependency
    Parsing
  • Achieved state-of-the-art accuracy for English
    and Chinese

47
Thanks!
48
  • Features,
  • Linguistic intuitions
  • Models

Training criteria, Regularization, smoothing
NLP
Machine learning
49
Ambiguities In NLP
Courtesy of Aravind Joshi
I like eating sushi with tuna.
50
Dependency Trees vs. Constituency Trees
S
NP
VP
V
NP
N
Dt
N
Mike
ate
the
cake
A Constituency tree
51
Dependency Tree
  • A dependency tree structure for a sentence
  • Syntactic relationships between word pairs in
    the sentence

obj
obj
mod
obj
subj
with
tuna
sushi
I
like
eating
with
tuna
sushi
I
like
eating
52
Dependency Parsing
  • An increasingly active research area (Yamada
    Matsumoto 2003, McDonald et al. 2005, McDonald
    Pereira 2006, Corston-Oliver et al. 2006, Smith
    Eisner 2005/06, Wang et al. 2006/07/08, Koo et
    al. 2008)
  • Dependency trees are much easier to understand
    and annotate than other syntactic representations
  • Dependency relations have been widely used in
  • Machine translation (Fox 2002, Cherry Lin 2003,
    Ding Palmer 2005)
  • Information extraction (Culotta Sorensen 2004)
  • Question answering (Pinchak Lin 2006)
  • Coreference resolution (Bergsma Lin 2006)

53
Scoring Functions
A vector of feature weights
A vector of features
  • Feature weights can be learned via either
    a local or a global training approach

54
Score of Each Link / Word-pair
  • The score of each link is based on the features
  • Considering the word pair (skipped, regularly)
  • (skipped, regularly) 1
  • POS (skipped, regularly) (VBD, RB) 1
  • PMI (skipped, regularly) 0.27
  • dist (skipped, regularly) 2 dist2(skipped,
    regularly) 4

55
However,
  • POS tags are not part of natural text
  • Need to be annotated by human effort
  • Introduce more noise to training data
  • For some languages, POS tags are not clearly
    defined
  • Such as Chinese or Japanese
  • A single word is often combined with other words

Can we use another smoothing technique rather
than POS?
56
Feature Representation
  • Represent a word w by a feature vector
  • The value of a feature c is

where P(w, c) is the probability of w and c
co-occur in a context window
57
Similarity-based Smoothing
  • Similarity measure cosine
  • In (Dagan et al., 1999)

58
Comparison With An Unlexicalized Model
  • In the unlexicalized model, the input to the
    parser is the sequence of the POS tags, which is
    opposite to our model
  • Using gold standard POS tags
  • Accuracy of the unlexicalized model 71.1
  • Our strictly lexicalized model 79.9

59
Large Margin Training
  • Minimizing a regularized loss (Hastie et at.,
    2004)

i the index of the training sentences Ti the
target tree Li a candidate tree the
distance between the two trees
60
Objective with Local Constraints
  • The corresponding new quadratic program

polynomial constraints!
j number of constraints in A
61
Combining Local Training witha Parsing Algorithm
62
Standard Boosting for Classification
training examples
Increase the weight of mis-classified examples
Local predictor h

hk
h1
h2
h3
63
Structured Boosting
  • Train a local link predictor, h1
  • Re-parse training data using h1
  • Re-weight local examples
  • Compare the parser outputs with the gold standard
    trees
  • Increase the weight of mis-parsed local examples
  • Re-train local link predictor, getting h2
  • Finally we have h1 , h2 , , hk

64
Dynamic Features
  • Also known as non-local features
  • Take into account the link labels of the
    surrounding word-pairs when predicting the label
    of current pair
  • Commonly used in sequential labeling (McCallum
    et al 2000, Toutanova et al. 2003)
  • A simple but useful idea for improving parsing
    accuracy
  • Wang et al. 2005
  • McDonald and Pereira 2006

65
Dynamic Features
with
a
telescope
duck
I
saw
her
with
a
spot
duck
I
saw
her
  • Define a canonical order so that a words
    children are generated first, before it modifies
    another word
  • telescope/spot are the dynamic features for
    deciding whether generating a link between saw
    with or duck with

66
Results - 1
Table 1 Boosting with static features
67
Variants of Structured Boosting
  • Using alternative boosting algorithms for
    structured boosting
  • Adaboost M2 (Freund Schapire 1997)
  • Re-weighting class labels
  • Logistic regression form of boosting (Collins et
    al. 2002)

68
Structured Boosting (An Example)
Gold tree
saw
duck
her
with
telescope
I
a
Parsers output
At each iteration, Increasing instance weights
that are mis-parsed
Iter 1
Try harder!
I
saw
her
with
duck
a
telescope
Try harder!
Iter 2
I
saw
duck
her
with
telescope
a

Good job!
Iter T
I
saw
her
telescope
duck
with
a
69
Similarity Between Word Pairs
  • Similarity between two words Sim(w1, w2) (Lin,
    1998)
  • Construct a feature vector for w that contains a
    set of words occurring within a small context
    window
  • Compute similarities between the two feature
    vectors
  • Using cosine measure
  • e.g., Sim (skipped, missed) 0.134966
  • Similarity between word pairs geometric average

Sim (skipped, regularly, missed, often)
70
A Generative Parsing Model
5
3
2
1
4
school 4
skipped 3
? 0
regularly 5
The 1
kid 2
71
Similarity-based Smoothing
similarity-based Prob.
MLE-based Prob.
Similar Contexts of C
regularly 5
skipped 3
regularly 5
skipped 3
skip frequently skip routinely skip
repeatedly Bounced often Bounced
who Bounced repeatedly
72
Similarity-based Smoothing
  • Finally,

P(E C) a PMLE(E C) (1 a) PSIM(E C)
C is the frequency count of the corresponding
context C in the training data
(skipped, regulary) 1 (The, kid) 95
73
Dependency Parsing Algorithms
  • Use a constituency parsing algorithm
  • Simply treat dependencies as constituents
  • With complexity
  • Eisners dependency parsing algorithm
    (Eisner 1996)
  • It stores spans, instead of subtrees
  • Only the end-words are active (still need a head)

74
Existing Semi/unsupervised Alg.
  • EM / self-training
  • Local minima
  • The disconnect between likelihood and accuracy
  • Same mistakes can be amplified at next iteration
  • Standard semi-supervised SVM
  • Non-convex objective on the unlabeled data
  • Available solutions are sophisticated and
    expensive (e.g, Xu et al. 2006)

75
Structured Boosting
  • A simple variant of standard boosting algorithms
    Adaboost M1 (Freund Schapire 1997)
  • Global optimization
  • As efficient as local methods
  • General, can use any local classifier
  • Also, can be easily applied to other tasks

76
Dependency Trees
  • A dependency tree structure for a sentence
    represents
  • Syntactic relations between word pairs in the
    sentence

mod
obj
subj
obj
det
gen
Over 100 possible trees!
with
a
telescope
duck
saw
I
her
1 million trees for a 20-word sentence
77
Constraints on
  • is represented as an adjacency matrix
  • Use rows to denote heads and columns to denote
    children
  • Constraints on
  • All entries in are between 0 and 1
  • Sum over all entries on each column is 1
    (one-head rule)
  • All the entries on the diagonal are zeros (no
    self-link rule)

  • (anti-symmetric rule)
  • Connectedness (no-cycle)

78
Structured Boosting (An Example)
saw
her
duck
I
with
a
telescope
Instance_weight of saw-with
Instance_weight of duck-with
Weights of local examples

Iter 1 2 3 T
79
Local Constraints
w1 common node
Convex!
Correct link
Missing link
  • With slack variables

80
Future Work - 1
  • Using alternative boosting algorithms for
    structured boosting
  • Adaboost M2 (Freund Schapire 1997)
  • Re-weighting class labels
  • Logistic regression form of boosting (Collins et
    al. 2002)

81
Future Work - 2
  • Multilingual dependency parsing
  • Apply our techniques to other languages, such as
    Czech, German, Spanish, French

82
Future Work - 3
  • Domain adaptation
  • Apply my parsers to other domains (e.g.,
    biomedical data)
  • Lack of annotated resources in these domains
  • Blitzer et al. 2006, McClosky et al. 2006
Write a Comment
User Comments (0)
About PowerShow.com