Learning Structured Classifiers for Statistical Dependency Parsing - PowerPoint PPT Presentation

1 / 85
About This Presentation
Title:

Learning Structured Classifiers for Statistical Dependency Parsing

Description:

bounced 0.139547. missed 0.134966. cruised 0.133933. scooted 0.13387. jogged 0.133638 ... Bounced often. Bounced who. Bounced repeatedly. 9/26/09. Qin Iris Wang ... – PowerPoint PPT presentation

Number of Views:98
Avg rating:3.0/5.0
Slides: 86
Provided by: wang6
Category:

less

Transcript and Presenter's Notes

Title: Learning Structured Classifiers for Statistical Dependency Parsing


1
Learning Structured Classifiers for Statistical
Dependency Parsing
  • Qin Iris Wang
  • Supervisors Dekang Lin Dale Schuurmans
  • University of Alberta
  • May 5th, 2008

2
Ambiguities In NLP
I saw her duck.
How about I saw her duck with a telescope.
3
Dependency Trees vs. Constituency Trees
S
NP
VP
V
NP
N
Dt
N
Mike
ate
the
cake
A Constituency tree
4
My Thesis Research
  • Goal
  • Improve dependency parsing via statistical
    machine learning approaches
  • Method
  • Tackle the problem from different angles
  • Employ and develop advanced supervised and
    semi-supervised machine learning techniques
  • Achievement
  • State-of-the-art accuracy for English and Chinese

5
Dependency Trees
  • A dependency tree structure for a sentence
    represents
  • Syntactic relations between word pairs in the
    sentence

mod
obj
subj
obj
det
gen
Over 100 possible trees!
with
a
telescope
duck
saw
I
her
1 million trees for a 20-word sentence
6
Dependency Parsing Algorithms
  • Use a constituency parsing algorithm
  • Simply treat dependencies as constituents
  • With complexity
  • Eisners dependency parsing algorithm
    (Eisner 1996)
  • It stores spans, instead of subtrees
  • Only the end-words are active (still need a head)

7
Dependency Parsing
  • An increasingly active research area (Yamada
    Matsumoto 2003, McDonald et al. 2005, McDonald
    Pereira 2006, Corston-Oliver et al. 2006, Smith
    Eisner 2005/06, Wang et al. 2006/07/08, Koo et
    al. 2008)
  • Dependency trees are much easier to understand
    and annotate than other syntactic representations
  • Dependency relations have been widely used in
  • Machine translation (Fox 2002, Cherry Lin 2003,
    Ding Palmer 2005)
  • Information extraction (Culotta Sorensen 2004)
  • Question answering (Pinchak Lin 2006)
  • Coreference resolution (Bergsma Lin 2006)

8
Overview
  • Dependency parsing model
  • Learning dependency parsers
  • Strictly lexicalized dependency parsing
  • Extensions to Large Margin Training
  • Training Dependency Parsers via Structured
    boosting
  • Semi-supervised Convex Training for Dependency
    Parsing
  • Results
  • Conclusion

9
Dependency Parsing Model
  • W an input sentence T a candidate dependency
    tree
  • a dependency link from word i to
    word j
  • the set of possible dependency trees
    over W
  • Can be applied to both probabilistic and
    non-probabilistic models

Edge/link-based factorization
10
Scoring Functions
A vector of feature weights
A vector of features
  • Feature weights can be learned via either
    a local or a global training approach

11
Features for an arc
  • Word pair indicator
  • Part-of-Speech (POS) tags of the word pair
  • Pointwise Mutual Information (PMI) for that word
    pair
  • Distance between words

Lots and sparse!
Abstraction or smoothing
12
Score of Each Link / Word-pair
  • The score of each link is based on the features
  • Considering the word pair (skipped, regularly)
  • (skipped, regularly) 1
  • POS (skipped, regularly) (VBD, RB) 1
  • PMI (skipped, regularly) 0.27
  • dist (skipped, regularly) 2 dist2(skipped,
    regularly) 4

13
My Work on Dependency Parsing
  • 1. Strictly Lexicalized Dependency Parsing (Wang
    et al. 2005)
  • MLE similarity-based smoothing
  • 2. Improving Large Margin Training (Wang et al.
    2006)
  • Local constraints capture local errors in a
    parse tree
  • Laplacian regularization - semi-supervised large
    margin training enforce similar links to have
    similar weights -- similarity smoothing
  • 3. Structured Boosting (Wang et al. 2007)
  • Global optimization, efficient and flexible
  • 4. Semi-supervised Convex Training (Wang et al.
    2008)
  • Combine large margin loss with least squares loss

14
Overview
  • Dependency parsing model
  • Learning dependency parsers
  • Strictly lexicalized dependency parsing
  • Extensions to Large Margin Training
  • Training Dependency Parsers via Structured
    boosting
  • Semi-supervised Convex Training for Dependency
    Parsing
  • Results
  • Conclusion

15
Strictly Lexicalized Dependency Parsing
-- IWPT 2005
16
Contributions
  • All the features are based on word statistics and
    no POS tags or grammatical categories needed
  • Using similarity-based smoothing to deal with
    data sparseness

17
POS Tags for Handling Sparseness in Parsing
  • All previous parsers use a POS lexicon
  • Natural language data is sparse
  • Bikel (2004) found that of all the needed bi-gram
    statistics, only 1.49 were observed in the Penn
    Treebank
  • Words belonging to the same POS are expected to
    have the same syntactic behavior

18
However,
  • POS tags are not part of natural text
  • Need to be annotated by human effort
  • Introduce more noise to training data
  • For some languages, POS tags are not clearly
    defined
  • Such as Chinese or Japanese
  • A single word is often combined with other words

Can we use another smoothing technique rather
than POS?
19
An Alternative Method to Deal with Sparseness
  • Distributional word similarities
  • Distributional Hypothesis --- Words that appear
    in similar contexts have similar meanings
    (Harris, 1968)
  • Soft clusters of words
  • Advantages of using similarity smoothing
  • Computed automatically from the raw text
  • Making the construction of Treebank easier

POS hard clusters of words
20
Similarity Between Word Pairs
  • Similarity between two words Sim(w1, w2) (Lin,
    1998)
  • Construct a feature vector for w that contains a
    set of words occurring within a small context
    window
  • Compute similarities between the two feature
    vectors
  • Using cosine measure
  • e.g., Sim (skipped, missed) 0.134966
  • Similarity between word pairs geometric average

Sim (skipped, regularly, missed, often)
21
A Generative Parsing Model
5
3
2
1
4
school 4
skipped 3
? 0
regularly 5
The 1
kid 2
22
Similarity-based Smoothing
skipped, regularly )
Context
regularly 5
skipped 3
P (
S (regularly)
S (skipped)
frequently 0.365862routinely 0.286178periodica
lly 0.273665often 0.24077constantly
0.234693occasionally 0.226324who
0.200348continuously 0.194026repeatedly
0.177434
skipping 0.229951skip 0.197991skips
0.169982sprinted 0.140535bounced
0.139547missed 0.134966cruised
0.133933scooted 0.13387jogged 0.133638
23
Similarity-based Smoothing
skipped, regularly )
regularly 5
skipped 3
P (
S (regularly)
S (skipped)
frequently 0.365862routinely 0.286178periodica
lly 0.273665often 0.24077constantly
0.234693occasionally 0.226324who
0.200348continuously 0.194026repeatedly
0.177434
Skipping 0.229951skip 0.197991skips
0.169982sprinted 0.140535bounced
0.139547missed 0.134966cruised
0.133933scooted 0.13387jogged 0.133638
24
Similarity-based Smoothing
skipped, regularly )
regularly 5
skipped 3
P (
Pairs seen in the training data
skip frequently skip routinely skip
repeatedly bounced often bounced who bounced
repeatedly
Similar Contexts
25
Similarity-based Smoothing
similarity-based Prob.
MLE-based Prob.
Similar Contexts of C
regularly 5
skipped 3
regularly 5
skipped 3
skip frequently skip routinely skip
repeatedly Bounced often Bounced
who Bounced repeatedly
26
Similarity-based Smoothing
  • Finally,

P(E C) a PMLE(E C) (1 a) PSIM(E C)
C is the frequency count of the corresponding
context C in the training data
(skipped, regulary) 1 (The, kid) 95
27
Overview
  • Dependency parsing model
  • Learning dependency parsers
  • Strictly lexicalized dependency parsing
  • Extensions to Large Margin Training
  • Training Dependency Parsers via Structured
    boosting
  • Semi-supervised Convex Training for Dependency
    Parsing
  • Results
  • Conclusion

28
Improved Large Margin Dependency Parsing via
Local Constraints and Laplacian Regularization
-- CoNLL 2006
29
Contributions
  • Large margin vs. generative training
  • Included distance and PMI features, but fewer
    dynamic features
  • Local constraints to capture local errors in a
    parse tree
  • Laplacian regularization (based on distributional
    word similarity) to deal with data sparseness
  • Semi-supervised large margin training

30
(Existing) Large Margin Training
Exponential constraints!
  • Having been used for parsing
  • Tsochantaridis et al. 2004, Taskar et al.2004
  • State of the art performance in dependency
    parsing
  • McDonald et at. 2005a, 2005b, 2006

31
However
  • Exponential number of constraints
  • Loss Ignoring the local errors of the parse tree
  • Over-fitting the training corpus
  • large number of bi-lexical features, need a good
    smoothing (regularization) method

32
Local Constraints (an example)
4
1
2
3
school
The
boy
skipped
regularly
loss
6
5
score(The, boy) gt score(The,
skipped) 1 score(boy, skipped) gt
score(The, skipped) 1
score(skipped, school) gt score(school,
regularly) 1 score(skipped, regularly) gt
score(school, regularly) 1
5
1
2
5
polynomial constraints!
6
3
6
4
33
Local Constraints
w1 common node
Convex!
Correct link
Missing link
  • With slack variables

34
Laplacian Regularization
  • Enforce the similar links (word pairs) to have
    similar weights

L(S) D(S) S D(S) a diagonal matrix
S similarity matrix of word pairs L(S)
Laplacian matrix of S
35
Refined Large Margin Objective
only for bi-lexical features
polynomial constraints!
36
Overview
  • Dependency parsing model
  • Learning dependency parsers
  • Strictly lexicalized dependency parsing
  • Extensions to Large Margin Training
  • Training Dependency Parsers via Structured
    boosting
  • Semi-supervised Convex Training for Dependency
    Parsing
  • Results
  • Conclusion

37
Simple Training of Dependency Parsers via
Structured Boosting
-- IJCAI 2007
38
Contributions
  • Structured Boosting a simple approach to
    training structured classifiers by applying a
    boosting-like procedure to standard supervised
    training methods
  • Advantages
  • Simple
  • Inexpensive
  • General
  • Successfully applied to dependency parsing

39
Local Training Examples
  • Given training data (S, T)

local examples
Word-pair Link-label Instance_weight
Features The-boy L 1 W1_The,
W2_boy, W1W2_The_boy, T1_DT, T2_NN,
T1T2_DT_NN, Dist_1, boy-skipped L 1
W1_boy, W2_skipped, skipped-school R 1
W1_skipped, W2_ school,
skipped-regularly R 1 W1_skipped,
W2_regularly, The-skipped N 1
W1_The, W2_skipped, The-school N 1
W1_The, W2_school,
L left link R right link N no link
40
Local Training Methods
  • Learn a local link classifier given a set of
    features defined on the local examples
  • For each word pair in a sentence
  • No link, left link or right link ?
  • 3-class classification
  • Any classifier can be used as a link classifier
    for parsing

41
Combining Local Training witha Parsing Algorithm
42
Parsing With a Local Link Classifier
  • Learn the weight vector over a set of
    features defined on the local examples
  • Maximum entropy models (Ratnaparkhi 1999,
    Charniak 2000)
  • Support vector machines (Yamada and Matsumoto
    2003)
  • The parameters of the local model are not being
    trained to consider the global parsing accuracy
  • Global training can do better

43
Global Training for Parsing
  • Directly capture the relations between the links
    of an output tree
  • Incorporate the effects of the parser directly
    into the training algorithm
  • Structured SVMs (Tsochantaridis et al. 2004)
  • Max-Margin Parsing (Taskar et al. 2004)
  • Online large-margin training (McDonald et al.
    2005)
  • Improving large-margin training (Wang et al. 2006)

44
But, Drawbacks
  • Unfortunately, these structured training
    techniques are
  • Expensive
  • Specialized
  • Complex to implement
  • Require a great deal of refinement and
    computational resources to apply to parsing

Need efficient global training algorithms!
45
Structured Boosting
  • A simple variant of standard boosting algorithms
    Adaboost M1 (Freund Schapire 1997)
  • Global optimization
  • As efficient as local methods
  • General, can use any local classifier
  • Also, can be easily applied to other tasks

46
Standard Boosting for Classification
training examples
Increase the weight of mis-classified examples
Local predictor h

hk
h1
h2
h3
47
Structured Boosting (An Example)
Gold tree
saw
duck
her
with
telescope
I
a
Parsers output
At each iteration, Increasing instance weights
that are mis-parsed
Iter 1
Try harder!
I
saw
her
with
duck
a
telescope
Try harder!
Iter 2
I
saw
duck
her
with
telescope
a

Good job!
Iter T
I
saw
her
telescope
duck
with
a
48
Structured Boosting for Dependency Parsing
Global training efficient ?
Training sentences (S, T)
Compare with the gold standard trees
Local training examples
Re-weight the mis-parsed examples
Dependency parsing algorithm
Link score
Local link classifier h

h1
h2
h3
hk
Dependency trees
49
Overview
  • Dependency parsing model
  • Learning dependency parsers
  • Strictly lexicalized dependency parsing
  • Extensions to Large Margin Training
  • Training Dependency Parsers via Structured
    boosting
  • Semi-supervised Convex Training for Dependency
    Parsing
  • Results
  • Conclusion

50
Semi-supervised Convex Training for Dependency
Parsing
-- ACL 2008
51
Contributions
  • Combined a structured large margin loss on
    labeled data and a least squares loss on
    unlabeled data
  • Obtained an efficient, convex, semi-supervised
    large margin training algorithm for learning
    dependency parsers

52
More Data Is Better Data
  • The Penn Treebank
  • 4.5 million words
  • About 200 thousand sentences
  • Annotation 30 person-minutes/ sentence
  • Raw text data
  • News wire
  • Wikipedia
  • Web resources

Limited expensive!
Plentiful Free!
Supervised learning
Semi/ unsupervised learning
53
Existing Semi/unsupervised Alg.
  • EM / self-training
  • Local minima
  • The disconnect between likelihood and accuracy
  • Same mistakes can be amplified at next iteration
  • Standard semi-supervised SVM
  • Non-convex objective on the unlabeled data
  • Available solutions are sophisticated and
    expensive (e.g, Xu et al. 2006)

54
Semi-supervised Structured SVM
Non-convex!
Both and Yj are variables
  • where

55
Our Approach
Convex!
  • Constraints on Yj
  • All entries in Yj are between 0 and 1
  • Sum over all entries on each column is 1
    (one-head rule)
  • All the entries on the diagonal are zeros (no
    self-link rule)

  • (anti-symmetric rule)

56
Our Approach
Convex!
57
Our Approach Cont.
  • Yj is represented as an adjacency matrix
  • Use rows to denote heads and columns to denote
    children
  • Constraints on Yj
  • All entries in Yj are between 0 and 1
  • Sum over all entries on each column is 1
    (one-head rule)
  • All the entries on the diagonal are zeros (no
    self-link rule)

  • (anti-symmetric rule)
  • Connectedness (no-cycle)

58
Efficient Optimization Alg.
  • Using a stochastic gradient steps
  • Parameters are updated locally on each sentence
  • The objective is globally minimized after a few
    iterations
  • Gradient on labeled sentence i

Gradient on unlabeled sentence j
59
Overview
  • Dependency parsing model
  • Learning dependency parsers
  • Strictly lexicalized dependency parsing
  • Extensions to Large Margin Training
  • Training Dependency Parsers via Structured
    boosting
  • Semi-supervised Convex Training for Dependency
    Parsing
  • Results
  • Conclusion

60
Experimental Design
  • Data set (Linguistic Data Consortium), split into
    training, development and test sets
  • English
  • PTB 50 K sentences (split standard)
  • Chinese
  • CTB4 15 K sentences (split Wang et al. 2005)
  • CTB5 19 K sentences (split Corston-Oliver et
    al. 2006)
  • Features
  • Word-pair indicator, POS-pair, PMI, context
    features, distance

61
Experimental Design Cont.
  • Dependency parsing algorithm
  • A CKY-like chart parser with
    complexity
  • Evaluation measure
  • Dependency accuracy percentage of words that
    have the correct head

62
Results - 1 (IWPT 05)
Evaluation Results on CTB 4.0 ()
An undirected tree root A directed tree
63
Results - 2 (CoNLL 06)
much simpler feature set
Evaluation Results on CTB4 - 10 ()
64
Results - 3 (IJCAI 07)
Evaluation Results on Chinese and English ()
65
Results - 4 (ACL 08)
Evaluation Results on Chinese and English ()
66
Comparison with State of the Art
IWPT 2005
IJCAI 2007
Chinese Treebank 4.0 Chinese Treebank 5.0
67
Overview
  • Dependency parsing model
  • Learning dependency parsers
  • Strictly lexicalized dependency parsing
  • Extensions to Large Margin Training
  • Training Dependency Parsers via Structured
    boosting
  • Semi-supervised Convex Training for Dependency
    Parsing
  • Results
  • Conclusion

68
Conclusion
  • I have developed several statistical approaches
    to improve learning dependency parsers
  • Strictly lexicalized dependency parsing
  • Extensions to Large Margin Training
  • Training Dependency Parsers via Structured
    boosting
  • Semi-supervised Convex Training for Dependency
    Parsing
  • Achieved state-of-the-art accuracy for English
    and Chinese

69
Thanks!
Questions?
70
  • Features,
  • Linguistic intuitions
  • Models

Training criteria, Regularization, smoothing
NLP
Machine learning
71
What Have I Worked On?
Dependency parsing
Output is a set of inter-dependent labels with a
specific structure (E.g., a parse tree)
Structured learning
Part-of-Speech tagging
Query segmentation
72
Ambiguities In NLP
Courtesy of Aravind Joshi
I like eating sushi with tuna.
73
Dependency Tree
  • A dependency tree structure for a sentence
  • Syntactic relationships between word pairs in
    the sentence

obj
obj
mod
obj
subj
with
tuna
sushi
I
like
eating
with
tuna
sushi
I
like
eating
74
Feature Representation
  • Represent a word w by a feature vector
  • The value of a feature c is

where P(w, c) is the probability of w and c
co-occur in a context window
75
Similarity-based Smoothing
  • Similarity measure cosine
  • In (Dagan et al., 1999)

76
Comparison With An Unlexicalized Model
  • In the unlexicalized model, the input to the
    parser is the sequence of the POS tags, which is
    opposite to our model
  • Using gold standard POS tags
  • Accuracy of the unlexicalized model 71.1
  • Our strictly lexicalized model 79.9

77
Large Margin Training
  • Minimizing a regularized loss (Hastie et at.,
    2004)

i the index of the training sentences Ti the
target tree Li a candidate tree the
distance between the two trees
78
Objective with Local Constraints
  • The corresponding new quadratic program

polynomial constraints!
j number of constraints in A
79
Structured Boosting
  • Train a local link predictor, h1
  • Re-parse training data using h1
  • Re-weight local examples
  • Compare the parser outputs with the gold standard
    trees
  • Increase the weight of mis-parsed local examples
  • Re-train local link predictor, getting h2
  • Finally we have h1 , h2 , , hk

80
Structured Boosting (An Example)
saw
her
duck
I
with
a
telescope
Instance_weight of saw-with
Instance_weight of duck-with
Weights of local examples

Iter 1 2 3 T
81
Variants of Structured Boosting
  • Using alternative boosting algorithms for
    structured boosting
  • Adaboost M2 (Freund Schapire 1997)
  • Re-weighting class labels
  • Logistic regression form of boosting (Collins et
    al. 2002)

82
Dynamic Features
  • Also known as non-local features
  • Take into account the link labels of the
    surrounding word-pairs when predicting the label
    of current pair
  • Commonly used in sequential labeling (McCallum
    et al 2000, Toutanova et al. 2003)
  • A simple but useful idea for improving parsing
    accuracy
  • Wang et al. 2005
  • McDonald and Pereira 2006

83
Dynamic Features
with
a
telescope
duck
I
saw
her
with
a
spot
duck
I
saw
her
  • Define a canonical order so that a words
    children are generated first, before it modifies
    another word
  • telescope/spot are the dynamic features for
    deciding whether generating a link between saw
    with or duck with

84
Results - 1
Table 1 Boosting with static features
85
More on Structured Learning
  • Improved Estimation for Unsupervised
    Part-of-Speech Tagging (Wang Schuurmans 2005)
  • Improved Large Margin Dependency Parsing via
    Local Constraints and Laplacian Regularization
    (Wang, Cherry, Lizotte and Schuurmans 2006)
  • Learning Noun Phrase Query Segmentation
    (Bergsma and Wang 2007)
  • Semi-supervised Convex Training for Dependency
    Parsing (Wang, Schuurmans, Lin 2008)
Write a Comment
User Comments (0)
About PowerShow.com