Title: Dependency Parsing: Machine Learning Approaches
1Dependency ParsingMachine Learning Approaches
January 7, 2008
- Yuji Matsumoto
- Graduate School of Information Science
- Nara Institute of Science and Technology
- (NAIST, Japan)
2Basic Language Analyses (POS-tagging, phrase
chunking, parsing)
Raw sentence
He reckons the current account deficit will
narrow to only 1.8 billion in September.
Part-of-speech tagging
POS-tagged sentence
He reckons the current account deficit will
narrow to only 1.8 billion in September . PRP
VBZ DT JJ NN NN
MD VB TO RB CD CD IN
NNP .
Base phrase chunking
Base phrase-chunked sentence
He reckons the current account deficit will
narrow to only 1.8 billion in September . NP
VP NP
VP PP NP PP NP
Dependency parsing
Dependency parsed sentence
3Word Dependency Parsing (unlabeled)
Raw sentence
He reckons the current account deficit will
narrow to only 1.8 billion in September.
Part-of-speech tagging
POS-tagged sentence
He reckons the current account deficit will
narrow to only 1.8 billion in September. PRP
VBZ DT JJ NN NN
MD VB TO RB CD CD IN
NNP .
Word dependency parsed sentence
He reckons the current account deficit will
narrow to only 1.8 billion in September .
4Word Dependency Parsing (labeled)
Raw sentence
He reckons the current account deficit will
narrow to only 1.8 billion in September.
Part-of-speech tagging
POS-tagged sentence
He reckons the current account deficit will
narrow to only 1.8 billion in September. PRP
VBZ DT JJ NN NN
MD VB TO RB CD CD IN
NNP .
Word dependency parsed sentence
He reckons the current account deficit will
narrow to only 1.8 billion in September .
5A phrase structure tree anda dependency tree
ounces
6Flattened representation ofa dependency tree
ounces
7Dependency structure terminology
Label
SUBJ
This is
- The direction of arrows may be drawn from head
to child - When there is an arrow from w to v, we write
w?v. - When there is a path (a series of arrows) from w
to v, - we write w?v.
8Definition of Dependency Trees
- Single head Except for the root (EOS), all words
have a single parent - Connected It should be a connected tree
- Acyclic If wi ?wj , then it will never be
- wj ?wi
- Projective If wi ?wj , then for all k between
- i and j, either wk ? wi or wk ? wj
holds - (non-crossing between dependencies).
9Projective dependency tree
ounces
Projectiveness all the words between here
finally depend on either on
was or . (e.g., light
? was)
10Non-projective dependency tree
root
Direction of edges from a child to the parent
11Non-projective dependency tree
root
taken from R. McDonald and F. Pereira, Online
Learning of Approximate Dependency Parsing
Algorithms, European Chapter of Association
for Computational Linguistics, 2006.
Direction of edges from a parent to the children
12Two Different Strategies for Structured Language
Analysis
- Sentences have structures
- Linear sequences POS tagging, Phrase/Named
Entity chunking - Tree structure Phrase structure trees,
dependency trees - Two statistical approaches to structure analysis
- Global optimization
- Eg., Hidden Markov Models, Conditional Ramdom
Fields for Sequential tagging problems - Probabilistic Context-free parsing
- Maximum Spanning Tree Parsing (graph-based)
- Repetition of local optimization
- Chunking with Support Vector Machine
- Deterministic parsing (transition-based)
13Statistical dependency parsers
- Eisner (COLING 96, Penn Technical Report 96)
- Kudo Matsumoto (VLC 00, CoNLL 02)
- Yamada Matsumoto (IWPT 03)
- Nivre (IWPT 03, COLING 04, ACL 05)
- Cheng, Asahara, Matsumoto (IJCNLP 04)
- McDonald-Crammer-Pereira (ACL 05a, EMNLP 05b,
EACL 06)
Global optimization Repetition of local
optimization
14Dependency Parsing used asthe CoNLL Shared Task
- CoNLL (Conference on Natural Language Learning)
- Multi-lingual Dependency Parsing Track
- 10 languages Arabic, Basque, Catalan, Chinese,
Czech, English, Greek, Hungarian, Italian,
Turkish - Domain Adaptation Track
- Dependency annotated data in one domain and a
large unannotated data in other domains
(biomedical/chemical abstracts, parent-child
dialogue) are available. - Objective To use large scale unannotated target
data to enhance the performance of dependency
parser learned in the original domain so as to
work well in the new domain.
Nivre, J., Hall, J., Kubler, S, McDonald, R.,
Nilsson, J., Riedel, S., Yuret, D., The CoNLL
2007 Shared Task on Dependency Parsing,
Proceedings of EMNLP-CoNLL 2007, pp.915-932, June
2007.
15Statistical dependency parsers(to be introduced
in this lecture)
- Kudo Matsumoto (VLC 00, CoNLL 02) Japanese
- Yamada Matsumoto (IWPT 03)
- Nivre (IWPT 03, COLING 04, ACL 05)
- McDonald-Crammer-Pereira (EMNLP 05a, ACL 05b,
EACL 06)
Most of them (except for Nivre 05 and McDonald
05a) Assume projective dependency parsing
16Japanese Syntactic Dependency Analysis
- Analysis of relationship between phrasal units
(bunsetsu segments) - Two Constraints
- Each segment modifies one of right-hand side
segments (Japanese is head final language) - Dependencies do not cross one another
(projectiveness)
17An Example of Japanese Syntactic Dependency
Analysis
18Model 1 Probabilistic Model
Kudo Matsumoto 00
Input
?? 1 / ??? 2 / ??? 3 / ???? 4 I-top /
with her / to Kyoto-loc / go
19Problems of Probabilistic model(1)
- Selection of training examples
- All pairs of segments in a sentence
- Depending pairs ? positive examples
- Non-depending pairs ? negative examples
- This produces a total of n(n-1)/2 training
examples per sentence (n is the number of
segments in a sentence) - In Model 1
- All positive and negative examples are used to
learn an SVM - Test example is given to the SVM, its distance
from the separating hyperplane is transformed
into a pseud-probability using the sigmoid
function
20Problems of Probabilistic Model
- Size of training examples is large
- O(n3) time is necessary for complete parsing
- The classification cost of SVM is much more
expensive than other ML algorithms such as
Maximum Entropy model and Decision Trees
21Model 2 Cascaded Chunking Model
Kudo Matsumoto 02
- Parse a sentence deterministically only deciding
whether the current segment modifies the segment
on its immediate right-hand side - Training examples are extracted using the same
parsing algorithm
22Example Training Phase
Annotated sentence
??1 ???2 ???3 ???4 ?????5 He her
warm heart was moved (He was
moved by her warm heart.)
??1 ???2 ???3 ???4 ?????5
Pairs of tag (D or O) and context(features) are
stored as training data for SVMs
Training Data
23Example Test Phase
Test sentence
??1 ???2 ???3 ???4 ?????5 He her
warm heart was moved (He was
moved by her warm heart.)
??1 ???2 ???3 ???4 ?????5
Tag is decided by SVMs built in training phase
SVMs
24Advantages of Cascaded Chunking model
- Efficiency
- O(n3) (Probability model) v.s. O(n2) (Cascaded
chunking model) - Lower than O(n2) since most segments modify the
segment on their immediate right-hand-side - The size of training examples is much smaller
- Independence from ML methods
- Can be combined with any ML algorithm which works
as a binary classifier - Probabilities of dependency are not necessary
25Features used in implementation
Modify or not?
??1 ???2 ????3 ?????4 ???5 ?????6 His
friend-top this book-acc have
lady-acc be looking for
modifier
head
His friend is looking for a lady who has this
book.
- Static Features
- modifier/modifiee
- Head/Functional Word surface, POS,
POS-subcategory, - inflection-type, inflection-form, brackets,
quotations, punctuations, - Between segments distance, case-particles,
brackets, - quotations, punctuations
- Dynamic Features
- A,B Static features of Functional word
- C Static features of Head word
26Settings of Experiments
- Kyoto University Corpus 2.0/3.0
- Standard Data Set
- Training 7,958 sentences / Test 1,246 sentences
- Same data as Uchimoto et al. 98, Kudo,
Matsumoto 00 - Large Data Set
- 2-fold Cross-Validation using all 38,383
sentences - Kernel Function 3rd polynomial
- Evaluation method
- Dependency accuracy
- Sentence accuracy
27Results
Large (20,000 sentences)
Standard (8,000 sentences)
Data Set
Probabilistic
Cascaded Chunking
Probabilistic
Cascaded Chunking
Model
N/A
90.45
89.09
89.29
Dependency Acc. ()
N/A
53.16
46.17
47.53
Sentence Acc. ()
19,191
19,191
7,956
7,956
of training sentences
1,074,316
251,254
459,105
110,355
of training examples
N/A
48
336
8
Training time (hours)
N/A
0.7
2.1
0.5
Parsing time (sec./sent.)
28Probabilistic v.s. Cascaded Chunking Models
Probabilistic Cascaded Chunking
Strategy Maximize sentence probability Shift-Reduce Deterministic
Merit Can see all candidates of dependency Simple, efficient and scalable, Accurate as Prob. model
Demerit Inefficient, Commit to unnecessary training examples Cannot see the all (posterior) candidates of dependency
29Smoothing Effect (in cascade model)
- No need to cut off low frequent words
Cut-off by frequency Dependency accuracy Sentence accuracy
1 89.3 47.8
2 88.7 46.3
4 87.8 44.6
6 87.6 44.8
8 87.4 42.5
30Combination of features
- Polynomial Kernels for taking into combination of
features (tested with a small corpus (2000
sentences))
Degree of polynomial kernel Dependency accuracy Sentence accuracy
1 N/A N/A
2 86.9 40.6
3 87.7 42.9
4 87.7 42.8
31Deterministic Dependency Parser based on SVM
Yamada Matsumoto 03
- Three possible actions
- Right For the two adjacent words, modification
goes from left word to the right word - Left For the two adjacent words, modification
goes from right word to the left word - Shift no action should be taken for the pair,
and move the focus to the right - There are two possibilities in this situation
- There is really no modification relation between
the pair - There is actually a modification relation between
them, but need to wait until the surrounding
analysis has been finished - The second situation can be categorized into a
different class (called Wait) - Do this process on the input sentence from the
beginning to the end, and repeat it until a
single word remains
32Right action
33Left action
34Shift action
35The features used in learning
SVM is used to make classification either in 3
class model (right, left, shift) or in 4 class
model (right, left, shift, wait)
36SVM Learning of Actions
- The best action for each configuration is learned
by SVMs - Since the problem is 3-class or 4-class
classification problem, either pair-wise or
one-vs-rest method is employed - pair-wise method For each pair of classes, learn
an SVM. The best class is decide by voting of
all SVMs - One-vs-rest method For each class, an SVM is
learned to discriminate that class from others.
The best class is decided by the SVM that gives
the highest value
37An Example of Deterministic DependencyParsing
(Yamada Matsumoto Algorithm)
right
with
a
boy
the
hits
the
dog
rod
38An Example of Deterministic DependencyParsing
(Yamada Matsumoto Algorithm)
right
with
a
hits
the
dog
rod
39An Example of Deterministic DependencyParsing
(Yamada Matsumoto Algorithm)
shift
with
a
hits
the
dog
rod
40An Example of Deterministic DependencyParsing
(Yamada Matsumoto Algorithm)
right
with
a
hits
the
dog
rod
41An Example of Deterministic DependencyParsing
(Yamada Matsumoto Algorithm)
shift
with
a
hits
dog
rod
the
42An Example of Deterministic DependencyParsing
(Yamada Matsumoto Algorithm)
shift
with
a
hits
dog
rod
the
43An Example of Deterministic DependencyParsing
(Yamada Matsumoto Algorithm)
right
with
a
hits
dog
rod
the
44An Example of Deterministic DependencyParsing
(Yamada Matsumoto Algorithm)
left
with
hits
rod
a
45An Example of Deterministic DependencyParsing
(Yamada Matsumoto Algorithm)
shift
with
hits
rod
a
46An Example of Deterministic DependencyParsing
(Yamada Matsumoto Algorithm)
left
with
hits
47An Example of Deterministic DependencyParsing
(Yamada Matsumoto Algorithm)
left
hits
48An Example of Deterministic DependencyParsing
(Yamada Matsumoto Algorithm)
End of parsing
hits
49The Accuracy of Parsing
Accuracies for Dependency relation Rood
identification Complete analysis
- Learned with 30000 English sentences
- no children no child info is considered
- word, POS only word/POS info is used
- all all information is used
50Deterministic linear time dependency parser based
on Shift-Reduce parsing
Nivre 03,04
- There are two stacks S and Q
- Initialization Sw1 w2, w3, , wnQ
- Termination S Q
- Parsing actions
- SHIFT S wi,Q ? S, wi Q
- Left-Arc S, wi wj, Q ? S wj, Q
- Right-Arc S, wi wj,Q ? S, wi, wj Q
- Reduce S, wi, wj Q ? S, wi Q
wi
wj
Though the original parser uses memory-based
learning, recent implementation uses SVMs to
select actions
51An Example of Deterministic DependencyParsing
(Nivres Algorithm)
left
S
Q
boy
hits
the
dog
with
a
rod
the
52An Example of Deterministic DependencyParsing
(Nivres Algorithm)
shift
S
Q
hits
the
dog
with
a
rod
53An Example of Deterministic DependencyParsing
(Nivres Algorithm)
left
S
Q
hits
the
dog
with
a
rod
54An Example of Deterministic DependencyParsing
(Nivres Algorithm)
shift
S
Q
the
dog
with
a
rod
55An Example of Deterministic DependencyParsing
(Nivres Algorithm)
shift
S
Q
the
dog
with
a
rod
56An Example of Deterministic DependencyParsing
(Nivres Algorithm)
left
S
Q
the
dog
with
a
rod
57An Example of Deterministic DependencyParsing
(Nivres Algorithm)
right
S
Q
with
a
rod
58An Example of Deterministic DependencyParsing
(Nivres Algorithm)
reduce
S
Q
with
a
rod
59An Example of Deterministic DependencyParsing
(Nivres Algorithm)
right
S
Q
with
a
rod
60An Example of Deterministic DependencyParsing
(Nivres Algorithm)
shift
S
Q
with
a
rod
61An Example of Deterministic DependencyParsing
(Nivres Algorithm)
left
S
Q
with
a
rod
62An Example of Deterministic DependencyParsing
(Nivres Algorithm)
right
S
Q
with
63An Example of Deterministic DependencyParsing
(Nivres Algorithm)
terminate
S
Q
with
64Computational Costs
- Cascade Chunking (Kudo2, Yamada)
- O(n2)
- Deterministic Shift-Reduce Parsing (Nivre)
- O(n)
- CKY-based parsing (Eisner, Kudo1)
- O(n3)
- Maximum Spanning Tree Parsing (McDonald)
- O(n2)
65McDonalds MST Parsing
- R. McDonald, F. Pereira, K. Ribarov, and J.
Haji?c, Non-projective dependency parsing using
spanning tree algorithms, - In Proceedings of the Joint Conference on
Human Language Technology and Empirical Methods
in Natural Language Processing (HLT/EMNLP), 2005.
MST Maximum Spanning Tree
66Definition of Non-projective Dependency Trees
- Single head Except for the root, all words have
a single parent - Connected It should be a connected tree
- Acyclic If wi ?wj , then it will never be
- wj ?wi
taken from R. McDonald and F. Pereira, Online
Learning of Approximate Dependency Parsing
Algorithms, European Chapter of Association
for Computational Linguistics, 2006.
67Notation and definition
training data
t-th sentence in T
Dependency structure for t-th sentence (defined
by a set of edges)
The vector of feature functions for edges
The set of all dependency trees producible from x
68The Basic Idea
The score function of dependency trees is defined
by the sum of the scores of edges, which are
defined by the weighted sum of feature functions
saw
The target optimization problem
I
girl
a
L(y,y) is defined as the number of words in y
that have different parent compared with y.
69Explanation of formulas
Score of an edge connecting nodes i and j
Feature vector representing information of words
i and j, and relation between them
Weight vector for the feature vectors for
maximizing (learned from training
examples)
70Online learning algorithm (MIRA)
- w00 v0 i0
- for n 1..N
- for t 1..T
- w(i1)update w(i) with (xt,yt)
- vvw(i1)
- ii1
- wv/(NT)
N predefined number of iteration T number
of traning examples
71Update of the weight vector
Since this formulation requires all the possible
trees, this is modified so that only the k-best
trees are taken into consideration
(k520)
72Features used in training
- Unigram features word/POS tag of the parent,
word/POS tag of the child (when the length of a
word is more than 5, only 5-letter prefix is
used, this applies all other features) - Bigram features pair of word/POS tags of the
parent and child - Trigram features POS tags of the parent and
child, and the POS tag or a word appearing
in-between - Context features POS tags of the parent and
child, and two POS tags before and after them
(backed-off by trigrams)
73Settings of Experiments
- Convert Penn Treebank into dependency trees based
on Yamadas head rules data(chapters 02-21),
development data(22), test data(23) - POS tagging is done by Ratnaparkhis MaxEnt
tagger - K-best tree are constructed based of Eisners
parsing algorithm
74Dependency Parsing byMaximum Spanning Tree
Algorithm
- Learn the optimal w from training data
- For a given sentence
- Calculate the scores to all word pairs in the
sentence - Get the spanning tree with the largest score
- There is an algorithm (Chu-Liu-Edmonds algorithm)
that requires just O(n2) time to obtain the
maximum spanning tree (n the number of
nodes(words))
75The Image of Gettingthe Maximum Spanning Tree
John saw Mary Scores to all the pairs or
words (root is the dummy root node)
The maximum spanning tree
76Results of experiments
Model accuracy root comp.
Yamad Matsumoto 90.3 91.6 38.4
Nivre 87.3 84.3 30.4
Avg. Perceptron 90.6 94.0 36.5
MIRA 90.9 94.2 37.5
accuracy proportion of correct edges root
accuracy of root words comp. proportion of
sentences completely analyzed
77Available Dependency Parsers
- Maximum Spanning Tree Parser (McDonald)
- http//ryanmcd.googlepages.com/MSTParser.html
- Date-driven Dependency Parser (Nivre)
- http//w3.msi.vxu.se/nivre/research/MaltParser.ht
ml - CaboCha SVM-based Japanese Dependency Parser
(Kudo) - http//chasen.org/taku/software/cabocha/
78- References
- Yuchang Cheng, Masayuki Asahara, Yuji Matsumoto,
"Deterministic dependency analyzer for Chinese,"
IJCNLP-04 The First International Joint
Conference on Natural Language Processing,P.135-14
0, 2004. - Yuchang Cheng, Masayuki Asahara, Yuji Matsumoto,
"Chinese Deterministic Dependency Analyzer
Examining Effects of Global Features and Root
Node Finder," Proc. Forth SIGHAN Workshop on
Chinese Language Processing, Proceedings of the
Workshop, pp.17-24, October 2005. - Jason M. Eisner, "Three New Probabilistic Models
for Dependency Parsing An Exploration,"
COLING-96 The 16th International Conference on
Computational Linguistics, pp.340-345, 1996. - Taku Kudo and Yuji Matsumoto, Japanese
Dependency Structure Analysis Based on Support
Vector Machines,Proceedings of the 2000 Joint
SIGDAT Conference on Empirical Methods in Natural
Language Processing and Very Large Corpora,
pp.18-25, October 2000. - Taku Kudo, Yuji Matsumoto, "Japanese dependency
analysis using cascaded chunking," Proc. 6th
Conference on Natural Language Learning
(CoNLL-02), pp.63-69, 2002. - Ryan McDonald, Koby Crammer, Fernando Pereira,
"Online Large-Margin Training of Dependency
Parsers," Proc. Annual Meeting of the Association
for Computational Linguistics, pp.91-98, 2005. - Ryan McDonald, Fernando Pereira, Jan Hajic
"Non-Projective Dependency Parsing using Spanning
Tree Algorithms,"HLT-EMNLP, 2005. - Ryan McDonald, Fernando Pereira, "Online Learning
of Approximate Dependency Parsing Algorithms,"
Proc. European Chapter of Association for
Computational Linguistics, 2006. - Joakim Nivre, "An Efficient Algorithm for
Projective Dependency Parsing," Proc. 8th
International Workshop on Parsing Technologies
(IWPT), pp.149-160, 2003. - Joakim Nivre, Mario Scholz, "Deterministic
Dependency Parsing of English Text," COLING 2004
20th International Conference on Computational
Linguistics, pp.64-70, 2004. - Joakim Nivre, Jens Nilsson, "Pseudo-Projective
Dependency Parsing," ACL-05 43rd Annual Meeting
of the Association for Computational Linguistics,
pp.99-106, 2005. - Joakim Nivre, Inductive Dependency Parsing,
Text, Speech Language Technology Vol.34,
Springer, 2006. - Hiroyasu Yamada and Yuji Matsumoto, "Statistical
dependency analysis with Support Vector
Machines," Proc. 8th International Workshop on
Parsing Technologies (IWPT), pp.195-206, 2003.