Dependency Parsing: Machine Learning Approaches

About This Presentation

Title:

Dependency Parsing: Machine Learning Approaches

Description:

Dependency Parsing: Machine Learning Approaches Yuji Matsumoto Graduate School of Information Science Nara Institute of Science and Technology (NAIST, Japan) – PowerPoint PPT presentation

Number of Views:403

Avg rating:3.0/5.0

Slides: 79

Provided by: 2098239

Category:

more less

Transcript and Presenter's Notes

Title: Dependency Parsing: Machine Learning Approaches

1
Dependency ParsingMachine Learning Approaches
January 7, 2008

Yuji Matsumoto
Graduate School of Information Science
Nara Institute of Science and Technology
(NAIST, Japan)

2
Basic Language Analyses (POS-tagging, phrase
chunking, parsing)
Raw sentence
He reckons the current account deficit will
narrow to only 1.8 billion in September.
Part-of-speech tagging
POS-tagged sentence
He reckons the current account deficit will
narrow to only 1.8 billion in September . PRP
VBZ DT JJ NN NN
MD VB TO RB CD CD IN
NNP .
Base phrase chunking
Base phrase-chunked sentence
He reckons the current account deficit will
narrow to only 1.8 billion in September . NP
VP NP
VP PP NP PP NP
Dependency parsing
Dependency parsed sentence
3
Word Dependency Parsing (unlabeled)
Raw sentence
He reckons the current account deficit will
narrow to only 1.8 billion in September.
Part-of-speech tagging
POS-tagged sentence
He reckons the current account deficit will
narrow to only 1.8 billion in September. PRP
VBZ DT JJ NN NN
MD VB TO RB CD CD IN
NNP .
Word dependency parsed sentence
He reckons the current account deficit will
narrow to only 1.8 billion in September .
4
Word Dependency Parsing (labeled)
Raw sentence
He reckons the current account deficit will
narrow to only 1.8 billion in September.
Part-of-speech tagging
POS-tagged sentence
He reckons the current account deficit will
narrow to only 1.8 billion in September. PRP
VBZ DT JJ NN NN
MD VB TO RB CD CD IN
NNP .
Word dependency parsed sentence
He reckons the current account deficit will
narrow to only 1.8 billion in September .
5
A phrase structure tree anda dependency tree
ounces
6
Flattened representation ofa dependency tree
ounces
7
Dependency structure terminology
Label
SUBJ
This is

Child
Dependent
Modifier

Parent
Governor
Head

The direction of arrows may be drawn from head
to child
When there is an arrow from w to v, we write
w?v.
When there is a path (a series of arrows) from w
to v,
we write w?v.

8
Definition of Dependency Trees

Single head Except for the root (EOS), all words
have a single parent
Connected It should be a connected tree
Acyclic If wi ?wj , then it will never be
wj ?wi
Projective If wi ?wj , then for all k between
i and j, either wk ? wi or wk ? wj
holds
(non-crossing between dependencies).

9
Projective dependency tree
ounces
Projectiveness all the words between here
finally depend on either on
was or . (e.g., light
? was)
10
Non-projective dependency tree
root
Direction of edges from a child to the parent
11
Non-projective dependency tree
root
taken from R. McDonald and F. Pereira, Online
Learning of Approximate Dependency Parsing
Algorithms, European Chapter of Association
for Computational Linguistics, 2006.
Direction of edges from a parent to the children
12
Two Different Strategies for Structured Language
Analysis

Sentences have structures
Linear sequences POS tagging, Phrase/Named
Entity chunking
Tree structure Phrase structure trees,
dependency trees
Two statistical approaches to structure analysis
Global optimization
Eg., Hidden Markov Models, Conditional Ramdom
Fields for Sequential tagging problems
Probabilistic Context-free parsing
Maximum Spanning Tree Parsing (graph-based)
Repetition of local optimization
Chunking with Support Vector Machine
Deterministic parsing (transition-based)

13
Statistical dependency parsers

Eisner (COLING 96, Penn Technical Report 96)
Kudo Matsumoto (VLC 00, CoNLL 02)
Yamada Matsumoto (IWPT 03)
Nivre (IWPT 03, COLING 04, ACL 05)
Cheng, Asahara, Matsumoto (IJCNLP 04)
McDonald-Crammer-Pereira (ACL 05a, EMNLP 05b,
EACL 06)

Global optimization Repetition of local
optimization
14
Dependency Parsing used asthe CoNLL Shared Task

CoNLL (Conference on Natural Language Learning)
Multi-lingual Dependency Parsing Track
10 languages Arabic, Basque, Catalan, Chinese,
Czech, English, Greek, Hungarian, Italian,
Turkish
Domain Adaptation Track
Dependency annotated data in one domain and a
large unannotated data in other domains
(biomedical/chemical abstracts, parent-child
dialogue) are available.
Objective To use large scale unannotated target
data to enhance the performance of dependency
parser learned in the original domain so as to
work well in the new domain.

Nivre, J., Hall, J., Kubler, S, McDonald, R.,
Nilsson, J., Riedel, S., Yuret, D., The CoNLL
2007 Shared Task on Dependency Parsing,
Proceedings of EMNLP-CoNLL 2007, pp.915-932, June
2007.
15
Statistical dependency parsers(to be introduced
in this lecture)

Kudo Matsumoto (VLC 00, CoNLL 02) Japanese
Yamada Matsumoto (IWPT 03)
Nivre (IWPT 03, COLING 04, ACL 05)
McDonald-Crammer-Pereira (EMNLP 05a, ACL 05b,
EACL 06)

Most of them (except for Nivre 05 and McDonald
05a) Assume projective dependency parsing
16
Japanese Syntactic Dependency Analysis

Analysis of relationship between phrasal units
(bunsetsu segments)
Two Constraints
Each segment modifies one of right-hand side
segments (Japanese is head final language)
Dependencies do not cross one another
(projectiveness)

17
An Example of Japanese Syntactic Dependency
Analysis
18
Model 1 Probabilistic Model
Kudo Matsumoto 00
Input
?? 1 / ??? 2 / ??? 3 / ???? 4 I-top /
with her / to Kyoto-loc / go
19
Problems of Probabilistic model(1)

Selection of training examples
All pairs of segments in a sentence
Depending pairs ? positive examples
Non-depending pairs ? negative examples
This produces a total of n(n-1)/2 training
examples per sentence (n is the number of
segments in a sentence)
In Model 1
All positive and negative examples are used to
learn an SVM
Test example is given to the SVM, its distance
from the separating hyperplane is transformed
into a pseud-probability using the sigmoid
function

20
Problems of Probabilistic Model

Size of training examples is large
O(n3) time is necessary for complete parsing
The classification cost of SVM is much more
expensive than other ML algorithms such as
Maximum Entropy model and Decision Trees

21
Model 2 Cascaded Chunking Model
Kudo Matsumoto 02

Parse a sentence deterministically only deciding
whether the current segment modifies the segment
on its immediate right-hand side
Training examples are extracted using the same
parsing algorithm

22
Example Training Phase
Annotated sentence
??1 ???2 ???3 ???4 ?????5 He her
warm heart was moved (He was
moved by her warm heart.)
??1 ???2 ???3 ???4 ?????5
Pairs of tag (D or O) and context(features) are
stored as training data for SVMs
Training Data
23
Example Test Phase
Test sentence
??1 ???2 ???3 ???4 ?????5 He her
warm heart was moved (He was
moved by her warm heart.)
??1 ???2 ???3 ???4 ?????5
Tag is decided by SVMs built in training phase
SVMs
24
Advantages of Cascaded Chunking model

Efficiency
O(n3) (Probability model) v.s. O(n2) (Cascaded
chunking model)
Lower than O(n2) since most segments modify the
segment on their immediate right-hand-side
The size of training examples is much smaller
Independence from ML methods
Can be combined with any ML algorithm which works
as a binary classifier
Probabilities of dependency are not necessary

25
Features used in implementation
Modify or not?
??1 ???2 ????3 ?????4 ???5 ?????6 His
friend-top this book-acc have
lady-acc be looking for
modifier
head
His friend is looking for a lady who has this
book.

Static Features
modifier/modifiee
Head/Functional Word surface, POS,
POS-subcategory,
inflection-type, inflection-form, brackets,
quotations, punctuations,
Between segments distance, case-particles,
brackets,
quotations, punctuations
Dynamic Features
A,B Static features of Functional word
C Static features of Head word

26
Settings of Experiments

Kyoto University Corpus 2.0/3.0
Standard Data Set
Training 7,958 sentences / Test 1,246 sentences
Same data as Uchimoto et al. 98, Kudo,
Matsumoto 00
Large Data Set
2-fold Cross-Validation using all 38,383
sentences
Kernel Function 3rd polynomial
Evaluation method
Dependency accuracy
Sentence accuracy

27
Results
Large (20,000 sentences)
Standard (8,000 sentences)
Data Set
Probabilistic
Cascaded Chunking
Probabilistic
Cascaded Chunking
Model
N/A
90.45
89.09
89.29
Dependency Acc. ()
N/A
53.16
46.17
47.53
Sentence Acc. ()
19,191
19,191
7,956
7,956
of training sentences
1,074,316
251,254
459,105
110,355
of training examples
N/A
48
336
8
Training time (hours)
N/A
0.7
2.1
0.5
Parsing time (sec./sent.)
28
Probabilistic v.s. Cascaded Chunking Models
Probabilistic Cascaded Chunking
Strategy Maximize sentence probability Shift-Reduce Deterministic
Merit Can see all candidates of dependency Simple, efficient and scalable, Accurate as Prob. model
Demerit Inefficient, Commit to unnecessary training examples Cannot see the all (posterior) candidates of dependency
29
Smoothing Effect (in cascade model)

No need to cut off low frequent words

Cut-off by frequency Dependency accuracy Sentence accuracy
1 89.3 47.8
2 88.7 46.3
4 87.8 44.6
6 87.6 44.8
8 87.4 42.5
30
Combination of features

Polynomial Kernels for taking into combination of
features (tested with a small corpus (2000
sentences))

Degree of polynomial kernel Dependency accuracy Sentence accuracy
1 N/A N/A
2 86.9 40.6
3 87.7 42.9
4 87.7 42.8
31
Deterministic Dependency Parser based on SVM
Yamada Matsumoto 03

Three possible actions
Right For the two adjacent words, modification
goes from left word to the right word
Left For the two adjacent words, modification
goes from right word to the left word
Shift no action should be taken for the pair,
and move the focus to the right
There are two possibilities in this situation
There is really no modification relation between
the pair
There is actually a modification relation between
them, but need to wait until the surrounding
analysis has been finished
The second situation can be categorized into a
different class (called Wait)
Do this process on the input sentence from the
beginning to the end, and repeat it until a
single word remains

32
Right action
33
Left action
34
Shift action
35
The features used in learning
SVM is used to make classification either in 3
class model (right, left, shift) or in 4 class
model (right, left, shift, wait)
36
SVM Learning of Actions

The best action for each configuration is learned
by SVMs
Since the problem is 3-class or 4-class
classification problem, either pair-wise or
one-vs-rest method is employed
pair-wise method For each pair of classes, learn
an SVM. The best class is decide by voting of
all SVMs
One-vs-rest method For each class, an SVM is
learned to discriminate that class from others.
The best class is decided by the SVM that gives
the highest value

37
An Example of Deterministic DependencyParsing
(Yamada Matsumoto Algorithm)
right
with
a
boy
the
hits
the
dog
rod
38
An Example of Deterministic DependencyParsing
(Yamada Matsumoto Algorithm)
right
with
a
hits
the
dog
rod
39
An Example of Deterministic DependencyParsing
(Yamada Matsumoto Algorithm)
shift
with
a
hits
the
dog
rod
40
An Example of Deterministic DependencyParsing
(Yamada Matsumoto Algorithm)
right
with
a
hits
the
dog
rod
41
An Example of Deterministic DependencyParsing
(Yamada Matsumoto Algorithm)
shift
with
a
hits
dog
rod
the
42
An Example of Deterministic DependencyParsing
(Yamada Matsumoto Algorithm)
shift
with
a
hits
dog
rod
the
43
An Example of Deterministic DependencyParsing
(Yamada Matsumoto Algorithm)
right
with
a
hits
dog
rod
the
44
An Example of Deterministic DependencyParsing
(Yamada Matsumoto Algorithm)
left
with
hits
rod
a
45
An Example of Deterministic DependencyParsing
(Yamada Matsumoto Algorithm)
shift
with
hits
rod
a
46
An Example of Deterministic DependencyParsing
(Yamada Matsumoto Algorithm)
left
with
hits
47
An Example of Deterministic DependencyParsing
(Yamada Matsumoto Algorithm)
left
hits
48
An Example of Deterministic DependencyParsing
(Yamada Matsumoto Algorithm)
End of parsing
hits
49
The Accuracy of Parsing
Accuracies for Dependency relation Rood
identification Complete analysis

Learned with 30000 English sentences
no children no child info is considered
word, POS only word/POS info is used
all all information is used

50
Deterministic linear time dependency parser based
on Shift-Reduce parsing
Nivre 03,04

There are two stacks S and Q
Initialization Sw1 w2, w3, , wnQ
Termination S Q
Parsing actions
SHIFT S wi,Q ? S, wi Q
Left-Arc S, wi wj, Q ? S wj, Q
Right-Arc S, wi wj,Q ? S, wi, wj Q
Reduce S, wi, wj Q ? S, wi Q

wi
wj
Though the original parser uses memory-based
learning, recent implementation uses SVMs to
select actions
51
An Example of Deterministic DependencyParsing
(Nivres Algorithm)
left
S
Q
boy
hits
the
dog
with
a
rod
the
52
An Example of Deterministic DependencyParsing
(Nivres Algorithm)
shift
S
Q
hits
the
dog
with
a
rod
53
An Example of Deterministic DependencyParsing
(Nivres Algorithm)
left
S
Q
hits
the
dog
with
a
rod
54
An Example of Deterministic DependencyParsing
(Nivres Algorithm)
shift
S
Q
the
dog
with
a
rod
55
An Example of Deterministic DependencyParsing
(Nivres Algorithm)
shift
S
Q
the
dog
with
a
rod
56
An Example of Deterministic DependencyParsing
(Nivres Algorithm)
left
S
Q
the
dog
with
a
rod
57
An Example of Deterministic DependencyParsing
(Nivres Algorithm)
right
S
Q
with
a
rod
58
An Example of Deterministic DependencyParsing
(Nivres Algorithm)
reduce
S
Q
with
a
rod
59
An Example of Deterministic DependencyParsing
(Nivres Algorithm)
right
S
Q
with
a
rod
60
An Example of Deterministic DependencyParsing
(Nivres Algorithm)
shift
S
Q
with
a
rod
61
An Example of Deterministic DependencyParsing
(Nivres Algorithm)
left
S
Q
with
a
rod
62
An Example of Deterministic DependencyParsing
(Nivres Algorithm)
right
S
Q
with
63
An Example of Deterministic DependencyParsing
(Nivres Algorithm)
terminate
S
Q
with
64
Computational Costs

Cascade Chunking (Kudo2, Yamada)
O(n2)
Deterministic Shift-Reduce Parsing (Nivre)
O(n)
CKY-based parsing (Eisner, Kudo1)
O(n3)
Maximum Spanning Tree Parsing (McDonald)
O(n2)

65
McDonalds MST Parsing

R. McDonald, F. Pereira, K. Ribarov, and J.
Haji?c, Non-projective dependency parsing using
spanning tree algorithms,
In Proceedings of the Joint Conference on
Human Language Technology and Empirical Methods
in Natural Language Processing (HLT/EMNLP), 2005.

MST Maximum Spanning Tree
66
Definition of Non-projective Dependency Trees

Single head Except for the root, all words have
a single parent
Connected It should be a connected tree
Acyclic If wi ?wj , then it will never be
wj ?wi

taken from R. McDonald and F. Pereira, Online
Learning of Approximate Dependency Parsing
Algorithms, European Chapter of Association
for Computational Linguistics, 2006.
67
Notation and definition
training data
t-th sentence in T
Dependency structure for t-th sentence (defined
by a set of edges)
The vector of feature functions for edges
The set of all dependency trees producible from x
68
The Basic Idea
The score function of dependency trees is defined
by the sum of the scores of edges, which are
defined by the weighted sum of feature functions
saw
The target optimization problem
I
girl
a
L(y,y) is defined as the number of words in y
that have different parent compared with y.
69
Explanation of formulas
Score of an edge connecting nodes i and j
Feature vector representing information of words
i and j, and relation between them
Weight vector for the feature vectors for
maximizing (learned from training
examples)
70
Online learning algorithm (MIRA)

w00 v0 i0
for n 1..N
for t 1..T
w(i1)update w(i) with (xt,yt)
vvw(i1)
ii1
wv/(NT)

N predefined number of iteration T number
of traning examples
71
Update of the weight vector
Since this formulation requires all the possible
trees, this is modified so that only the k-best
trees are taken into consideration
(k520)
72
Features used in training

Unigram features word/POS tag of the parent,
word/POS tag of the child (when the length of a
word is more than 5, only 5-letter prefix is
used, this applies all other features)
Bigram features pair of word/POS tags of the
parent and child
Trigram features POS tags of the parent and
child, and the POS tag or a word appearing
in-between
Context features POS tags of the parent and
child, and two POS tags before and after them
(backed-off by trigrams)

73
Settings of Experiments

Convert Penn Treebank into dependency trees based
on Yamadas head rules data(chapters 02-21),
development data(22), test data(23)
POS tagging is done by Ratnaparkhis MaxEnt
tagger
K-best tree are constructed based of Eisners
parsing algorithm

74
Dependency Parsing byMaximum Spanning Tree
Algorithm

Learn the optimal w from training data
For a given sentence
Calculate the scores to all word pairs in the
sentence
Get the spanning tree with the largest score
There is an algorithm (Chu-Liu-Edmonds algorithm)
that requires just O(n2) time to obtain the
maximum spanning tree (n the number of
nodes(words))

75
The Image of Gettingthe Maximum Spanning Tree
John saw Mary Scores to all the pairs or
words (root is the dummy root node)
The maximum spanning tree
76
Results of experiments
Model accuracy root comp.
Yamad Matsumoto 90.3 91.6 38.4
Nivre 87.3 84.3 30.4
Avg. Perceptron 90.6 94.0 36.5
MIRA 90.9 94.2 37.5
accuracy proportion of correct edges root
accuracy of root words comp. proportion of
sentences completely analyzed
77
Available Dependency Parsers

Maximum Spanning Tree Parser (McDonald)
http//ryanmcd.googlepages.com/MSTParser.html
Date-driven Dependency Parser (Nivre)
http//w3.msi.vxu.se/nivre/research/MaltParser.ht
ml
CaboCha SVM-based Japanese Dependency Parser
(Kudo)
http//chasen.org/taku/software/cabocha/

References
Yuchang Cheng, Masayuki Asahara, Yuji Matsumoto,
"Deterministic dependency analyzer for Chinese,"
IJCNLP-04 The First International Joint
Conference on Natural Language Processing,P.135-14
0, 2004.
Yuchang Cheng, Masayuki Asahara, Yuji Matsumoto,
"Chinese Deterministic Dependency Analyzer
Examining Effects of Global Features and Root
Node Finder," Proc. Forth SIGHAN Workshop on
Chinese Language Processing, Proceedings of the
Workshop, pp.17-24, October 2005.
Jason M. Eisner, "Three New Probabilistic Models
for Dependency Parsing An Exploration,"
COLING-96 The 16th International Conference on
Computational Linguistics, pp.340-345, 1996.
Taku Kudo and Yuji Matsumoto, Japanese
Dependency Structure Analysis Based on Support
Vector Machines,Proceedings of the 2000 Joint
SIGDAT Conference on Empirical Methods in Natural
Language Processing and Very Large Corpora,
pp.18-25, October 2000.
Taku Kudo, Yuji Matsumoto, "Japanese dependency
analysis using cascaded chunking," Proc. 6th
Conference on Natural Language Learning
(CoNLL-02), pp.63-69, 2002.
Ryan McDonald, Koby Crammer, Fernando Pereira,
"Online Large-Margin Training of Dependency
Parsers," Proc. Annual Meeting of the Association
for Computational Linguistics, pp.91-98, 2005.
Ryan McDonald, Fernando Pereira, Jan Hajic
"Non-Projective Dependency Parsing using Spanning
Tree Algorithms,"HLT-EMNLP, 2005.
Ryan McDonald, Fernando Pereira, "Online Learning
of Approximate Dependency Parsing Algorithms,"
Proc. European Chapter of Association for
Computational Linguistics, 2006.
Joakim Nivre, "An Efficient Algorithm for
Projective Dependency Parsing," Proc. 8th
International Workshop on Parsing Technologies
(IWPT), pp.149-160, 2003.
Joakim Nivre, Mario Scholz, "Deterministic
Dependency Parsing of English Text," COLING 2004
20th International Conference on Computational
Linguistics, pp.64-70, 2004.
Joakim Nivre, Jens Nilsson, "Pseudo-Projective
Dependency Parsing," ACL-05 43rd Annual Meeting
of the Association for Computational Linguistics,
pp.99-106, 2005.
Joakim Nivre, Inductive Dependency Parsing,
Text, Speech Language Technology Vol.34,
Springer, 2006.
Hiroyasu Yamada and Yuji Matsumoto, "Statistical
dependency analysis with Support Vector
Machines," Proc. 8th International Workshop on
Parsing Technologies (IWPT), pp.195-206, 2003.