Dependency Parsing: Machine Learning Approaches - PowerPoint PPT Presentation

About This Presentation
Title:

Dependency Parsing: Machine Learning Approaches

Description:

Dependency Parsing: Machine Learning Approaches Yuji Matsumoto Graduate School of Information Science Nara Institute of Science and Technology (NAIST, Japan) – PowerPoint PPT presentation

Number of Views:389
Avg rating:3.0/5.0
Slides: 79
Provided by: 2098239
Category:

less

Transcript and Presenter's Notes

Title: Dependency Parsing: Machine Learning Approaches


1
Dependency ParsingMachine Learning Approaches
January 7, 2008
  • Yuji Matsumoto
  • Graduate School of Information Science
  • Nara Institute of Science and Technology
  • (NAIST, Japan)

2
Basic Language Analyses (POS-tagging, phrase
chunking, parsing)
Raw sentence
He reckons the current account deficit will
narrow to only 1.8 billion in September.
Part-of-speech tagging
POS-tagged sentence
He reckons the current account deficit will
narrow to only 1.8 billion in September . PRP
VBZ DT JJ NN NN
MD VB TO RB CD CD IN
NNP .
Base phrase chunking
Base phrase-chunked sentence
He reckons the current account deficit will
narrow to only 1.8 billion in September . NP
VP NP
VP PP NP PP NP
Dependency parsing
Dependency parsed sentence
3
Word Dependency Parsing (unlabeled)
Raw sentence
He reckons the current account deficit will
narrow to only 1.8 billion in September.
Part-of-speech tagging
POS-tagged sentence
He reckons the current account deficit will
narrow to only 1.8 billion in September. PRP
VBZ DT JJ NN NN
MD VB TO RB CD CD IN
NNP .
Word dependency parsed sentence
He reckons the current account deficit will
narrow to only 1.8 billion in September .
4
Word Dependency Parsing (labeled)
Raw sentence
He reckons the current account deficit will
narrow to only 1.8 billion in September.
Part-of-speech tagging
POS-tagged sentence
He reckons the current account deficit will
narrow to only 1.8 billion in September. PRP
VBZ DT JJ NN NN
MD VB TO RB CD CD IN
NNP .
Word dependency parsed sentence
He reckons the current account deficit will
narrow to only 1.8 billion in September .
5
A phrase structure tree anda dependency tree
ounces
6
Flattened representation ofa dependency tree
ounces
7
Dependency structure terminology
Label
SUBJ
This is
  • Child
  • Dependent
  • Modifier
  • Parent
  • Governor
  • Head
  • The direction of arrows may be drawn from head
    to child
  • When there is an arrow from w to v, we write
    w?v.
  • When there is a path (a series of arrows) from w
    to v,
  • we write w?v.

8
Definition of Dependency Trees
  • Single head Except for the root (EOS), all words
    have a single parent
  • Connected It should be a connected tree
  • Acyclic If wi ?wj , then it will never be
  • wj ?wi
  • Projective If wi ?wj , then for all k between
  • i and j, either wk ? wi or wk ? wj
    holds
  • (non-crossing between dependencies).

9
Projective dependency tree
ounces
Projectiveness all the words between here
finally depend on either on
was or . (e.g., light
? was)
10
Non-projective dependency tree
root
Direction of edges from a child to the parent
11
Non-projective dependency tree
root
taken from R. McDonald and F. Pereira, Online
Learning of Approximate Dependency Parsing
Algorithms, European Chapter of Association
for Computational Linguistics, 2006.
Direction of edges from a parent to the children
12
Two Different Strategies for Structured Language
Analysis
  • Sentences have structures
  • Linear sequences POS tagging, Phrase/Named
    Entity chunking
  • Tree structure Phrase structure trees,
    dependency trees
  • Two statistical approaches to structure analysis
  • Global optimization
  • Eg., Hidden Markov Models, Conditional Ramdom
    Fields for Sequential tagging problems
  • Probabilistic Context-free parsing
  • Maximum Spanning Tree Parsing (graph-based)
  • Repetition of local optimization
  • Chunking with Support Vector Machine
  • Deterministic parsing (transition-based)

13
Statistical dependency parsers
  • Eisner (COLING 96, Penn Technical Report 96)
  • Kudo Matsumoto (VLC 00, CoNLL 02)
  • Yamada Matsumoto (IWPT 03)
  • Nivre (IWPT 03, COLING 04, ACL 05)
  • Cheng, Asahara, Matsumoto (IJCNLP 04)
  • McDonald-Crammer-Pereira (ACL 05a, EMNLP 05b,
    EACL 06)

Global optimization Repetition of local
optimization
14
Dependency Parsing used asthe CoNLL Shared Task
  • CoNLL (Conference on Natural Language Learning)
  • Multi-lingual Dependency Parsing Track
  • 10 languages Arabic, Basque, Catalan, Chinese,
    Czech, English, Greek, Hungarian, Italian,
    Turkish
  • Domain Adaptation Track
  • Dependency annotated data in one domain and a
    large unannotated data in other domains
    (biomedical/chemical abstracts, parent-child
    dialogue) are available.
  • Objective To use large scale unannotated target
    data to enhance the performance of dependency
    parser learned in the original domain so as to
    work well in the new domain.

Nivre, J., Hall, J., Kubler, S, McDonald, R.,
Nilsson, J., Riedel, S., Yuret, D., The CoNLL
2007 Shared Task on Dependency Parsing,
Proceedings of EMNLP-CoNLL 2007, pp.915-932, June
2007.
15
Statistical dependency parsers(to be introduced
in this lecture)
  • Kudo Matsumoto (VLC 00, CoNLL 02) Japanese
  • Yamada Matsumoto (IWPT 03)
  • Nivre (IWPT 03, COLING 04, ACL 05)
  • McDonald-Crammer-Pereira (EMNLP 05a, ACL 05b,
    EACL 06)

Most of them (except for Nivre 05 and McDonald
05a) Assume projective dependency parsing
16
Japanese Syntactic Dependency Analysis
  • Analysis of relationship between phrasal units
    (bunsetsu segments)
  • Two Constraints
  • Each segment modifies one of right-hand side
    segments (Japanese is head final language)
  • Dependencies do not cross one another
    (projectiveness)

17
An Example of Japanese Syntactic Dependency
Analysis
18
Model 1 Probabilistic Model
Kudo Matsumoto 00
Input
?? 1 / ??? 2 / ??? 3 / ???? 4 I-top /
with her / to Kyoto-loc / go
19
Problems of Probabilistic model(1)
  • Selection of training examples
  • All pairs of segments in a sentence
  • Depending pairs ? positive examples
  • Non-depending pairs ? negative examples
  • This produces a total of n(n-1)/2 training
    examples per sentence (n is the number of
    segments in a sentence)
  • In Model 1
  • All positive and negative examples are used to
    learn an SVM
  • Test example is given to the SVM, its distance
    from the separating hyperplane is transformed
    into a pseud-probability using the sigmoid
    function

20
Problems of Probabilistic Model
  • Size of training examples is large
  • O(n3) time is necessary for complete parsing
  • The classification cost of SVM is much more
    expensive than other ML algorithms such as
    Maximum Entropy model and Decision Trees

21
Model 2 Cascaded Chunking Model
Kudo Matsumoto 02
  • Parse a sentence deterministically only deciding
    whether the current segment modifies the segment
    on its immediate right-hand side
  • Training examples are extracted using the same
    parsing algorithm

22
Example Training Phase
Annotated sentence
??1 ???2 ???3 ???4 ?????5 He her
warm heart was moved (He was
moved by her warm heart.)
??1 ???2 ???3 ???4 ?????5
Pairs of tag (D or O) and context(features) are
stored as training data for SVMs
Training Data
23
Example Test Phase
Test sentence
??1 ???2 ???3 ???4 ?????5 He her
warm heart was moved (He was
moved by her warm heart.)
??1 ???2 ???3 ???4 ?????5
Tag is decided by SVMs built in training phase
SVMs
24
Advantages of Cascaded Chunking model
  • Efficiency
  • O(n3) (Probability model) v.s. O(n2) (Cascaded
    chunking model)
  • Lower than O(n2) since most segments modify the
    segment on their immediate right-hand-side
  • The size of training examples is much smaller
  • Independence from ML methods
  • Can be combined with any ML algorithm which works
    as a binary classifier
  • Probabilities of dependency are not necessary

25
Features used in implementation
Modify or not?
??1 ???2 ????3 ?????4 ???5 ?????6 His
friend-top this book-acc have
lady-acc be looking for
modifier
head
His friend is looking for a lady who has this
book.
  • Static Features
  • modifier/modifiee
  • Head/Functional Word surface, POS,
    POS-subcategory,
  • inflection-type, inflection-form, brackets,
    quotations, punctuations,
  • Between segments distance, case-particles,
    brackets,
  • quotations, punctuations
  • Dynamic Features
  • A,B Static features of Functional word
  • C Static features of Head word

26
Settings of Experiments
  • Kyoto University Corpus 2.0/3.0
  • Standard Data Set
  • Training 7,958 sentences / Test 1,246 sentences
  • Same data as Uchimoto et al. 98, Kudo,
    Matsumoto 00
  • Large Data Set
  • 2-fold Cross-Validation using all 38,383
    sentences
  • Kernel Function 3rd polynomial
  • Evaluation method
  • Dependency accuracy
  • Sentence accuracy

27
Results
Large (20,000 sentences)
Standard (8,000 sentences)
Data Set
Probabilistic
Cascaded Chunking
Probabilistic
Cascaded Chunking
Model
N/A
90.45
89.09
89.29
Dependency Acc. ()
N/A
53.16
46.17
47.53
Sentence Acc. ()
19,191
19,191
7,956
7,956
of training sentences
1,074,316
251,254
459,105
110,355
of training examples
N/A
48
336
8
Training time (hours)
N/A
0.7
2.1
0.5
Parsing time (sec./sent.)
28
Probabilistic v.s. Cascaded Chunking Models
Probabilistic Cascaded Chunking
Strategy Maximize sentence probability Shift-Reduce Deterministic
Merit Can see all candidates of dependency Simple, efficient and scalable, Accurate as Prob. model
Demerit Inefficient, Commit to unnecessary training examples Cannot see the all (posterior) candidates of dependency
29
Smoothing Effect (in cascade model)
  • No need to cut off low frequent words

Cut-off by frequency Dependency accuracy Sentence accuracy
1 89.3 47.8
2 88.7 46.3
4 87.8 44.6
6 87.6 44.8
8 87.4 42.5
30
Combination of features
  • Polynomial Kernels for taking into combination of
    features (tested with a small corpus (2000
    sentences))

Degree of polynomial kernel Dependency accuracy Sentence accuracy
1 N/A N/A
2 86.9 40.6
3 87.7 42.9
4 87.7 42.8
31
Deterministic Dependency Parser based on SVM
Yamada Matsumoto 03
  • Three possible actions
  • Right For the two adjacent words, modification
    goes from left word to the right word
  • Left For the two adjacent words, modification
    goes from right word to the left word
  • Shift no action should be taken for the pair,
    and move the focus to the right
  • There are two possibilities in this situation
  • There is really no modification relation between
    the pair
  • There is actually a modification relation between
    them, but need to wait until the surrounding
    analysis has been finished
  • The second situation can be categorized into a
    different class (called Wait)
  • Do this process on the input sentence from the
    beginning to the end, and repeat it until a
    single word remains

32
Right action
33
Left action
34
Shift action
35
The features used in learning
SVM is used to make classification either in 3
class model (right, left, shift) or in 4 class
model (right, left, shift, wait)
36
SVM Learning of Actions
  • The best action for each configuration is learned
    by SVMs
  • Since the problem is 3-class or 4-class
    classification problem, either pair-wise or
    one-vs-rest method is employed
  • pair-wise method For each pair of classes, learn
    an SVM. The best class is decide by voting of
    all SVMs
  • One-vs-rest method For each class, an SVM is
    learned to discriminate that class from others.
    The best class is decided by the SVM that gives
    the highest value

37
An Example of Deterministic DependencyParsing
(Yamada Matsumoto Algorithm)
right
with
a
boy
the
hits
the
dog
rod
38
An Example of Deterministic DependencyParsing
(Yamada Matsumoto Algorithm)
right
with
a
hits
the
dog
rod
39
An Example of Deterministic DependencyParsing
(Yamada Matsumoto Algorithm)
shift
with
a
hits
the
dog
rod
40
An Example of Deterministic DependencyParsing
(Yamada Matsumoto Algorithm)
right
with
a
hits
the
dog
rod
41
An Example of Deterministic DependencyParsing
(Yamada Matsumoto Algorithm)
shift
with
a
hits
dog
rod
the
42
An Example of Deterministic DependencyParsing
(Yamada Matsumoto Algorithm)
shift
with
a
hits
dog
rod
the
43
An Example of Deterministic DependencyParsing
(Yamada Matsumoto Algorithm)
right
with
a
hits
dog
rod
the
44
An Example of Deterministic DependencyParsing
(Yamada Matsumoto Algorithm)
left
with
hits
rod
a
45
An Example of Deterministic DependencyParsing
(Yamada Matsumoto Algorithm)
shift
with
hits
rod
a
46
An Example of Deterministic DependencyParsing
(Yamada Matsumoto Algorithm)
left
with
hits
47
An Example of Deterministic DependencyParsing
(Yamada Matsumoto Algorithm)
left
hits
48
An Example of Deterministic DependencyParsing
(Yamada Matsumoto Algorithm)
End of parsing
hits
49
The Accuracy of Parsing
Accuracies for Dependency relation Rood
identification Complete analysis
  • Learned with 30000 English sentences
  • no children no child info is considered
  • word, POS only word/POS info is used
  • all all information is used

50
Deterministic linear time dependency parser based
on Shift-Reduce parsing
Nivre 03,04
  • There are two stacks S and Q
  • Initialization Sw1 w2, w3, , wnQ
  • Termination S Q
  • Parsing actions
  • SHIFT S wi,Q ? S, wi Q
  • Left-Arc S, wi wj, Q ? S wj, Q
  • Right-Arc S, wi wj,Q ? S, wi, wj Q
  • Reduce S, wi, wj Q ? S, wi Q

wi
wj
Though the original parser uses memory-based
learning, recent implementation uses SVMs to
select actions
51
An Example of Deterministic DependencyParsing
(Nivres Algorithm)
left
S
Q
boy
hits
the
dog
with
a
rod
the
52
An Example of Deterministic DependencyParsing
(Nivres Algorithm)
shift
S
Q
hits
the
dog
with
a
rod
53
An Example of Deterministic DependencyParsing
(Nivres Algorithm)
left
S
Q
hits
the
dog
with
a
rod
54
An Example of Deterministic DependencyParsing
(Nivres Algorithm)
shift
S
Q
the
dog
with
a
rod
55
An Example of Deterministic DependencyParsing
(Nivres Algorithm)
shift
S
Q
the
dog
with
a
rod
56
An Example of Deterministic DependencyParsing
(Nivres Algorithm)
left
S
Q
the
dog
with
a
rod
57
An Example of Deterministic DependencyParsing
(Nivres Algorithm)
right
S
Q
with
a
rod
58
An Example of Deterministic DependencyParsing
(Nivres Algorithm)
reduce
S
Q
with
a
rod
59
An Example of Deterministic DependencyParsing
(Nivres Algorithm)
right
S
Q
with
a
rod
60
An Example of Deterministic DependencyParsing
(Nivres Algorithm)
shift
S
Q
with
a
rod
61
An Example of Deterministic DependencyParsing
(Nivres Algorithm)
left
S
Q
with
a
rod
62
An Example of Deterministic DependencyParsing
(Nivres Algorithm)
right
S
Q
with
63
An Example of Deterministic DependencyParsing
(Nivres Algorithm)
terminate
S
Q
with
64
Computational Costs
  • Cascade Chunking (Kudo2, Yamada)
  • O(n2)
  • Deterministic Shift-Reduce Parsing (Nivre)
  • O(n)
  • CKY-based parsing (Eisner, Kudo1)
  • O(n3)
  • Maximum Spanning Tree Parsing (McDonald)
  • O(n2)

65
McDonalds MST Parsing
  • R. McDonald, F. Pereira, K. Ribarov, and J.
    Haji?c, Non-projective dependency parsing using
    spanning tree algorithms,
  • In Proceedings of the Joint Conference on
    Human Language Technology and Empirical Methods
    in Natural Language Processing (HLT/EMNLP), 2005.

MST Maximum Spanning Tree
66
Definition of Non-projective Dependency Trees
  • Single head Except for the root, all words have
    a single parent
  • Connected It should be a connected tree
  • Acyclic If wi ?wj , then it will never be
  • wj ?wi

taken from R. McDonald and F. Pereira, Online
Learning of Approximate Dependency Parsing
Algorithms, European Chapter of Association
for Computational Linguistics, 2006.
67
Notation and definition
training data
t-th sentence in T
Dependency structure for t-th sentence (defined
by a set of edges)
The vector of feature functions for edges
The set of all dependency trees producible from x
68
The Basic Idea
The score function of dependency trees is defined
by the sum of the scores of edges, which are
defined by the weighted sum of feature functions
saw
The target optimization problem
I
girl
a
L(y,y) is defined as the number of words in y
that have different parent compared with y.
69
Explanation of formulas
Score of an edge connecting nodes i and j
Feature vector representing information of words
i and j, and relation between them
Weight vector for the feature vectors for
maximizing (learned from training
examples)
70
Online learning algorithm (MIRA)
  1. w00 v0 i0
  2. for n 1..N
  3. for t 1..T
  4. w(i1)update w(i) with (xt,yt)
  5. vvw(i1)
  6. ii1
  7. wv/(NT)

N predefined number of iteration T number
of traning examples
71
Update of the weight vector
Since this formulation requires all the possible
trees, this is modified so that only the k-best
trees are taken into consideration
(k520)
72
Features used in training
  • Unigram features word/POS tag of the parent,
    word/POS tag of the child (when the length of a
    word is more than 5, only 5-letter prefix is
    used, this applies all other features)
  • Bigram features pair of word/POS tags of the
    parent and child
  • Trigram features POS tags of the parent and
    child, and the POS tag or a word appearing
    in-between
  • Context features POS tags of the parent and
    child, and two POS tags before and after them
    (backed-off by trigrams)

73
Settings of Experiments
  • Convert Penn Treebank into dependency trees based
    on Yamadas head rules data(chapters 02-21),
    development data(22), test data(23)
  • POS tagging is done by Ratnaparkhis MaxEnt
    tagger
  • K-best tree are constructed based of Eisners
    parsing algorithm

74
Dependency Parsing byMaximum Spanning Tree
Algorithm
  • Learn the optimal w from training data
  • For a given sentence
  • Calculate the scores to all word pairs in the
    sentence
  • Get the spanning tree with the largest score
  • There is an algorithm (Chu-Liu-Edmonds algorithm)
    that requires just O(n2) time to obtain the
    maximum spanning tree (n the number of
    nodes(words))

75
The Image of Gettingthe Maximum Spanning Tree
John saw Mary Scores to all the pairs or
words (root is the dummy root node)
The maximum spanning tree
76
Results of experiments
Model accuracy root comp.
Yamad Matsumoto 90.3 91.6 38.4
Nivre 87.3 84.3 30.4
Avg. Perceptron 90.6 94.0 36.5
MIRA 90.9 94.2 37.5
accuracy proportion of correct edges root
accuracy of root words comp. proportion of
sentences completely analyzed
77
Available Dependency Parsers
  • Maximum Spanning Tree Parser (McDonald)
  • http//ryanmcd.googlepages.com/MSTParser.html
  • Date-driven Dependency Parser (Nivre)
  • http//w3.msi.vxu.se/nivre/research/MaltParser.ht
    ml
  • CaboCha SVM-based Japanese Dependency Parser
    (Kudo)
  • http//chasen.org/taku/software/cabocha/

78
  • References
  • Yuchang Cheng, Masayuki Asahara, Yuji Matsumoto,
    "Deterministic dependency analyzer for Chinese,"
    IJCNLP-04 The First International Joint
    Conference on Natural Language Processing,P.135-14
    0, 2004.
  • Yuchang Cheng, Masayuki Asahara, Yuji Matsumoto,
    "Chinese Deterministic Dependency Analyzer
    Examining Effects of Global Features and Root
    Node Finder," Proc. Forth SIGHAN Workshop on
    Chinese Language Processing, Proceedings of the
    Workshop, pp.17-24, October 2005.
  • Jason M. Eisner, "Three New Probabilistic Models
    for Dependency Parsing An Exploration,"
    COLING-96 The 16th International Conference on
    Computational Linguistics, pp.340-345, 1996.
  • Taku Kudo and Yuji Matsumoto, Japanese
    Dependency Structure Analysis Based on Support
    Vector Machines,Proceedings of the 2000 Joint
    SIGDAT Conference on Empirical Methods in Natural
    Language Processing and Very Large Corpora,
    pp.18-25, October 2000.
  • Taku Kudo, Yuji Matsumoto, "Japanese dependency
    analysis using cascaded chunking," Proc. 6th
    Conference on Natural Language Learning
    (CoNLL-02), pp.63-69, 2002.
  • Ryan McDonald, Koby Crammer, Fernando Pereira,
    "Online Large-Margin Training of Dependency
    Parsers," Proc. Annual Meeting of the Association
    for Computational Linguistics, pp.91-98, 2005.
  • Ryan McDonald, Fernando Pereira, Jan Hajic
    "Non-Projective Dependency Parsing using Spanning
    Tree Algorithms,"HLT-EMNLP, 2005.
  • Ryan McDonald, Fernando Pereira, "Online Learning
    of Approximate Dependency Parsing Algorithms,"
    Proc. European Chapter of Association for
    Computational Linguistics, 2006.
  • Joakim Nivre, "An Efficient Algorithm for
    Projective Dependency Parsing," Proc. 8th
    International Workshop on Parsing Technologies
    (IWPT), pp.149-160, 2003.
  • Joakim Nivre, Mario Scholz, "Deterministic
    Dependency Parsing of English Text," COLING 2004
    20th International Conference on Computational
    Linguistics, pp.64-70, 2004.
  • Joakim Nivre, Jens Nilsson, "Pseudo-Projective
    Dependency Parsing," ACL-05 43rd Annual Meeting
    of the Association for Computational Linguistics,
    pp.99-106, 2005.
  • Joakim Nivre, Inductive Dependency Parsing,
    Text, Speech Language Technology Vol.34,
    Springer, 2006.
  • Hiroyasu Yamada and Yuji Matsumoto, "Statistical
    dependency analysis with Support Vector
    Machines," Proc. 8th International Workshop on
    Parsing Technologies (IWPT), pp.195-206, 2003.
Write a Comment
User Comments (0)
About PowerShow.com