Title: Strictly Lexical Dependency Parsing
1Strictly Lexical Dependency Parsing
- Qin Iris Wang and Dale Schuurmans
- University of Alberta
- wqin,dale_at_cs.ualberta.ca
Dekang Lin Google, Inc. lindek_at_google.com
2Lexical Statistics In Parsing
- Widely used in previous statistical parsers, such
as - Collins (1996, 1997, 1999)
- Chaniak (2000)
- But, have been proved not very useful
- Gildea (2001)
- Bikel (2004)
- Unlexicalized parsing
- Klein and Manning (2003)
3Strictly Lexicalized Parsing
- A dependency parsing model
- All the parameters are based on word statistics
- No POS tags or grammatical categories needed
- Advantages
- Making the construction of Treebank easier
- Especially beneficial for languages where POS
tags are not as clearly defined as English (such
as Chinese)
4POS tags in Parsing
- All previous parsers use a POS lexicon
- Natural language data is sparse
- Bikel (2004) found that of all the needed bi-gram
statistics, only 1.49 were observed in the
treebank - Part-of-speech tags
- Words belonging to the same part-of-speech are
expected to have the same syntactic behavior
5An Alternative Approach
- Distributional word similarities
- Words tend to have similar meanings if they tend
to appear in the same contexts - Soft clusters of words
- Computed automatically
- Not having been used in parsing before
6Outline
- A probabilistic dependency parsing model
- Similarity-based smoothing
- Experimental results
- Related work and conclusions
7An Example Dependency Tree
- A dependency tree structure for the sentence
-
- The kid skipped school regularly.
? 0
school 4
The 1
kid 2
skipped 3
regularly 5
8Probabilistic Dependency Parsing
- S an input sentence T a candidate
dependency tree - F(S) the set of possible dependency trees
spanning on S. The goal of parsing - The tree T is constructed in steps G1, G2 , GN.
(N words in the sentence)
9Different Sequences of Steps May Lead to the Same
Dependency Tree
3
5
1
2
4
? 0
school 4
The 1
kid 2
skipped 3
regularly 5
4
5
1
2
3
? 0
school 4
The 1
kid 2
skipped 3
regularly 5
10Canonical Order of the Links
- Left to right
- Bottom-up
- Head outward
- Right attaching first
11Whats Involved in Each Step?
- Each step involves four events, conditioned on
the context
Step (events)
Context
G1
? 0
regularly 5
The 1
school 4
kid 2
skipped 3
12Whats Involved in Each Step?
Step (events)
Context
)
regularly 5
skipped 3
P (
skipped, regularly, C3R 1
)
regularly 5
skipped 3
P(
The number of modifiers already created
13Whats Involved in Each Step?
Step (events)
Context
G4
14 - Suppose Gi corresponds to a dependency link (u,
v, L).
- Maximum Likelihood estimates ?
Sparse Data!
15Outline
- A probabilistic dependency parsing model
- Similarity-based smoothing
- Experimental results
- Related work and conclusion
16Similarity-based Smoothing
skipped, regularly, C3R 1
)
regularly 5
skipped 3
P(
17Similarity-based Smoothing
skipped, regularly,
)
regularly 5
skipped 3
P(
S(regularly)
S(skipped)
skipping 0.229951skip 0.197991skips
0.169982sprinted 0.140535bounced
0.139547missed 0.134966cruised 0.133933scooted
0.13387jogged 0.133638wandered 0.132721
frequently 0.365862routinely 0.286178periodicall
y 0.273665often 0.24077constantly
0.234693occasionally 0.226324who
0.200348continuously 0.194026continually
0.177632repeatedly 0.177434
18Similarity-based Smoothing
skipped, regularly,
)
regularly 5
skipped 3
P(
S(regularly)
S(skipped)
skipping 0.229951skip 0.197991skips
0.169982sprinted 0.140535bounced
0.139547missed 0.134966cruised 0.133933scooted
0.13387jogged 0.133638wandered 0.132721
frequently 0.365862routinely 0.286178periodicall
y 0.273665often 0.24077constantly
0.234693occasionally 0.226324who
0.200348continuously 0.194026continually
0.177632repeatedly 0.177434
19Similarity-based Smoothing
skipped, regularly,
)
regularly 5
skipped 3
P(
skip frequently skip routinely skip
repeatedly bounced often bounced who bounced
repeatedly
Similar contexts
20Similarity-based Smoothing
C similar contexts of C
C
)
regularly 5
skipped 3
PSIM (
(
C
)
regularly 5
skipped 3
An event is more likely to occur after context C
if it tends to occur after similar contexts of C
21Similarity-based Smoothing
- Finally,
- P(E C) a PMLE(E C) (1 a)
PSIM(E C)
C is the frequency count of the corresponding
context C in the training data
22(No Transcript)
23Outline
- A probabilistic dependency parsing model
- Similarity-based smoothing
- Experimental results
- Related work and conclusion
24Experimental Setup
- Word similarities
- Chinese Gigaword corpus
- Chinese Treebank (CTB 3.0), same data split as
Bikel (2004) - Training Sections 1-270 and 400-931
- Development Sections 301-325
- Testing Sections 271-300
- Dependency trees
- Converted from the constituency trees (Bikel,
2004)
25Experimental Results - 1
Evaluation Results on Chinese Treebank (CTB) 3.0
- Performance highly correlated with the length of
the sentences
26Comparison With an Unlexicalized Model
- For the unlexicalized model, input to the parser
is the sequence of the POS tags
27Comparison With an Strictly Lexicalized Joint
Model
- where hi and mi are the head and the modifier of
the i'th dependency link.
28Comparison With a Strictly Less Lexicalized
Conditional Model
- Only one word in a similar context of C may be
different from a word in C
29Experimental Results - 2
Performance of Alternative Models (sentence
length lt 40)
30Outline
- A probabilistic dependency parsing model
- Similarity-based smoothing
- Experimental results
- Related work and conclusion
31Related Work
- Maximize the joint probability
- Collins (1997)
- Charniak (2000)
- Maximize the conditional probability
- Clark et al. (2002) CCG grammar
- Ratnaparkhi (1999) maximize the prob. during
each step - Dependency parsing models
- Yamada and Matsumoto (2002)
- Eisner (1996)
- MacDonald et al.(2005)
- Klein and Manning (2004) the DMV model
- Parsing Chinese with the Penn Chinese Treebank
- Bikel and Chiang (2000)
- Levy and Manning (2003)
32Conclusions
- First work for parsing without using
part-of-speech tags - Strictly lexicalized parser outperformed its
unlexicalized counterpart - Takes the advantage of the similarity-based
smoothing, which has not been successfully
applied to parsing before.
33Questions?
Thanks!
34Notation
- / no more modifiers on the left/right
of a word w - / the current number of modifiers w
has taken on its left / right - (u, v, d) a dependency link with the direction
d. u and v are integers, denoting the indices of
the words (u lt v) - LinkR(u, v) a link from u to v
- LinkL(u, v) a link from v to u
35Assumptions
- depends only on w and
- LinkR(u, v) depends only on u, v, and
- LinkL(u, v) depends only on u, v, and
- Suppose the dependency link created in the step i
is - (u, v, d)
- If d L, Gi is the conjunction of , ,
and LinkL (u, v). - If d R, Gi is the conjunction of , ,
and LinkR (u, v).
36Feature Representation
- Represent a word w by a feature vector
- The features of w are the set of words occurred
within a small context window of w in a large
corpus - The value of a feature w is
where P(w, w) is the probability of w and w
co-occur in a context window
37Similarity-based Smoothing
- The parameters in our model consist of
conditional probabilities P(EC) where E is the
binary variable linkd(u, v) or and the
context C is either or
38Similarity-based Smoothing
- In our model, the similar contexts are defined
as
- We compute the similarity between two contexts
as