Title: Construct State Modification in the Arabic Treebank
1Construct State Modification in the Arabic
Treebank
- Ryan Gabbard and Seth Kulick
- University of Pennsylvania
2Outline
- Construct State (iDAfa ?????) in Arabic
- What it is
- The problem of attachment within an iDAfa
- A Machine Learning Approach
- Definition, Features, Results
- Conclusion and Future Work
3Construct State (iDAfa)
- 2 words grouped tightly together
- Like English compound or possessive
- NOUN with NP complement (recursive)
(NP awAriE streets (NP madiynap
city (NP luwnog byt Long Beach)))
????? ????? ???? ????
4Construct State (iDAfa)
(NP awAriE streets (NP (NP madiynap
city (NP luwnog byt Long
Beach)) (PP fiy in
(NP wilAyap state
(NP kAliyfuwrniyA)))))
????? ????? ???? ???? ?? ????? ??????????
- (Multiple) Modification at any level
- Modifiers stacked up at end
- No clear pattern of attachment level
5Restriction on PP attachment in PTB
- Multiple PP modifiers at same level
Allowed Not Allowed(NP (NP
) (NP (NP (NP ) (PP )
(PP )) (PP )
(PP ))
- Parser can learn that PPs attach to base
(non-recursive) NPs (Collins, 99) - Not true for ATB, because of the iDAfa.
6Modification of non-base NPs
(NP awAriE streets (NP (NP madiynap
city (NP luwnog byt Long
Beach)) (PP fiy in
(NP wilAyap state
(NP kAliyfuwrniyA)))))
(NP (NP streets) (PP of (NP (NP the city)
(PP of (NP Long Beach))
(PP in (NP (NP the state)
(PP of
California)))))
7Problem Summary and Approach
- PP, ADJP attachment harder in ATB
- Cannot rely on base NP constraint
- PP attachment to a non-base NP nearly
non-existent in PTB - 16th most frequent dependency in ATB
- PP attachment worse for ATB(Kulick,Gabbard,Marcus
, 2006) - Treat attachment within iDAfa as problem
independent of parser
8The Task as a Machine Learning Problem
- Definition
- Instances are attachmentsExtract idafas and
modifiers from corpus - Labels are level to attach at
- Constraint No attachments crossing levels
- Technique
- MaxEnt model to label attachments
- Dynamic programming to enforce constraint
9Machine Learning Features
- Baseline Only level of attachment
- Non-Baseline Features
- AttSym POS tag or nonterminal label of modifier
- Lex (noun being modifed, head word of modifier)
- TotDepth (baseline total depth of idafa
AttSym) - Simple GenAgr - (AttSym gender suffixes of the
words corresponding to lex) - Full GenAgr Simple GenAgr also with number
suffixes
10Machine Learning Results
Features Accuracy
Base 39.7
BaseAttSym 76.1
BaseLex 58.4
BaseLexAttSym 79.9
BaseLexAttSymTotDepth 78.7
BaseLexAttSymGenAgr 79.3
11 Future Work
- For ML problem in this talk
- More feature investigation
- Improved analysis of subclasses of iDAfas.
- In context of real system
- Analysis of iDAfa and attachment accuracy in
current parsing - Get attachment problem out of parserUse current
work as module after parsing