Title: Learning Morphological Disambiguation Rules for Turkish
1Learning Morphological Disambiguation Rules for
Turkish
- Deniz Yuret
- Ferhan Türe
- Koç University, Istanbul
2Overview
- Turkish morphology
- The morphological disambiguation task
- The Greedy Prepend Algorithm
- Training
- Evaluation
3Turkish Morphology
- Turkish is an agglutinative language Many
syntactic phenomena expressed by function words
and word order in English are expressed by
morphology in Turkish. - I will be able to go.
- (go) (able to) (will) (I)
- git ebil ecek im
- Gidebilecegim.
4Fun with Turkish Morphology
Avrupalilastiramadiklarimizdanmissiniz
- Avrupa Europe
- li European
- las become
- tir make
- ama not able to
- dik we were
- larimiz those that
- dan from
- mis were
- siniz you
5So how long can words be?
- uyu sleep
- uyut make X sleep
- uyuttur have Y make X sleep
- uyutturt have Z have Y make X sleep
- uyutturttur have W have Z have Y make X sleep
- uyutturtturt have Q have W have Z
6Morphological Analyzer for Turkish
- masali
- masalNounA3sgPnonAcc ( the story)
- masalNounA3sgP3sgNom ( his story)
- masaNounA3sgPnonNomDBAdjWith ( with
tables) - Oflazer, K. (1994). Two-level description of
Turkish morphology. Literary and Linguistic
Computing - Oflazer, K., Hakkani-Tür, D. Z., and Tür, G.
(1999) Design for a turkish treebank. EACL99 - Kenneth R. Beesley and Lauri Karttunen, Finite
State Morphology, CSLI Publications, 2003
7Features, IGs and Tags
masaNounA3sgPnonNomDBAdjWith
- 8 unique tags
- 11084 distinct tags observed in 1M word training
corpus
- 126 unique features
- 9129 unique IGs
8Why not just do POS tagging?
from Oflazer (1999)
9Why not just do POS tagging?
- Inflectional groups can independently act as
heads or modifiers in syntactic dependencies. - Full morphological analysis is essential for
further syntactic analysis.
10Morphological disambiguation
- Ambiguity rare in English
- lives lives or lifes
- More serious in Turkish
- 42.1 of the tokens ambiguous
- 1.8 parses per token on average
- 3.8 parses for ambiguous tokens
11Morphological disambiguation
- Task pick correct parse given context
- masalNounA3sgPnonAcc
- masalNounA3sgP3sgNom
- masaNounA3sgPnonNomDBAdjWith
- Uzun masali anlat Tell the long story
- Uzun masali bitti His long story ended
- Uzun masali oda Room with long table
12Morphological disambiguation
- Task pick correct parse given context
- masalNounA3sgPnonAcc
- masalNounA3sgP3sgNom
- masaNounA3sgPnonNomDBAdjWith
- Key Idea
- Build a separate classifier for each feature.
13Decision Lists
- If (W çok) and (R1 DA)
- Then W has Det
- If (L1 pek)
- Then W has Det
- If (W AzI)
- Then W does not have Det
- If (W çok)
- Then W does not have Det
- If TRUE
- Then W has Det
- pek çok alanda (R1)
- pek çok insan (R2)
- insan çok daha (R4)
14Greedy Prepend Algorithm
GPA(data) 1 dlist NIL 2 default-class
Most-Common-Class(data) 3 rule If TRUE Then
default-class 4 while Gain(rule, dlist, data) gt
0 5 do dlist prepend(rule, dlist) 6
rule Max-Gain-Rule(dlist, data) 7 return
dlist
15Training Data
- 1M words of news material
- Semi automatically disambiguated
- Created 126 separate training sets, one for each
feature - Each training set only contains instances which
have the corresponding feature in at least one of
their parses
16Input attributes
- For a five word window
- The exact word string (e.g. WAli'nin)
- The lowercase version (e.g. Wali'nin)
- All suffixes (e.g. Wn, WIn, WnIn, W'nIn,
etc.) - Character types (e.g. Ali'nin would be described
with WUPPER-FIRST, WLOWER-MID, WAPOS-MID,
WLOWERLAST) - Average 40 features per instance.
17Sample decision lists
Acc 0 1 WInI 1 WyI 1 WUPPER0 1 WIzI 1
L1bu 1 Wonu 1 R1mAK 1 Wbeni 0 Wgünü 1
WInlArI 1 Wonlarý 0 WolAyI 0 Wsorunu
(672 rules)
Prop 1 0 WSTFIRST 0 WTürk 1 WSTFIRST
R1UCFIRST 0 L1. 0 WAnAl 1 R1, 0 WyAD 1
WUPPER0 0 WlAD 0 WAK 1 R1UPPER 0 WMilli 1
WSTFIRST R1UPPER0 (3476 rules)
18Models for individual features
19Combining models
- masalNounA3sgP3sgNom
- masalNounA3sgPnonAcc
- Decision list results and confidence (only
distinguishing features necessary) - P3sg yes (89.53)
- Nom no (93.92)
- Pnon no (95.03)
- Acc yes (89.24)
- score(P3sgNom) 0.8953 x (1 0.9392)
- score(PnonAcc) (1 0.9503) x 0.8924
20Evaluation
- Test corpus 1000 words, hand tagged
- Accuracy 95.87 (conf. int 94.57-97.08)
- Better than the training data !?
21Other Experiments
- Retraining on own output 96.03
- Training on unambiguous data 82.57
- Forget disambiguation, lets do tagging with a
single decision list 91.23, 10000 rules
22Contributions
- Learning morphological disambiguation rules using
GPA decision list learner. - Reducing data sparseness and increase noise
tolerance using separate models for individual
output features. - ECOC, WSD, etc.