Title: Morphological Disambiguation
1Morphological Disambiguation
CONTEXT-BASED
WITH RANDOM FIELDS
- Noah A. Smith, David A. Smith, Roy W. Tromble
- Johns Hopkins University
2The Cast
3Morphological analysis
4Morphological analysis
??
????
ta/IV2MStahim/IV ta/IV3FStahim/IV
tthm/NOUN_PROP tu/IV2MStaham/IV_PASS
tu/IV3FStaham/IV_PASS
?/VJ?/ECS ?/VJ?/EFN ?/VX?/ECS ?/VX?/EFN
??/ADV
might be encoded as a finite-state transducer
klimatizovaná
AdjNeuter Plural Accusative/klimatizovaný
AdjNeuter Plural Vocative/klimatizovaný
AdjFeminine Singular Vocative/klimatizovaný
AdjFeminine Singular Nominative/klimatizovaný
AdjNeuter Plural Nominative/klimatizovaný
5Morphological analysis
??
????
ta/IV2MStahim/IV ta/IV3FStahim/IV
tthm/NOUN_PROP tu/IV2MStaham/IV_PASS
tu/IV3FStaham/IV_PASS
?/VJ?/ECS ?/VJ?/EFN ?/VX?/ECS ?/VX?/EFN
??/ADV
klimatizovaná
AdjNeuter Plural Accusative/klimatizovaný
AdjNeuter Plural Vocative/klimatizovaný
AdjFeminine Singular Vocative/klimatizovaný
AdjFeminine Singular Nominative/klimatizovaný
AdjNeuter Plural Nominative/klimatizovaný
6Morphological disambiguation
????? ?? ??? ?? .
. ?????? ????? ??????? ?????? ???? ????? 1998
Klimatizovaná jÃdelna, svetlá mÃstnost pro
snÃdane.
7Morphological disambiguation
Morpheme tokenization POS Tagging Lemmatizatio
n
????? ?? ??? ?? .
. ?????? ????? ??????? ?????? ???? ????? 1998
POS Tagging Lemmatization
Klimatizovaná jÃdelna, svetlá mÃstnost pro
snÃdane.
8Corpora and tools
9unambiguous tagged, tokenized morphemes implicit
word boundaries
Decoding
little noise (optionality)
source model
channel model
10Channel Model
usually 0 or 1
11Source Model
12Estimation of the source model
?
Gaussian prior (0, s2)
Prior work (Kudo, Yamamoto, Matsumoto 2004),
(Smith Smith 2004)
13Regarding vocabulary
- Words unknown to the analyzer
- replaced with OOV symbol,
- receive all analyses associated with OOV training
words, - morphemes replaced with OOV.
??
OOV
?/VJ?/EFN
OOV/VJOOV/EFN
14Concatenative analyses
15Source model features
Tag
Morpheme
16Korean results
49K words training 5K words test
???/NNC?/PAD?/PAU ??/DAN ??/NNC?/PCA
?/VJ?/EFN ./SFN
17Korean results
49K words training 5K words test
???/NNC?/PAD?/PAU ??/DAN ??/NNC?/PCA
?/VJ?/EFN ./SFN
18Korean results
49K words training 5K words test
???/NNC?/PAD?/PAU ??/DAN ??/NNC?/PCA
?/VJ?/EFN ./SFN
19Korean results
49K words training 5K words test
???/NNC?/PAD?/PAU ??/DAN ??/NNC?/PCA
?/VJ?/EFN ./SFN
20Korean results
49K words training 5K words test
See paper for additional results.
Prior work (Cha, Lee, Lee 1998), (Smith
Smith 2004)
21Arabic results
114K words training 13K words test
See paper for additional results.
Prior work (Habash Rambow 2005)
22Inflectional analyses
23Source model features
ti-2
ti-1
Tag
Gender
Number
Case
POS
Lemma
24Separate training
25Czech results
768K words training 109K words test
See paper for additional results.
26Suggested improvements
- OOV handling
- spelling features, e.g.
- in the channel
- Better models of fields
- Case, e.g.
- Classifier combination for factored models
- e.g. LOPs (Smith, Cohn, and Osborne 2005)
27Conclusion
- Given a morphological analyzer, we can apply
log-linear sequence models to produce a
disambiguator. - Variations on our model allow multiple analysis
paradigms. - Factoring speeds up training.
- Too many Smiths!
28Thanks!
- Jan Hajic and Pavel Krbec