Morphological Disambiguation - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Morphological Disambiguation

Description:

Words unknown to the analyzer: replaced with OOV symbol, receive all analyses associated with OOV training words, morphemes replaced with OOV. ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 29
Provided by: non8109
Category:

less

Transcript and Presenter's Notes

Title: Morphological Disambiguation


1
Morphological Disambiguation
CONTEXT-BASED
WITH RANDOM FIELDS
  • Noah A. Smith, David A. Smith, Roy W. Tromble
  • Johns Hopkins University

2
The Cast
3
Morphological analysis
4
Morphological analysis
??
????
ta/IV2MStahim/IV ta/IV3FStahim/IV
tthm/NOUN_PROP tu/IV2MStaham/IV_PASS
tu/IV3FStaham/IV_PASS
?/VJ?/ECS ?/VJ?/EFN ?/VX?/ECS ?/VX?/EFN
??/ADV
might be encoded as a finite-state transducer
klimatizovaná
AdjNeuter Plural Accusative/klimatizovaný
AdjNeuter Plural Vocative/klimatizovaný
AdjFeminine Singular Vocative/klimatizovaný
AdjFeminine Singular Nominative/klimatizovaný
AdjNeuter Plural Nominative/klimatizovaný
5
Morphological analysis
??
????
ta/IV2MStahim/IV ta/IV3FStahim/IV
tthm/NOUN_PROP tu/IV2MStaham/IV_PASS
tu/IV3FStaham/IV_PASS
?/VJ?/ECS ?/VJ?/EFN ?/VX?/ECS ?/VX?/EFN
??/ADV
klimatizovaná
AdjNeuter Plural Accusative/klimatizovaný
AdjNeuter Plural Vocative/klimatizovaný
AdjFeminine Singular Vocative/klimatizovaný
AdjFeminine Singular Nominative/klimatizovaný
AdjNeuter Plural Nominative/klimatizovaný
6
Morphological disambiguation
????? ?? ??? ?? .
. ?????? ????? ??????? ?????? ???? ????? 1998
Klimatizovaná jídelna, svetlá místnost pro
snídane.
7
Morphological disambiguation
Morpheme tokenization POS Tagging Lemmatizatio
n
????? ?? ??? ?? .
. ?????? ????? ??????? ?????? ???? ????? 1998
POS Tagging Lemmatization
Klimatizovaná jídelna, svetlá místnost pro
snídane.
8
Corpora and tools
9
unambiguous tagged, tokenized morphemes implicit
word boundaries
Decoding
little noise (optionality)
source model
channel model
10
Channel Model
usually 0 or 1
11
Source Model
12
Estimation of the source model
?
Gaussian prior (0, s2)
Prior work (Kudo, Yamamoto, Matsumoto 2004),
(Smith Smith 2004)
13
Regarding vocabulary
  • Words unknown to the analyzer
  • replaced with OOV symbol,
  • receive all analyses associated with OOV training
    words,
  • morphemes replaced with OOV.

??
OOV
?/VJ?/EFN
OOV/VJOOV/EFN
14
Concatenative analyses
15
Source model features
Tag
Morpheme
16
Korean results
49K words training 5K words test
???/NNC?/PAD?/PAU ??/DAN ??/NNC?/PCA
?/VJ?/EFN ./SFN
17
Korean results
49K words training 5K words test
???/NNC?/PAD?/PAU ??/DAN ??/NNC?/PCA
?/VJ?/EFN ./SFN
18
Korean results
49K words training 5K words test
???/NNC?/PAD?/PAU ??/DAN ??/NNC?/PCA
?/VJ?/EFN ./SFN
19
Korean results
49K words training 5K words test
???/NNC?/PAD?/PAU ??/DAN ??/NNC?/PCA
?/VJ?/EFN ./SFN
20
Korean results
49K words training 5K words test
See paper for additional results.
Prior work (Cha, Lee, Lee 1998), (Smith
Smith 2004)
21
Arabic results
114K words training 13K words test
See paper for additional results.
Prior work (Habash Rambow 2005)
22
Inflectional analyses
23
Source model features
ti-2
ti-1
Tag
Gender
Number
Case
POS
Lemma
24
Separate training
25
Czech results
768K words training 109K words test
See paper for additional results.
26
Suggested improvements
  • OOV handling
  • spelling features, e.g.
  • in the channel
  • Better models of fields
  • Case, e.g.
  • Classifier combination for factored models
  • e.g. LOPs (Smith, Cohn, and Osborne 2005)

27
Conclusion
  • Given a morphological analyzer, we can apply
    log-linear sequence models to produce a
    disambiguator.
  • Variations on our model allow multiple analysis
    paradigms.
  • Factoring speeds up training.
  • Too many Smiths!

28
Thanks!
  • Jan Hajic and Pavel Krbec
Write a Comment
User Comments (0)
About PowerShow.com