Title: Applying Conditional Random Fields to Japanese Morphological Analysis
1Applying Conditional Random Fields to Japanese
Morphological Analysis
- Taku Kudo 1, Kaoru Yamamoto 2, Yuji Matsumoto
1 - 1 Nara Institute of Science and Technology
- 2 CREST, Tokyo Institute of Technology
- Currently, NTT Communication Science Labs.
2Backgrounds
- Conditional Random Fields Lafferty 01
- A variant of Markov Random Fields
- Many applications
- POS tagging Lafferty01, shallow parsing Sha
03, NE recognition McCallum 03, IE Pinto
03, Peng 04 - Japanese Morphological Analysis
- Must cope with word segmentation
- Must incorporate many features
- Must minimize the influence of the length bias
3Japanese Morphological Analysis
INPUT ?????? (I live in Metropolis of Tokyo.)
?? / ? / ? / ?? ??
(Tokyo) NOUN-PROPER-LOC-GENERAL ?
(Metro.) NOUN-SUFFIX-LOC ? (in) PARTICLE-GENE
RAL ?? (live) VERB BASE-FORM
- word segmentation (no explicit spaces in
Japanese) - POS tagging
- lemmatization, stemming
4Simple approach for JMA
- Character-based begin / inside tagging
- non standard method in JMA
- cannot directly reflect lexicons
- over 90 accuracy can be achieved using the
naïve longest prefix matching with a lexicon - decoding is slow
? ? / ? / ? / ? ?
B I B B B I
5Our approach for JMA
- Assume that a lexicon is available
- word lattice
- represents all candidate outputs
- reduces redundant outputs
- Unknown word processing
- invoked when no matching word can be found in
a lexicon - character types
- e.g., Chinese character, hiragana, katakana,
number .. etc
6Problem Setting
lexicon
? particle, verb ? noun ?
noun ?? noun ?? noun
Input ?????? (I live in Metropolis of Tokyo)
?? (Kyoto) noun
Lattice
? (in) particle
? (east) noun
? (Metro.) suffix
?? (live) verb
? (capital) noun
BOS
EOS
? (resemble) verb
?? (Tokyo) noun
NOTE the number of tokens Y varies
7Long-standing Problems in JMA
8Complex tagset
?? (Kyoto) Noun Proper Loc General Kyoto
- Hierarchical tagset
- HMMs cannot capture them
- How to select the hidden classes?
- TOP level ? lack of granularity
- Bottom level ? data sparseness
- Some functional particles should be lexicalized
- Semi-automatic hidden class selections
- Asahara 00
9Complex tagset, cont.
- Must capture a variety of features
?? (Kyoto) noun proper loc general Kyoto
? (in) particle general f f ?
?? (live) verb independent f f live base-form
overlapping features
POS hierarchy
character types prefix, suffix
lexicalization
inflections
These features are important to JMA
10JMA with MEMMs Uchimoto 00-03
- Use discriminative model, e.g., maximum entropy
model, to capture a variety of features - sequential application of ME models
11Problems of MEMMs
0.4
C
1.0
0.6
BOS
A
D
EOS
1.0
0.6
0.4
1.0
B
E
1.0
P(A, D x) 0.6 0.6 1.0 0.36 P(B, E x)
0.4 1.0 1.0 0.4
P(A,Dx) lt P(B,Ex)
paths with low-entropy are preferred
12Problems of MEMMs in JMA
0.4
C
1.0
0.6
BOS
A
D
EOS
1.0
0.6
0.4
1.0
B
P(A, D x) 0.6 0.6 1.0 0.36 P(B x)
0.4 1.0 0.4
P(A,Dx) lt P(Bx)
long words are preferred length
bias has been ignored in JMA !
13Long-standing problems
- must incorporate a variety of features
- overlapping features, POS hierarchy,
lexicalization, character-types - HMMs are not sufficient
- must minimize the influence of length bias
- another bias observed especially in JMA
- MEMMs are not sufficient
14Use of CRFs to JMA
15CRFs for word lattice
encodes a variety of uni- or bi-gram features in
a path
BOS - noun
noun - suffix
noun / Tokyo
Global Feature F(Y,X) ( 1 1 1 )
Parameter ? ( 3 20 20
... )
16CRFs for word lattice, cont.
- single exponential model for the entire paths
- fewer restrictions in the feature design
- can incorporate a variety of features
- can solve the problems of HMMs
17Encoding
- Maximum Likelihood estimation
- all candidate paths are taken in encoding
- influence of length bias will be minimized
- can solve the problems of MEMMs
- A variant of Forward-Backward Lafferty 01 can
- also be applied to word lattice
18MAP estimation
- L2-CRF (Gaussian prior)
- non-sparse solution (all features have non-zero
weight) - good if most given features are relevant
- non-constrained optimizers, e.g., L-BFGS, are
used - L1-CRF (Laplacian prior)
- sparse solution (most features have zero-weight)
- good if most given features are irrelevant
- constrained optimizers, e.g., L-BFGS-B, are used
- C is a hyper-parameter
19Decoding
- Viterbi algorithm
- essentially the same architecture as HMMs and
- MEMMs
20Experiments
21Data
KC and RWCP, widely-used Japanese annotated
corpora
KC
source Mainichi News Article 95
lexicon (size) JUMAN 3.61 (1,983,173)
POS structure 2-levels POS, c-form, c-type, base
training sentences 7,958
training tokens 198,514
test sentences 1,246
test tokens 31,302
features 791,798
22Features
?? (Kyoto) noun proper loc general Kyoto
? (in) particle general f f ?
?? (live) verb independent f f live base-form
overlapping features
POS hierarchy
character types prefix, suffix
lexicalization
inflections
23Evaluation
2recallprecision
F
recall precision
correct tokens
recall
tokens in test corpus
correct tokens
precision
tokens in system output
- three criteria of correctness
- seg word segmentation only
- top word segmentation top level of POS
- all all information
24Results
Significance Tests McNemars paired test on
the labeling disagreements
seg top all
L2-CRFs 98.96 98.31 96.75
L1-CRFs 98.80 98.14 96.55
HMMs 96.22 94.99 91.85
MEMMs 96.44 95.81 94.28
- L1/L2-CRFs outperform HMM and MEMM
- L2-CRFs outperform L1-CRFs
25Influence of the length bias
long word err. short word err.
HMMs 306 (44) 387 (56)
L2-CRFs 79 (40) 120 (60)
MEMMs 416 (70) 183 (30)
- HMM, CRFs relative ratios are not much different
- MEMM of long word errors is large
- ? influenced by the
length bias
26L1-CRFs v.s L2-CRFs
- L2-CRFs gt L1-CRFs
- most given features are relevant
- (POS hierarchies, suffixes/prefixes,
character types) - L1-CRFs produce a compact model
- of active features
- L2 791,798 v.s L1 90,163 11
- L1-CRFs are worth being examined
- if there exist practical constraints
27Conclusions
- An application of CRFs to JMA
- Not use character-based begin / inside tags but
use word lattice with a lexicon - CRFs offer an elegant solution to the problems
with HMMs and MEMMs - can use a wide variety of features
- (hierarchical POS tags, inflections,
character types, etc) - can minimize the influence of the length bias
(length bias has been ignored in JMA!)
28Future work
- Tri-gram features
- Use of all tri-grams is impractical as they make
- the decoding speed significantly slower
- need to use a practical feature selection
e.g., McCallum 03 - Apply to other non-segmented languages
- e.g., Chinese or Thai
29CRFs encoding
- A variant of Forward-Backward Lafferty 01 can
also be applied to word lattice
30Influence of the length bias, cont.
MEMMs select
???? romanticist
? sea
? particle
??? bet
??? romance
? particle
The romance on the sea they bet is
MEMMs select
??? ones heart
?? rough waves
? particle
?? loose
?? not
? heart
A heart which beats rough waves is
- caused rather by the influence of the length
bias - (CRFs can correctly analyze these sentences)
31Cause of label and length bias
-
- MEMM only use correct path in encoding
- transition probabilities of unobserved paths will
be distributed uniformly -