Applying Conditional Random Fields to Japanese Morphological Analysis - PowerPoint PPT Presentation

About This Presentation

Title:

Applying Conditional Random Fields to Japanese Morphological Analysis

Description:

Applying Conditional Random Fields to Japanese Morphological Analysis Taku Kudo 1*, Kaoru Yamamoto 2, Yuji Matsumoto 1 1 Nara Institute of Science and Technology – PowerPoint PPT presentation

Number of Views:111

Avg rating:3.0/5.0

Slides: 32

Provided by: Taku5

Learn more at: http://chasen.org

Category:

more less

Transcript and Presenter's Notes

Title: Applying Conditional Random Fields to Japanese Morphological Analysis

1
Applying Conditional Random Fields to Japanese
Morphological Analysis

Taku Kudo 1, Kaoru Yamamoto 2, Yuji Matsumoto
1
1 Nara Institute of Science and Technology
2 CREST, Tokyo Institute of Technology
Currently, NTT Communication Science Labs.

2
Backgrounds

Conditional Random Fields Lafferty 01
A variant of Markov Random Fields
Many applications
POS tagging Lafferty01, shallow parsing Sha
03, NE recognition McCallum 03, IE Pinto
03, Peng 04
Japanese Morphological Analysis
Must cope with word segmentation
Must incorporate many features
Must minimize the influence of the length bias

3
Japanese Morphological Analysis
INPUT ?????? (I live in Metropolis of Tokyo.)
?? / ? / ? / ?? ??
(Tokyo) NOUN-PROPER-LOC-GENERAL ?
(Metro.) NOUN-SUFFIX-LOC ? (in) PARTICLE-GENE
RAL ?? (live) VERB BASE-FORM

word segmentation (no explicit spaces in
Japanese)
POS tagging
lemmatization, stemming

4
Simple approach for JMA

Character-based begin / inside tagging
non standard method in JMA
cannot directly reflect lexicons
over 90 accuracy can be achieved using the
naïve longest prefix matching with a lexicon
decoding is slow

? ? / ? / ? / ? ?
B I B B B I
5
Our approach for JMA

Assume that a lexicon is available
word lattice
represents all candidate outputs
reduces redundant outputs
Unknown word processing
invoked when no matching word can be found in
a lexicon
character types
e.g., Chinese character, hiragana, katakana,
number .. etc

6
Problem Setting
lexicon
? particle, verb ? noun ?
noun ?? noun ?? noun
Input ?????? (I live in Metropolis of Tokyo)
?? (Kyoto) noun
Lattice
? (in) particle
? (east) noun
? (Metro.) suffix
?? (live) verb
? (capital) noun
BOS
EOS
? (resemble) verb
?? (Tokyo) noun
NOTE the number of tokens Y varies
7
Long-standing Problems in JMA
8
Complex tagset
?? (Kyoto) Noun Proper Loc General Kyoto

Hierarchical tagset
HMMs cannot capture them
How to select the hidden classes?
TOP level ? lack of granularity
Bottom level ? data sparseness
Some functional particles should be lexicalized
Semi-automatic hidden class selections
Asahara 00

9
Complex tagset, cont.

Must capture a variety of features

?? (Kyoto) noun proper loc general Kyoto
? (in) particle general f f ?
?? (live) verb independent f f live base-form
overlapping features
POS hierarchy
character types prefix, suffix
lexicalization
inflections
These features are important to JMA
10
JMA with MEMMs Uchimoto 00-03

Use discriminative model, e.g., maximum entropy
model, to capture a variety of features
sequential application of ME models

11
Problems of MEMMs

Label bias Lafferty 01

0.4
C
1.0
0.6
BOS
A
D
EOS
1.0
0.6
0.4
1.0
B
E
1.0
P(A, D x) 0.6 0.6 1.0 0.36 P(B, E x)
0.4 1.0 1.0 0.4
P(A,Dx) lt P(B,Ex)
paths with low-entropy are preferred
12
Problems of MEMMs in JMA

Length bias

0.4
C
1.0
0.6
BOS
A
D
EOS
1.0
0.6
0.4
1.0
B
P(A, D x) 0.6 0.6 1.0 0.36 P(B x)
0.4 1.0 0.4
P(A,Dx) lt P(Bx)
long words are preferred length
bias has been ignored in JMA !
13
Long-standing problems

must incorporate a variety of features
overlapping features, POS hierarchy,
lexicalization, character-types
HMMs are not sufficient
must minimize the influence of length bias
another bias observed especially in JMA
MEMMs are not sufficient

14
Use of CRFs to JMA
15
CRFs for word lattice
encodes a variety of uni- or bi-gram features in
a path
BOS - noun
noun - suffix
noun / Tokyo
Global Feature F(Y,X) ( 1 1 1 )
Parameter ? ( 3 20 20
... )
16
CRFs for word lattice, cont.

single exponential model for the entire paths

fewer restrictions in the feature design
can incorporate a variety of features
can solve the problems of HMMs

17
Encoding

Maximum Likelihood estimation

all candidate paths are taken in encoding
influence of length bias will be minimized
can solve the problems of MEMMs

A variant of Forward-Backward Lafferty 01 can
also be applied to word lattice

18
MAP estimation

L2-CRF (Gaussian prior)
non-sparse solution (all features have non-zero
weight)
good if most given features are relevant
non-constrained optimizers, e.g., L-BFGS, are
used
L1-CRF (Laplacian prior)
sparse solution (most features have zero-weight)
good if most given features are irrelevant
constrained optimizers, e.g., L-BFGS-B, are used
C is a hyper-parameter

19
Decoding

Viterbi algorithm
essentially the same architecture as HMMs and
MEMMs

20
Experiments
21
Data
KC and RWCP, widely-used Japanese annotated
corpora
KC
source Mainichi News Article 95
lexicon (size) JUMAN 3.61 (1,983,173)
POS structure 2-levels POS, c-form, c-type, base
training sentences 7,958
training tokens 198,514
test sentences 1,246
test tokens 31,302
features 791,798
22
Features
?? (Kyoto) noun proper loc general Kyoto
? (in) particle general f f ?
?? (live) verb independent f f live base-form
overlapping features
POS hierarchy
character types prefix, suffix
lexicalization
inflections
23
Evaluation
2recallprecision
F
recall precision
correct tokens
recall
tokens in test corpus
correct tokens
precision
tokens in system output

three criteria of correctness
seg word segmentation only
top word segmentation top level of POS
all all information

24
Results
Significance Tests McNemars paired test on
the labeling disagreements
seg top all
L2-CRFs 98.96 98.31 96.75
L1-CRFs 98.80 98.14 96.55
HMMs 96.22 94.99 91.85
MEMMs 96.44 95.81 94.28

L1/L2-CRFs outperform HMM and MEMM
L2-CRFs outperform L1-CRFs

25
Influence of the length bias
long word err. short word err.
HMMs 306 (44) 387 (56)
L2-CRFs 79 (40) 120 (60)
MEMMs 416 (70) 183 (30)

HMM, CRFs relative ratios are not much different
MEMM of long word errors is large
? influenced by the
length bias

26
L1-CRFs v.s L2-CRFs

L2-CRFs gt L1-CRFs
most given features are relevant
(POS hierarchies, suffixes/prefixes,
character types)
L1-CRFs produce a compact model
of active features
L2 791,798 v.s L1 90,163 11
L1-CRFs are worth being examined
if there exist practical constraints

27
Conclusions

An application of CRFs to JMA
Not use character-based begin / inside tags but
use word lattice with a lexicon
CRFs offer an elegant solution to the problems
with HMMs and MEMMs
can use a wide variety of features
(hierarchical POS tags, inflections,
character types, etc)
can minimize the influence of the length bias
(length bias has been ignored in JMA!)

28
Future work

Tri-gram features
Use of all tri-grams is impractical as they make
the decoding speed significantly slower
need to use a practical feature selection
e.g., McCallum 03
Apply to other non-segmented languages
e.g., Chinese or Thai

29
CRFs encoding

A variant of Forward-Backward Lafferty 01 can
also be applied to word lattice

30
Influence of the length bias, cont.
MEMMs select
???? romanticist
? sea
? particle
??? bet
??? romance
? particle
The romance on the sea they bet is
MEMMs select
??? ones heart
?? rough waves
? particle
?? loose
?? not
? heart
A heart which beats rough waves is