Title: Graphical models for part of speech tagging
1Graphical models for part of speech tagging
2Different Models for POS tagging
- HMM
- Maximum Entropy Markov Models
- Conditional Random Fields
3POS tagging A Sequence Labeling Problem
- Input and Output
- Input sequence x x1x2 ?xn
- Output sequence y y1y2 ?ym
- Labels of the input sequence
- Semantic representation of the input
- Other Applications
- Automatic speech recognition
- Text processing, e.g., tagging, name entity
recognition, summarization by exploiting layout
structure of text, etc.
4Hidden Markov Models
- Doubly stochastic models
- Efficient dynamic programming algorithms exist
for - Finding Pr(S)
- The highest probability path P that maximizes
Pr(S,P) (Viterbi) - Training the model
- (Baum-Welch algorithm)
A C
0.6 0.4
A C
0.9 0.1
S2
S1
S4
S3
A C
0.5 0.5
A C
0.3 0.7
5Hidden Markov Model (HMM) Generative Modeling
Source Model P(Y)
Noisy Channel P(XY)
y
x
e.g., 1st order Markov chain
Parameter estimation maximize the joint
likelihood of training examples
6Dependency (1st order)
7Different Models for POS tagging
- HMM
- Maximum Entropy Markov Models
- Conditional Random Fields
8Disadvantage of HMMs (1)
- No Rich Feature Information
- Rich information are required
- When xk is complex
- When data of xk is sparse
- Example POS Tagging
- How to evaluate P(wktk) for unknown words wk ?
- Useful features
- Suffix, e.g., -ed, -tion, -ing, etc.
- Capitalization
9Disadvantage of HMMs (2)
- Generative Model
- Parameter estimation maximize the joint
likelihood of training examples
- Better Approach
- Discriminative model which models P(yx) directly
- Maximize the conditional likelihood of training
examples
10Maximum Entropy Markov Model
- Discriminative Sub Models
- Unify two parameters in generative model into one
conditional model - Two parameters in generative model,
- parameter in source model
and parameter in noisy channel - Unified conditional model
- Employ maximum entropy principle
- Maximum Entropy Markov Model
11General Maximum Entropy Model
- Model
- Model distribution P(Y X) with a set of features
f1, f2, ?, fl defined on X and Y - Idea
- Collect information of features from training
data - Assume nothing on distribution P(Y X) other than
the collected information - Maximize the entropy as a criterion
12Features
- Features
- 0-1 indicator functions
- 1 if (x, y) satisfies a predefined condition
- 0 if not
- Example POS Tagging
13Constraints
- Empirical Information
- Statistics from training data T
- Expected Value
- From the distribution P(Y X) we want to model
14Maximum Entropy Objective
15Dual Problem
- Dual Problem
- Conditional model
- Maximum likelihood of conditional data
- Solution
- Improved iterative scaling (IIS) (Berger et al.
1996) - Generalized iterative scaling (GIS) (McCallum et
al. 2000)
16Maximum Entropy Markov Model
- Use Maximum Entropy Approach to Model
- 1st order
- Features
- Basic features (like parameters in HMM)
- Bigram (1st order) or trigram (2nd order) in
source model - State-output pair feature (Xk xk, Yk yk)
- Advantage incorporate other advanced features on
(xk, yk)
17HMM vs MEMM (1st order)
Maximum Entropy Markov Model (MEMM)
HMM
18Performance in POS Tagging
- POS Tagging
- Data set WSJ
- Features
- HMM features, spelling features (like ed, -tion,
-s, -ing, etc.) - Results (Lafferty et al. 2001)
- 1st order HMM
- 94.31 accuracy, 54.01 OOV accuracy
- 1st order MEMM
- 95.19 accuracy, 73.01 OOV accuracy
19Different Models for POS tagging
- HMM
- Maximum Entropy Markov Models
- Conditional Random Fields
20Disadvantage of MEMMs (1)
- Complex Algorithm of Maximum Entropy Solution
- Both IIS and GIS are difficult to implement
- Require many tricks in implementation
- Slow in Training
- Time consuming when data set is large
- Especially for MEMM
21Disadvantage of MEMMs (2)
- Maximum Entropy Markov Model
- Maximum entropy model as a sub model
- Optimization of entropy on sub models, not on
global model - Label Bias Problem
- Conditional models with per-state normalization
- Effects of observations are weakened for states
with fewer outgoing transitions
22Label Bias Problem
Training Data XY rib123 rib123
rib123 rob456 rob456
New input rob
23Solution
- Global Optimization
- Optimize parameters in a global model
simultaneously, not in sub models separately - Alternatives
- Conditional random fields
- Application of perceptron algorithm
24Conditional Random Field (CRF) (1)
- Let
- be a graph such that Y is indexed by
the vertices
- Then
- (X, Y) is a conditional random field if
- Conditioned globally on X
25Conditional Random Field (CRF) (2)
Determined by State Transitions
- Exponential Model
- a tree (or more specifically, a
chain) with cliques as edges and vertices
State determined
- Parameter Estimation
- Maximize the conditional likelihood of training
examples - IIS or GIS
26MEMM vs CRF
- Similarities
- Both employ maximum entropy principle
- Both incorporate rich feature information
- Differences
- Conditional random fields are always globally
conditioned on X, resulting in a global optimized
model
27Performance in POS Tagging
- POS Tagging
- Data set WSJ
- Features
- HMM features, spelling features (like ed, -tion,
-s, -ing, etc.) - Results (Lafferty et al. 2001)
- 1st order MEMM
- 95.19 accuracy, 73.01 OOV accuracy
- Conditional random fields
- 95.73 accuracy, 76.24 OOV accuracy
28Comparison of the three approaches to POS Tagging
- Results (Lafferty et al. 2001)
- 1st order HMM
- 94.31 accuracy, 54.01 OOV accuracy
- 1st order MEMM
- 95.19 accuracy, 73.01 OOV accuracy
- Conditional random fields
- 95.73 accuracy, 76.24 OOV accuracy
29References
- A. Berger, S. Della Pietra, and V. Della Pietra
(1996). A Maximum Entropy Approach to Natural
Language Processing. Computational Linguistics,
22(1), 39-71. - J. Lafferty, A. McCallumn, and F. Pereira (2001).
Conditional Random Fields Probabilistic Models
for Segmenting and Labeling Sequence Data. In
Proc. ICML-2001, 282-289.