Graphical models for part of speech tagging

About This Presentation

Title:

Graphical models for part of speech tagging

Description:

Graphical models for part of speech tagging – PowerPoint PPT presentation

Number of Views:143

Avg rating:3.0/5.0

Slides: 30

Provided by: IBMU88

Category:

more less

Transcript and Presenter's Notes

Title: Graphical models for part of speech tagging

1
Graphical models for part of speech tagging
2
Different Models for POS tagging

HMM
Maximum Entropy Markov Models
Conditional Random Fields

3
POS tagging A Sequence Labeling Problem

Input and Output
Input sequence x x1x2 ?xn
Output sequence y y1y2 ?ym
Labels of the input sequence
Semantic representation of the input
Other Applications
Automatic speech recognition
Text processing, e.g., tagging, name entity
recognition, summarization by exploiting layout
structure of text, etc.

4
Hidden Markov Models

Doubly stochastic models
Efficient dynamic programming algorithms exist
for
Finding Pr(S)
The highest probability path P that maximizes
Pr(S,P) (Viterbi)
Training the model
(Baum-Welch algorithm)

A C
0.6 0.4
A C
0.9 0.1
S2
S1
S4
S3
A C
0.5 0.5
A C
0.3 0.7
5
Hidden Markov Model (HMM) Generative Modeling
Source Model P(Y)
Noisy Channel P(XY)
y
x
e.g., 1st order Markov chain
Parameter estimation maximize the joint
likelihood of training examples
6
Dependency (1st order)
7
Different Models for POS tagging

HMM
Maximum Entropy Markov Models
Conditional Random Fields

8
Disadvantage of HMMs (1)

No Rich Feature Information
Rich information are required
When xk is complex
When data of xk is sparse
Example POS Tagging
How to evaluate P(wktk) for unknown words wk ?
Useful features
Suffix, e.g., -ed, -tion, -ing, etc.
Capitalization

9
Disadvantage of HMMs (2)

Generative Model
Parameter estimation maximize the joint
likelihood of training examples

Better Approach
Discriminative model which models P(yx) directly
Maximize the conditional likelihood of training
examples

10
Maximum Entropy Markov Model

Discriminative Sub Models
Unify two parameters in generative model into one
conditional model
Two parameters in generative model,
parameter in source model
and parameter in noisy channel
Unified conditional model
Employ maximum entropy principle

Maximum Entropy Markov Model

11
General Maximum Entropy Model

Model
Model distribution P(Y X) with a set of features
f1, f2, ?, fl defined on X and Y
Idea
Collect information of features from training
data
Assume nothing on distribution P(Y X) other than
the collected information
Maximize the entropy as a criterion

12
Features

Features
0-1 indicator functions
1 if (x, y) satisfies a predefined condition
0 if not
Example POS Tagging

13
Constraints

Empirical Information
Statistics from training data T

Expected Value
From the distribution P(Y X) we want to model

Constraints

14
Maximum Entropy Objective

Entropy

Maximization Problem

15
Dual Problem

Dual Problem
Conditional model
Maximum likelihood of conditional data

Solution
Improved iterative scaling (IIS) (Berger et al.
1996)
Generalized iterative scaling (GIS) (McCallum et
al. 2000)

16
Maximum Entropy Markov Model

Use Maximum Entropy Approach to Model
1st order

Features
Basic features (like parameters in HMM)
Bigram (1st order) or trigram (2nd order) in
source model
State-output pair feature (Xk xk, Yk yk)
Advantage incorporate other advanced features on
(xk, yk)

17
HMM vs MEMM (1st order)
Maximum Entropy Markov Model (MEMM)
HMM
18
Performance in POS Tagging

POS Tagging
Data set WSJ
Features
HMM features, spelling features (like ed, -tion,
-s, -ing, etc.)
Results (Lafferty et al. 2001)
1st order HMM
94.31 accuracy, 54.01 OOV accuracy
1st order MEMM
95.19 accuracy, 73.01 OOV accuracy

19
Different Models for POS tagging

HMM
Maximum Entropy Markov Models
Conditional Random Fields

20
Disadvantage of MEMMs (1)

Complex Algorithm of Maximum Entropy Solution
Both IIS and GIS are difficult to implement
Require many tricks in implementation
Slow in Training
Time consuming when data set is large
Especially for MEMM

21
Disadvantage of MEMMs (2)

Maximum Entropy Markov Model
Maximum entropy model as a sub model
Optimization of entropy on sub models, not on
global model
Label Bias Problem
Conditional models with per-state normalization
Effects of observations are weakened for states
with fewer outgoing transitions

22
Label Bias Problem
Training Data XY rib123 rib123
rib123 rob456 rob456
New input rob
23
Solution

Global Optimization
Optimize parameters in a global model
simultaneously, not in sub models separately
Alternatives
Conditional random fields
Application of perceptron algorithm

24
Conditional Random Field (CRF) (1)

Let
be a graph such that Y is indexed by
the vertices

Then
(X, Y) is a conditional random field if
Conditioned globally on X

25
Conditional Random Field (CRF) (2)
Determined by State Transitions

Exponential Model
a tree (or more specifically, a
chain) with cliques as edges and vertices

State determined

Parameter Estimation
Maximize the conditional likelihood of training
examples
IIS or GIS

26
MEMM vs CRF

Similarities
Both employ maximum entropy principle
Both incorporate rich feature information
Differences
Conditional random fields are always globally
conditioned on X, resulting in a global optimized
model

27
Performance in POS Tagging

POS Tagging
Data set WSJ
Features
HMM features, spelling features (like ed, -tion,
-s, -ing, etc.)
Results (Lafferty et al. 2001)
1st order MEMM
95.19 accuracy, 73.01 OOV accuracy
Conditional random fields
95.73 accuracy, 76.24 OOV accuracy

28
Comparison of the three approaches to POS Tagging

Results (Lafferty et al. 2001)
1st order HMM
94.31 accuracy, 54.01 OOV accuracy
1st order MEMM
95.19 accuracy, 73.01 OOV accuracy
Conditional random fields
95.73 accuracy, 76.24 OOV accuracy

29
References

A. Berger, S. Della Pietra, and V. Della Pietra
(1996). A Maximum Entropy Approach to Natural
Language Processing. Computational Linguistics,
22(1), 39-71.
J. Lafferty, A. McCallumn, and F. Pereira (2001).
Conditional Random Fields Probabilistic Models
for Segmenting and Labeling Sequence Data. In
Proc. ICML-2001, 282-289.

Write a Comment

User Comments (0)

About PowerShow.com

Graphical models for part of speech tagging - PowerPoint PPT Presentation

Graphical models for part of speech tagging

Graphical models for part of speech tagging – PowerPoint PPT presentation