Graphical models for part of speech tagging - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Graphical models for part of speech tagging

Description:

Graphical models for part of speech tagging – PowerPoint PPT presentation

Number of Views:143
Avg rating:3.0/5.0
Slides: 30
Provided by: IBMU88
Category:

less

Transcript and Presenter's Notes

Title: Graphical models for part of speech tagging


1
Graphical models for part of speech tagging
2
Different Models for POS tagging
  • HMM
  • Maximum Entropy Markov Models
  • Conditional Random Fields

3
POS tagging A Sequence Labeling Problem
  • Input and Output
  • Input sequence x x1x2 ?xn
  • Output sequence y y1y2 ?ym
  • Labels of the input sequence
  • Semantic representation of the input
  • Other Applications
  • Automatic speech recognition
  • Text processing, e.g., tagging, name entity
    recognition, summarization by exploiting layout
    structure of text, etc.

4
Hidden Markov Models
  • Doubly stochastic models
  • Efficient dynamic programming algorithms exist
    for
  • Finding Pr(S)
  • The highest probability path P that maximizes
    Pr(S,P) (Viterbi)
  • Training the model
  • (Baum-Welch algorithm)

A C
0.6 0.4
A C
0.9 0.1
S2
S1
S4
S3
A C
0.5 0.5
A C
0.3 0.7
5
Hidden Markov Model (HMM) Generative Modeling
Source Model P(Y)
Noisy Channel P(XY)
y
x
e.g., 1st order Markov chain
Parameter estimation maximize the joint
likelihood of training examples
6
Dependency (1st order)
7
Different Models for POS tagging
  • HMM
  • Maximum Entropy Markov Models
  • Conditional Random Fields

8
Disadvantage of HMMs (1)
  • No Rich Feature Information
  • Rich information are required
  • When xk is complex
  • When data of xk is sparse
  • Example POS Tagging
  • How to evaluate P(wktk) for unknown words wk ?
  • Useful features
  • Suffix, e.g., -ed, -tion, -ing, etc.
  • Capitalization

9
Disadvantage of HMMs (2)
  • Generative Model
  • Parameter estimation maximize the joint
    likelihood of training examples
  • Better Approach
  • Discriminative model which models P(yx) directly
  • Maximize the conditional likelihood of training
    examples

10
Maximum Entropy Markov Model
  • Discriminative Sub Models
  • Unify two parameters in generative model into one
    conditional model
  • Two parameters in generative model,
  • parameter in source model
    and parameter in noisy channel
  • Unified conditional model
  • Employ maximum entropy principle
  • Maximum Entropy Markov Model

11
General Maximum Entropy Model
  • Model
  • Model distribution P(Y X) with a set of features
    f1, f2, ?, fl defined on X and Y
  • Idea
  • Collect information of features from training
    data
  • Assume nothing on distribution P(Y X) other than
    the collected information
  • Maximize the entropy as a criterion

12
Features
  • Features
  • 0-1 indicator functions
  • 1 if (x, y) satisfies a predefined condition
  • 0 if not
  • Example POS Tagging

13
Constraints
  • Empirical Information
  • Statistics from training data T
  • Expected Value
  • From the distribution P(Y X) we want to model
  • Constraints

14
Maximum Entropy Objective
  • Entropy
  • Maximization Problem

15
Dual Problem
  • Dual Problem
  • Conditional model
  • Maximum likelihood of conditional data
  • Solution
  • Improved iterative scaling (IIS) (Berger et al.
    1996)
  • Generalized iterative scaling (GIS) (McCallum et
    al. 2000)

16
Maximum Entropy Markov Model
  • Use Maximum Entropy Approach to Model
  • 1st order
  • Features
  • Basic features (like parameters in HMM)
  • Bigram (1st order) or trigram (2nd order) in
    source model
  • State-output pair feature (Xk xk, Yk yk)
  • Advantage incorporate other advanced features on
    (xk, yk)

17
HMM vs MEMM (1st order)
Maximum Entropy Markov Model (MEMM)
HMM
18
Performance in POS Tagging
  • POS Tagging
  • Data set WSJ
  • Features
  • HMM features, spelling features (like ed, -tion,
    -s, -ing, etc.)
  • Results (Lafferty et al. 2001)
  • 1st order HMM
  • 94.31 accuracy, 54.01 OOV accuracy
  • 1st order MEMM
  • 95.19 accuracy, 73.01 OOV accuracy

19
Different Models for POS tagging
  • HMM
  • Maximum Entropy Markov Models
  • Conditional Random Fields

20
Disadvantage of MEMMs (1)
  • Complex Algorithm of Maximum Entropy Solution
  • Both IIS and GIS are difficult to implement
  • Require many tricks in implementation
  • Slow in Training
  • Time consuming when data set is large
  • Especially for MEMM

21
Disadvantage of MEMMs (2)
  • Maximum Entropy Markov Model
  • Maximum entropy model as a sub model
  • Optimization of entropy on sub models, not on
    global model
  • Label Bias Problem
  • Conditional models with per-state normalization
  • Effects of observations are weakened for states
    with fewer outgoing transitions

22
Label Bias Problem
Training Data XY rib123 rib123
rib123 rob456 rob456
New input rob
23
Solution
  • Global Optimization
  • Optimize parameters in a global model
    simultaneously, not in sub models separately
  • Alternatives
  • Conditional random fields
  • Application of perceptron algorithm

24
Conditional Random Field (CRF) (1)
  • Let
  • be a graph such that Y is indexed by
    the vertices
  • Then
  • (X, Y) is a conditional random field if
  • Conditioned globally on X

25
Conditional Random Field (CRF) (2)
Determined by State Transitions
  • Exponential Model
  • a tree (or more specifically, a
    chain) with cliques as edges and vertices

State determined
  • Parameter Estimation
  • Maximize the conditional likelihood of training
    examples
  • IIS or GIS

26
MEMM vs CRF
  • Similarities
  • Both employ maximum entropy principle
  • Both incorporate rich feature information
  • Differences
  • Conditional random fields are always globally
    conditioned on X, resulting in a global optimized
    model

27
Performance in POS Tagging
  • POS Tagging
  • Data set WSJ
  • Features
  • HMM features, spelling features (like ed, -tion,
    -s, -ing, etc.)
  • Results (Lafferty et al. 2001)
  • 1st order MEMM
  • 95.19 accuracy, 73.01 OOV accuracy
  • Conditional random fields
  • 95.73 accuracy, 76.24 OOV accuracy

28
Comparison of the three approaches to POS Tagging
  • Results (Lafferty et al. 2001)
  • 1st order HMM
  • 94.31 accuracy, 54.01 OOV accuracy
  • 1st order MEMM
  • 95.19 accuracy, 73.01 OOV accuracy
  • Conditional random fields
  • 95.73 accuracy, 76.24 OOV accuracy

29
References
  • A. Berger, S. Della Pietra, and V. Della Pietra
    (1996). A Maximum Entropy Approach to Natural
    Language Processing. Computational Linguistics,
    22(1), 39-71.
  • J. Lafferty, A. McCallumn, and F. Pereira (2001).
    Conditional Random Fields Probabilistic Models
    for Segmenting and Labeling Sequence Data. In
    Proc. ICML-2001, 282-289.
Write a Comment
User Comments (0)
About PowerShow.com