Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

About This Presentation

Title:

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

Description:

These conditional probabilities are specified by exponential models based on ... Then (X,Y) is a conditional random field if the random variables Yv, conditioned ... – PowerPoint PPT presentation

Number of Views:656

Avg rating:3.0/5.0

Slides: 39

Provided by: Inna2

Category:

more less

Transcript and Presenter's Notes

Title: Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

1
Conditional Random Fields Probabilistic
Modelsfor Segmenting and Labeling Sequence Data

J. Lafferty, A. McCallum, F. Pereira
Presentation Inna Weiner

2
Outline

Labeling sequence data problem
Classification with probabilistic models
Generative and Discriminative
Why HMMs and MEMMs are not good enough
Conditional Field Model
Experimental Results

3
Labeling Sequence Data Problem

X is a random variable over data sequences
Y is a random variable over label sequences
Yi is assumed to range over a finite label
alphabet A
The problem
Learn how to give labels from a closed set Y to a
data sequence X

4
Labeling Sequence Data Problem

The lab setup Let a monkey do some behavioral
task while recording movement and neural activity
Motor task Reach to target
Goal Map neural activityto behavior
In our notation
X Neural Data
Y Hand movements

5
Generative Probabilistic Models

Learning problem
Choose T to maximize joint likelihood
L(T) S log pT (yi,xi)
The goal maximization of the joint likelihood of
training examples
y argmax p(yx) argmax p(y,x)/p(x)
Needs to enumerate all possible observation
sequences

6
Markov Model

A Markov process or model assumes that we can
predict the future based just on the present (or
on a limited horizon into the past)
Let X1,,XT be a sequence of random variables
taking values 1,,N then the Markov properties
are
Limited Horizon
P(Xt1X1,,Xt) P(Xt1Xt)
Time invariant (stationary)
P(X2X1)

7
Describing a Markov Chain

A Markov Chain can be described by the transition
matrix A and the initial probabilities Q
Aij P(Xt1jXti)
qi P(X1i)

8
Hidden Markov Model

In a Hidden Markov Model (HMM) we do not observe
the sequence that the model passed through (X)
but only some probabilistic function of it (Y).
Thus, it is a Markov model with the addition of
emission probabilities
Bik P(Yt kXt i)

9
The Three Problems of HMM

Likelihood Given a series of observations y and
a model ? A,B,q, compute the likelihood p(y
?)
Inference Given a series of observations y and a
model lambda compute the most likely series of
hidden states x.
Learning Given a series of observations, learn
the best model ?

10
Likelihood in HMMs

Given a model ? A,B,q, we can compute the
likelihood by
P(y) p(y ?) Sp(x)p(yx)
q(x1)?A(xt1xt) ?B(ytxt)
But this computation complexity is O(NT), when
xi N ? impossible in practice

11
Forward-Backward algorithm

To compute likelihood
Need to enumerate over all paths in the lattice
(all possible instantiations of X1XT). But
some starting subpath(blue) is common to many
continuing paths (bluered)

The idea using dynamic programming, calculate a
path in terms of shorter sub-paths
12
Forward-Backward algorithm (contd)

We build a matrix of the probability of being at
time t at state i at(i)P(xti,y1y2yt). This is
a function of the previous column (forward
procedure)

13
Forward-Backward algorithm (contd)

We can similarly define a backwards procedure
for filling the matrix ßt(i) P(yt1yTxti)

14
Forward-Backward algorithm (contd)

And we can easily combine
P(y,xti) P(xti,y1y2yt) P(yt1yTxti)
at(i)ßt(i)
And then we get
P(y) S P(y,xti) S at(i)ßt(i)
Summary we presented a polynomial algorithm for
computing likelihood in HMMs.

15
HMM why not?

Advantages
Estimation very easy.
Closed form solution
The parameters can be estimated with relatively
high confidence from small samples
But
The model represents all possible (x,y) sequences
and defines joint probability over all possible
observation and label sequences ? needless effort

16
Discriminative Probabilistic Models
Generative
Discriminative

Solve the problem you need to solve
The traditional approach inappropriately uses a
generative joint model in order to solve a
conditional problem in which the observations are
given.
To classify we need p(yx) theres no need to
implicitly approximate p(x).

17
Discriminative Models - Estimation

Choose Ty to maximize conditional likelihood
L(Ty) S log pTy(yixi)
Estimation usually doesnt have closed form
Example MinMI discriminative approach (2nd week
lecture)

18
Maximum Entropy Markov Model

MEMM
a conditional model that represents the
probability of reaching a state given an
observation and the previous state
These conditional probabilities are specified by
exponential models based on arbitrary observation
features

19
The Label Bias Problem

The mass that arrives at the state must be
distributed among the possible successor states
Potential victims Discriminative Models

20
The Label Bias Problem Solutions

Determinization of the Finite State Machine
Not always possible
May lead to combinatorial explosion
Start with a fully connected model and let the
training procedure to find a good structure
Prior structural knowledge has proven to be
valuable in information extraction tasks

21
Random Field Model Definition

Let G (V, E) be a finite graph, and let A be a
finite alphabet.
The configuration space O is the set of all
labelings of the vertices in V by letters in A.
If C is a part of V and ? is an element
of O is a configuration, the ?c denotes the
configuration restricted to C.
A random field on G is a probability distribution
on O.

22
Random Field Model The Problem

Assume that a finite number of features can
define a class
The features fi(w) are given and fixed.
The goal estimating ? to maximize likelihood for
training examples

23
Conditional Random Field Definition

X random variable over data sequences
Y - random variable over label sequences
Yi is assumed to range over a finite label
alphabet A
Discriminative approach we construct a
conditional model p(yx) and do not explicitly
model marginal p(x)

24
CRF - Definition

Let G (V, E) be a finite graph, and let A be a
finite alphabet
Y is indexed by the vertices of G
Then (X,Y) is a conditional random field if the
random variables Yv, conditioned on X, obey the
Markov property with respect to the graph
p(YX,Yw,w?v) p(YvX,Yw,wv),
where wv means that w and v are neighbors in G

25
CRF on Simple Chain Graph

We will handle the case when G is a simple chain
G (V 1,,m, E (I,i1) )

HMM (Generative)
MEMM (Discriminative)
CRF
26
Fundamental Theorem of Random Fields (Hammersley
Clifford)

Assumption
G structure is a tree, of which simple chain is a
private case

27
CRF the Learning Problem

Assumption the features fk and gk are given and
fixed.
For example, a boolean feature gk is TRUE if the
word Xi is upper case and the label Yi is a
noun.
The learning problem
We need to determine the parameters T
(?1, ?2, . . . µ1, µ2, . . .) from training
data D (x(i), y(i)) with empirical
distribution p(x, y).

28
CRF Estimation

And we return to the log-likelihood maximization
problem, this time we need to find T that
maximizes the conditional log-likelihood

29
CRF Estimation

From now on we assume that the dependencies of Y,
conditioned on X, form a chain.
To simplify some
expressions, we add
special start and stop
states Y0 start and
Yn1 stop.

30
CRF Estimation

Suppose that p(YX) is a CRF. For each position i
in the observation sequence X, we define the
YY matrix random variable Mi(x) Mi(y',
yx) by

ei is the edge with labels (Yi-1,Yi) and vi is
the vertex with label Yi
31
CRF Estimation

The normalization function Z(x) is
The conditional probability of a label sequence y
is written as

32
Parameter Estimation for CRFs

The parameter vector T that maximizes the
log-likelihood is found using a iterative scaling
algorithm.
We define standard
HMM-like forward and
backward vectors a and ß,
which allow polynomial
calculations.
For example

33
Experimental Results Set 1

Set 1 modeling label bias
Data was generated from a simple HMM which
encodes a noisy version of the finite-state
network (rib/ rob)
Each state emits its designated symbol with
probability 29/32 and any of the other symbols
with probability 1/32
We train both an MEMM and a CRF
The observation features are simply the identity
of the observation symbols.
2, 000 training and 500 test samples were used
Results
CRF error 4.6
MEMM error 42
Conclusion
MEMM fails to discriminate between the two
branches and we get the label bias problem

34
Experimental Results Set 2

Set 2 modeling mixed order sources
Data was generated from a mixed-order HMM with
state transition probabilities given by
p(yiyi-1, yi-2) a p2(yiyi-1, yi-2) (1 - a)
p1(yiyi-1)
Similarly, emission probabilities given by
p(xiyi, xi-1) a p2(xiyi, xi-1)(1- a)
p1(xiyi)
Thus, for a 0 we have a standard first-order
HMM.
For each randomly generated model, a sample of
1,000 sequences of length 25 is generated for
training and testing.

35
Experimental Results Set 2
36
Experimental Results Set 3

Set 3 Part-Of-Speech tagging experiments

37
Conclusions

Conditional random fields offer a unique
combination of properties
discriminatively trained models for sequence
segmentation and labeling
combination of arbitrary and overlapping
observation features from both the past and
future
efficient training and decoding based on dynamic
programming for a simple chain graph
parameter estimation guaranteed to find the
global optimum
CRFs main current limitation is the slow
convergence of the training algorithm relative to
MEMMs, let alone to HMMs, for which training on
fully observed data is very efficient.

Thank you

Write a Comment

User Comments (0)

About PowerShow.com

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data - PowerPoint PPT Presentation

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

These conditional probabilities are specified by exponential models based on ... Then (X,Y) is a conditional random field if the random variables Yv, conditioned ... – PowerPoint PPT presentation