Title: Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
1Conditional Random Fields Probabilistic
Modelsfor Segmenting and Labeling Sequence Data
- J. Lafferty, A. McCallum, F. Pereira
- Presentation Inna Weiner
2Outline
- Labeling sequence data problem
- Classification with probabilistic models
Generative and Discriminative - Why HMMs and MEMMs are not good enough
- Conditional Field Model
- Experimental Results
3Labeling Sequence Data Problem
- X is a random variable over data sequences
- Y is a random variable over label sequences
- Yi is assumed to range over a finite label
alphabet A - The problem
- Learn how to give labels from a closed set Y to a
data sequence X
4Labeling Sequence Data Problem
- The lab setup Let a monkey do some behavioral
task while recording movement and neural activity - Motor task Reach to target
- Goal Map neural activityto behavior
- In our notation
- X Neural Data
- Y Hand movements
5Generative Probabilistic Models
- Learning problem
- Choose T to maximize joint likelihood
- L(T) S log pT (yi,xi)
- The goal maximization of the joint likelihood of
training examples - y argmax p(yx) argmax p(y,x)/p(x)
- Needs to enumerate all possible observation
sequences
6Markov Model
- A Markov process or model assumes that we can
predict the future based just on the present (or
on a limited horizon into the past) - Let X1,,XT be a sequence of random variables
taking values 1,,N then the Markov properties
are - Limited Horizon
- P(Xt1X1,,Xt) P(Xt1Xt)
- Time invariant (stationary)
- P(X2X1)
7Describing a Markov Chain
- A Markov Chain can be described by the transition
matrix A and the initial probabilities Q - Aij P(Xt1jXti)
- qi P(X1i)
8Hidden Markov Model
- In a Hidden Markov Model (HMM) we do not observe
the sequence that the model passed through (X)
but only some probabilistic function of it (Y).
Thus, it is a Markov model with the addition of
emission probabilities - Bik P(Yt kXt i)
9The Three Problems of HMM
- Likelihood Given a series of observations y and
a model ? A,B,q, compute the likelihood p(y
?) - Inference Given a series of observations y and a
model lambda compute the most likely series of
hidden states x. - Learning Given a series of observations, learn
the best model ?
10Likelihood in HMMs
- Given a model ? A,B,q, we can compute the
likelihood by - P(y) p(y ?) Sp(x)p(yx)
- q(x1)?A(xt1xt) ?B(ytxt)
- But this computation complexity is O(NT), when
xi N ? impossible in practice
11Forward-Backward algorithm
- To compute likelihood
- Need to enumerate over all paths in the lattice
(all possible instantiations of X1XT). But
some starting subpath(blue) is common to many
continuing paths (bluered)
The idea using dynamic programming, calculate a
path in terms of shorter sub-paths
12Forward-Backward algorithm (contd)
- We build a matrix of the probability of being at
time t at state i at(i)P(xti,y1y2yt). This is
a function of the previous column (forward
procedure)
13Forward-Backward algorithm (contd)
- We can similarly define a backwards procedure
for filling the matrix ßt(i) P(yt1yTxti)
14Forward-Backward algorithm (contd)
- And we can easily combine
- P(y,xti) P(xti,y1y2yt) P(yt1yTxti)
at(i)ßt(i) - And then we get
- P(y) S P(y,xti) S at(i)ßt(i)
- Summary we presented a polynomial algorithm for
computing likelihood in HMMs.
15HMM why not?
- Advantages
- Estimation very easy.
- Closed form solution
- The parameters can be estimated with relatively
high confidence from small samples - But
- The model represents all possible (x,y) sequences
and defines joint probability over all possible
observation and label sequences ? needless effort
16Discriminative Probabilistic Models
Generative
Discriminative
- Solve the problem you need to solve
- The traditional approach inappropriately uses a
generative joint model in order to solve a
conditional problem in which the observations are
given. - To classify we need p(yx) theres no need to
implicitly approximate p(x).
17Discriminative Models - Estimation
- Choose Ty to maximize conditional likelihood
- L(Ty) S log pTy(yixi)
- Estimation usually doesnt have closed form
- Example MinMI discriminative approach (2nd week
lecture)
18Maximum Entropy Markov Model
- MEMM
- a conditional model that represents the
probability of reaching a state given an
observation and the previous state - These conditional probabilities are specified by
exponential models based on arbitrary observation
features
19The Label Bias Problem
- The mass that arrives at the state must be
distributed among the possible successor states - Potential victims Discriminative Models
20The Label Bias Problem Solutions
- Determinization of the Finite State Machine
- Not always possible
- May lead to combinatorial explosion
- Start with a fully connected model and let the
training procedure to find a good structure - Prior structural knowledge has proven to be
valuable in information extraction tasks
21Random Field Model Definition
- Let G (V, E) be a finite graph, and let A be a
finite alphabet. - The configuration space O is the set of all
labelings of the vertices in V by letters in A.
If C is a part of V and ? is an element
of O is a configuration, the ?c denotes the
configuration restricted to C. - A random field on G is a probability distribution
on O.
22Random Field Model The Problem
- Assume that a finite number of features can
define a class - The features fi(w) are given and fixed.
- The goal estimating ? to maximize likelihood for
training examples
23Conditional Random Field Definition
- X random variable over data sequences
- Y - random variable over label sequences
- Yi is assumed to range over a finite label
alphabet A - Discriminative approach we construct a
conditional model p(yx) and do not explicitly
model marginal p(x)
24CRF - Definition
- Let G (V, E) be a finite graph, and let A be a
finite alphabet - Y is indexed by the vertices of G
- Then (X,Y) is a conditional random field if the
random variables Yv, conditioned on X, obey the
Markov property with respect to the graph - p(YX,Yw,w?v) p(YvX,Yw,wv),
- where wv means that w and v are neighbors in G
25CRF on Simple Chain Graph
- We will handle the case when G is a simple chain
G (V 1,,m, E (I,i1) )
HMM (Generative)
MEMM (Discriminative)
CRF
26Fundamental Theorem of Random Fields (Hammersley
Clifford)
- Assumption
- G structure is a tree, of which simple chain is a
private case
27CRF the Learning Problem
- Assumption the features fk and gk are given and
fixed. - For example, a boolean feature gk is TRUE if the
word Xi is upper case and the label Yi is a
noun. - The learning problem
- We need to determine the parameters T
(?1, ?2, . . . µ1, µ2, . . .) from training
data D (x(i), y(i)) with empirical
distribution p(x, y).
28CRF Estimation
- And we return to the log-likelihood maximization
problem, this time we need to find T that
maximizes the conditional log-likelihood
29CRF Estimation
- From now on we assume that the dependencies of Y,
conditioned on X, form a chain. - To simplify some
- expressions, we add
- special start and stop
- states Y0 start and
- Yn1 stop.
30CRF Estimation
- Suppose that p(YX) is a CRF. For each position i
in the observation sequence X, we define the
YY matrix random variable Mi(x) Mi(y',
yx) by
ei is the edge with labels (Yi-1,Yi) and vi is
the vertex with label Yi
31CRF Estimation
- The normalization function Z(x) is
- The conditional probability of a label sequence y
is written as
32Parameter Estimation for CRFs
- The parameter vector T that maximizes the
log-likelihood is found using a iterative scaling
algorithm. - We define standard
- HMM-like forward and
- backward vectors a and ß,
- which allow polynomial
- calculations.
- For example
33Experimental Results Set 1
- Set 1 modeling label bias
- Data was generated from a simple HMM which
encodes a noisy version of the finite-state
network (rib/ rob) - Each state emits its designated symbol with
probability 29/32 and any of the other symbols
with probability 1/32 - We train both an MEMM and a CRF
- The observation features are simply the identity
of the observation symbols. - 2, 000 training and 500 test samples were used
- Results
- CRF error 4.6
- MEMM error 42
- Conclusion
- MEMM fails to discriminate between the two
branches and we get the label bias problem
34Experimental Results Set 2
- Set 2 modeling mixed order sources
- Data was generated from a mixed-order HMM with
state transition probabilities given by - p(yiyi-1, yi-2) a p2(yiyi-1, yi-2) (1 - a)
p1(yiyi-1) - Similarly, emission probabilities given by
- p(xiyi, xi-1) a p2(xiyi, xi-1)(1- a)
p1(xiyi) - Thus, for a 0 we have a standard first-order
HMM. - For each randomly generated model, a sample of
1,000 sequences of length 25 is generated for
training and testing.
35Experimental Results Set 2
36Experimental Results Set 3
- Set 3 Part-Of-Speech tagging experiments
37Conclusions
- Conditional random fields offer a unique
combination of properties - discriminatively trained models for sequence
segmentation and labeling - combination of arbitrary and overlapping
observation features from both the past and
future - efficient training and decoding based on dynamic
programming for a simple chain graph - parameter estimation guaranteed to find the
global optimum - CRFs main current limitation is the slow
convergence of the training algorithm relative to
MEMMs, let alone to HMMs, for which training on
fully observed data is very efficient.
38