Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data - PowerPoint PPT Presentation

About This Presentation
Title:

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

Description:

These conditional probabilities are specified by exponential models based on ... Then (X,Y) is a conditional random field if the random variables Yv, conditioned ... – PowerPoint PPT presentation

Number of Views:656
Avg rating:3.0/5.0
Slides: 39
Provided by: Inna2
Category:

less

Transcript and Presenter's Notes

Title: Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data


1
Conditional Random Fields Probabilistic
Modelsfor Segmenting and Labeling Sequence Data
  • J. Lafferty, A. McCallum, F. Pereira
  • Presentation Inna Weiner

2
Outline
  • Labeling sequence data problem
  • Classification with probabilistic models
    Generative and Discriminative
  • Why HMMs and MEMMs are not good enough
  • Conditional Field Model
  • Experimental Results

3
Labeling Sequence Data Problem
  • X is a random variable over data sequences
  • Y is a random variable over label sequences
  • Yi is assumed to range over a finite label
    alphabet A
  • The problem
  • Learn how to give labels from a closed set Y to a
    data sequence X

4
Labeling Sequence Data Problem
  • The lab setup Let a monkey do some behavioral
    task while recording movement and neural activity
  • Motor task Reach to target
  • Goal Map neural activityto behavior
  • In our notation
  • X Neural Data
  • Y Hand movements

5
Generative Probabilistic Models
  • Learning problem
  • Choose T to maximize joint likelihood
  • L(T) S log pT (yi,xi)
  • The goal maximization of the joint likelihood of
    training examples
  • y argmax p(yx) argmax p(y,x)/p(x)
  • Needs to enumerate all possible observation
    sequences

6
Markov Model
  • A Markov process or model assumes that we can
    predict the future based just on the present (or
    on a limited horizon into the past)
  • Let X1,,XT be a sequence of random variables
    taking values 1,,N then the Markov properties
    are
  • Limited Horizon
  • P(Xt1X1,,Xt) P(Xt1Xt)
  • Time invariant (stationary)
  • P(X2X1)

7
Describing a Markov Chain
  • A Markov Chain can be described by the transition
    matrix A and the initial probabilities Q
  • Aij P(Xt1jXti)
  • qi P(X1i)

8
Hidden Markov Model
  • In a Hidden Markov Model (HMM) we do not observe
    the sequence that the model passed through (X)
    but only some probabilistic function of it (Y).
    Thus, it is a Markov model with the addition of
    emission probabilities
  • Bik P(Yt kXt i)

9
The Three Problems of HMM
  • Likelihood Given a series of observations y and
    a model ? A,B,q, compute the likelihood p(y
    ?)
  • Inference Given a series of observations y and a
    model lambda compute the most likely series of
    hidden states x.
  • Learning Given a series of observations, learn
    the best model ?

10
Likelihood in HMMs
  • Given a model ? A,B,q, we can compute the
    likelihood by
  • P(y) p(y ?) Sp(x)p(yx)
  • q(x1)?A(xt1xt) ?B(ytxt)
  • But this computation complexity is O(NT), when
    xi N ? impossible in practice

11
Forward-Backward algorithm
  • To compute likelihood
  • Need to enumerate over all paths in the lattice
    (all possible instantiations of X1XT). But
    some starting subpath(blue) is common to many
    continuing paths (bluered)

The idea using dynamic programming, calculate a
path in terms of shorter sub-paths
12
Forward-Backward algorithm (contd)
  • We build a matrix of the probability of being at
    time t at state i at(i)P(xti,y1y2yt). This is
    a function of the previous column (forward
    procedure)

13
Forward-Backward algorithm (contd)
  • We can similarly define a backwards procedure
    for filling the matrix ßt(i) P(yt1yTxti)

14
Forward-Backward algorithm (contd)
  • And we can easily combine
  • P(y,xti) P(xti,y1y2yt) P(yt1yTxti)
    at(i)ßt(i)
  • And then we get
  • P(y) S P(y,xti) S at(i)ßt(i)
  • Summary we presented a polynomial algorithm for
    computing likelihood in HMMs.

15
HMM why not?
  • Advantages
  • Estimation very easy.
  • Closed form solution
  • The parameters can be estimated with relatively
    high confidence from small samples
  • But
  • The model represents all possible (x,y) sequences
    and defines joint probability over all possible
    observation and label sequences ? needless effort

16
Discriminative Probabilistic Models
Generative
Discriminative
  • Solve the problem you need to solve
  • The traditional approach inappropriately uses a
    generative joint model in order to solve a
    conditional problem in which the observations are
    given.
  • To classify we need p(yx) theres no need to
    implicitly approximate p(x).

17
Discriminative Models - Estimation
  • Choose Ty to maximize conditional likelihood
  • L(Ty) S log pTy(yixi)
  • Estimation usually doesnt have closed form
  • Example MinMI discriminative approach (2nd week
    lecture)

18
Maximum Entropy Markov Model
  • MEMM
  • a conditional model that represents the
    probability of reaching a state given an
    observation and the previous state
  • These conditional probabilities are specified by
    exponential models based on arbitrary observation
    features

19
The Label Bias Problem
  • The mass that arrives at the state must be
    distributed among the possible successor states
  • Potential victims Discriminative Models

20
The Label Bias Problem Solutions
  • Determinization of the Finite State Machine
  • Not always possible
  • May lead to combinatorial explosion
  • Start with a fully connected model and let the
    training procedure to find a good structure
  • Prior structural knowledge has proven to be
    valuable in information extraction tasks

21
Random Field Model Definition
  • Let G (V, E) be a finite graph, and let A be a
    finite alphabet.
  • The configuration space O is the set of all
    labelings of the vertices in V by letters in A.
    If C is a part of V and ? is an element
    of O is a configuration, the ?c denotes the
    configuration restricted to C.
  • A random field on G is a probability distribution
    on O.

22
Random Field Model The Problem
  • Assume that a finite number of features can
    define a class
  • The features fi(w) are given and fixed.
  • The goal estimating ? to maximize likelihood for
    training examples

23
Conditional Random Field Definition
  • X random variable over data sequences
  • Y - random variable over label sequences
  • Yi is assumed to range over a finite label
    alphabet A
  • Discriminative approach we construct a
    conditional model p(yx) and do not explicitly
    model marginal p(x)

24
CRF - Definition
  • Let G (V, E) be a finite graph, and let A be a
    finite alphabet
  • Y is indexed by the vertices of G
  • Then (X,Y) is a conditional random field if the
    random variables Yv, conditioned on X, obey the
    Markov property with respect to the graph
  • p(YX,Yw,w?v) p(YvX,Yw,wv),
  • where wv means that w and v are neighbors in G

25
CRF on Simple Chain Graph
  • We will handle the case when G is a simple chain
    G (V 1,,m, E (I,i1) )

HMM (Generative)
MEMM (Discriminative)
CRF
26
Fundamental Theorem of Random Fields (Hammersley
Clifford)
  • Assumption
  • G structure is a tree, of which simple chain is a
    private case

27
CRF the Learning Problem
  • Assumption the features fk and gk are given and
    fixed.
  • For example, a boolean feature gk is TRUE if the
    word Xi is upper case and the label Yi is a
    noun.
  • The learning problem
  • We need to determine the parameters T
    (?1, ?2, . . . µ1, µ2, . . .) from training
    data D (x(i), y(i)) with empirical
    distribution p(x, y).

28
CRF Estimation
  • And we return to the log-likelihood maximization
    problem, this time we need to find T that
    maximizes the conditional log-likelihood

29
CRF Estimation
  • From now on we assume that the dependencies of Y,
    conditioned on X, form a chain.
  • To simplify some
  • expressions, we add
  • special start and stop
  • states Y0 start and
  • Yn1 stop.

30
CRF Estimation
  • Suppose that p(YX) is a CRF. For each position i
    in the observation sequence X, we define the
    YY matrix random variable Mi(x) Mi(y',
    yx) by

ei is the edge with labels (Yi-1,Yi) and vi is
the vertex with label Yi
31
CRF Estimation
  • The normalization function Z(x) is
  • The conditional probability of a label sequence y
    is written as

32
Parameter Estimation for CRFs
  • The parameter vector T that maximizes the
    log-likelihood is found using a iterative scaling
    algorithm.
  • We define standard
  • HMM-like forward and
  • backward vectors a and ß,
  • which allow polynomial
  • calculations.
  • For example

33
Experimental Results Set 1
  • Set 1 modeling label bias
  • Data was generated from a simple HMM which
    encodes a noisy version of the finite-state
    network (rib/ rob)
  • Each state emits its designated symbol with
    probability 29/32 and any of the other symbols
    with probability 1/32
  • We train both an MEMM and a CRF
  • The observation features are simply the identity
    of the observation symbols.
  • 2, 000 training and 500 test samples were used
  • Results
  • CRF error 4.6
  • MEMM error 42
  • Conclusion
  • MEMM fails to discriminate between the two
    branches and we get the label bias problem

34
Experimental Results Set 2
  • Set 2 modeling mixed order sources
  • Data was generated from a mixed-order HMM with
    state transition probabilities given by
  • p(yiyi-1, yi-2) a p2(yiyi-1, yi-2) (1 - a)
    p1(yiyi-1)
  • Similarly, emission probabilities given by
  • p(xiyi, xi-1) a p2(xiyi, xi-1)(1- a)
    p1(xiyi)
  • Thus, for a 0 we have a standard first-order
    HMM.
  • For each randomly generated model, a sample of
    1,000 sequences of length 25 is generated for
    training and testing.

35
Experimental Results Set 2
36
Experimental Results Set 3
  • Set 3 Part-Of-Speech tagging experiments

37
Conclusions
  • Conditional random fields offer a unique
    combination of properties
  • discriminatively trained models for sequence
    segmentation and labeling
  • combination of arbitrary and overlapping
    observation features from both the past and
    future
  • efficient training and decoding based on dynamic
    programming for a simple chain graph
  • parameter estimation guaranteed to find the
    global optimum
  • CRFs main current limitation is the slow
    convergence of the training algorithm relative to
    MEMMs, let alone to HMMs, for which training on
    fully observed data is very efficient.

38
  • Thank you
Write a Comment
User Comments (0)
About PowerShow.com