Conditional Random Fields - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Conditional Random Fields

Description:

After a few minutes, she leaned over and asked her fellow student, 'What's the ... Per-state normalization does not allow the required expectation. Consider this MEMM: ... – PowerPoint PPT presentation

Number of Views:17
Avg rating:3.0/5.0
Slides: 38
Provided by: csC76
Category:

less

Transcript and Presenter's Notes

Title: Conditional Random Fields


1
Conditional Random Fields
  • William W. Cohen
  • Feb 13, 2007

2
One winter day in a certain unnamed small college
town, there was a snowstorm of such epic
proportions that many roads were closed down.
However, one stalwart and dedicated student
decided to make the trek to class anyway. Because
of the treacherous conditions, she arrived at the
lecture hall forty minutes late, only to find the
room empty except for the professor, busy
lecturing, and one other classmate. She took the
seat next to him. After a few minutes, she leaned
over and asked her fellow student, "What's the
prof talking about?" The other student replied,
"How should I know? I only got here ten minutes
before you.
- Lillian Lee, Cornell CS
3
Announcements
  • This week
  • Office hours Fri 1030-1200
  • Lecture 1 Sha Pereira, Lafferty et al 2001,
    Klein and Manning
  • Lecture 2 Stacked Sequential Learning
  • Three student presentations

4
Review motivation for CMMs
Ideally we would like to use many, arbitrary,
overlapping features of words.
S
S
S
identity of word ends in -ski is capitalized is
part of a noun phrase is in a list of city
names is under node X in WordNet is in bold
font is indented is in hyperlink anchor
t
-
1
t
t1

is Wisniewski

part ofnoun phrase
ends in -ski
O
O
O
t
t
1
-
t
1
5
Motivation for CMMs
S
S
S
identity of word ends in -ski is capitalized is
part of a noun phrase is in a list of city
names is under node X in WordNet is in bold
font is indented is in hyperlink anchor
t
-
1
t
t1

is Wisniewski

part ofnoun phrase
ends in -ski
O
O
O
t
t
1
-
t
1
Idea replace generative model in HMM with a
maxent model, where state depends on observations
and previous state
6
Implications of the model
  • Does this do what we want?
  • Q does Yi-1 depend on Xi1 ?
  • a nodes is conditionally independent of its
    non-descendents given its parents

7
Inference for MXPOST
When will prof Cohen post
the notes
B
B
B
B
B
B
B
I
I
I
I
I
I
I
O
O
O
O
O
O
O
(Approx view) find best path, weights are now on
arcs from state to state.
8
Inference for MXPOST
When will prof Cohen post
the notes
B
B
B
B
B
B
B
I
I
I
I
I
I
I
O
O
O
O
O
O
O
More accurately find total flow to each node,
weights are now on arcs from state to state.
Flow out of a node is always fixed
9
Label Bias Problem
  • Consider this MEMM
  • P(1 and 2 ro) P(2 1 and ro)P(1 ro)
    P(2 1 and o)P(1 r)
  • P(1 and 2 ri) P(2 1 and ri)P(1 ri)
    P(2 1 and i)P(1 r)
  • Since P(2 1 and x) 1 for all x, P(1 and 2
    ro) P(1 and 2 ri)
  • In the training data, label value 2 is the only
    label value observed after label value 1
  • Therefore P(2 1) 1, so P(2 1 and x) 1 for
    all x
  • However, we expect P(1 and 2 ri) to be
    greater than P(1 and 2 ro).
  • Per-state normalization does not allow the
    required expectation

10
Label Bias Problem
  • Consider this MEMM, and enough training data to
    perfectly model it

Pr(0123rib)1 Pr(0453rob)1
Pr(0123rob) Pr(10,r)/Z1 Pr(21,o)/Z2
Pr(32,b)/Z3 0.5 1 1
Pr(0453rib) Pr(40,r)/Z1 Pr(54,i)/Z2
Pr(35,b)/Z3 0.5 1 1
11
How important is label bias?
  • Could be avoided in this case by changing
    structure
  • Our models are always wrong is this wrongness
    a problem?
  • See Klein Mannings paper for more on this.

12
Another view of label bias Sha Pereira
So whats the alternative?
13
Inference for MXPOST
When will prof Cohen post
the notes
B
B
B
B
B
B
B
I
I
I
I
I
I
I
O
O
O
O
O
O
O
More accurately find total flow to each node,
weights are now on arcs from state to state.
Flow out of a node is always fixed
14
Another max-flow scheme
When will prof Cohen post
the notes
B
B
B
B
B
B
B
I
I
I
I
I
I
I
O
O
O
O
O
O
O
More accurately find total flow to each node,
weights are now on arcs from state to state.
Flow out of a node is always fixed
15
Another max-flow scheme MRFs
When will prof Cohen post
the notes
B
B
B
B
B
B
B
I
I
I
I
I
I
I
O
O
O
O
O
O
O
Find total flow to each node, weights are now on
edges from state to state. Goal is to learn how
to weight edges in the graph, given features from
the examples.
16
CRFs vs MEMMs
  • MEMMs
  • Sequence classification fx?y is reduced to many
    cases of ordinary classification, fxi?yi
  • combined with Viterbi or beam search
  • CRFs
  • Sequence classification fx?y is done by
  • Converting x,Y to a MRF
  • Using flow computations on the MRF to compute
    some best yx

x1 x2 x3 x4 x5 x6
x1 x2 x3 x4 x5 x6

Pr(Yx2,y1)
MRF f(Y1,Y2), f(Y2,Y3),.
Pr(Yx4,y3)

Pr(Yx5,y5)
Pr(Yx2,y1)

y1 y2 y3 y4 y5 y6
y1 y2 y3 y4 y5 y6
17
The math Review of maxent
18
Review of maxent/MEMM/CMMs
We know how to compute this.
19
Details on CMMs
20
From CMMs to CRFs
Recall why were unhappy we dont want local
normalization
How to compute this?
21
Whats the new model look like?
Whats independent?
22
Whats the new model look like?
Whats independent now??
y1
y2
y3
x
23
Hammerley-Clifford
  • For positive distributions P(x1,,xn)
  • Pr(xix1,,xi-1,xi1,,xn) Pr(xiNeighbors(xi))
  • Pr(AB,S) Pr(AS) where A,B are sets of nodes
    and S is a set that separates A and B
  • P can be written as normalized product of clique
    potentials

So this is very general any Markov distribution
can be written in this form (modulo nits like
positive distribution)
24
Definition of CRFs
X is a random variable over data sequences to be
labeled Y is a random variable over corresponding
label sequences
25
Example of CRFs
26
Graphical comparison among HMMs, MEMMs and CRFs
HMM MEMM CRF
27
Lafferty et al notation
28
Conditional Distribution (contd)
  • CRFs use the observation-dependent
    normalization Z(x) for the conditional
    distributions

Z(x) is a normalization over the data sequence x
  • Learning
  • Lafferty et als IIS-based method is rather
    inefficient.
  • Gradient-based methods are faster
  • Trickiest bit is computing normalization, which
    is over exponentially many y vectors.

29
CRF learning from Sha Pereira
30
CRF learning from Sha Pereira
31
CRF learning from Sha Pereira
Something like forward-backward
  • Idea
  • Define matrix of y,y affinities at stage i
  • Miy,y unnormalized probability of
    transition from y to y at stage I
  • Mi Mi1 unnormalized probability of any
    path through stages i and i1

32
y1
y2
y3
x
y1
y2
y3
33
Forward backward ideas
a
e
name
name
name
c
g
b
f
nonName
nonName
nonName
d
h
34
CRF learning from Sha Pereira
35
Sha Pereira results
CRF beats MEMM (McNemars test) MEMM probably
beats voted perceptron
36
Sha Pereira results
in minutes, 375k examples
37
Some recent results
ICML 2006
38
Some recent results
39
POS tagging Experiments in Lafferty et al
  • Compared HMMs, MEMMs, and CRFs on Penn treebank
    POS tagging
  • Each word in a given input sentence must be
    labeled with one of 45 syntactic tags
  • Add a small set of orthographic features whether
    a spelling begins with a number or upper case
    letter, whether it contains a hyphen, and if it
    contains one of the following suffixes -ing,
    -ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies
  • oov out-of-vocabulary (not observed in the
    training set)

40
POS tagging vs MXPost
Write a Comment
User Comments (0)
About PowerShow.com