Sequential Learning with Dependency Nets - PowerPoint PPT Presentation

1 / 29

About This Presentation

Title:

Sequential Learning with Dependency Nets

Description:

(Critiques showing evidence of classroom/classwork multiplexing are considered ... Inference can be speeded up substantially over na ve Gibbs sampling. Dependency nets ... – PowerPoint PPT presentation

Number of Views:23

Avg rating:3.0/5.0

Slides: 30

Provided by: www2C

Category:

more less

Transcript and Presenter's Notes

Title: Sequential Learning with Dependency Nets

1
Sequential Learning with Dependency Nets

William W. Cohen
2/22

2
Announcements

Critiques for the week due Tuesday
Email to Vitor and me
(Critiques showing evidence of classroom/classwork
multiplexing are considered really really late)
Confusion about number of student presentations
for today (1,2?)
Please dont make changes after I copy over the
presentations to the web page
Do we need a system?
Office hours are no-appointment-necessary
Appointments are via sharonw_at_cs
Preferences on later topics?
Relation extraction
Semantic role labeling
Semi-supervised IE IE on the web and large
corpora bootstrapping

3
CRFs the good, the bad, and the cumbersome

Good points
Global optimization of weight vector that guides
decision making
Trade off decisions made at different points in
sequence
Worries
Cost (of training)
Complexity (do we need all this math?)
Amount of context
Matrix for normalizer is Y Y, so high-order
models for many classes get expensive fast.
Strong commitment to maxent-style learning
Loglinear models are nice, but nothing is always
best.

4
Dependency Nets
5
(No Transcript)
6

Proposed solution
parents of node are the Markov blanket
like undirected Markov net
capture all correlational associations
one conditional probability for each node X,
namely P(Xparents of X)
like directed Bayes netno messy clique
potentials

7
Example bidirectional chains

Y1
Y2
Yi
Cohen
post
the
When
will
dr
notes
8
DN chains

Y1
Y2
Yi
Cohen
post
the
When
will
dr
notes

How do we do inference? Iteratively
Pick values for Y1, Y2, at random
Pick some j, and compute
Set new value of Yj according to this
Go back to (2)

Current values
9
This an MCMC process
Transition probability
General case

Markov Chain Monte Carlo a randomized process
that doesnt depend on previous ys changes y(t)
to y(t1)
One particular run

How do we do inference? Iteratively
Pick values for Y1, Y2, at random y(0)
Pick some j, and compute
Set new value of Yj according to this y(1)
Go back to (2) and repeat to get y(1) , y(2) , ,
y(t) ,

Current values (t)
10
(No Transcript)
11
This an MCMC process

Claim suppose Y(t) is drawn from some
distribution D such that
Then Y(t1) is also drawn from D (i.e., the
random flip doesnt move us away from D
12
This an MCMC process

Burn-in
Claim if you wait long enough then for some t,
Y(t) will be drawn from some distribution D such
that
under certain reasonable conditions (e.g., graph
of potential edges is connected, ). So D is a
sink.
13
(No Transcript)
14
This an MCMC process

averaged for prediction
burn-in - discarded

An algorithm
Run the MCMC chain for a long time t, and hope
that Y(t) will be drawn from the target
distribution D.
Run the MCMC chain for a while longer and save
sample S Y(t) , Y(t1) , , Y(tm)
Use S to answer any probabilistic queries like
Pr(YjX)

15
More on MCMC

This particular process is Gibbs sampling
Transition probabilities are defined by sampling
from the posterior of one variable Yj given the
others.
MCMC is very general-purpose inference scheme
(and sometimes very slow)
On the plus side, learning is relatively cheap,
since theres no inference involved (!)
A dependency net is closely related to a Markov
random field learned by maximizing
pseudo-likelihood
Identical?
Statistical relation learning community has some
proponents of this approach
Pedro Domingos, David Jensen,
A big advantage is the generality of the approach
Sparse learners (eg L1 regularized maxent,
decision trees, ) can be used to infer Markov
blanket (NIPS 2006)

16
Examples

Y1
Y2
Yi
Cohen
post
the
When
will
dr
notes
17
Examples
POS?

Z1
Z2
Zi
Mahesh

Y1
Y2
Yi
BIO
Cohen
post
the
When
will
dr
notes
18
Examples

Y1
Y2
Yi
Cohen
post
the
When
will
dr
notes
19
Dependency nets

The bad and the ugly
Inference is less efficient MCMC sampling
Cant reconstruct probability via chain rule
Networks might be inconsistent
ie local P(xpa(x))s dont define a pdf
Exactly equal, representationally, to normal
undirected Markov nets

20
(No Transcript)
21
Dependency nets

The good
Learning is simple and elegant (if you know each
nodes Markov blanket) just learn a
probabilistic classifier for P(Xpa(X)) for each
node X.
(You might not learn a consistent model, but
youll probably learn a reasonably good one.)
Inference can be speeded up substantially over
naïve Gibbs sampling.

22
Dependency nets

Learning is simple and elegant (if you know each
nodes Markov blanket) just learn a
probabilistic classifier for P(Xpa(X)) for each
node X.

Pr(y1x,y2)
Pr(y2x,y1,y2)
Pr(y3x,y2,y4)
Pr(y4x,y3)
y1
y2
y3
y4
Learning is local, but inference is not, and need
not be unidirectional
x
23
Toutanova, Klein, Manning, Singer

Dependency nets for POS tagging vs CMMs.
Maxent is used for local conditional model.
Goals
An easy-to-train bidirectional model
A really good POS tagger

24
Toutanova et al

Dont use Gibbs sampling for inference instead
use a Viterbi variant (which is not guaranteed to
produce the ML sequence)

D 11, 11, 11, 12, 21, 33 ML state
11 P(a1b1)P(b1a1) lt 1 P(a3b3)P(b3a3)
1
25
Results with model
26
Results with model
27
Results with model
Best model includes some special unknown-word
features, including a crude company-name
detector
28
Results with model
Final test-set results
MXPost 47.6, 96.4, 86.2 CRF 95.7, 76.4
(Ratnaparki)
(Lafferty et al ICML2001)
29
Other comments