Title: Sequential Learning with Dependency Nets
1Sequential Learning with Dependency Nets
2Announcements
- Critiques for the week due Tuesday
- Email to Vitor and me
- (Critiques showing evidence of classroom/classwork
multiplexing are considered really really late) - Confusion about number of student presentations
for today (1,2?) - Please dont make changes after I copy over the
presentations to the web page - Do we need a system?
- Office hours are no-appointment-necessary
- Appointments are via sharonw_at_cs
- Preferences on later topics?
- Relation extraction
- Semantic role labeling
- Semi-supervised IE IE on the web and large
corpora bootstrapping
3CRFs the good, the bad, and the cumbersome
- Good points
- Global optimization of weight vector that guides
decision making - Trade off decisions made at different points in
sequence - Worries
- Cost (of training)
- Complexity (do we need all this math?)
- Amount of context
- Matrix for normalizer is Y Y, so high-order
models for many classes get expensive fast. - Strong commitment to maxent-style learning
- Loglinear models are nice, but nothing is always
best.
4Dependency Nets
5(No Transcript)
6- Proposed solution
- parents of node are the Markov blanket
- like undirected Markov net
- capture all correlational associations
- one conditional probability for each node X,
namely P(Xparents of X) - like directed Bayes netno messy clique
potentials
7Example bidirectional chains
Y1
Y2
Yi
Cohen
post
the
When
will
dr
notes
8DN chains
Y1
Y2
Yi
Cohen
post
the
When
will
dr
notes
- How do we do inference? Iteratively
- Pick values for Y1, Y2, at random
- Pick some j, and compute
- Set new value of Yj according to this
- Go back to (2)
Current values
9This an MCMC process
Transition probability
General case
Markov Chain Monte Carlo a randomized process
that doesnt depend on previous ys changes y(t)
to y(t1)
One particular run
- How do we do inference? Iteratively
- Pick values for Y1, Y2, at random y(0)
- Pick some j, and compute
- Set new value of Yj according to this y(1)
- Go back to (2) and repeat to get y(1) , y(2) , ,
y(t) ,
Current values (t)
10(No Transcript)
11This an MCMC process
Claim suppose Y(t) is drawn from some
distribution D such that
Then Y(t1) is also drawn from D (i.e., the
random flip doesnt move us away from D
12This an MCMC process
Burn-in
Claim if you wait long enough then for some t,
Y(t) will be drawn from some distribution D such
that
under certain reasonable conditions (e.g., graph
of potential edges is connected, ). So D is a
sink.
13(No Transcript)
14This an MCMC process
averaged for prediction
burn-in - discarded
- An algorithm
- Run the MCMC chain for a long time t, and hope
that Y(t) will be drawn from the target
distribution D. - Run the MCMC chain for a while longer and save
sample S Y(t) , Y(t1) , , Y(tm) - Use S to answer any probabilistic queries like
Pr(YjX)
15More on MCMC
- This particular process is Gibbs sampling
- Transition probabilities are defined by sampling
from the posterior of one variable Yj given the
others. - MCMC is very general-purpose inference scheme
(and sometimes very slow) - On the plus side, learning is relatively cheap,
since theres no inference involved (!) - A dependency net is closely related to a Markov
random field learned by maximizing
pseudo-likelihood - Identical?
- Statistical relation learning community has some
proponents of this approach - Pedro Domingos, David Jensen,
- A big advantage is the generality of the approach
- Sparse learners (eg L1 regularized maxent,
decision trees, ) can be used to infer Markov
blanket (NIPS 2006)
16Examples
Y1
Y2
Yi
Cohen
post
the
When
will
dr
notes
17Examples
POS?
Z1
Z2
Zi
Mahesh
Y1
Y2
Yi
BIO
Cohen
post
the
When
will
dr
notes
18Examples
Y1
Y2
Yi
Cohen
post
the
When
will
dr
notes
19Dependency nets
- The bad and the ugly
- Inference is less efficient MCMC sampling
- Cant reconstruct probability via chain rule
- Networks might be inconsistent
- ie local P(xpa(x))s dont define a pdf
- Exactly equal, representationally, to normal
undirected Markov nets
20(No Transcript)
21Dependency nets
- The good
- Learning is simple and elegant (if you know each
nodes Markov blanket) just learn a
probabilistic classifier for P(Xpa(X)) for each
node X. - (You might not learn a consistent model, but
youll probably learn a reasonably good one.) - Inference can be speeded up substantially over
naïve Gibbs sampling.
22Dependency nets
- Learning is simple and elegant (if you know each
nodes Markov blanket) just learn a
probabilistic classifier for P(Xpa(X)) for each
node X.
Pr(y1x,y2)
Pr(y2x,y1,y2)
Pr(y3x,y2,y4)
Pr(y4x,y3)
y1
y2
y3
y4
Learning is local, but inference is not, and need
not be unidirectional
x
23Toutanova, Klein, Manning, Singer
- Dependency nets for POS tagging vs CMMs.
- Maxent is used for local conditional model.
- Goals
- An easy-to-train bidirectional model
- A really good POS tagger
24Toutanova et al
- Dont use Gibbs sampling for inference instead
use a Viterbi variant (which is not guaranteed to
produce the ML sequence)
D 11, 11, 11, 12, 21, 33 ML state
11 P(a1b1)P(b1a1) lt 1 P(a3b3)P(b3a3)
1
25Results with model
26Results with model
27Results with model
Best model includes some special unknown-word
features, including a crude company-name
detector
28Results with model
Final test-set results
MXPost 47.6, 96.4, 86.2 CRF 95.7, 76.4
(Ratnaparki)
(Lafferty et al ICML2001)
29Other comments
- Smoothing (quadratic regularization, aka Gaussian
prior) is importantit avoids overfitting effects
reported elsewhere