Learning in Bayes Nets - PowerPoint PPT Presentation

About This Presentation

Title:

Learning in Bayes Nets

Description:

A Bayesian says the probability of p is the degree of belief we assign to it ... But what if our prior belief is different? ... – PowerPoint PPT presentation

Number of Views:51

Avg rating:3.0/5.0

Slides: 58

Provided by: dpa86

Learn more at: https://pages.cs.wisc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Learning in Bayes Nets

1
Learning in Bayes Nets

Task 1 Given the network structure and given
data, where a data point is an observed setting
for the variables, learn the CPTs for the Bayes
Net. Might also start with priors for CPT
probabilities.
Task 2 Given only the data (and possibly a prior
over Bayes Nets), learn the entire Bayes Net
(both Net structure and CPTs).

2
Task 1 Maximum Likelihood by Example (Howard
1970)

Suppose we have a thumbtack (with a round flat
head and sharp point) that when flipped can land
either with the point up (tails) or with the
point touching the ground (heads).
Suppose we flip the thumbtack 100 times, and 70
times it lands on heads. Then we estimate that
the probability of heads the next time is 0.7.
This is the maximum likelihood estimate.

3
The General Maximum Likelihood Setting

We had a binomial distribution b(n,p) for n100,
and we wanted a good guess at p.
We chose the p that would maximize the
probability of our observation of 70 heads.
In general we have a parameterized distribution
and want to estimate a (several) parameter(s)
choose the value that maximizes the probability
of the data.

4
Back to the Frequentist-Bayes Debate

The preceding seems backwards we want to
maximize the probability of p, not necessarily of
the data (we already have it).
A Frequentist will say this is the best we can
do we cant talk about probability of p it is
fixed (though unknown).
A Bayesian says the probability of p is the
degree of belief we assign to it ...

5
Fortunately the Two Agree (Almost)

It turns out that for Bayesians, if our prior
belief is that all values of p are equally
likely, then after observing the data well
assign the highest probability to the maximum
likelihood estimate for p.
But what if our prior belief is different? How do
we merge the prior belief with the data to get
the best new belief?

6
Encode Prior Beliefs as a Beta Distribution
a,b
7
Any intuition for this?

For any positive integer y, G(y) (y-1)!.
Suppose we use this, and we also replace
x with p
a with x
ab with n
Then we get
The beta(a,b) is just the binomial(n,p) where
nap, and p becomes the variable. With change
of variable, we need a different normalizing
constant so the sum (integral) is 1. Hence
(n1)! replaces n!.

8
(No Transcript)
9
Incorporating a Prior

We assume a beta distribution as our prior
distribution over the parameter p.
Nice properties unimodal, we can choose the mode
to reflect the most probable value, we can choose
the variance to reflect our confidence in this
value.
Best property a beta distribution is
parameterized by two positive numbers, a

10
Beta Distribution (Continued)

(Continued) and b. Higher values of a relative
to b cause the mode of the distribution to be
more to the left, and higher values of both a and
b cause the distribution to be more peaked (lower
variance). We might for example take a to be the
number of heads, and b to be the number of tails.
At any time, the mode of

11
Beta Distribution (Continued)

(Continued) the beta distribution (the
expectation for p) is a/(ab), and as we get more
data, the distribution becomes more peaked
reflecting higher confidence in our expectation.
So we can specify our prior belief for p by
choosing initial values for a and b such that
a/(ab)p, and we can specify confidence in this
belief with high

12
Beta Distribution (Continued)

(Continued) initial values for a and b.
Updating our prior belief based on data to obtain
a posterior belief simply requires incrementing a
for every heads outcome and incrementing b for
every tails outcome.
So after h heads out of n flips, our posterior
distribution says P(heads)(ah)/(abn).

13
Dirichlet Distributions

What if our variable is not Boolean but can take
on more values? (Lets still assume our
variables are discrete.)
Dirichlet distributions are an extension of beta
distributions for the multi-valued case
(corresponding to the extension from binomial to
multinomial distributions).
A Dirichlet distribution over a variable with n
values has n parameters rather than 2.

14
Back to Frequentist-Bayes Debate

Recall that under the frequentist view we
estimate each parameter p by taking the ML
estimate (maximum likelihood estimate the value
for p that maximizes the probability of the
data).
Under the Bayesian view, we now have a prior
distribution over values of p. If this prior is
a beta, or more generally a Dirichlet

15
Frequentist-Bayes Debate (Continued)

(Continued) then we can update it to a posterior
distribution quite easily using the data as
illustrated in the thumbtack example. The result
yields a new value for the parameter p we wish to
estimate (e.g., probability of heads) called the
MAP (maximum a posteriori) estimate.
If our prior distribution was uniform over values
for p, then ML and MAP agree.

16
Learning CPTs from Complete Settings

Suppose we are given a set of data, where each
data point is a complete setting for all the
variables.
One assumption we make is that the data set is a
random sample from the distribution were trying
to model.
For each node in our network, we consider each
entry in its CPT (each setting of values

17
Learning CPTs (Continued)

(Continued) for its parents). For each entry in
the CPT, we have a prior (possibly uniform)
Dirichlet distribution over its values. We
simply update this distribution based on the
relevant data points (those that agree on the
settings for the parents that correspond with
this CPT entry).
A second, implicit assumption is that the

18
Learning CPTs (Continued)

(Continued) distributions over different rows of
the CPT are independent of one another.
Finally, it is worth noting that instead of this
last assumption, we might have a stronger bias
over the form of the CPT. We might believe it is
a noisy-OR, a linear function, or a tree, in
which case we would instead use machine learning,
linear regression, etc.

19
Simple Example

Suppose we believe the variables PinLength and
HeadWeight directly influence whether a thumbtack
comes up heads or tails. For simplicity, suppose
PinLength can be long or short and HeadWeight can
be heavy or light.
Suppose we adopt the following prior over the CPT
entries for the variable Thumbtack.

20
Simple Example (Continued)
21
Simple Example (Continued)

Notice that we have equal confidence in our prior
(initial) probabilities for the first and last
columns of the CPT, less confidence in those of
the second column, and more in those of the third
column.
A new data point will affect only one of the
columns. A new data point will have more effect
on the second column than the others.

22
More Difficult Case What if Some Variables are
Missing

Recall our earlier notion of hidden variables.
Sometimes a variable is hidden because it cannot
be explicitly measured. For example, we might
hypothesize that a chromosomal abnormality is
responsible for some patients with a particular
cancer not responding well to treatment.

23
Missing Values (Continued)

We might include a node for this chromosomal
abnormality in our network because we strongly
believe it exists, other variables can be used to
predict it, and it is in turn predictive of still
other variables.
But in estimating CPTs from data, none of our
data points has a value for this variable.

24
Missing Values (Continued)

This missing value (hidden variable) problem
arises frequently.
Chicken-and-Egg issue if we had CPTs, we could
fill in the data, or if we had data we could
estimate CPTs.
We do have partial data and partial (prior) CPTs.
Can we somehow leverage these into full data and
posterior CPTs?

25
Three Approaches

Expectation-Maximization (EM) Algorithm.
Gibbs Sampling (again).
Gradient Ascent (Hill-climbing).

26
K-Means as EM
27
K-Means as EM
28
K-Means as EM
29
K-Means as EM
30
K-Means as EM
31
K-Means as EM
32
General EM Framework

Given Data with missing values, Space of
possible models, Initial model.
Repeat until no change greater than threshold
Expectation (E) Step Compute expectation over
missing values, given model.
Maximization (M) Step Replace current model with
model that maximizes probability of data.

33
(Soft) EM vs. Hard EM

Standard (soft) EM expectation is a probability
distribution.
Hard EM expectation is all or nothing most
likely/probable value.
K-means is usually run as hard EM but doesnt
have to be.
Advantage of hard EM is computational efficiency
when expectation is over state consisting of
values for multiple variables (next example
illustrates).

34
EM for Parameter Learning E Step

For each data point with missing values, compute
the probability of each possible completion of
that data point. Replace the original data point
with all these completions, weighted by
probabilities.
Computing the probability of each completion
(expectation) is just answering query over
missing variables given others.

35
EM for Parameter Learning M Step

Use the completed data set to update our
Dirichlet distributions as we would use any
complete data set, except that our counts
(tallies) may be fractional now.
Update CPTs based on new Dirichlet distributions,
as we would with any complete data set.

36
EM for Parameter Learning

Iterate E and M steps until no changes occur. We
will not necessarily get the global MAP (or ML
given uniform priors) setting of all the CPT
entries, but under a natural set of conditions
we are guaranteed convergence to a local MAP
solution.
EM algorithm is used for a wide variety of tasks
outside of BN learning as well.

37
Subtlety for Parameter Learning

Overcounting based on number of interations
required to converge to settings for the missing
values.
After each repetition of E step, reset all
Dirichlet distributions before repeating M step.

38
EM for Parameter Learning
Data
P(A) 0.1 (1,9)
P(B) 0.2 (1,4)
A
B

A B C D E
0 0 ? 0 0
0 0 ? 1 0
0 ? 1 1
0 0 ? 0 1
0 1 ? 1 0
0 0 ? 0 1
1 ? 1 1
0 0 ? 0 0
0 0 ? 1 0
0 0 ? 0 1

A B P(C) T T 0.9 (9,1) T F 0.6 (3,2) F T 0.3
(3,7) F F 0.2 (1,4)
C
E
D
C P(E) T 0.8 (4,1) F 0.1 (1,9)
C P(D) T 0.9 (9,1) F 0.2 (1,4)
39
EM for Parameter Learning
Data
P(B) 0.2 (1,4)
A
B
0 0.99 1 0.01
A B P(C) T T 0.9 (9,1) T F 0.6 (3,2) F T 0.3
(3,7) F F 0.2 (1,4)
0 0.80 1 0.20
0 0.02 1 0.98
C
0 0.80 1 0.20
0 0.70 1 0.30
0 0.80 1 0.20
E
D
0 0.003 1 0.997
C P(E) T 0.8 (4,1) F 0.1 (1,9)
C P(D) T 0.9 (9,1) F 0.2 (1,4)
0 0.99 1 0.01
0 0.80 1 0.20
0 0.80 1 0.20
40
Multiple Missing Values
Data
P(B) 0.2 (1,4)
A B C D E ? 0 ? 0 1
A
B
A B P(C) T T 0.9 (9,1) T F 0.6 (3,2) F T 0.3
(3,7) F F 0.2 (1,4)
C
E
D
C P(E) T 0.8 (4,1) F 0.1 (1,9)
C P(D) T 0.9 (9,1) F 0.2 (1,4)
41
Multiple Missing Values
Data
P(B) 0.2 (1,4)
A B C D E 0 0 0 0 1 0 0
1 0 1 1 0 0 0 1 1 0 1
0 1
A
B
0.72 0.18 0.04 0.06
A B P(C) T T 0.9 (9,1) T F 0.6 (3,2) F T 0.3
(3,7) F F 0.2 (1,4)
C
E
D
C P(E) T 0.8 (4,1) F 0.1 (1,9)
42
Multiple Missing Values
P(A) 0.1 (1.1,9.9)
P(B) 0.17 (1,5)
A
B
A B P(C) T T 0.9 (9,1) T F 0.6 (3.06,2.04) F T
0.3 (3,7) F F 0.2 (1.18,4.72)
C
E
D
C P(E) T 0.81 (4.24,1) F 0.16 (1.76,9)
C P(D) T 0.88 (9,1.24) F 0.17 (1,4.76)
43
Problems with EM

Only local optimum (not much way around that,
though).
Deterministic if priors are uniform, may be
impossible to make any progress
next figure illustrates the need for some
randomization to move us off an uninformative
prior

44
What will EM do here?
Data
A

A B C
0 ? 0
? 1
0 ? 0
? 1
0 ? 0
1 ? 1

B
C
45
EM Dependent on Initial Beliefs
Data
A

A B C
0 ? 0
? 1
0 ? 0
? 1
0 ? 0
1 ? 1

B
C
46
EM Dependent on Initial Beliefs
B is more likely T than F when A is T. Filling
this in makes C more likely T than F when B is T.
This makes B still more likely T than F when A
is T. Etc. Small change in CPT for B (swap 0.6
and 0.4) would have opposite effect.
Data
A

A B C
0 ? 0
? 1
0 ? 0
? 1
0 ? 0
1 ? 1

B
C
47
A Second Approach Gibbs Sampling

The idea is analogous to Gibbs Sampling for Bayes
Net inference, which we have seen in detail.
First, initialize the values of hidden variables
arbitrarily. Update CPTs based on current (now
complete) data.
Second, choose one data point and one unobserved
variable X for that data point.

48
Gibbs Sampling (Continued)

(Continued) Reset the value of X within that
data point based on the current CPTs and the
current setting of the variables in the Markov
Blanket of X within that data point.
Third, repeat this process for all the other
unobserved variables throughout the data set and
then update the CPTs.

49
Gibbs Sampling (Continued)

Fourth, iterate through the previous three steps
some number of times (chain length)
Gibbs faster than (soft) EM if many data missing
values per data point.
Gibbs often slower than hard EM (we have more
variables now than in the inference case) but
results may be better.

50
Approach 3 Gradient Ascent

We want to maximize posterior probability (or
likelihood if uniform priors). Where w1,,wk are
the probabilities in the CPTs (analogous to
weights in a neural network) and D1,, Dn are the
data points, we will use a greedy hill-climbing
search (making small changes in the direction of
the gradient) to maximize P(Dw1,,wk)
P(D1w1,,wk)...P(Dnw1,,wk).

51
Gradient Ascent (Continued)

Must first define the gradient (slope) of the
function we wish to maximize. It turns out it is
easier to do this with the logarithm of the
posterior probability or likelihood. Russell
Norvig focus on likelihood so we will as well.
Maximizing log likelihood also will maximize
likelihood but the function is easier to work
with (additive).

52
Gradient Ascent (Continued)

Because the log likelihood is additive, we simply
need to compute the gradient relative to each
individual probability wi, which we can do as
follows.

53
Gradient Ascent (Continued)

Based on the preceding derivation, we can
calculate the gradient contribution of each data
point and sum the contributions.
Hence we want to find the gradient contribution
for a single case (Dj) from a single CPT with
which wi is associated.
Assume wi is the probability that Xi xi given
that Pa(Xi) U is set to ui. So wiP(xi,ui).

54
Gradient Ascent (Continued)

We now work with the probability of these
settings out of all possible settings.

55
Gradient Ascent (Continued)

In the preceding, wi appears in only one term of
the summation, where xxi and uui. For this
term, P(x,u) wi. So

56
Gradient Ascent (Continued)

Applying Bayes Theorem

57
Gradient Ascent (Continued)

We can compute P(xi,uiDj) using one of our
already-studied mechanisms for Bayes Net
inference (answering queries from Bayes Nets).
We then sum over all data point to get the
gradient contribution from each probability. We
assume the probabilities are independent of one
another, making it easy to take a small step up
the gradient.

Write a Comment

User Comments (0)