Learning in Bayes Nets - PowerPoint PPT Presentation

About This Presentation
Title:

Learning in Bayes Nets

Description:

A Bayesian says the probability of p is the degree of belief we assign to it ... But what if our prior belief is different? ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 58
Provided by: dpa86
Category:
Tags: bayes | learning | nets

less

Transcript and Presenter's Notes

Title: Learning in Bayes Nets


1
Learning in Bayes Nets
  • Task 1 Given the network structure and given
    data, where a data point is an observed setting
    for the variables, learn the CPTs for the Bayes
    Net. Might also start with priors for CPT
    probabilities.
  • Task 2 Given only the data (and possibly a prior
    over Bayes Nets), learn the entire Bayes Net
    (both Net structure and CPTs).

2
Task 1 Maximum Likelihood by Example (Howard
1970)
  • Suppose we have a thumbtack (with a round flat
    head and sharp point) that when flipped can land
    either with the point up (tails) or with the
    point touching the ground (heads).
  • Suppose we flip the thumbtack 100 times, and 70
    times it lands on heads. Then we estimate that
    the probability of heads the next time is 0.7.
    This is the maximum likelihood estimate.

3
The General Maximum Likelihood Setting
  • We had a binomial distribution b(n,p) for n100,
    and we wanted a good guess at p.
  • We chose the p that would maximize the
    probability of our observation of 70 heads.
  • In general we have a parameterized distribution
    and want to estimate a (several) parameter(s)
    choose the value that maximizes the probability
    of the data.

4
Back to the Frequentist-Bayes Debate
  • The preceding seems backwards we want to
    maximize the probability of p, not necessarily of
    the data (we already have it).
  • A Frequentist will say this is the best we can
    do we cant talk about probability of p it is
    fixed (though unknown).
  • A Bayesian says the probability of p is the
    degree of belief we assign to it ...

5
Fortunately the Two Agree (Almost)
  • It turns out that for Bayesians, if our prior
    belief is that all values of p are equally
    likely, then after observing the data well
    assign the highest probability to the maximum
    likelihood estimate for p.
  • But what if our prior belief is different? How do
    we merge the prior belief with the data to get
    the best new belief?

6
Encode Prior Beliefs as a Beta Distribution
a,b
7
Any intuition for this?
  • For any positive integer y, G(y) (y-1)!.
  • Suppose we use this, and we also replace
  • x with p
  • a with x
  • ab with n
  • Then we get
  • The beta(a,b) is just the binomial(n,p) where
    nap, and p becomes the variable. With change
    of variable, we need a different normalizing
    constant so the sum (integral) is 1. Hence
    (n1)! replaces n!.

8
(No Transcript)
9
Incorporating a Prior
  • We assume a beta distribution as our prior
    distribution over the parameter p.
  • Nice properties unimodal, we can choose the mode
    to reflect the most probable value, we can choose
    the variance to reflect our confidence in this
    value.
  • Best property a beta distribution is
    parameterized by two positive numbers, a

10
Beta Distribution (Continued)
  • (Continued) and b. Higher values of a relative
    to b cause the mode of the distribution to be
    more to the left, and higher values of both a and
    b cause the distribution to be more peaked (lower
    variance). We might for example take a to be the
    number of heads, and b to be the number of tails.
    At any time, the mode of

11
Beta Distribution (Continued)
  • (Continued) the beta distribution (the
    expectation for p) is a/(ab), and as we get more
    data, the distribution becomes more peaked
    reflecting higher confidence in our expectation.
    So we can specify our prior belief for p by
    choosing initial values for a and b such that
    a/(ab)p, and we can specify confidence in this
    belief with high

12
Beta Distribution (Continued)
  • (Continued) initial values for a and b.
    Updating our prior belief based on data to obtain
    a posterior belief simply requires incrementing a
    for every heads outcome and incrementing b for
    every tails outcome.
  • So after h heads out of n flips, our posterior
    distribution says P(heads)(ah)/(abn).

13
Dirichlet Distributions
  • What if our variable is not Boolean but can take
    on more values? (Lets still assume our
    variables are discrete.)
  • Dirichlet distributions are an extension of beta
    distributions for the multi-valued case
    (corresponding to the extension from binomial to
    multinomial distributions).
  • A Dirichlet distribution over a variable with n
    values has n parameters rather than 2.

14
Back to Frequentist-Bayes Debate
  • Recall that under the frequentist view we
    estimate each parameter p by taking the ML
    estimate (maximum likelihood estimate the value
    for p that maximizes the probability of the
    data).
  • Under the Bayesian view, we now have a prior
    distribution over values of p. If this prior is
    a beta, or more generally a Dirichlet

15
Frequentist-Bayes Debate (Continued)
  • (Continued) then we can update it to a posterior
    distribution quite easily using the data as
    illustrated in the thumbtack example. The result
    yields a new value for the parameter p we wish to
    estimate (e.g., probability of heads) called the
    MAP (maximum a posteriori) estimate.
  • If our prior distribution was uniform over values
    for p, then ML and MAP agree.

16
Learning CPTs from Complete Settings
  • Suppose we are given a set of data, where each
    data point is a complete setting for all the
    variables.
  • One assumption we make is that the data set is a
    random sample from the distribution were trying
    to model.
  • For each node in our network, we consider each
    entry in its CPT (each setting of values

17
Learning CPTs (Continued)
  • (Continued) for its parents). For each entry in
    the CPT, we have a prior (possibly uniform)
    Dirichlet distribution over its values. We
    simply update this distribution based on the
    relevant data points (those that agree on the
    settings for the parents that correspond with
    this CPT entry).
  • A second, implicit assumption is that the

18
Learning CPTs (Continued)
  • (Continued) distributions over different rows of
    the CPT are independent of one another.
  • Finally, it is worth noting that instead of this
    last assumption, we might have a stronger bias
    over the form of the CPT. We might believe it is
    a noisy-OR, a linear function, or a tree, in
    which case we would instead use machine learning,
    linear regression, etc.

19
Simple Example
  • Suppose we believe the variables PinLength and
    HeadWeight directly influence whether a thumbtack
    comes up heads or tails. For simplicity, suppose
    PinLength can be long or short and HeadWeight can
    be heavy or light.
  • Suppose we adopt the following prior over the CPT
    entries for the variable Thumbtack.

20
Simple Example (Continued)
21
Simple Example (Continued)
  • Notice that we have equal confidence in our prior
    (initial) probabilities for the first and last
    columns of the CPT, less confidence in those of
    the second column, and more in those of the third
    column.
  • A new data point will affect only one of the
    columns. A new data point will have more effect
    on the second column than the others.

22
More Difficult Case What if Some Variables are
Missing
  • Recall our earlier notion of hidden variables.
  • Sometimes a variable is hidden because it cannot
    be explicitly measured. For example, we might
    hypothesize that a chromosomal abnormality is
    responsible for some patients with a particular
    cancer not responding well to treatment.

23
Missing Values (Continued)
  • We might include a node for this chromosomal
    abnormality in our network because we strongly
    believe it exists, other variables can be used to
    predict it, and it is in turn predictive of still
    other variables.
  • But in estimating CPTs from data, none of our
    data points has a value for this variable.

24
Missing Values (Continued)
  • This missing value (hidden variable) problem
    arises frequently.
  • Chicken-and-Egg issue if we had CPTs, we could
    fill in the data, or if we had data we could
    estimate CPTs.
  • We do have partial data and partial (prior) CPTs.
    Can we somehow leverage these into full data and
    posterior CPTs?

25
Three Approaches
  • Expectation-Maximization (EM) Algorithm.
  • Gibbs Sampling (again).
  • Gradient Ascent (Hill-climbing).

26
K-Means as EM
27
K-Means as EM
28
K-Means as EM
29
K-Means as EM
30
K-Means as EM
31
K-Means as EM
32
General EM Framework
  • Given Data with missing values, Space of
    possible models, Initial model.
  • Repeat until no change greater than threshold
  • Expectation (E) Step Compute expectation over
    missing values, given model.
  • Maximization (M) Step Replace current model with
    model that maximizes probability of data.

33
(Soft) EM vs. Hard EM
  • Standard (soft) EM expectation is a probability
    distribution.
  • Hard EM expectation is all or nothing most
    likely/probable value.
  • K-means is usually run as hard EM but doesnt
    have to be.
  • Advantage of hard EM is computational efficiency
    when expectation is over state consisting of
    values for multiple variables (next example
    illustrates).

34
EM for Parameter Learning E Step
  • For each data point with missing values, compute
    the probability of each possible completion of
    that data point. Replace the original data point
    with all these completions, weighted by
    probabilities.
  • Computing the probability of each completion
    (expectation) is just answering query over
    missing variables given others.

35
EM for Parameter Learning M Step
  • Use the completed data set to update our
    Dirichlet distributions as we would use any
    complete data set, except that our counts
    (tallies) may be fractional now.
  • Update CPTs based on new Dirichlet distributions,
    as we would with any complete data set.

36
EM for Parameter Learning
  • Iterate E and M steps until no changes occur. We
    will not necessarily get the global MAP (or ML
    given uniform priors) setting of all the CPT
    entries, but under a natural set of conditions
    we are guaranteed convergence to a local MAP
    solution.
  • EM algorithm is used for a wide variety of tasks
    outside of BN learning as well.

37
Subtlety for Parameter Learning
  • Overcounting based on number of interations
    required to converge to settings for the missing
    values.
  • After each repetition of E step, reset all
    Dirichlet distributions before repeating M step.

38
EM for Parameter Learning
Data
P(A) 0.1 (1,9)
P(B) 0.2 (1,4)
A
B
  • A B C D E
  • 0 0 ? 0 0
  • 0 0 ? 1 0
  • 0 ? 1 1
  • 0 0 ? 0 1
  • 0 1 ? 1 0
  • 0 0 ? 0 1
  • 1 ? 1 1
  • 0 0 ? 0 0
  • 0 0 ? 1 0
  • 0 0 ? 0 1

A B P(C) T T 0.9 (9,1) T F 0.6 (3,2) F T 0.3
(3,7) F F 0.2 (1,4)
C
E
D
C P(E) T 0.8 (4,1) F 0.1 (1,9)
C P(D) T 0.9 (9,1) F 0.2 (1,4)
39
EM for Parameter Learning
Data
P(B) 0.2 (1,4)
A
B
0 0.99 1 0.01
A B P(C) T T 0.9 (9,1) T F 0.6 (3,2) F T 0.3
(3,7) F F 0.2 (1,4)
0 0.80 1 0.20
0 0.02 1 0.98
C
0 0.80 1 0.20
0 0.70 1 0.30
0 0.80 1 0.20
E
D
0 0.003 1 0.997
C P(E) T 0.8 (4,1) F 0.1 (1,9)
C P(D) T 0.9 (9,1) F 0.2 (1,4)
0 0.99 1 0.01
0 0.80 1 0.20
0 0.80 1 0.20
40
Multiple Missing Values
Data
P(B) 0.2 (1,4)
A B C D E ? 0 ? 0 1
A
B
A B P(C) T T 0.9 (9,1) T F 0.6 (3,2) F T 0.3
(3,7) F F 0.2 (1,4)
C
E
D
C P(E) T 0.8 (4,1) F 0.1 (1,9)
C P(D) T 0.9 (9,1) F 0.2 (1,4)
41
Multiple Missing Values
Data
P(B) 0.2 (1,4)
A B C D E 0 0 0 0 1 0 0
1 0 1 1 0 0 0 1 1 0 1
0 1
A
B
0.72 0.18 0.04 0.06
A B P(C) T T 0.9 (9,1) T F 0.6 (3,2) F T 0.3
(3,7) F F 0.2 (1,4)
C
E
D
C P(E) T 0.8 (4,1) F 0.1 (1,9)
42
Multiple Missing Values
P(A) 0.1 (1.1,9.9)
P(B) 0.17 (1,5)
A
B
A B P(C) T T 0.9 (9,1) T F 0.6 (3.06,2.04) F T
0.3 (3,7) F F 0.2 (1.18,4.72)
C
E
D
C P(E) T 0.81 (4.24,1) F 0.16 (1.76,9)
C P(D) T 0.88 (9,1.24) F 0.17 (1,4.76)
43
Problems with EM
  • Only local optimum (not much way around that,
    though).
  • Deterministic if priors are uniform, may be
    impossible to make any progress
  • next figure illustrates the need for some
    randomization to move us off an uninformative
    prior

44
What will EM do here?
Data
A
  • A B C
  • 0 ? 0
  • ? 1
  • 0 ? 0
  • ? 1
  • 0 ? 0
  • 1 ? 1

B
C
45
EM Dependent on Initial Beliefs
Data
A
  • A B C
  • 0 ? 0
  • ? 1
  • 0 ? 0
  • ? 1
  • 0 ? 0
  • 1 ? 1

B
C
46
EM Dependent on Initial Beliefs
B is more likely T than F when A is T. Filling
this in makes C more likely T than F when B is T.
This makes B still more likely T than F when A
is T. Etc. Small change in CPT for B (swap 0.6
and 0.4) would have opposite effect.
Data
A
  • A B C
  • 0 ? 0
  • ? 1
  • 0 ? 0
  • ? 1
  • 0 ? 0
  • 1 ? 1

B
C
47
A Second Approach Gibbs Sampling
  • The idea is analogous to Gibbs Sampling for Bayes
    Net inference, which we have seen in detail.
  • First, initialize the values of hidden variables
    arbitrarily. Update CPTs based on current (now
    complete) data.
  • Second, choose one data point and one unobserved
    variable X for that data point.

48
Gibbs Sampling (Continued)
  • (Continued) Reset the value of X within that
    data point based on the current CPTs and the
    current setting of the variables in the Markov
    Blanket of X within that data point.
  • Third, repeat this process for all the other
    unobserved variables throughout the data set and
    then update the CPTs.

49
Gibbs Sampling (Continued)
  • Fourth, iterate through the previous three steps
    some number of times (chain length)
  • Gibbs faster than (soft) EM if many data missing
    values per data point.
  • Gibbs often slower than hard EM (we have more
    variables now than in the inference case) but
    results may be better.

50
Approach 3 Gradient Ascent
  • We want to maximize posterior probability (or
    likelihood if uniform priors). Where w1,,wk are
    the probabilities in the CPTs (analogous to
    weights in a neural network) and D1,, Dn are the
    data points, we will use a greedy hill-climbing
    search (making small changes in the direction of
    the gradient) to maximize P(Dw1,,wk)
    P(D1w1,,wk)...P(Dnw1,,wk).

51
Gradient Ascent (Continued)
  • Must first define the gradient (slope) of the
    function we wish to maximize. It turns out it is
    easier to do this with the logarithm of the
    posterior probability or likelihood. Russell
    Norvig focus on likelihood so we will as well.
    Maximizing log likelihood also will maximize
    likelihood but the function is easier to work
    with (additive).

52
Gradient Ascent (Continued)
  • Because the log likelihood is additive, we simply
    need to compute the gradient relative to each
    individual probability wi, which we can do as
    follows.

53
Gradient Ascent (Continued)
  • Based on the preceding derivation, we can
    calculate the gradient contribution of each data
    point and sum the contributions.
  • Hence we want to find the gradient contribution
    for a single case (Dj) from a single CPT with
    which wi is associated.
  • Assume wi is the probability that Xi xi given
    that Pa(Xi) U is set to ui. So wiP(xi,ui).

54
Gradient Ascent (Continued)
  • We now work with the probability of these
    settings out of all possible settings.

55
Gradient Ascent (Continued)
  • In the preceding, wi appears in only one term of
    the summation, where xxi and uui. For this
    term, P(x,u) wi. So

56
Gradient Ascent (Continued)
  • Applying Bayes Theorem

57
Gradient Ascent (Continued)
  • We can compute P(xi,uiDj) using one of our
    already-studied mechanisms for Bayes Net
    inference (answering queries from Bayes Nets).
  • We then sum over all data point to get the
    gradient contribution from each probability. We
    assume the probabilities are independent of one
    another, making it easy to take a small step up
    the gradient.
Write a Comment
User Comments (0)
About PowerShow.com