Sampling Bayesian Networks - PowerPoint PPT Presentation

1 / 78
About This Presentation
Title:

Sampling Bayesian Networks

Description:

Sampling Fundamentals ... independent, identically distributed samples (iid) S1, S2, ...ST ... Generate dependent samples forming a Markov Chain from P'(X) (X) ... – PowerPoint PPT presentation

Number of Views:139
Avg rating:3.0/5.0
Slides: 79
Provided by: ValuedGate775
Category:

less

Transcript and Presenter's Notes

Title: Sampling Bayesian Networks


1
Sampling Bayesian Networks
  • ICS 295
  • 2008

2
Answering BN Queries
  • Probability of Evidence P(e) ? NP-hard
  • Conditional Prob. P(xie) ? NP-hard
  • MPE x arg max P(xe) ? NP-hard
  • MAP y arg max P(ye), y ? x ?
  • NPPP-hard
  • Approximating P(e) or P(xie) within ? NP-hard

3
Approximation Algorithms
  • Structural Approximations
  • Eliminate some dependencies
  • Remove edges
  • Mini-Bucket Approach
  • Search
  • Approach for optimization tasks MPE, MAP
  • Sampling
  • Generate random samples and compute values of
    interest from samples,
  • not original network

4
Algorithm Tree
5
Sampling Fundamentals
Given a set of variables X X1, X2, Xn, a
joint probability distribution ?(X) and some
function g(X), we can compute expected value of
g(X)
6
Sampling From ?(X)
A sample St is an instantiation
Given independent, identically distributed
samples (iid) S1, S2, ST from ?(X), it follows
from Strong Law of Large Numbers
7
Sampling Challenge
  • It is hard to generate samples from ?(X)
  • Trade-Offs
  • Generate samples from Q(X)
  • Forward Sampling, Likelyhood Weighting, IS
  • Try to find Q(X) close to ?(X)
  • Generate dependent samples forming a Markov Chain
    from P(X)??(X)
  • Metropolis, Metropolis-Hastings, Gibbs
  • Try to reduce dependence between samples

8
Markov Chain
  • A sequence of random values x0,x1, , defined on
    a finite state space D(X), is called a Markov
    Chain if it satisfies the Markov Property
  • If P(xt1 y xt) does not change with t (time
    homogeneous), then it is often expressed as a
    transition function, A(x,y)

Liu, Ch.12, p 245
9
Markov Chain Monte Carlo
  • First, define a transition probability
    P(xt1yxt)
  • Pick initial state x0, usually not important
    because it becomes forgotten
  • Generate samples x1, x2, sampling each next
    value from P(X xt)

x0
x1
xt
xt1
  • If we choose proper P(xt1xt), we can guarantee
    that the distribution represented by samples
    x0,x1, converges to ?(X)

10
Markov Chain Properties
  • Irreducibility
  • Periodicity
  • Recurrence
  • Revercibility
  • Ergodicity
  • Stationary Distribution

11
Irreducible
  • A station x is said to be irreducible if under
    the transition rule one has nonzero probability
    of moving from x to any other state and then
    coming back in a finite number of steps.
  • If on state is irreducible, then all the sates
    must be irreducible.

Liu, Ch. 12, pp. 249, Def. 12.1.1
12
Aperiodic
  • A state x is aperiodic if the greatest common
    divider of n An(x,x) gt 0 is 1.
  • If state x is aperiodic and the chain is
    irreducible, then every state must be aperiodic.

Liu, Ch. 12, pp.240-250, Def. 12.1.1
13
Recurrence
  • A state x is recurrent if the chain returns to x
    with probability 1
  • State x is recurrent if and only if
  • Let M(x) be the expected number of steps to
    return to state x
  • State x is positive recurrent if M(x) is finite
  • The recurrent states in a finite state chain are
    positive recurrent.

14
Ergodicity
  • A state x is ergodic if it is aperiodic and
    positive recurrent.
  • If all states in a Markov chain are ergodic then
    the chain is ergodic.

15
Reversibility
  • Detail balance condition
  • Markov chain is reversible if there is a ? such
    that
  • For a reversible Markov chain, ? is always a
    stationary distribution.

16
Stationary Distribution
  • If the Markov chain is time-homogeneous, then the
    vector ?(X) is a stationary distribution (aka
    invariant or equilibrium distribution, aka fixed
    point), if its entries sum up to 1 and satisfy
  • An irreducible chain has a stationary
    distribution if and only if all of its states are
    positive recurrent. The distribution is unique.

17
Stationary Distribution In Finite State Space
  • Stationary distribution always exists but may not
    be unique
  • If a finite-state Markov chain is irreducible and
    aperiodic, it is guaranteed to be unique and
    A(n)P(xn y x0) converges to a rank-one
    matrix in which each row is the stationary
    distribution ?.
  • Thus, initial state x0 is not important for
    convergence it gets forgotten and we start
    sampling from target distribution
  • However, it is important how long it takes to
    forget it!

18
Convergence Theorem
  • Given a finite state Markov Chain whose
    transition function is irreducible and aperiodic,
    then An(x0,y) converges to its invariant
    distribution ?(y) geometrically in variation
    distance, then there exists a 0 lt r lt 1 and c gt 0
    s.t.

19
Eigen-Value Condition
  • Convergence to stationary distribution is driven
    by eigen-values of matrix A(x,y).
  • The chain will converge to its unique invariant
    distribution if and only if matrix As second
    largest eigen-value in modular is strictly less
    than 1.
  • Many proofs of convergence are centered around
    analyzing second eigen-value.

Liu, Ch. 12, p. 249
20
Convergence In Finite State Space
  • Assume a finite-state Markov chain is irreducible
    and aperiodic
  • Initial state x0 is not important for
    convergence it gets forgotten and we start
    sampling from target distribution
  • However, it is important how long it takes to
    forget it! known as burn-in time
  • Since the first k states are not drown exactly
    from ?, they are often thrown away. Open
    question how big a k ?

21
Sampling in BN
  • Same Idea generate a set of samples T
  • Estimate P(XiE) from samples
  • Challenge X is a vector and P(X) is a huge
    distribution represented by BN
  • Need to know
  • How to generate a new sample ?
  • How many samples T do we need ?
  • How to estimate P(Ee) and P(Xie) ?

22
Sampling
  • Input Bayesian network with set of nodes X
  • Sample a tuple with assigned values
  • s(X1x1,X2x2, ,Xkxk)
  • Tuple may include all variables (except evidence)
    or a subset
  • Sampling schemas dictate how to generate samples
    (tuples)
  • Ideally, samples are distributed according to
    P(XE)

23
Sampling Algorithms
  • Forward Sampling
  • Gibbs Sampling (MCMC)
  • Blocking
  • Rao-Blackwellised
  • Likelihood Weighting
  • Importance Sampling
  • Sequential Monte-Carlo (Particle Filtering) in
    Dynamic Bayesian Networks

24
Gibbs Sampling
  • Markov Chain Monte Carlo method
  • (Gelfand and Smith, 1990, Smith and Roberts,
    1993, Tierney, 1994)
  • Transition probability equals the conditional
    distribution
  • Example ?(X,Y), A(xt1yt)P(xy), A(yt1xt)
    P(yx)

y1
y0
x0
x1
25
Gibbs Sampling for BN
  • Samples are dependent, form Markov Chain
  • Sample from P(Xe) which converges to P(Xe)
  • Guaranteed to converge when all P gt 0
  • Methods to improve convergence
  • Blocking
  • Rao-Blackwellised
  • Error Bounds
  • Lag-t autocovariance
  • Multiple Chains, Chebyshevs Inequality

26
Gibbs Sampling (Pearl, 1988)
  • A sample t?1,2,,is an instantiation of all
    variables in the network
  • Sampling process
  • Fix values of observed variables e
  • Instantiate node values in sample x0 at random
  • Generate samples x1,x2,xT from P(xe)
  • Compute posteriors from samples

27
Ordered Gibbs Sampler
  • Generate sample xt1 from xt
  • In short, for i1 to N

Process All Variables In Some Order
28
Gibbs Sampling (contd)(Pearl, 1988)
Markov blanket
29
Ordered Gibbs Sampling Algorithm
  • Input X, E
  • Output T samples xt
  • Fix evidence E
  • Generate samples from P(X E)
  • For t 1 to T (compute samples)
  • For i 1 to N (loop through variables)
  • Xi ? sample xit from P(Xi markovt \ Xi)

30
Answering Queries
  • Query P(xi e) ?
  • Method 1 count of samples where Xixi
  • Method 2 average probability (mixture estimator)

31
Gibbs Sampling Example - BN
  • X X1,X2,,X9
  • E X9

X1
X3
X6
X8
X5
X2
X9
X4
X7
32
Gibbs Sampling Example - BN
  • X1 x10 X6 x60
  • X2 x20 X7 x70
  • X3 x30 X8 x80
  • X4 x40
  • X5 x50

X1
X3
X6
X8
X5
X2
X9
X4
X7
33
Gibbs Sampling Example - BN
  • X1 ? P (X1 X02,,X08 ,X9
  • E X9

X1
X3
X6
X8
X5
X2
X9
X4
X7
34
Gibbs Sampling Example - BN
  • X2 ? P(X2 X11,,X08 ,X9
  • E X9

X1
X3
X6
X8
X5
X2
X9
X4
X7
35
Gibbs Sampling Illustration
36
Gibbs Sampling Example Init
  • Initialize nodes with random values
  • X1 x10 X6 x60
  • X2 x20 X7 x70
  • X3 x30 X8 x80
  • X4 x40
  • X5 x50
  • Initialize Running Sums
  • SUM1 0
  • SUM2 0
  • SUM3 0
  • SUM4 0
  • SUM5 0
  • SUM6 0
  • SUM7 0
  • SUM8 0

37
Gibbs Sampling Example Step 1
  • Generate Sample 1
  • compute SUM1 P(x1 x20, x30, x40, x50, x60,
    x70, x80, x9 )
  • select and assign new value X1 x11
  • compute SUM2 P(x2 x11, x30, x40, x50, x60,
    x70, x80, x9 )
  • select and assign new value X2 x21
  • compute SUM3 P(x2 x11, x21, x40, x50, x60,
    x70, x80, x9 )
  • select and assign new value X3 x31
  • ..
  • At the end, have new sample
  • S1 x11, x21, x41, x51, x61, x71, x81, x9

38
Gibbs Sampling Example Step 2
  • Generate Sample 2
  • Compute P(x1 x21, x31, x41, x51, x61, x71, x81,
    x9 )
  • select and assign new value X1 x11
  • update SUM1 P(x1 x21, x31, x41, x51, x61,
    x71, x81, x9 )
  • Compute P(x2 x12, x31, x41, x51, x61, x71, x81,
    x9 )
  • select and assign new value X2 x21
  • update SUM2 P(x2 x12, x31, x41, x51, x61,
    x71, x81, x9 )
  • Compute P(x3 x12, x22, x41, x51, x61, x71, x81,
    x9 )
  • select and assign new value X3 x31
  • compute SUM3 P(x2 x12, x22, x41, x51, x61,
    x71, x81, x9 )
  • ..
  • New sample S2 x12, x22, x42, x52, x62, x72,
    x82, x9

39
Gibbs Sampling Example Answering Queries
  • P(x1x9) SUM1 /2
  • P(x2x9) SUM2 /2
  • P(x3x9) SUM3 /2
  • P(x4x9) SUM4 /2
  • P(x5x9) SUM5 /2
  • P(x6x9) SUM6 /2
  • P(x7x9) SUM7 /2
  • P(x8x9) SUM8 /2

40
Gibbs Convergence
  • Stationary distribution target sampling
    distribution
  • MCMC converges to the stationary distribution if
    network is ergodic
  • Chain is ergodic if all probabilities are positive
  • If ?i,j such that pij 0 , then we may not be
    able to explore full sampling space !

41
Gibbs Sampling Burn-In
  • We want to sample from P(X E)
  • Butstarting point is random
  • Solution throw away first K samples
  • Known As Burn-In
  • What is K ? Hard to tell. Use intuition.
  • Alternatives sample first sample values from
    approximate P(xe) (for example, run IBP first)

42
Gibbs Sampling Performance
  • Advantage guaranteed to converge to P(XE)
  • -Disadvantage convergence may be slow
  • Problems
  • Samples are dependent !
  • Statistical variance is too big in
    high-dimensional problems

43
Gibbs Speeding Convergence
  • Objectives
  • Reduce dependence between samples
    (autocorrelation)
  • Skip samples
  • Randomize Variable Sampling Order
  • Reduce variance
  • Blocking Gibbs Sampling
  • Rao-Blackwellisation

44
Skipping Samples
  • Pick only every k-th sample (Gayer, 1992)
  • Can reduce dependence between samples !
  • Increases variance ! Waists samples !

45
Randomized Variable Order
  • Random Scan Gibbs Sampler
  • Pick each next variable Xi for update at random
    with probability pi , ?i pi 1.
  • (In the simplest case, pi are distributed
    uniformly.)
  • In some instances, reduces variance (MacEachern,
    Peruggia, 1999
  • Subsampling the Gibbs Sampler Variance
    Reduction)

46
Blocking
  • Sample several variables together, as a block
  • Example Given three variables X,Y,Z, with
    domains of size 2, group Y and Z together to form
    a variable WY,Z with domain size 4. Then,
    given sample (xt,yt,zt), compute next sample
  • Xt1 ? P(yt,zt)P(wt)
  • (yt1,zt1)Wt1 ? P(xt1)
  • Can improve convergence greatly when two
    variables are strongly correlated!
  • - Domain of the block variable grows
    exponentially with the variables in a block!

47
Blocking Gibbs Sampling
  • Jensen, Kong, Kjaerulff, 1993
  • Blocking Gibbs Sampling Very Large Probabilistic
    Expert Systems
  • Select a set of subsets
  • E1, E2, E3, , Ek s.t. Ei ? X
  • Ui Ei X
  • Ai X \ Ei
  • Sample P(Ei Ai)

48
Rao-Blackwellisation
  • Do not sample all variables!
  • Sample a subset!
  • Example Given three variables X,Y,Z, sample only
    X and Y, sum out Z. Given sample (xt,yt), compute
    next sample
  • Xt1 ? P(xyt)
  • yt1 ? P(yxt1)

49
Rao-Blackwell Theorem
Bottom line reducing number of variables in a
sample reduce variance!
50
Blocking vs. Rao-Blackwellisation
  • Standard Gibbs
  • P(xy,z),P(yx,z),P(zx,y) (1)
  • Blocking
  • P(xy,z), P(y,zx) (2)
  • Rao-Blackwellised
  • P(xy), P(yx) (3)
  • Var3 lt Var2 lt Var1
  • Liu, Wong, Kong, 1994
  • Covariance structure of the Gibbs sampler

X
Y
Z
51
Rao-Blackwellised Gibbs Cutset Sampling
  • Select C ? X (possibly cycle-cutset), C m
  • Fix evidence E
  • Initialize nodes with random values
  • For i1 to m ci to Ci c 0i
  • For t1 to n , generate samples
  • For i1 to m
  • Cicit1 ? P(cic1 t1,,ci-1 t1,ci1t,,cmt
    ,e)

52
Cutset Sampling
  • Select a subset CC1,,CK ? X
  • A sample t?1,2,,is an instantiation of C
  • Sampling process
  • Fix values of observed variables e
  • Generate sample c0 at random
  • Generate samples c1,c2,cT from P(ce)
  • Compute posteriors from samples

53
Cutset SamplingGenerating Samples
  • Generate sample ct1 from ct
  • In short, for i1 to K

54
Rao-Blackwellised Gibbs Cutset Sampling
  • How to compute P(cic t\ci, e) ?
  • Compute joint P(ci, c t\ci, e) for each ci ?
    D(Ci)
  • Then normalize
  • P(ci c t\ci , e) ? P(ci, c t\ci , e)
  • Computation efficiency depends
  • on choice of C

55
Rao-Blackwellised Gibbs Cutset Sampling
  • How to choose C ?
  • Special case C is cycle-cutset, O(N)
  • General case apply Bucket Tree Elimination
    (BTE), O(exp(w)) where w is the induced width of
    the network when nodes in C are observed.
  • Pick C wisely so as to minimize w ? notion of
    w-cutset

56
w-cutset Sampling
  • Cw-cutset of the network, a set of nodes such
    that when C and E are instantiated, the adjusted
    induced width of the network is w
  • Complexity of exact inference
  • bounded by w !
  • cycle-cutset is a special case

57
Cutset Sampling-Answering Queries
  • Query ?ci ?C, P(ci e)? same as Gibbs
  • Special case of w-cutset

computed while generating sample t
  • Query P(xi e) ?

compute after generating sample t
58
Cutset Sampling Example
X1
X2
X3
X5
X4
X6
X9
X7
X8
Ex9
59
Cutset Sampling Example
Sample a new value for X2
X1
X2
X3
X6
X5
X4
X9
X7
X8
60
Cutset Sampling Example
Sample a new value for X5
X1
X2
X3
X6
X5
X4
X9
X7
X8
61
Cutset Sampling Example
Query P(x2e) for sampling node X2
Sample 1
X1
X2
X3
Sample 2
X6
X5
X4
Sample 3
X9
X7
X8
62
Cutset Sampling Example
Query P(x3 e) for non-sampled node X3
X1
X2
X3
X6
X5
X4
X9
X7
X8
63
Gibbs Error Bounds
  • Objectives
  • Estimate needed number of samples T
  • Estimate error
  • Methodology
  • 1 chain ? use lag-k autocovariance
  • Estimate T
  • M chains ? standard sampling variance
  • Estimate Error

64
Gibbs lag-k autocovariance
Lag-k autocovariance
65
Gibbs lag-k autocovariance
Estimate Monte Carlo variance
Here, ? is the smallest positive integer
satisfying
Effective chain size
In absense of autocovariance
66
Gibbs Multiple Chains
  • Generate M chains of size K
  • Each chain produces independent estimate Pm

Estimate P(xie) as average of Pm (xie)
Treat Pm as independent random variables.
67
Gibbs Multiple Chains
  • Pm are independent random variables
  • Therefore

68
GemanGeman1984
  • Geman, S. Geman D., 1984. Stocahstic
    relaxation, Gibbs distributions, and the Bayesian
    restoration of images. IEEE Trans.Pat.Anal.Mach.In
    tel. 6, 721-41.
  • Introduce Gibbs sampling
  • Place the idea of Gibbs sampling in a general
    setting in which the collection of variables is
    structured in a graphical model and each variable
    has a neighborhood corresponding to a local
    region of the graphical structure. Geman and
    Geman use the Gibbs distribution to define the
    joint distribution on this structured set of
    variables.

69
TannerWong 1987
  • Tanner and Wong (1987)
  • Data-augmentation
  • Convergence Results

70
Pearl1988
  • Pearl,1988. Probabilistic Reasoning in
    Intelligent Systems, Morgan-Kaufmann.
  • In the case of Bayesian networks, the
    neighborhoods correspond to the Markov blanket of
    a variable and the joint distribution is defined
    by the factorization of the network.

71
GelfandSmith,1990
  • Gelfand, A.E. and Smith, A.F.M., 1990.
    Sampling-based approaches to calculating marginal
    densities. J. Am.Statist. Assoc. 85, 398-409.
  • Show variance reduction in using mixture
    estimator for posterior marginals.

72
Neal, 1992
  • R. M. Neal, 1992. Connectionist learning of
    belief networks, Artifical Intelligence, v. 56,
    pp. 71-118.
  • Stochastic simulation in Noisy-Or networks.

73
CPCS54 Test Results
MSE vs. samples (left) and time (right)
Ergodic, X 54, D(Xi) 2, C 15, E
4 Exact Time 30 sec using Cutset Conditioning
74
CPCS179 Test Results
MSE vs. samples (left) and time (right)
Non-Ergodic (1 deterministic CPT entry) X
179, C 8, 2lt D(Xi)lt4, E 35 Exact Time
122 sec using Loop-Cutset Conditioning
75
CPCS360b Test Results
MSE vs. samples (left) and time (right)
Ergodic, X 360, D(Xi)2, C 21, E
36 Exact Time gt 60 min using Cutset
Conditioning Exact Values obtained via Bucket
Elimination
76
Random Networks
MSE vs. samples (left) and time (right) X
100, D(Xi) 2,C 13, E 15-20 Exact Time
30 sec using Cutset Conditioning
77
Coding Networks
x1
x2
x3
x4
u1
u2
u3
u4
p1
p2
p3
p4
y4
y3
y2
y1
MSE vs. time (right) Non-Ergodic, X 100,
D(Xi)2, C 13-16, E 50 Sample Ergodic
Subspace UU1, U2,Uk Exact Time 50 sec using
Cutset Conditioning
78
Non-Ergodic Hailfinder
MSE vs. samples (left) and time
(right) Non-Ergodic, X 56, C 5, 2 ltD(Xi)
lt11, E 0 Exact Time 2 sec using
Loop-Cutset Conditioning
79
Non-Ergodic CPCS360b - MSE
MSE vs. Time Non-Ergodic, X 360, C 26,
D(Xi)2 Exact Time 50 min using BTE
80
Non-Ergodic CPCS360b - MaxErr
81
Importance vs. Gibbs
wt
Write a Comment
User Comments (0)
About PowerShow.com