Title: Approximation Techniques bounded inference
1Approximation Techniques bounded inference
2Mini-buckets local inference
- The idea is similar to i-consistency
- bound the size of recorded dependencies
- Computation in a bucket is time and space
- exponential in the number of variables
involved - Therefore, partition functions in a bucket
- into mini-buckets on smaller number of
variables -
3Mini-bucket approximation MPE task
Split a bucket into mini-buckets gtbound
complexity
4Approx-mpe(i)
- Input i max number of variables allowed in a
mini-bucket - Output lower bound (P of a sub-optimal
solution), upper bound -
Example approx-mpe(3) versus elim-mpe
5Properties of approx-mpe(i)
- Complexity O(exp(2i)) time and O(exp(i))
time. - Accuracy determined by upper/lower (U/L) bound.
- As i increases, both accuracy and complexity
increase. - Possible use of mini-bucket approximations
- As anytime algorithms (Dechter and Rish, 1997)
- As heuristics in best-first search (Kask and
Dechter, 1999) - Other tasks similar mini-bucket approximations
for belief updating, MAP and MEU (Dechter and
Rish, 1997) -
6Anytime Approximation
7Bounded elimination for belief updating
- Idea mini-bucket is the same
- So we can apply a sum in each mini-bucket, or
better, one sum and the rest max, or min (for
lower-bound) - Approx-bel-max(i,m) generating upper and
lower-bound on beliefs approximates elim-bel - Approx-map(i,m) max buckets will be maximizes,
sum buckets will be sum-max. Approximates
elim-map.
8Empirical Evaluation(Dechter and Rish, 1997
Rish thesis, 1999)
- Randomly generated networks
- Uniform random probabilities
- Random noisy-OR
- CPCS networks
- Probabilistic decoding
- Comparing approx-mpe and anytime-mpe
- versus elim-mpe
9Random networks
- Uniform random 60 nodes, 90 edges (200
instances) - In 80 of cases, 10-100 times speed-up while
U/Llt2 - Noisy-OR even better results
- Exact elim-mpe was infeasible appprox-mpe took
0.1 to 80 sec.
10CPCS networks medical diagnosis(noisy-OR model)
Test case no evidence
11The effect of evidence
More likely evidencegthigher MPE gt higher
accuracy (why?)
Likely evidence versus random (unlikely) evidence
12Probabilistic decoding
Error-correcting linear block code
State-of-the-art
approximate algorithm iterative belief
propagation (IBP) (Pearls poly-tree algorithm
applied to loopy networks)
13Iterative Belief Proapagation
- Belief propagation is exact for poly-trees
- IBP - applying BP iteratively to cyclic networks
- No guarantees for convergence
- Works well for many coding networks
14approx-mpe vs. IBP
Bit error rate (BER) as a function of noise
(sigma)
15Mini-buckets summary
- Mini-buckets local inference approximation
- Idea bound size of recorded functions
- Approx-mpe(i) - mini-bucket algorithm for MPE
- Better results for noisy-OR than for random
problems - Accuracy increases with decreasing noise in
- Accuracy increases for likely evidence
- Sparser graphs -gt higher accuracy
- Coding networks approx-mpe outperfroms IBP on
low-induced width codes
16Heuristic search
- Mini-buckets record upper-bound heuristics
- The evaluation function over
- Best-first expand a node with maximal evaluation
function - Branch and Bound prune if f gt upper bound
- Properties
- an exact algorithm
- Better heuristics lead to more prunning
17Heuristic Function
Given a cost function
P(a,b,c,d,e) P(a) P(ba) P(ca) P(eb,c)
P(db,a)
Define an evaluation function over a partial
assignment as the probability of its best
extension
0
D
0
B
E
0
D
1
A
B
1
D
E
1
D
f(a,e,d) maxb,c P(a,b,c,d,e) P(a)
maxb,c P)ba) P(ca) P(eb,c) P(da,b)
g(a,e,d) H(a,e,d)
18Heuristic Function
H(a,e,d) maxb,c P(ba) P(ca) P(eb,c)
P(da,b) maxc P(ca) maxb P(eb,c)
P(ba) P(da,b) maxc P(ca) maxb
P(eb,c) maxb P(ba) P(da,b)
H(a,e,d) f(a,e,d) g(a,e,d) H(a,e,d) ³
f(a,e,d) The heuristic function H is compiled
during the preprocessing stage of the
Mini-Bucket algorithm.
19Heuristic Function
The evaluation function f(xp) can be computed
using function recorded by the Mini-Bucket scheme
and can be used to estimate the probability of
the best extension of partial assignment xpx1,
, xp,
f(xp)g(xp) H(xp )
For example,
maxB P(eb,c) P(da,b)
P(ba) maxC P(ca) hB(e,c) maxD
hB(d,a) maxE hC(e,a) maxA P(a)
hE(a) hD (a)
H(a,e,d) hB(d,a) hC (e,a)
g(a,e,d) P(a)
20Properties
- Heuristic is monotone
- Heuristic is admissible
- Heuristic is computed in linear time
- IMPORTANT
- Mini-buckets generate heuristics of varying
strength using control parameter bound I - Higher bound -gt more preprocessing -gt
- stronger heuristics -gt less search
- Allows controlled trade-off between preprocessing
and search
21Empirical Evaluation of mini-bucket heuristics
22Cluster Tree Elimination - properties
- Correctness and completeness Algorithm CTE is
correct, i.e. it computes the exact joint
probability of a single variable and the
evidence. - Time complexity O ( deg ? (nN) ? d w1 )
- Space complexity O ( N ? d sep)
- where deg the maximum degree of a node
- n number of variables ( number of CPTs)
- N number of nodes in the tree decomposition
- d the maximum domain size of a variable
- w the induced width
- sep the separator size
23Mini-Clustering for belief updating
- Motivation
- Time and space complexity of Cluster Tree
Elimination depend on the induced width w of the
problem - When the induced width w is big, CTE algorithm
becomes infeasible - The basic idea
- Try to reduce the size of the cluster (the
exponent) partition each cluster into
mini-clusters with less variables - Accuracy parameter i maximum number of
variables in a mini-cluster - The idea was explored for variable elimination
(Mini-Bucket)
24Idea of Mini-Clustering
25Mini-Clustering - example
ABC
1
BC
BCDF
2
BF
BEF
3
EF
EFG
4
26Cluster Tree Elimination vs. Mini-Clustering
ABC
ABC
1
1
BC
BC
BCDF
BCDF
2
2
BF
BF
BEF
BEF
3
3
EF
EF
EFG
EFG
4
4
27Mini-Clustering
- Correctness and completeness Algorithm MC(i)
computes a bound (or an approximation) on the
joint probability P(Xi,e) of each variable and
each of its values. - Time space complexity O(n ? hw ? d i)
- where hw maxu f f ? ?(u) ? ?
28Experimental results
- Algorithms
- Exact
- IBP
- Gibbs sampling (GS)
- MC with normalization (approximate)
- Networks (all variables are binary)
- Coding networks
- CPCS 54, 360, 422
- Grid networks (MxM)
- Random noisy-OR networks
- Random networks
- Measures
- Normalized Hamming Distance (NHD)
- BER (Bit Error Rate)
- Absolute error
- Relative error
- Time
29Random networks - Absolute error
evidence0
evidence10
30Noisy-OR networks - Absolute error
evidence10
evidence20
31Grid 15x15 - 10 evidence
32CPCS422 - Absolute error
evidence0
evidence10
33Coding networks - Bit Error Rate
sigma0.22
sigma.51
34Mini-Clustering summary
- MC extends the partition based approximation from
mini-buckets to general tree decompositions for
the problem of belief updating - Empirical evaluation demonstrates its
effectiveness and superiority (for certain types
of problems, with respect to the measures
considered) relative to other existing algorithms
35What is IJGP?
- IJGP is an approximate algorithm for belief
updating in Bayesian networks - IJGP is a version of join-tree clustering which
is both anytime and iterative - IJGP applies message passing along a join-graph,
rather than a join-tree - Empirical evaluation shows that IJGP is almost
always superior to other approximate schemes
(IBP, MC)
36Iterative Belief Propagation - IBP
One step update BEL(U1)
U1
U2
U3
- Belief propagation is exact for poly-trees
- IBP - applying BP iteratively to cyclic networks
- No guarantees for convergence
- Works well for many coding networks
X2
X1
37IJGP - Motivation
- IBP is applied to a loopy network iteratively
- not an anytime algorithm
- when it converges, it converges very fast
- MC applies bounded inference along a tree
decomposition - MC is an anytime algorithm controlled by i-bound
- MC converges in two passes up and down the tree
- IJGP combines
- the iterative feature of IBP
- the anytime feature of MC
38IJGP - The basic idea
- Apply Cluster Tree Elimination to any join-graph
- We commit to graphs that are minimal I-maps
- Avoid cycles as long as I-mapness is not violated
- Result use minimal arc-labeled join-graphs
39IJGP - Example
A
B
C
A
C
A
ABC
C
A
AB
BC
C
D
E
ABDE
BCE
BE
C
DE
CE
F
CDEF
G
H
H
FGH
H
F
F
FG
GH
H
GI
I
J
FGI
GHIJ
a) Belief network
a) The graph IBP works on
40Arc-minimal join-graph
A
C
A
A
ABC
C
A
ABC
C
A
AB
BC
C
AB
BC
ABDE
BCE
ABDE
BCE
BE
C
C
DE
CE
DE
CE
CDEF
CDEF
H
H
FGH
H
FGH
H
F
F
F
FG
GH
H
FG
GH
GI
GI
FGI
GHIJ
FGI
GHIJ
41Minimal arc-labeled join-graph
A
A
A
ABC
C
A
ABC
C
AB
BC
AB
BC
ABDE
BCE
ABDE
BCE
C
C
DE
CE
DE
CE
CDEF
CDEF
H
H
FGH
H
FGH
H
F
F
FG
GH
F
GH
GI
GI
FGI
GHIJ
FGI
GHIJ
42Join-graph decompositions
A
A
ABC
C
AB
BC
BC
BC
ABDE
BCE
ABCDE
BCE
ABCDE
BCE
C
DE
CE
CDE
CE
DE
CE
CDEF
CDEF
CDEF
H
FGH
H
FGH
FGH
F
F
F
F
GH
F
GH
F
GH
GI
GI
GI
FGI
GHIJ
FGI
GHIJ
FGI
GHIJ
a) Minimal arc-labeled join graph
b) Join-graph obtained by collapsing nodes of
graph a)
c) Minimal arc-labeled join graph
43Tree decomposition
BC
ABCDE
BCE
ABCDE
DE
CE
CDE
CDEF
CDEF
FGH
F
F
F
GH
GHI
GI
FGI
GHIJ
FGHI
GHIJ
a) Minimal arc-labeled join graph
a) Tree decomposition
44Join-graphs
more accuracy
less complexity
45Message propagation
BC
ABCDE
BCE
ABCDE p(a), p(c), p(bac), p(dabe),p(eb,c)
h(3,1)(bc)
h(3,1)(bc)
CDE
BCD
1
3
CE
BC
CDEF
FGH
h(1,2)
CDE
CE
F
F
GH
CDEF
2
GI
FGI
GHIJ
Minimal arc-labeled sep(1,2)D,E
elim(1,2)A,B,C
Non-minimal arc-labeled sep(1,2)C,D,E
elim(1,2)A,B
46Bounded decompositions
- We want arc-labeled decompositions such that
- the cluster size (internal width) is bounded by i
(the accuracy parameter) - the width of the decomposition as a graph
(external width) is as small as possible - Possible approaches to build decompositions
- partition-based algorithms - inspired by the
mini-bucket decomposition - grouping-based algorithms
47Partition-based algorithms
GFE
P(GF,E)
EF
EBF
P(EB,F)
P(FC,D)
BF
F
FCD
BF
CD
CDB
P(DB)
CB
B
CAB
P(CA,B)
BA
BA
P(BA)
A
A
P(A)
a) schematic mini-bucket(i), i3 b) arc-labeled
join-graph decomposition
48IJGP properties
- IJGP(i) applies BP to min arc-labeled
join-graph, whose cluster size is bounded by i - On join-trees IJGP finds exact beliefs
- IJGP is a Generalized Belief Propagation
algorithm (Yedidia, Freeman, Weiss 2001) - Complexity of one iteration
- time O(deg(nN) d i1)
- space O(Nd?)
49Empirical evaluation
- Measures
- Absolute error
- Relative error
- Kulbach-Leibler (KL) distance
- Bit Error Rate
- Time
- Algorithms
- Exact
- IBP
- MC
- IJGP
- Networks (all variables are binary)
- Random networks
- Grid networks (MxM)
- CPCS 54, 360, 422
- Coding networks
50Random networks - KL at convergence
evidence0
evidence5
51Random networks - KL vs. iterations
evidence0
evidence5
52Random networks - Time
53Coding networks - BER
sigma.22
sigma.32
sigma.51
sigma.65
54Coding networks - Time
55IJGP summary
- IJGP borrows the iterative feature from IBP and
the anytime virtues of bounded inference from MC - Empirical evaluation showed the potential of
IJGP, which improves with iteration and most of
the time with i-bound, and scales up to large
networks - IJGP is almost always superior, often by a high
margin, to IBP and MC - Based on all our experiments, we think that IJGP
provides a practical breakthrough to the task of
belief updating
56Random networks
N80, 100 instances, w15
57Random networks
N80, 100 instances, w15
58CPCS 54, CPCS360
CPCS360 5 instances, w20 CPCS54 100
instances, w15
59Graph coloring problems
X1
H3
X1
X2
X3
X4
Xn-1
Xn
X3
H1
X2
H2
H1
H2
H3
H4
60Graph coloring problems
61Inference power of IBP - summary
- IBPs inference of zero beliefs converges in a
finite number of iterations and is sound The
results extend to generalized belief propagation
algorithms, in particular to IJGP - We identified classes of networks for which IBP
- can infer zeros, and therefore is likely to be
good - can not infer zeros, although there are many of
them (graph coloring), and therefore is bad - Based on the analysis it is easy to synthesize
belief networks that are hard for IBP. - The success of IBP for coding networks can be
explained by - Many extreme beliefs
- An easy-for-arc-consistency flat network
62Road map
- CSPs complete algorithms
- CSPs approximations
- Belief nets complete algorithms
- Belief nets approximations
- Local inference mini-buckets
- Stochastic simulations
- Variational techniques
- MDPs
63Stochastic Simulation
- Forward sampling (logic sampling)
- Likelihood weighing
- Markov Chain Monte Carlo (MCMC) Gibbs sampling
64Approximation via Sampling
65Forward Sampling(logic sampling (Henrion, 1988))
66Forward sampling (example)
Drawback high rejection rate!
67Likelihood Weighing(Fung and Chang, 1990
Shachter and Peot, 1990)
Clamping evidenceforward sampling weighing
samples by evidence likelihood
Works well for likely evidence!
68Gibbs Sampling(Geman and Geman, 1984)
Markov Chain Monte Carlo (MCMC) create a Markov
chain of samples
Advantage guaranteed to converge to
P(X) Disadvantage convergence may be slow
69Gibbs Sampling (contd)(Pearl, 1988)
Markov blanket