Title: Approximation Techniques bounded inference
1Approximation Techniques bounded inference
2Approximate Inference
- Metrics of evaluation
- Absolute error give egt0 and a query p P(xe),
an estimate r has absolute error e iff p-rlte - Relative error the ratio r/p in 1-e,1e.
- Dagum and Luby 1993 approximation up to a
relative error is NP-hard. - Absolute error is also NP-hard if error is less
than .5
3Mini-buckets local inference
- Computation in a bucket is time and space
- exponential in the number of variables
involved - Therefore, partition functions in a bucket
- into mini-buckets on smaller number of
variables - (The idea is similar to i-consistency
- bound the size of recorded dependencies,
Dechter 2003) -
4Mini-bucket approximation MPE task
Split a bucket into mini-buckets gtbound
complexity
5Mini-Bucket Elimination
Mini-buckets
minBS
minBS
bucket B
F(b,d)
F(b,e)
F(a,b)
F(b,c)
A
hB(a,c)
F(c,e)
F(a,c)
bucket C
B
C
F(a,d)
hB(d,e)
bucket D
E
D
hC(e,a)
e 0
hD(e,a)
bucket E
hE(a)
bucket A
L lower bound
5
6Semantics of Mini-Bucket Splitting a Node
Variables in different buckets are renamed and
duplicated (Kask et. al., 2001), (Geffner et.
al., 2007), (Choi, Chavira, Darwiche , 2007)
After SplittingNetwork N'
Before SplittingNetwork N
U
U
Û
6
7Approx-mpe(i)
- Input i max number of variables allowed in a
mini-bucket - Output lower bound (P of a sub-optimal
solution), upper bound -
Example approx-mpe(3) versus elim-mpe
8MBE(i,m), (MBE(i) (also, approx-mpe)
- Input Belief network (P1,Pn)
- Output upper and lower bounds
- Initialize (put functions in buckets)
- Process each bucket from pn to 1
- Create (i,m)-mini-buckets
- Process each mini-bucket
- (For mpe) assign values in ordering d
- Return mpe-tuple, upper and lower bounds
9Properties of approx-mpe(i)
- Complexity O(exp(2i)) time and O(exp(i))
space. - Accuracy determined by upper/lower (U/L) bound.
- As i increases, both accuracy and complexity
increase. - Possible use of mini-bucket approximations
- As anytime algorithms (Dechter and Rish, 1997)
- As heuristics in best-first search (Kask and
Dechter, 1999) - Other tasks similar mini-bucket approximations
for belief updating, MAP and MEU (Dechter and
Rish, 1997) -
10Anytime Approximation
11Bounded Inference for belief updating
- Idea mini-bucket is the same
- So we can apply a sum in each mini-bucket, or
better, one sum and the rest max, or min (for
lower-bound) - Approx-bel-max(i,m) generating upper and
lower-bound on beliefs approximates elim-bel - Approx-map(i,m) max buckets will be maximized,
sum buckets will be sum-max. Approximates
elim-map.
12Empirical Evaluation(Dechter and Rish, 1997
Rish thesis, 1999)
- Randomly generated networks
- Uniform random probabilities
- Random noisy-OR
- CPCS networks
- Probabilistic decoding
- Comparing approx-mpe and anytime-mpe
- versus elim-mpe
13Causal Independence
- Event X has two possible causes A,B. It is hard
to elicit P(XA,B) but it is easy to determine
P(XA) and P(XB). - Example several diseases causes a symptom.
- Effect of A on X is independent from the effect
of B on X - Causal Independence, using canonical models
- Noisy-O, Noisy AND, noisy-max
A
B
X
14Binary OR
A
B
X
A
B
P(X0A,B)
P(X1A,B)
0
0
1
0
0
1
0
1
1
0
0
1
1
1
0
1
15Noisy-OR
- noise is associated with each edge
- described by noise parameter ? ? 0,1
- Let q b0.2, qa 0.1
- P(x0a,b) (1-?a) (1-?b)
- P(x1a,b)1-(1-?a) (1-?b)
A
B
?a
?b
X
A
B
P(X0A,B)
P(X1A,B)
0
0
1
0
0
1
0.1
0.9
qiP(X0A_i1,else 0)
1
0
0.2
0.8
1
1
0.02
0.98
16Closed Form Bel(X) - 1
Given noisy-or CPT P(xu) noise parameters
?i Tu i Ui 1 Define qi 1 - ?I Then
17Closed Form Bel(X) - 2
Using Iterative Belief Propagation
Set piix pix (uk1). Then we can show that
18Methodology for Empirical Evakuation (for mpe)
- U/L accuracy
- Better (U/mpe) or mpe/L
- Benchmarks Random networks
- Given n,e,v generate a random DAG
- For xi and parents generate table from uniform
0,1, or noisy-or - Create k instances. For each generate random
evidence, likely evidence - Measure averages
19Random networks
- Uniform random 60 nodes, 90 edges (200
instances) - In 80 of cases, 10-100 times speed-up while
U/Llt2 - Noisy-OR even better results
- Exact elim-mpe was infeasible appprox-mpe took
0.1 to 80 sec.
20CPCS networks medical diagnosis(noisy-OR model)
Test case no evidence
21The effect of evidence
More likely evidencegthigher MPE gt higher
accuracy (why?)
Likely evidence versus random (unlikely) evidence
22Probabilistic decoding
Error-correcting linear block code
State-of-the-art
approximate algorithm iterative belief
propagation (IBP) (Pearls poly-tree algorithm
applied to loopy networks)
23Iterative Belief Proapagation
- Belief propagation is exact for poly-trees
- IBP - applying BP iteratively to cyclic networks
- No guarantees for convergence
- Works well for many coding networks
24approx-mpe vs. IBP
Bit error rate (BER) as a function of noise
(sigma)
25Mini-buckets summary
- Mini-buckets local inference approximation
- Idea bound size of recorded functions
- Approx-mpe(i) - mini-bucket algorithm for MPE
- Better results for noisy-OR than for random
problems - Accuracy increases with decreasing noise in
coding - Accuracy increases for likely evidence
- Sparser graphs -gt higher accuracy
- Coding networks approx-mpe outperfroms IBP on
low-induced width codes
26Cluster Tree Elimination - properties
- Correctness and completeness Algorithm CTE is
correct, i.e. it computes the exact joint
probability of a single variable and the
evidence. - Time complexity O ( deg ? (nN) ? d w1 )
- Space complexity O ( N ? d sep)
- where deg the maximum degree of a node
- n number of variables ( number of CPTs)
- N number of nodes in the tree decomposition
- d the maximum domain size of a variable
- w the induced width
- sep the separator size
27Cluster Tree Elimination - the messages
A B C p(a), p(ba), p(ca,b)
1
BC
B C D F p(db), p(fc,d) h(1,2)(b,c)
2
sep(2,3)B,F elim(2,3)C,D
BF
B E F p(eb,f), h(2,3)(b,f)
3
EF
E F G p(ge,f)
4
28Mini-Clustering for belief updating
- Motivation
- Time and space complexity of Cluster Tree
Elimination depend on the induced width w of the
problem - When the induced width w is big, CTE algorithm
becomes infeasible - The basic idea
- Try to reduce the size of the cluster (the
exponent) partition each cluster into
mini-clusters with less variables - Accuracy parameter i maximum number of
variables in a mini-cluster - The idea was explored for variable elimination
(Mini-Bucket)
29Idea of Mini-Clustering
30Mini-Clustering - MC
Mini-Clustering, i3
A B C p(a), p(ba), p(ca,b)
Cluster Tree Elimination
1
BC
B C D F p(db), p(fc,d)
B C D F p(db), h(1,2)(b,c), p(fc,d)
2
2
BF
sep(2,3) B,F elim(2,3) C,D
B E F p(eb,f)
3
EF
E F G p(ge,f)
4
31Mini-Clustering - the messages, i3
A B C p(a), p(ba), p(ca,b)
1
BC
B C D p(db), h(1,2)(b,c) C D F p(fc,d)
2
sep(2,3)B,F elim(2,3)C,D
BF
B E F p(eb,f), h1(2,3)(b), h2(2,3)(f)
3
EF
E F G p(ge,f)
4
32Mini-Clustering - example
ABC
1
BC
BCDF
2
BF
BEF
3
EF
EFG
4
33Cluster Tree Elimination vs. Mini-Clustering
ABC
ABC
1
1
BC
BC
BCDF
BCDF
2
2
BF
BF
BEF
BEF
3
3
EF
EF
EFG
EFG
4
4
34Mini-Clustering
- Correctness and completeness Algorithm MC(i)
computes a bound (or an approximation) on the
joint probability P(Xi,e) of each variable and
each of its values. - Time space complexity O(n ? hw ? d i)
- where hw maxu f f ? ?(u) ? ?
35Lower bounds and mean approximations
- We can replace max operator by
- min gt lower bound on the joint
- mean gt approximation of the joint
36Normalization
- MC can compute an (upper) bound on
the joint P(Xi,e) - Deriving a bound on the conditional P(Xie) is
not easy when the exact P(e) is not available - If a lower bound would be available, we
could useas an upper bound on the posterior - In our experiments we normalized the results and
regarded them as approximations of the posterior
P(Xie)
37Experimental results
- Algorithms
- Exact
- IBP
- Gibbs sampling (GS)
- MC with normalization (approximate)
- Networks (all variables are binary)
- Coding networks
- CPCS 54, 360, 422
- Grid networks (MxM)
- Random noisy-OR networks
- Random networks
- Measures
- Normalized Hamming Distance (NHD)
- BER (Bit Error Rate)
- Absolute error
- Relative error
- Time
38Random networks - Absolute error
evidence0
evidence10
39Noisy-OR networks - Absolute error
evidence10
evidence20
40Grid 15x15 - 10 evidence
41CPCS422 - Absolute error
evidence0
evidence10
42Coding networks - Bit Error Rate
sigma0.22
sigma.51
43Mini-Clustering summary
- MC extends the partition based approximation from
mini-buckets to general tree decompositions for
the problem of belief updating - Empirical evaluation demonstrates its
effectiveness and superiority (for certain types
of problems, with respect to the measures
considered) relative to other existing algorithms
44What is IJGP?
- IJGP is an approximate algorithm for belief
updating in Bayesian networks - IJGP is a version of join-tree clustering which
is both anytime and iterative - IJGP applies message passing along a join-graph,
rather than a join-tree - Empirical evaluation shows that IJGP is almost
always superior to other approximate schemes
(IBP, MC)
45Iterative Belief Propagation - IBP
One step update BEL(U1)
U1
U2
U3
- Belief propagation is exact for poly-trees
- IBP - applying BP iteratively to cyclic networks
- No guarantees for convergence
- Works well for many coding networks
X2
X1
46IJGP - Motivation
- IBP is applied to a loopy network iteratively
- not an anytime algorithm
- when it converges, it converges very fast
- MC applies bounded inference along a tree
decomposition - MC is an anytime algorithm controlled by i-bound
- MC converges in two passes up and down the tree
- IJGP combines
- the iterative feature of IBP
- the anytime feature of MC
47IJGP - The basic idea
- Apply Cluster Tree Elimination to any join-graph
- We commit to graphs that are minimal I-maps
- Avoid cycles as long as I-mapness is not violated
- Result use minimal arc-labeled join-graphs
48IJGP - Example
A
B
C
A
C
A
ABC
C
A
AB
BC
C
D
E
ABDE
BCE
BE
C
DE
CE
F
CDEF
G
H
H
FGH
H
F
F
FG
GH
H
GI
I
J
FGI
GHIJ
a) Belief network
a) The graph IBP works on
49Arc-minimal join-graph
A
C
A
A
ABC
C
A
ABC
C
A
AB
BC
C
AB
BC
ABDE
BCE
ABDE
BCE
BE
C
C
DE
CE
DE
CE
CDEF
CDEF
H
H
FGH
H
FGH
H
F
F
F
FG
GH
H
FG
GH
GI
GI
FGI
GHIJ
FGI
GHIJ
50Minimal arc-labeled join-graph
A
A
A
ABC
C
A
ABC
C
AB
BC
AB
BC
ABDE
BCE
ABDE
BCE
C
C
DE
CE
DE
CE
CDEF
CDEF
H
H
FGH
H
FGH
H
F
F
FG
GH
F
GH
GI
GI
FGI
GHIJ
FGI
GHIJ
51Join-graph decompositions
A
A
ABC
C
AB
BC
BC
BC
ABDE
BCE
ABCDE
BCE
ABCDE
BCE
C
DE
CE
CDE
CE
DE
CE
CDEF
CDEF
CDEF
H
FGH
H
FGH
FGH
F
F
F
F
GH
F
GH
F
GH
GI
GI
GI
FGI
GHIJ
FGI
GHIJ
FGI
GHIJ
a) Minimal arc-labeled join graph
b) Join-graph obtained by collapsing nodes of
graph a)
c) Minimal arc-labeled join graph
52Tree decomposition
BC
ABCDE
BCE
ABCDE
DE
CE
CDE
CDEF
CDEF
FGH
F
F
F
GH
GHI
GI
FGI
GHIJ
FGHI
GHIJ
a) Minimal arc-labeled join graph
a) Tree decomposition
53Join-graphs
more accuracy
less complexity
54Message propagation
BC
ABCDE
BCE
ABCDE p(a), p(c), p(bac), p(dabe),p(eb,c)
h(3,1)(bc)
h(3,1)(bc)
CDE
BCD
1
3
CE
BC
CDEF
FGH
h(1,2)
CDE
CE
F
F
GH
CDEF
2
GI
FGI
GHIJ
Minimal arc-labeled sep(1,2)D,E
elim(1,2)A,B,C
Non-minimal arc-labeled sep(1,2)C,D,E
elim(1,2)A,B
55Bounded decompositions
- We want arc-labeled decompositions such that
- the cluster size (internal width) is bounded by i
(the accuracy parameter) - the width of the decomposition as a graph
(external width) is as small as possible - Possible approaches to build decompositions
- partition-based algorithms - inspired by the
mini-bucket decomposition - grouping-based algorithms
56Partition-based algorithms
GFE
P(GF,E)
EF
EBF
P(EB,F)
P(FC,D)
BF
F
FCD
BF
CD
CDB
P(DB)
CB
B
CAB
P(CA,B)
BA
BA
P(BA)
A
A
P(A)
a) schematic mini-bucket(i), i3 b) arc-labeled
join-graph decomposition
57IJGP properties
- IJGP(i) applies BP to min arc-labeled
join-graph, whose cluster size is bounded by i - On join-trees IJGP finds exact beliefs
- IJGP is a Generalized Belief Propagation
algorithm (Yedidia, Freeman, Weiss 2001) - Complexity of one iteration
- time O(deg(nN) d i1)
- space O(Nd?)
58Empirical evaluation
- Measures
- Absolute error
- Relative error
- Kulbach-Leibler (KL) distance
- Bit Error Rate
- Time
- Algorithms
- Exact
- IBP
- MC
- IJGP
- Networks (all variables are binary)
- Random networks
- Grid networks (MxM)
- CPCS 54, 360, 422
- Coding networks
59Random networks - KL at convergence
evidence0
evidence5
60Random networks - KL vs. iterations
evidence0
evidence5
61Random networks - Time
62Coding networks - BER
sigma.22
sigma.32
sigma.51
sigma.65
63Coding networks - Time
64IJGP summary
- IJGP borrows the iterative feature from IBP and
the anytime virtues of bounded inference from MC - Empirical evaluation showed the potential of
IJGP, which improves with iteration and most of
the time with i-bound, and scales up to large
networks - IJGP is almost always superior, often by a high
margin, to IBP and MC - Based on all our experiments, we think that IJGP
provides a practical breakthrough to the task of
belief updating
65Heuristic search
- Mini-buckets record upper-bound heuristics
- The evaluation function over
- Best-first expand a node with maximal evaluation
function - Branch and Bound prune if f gt upper bound
- Properties
- an exact algorithm
- Better heuristics lead to more prunning
66Heuristic Function
Given a cost function
P(a,b,c,d,e) P(a) P(ba) P(ca) P(eb,c)
P(db,a)
Define an evaluation function over a partial
assignment as the probability of its best
extension
0
D
0
B
E
0
D
1
A
B
1
D
E
1
D
f(a,e,d) maxb,c P(a,b,c,d,e) P(a)
maxb,c P)ba) P(ca) P(eb,c) P(da,b)
g(a,e,d) H(a,e,d)
67Heuristic Function
H(a,e,d) maxb,c P(ba) P(ca) P(eb,c)
P(da,b) maxc P(ca) maxb P(eb,c)
P(ba) P(da,b) maxc P(ca) maxb
P(eb,c) maxb P(ba) P(da,b)
H(a,e,d) f(a,e,d) g(a,e,d) H(a,e,d) ³
f(a,e,d) The heuristic function H is compiled
during the preprocessing stage of the
Mini-Bucket algorithm.
68Heuristic Function
The evaluation function f(xp) can be computed
using function recorded by the Mini-Bucket scheme
and can be used to estimate the probability of
the best extension of partial assignment xpx1,
, xp,
f(xp)g(xp) H(xp )
For example,
maxB P(eb,c) P(da,b)
P(ba) maxC P(ca) hB(e,c) maxD
hB(d,a) maxE hC(e,a) maxA P(a)
hE(a) hD (a)
H(a,e,d) hB(d,a) hC (e,a)
g(a,e,d) P(a)
69Properties
- Heuristic is monotone
- Heuristic is admissible
- Heuristic is computed in linear time
- IMPORTANT
- Mini-buckets generate heuristics of varying
strength using control parameter bound I - Higher bound -gt more preprocessing -gt
- stronger heuristics -gt less search
- Allows controlled trade-off between preprocessing
and search
70Empirical Evaluation of mini-bucket heuristics