Approximation Techniques bounded inference - PowerPoint PPT Presentation

About This Presentation

Title:

Approximation Techniques bounded inference

Description:

Dagum and Luby 1993: approximation up to a relative error is NP-hard. ... MC extends the partition based approximation from mini-buckets to general tree ... – PowerPoint PPT presentation

Number of Views:65

Avg rating:3.0/5.0

Slides: 70

Provided by: ibm359

Learn more at: https://ics.uci.edu

Category:

more less

Transcript and Presenter's Notes

Title: Approximation Techniques bounded inference

1
Approximation Techniques bounded inference

COMPSCI 276
Fall 2007

2
Approximate Inference

Metrics of evaluation
Absolute error give egt0 and a query p P(xe),
an estimate r has absolute error e iff p-rlte
Relative error the ratio r/p in 1-e,1e.
Dagum and Luby 1993 approximation up to a
relative error is NP-hard.
Absolute error is also NP-hard if error is less
than .5

3
Mini-buckets local inference

Computation in a bucket is time and space
exponential in the number of variables
involved
Therefore, partition functions in a bucket
into mini-buckets on smaller number of
variables
(The idea is similar to i-consistency
bound the size of recorded dependencies,
Dechter 2003)

4
Mini-bucket approximation MPE task
Split a bucket into mini-buckets gtbound
complexity
5
Mini-Bucket Elimination
Mini-buckets
minBS
minBS
bucket B
F(b,d)
F(b,e)
F(a,b)
F(b,c)
A
hB(a,c)
F(c,e)
F(a,c)
bucket C
B
C
F(a,d)
hB(d,e)
bucket D
E
D
hC(e,a)
e 0
hD(e,a)
bucket E
hE(a)
bucket A
L lower bound
5
6
Semantics of Mini-Bucket Splitting a Node
Variables in different buckets are renamed and
duplicated (Kask et. al., 2001), (Geffner et.
al., 2007), (Choi, Chavira, Darwiche , 2007)
After SplittingNetwork N'
Before SplittingNetwork N
U
U
Û
6
7
Approx-mpe(i)

Input i max number of variables allowed in a
mini-bucket
Output lower bound (P of a sub-optimal
solution), upper bound

Example approx-mpe(3) versus elim-mpe
8
MBE(i,m), (MBE(i) (also, approx-mpe)

Input Belief network (P1,Pn)
Output upper and lower bounds
Initialize (put functions in buckets)
Process each bucket from pn to 1
Create (i,m)-mini-buckets
Process each mini-bucket
(For mpe) assign values in ordering d
Return mpe-tuple, upper and lower bounds

9
Properties of approx-mpe(i)

Complexity O(exp(2i)) time and O(exp(i))
space.
Accuracy determined by upper/lower (U/L) bound.
As i increases, both accuracy and complexity
increase.
Possible use of mini-bucket approximations
As anytime algorithms (Dechter and Rish, 1997)
As heuristics in best-first search (Kask and
Dechter, 1999)
Other tasks similar mini-bucket approximations
for belief updating, MAP and MEU (Dechter and
Rish, 1997)

10
Anytime Approximation
11
Bounded Inference for belief updating

Idea mini-bucket is the same
So we can apply a sum in each mini-bucket, or
better, one sum and the rest max, or min (for
lower-bound)
Approx-bel-max(i,m) generating upper and
lower-bound on beliefs approximates elim-bel
Approx-map(i,m) max buckets will be maximized,
sum buckets will be sum-max. Approximates
elim-map.

12
Empirical Evaluation(Dechter and Rish, 1997
Rish thesis, 1999)

Randomly generated networks
Uniform random probabilities
Random noisy-OR
CPCS networks
Probabilistic decoding
Comparing approx-mpe and anytime-mpe
versus elim-mpe

13
Causal Independence

Event X has two possible causes A,B. It is hard
to elicit P(XA,B) but it is easy to determine
P(XA) and P(XB).
Example several diseases causes a symptom.
Effect of A on X is independent from the effect
of B on X
Causal Independence, using canonical models
Noisy-O, Noisy AND, noisy-max

A
B
X
14
Binary OR
A
B
X
A
B
P(X0A,B)
P(X1A,B)
0
0
1
0
0
1
0
1
1
0
0
1
1
1
0
1
15
Noisy-OR

noise is associated with each edge
described by noise parameter ? ? 0,1
Let q b0.2, qa 0.1
P(x0a,b) (1-?a) (1-?b)
P(x1a,b)1-(1-?a) (1-?b)

A
B
?a
?b
X
A
B
P(X0A,B)
P(X1A,B)
0
0
1
0
0
1
0.1
0.9
qiP(X0A_i1,else 0)
1
0
0.2
0.8
1
1
0.02
0.98
16
Closed Form Bel(X) - 1
Given noisy-or CPT P(xu) noise parameters
?i Tu i Ui 1 Define qi 1 - ?I Then
17
Closed Form Bel(X) - 2
Using Iterative Belief Propagation
Set piix pix (uk1). Then we can show that
18
Methodology for Empirical Evakuation (for mpe)

U/L accuracy
Better (U/mpe) or mpe/L
Benchmarks Random networks
Given n,e,v generate a random DAG
For xi and parents generate table from uniform
0,1, or noisy-or
Create k instances. For each generate random
evidence, likely evidence
Measure averages

19
Random networks

Uniform random 60 nodes, 90 edges (200
instances)
In 80 of cases, 10-100 times speed-up while
U/Llt2
Noisy-OR even better results
Exact elim-mpe was infeasible appprox-mpe took
0.1 to 80 sec.

20
CPCS networks medical diagnosis(noisy-OR model)
Test case no evidence
21
The effect of evidence
More likely evidencegthigher MPE gt higher
accuracy (why?)
Likely evidence versus random (unlikely) evidence
22
Probabilistic decoding
Error-correcting linear block code
State-of-the-art
approximate algorithm iterative belief
propagation (IBP) (Pearls poly-tree algorithm
applied to loopy networks)
23
Iterative Belief Proapagation

Belief propagation is exact for poly-trees
IBP - applying BP iteratively to cyclic networks
No guarantees for convergence
Works well for many coding networks

24
approx-mpe vs. IBP
Bit error rate (BER) as a function of noise
(sigma)
25
Mini-buckets summary

Mini-buckets local inference approximation
Idea bound size of recorded functions
Approx-mpe(i) - mini-bucket algorithm for MPE
Better results for noisy-OR than for random
problems
Accuracy increases with decreasing noise in
coding
Accuracy increases for likely evidence
Sparser graphs -gt higher accuracy
Coding networks approx-mpe outperfroms IBP on
low-induced width codes

26
Cluster Tree Elimination - properties

Correctness and completeness Algorithm CTE is
correct, i.e. it computes the exact joint
probability of a single variable and the
evidence.
Time complexity O ( deg ? (nN) ? d w1 )
Space complexity O ( N ? d sep)
where deg the maximum degree of a node
n number of variables ( number of CPTs)
N number of nodes in the tree decomposition
d the maximum domain size of a variable
w the induced width
sep the separator size

27
Cluster Tree Elimination - the messages
A B C p(a), p(ba), p(ca,b)
1
BC
B C D F p(db), p(fc,d) h(1,2)(b,c)
2
sep(2,3)B,F elim(2,3)C,D
BF
B E F p(eb,f), h(2,3)(b,f)
3
EF
E F G p(ge,f)
4
28
Mini-Clustering for belief updating

Motivation
Time and space complexity of Cluster Tree
Elimination depend on the induced width w of the
problem
When the induced width w is big, CTE algorithm
becomes infeasible
The basic idea
Try to reduce the size of the cluster (the
exponent) partition each cluster into
mini-clusters with less variables
Accuracy parameter i maximum number of
variables in a mini-cluster
The idea was explored for variable elimination
(Mini-Bucket)

29
Idea of Mini-Clustering
30
Mini-Clustering - MC
Mini-Clustering, i3
A B C p(a), p(ba), p(ca,b)
Cluster Tree Elimination
1
BC
B C D F p(db), p(fc,d)
B C D F p(db), h(1,2)(b,c), p(fc,d)
2
2
BF
sep(2,3) B,F elim(2,3) C,D
B E F p(eb,f)
3
EF
E F G p(ge,f)
4
31
Mini-Clustering - the messages, i3
A B C p(a), p(ba), p(ca,b)
1
BC
B C D p(db), h(1,2)(b,c) C D F p(fc,d)
2
sep(2,3)B,F elim(2,3)C,D
BF
B E F p(eb,f), h1(2,3)(b), h2(2,3)(f)
3
EF
E F G p(ge,f)
4
32
Mini-Clustering - example
ABC
1
BC
BCDF
2
BF
BEF
3
EF
EFG
4
33
Cluster Tree Elimination vs. Mini-Clustering
ABC
ABC
1
1
BC
BC
BCDF
BCDF
2
2
BF
BF
BEF
BEF
3
3
EF
EF
EFG
EFG
4
4
34
Mini-Clustering

Correctness and completeness Algorithm MC(i)
computes a bound (or an approximation) on the
joint probability P(Xi,e) of each variable and
each of its values.
Time space complexity O(n ? hw ? d i)
where hw maxu f f ? ?(u) ? ?

35
Lower bounds and mean approximations

We can replace max operator by
min gt lower bound on the joint
mean gt approximation of the joint

36
Normalization

MC can compute an (upper) bound on
the joint P(Xi,e)
Deriving a bound on the conditional P(Xie) is
not easy when the exact P(e) is not available
If a lower bound would be available, we
could useas an upper bound on the posterior
In our experiments we normalized the results and
regarded them as approximations of the posterior
P(Xie)

37
Experimental results

Algorithms
Exact
IBP
Gibbs sampling (GS)
MC with normalization (approximate)
Networks (all variables are binary)
Coding networks
CPCS 54, 360, 422
Grid networks (MxM)
Random noisy-OR networks
Random networks

Measures
Normalized Hamming Distance (NHD)
BER (Bit Error Rate)
Absolute error
Relative error
Time

38
Random networks - Absolute error
evidence0
evidence10
39
Noisy-OR networks - Absolute error
evidence10
evidence20
40
Grid 15x15 - 10 evidence
41
CPCS422 - Absolute error
evidence0
evidence10
42
Coding networks - Bit Error Rate
sigma0.22
sigma.51
43
Mini-Clustering summary

MC extends the partition based approximation from
mini-buckets to general tree decompositions for
the problem of belief updating
Empirical evaluation demonstrates its
effectiveness and superiority (for certain types
of problems, with respect to the measures
considered) relative to other existing algorithms

44
What is IJGP?

IJGP is an approximate algorithm for belief
updating in Bayesian networks
IJGP is a version of join-tree clustering which
is both anytime and iterative
IJGP applies message passing along a join-graph,
rather than a join-tree
Empirical evaluation shows that IJGP is almost
always superior to other approximate schemes
(IBP, MC)

45
Iterative Belief Propagation - IBP
One step update BEL(U1)
U1
U2
U3

Belief propagation is exact for poly-trees
IBP - applying BP iteratively to cyclic networks
No guarantees for convergence
Works well for many coding networks

X2
X1
46
IJGP - Motivation

IBP is applied to a loopy network iteratively
not an anytime algorithm
when it converges, it converges very fast
MC applies bounded inference along a tree
decomposition
MC is an anytime algorithm controlled by i-bound
MC converges in two passes up and down the tree
IJGP combines
the iterative feature of IBP
the anytime feature of MC

47
IJGP - The basic idea

Apply Cluster Tree Elimination to any join-graph
We commit to graphs that are minimal I-maps
Avoid cycles as long as I-mapness is not violated
Result use minimal arc-labeled join-graphs

48
IJGP - Example
A
B
C
A
C
A
ABC
C
A
AB
BC
C
D
E
ABDE
BCE
BE
C
DE
CE
F
CDEF
G
H
H
FGH
H
F
F
FG
GH
H
GI
I
J
FGI
GHIJ
a) Belief network
a) The graph IBP works on
49
Arc-minimal join-graph
A
C
A
A
ABC
C
A
ABC
C
A
AB
BC
C
AB
BC
ABDE
BCE
ABDE
BCE
BE
C
C
DE
CE
DE
CE
CDEF
CDEF
H
H
FGH
H
FGH
H
F
F
F
FG
GH
H
FG
GH
GI
GI
FGI
GHIJ
FGI
GHIJ
50
Minimal arc-labeled join-graph
A
A
A
ABC
C
A
ABC
C
AB
BC
AB
BC
ABDE
BCE
ABDE
BCE
C
C
DE
CE
DE
CE
CDEF
CDEF
H
H
FGH
H
FGH
H
F
F
FG
GH
F
GH
GI
GI
FGI
GHIJ
FGI
GHIJ
51
Join-graph decompositions
A
A
ABC
C
AB
BC
BC
BC
ABDE
BCE
ABCDE
BCE
ABCDE
BCE
C
DE
CE
CDE
CE
DE
CE
CDEF
CDEF
CDEF
H
FGH
H
FGH
FGH
F
F
F
F
GH
F
GH
F
GH
GI
GI
GI
FGI
GHIJ
FGI
GHIJ
FGI
GHIJ
a) Minimal arc-labeled join graph
b) Join-graph obtained by collapsing nodes of
graph a)
c) Minimal arc-labeled join graph
52
Tree decomposition
BC
ABCDE
BCE
ABCDE
DE
CE
CDE
CDEF
CDEF
FGH
F
F
F
GH
GHI
GI
FGI
GHIJ
FGHI
GHIJ
a) Minimal arc-labeled join graph
a) Tree decomposition
53
Join-graphs
more accuracy
less complexity
54
Message propagation
BC
ABCDE
BCE
ABCDE p(a), p(c), p(bac), p(dabe),p(eb,c)
h(3,1)(bc)
h(3,1)(bc)
CDE
BCD
1
3
CE
BC
CDEF
FGH
h(1,2)
CDE
CE
F
F
GH
CDEF
2
GI
FGI
GHIJ
Minimal arc-labeled sep(1,2)D,E
elim(1,2)A,B,C
Non-minimal arc-labeled sep(1,2)C,D,E
elim(1,2)A,B
55
Bounded decompositions

We want arc-labeled decompositions such that
the cluster size (internal width) is bounded by i
(the accuracy parameter)
the width of the decomposition as a graph
(external width) is as small as possible
Possible approaches to build decompositions
partition-based algorithms - inspired by the
mini-bucket decomposition
grouping-based algorithms

56
Partition-based algorithms
GFE
P(GF,E)
EF
EBF
P(EB,F)
P(FC,D)
BF
F
FCD
BF
CD
CDB
P(DB)
CB
B
CAB
P(CA,B)
BA
BA
P(BA)
A
A
P(A)
a) schematic mini-bucket(i), i3 b) arc-labeled
join-graph decomposition
57
IJGP properties

IJGP(i) applies BP to min arc-labeled
join-graph, whose cluster size is bounded by i
On join-trees IJGP finds exact beliefs
IJGP is a Generalized Belief Propagation
algorithm (Yedidia, Freeman, Weiss 2001)
Complexity of one iteration
time O(deg(nN) d i1)
space O(Nd?)

58
Empirical evaluation

Measures
Absolute error
Relative error
Kulbach-Leibler (KL) distance
Bit Error Rate
Time

Algorithms
Exact
IBP
MC
IJGP

Networks (all variables are binary)
Random networks
Grid networks (MxM)
CPCS 54, 360, 422
Coding networks

59
Random networks - KL at convergence
evidence0
evidence5
60
Random networks - KL vs. iterations
evidence0
evidence5
61
Random networks - Time
62
Coding networks - BER
sigma.22
sigma.32
sigma.51
sigma.65
63
Coding networks - Time
64
IJGP summary

IJGP borrows the iterative feature from IBP and
the anytime virtues of bounded inference from MC
Empirical evaluation showed the potential of
IJGP, which improves with iteration and most of
the time with i-bound, and scales up to large
networks
IJGP is almost always superior, often by a high
margin, to IBP and MC
Based on all our experiments, we think that IJGP
provides a practical breakthrough to the task of
belief updating

65
Heuristic search

Mini-buckets record upper-bound heuristics
The evaluation function over
Best-first expand a node with maximal evaluation
function
Branch and Bound prune if f gt upper bound
Properties
an exact algorithm
Better heuristics lead to more prunning

66
Heuristic Function
Given a cost function
P(a,b,c,d,e) P(a) P(ba) P(ca) P(eb,c)
P(db,a)
Define an evaluation function over a partial
assignment as the probability of its best
extension
0
D
0
B
E
0
D
1
A
B
1
D
E
1
D
f(a,e,d) maxb,c P(a,b,c,d,e) P(a)
maxb,c P)ba) P(ca) P(eb,c) P(da,b)
g(a,e,d) H(a,e,d)
67
Heuristic Function
H(a,e,d) maxb,c P(ba) P(ca) P(eb,c)
P(da,b) maxc P(ca) maxb P(eb,c)
P(ba) P(da,b) maxc P(ca) maxb
P(eb,c) maxb P(ba) P(da,b)
H(a,e,d) f(a,e,d) g(a,e,d) H(a,e,d) ³
f(a,e,d) The heuristic function H is compiled
during the preprocessing stage of the
Mini-Bucket algorithm.
68
Heuristic Function
The evaluation function f(xp) can be computed
using function recorded by the Mini-Bucket scheme
and can be used to estimate the probability of
the best extension of partial assignment xpx1,
, xp,
f(xp)g(xp) H(xp )
For example,
maxB P(eb,c) P(da,b)
P(ba) maxC P(ca) hB(e,c) maxD

hB(d,a) maxE hC(e,a) maxA P(a)
hE(a) hD (a)
H(a,e,d) hB(d,a) hC (e,a)
g(a,e,d) P(a)
69
Properties

Heuristic is monotone
Heuristic is admissible
Heuristic is computed in linear time
IMPORTANT
Mini-buckets generate heuristics of varying
strength using control parameter bound I
Higher bound -gt more preprocessing -gt
stronger heuristics -gt less search
Allows controlled trade-off between preprocessing
and search

70
Empirical Evaluation of mini-bucket heuristics

Write a Comment

User Comments (0)