Bayesian networks Variable Elimination

About This Presentation

Title:

Bayesian networks Variable Elimination

Description:

This is the a posteriori belief in X, given evidence e ... A posteriori belief. This query is useful in many cases: ... Queries: A posteriori joint ... – PowerPoint PPT presentation

Number of Views:101

Avg rating:3.0/5.0

Slides: 68

Provided by: gzirkela

Category:

more less

Transcript and Presenter's Notes

Title: Bayesian networks Variable Elimination

1
BayesiannetworksVariable Elimination
Based on Nir Friedmans course (Hebrew
University)
2

In previous lessons we introduced compact
representations of probability distributions
Bayesian Networks
A network describes a unique probability
distribution P
How do we answer queries about P?
The process of computing answers to these queries
is called probabilistic inference

3
Queries Likelihood

There are many types of queries we might ask.
Most of these involve evidence
An evidence e is an assignment of values to a set
E of variables in the domain
Without loss of generality E Xk1, , Xn
Simplest query compute probability of
evidence
This is often referred to as computing the
likelihood of the evidence

4
Queries A posteriori belief

Often we are interested in the conditional
probability of a variable given the evidence
This is the a posteriori belief in X, given
evidence e
A related task is computing the term P(X, e)
i.e., the likelihood of e and X x for values
of X
we can recover the a posteriori belief by

5
A posteriori belief

This query is useful in many cases
Prediction what is the probability of an outcome
given the starting condition
Target is a descendent of the evidence
Diagnosis what is the probability of
disease/fault given symptoms
Target is an ancestor of the evidence
As we shall see, the direction between variables
does not restrict the directions of the queries
Probabilistic inference can combine evidence form
all parts of the network

6
Queries A posteriori joint

In this query, we are interested in the
conditional probability of several variables,
given the evidence P(X, Y, e )
Note that the size of the answer to query is
exponential in the number of variables in the
joint

7
Queries MAP

In this query we want to find the maximum a
posteriori assignment for some variable of
interest (say X1,,Xl )
That is, x1,,xl maximize the probability P(x1,
,xl e)
Note that this is equivalent to
maximizing P(x1,,xl, e)

8
Queries MAP

We can use MAP for
Classification
find most likely label, given the evidence
Explanation
What is the most likely scenario, given the
evidence

9
Queries MAP

Cautionary note
The MAP depends on the set of variables
Example
MAP of X is 1,
MAP of (X, Y) is (0,0)

10
Complexity of Inference

Theorem
Computing P(X x) in a Bayesian network is
NP-hard
Not surprising, since we can simulate Boolean
gates.

11
Proof

We reduce 3-SAT to Bayesian network computation
Assume we are given a 3-SAT problem
q1,,qn be propositions,
?1 ,... ,?k be clauses, such that ?i li1? li2 ?
li3 where each lij is a literal over q1,,qn
? ?1?... ??k
We will construct a network s.t. P(Xt) gt 0 iff
? is satisfiable

12
...
Q1
Q3
Q2
Q4
Qn
...
?1
?2
?3
?k-1
?k
...
A1
A2
X
Ak/2-1

P(Qi true) 0.5,
P(?I true Qi , Qj , Ql ) 1 iff Qi , Qj , Ql
satisfy the clause ?I
A1, A2, , are simple binary and gates

It is easy to check
Polynomial number of variables
Each CPDs can be described by a small table (8
parameters at most)
P(X true) gt 0 if and only if there exists a
satisfying assignment to Q1,,Qn
Conclusion polynomial reduction of 3-SAT

Note this construction also shows that computing
P(X t) is harder than NP
2nP(X t) is the number of satisfying
assignments to ?
Thus, it is P-hard (in fact it is P-complete)

15
Hardness - Notes

We used deterministic relations in our
construction
The same construction works if we use (1-?, ?)
instead of (1,0) in each gate for any ? lt 0.5
Hardness does not mean we cannot solve inference
It implies that we cannot find a general
procedure that works efficiently for all networks
For particular families of networks, we can have
provably efficient procedure

16
Inference in Simple Chains
X1
X2

How do we compute P(X2)?

17
Inference in Simple Chains (cont.)
X1
X2
X3

How do we compute P(X3)?
we already know how to compute P(X2)...

18
Inference in Simple Chains (cont.)
...

How do we compute P(Xn)?
Compute P(X1), P(X2), P(X3),
We compute each term by using the previous one
Complexity
Each step costs O(Val(Xi)Val(Xi1))
operations
Compare to naïve evaluation, that requires
summing over joint values of n-1 variables

19
Inference in Simple Chains (cont.)
X1
X2

Suppose that we observe thevalue of X2 x2
How do we compute P(X1x2)?
Recall that it suffices to compute P(X1,x2)

20
Inference in Simple Chains (cont.)
X1
X2
X3

Suppose that we observe the value of X3 x3
How do we compute P(X1,x3)?
How do we compute P(x3x1)?

21
Inference in Simple Chains (cont.)
...
X1
X2
X3
Xn

Suppose that we observe the value of Xn xn
How do we compute P(X1,xn)?

22
Inference in Simple Chains (cont.)
X1
X2
X3
Xn

We compute P(xnxn-1), P(xnxn-2), iteratively

23
Inference in Simple Chains (cont.)
...
...
X1
X2
Xk
Xn

Suppose that we observe the value of Xn xn
We want to find P(Xkxn )
How do we compute P(Xk,xn )?
We compute P(Xk ) by forward iterations
We compute P(xn Xk ) by backward iterations

24
Elimination in Chains

We now try to understand the simple chain example
using first-order principles
Using definition of probability, we have

25
Elimination in Chains

By chain decomposition, we get

A
B
C
E
D
26
Elimination in Chains

Rearranging terms ...

27
Elimination in Chains
X
A
B
C
E
D

Now we can perform innermost summation
This summation, is exactly the first step in the
forward iteration we describe before

28
Elimination in Chains

Rearranging and then summing again, we get

X
X
A
B
C
E
D
29
Elimination in Chains with Evidence

Similarly, we understand the backward pass
We write the query in explicit form

30
Elimination in Chains with Evidence

Eliminating d, we get

X
A
B
C
E
D
31
Elimination in Chains with Evidence

Eliminating c, we get

X
X
A
B
C
E
D
32
Elimination in Chains with Evidence

Finally, we eliminate b

X
X
X
A
B
C
E
D
33
Variable Elimination

General idea
Write query in the form
Iteratively
Move all irrelevant terms outside of innermost
sum
Perform innermost sum, getting a new term
Insert the new term into the product

34
A More Complex Example

Asia network

35
S
V

We want to compute P(d)
Need to eliminate v,s,x,t,l,a,b
Initial factors

L
T
A
B
X
D
36
S
V

We want to compute P(d)
Need to eliminate v,s,x,t,l,a,b
Initial factors

L
T
A
B
X
D
Eliminate v
Note fv(t) P(t) In general, result of
elimination is not necessarily a probability term
37

We want to compute P(d)
Need to eliminate s,x,t,l,a,b
Initial factors

Eliminate s
Summing on s results in a factor with two
arguments fs(b,l) In general, result of
elimination may be a function of several variables
38

We want to compute P(d)
Need to eliminate x,t,l,a,b
Initial factors

Eliminate x
Note fx(a) 1 for all values of a !!
39

We want to compute P(d)
Need to eliminate t,l,a,b
Initial factors

Eliminate t
40

We want to compute P(d)
Need to eliminate l,a,b
Initial factors

Eliminate l
41

We want to compute P(d)
Need to eliminate b
Initial factors

Eliminate a,b
Compute
a
42
Variable Elimination

We now understand variable elimination as a
sequence of rewriting operations
Actual computation is done in elimination step
Computation depends on order of elimination
We will return to this issue in detail

43
Dealing with evidence

How do we deal with evidence?
Suppose get evidence V t, S f, D t
We want to compute P(L, V t, S f, D t)

44
Dealing with Evidence

We start by writing the factors
Since we know that V t, we dont need to
eliminate V
Instead, we can replace the factors P(V) and
P(TV) with
These select the appropriate parts of the
original factors given the evidence
Note that fp(V) is a constant, and thus does not
appear in elimination of other variables

45
Dealing with Evidence

Given evidence V t, S f, D t
Compute P(L, V t, S f, D t )
Initial factors, after setting evidence

46
Dealing with Evidence

Given evidence V t, S f, D t
Compute P(L, V t, S f, D t )
Initial factors, after setting evidence
Eliminating x, we get

47
Dealing with Evidence

Given evidence V t, S f, D t
Compute P(L, V t, S f, D t )
Initial factors, after setting evidence
Eliminating x, we get
Eliminating t, we get

48
Dealing with Evidence

Given evidence V t, S f, D t
Compute P(L, V t, S f, D t )
Initial factors, after setting evidence
Eliminating x, we get
Eliminating t, we get
Eliminating a, we get

49
Dealing with Evidence

Given evidence V t, S f, D t
Compute P(L, V t, S f, D t )
Initial factors, after setting evidence
Eliminating x, we get
Eliminating t, we get
Eliminating a, we get
Eliminating b, we get

50
Complexity of variable elimination

Suppose in one elimination step we compute
This requires
multiplications
For each value for x, y1, , yk, we do m
multiplications
additions
For each value of y1, , yk , we do Val(X)
additions
Complexity is exponential in number of variables
in the intermediate factor.

51
Understanding Variable Elimination

We want to select good elimination orderings
that reduce complexity
We start by attempting to understand variable
elimination via the graph we are working with
This will reduce the problem of finding good
ordering to a graph-theoretic operation that is
well-understood

52
Undirected graph representation

At each stage of the procedure, we have an
algebraic term that we need to evaluate
In general this term is of the form
where Zi are sets of variables
We now plot a graph where there is undirected
edge X--Y if X,Y are arguments of some factor
that is, if X,Y are in some Zi

53
Undirected Graph Representation

Consider the Asia example
The initial factors are
thus, the undirected graph is
In the first step this graph is just the
moralized graph

V
S
V
S
L
T
L
T
A
B
A
B
X
D
X
D
54
Undirected Graph Representation

Now we eliminate t, getting
The corresponding change in the graph is

V
S
V
S
L
T
L
T
A
B
A
B
X
D
X
D
55
Example

Want to computeP(L, V t, S f, D t)
Moralizing

V
S
L
T
A
B
X
D
56
Example

Want to computeP(L, V t, S f, D t)
Moralizing
Setting evidence

V
S
L
T
A
B
X
D
57
Example

Want to computeP(L, V t, S f, D t)
Moralizing
Setting evidence
Eliminating x
New factor fx(A)

V
S
L
T
A
B
X
D
58
Example

Want to computeP(L, V t, S f, D t)
Moralizing
Setting evidence
Eliminating x
Eliminating a
New factor fa(b,t,l)

V
S
L
T
A
B
X
D
59
Example

Want to computeP(L, V t, S f, D t)
Moralizing
Setting evidence
Eliminating x
Eliminating a
Eliminating b
New factor fb(t,l)

V
S
L
T
A
B
X
D
60
Example

Want to computeP(L, V t, S f, D t)
Moralizing
Setting evidence
Eliminating x
Eliminating a
Eliminating b
Eliminating t
New factor ft(l)

V
S
L
T
A
B
X
D
61
Elimination in Undirected Graphs

Generalizing, we see that we can eliminate a
variable x by
1. For all Y,Z, s.t., Y--X, Z--X
add an edge Y--Z
2. Remove X and all adjacent edges to it
This procedure creates a clique that contains all
the neighbors of X
After step 1 we have a clique that corresponds to
the intermediate factor (before marginalization)
The cost of the step is exponential in the size
of this clique

62
Undirected Graphs

The process of eliminating nodes from an
undirected graph gives us a clue to the
complexity of inference
To see this, we will examine the graph that
contains all of the edges we added during the
elimination. The resulting graph is always
chordal.

63
Example

Want to compute P(L)
Moralizing

V
S
L
T
A
B
X
D
64
Example

Want to compute P(L)
Moralizing
Eliminating v
Multiply to get fv(v,t)
Result fv(t)

V
S
L
T
A
B
X
D
65
Example

Want to compute P(L)
Moralizing
Eliminating v
Eliminating x
Multiply to get fx(a,x)
Result fx(a)

V
S
L
T
A
B
X
D
66
Example

Want to compute P(L)
Moralizing
Eliminating v
Eliminating x
Eliminating s
Multiply to get fs(l,b,s)
Result fs(l,b)

V
S
L
T
A
B
X
D
67
Example

Want to compute P(D)
Moralizing
Eliminating v
Eliminating x
Eliminating s
Eliminating t
Multiply to get ft(a,l,t)
Result ft(a,l)

V
S
L
T
A
B
X
D
68
Example

Want to compute P(D)
Moralizing
Eliminating v
Eliminating x
Eliminating s
Eliminating t
Eliminating l
Multiply to get fl(a,b,l)
Result fl(a,b)

V
S
L
T
A
B
X
D
69
Example

Want to compute P(D)
Moralizing
Eliminating v
Eliminating x
Eliminating s
Eliminating t
Eliminating l
Eliminating a, b
Multiply to get fa(a,b,d)
Result f(d)

V
S
L
T
A
B
X
D
70
Expanded Graphs
V
S
L
T

The resulting graph is the inducedgraph (for
this particular ordering)
Main property
Every maximal clique in the induced
graphcorresponds to a intermediate factor in the
computation
Every factor stored during the process is a
subset of some maximal clique in the graph
These facts are true for any variable elimination
ordering on any network

A
B
X
D
71
Induced Width (Treewidth)

The size of the largest clique in the induced
graph is thus an indicator for the complexity of
variable elimination
This quantity (minus one) is called the induced
width (or treewidth) of a graph according to the
specified ordering
Finding a good ordering for a graph is equivalent
to finding the minimal induced width of the graph

72
Consequence Elimination on Trees

Suppose we have a tree
A network where each variable has at most one
parent
All the factors involve at most two variables
Thus, the moralized graph is also a tree

A
A
C
C
B
B
D
E
D
E
F
G
F
G
73
Elimination on Trees

We can maintain the tree structure by eliminating
extreme variables in the tree

A
C
B
A
E
D
C
B
A
F
G
D
E
C
B
F
G
D
E
F
G
74
Elimination on Trees

Formally, for any tree, there is an elimination
ordering with treewidth 1
Theorem
Inference on trees is linear in number of
variables

75
PolyTrees

A polytree is a network where there is at most
one path from one variable to another
Theorem
Inference in a polytree is linear in the
representation size of the network
This assumes tabular CPT representation
Can you see how the argument would work?

A
H
C
B
D
E
F
G
76
General Networks

What do we do when the network is not a polytree?
If network has a cycle, the treewidth for any
ordering is greater than 1

77
Example

Eliminating A, B, C, D, E,.
Resulting graph is chordal with
treewidth 2

A
A
A
A
B
C
B
C
B
C
B
C
D
E
D
E
D
E
D
E
F
G
F
G
F
G
F
G
H
H
H
H
78
Example

Eliminating H,G, E, C, F, D, E, A

A
A
A
A
B
C
B
C
B
C
B
C
D
E
D
E
D
E
D
E
F
G
F
G
F
G
F
G
H
H
H
H
79
General Networks

From graph theory
Theorem
Finding an ordering that minimizes the treewidth
is NP-Hard
However,
There are reasonable heuristics for finding
relatively good ordering
There are provable approximations to the best
treewidth
If the graph has a small treewidth, there are
algorithms that find it in polynomial time