Title: Bayesian networks Variable Elimination
1BayesiannetworksVariable Elimination
Based on Nir Friedmans course (Hebrew
University)
2- In previous lessons we introduced compact
representations of probability distributions - Bayesian Networks
- A network describes a unique probability
distribution P - How do we answer queries about P?
- The process of computing answers to these queries
is called probabilistic inference
3Queries Likelihood
- There are many types of queries we might ask.
- Most of these involve evidence
- An evidence e is an assignment of values to a set
E of variables in the domain - Without loss of generality E Xk1, , Xn
- Simplest query compute probability of
evidence - This is often referred to as computing the
likelihood of the evidence
4Queries A posteriori belief
- Often we are interested in the conditional
probability of a variable given the evidence - This is the a posteriori belief in X, given
evidence e - A related task is computing the term P(X, e)
- i.e., the likelihood of e and X x for values
of X - we can recover the a posteriori belief by
5A posteriori belief
- This query is useful in many cases
- Prediction what is the probability of an outcome
given the starting condition - Target is a descendent of the evidence
- Diagnosis what is the probability of
disease/fault given symptoms - Target is an ancestor of the evidence
- As we shall see, the direction between variables
does not restrict the directions of the queries - Probabilistic inference can combine evidence form
all parts of the network
6Queries A posteriori joint
- In this query, we are interested in the
conditional probability of several variables,
given the evidence P(X, Y, e ) - Note that the size of the answer to query is
exponential in the number of variables in the
joint
7Queries MAP
- In this query we want to find the maximum a
posteriori assignment for some variable of
interest (say X1,,Xl ) - That is, x1,,xl maximize the probability P(x1,
,xl e) - Note that this is equivalent to
maximizing P(x1,,xl, e)
8Queries MAP
- We can use MAP for
- Classification
- find most likely label, given the evidence
- Explanation
- What is the most likely scenario, given the
evidence
9Queries MAP
- Cautionary note
- The MAP depends on the set of variables
- Example
- MAP of X is 1,
- MAP of (X, Y) is (0,0)
10Complexity of Inference
- Theorem
- Computing P(X x) in a Bayesian network is
NP-hard - Not surprising, since we can simulate Boolean
gates.
11Proof
- We reduce 3-SAT to Bayesian network computation
- Assume we are given a 3-SAT problem
- q1,,qn be propositions,
- ?1 ,... ,?k be clauses, such that ?i li1? li2 ?
li3 where each lij is a literal over q1,,qn - ? ?1?... ??k
- We will construct a network s.t. P(Xt) gt 0 iff
? is satisfiable
12...
Q1
Q3
Q2
Q4
Qn
...
?1
?2
?3
?k-1
?k
...
A1
A2
X
Ak/2-1
- P(Qi true) 0.5,
- P(?I true Qi , Qj , Ql ) 1 iff Qi , Qj , Ql
satisfy the clause ?I - A1, A2, , are simple binary and gates
13- It is easy to check
- Polynomial number of variables
- Each CPDs can be described by a small table (8
parameters at most) - P(X true) gt 0 if and only if there exists a
satisfying assignment to Q1,,Qn - Conclusion polynomial reduction of 3-SAT
14- Note this construction also shows that computing
P(X t) is harder than NP - 2nP(X t) is the number of satisfying
assignments to ? - Thus, it is P-hard (in fact it is P-complete)
15Hardness - Notes
- We used deterministic relations in our
construction - The same construction works if we use (1-?, ?)
instead of (1,0) in each gate for any ? lt 0.5 - Hardness does not mean we cannot solve inference
- It implies that we cannot find a general
procedure that works efficiently for all networks - For particular families of networks, we can have
provably efficient procedure
16Inference in Simple Chains
X1
X2
17Inference in Simple Chains (cont.)
X1
X2
X3
- How do we compute P(X3)?
- we already know how to compute P(X2)...
18Inference in Simple Chains (cont.)
...
- How do we compute P(Xn)?
- Compute P(X1), P(X2), P(X3),
- We compute each term by using the previous one
- Complexity
- Each step costs O(Val(Xi)Val(Xi1))
operations - Compare to naïve evaluation, that requires
summing over joint values of n-1 variables
19Inference in Simple Chains (cont.)
X1
X2
- Suppose that we observe thevalue of X2 x2
- How do we compute P(X1x2)?
- Recall that it suffices to compute P(X1,x2)
20Inference in Simple Chains (cont.)
X1
X2
X3
- Suppose that we observe the value of X3 x3
- How do we compute P(X1,x3)?
- How do we compute P(x3x1)?
21Inference in Simple Chains (cont.)
...
X1
X2
X3
Xn
- Suppose that we observe the value of Xn xn
- How do we compute P(X1,xn)?
22Inference in Simple Chains (cont.)
X1
X2
X3
Xn
- We compute P(xnxn-1), P(xnxn-2), iteratively
23Inference in Simple Chains (cont.)
...
...
X1
X2
Xk
Xn
- Suppose that we observe the value of Xn xn
- We want to find P(Xkxn )
- How do we compute P(Xk,xn )?
- We compute P(Xk ) by forward iterations
- We compute P(xn Xk ) by backward iterations
24Elimination in Chains
- We now try to understand the simple chain example
using first-order principles - Using definition of probability, we have
25Elimination in Chains
- By chain decomposition, we get
A
B
C
E
D
26Elimination in Chains
27Elimination in Chains
X
A
B
C
E
D
- Now we can perform innermost summation
- This summation, is exactly the first step in the
forward iteration we describe before
28Elimination in Chains
- Rearranging and then summing again, we get
X
X
A
B
C
E
D
29Elimination in Chains with Evidence
- Similarly, we understand the backward pass
- We write the query in explicit form
30Elimination in Chains with Evidence
X
A
B
C
E
D
31Elimination in Chains with Evidence
X
X
A
B
C
E
D
32Elimination in Chains with Evidence
X
X
X
A
B
C
E
D
33Variable Elimination
- General idea
- Write query in the form
- Iteratively
- Move all irrelevant terms outside of innermost
sum - Perform innermost sum, getting a new term
- Insert the new term into the product
34A More Complex Example
35S
V
- We want to compute P(d)
- Need to eliminate v,s,x,t,l,a,b
- Initial factors
L
T
A
B
X
D
36S
V
- We want to compute P(d)
- Need to eliminate v,s,x,t,l,a,b
- Initial factors
L
T
A
B
X
D
Eliminate v
Note fv(t) P(t) In general, result of
elimination is not necessarily a probability term
37- We want to compute P(d)
- Need to eliminate s,x,t,l,a,b
- Initial factors
Eliminate s
Summing on s results in a factor with two
arguments fs(b,l) In general, result of
elimination may be a function of several variables
38- We want to compute P(d)
- Need to eliminate x,t,l,a,b
- Initial factors
Eliminate x
Note fx(a) 1 for all values of a !!
39- We want to compute P(d)
- Need to eliminate t,l,a,b
- Initial factors
Eliminate t
40- We want to compute P(d)
- Need to eliminate l,a,b
- Initial factors
Eliminate l
41- We want to compute P(d)
- Need to eliminate b
- Initial factors
Eliminate a,b
Compute
a
42Variable Elimination
- We now understand variable elimination as a
sequence of rewriting operations - Actual computation is done in elimination step
- Computation depends on order of elimination
- We will return to this issue in detail
43Dealing with evidence
- How do we deal with evidence?
- Suppose get evidence V t, S f, D t
- We want to compute P(L, V t, S f, D t)
44Dealing with Evidence
- We start by writing the factors
- Since we know that V t, we dont need to
eliminate V - Instead, we can replace the factors P(V) and
P(TV) with - These select the appropriate parts of the
original factors given the evidence - Note that fp(V) is a constant, and thus does not
appear in elimination of other variables
45Dealing with Evidence
- Given evidence V t, S f, D t
- Compute P(L, V t, S f, D t )
- Initial factors, after setting evidence
46Dealing with Evidence
- Given evidence V t, S f, D t
- Compute P(L, V t, S f, D t )
- Initial factors, after setting evidence
- Eliminating x, we get
47Dealing with Evidence
- Given evidence V t, S f, D t
- Compute P(L, V t, S f, D t )
- Initial factors, after setting evidence
- Eliminating x, we get
- Eliminating t, we get
48Dealing with Evidence
- Given evidence V t, S f, D t
- Compute P(L, V t, S f, D t )
- Initial factors, after setting evidence
- Eliminating x, we get
- Eliminating t, we get
- Eliminating a, we get
49Dealing with Evidence
- Given evidence V t, S f, D t
- Compute P(L, V t, S f, D t )
- Initial factors, after setting evidence
- Eliminating x, we get
- Eliminating t, we get
- Eliminating a, we get
- Eliminating b, we get
50Complexity of variable elimination
- Suppose in one elimination step we compute
- This requires
-
multiplications - For each value for x, y1, , yk, we do m
multiplications - additions
- For each value of y1, , yk , we do Val(X)
additions - Complexity is exponential in number of variables
in the intermediate factor.
51Understanding Variable Elimination
- We want to select good elimination orderings
that reduce complexity - We start by attempting to understand variable
elimination via the graph we are working with - This will reduce the problem of finding good
ordering to a graph-theoretic operation that is
well-understood
52Undirected graph representation
- At each stage of the procedure, we have an
algebraic term that we need to evaluate - In general this term is of the form
- where Zi are sets of variables
- We now plot a graph where there is undirected
edge X--Y if X,Y are arguments of some factor - that is, if X,Y are in some Zi
53Undirected Graph Representation
- Consider the Asia example
- The initial factors are
- thus, the undirected graph is
- In the first step this graph is just the
moralized graph
V
S
V
S
L
T
L
T
A
B
A
B
X
D
X
D
54Undirected Graph Representation
- Now we eliminate t, getting
- The corresponding change in the graph is
V
S
V
S
L
T
L
T
A
B
A
B
X
D
X
D
55Example
- Want to computeP(L, V t, S f, D t)
- Moralizing
V
S
L
T
A
B
X
D
56Example
- Want to computeP(L, V t, S f, D t)
- Moralizing
- Setting evidence
V
S
L
T
A
B
X
D
57Example
- Want to computeP(L, V t, S f, D t)
- Moralizing
- Setting evidence
- Eliminating x
- New factor fx(A)
V
S
L
T
A
B
X
D
58Example
- Want to computeP(L, V t, S f, D t)
- Moralizing
- Setting evidence
- Eliminating x
- Eliminating a
- New factor fa(b,t,l)
V
S
L
T
A
B
X
D
59Example
- Want to computeP(L, V t, S f, D t)
- Moralizing
- Setting evidence
- Eliminating x
- Eliminating a
- Eliminating b
- New factor fb(t,l)
V
S
L
T
A
B
X
D
60Example
- Want to computeP(L, V t, S f, D t)
- Moralizing
- Setting evidence
- Eliminating x
- Eliminating a
- Eliminating b
- Eliminating t
- New factor ft(l)
V
S
L
T
A
B
X
D
61Elimination in Undirected Graphs
- Generalizing, we see that we can eliminate a
variable x by - 1. For all Y,Z, s.t., Y--X, Z--X
- add an edge Y--Z
- 2. Remove X and all adjacent edges to it
- This procedure creates a clique that contains all
the neighbors of X - After step 1 we have a clique that corresponds to
the intermediate factor (before marginalization) - The cost of the step is exponential in the size
of this clique
62Undirected Graphs
- The process of eliminating nodes from an
undirected graph gives us a clue to the
complexity of inference - To see this, we will examine the graph that
contains all of the edges we added during the
elimination. The resulting graph is always
chordal.
63Example
- Want to compute P(L)
- Moralizing
V
S
L
T
A
B
X
D
64Example
- Want to compute P(L)
- Moralizing
- Eliminating v
- Multiply to get fv(v,t)
- Result fv(t)
V
S
L
T
A
B
X
D
65Example
- Want to compute P(L)
- Moralizing
- Eliminating v
- Eliminating x
- Multiply to get fx(a,x)
- Result fx(a)
V
S
L
T
A
B
X
D
66Example
- Want to compute P(L)
- Moralizing
- Eliminating v
- Eliminating x
- Eliminating s
- Multiply to get fs(l,b,s)
- Result fs(l,b)
V
S
L
T
A
B
X
D
67Example
- Want to compute P(D)
- Moralizing
- Eliminating v
- Eliminating x
- Eliminating s
- Eliminating t
- Multiply to get ft(a,l,t)
- Result ft(a,l)
V
S
L
T
A
B
X
D
68Example
- Want to compute P(D)
- Moralizing
- Eliminating v
- Eliminating x
- Eliminating s
- Eliminating t
- Eliminating l
- Multiply to get fl(a,b,l)
- Result fl(a,b)
V
S
L
T
A
B
X
D
69Example
- Want to compute P(D)
- Moralizing
- Eliminating v
- Eliminating x
- Eliminating s
- Eliminating t
- Eliminating l
- Eliminating a, b
- Multiply to get fa(a,b,d)
- Result f(d)
V
S
L
T
A
B
X
D
70Expanded Graphs
V
S
L
T
- The resulting graph is the inducedgraph (for
this particular ordering) - Main property
- Every maximal clique in the induced
graphcorresponds to a intermediate factor in the
computation - Every factor stored during the process is a
subset of some maximal clique in the graph - These facts are true for any variable elimination
ordering on any network
A
B
X
D
71Induced Width (Treewidth)
- The size of the largest clique in the induced
graph is thus an indicator for the complexity of
variable elimination - This quantity (minus one) is called the induced
width (or treewidth) of a graph according to the
specified ordering - Finding a good ordering for a graph is equivalent
to finding the minimal induced width of the graph
72Consequence Elimination on Trees
- Suppose we have a tree
- A network where each variable has at most one
parent - All the factors involve at most two variables
- Thus, the moralized graph is also a tree
A
A
C
C
B
B
D
E
D
E
F
G
F
G
73Elimination on Trees
- We can maintain the tree structure by eliminating
extreme variables in the tree
A
C
B
A
E
D
C
B
A
F
G
D
E
C
B
F
G
D
E
F
G
74Elimination on Trees
- Formally, for any tree, there is an elimination
ordering with treewidth 1 - Theorem
- Inference on trees is linear in number of
variables
75PolyTrees
- A polytree is a network where there is at most
one path from one variable to another - Theorem
- Inference in a polytree is linear in the
representation size of the network - This assumes tabular CPT representation
- Can you see how the argument would work?
A
H
C
B
D
E
F
G
76General Networks
- What do we do when the network is not a polytree?
- If network has a cycle, the treewidth for any
ordering is greater than 1
77Example
- Eliminating A, B, C, D, E,.
- Resulting graph is chordal with
- treewidth 2
A
A
A
A
B
C
B
C
B
C
B
C
D
E
D
E
D
E
D
E
F
G
F
G
F
G
F
G
H
H
H
H
78Example
- Eliminating H,G, E, C, F, D, E, A
A
A
A
A
B
C
B
C
B
C
B
C
D
E
D
E
D
E
D
E
F
G
F
G
F
G
F
G
H
H
H
H
79General Networks
- From graph theory
- Theorem
- Finding an ordering that minimizes the treewidth
is NP-Hard - However,
- There are reasonable heuristics for finding
relatively good ordering - There are provable approximations to the best
treewidth - If the graph has a small treewidth, there are
algorithms that find it in polynomial time