Bayesian Inference

About This Presentation

Title:

Bayesian Inference

Description:

Bayesian Inference. Summer School on Causality, Uncertainty and Ignorance ... David Glass, University of Ulster. dh.glass_at_ulster.ac.uk. Lecture 1: ... – PowerPoint PPT presentation

Number of Views:24

Avg rating:3.0/5.0

Slides: 41

Provided by: david716

Category:

more less

Transcript and Presenter's Notes

Title: Bayesian Inference

1
Bayesian Inference

Summer School on Causality, Uncertainty and
Ignorance
Konstanz, 15-21 August, 2004
David Glass, University of Ulsterdh.glass_at_ulster.
ac.uk

2
Lecture 1Introduction to Bayesian Networks

Bayes theorem
Large problems
Bayesian networks - an overview
Causality and Bayesian networks
the Causal Markov condition
Inference in BNs
constructing the junction tree
two-phase propagation

3
Bayes Theorem
Prior
Likelihood
Posterior
Probability of Evidence
Probability of an hypothesis, h, can be updated
when evidence, e, has been obtained. Note it is
usually not necessary to calculate P(e) directly
as it can be obtained by normalizing the
posterior probabilities, P(hi e).
4
A Simple Example
Consider two related variables 1. Drug (D) with
values y or n 2. Test (T) with values ve or
ve And suppose we have the following
probabilities P(D y) 0.001 P(T ve D
y) 0.8 P(T ve D n) 0.01 These
probabilities are sufficient to define a joint
probability distribution. Suppose an athlete
tests positive. What is the probability that he
has taken the drug?
5
A More Complex Case
Suppose now that there is a similar link between
Lung Cancer (L) and a chest X-ray (X) and that we
also have the following relationships History of
smoking (S) has a direct influence on bronchitis
(B) and lung cancer (L) L and B have a direct
influence on fatigue (F). What is the probability
that someone has bronchitis given that they
smoke, have fatigue and have received a positive
X-ray result?
where, for example, the variable B takes on
values b1 (has bronchitis) and b2 (does not have
bronchitis).
R.E. Neapolitan, Learning Bayesian Networks (2004)
6
Problems with Large Instances

The joint probability distribution,
P(b,s,f,x,l)
For five binary variables there are 25 32
values in the joint distribution (for 100
variables there are over 1030 values)
How are these values to be obtained?
Inference
To obtain posterior distributions once some
evidence is available requires summation over an
exponential number of terms eg 22 in the
calculation of

which increases to 297 if there are 100 variables.
7
Bayesian Networks

A Bayesian network consists of
A Graph
nodes represent the random variables
directed edges (arrows) between pairs of
nodes
it must be a Directed Acyclic Graph (DAG)
no directed cycles
the graph represents independence
relationships between variables
Conditional probability specifications
the conditional probability of each variable
given its parents in the DAG

8
An Example Bayesian Network
P(s1)0.2
P(l1s1)0.003P(l1s2)0.00005
P(b1s1)0.25P(b1s2)0.05
P(f1b1,l1)0.75P(f1b1,l2)0.10P(f1b2,l1)0.5
P(f1b2,l2)0.05
P(x1l1)0.6P(x1l2)0.02
R.E. Neapolitan, Learning Bayesian Networks (2004)
9
The Markov Condition
A Bayesian network (G,P) satisfies the Markov
condition according to which for each variable,
X, in G X is conditionally independent of its
nondescendents given its parents in G Denoted by
X - nd(X) pa(X) or IP(X, nd(X) pa(X))
Eg, in this network F - S,X B,L L - B
S
The Markov Condition is sometimes referred to as
the local directed Markov condition or the
parental Markov condition. See Cowell et al
(1999) or Whittaker (1990) for a detailed
discussion of Markov properties.
10
The Joint Probability Distribution
Note that our joint distribution with 5 variables
can be represented as
But due to the Markov condition we have, for
example,
Consequently the joint probability distribution
can now be expressed as
For example, the probability that someone has a
smoking history, lung cancer but not bronchitis,
suffers from fatigue and tests positive in an
X-ray test is
11
Representing the Joint Distribution
In general, for a network with nodes X1, X2, ,
Xn then
An enormous saving can be made regarding the
number of values required for the joint
distribution. To determine the joint
distribution directly for n binary variables 2n
1 values are required. For a BN with n binary
variables and each node has at most k parents
then less than 2kn values are required.
12
Causality and Bayesian Networks
Clearly not every BN describes causal
relationships between the variables. Consider
the dependence between Lung Cancer, L, and the
X-ray test, X. By focusing on just these
variables we might be tempted to represent them
by the following BN
P(x1l1)0.6P(x1l2)0.02
P(l1)0.001
However, the following BN represents the same
distribution and independencies (i.e. none)
P(l1x1)0.02915P(l1x2)0.00041
P(x1)0.02058
Nevertheless, it is tempting to think that BNs
can be created by creating a DAG where the edges
represent direct causal relationships between the
variables.
13
Common Causes
Consider the following DAG
Smoking
Bronchitis
Lung Cancer
Markov condition Ip(B, L S), i.e. P(b l, s)
P(b s) If we know the causal relationships
S?B and S?L and we know that Joe is a smoker,
then finding out that he has Bronchitis will not
give us any more information about the
probability of him having Lung Cancer. So the
Markov condition would be satisfied.
14
Common Effects
Consider the following DAG
Markov condition Ip(B, E), i.e. P(b e)
P(b) We would expect Burglary and Earthquake to
be independent of each other which is in
agreement with the Markov condition. We would,
however expect them to be conditionally dependent
given Alarm. If the alarm has gone off, news that
there had been an earthquake would explain away
the idea that a burglary had taken place. Again
in agreement with the Markov condition.
15
The Causal Markov Condition

The basic idea is that the Markov condition holds
for a causal DAG.
Certain other conditions must be met for the
Causal Markov condition to hold
there must be no hidden common causes
there must not be selection bias
there must be no feedback loops
Even with these provisos there is a lot of
controversy as to its validity.
It seems to be false in quantum mechanical
systems which have been found to violate Bells
inequalities.

16
Hidden Common Causes
H
X
Y
Z
If a DAG is created on the basis of causal
relationships between the variables under
consideration then X and Y would be marginally
independent according to the Markov
condition. But since they have a hidden common
cause, H, they will normally be dependent.
17
Inference in Bayesian Networks

The main point of BNs is to enable probabilistic
inference to be performed.
There are two main types of inference to be
carried out
Belief updating to obtain the posterior
probability of one or more variables given
evidence concerning the values of other variables
Abductive inference (or belief updating)
find the most probable configuration of a
set of variables (hypothesis) given evidence
Consider the BN discussed earlier

What is the probability that someone has
bronchitis (B) given that they smoke (S) have
fatigue (F) and have received a positive X-ray
(X) result?
18
Inference an overview

Trees and singly connected networks only
one path between any two nodes
message passing (Pearl, 1988)
Multiply connected networks
a range of algorithms including cut-set
conditioning (Pearl, 1988), junction tree
propagation (Lauritzen and Spiegelhalter, 1988),
bucket elimination (Dechter, 1996) to
mention a few.
a range of algorithms for approximate
inference

Both exact and approximate inference are NP-hard
in the worst case. Here the focus will be on
junction tree propagation for discrete variables.
19
Junction Tree Propagation

(Lauritzen and Spiegelhalter, 1988)
The general idea is that the propagation of
evidence through the network can be carried out
more efficiently by representing the joint
probability distribution on an undirected graph
called the Junction tree (or Join tree).
The junction tree has the following
characteristics
it is an undirected tree
its nodes are clusters of variables (i.e.
from the original BN)
given two clusters, C1 and C2, every node on
the path between them contains their
intersection C1 ? C2
a Separator, S, is associated with each edge
and contains the variables in the
intersection between neighbouring nodes

20
Constructing the Junction Tree (1)
Step 1. Form the moral graph from the
DAG Consider the Asia network
DAG
Moral Graph marry parents and remove arrows
21
Constructing the Junction Tree (2)
Step 2. Triangulate the moral graph An undirected
graph is triangulated if every cycle of length
greater than 3 possesses a chord
22
Constructing the Junction Tree (3)
Step 3. Identify the Cliques A clique is a subset
of nodes which is complete (i.e. there is an edge
between every pair of nodes) and maximal.
Cliques B,S,LB,L,EB,E,FL,E,TA,TE,X

?
23
Constructing the Junction Tree (4)
Step 4. Build Junction Tree The cliques should be
ordered (C1,C2,,Ck) so they possess the running
intersection property for all 1 lt j k, there
is an i lt j such that Cj ? (C1? ?Cj-1) ? Ci.
To build the junction tree choose one such I for
each j and add an edge between Cj and Ci.
Junction Tree
Cliques B,S,LB,L,EB,E,FL,E,TA,TE,X

?
24
Potential Representation
The joint probability distribution can now be
represented in terms of potential functions, ?,
defined on each clique and each separator of the
junction tree. The joint distribution is given by
The idea is to transform one representation of
the joint distribution to another in which for
each clique, C, the potential function gives the
marginal distribution for the variables in C, i.e.
This will also apply for the separators, S.
25
Initialization
To initialize the potential functions 1. set all
potentials to unity 2. for each variable, Xi,
select one node in the junction tree (i.e. one
clique) containing both that variable and its
parents, pa(Xi), in the original DAG 3. multiply
the potential by P(xipa(xi)) Example. Our
original BN can be represented as
BSL
BLF
LX
26
Propagating Information
Passing information from one clique C1 to another
C2 via the separator in between them, S0,
requires two steps 1. Obtain a new potential for
S0 by marginalizing out the variables in C1 that
are not in S0
2. Obtain a new potential for C2
where
27
An Example
Consider a flow from the clique B,S,L to B,L,F
Initial representation
After Flow
?BSL P(B S)P(L S)P(S) l1 l2s1,b1 0.00015 0.
04985s1,b2 0.00045 0.14955s2,b1 0.000002 0.03999
8s2,b2 0.000038 0.759962
?BSL l1 l2s1,b1 0.00015 0.04985s1,b2 0.00045 0.
14955s2,b1 0.000002 0.039998s2,b2 0.000038 0.759
962
?BL 1 l1 l2b1 1 1b2 1 1
?BL l1 l2b1 0.000152 0.089848b2 0.000488 0.909
512
?BLF P(F B,L) l1 l2f1,b1 0.75 0.1f1,b2 0.5 0
.05f2,b1 0.25 0.9f2,b2 0.5 0.95
?BLF l1 l2f1,b1 0.000114 0.0089848f1,b2 0.00024
4 0.0454756f2,b1 0.000038 0.0808632f2,b2 0.00024
4 0.8640364
28
An Example with Evidence
Consider a flow from the clique B,S,L to
B,L,F, but this time we include the information
that Joe is a smoker, S s1.
Incorporation of Evidence
After Flow
?BSL l1 l2s1,b1 0.00015 0.04985s1,b2 0.00045 0.
14955s2,b1 0 0s2,b2 0 0
?BL l1 l2b1 0.00015 0.04985b2 0.00045 0.14955
?BLF l1 l2f1,b1 0.0001125 0.004985f1,b2 0.000
225 0.0074775f2,b1 0.0000375
0.044865f2,b2 0.000225 0.1420725
29
The Full Propagation (1)
Two phase propagation (Jensen et al, 1990) 1.
Select an arbitrary clique, C0 2. Collection
Phase flows passed from periphery to C0 3.
Distribution Phase flows passed from C0 to
periphery Eg
Collection
Distribution
Collection
30
The Full Propagation (2)
After the two propagation phases have been
carried out the Junction tree will be in
equilibrium with each clique containing the joint
probability distribution for the variables it
contains. Marginal probabilities for individual
variables can then be obtained from the
cliques. Evidence, E, can be included before
propagation by selecting a clique for each
variable for which evidence is available. The
potential for the clique is then set to 0 for any
configuration which differs from the evidence.
After propagation the result will be
Normalizing gives
31
A Final Example (1)
What is the probability that someone has
bronchitis given that they smoke, have fatigue
and have received a positive Xray result? Recall
that the BN
can be represented by the junction tree
On entering evidence S s1, F f1 and X x1,
we obtain
32
A Final Example (2)
?BSL l1 l2s1,b1 0.0000675 0.0000997s1,b2 0.00
0135 0.00014955s2,b1 0 0s2,b2 0 0
After collection phase ?BSL is in final state. To
obtain P(b1,E) marginalize out L, 0.00006750.000
0997 0.0001672 Normalizing for P(b1E) gives
0.37. If we also observe Ll1 then P(b1E,l2)
0.33
?BL l1 l2b1 0.45 0.002b2 0.3 0.001
?BLF l1 l2f1,b1 0.45 0.002f1,b2 0.3 0.001f2,b1
0 0f2,b2 0 0
?L l1 l2 0.6 0.02
?LX l1 l2x1 0.6 0.02x2 0 0
33
Summary

Things to remember
the Markov condition - a property of BNs
what it means for a distribution P to
satisfy the Markov condition with respect
to a DAG
the causal Markov condtion - assumed for
causal BNs
how to construct a junction tree
propagation in junction tree

34
Abductive Inference in BNs (1)
So far we have considered inference problems
where the goal is to obtain posterior
probabilities for variables given evidence. In
abductive inference it is to find the
configuration of a set of variables (hypothesis)
which will best explain the evidence.
What would count as the best explanation of
fatigue (Ff1) and a positive X-ray (Xx1)? A
configuration of all the other variables? A
subset of them?
35
Abductive Inference in BNs (2)

There are two types of abductive inference in
BNs
MPE (Most Probable Explanation) - the most
probable configuration of all variables
in the BN given evidence
MAP (Maximum A Posteriori) - the most
probable configuration of a subset of
variables in the BN given evidence
Note 1 In general the MPE cannot be found by
taking the most probable configuration of nodes
individually!
Note 2 And the MAP cannot be found by taking the
projection of the MPEonto the explanation set!

36
The Most Probable Explanations
(Dawid, 1992 and Nilsson, 1998) The MPE can be
obtained by adapting the propagation algorithm in
the junction tree when a message is passed from
one clique to another the potential function on
the separator is obtained by
rather than
Nilsson (1998) discusses how to find the K MPEs
and also outlines how to perform MAP
inferences marginalize out the variables not in
the explanation set and use the MPE approach on
the remaining variables.
37
An Example
Consider a max-flow from the clique B,S,L to
B,L,F
Initial representation
After Flow
?BSL P(B S)P(L S)P(S) l1 l2s1,b1 0.00015 0.
04985s1,b2 0.00045 0.14955s2,b1 0.000002 0.03999
8s2,b2 0.000038 0.759962
?BSL l1 l2s1,b1 0.00015 0.04985s1,b2 0.00045 0.
14955s2,b1 0.000002 0.039998s2,b2 0.000038 0.759
962
?BL 1 l1 l2b1 1 1b2 1 1
?BL l1 l2b1 0.00015 0.04985b2 0.00045 0.759962
?BLF P(F B,L) l1 l2f1,b1 0.75 0.1f1,b2 0.5 0
.05f2,b1 0.25 0.9f2,b2 0.5 0.95
?BLF l1 l2f1,b1 0.000113 0.004985f1,b2 0.000225
0.037998f2,b1 0.000038 0.044685f2,b2 0.000225 0
.721964
38
What is the best explanation?
Two questions What counts as an explanation?
And which one is best?
Eg What would count as an explanation of fatigue
(Ff1) and a positive X-ray (Xx1)? Or of lung
cancer (Ll1)?
Causality is often taken to play a crucial role
in explanation so if BNs can be interpreted
causally they provide a good platform for
obtaining explanations. So a suitable restriction
is that explanatory variables causally precede
the evidence.
39
What is the best explanation? (1)
Which explanation is best? How do we determine
how good an explanation is? One approach is to
say that it is the one which makes the evidence
most probable. Select explanation H1 over H2 if
But even if P(E H1) 1 the posterior of H1 may
be small due to a small prior. The MAP approach
takes account of this by selecting H1 over H2 if
But perhaps this overcompensates since it could
be the case that
and so H1 actually lowers the probability of the
evidence.
40
What is the best explanation? (2)
In many cases the two approaches agree as to
which is the better of the two explanations, i.e.
and
A reasonable requirement for any approach that
wants to do better is that it should agree with
these approaches when they agree with each other.
Why consider an alternative to MPE / MAP? 1. It
might be closer to human reasoning. 2. It could
be used to test Inference to the Best Explanation
(IBE) - does IBE make probable inferences? 3.
Perhaps high probability is not what we want -
trade-off with information content.

Write a Comment

User Comments (0)