Title: Lecture 2: Inference and Learning in Bayesian Networks
1Lecture 2 Inference and Learning in Bayesian
Networks
- Abductive Inference in BNs
- Conditional Independence in BNs
- D-separation
- Faithfulness
- Markov Equivalence
- Learning BN Structure from Data
- The Bayesian Approach
- K2 Algorithm
2Abductive Inference in BNs (1)
So far we have considered inference problems
where the goal is to obtain posterior
probabilities for variables given evidence. In
abductive inference it is to find the
configuration of a set of variables (hypothesis)
which will best explain the evidence.
What would count as the best explanation of
fatigue (Ff1) and a positive X-ray (Xx1)? A
configuration of all the other variables? A
subset of them?
3Abductive Inference in BNs (2)
- There are two types of abductive inference in
BNs - MPE (Most Probable Explanation) - the most
probable configuration of all variables
in the BN given evidence - MAP (Maximum A Posteriori) - the most
probable configuration of a subset of
variables in the BN given evidence - Note 1 In general the MPE cannot be found by
taking the most probable configuration of nodes
individually! - Note 2 And the MAP cannot be found by taking the
projection of the MPEonto the explanation set!
4The Most Probable Explanations
(Dawid, 1992 and Nilsson, 1998) The MPE can be
obtained by adapting the propagation algorithm in
the junction tree when a message is passed from
one clique to another the potential function on
the separator is obtained by
rather than
Nilsson (1998) discusses how to find the K MPEs
and also outlines how to perform MAP
inferences marginalize out the variables not in
the explanation set and use the MPE approach on
the remaining variables.
5An Example
Consider a max-flow from the clique B,S,L to
B,L,F
Initial representation
After Flow
?BSL P(B S)P(L S)P(S) l1 l2s1,b1 0.00015 0.
04985s1,b2 0.00045 0.14955s2,b1 0.000002 0.03999
8s2,b2 0.000038 0.759962
?BSL l1 l2s1,b1 0.00015 0.04985s1,b2 0.00045 0.
14955s2,b1 0.000002 0.039998s2,b2 0.000038 0.759
962
?BL 1 l1 l2b1 1 1b2 1 1
?BL l1 l2b1 0.00015 0.04985b2 0.00045 0.759962
?BLF P(F B,L) l1 l2f1,b1 0.75 0.1f1,b2 0.5 0
.05f2,b1 0.25 0.9f2,b2 0.5 0.95
?BLF l1 l2f1,b1 0.000113 0.004985f1,b2 0.000225
0.037998f2,b1 0.000038 0.044685f2,b2 0.000225 0
.721964
6What is the best explanation?
Two questions What counts as an explanation?
And which one is best?
Eg What would count as an explanation of fatigue
(Ff1) and a positive X-ray (Xx1)? Or of lung
cancer (Ll1)?
Causality is often taken to play a crucial role
in explanation so if BNs can be interpreted
causally they provide a good platform for
obtaining explanations. So a suitable restriction
is that explanatory variables causally precede
the evidence.
7What is the best explanation? (1)
Which explanation is best? How do we determine
how good an explanation is? One approach is to
say that it is the one which makes the evidence
most probable. Select explanation H1 over H2 if
But even if P(E H1) 1 the posterior of H1 may
be small due to a small prior. The MAP approach
takes account of this by selecting H1 over H2 if
But perhaps this overcompensates since it could
be the case that
and so H1 actually lowers the probability of the
evidence.
8What is the best explanation? (2)
In many cases the two approaches agree as to
which is the better of the two explanations, i.e.
and
A reasonable requirement for any approach that
wants to do better is that it should agree with
these approaches when they agree with each other.
Why consider an alternative to MPE / MAP? 1. It
might be closer to human reasoning. 2. It could
be used to test Inference to the Best Explanation
(IBE) - does IBE make probable inferences? 3.
Perhaps high probability is not what we want -
trade-off with information content.
9BNs and Conditional Independence
We have already considered the Markov condition
which is satisfied by BNs a variable is
conditionally independent of its nondescendents
given its parents. Which independencies are
determined by the Markov condition?
We know that Ip(X, A E), but is Ip(X, A
T)? We also know that Ip(F, T B, E), but is
Ip(F, T E)? We also know that Ip(B, L S), but
is Ip(B, L S, F)? Or Ip(B, L S, E)?
10Common Causes and Causal Chains
Consider the following DAG
Ip(B, L S) we say that S blocks the chain
from B to L and the arrows meet tail-to-tail
Ip(B, L S) we say that S blocks the chain
from B to L and the arrows meet head-to-tail
11Common Effects
Consider the following DAG
Although B and E are marginally independent, i.e.
Ip(B, L), they are not independent given A. We
say that A unblocks the chain from B to E and
that the arrows meet head-to-head.
12D-Separation
A chain, p, between two nodes A and B is said to
be d-separated by a set of nodes Z if one of the
following holds 1. p contains a node which is in
Z where the edges in p meet head-to-tail 2. p
contains a node which is in Z where the edges in
p meet tail-to-tail 3. p contains a node where
the edges in p meet head-to-head and neither p
nor its descendents are in Z. Z d-separates A
and B if it d-separates every chain from A to
B. Z d-separates two sets of nodes, X and Y, if
it d-separates every chain from a node in X to a
node in Y. We denote this by ltX Z YgtD.
13D-Separation and Conditional Independence
For a DAG, G, the Markov condition entails all
and only those conditional independencies that
are identified by d-separation. This can be
expressed by the following statements ltX Z
YgtD ? IP(X, Y Z) for all P that are Markov with
respect to G. If IP(X, Y Z) for all P that are
Markov with respect to G then ltX Z YgtD.
Note a particular distribution P could have
conditional independencies not implied by
D-separation
14D-Separation - Examples
Consider again the following DAG
Ip(X, A T) and Ip(B, L S, E) Note, however,
that the DAG does not entail Ip(B, L S, F)
since F unblocks the chain B,F,E,L Nor Ip(L, T
X) since X unblocks the chain L,E,T Nor Ip(F,
T E) since E unblocks the chain F,B,S,L,E,T
15What do the Arrows Mean?
If there is no arrow between two variables the
Markov condition entails that the variables are
conditionally independent given some set of
variables. And if there is an arrow? Suppose we
add an arrow to the following DAG for which the
probability distribution P already satisfies the
Markov condition
The difficulty is that P also satisfies the
Markov condition with respect to the new DAG. So
B and L are conditionally independent given S
even though there is an arrow from B to L.
Conditional independencies can exist which are
not implied by d-separation.
16The Faithfulness Condition (1)
Although d-separation tells us what conditional
independencies are entailed by the Markov
condition, it does not guarantee that those are
the only conditional independencies in P. Ideally
we would want an arrow to rule out any
independencies between the variables
concerned. To do this we introduce
Faithfulness A DAG, G, and probability
distribution, P, satisfy the faithfulness
condition if the Markov condition entails all and
only the conditional independencies in P. This
tightens up the relationship between d-separation
and conditional independence. Earlier we had If
IP(X, Y Z) for all P that are Markov with
respect to G then ltX Z YgtD but if the
faithfulness condition holds this can be replaced
by IP(X, Y Z) ? ltX Z YgtD.
17The Faithfulness Condition (2)
Consider the following DAG with binary variables
P(a1) 0.5
P(b1a1) 1P(b1a2) 0
P(c1b1) 1P(c1b2) 0
A is conditionally independent of B given C even
though this is not implied by the Markov
condition, thus faithfulness is violated. Other
examples can also be found, such as Simpsons
Paradox, which violate faithfulness. However,
for certain cases, it has been shown that the set
of points in the parameter space violating
faithfulness is a set of measure zero (Spirtes et
al, 1993, and Meek, 1995).
18The Faithfulness Condition (3)
Consider the following DAG with binary variables
A
P(a1) 2/3
B
P(c1a1,b1) 2/3P(c1a1,b2) 5/9P(c1a2,b1)
2/3P(c1a2,b2) 5/6
C
P(b1a1) 2/5P(b1a2) 1/5
Simpsons Paradox (See Spirtes et al, 2001) eg A
- Gender, B - Treatment, C - Recovery B and C are
independent, but are not independent for a1 or
a2. Since the DAG implies no independencies this
violates faithfulness. The distribution must be
carefully chosen.
19Markov Equivalence
- Two DAGs are said to be Markov equivalent if they
entail the same conditional independence
relationships. - Equivalent DAGs can be identified in the
following way - They have the same skeleton, i.e. arrows
between all the same pairs of nodes,
although not necessarily in the same direction - They have the same set of v-structures, i.e.
two converging arrows whose tails are not
connected by arrows.(Verma and Pearl, 1990) - Example. The following DAGs are Markov equivalent
20Examples of Markov Equivalence (1)
The following four DAGs are Markov equivalent and
are said to form a Markov equivalence class.
The following four DAGs are not Markov equivalent
to those given above
21Examples of Markov Equivalence (2)
Note that within this DAG arrows S?B, S?L and L?X
are permitted to change direction (with certain
restrictions) while retaining Markov equivalence.
However, it is not just two arrows meeting in a
v-structure which are constrained to stay the
same. In the following DAG F?G cannot be reversed.
G
22Learning Bayesian Networks
- There are two basic approaches to learning BNs
from data - Constraint-based approaches
- performs statistical tests to discover
conditional independencies - find a DAG which entails these relationships
- Scoring-based approaches
- uses scoring functions to compare DAGs,
eg Bayesian, BIC, MDL, MML - select DAG that best fits the data
- Generally assumed that Markov equivalence places
a restriction on our learning procedures since
DAGs in the same equivalence class are
statistically indistinguishable.
23The Bayesian Approach (1)
(Cooper and Herskovits, 1992) Uses data to make
probabilistic inferences about conditional
independencies allows for better modelling of
uncertainty M discrete variable whose states,
m, are possible DAG structures.Uncertainty about
M represented by distribution, P(m). ?m
continuous vector-valued variable for model, m,
whose values, ?m, are possible values for
parameters. Uncertainty about ?m represented by
probability density function, P(?m m).
24The Bayesian Approach (2)
Given a data set D, the posterior probability of
a DAG structure, m, given D is
where
is called the marginal likelihood. Note that if
the priors P(m) are equal
Thus, by maximizing the likelihood we maximize
the posterior probability.
25The Bayesian Approach (3)
Cooper and Herskovits (1992) showed that the
marginal likelihood is
n number of nodesqi number of configurations
of parents for node, Xiri number of possible
values of Xi ? priors in Dirichlet
distributionN obtained directly from data P(D
m) is known as the Bayesian scoring criterion.
26An Example (1)
Consider the following DAG, m1, and data, D
X 1 Individual graduated college Y 1
Individual was divorced by age 50
We want to obtain P(D m1) by
For Y (i1) we have qi 2 (since X1 or X2) and
ri 2. The term for j 1 is
Y1
Y2
X1
After calculating the other terms we obtain P(D
m1) 7.22 x 10-6
R.E. Neapolitan, Learning Bayesian Networks (2004)
27An Example (2)
m1 can be taken as a representative of the Markov
equivalence class of DAGs which specify no
independence between X and Y.
If m2 is the DAG with no arrows we obtain P(D
m2) 6.75 x 10-6
If we further assume that P(m1) P(m2) 0.5
then we can conclude that m1 is more probable
than m2 given the data. By Bayes theorem
28The Need for Search Algorithms
Ideally we would search the space of all DAGs
exhaustively and find the DAG which maximizes the
Bayesian scoring criterion. However, for a large
(not that large!) number of nodes this becomes
infeasible No. of nodes No. of
DAGs1 12 33 254 5435 29,28110 4.2 ?
1018 Various heuristic search algorithms have
been designed to tackle the problem.
29K2 Algorithm (1)
(Cooper and Herskovits, 1992) Assume an ordering
of the n variables, X1,X2,,Xn so that Xj
cannot be a parent of Xi if j gt i. For
X2 Calculate the Bayesian score for the case
where X2 has no parents Calculate the Bayesian
score for the case where X2 has X1 as a parent.
If this is greater than the first case add an
arrow from X1 to X2. For Xi Calculate the score
for the case where Xi has no parents Calculate
the score for the cases where Xi has one parent.
If any of these are greater than the case with no
parents, select the node Xj which gives the
maximum and add an arrow from Xj to Xi. Now try
adding a second parent and continue the process
until no further nodes increase the score.
30K2 Algorithm (2)
Assume an ordering X, Y, Z
Level 1
Level 2
Level 3
Level 4
31Summary
- Abductive inference can be carried out in
BNs by modifying the propagation
algorithm in the junction tree - Generally assumed that the best explanation
is the most probable one although other
accounts of best could be explored - D-separation enables conditional
independencies to be read off a DAG - Faithfulness is an important assumption
often made in learning causal structure - Generally assumed that causal learning is up
to Markov equivalence - Overview of Bayesian approach to learning
structure
32Some Books
R.G. Cowell, A.P. Dawid, S.L. Lauritzen and D.J.
Spiegelhalter (1999) Probablistic Networks and
Expert Systems, Springer C. Glymour and G.F.
Cooper (eds.) (1999) Computation, Causation and
Discovery, MIT Press F.V. Jensen (2001) Bayesian
Networks and Decision Graphs, Springer M.I.
Jordan (ed.) (1999) Learning in Graphical Models,
MIT Press K.E. Korb and A.E. Nicholson (2003)
Bayesian Artificial Intelligence, Chapman and
Hall S.L. Lauritzen (1996) Graphical Models,
Oxford V.R. McKim and S. Turner (eds.)(1997)
Causality in Crisis, University of Notre Dame
Press R.E. Neapolitan (1990), Probabilistic
Reasoning in Expert Systems, Wiley R.E.
Neapolitan (2004), Learning Bayesian Networks,
Prentice Hall J. Pearl (1988) Probabilistic
Reasoning in Intelligent Systems, Morgan
Kaufmann J. Pearl (2000) Causality, Cambridge P.
Spirtes, C. Glymour and P. Scheines (2001)
Causation, Prediction and Search, MIT Press J.
Whittaker (1990) Graphical Models in Applied
Multivariate Statistics, Wiley
33Some Software / Websites
http//www.bayesware.com/ http//www.cs.ualberta.c
a/jcheng/bnpc.htm http//bnj.sourceforge.net/ htt
p//www.hugin.com/ http//www-2.cs.cmu.edu/javaba
yes/Home/ http//research.microsoft.com/adapt/MSBN
x/ http//www.norsys.com/ http//reasoning.cs.ucla
.edu/samiam/ http//www.phil.cmu.edu/projects/tetr
ad/ http//www.staff.city.ac.uk/rgc/webpages/xbpa
ge.html http//www.mrc-bsu.cam.ac.uk/bugs/
34Papers Referred to in Slides
G.F. Cooper and E. Herskovits (1992) Machine
Learning, 9, 309-47 A.P. Dawid (1992) Statistics
and Computing, 2, 25-36 R. Dechter (1996)
Proceedings of the 6th Conference on Uncertainty
in AI, 211-9 F.V. Jensen, S.L. Lauritzen and K.G.
Oelsen (1990) Computational Statistics Quarterly,
4, 269-82 S.L. Lauritzen and D.J. Spiegelhalter
(1988) Journal of the Royal Statistical Society,
Series B, 50, 150-224 C. Meek (1995) Proceedings
of the 11th Conference on Uncertainty in AI,
411-18 D. Nilsson (1998) Statistics and
Computing, 2, 159-73 T. Verma and J. Pearl (1990)
Proceedings of the 6th Conference on Uncertainty
in AI, 220-7