Graphical Models: An Introduction

About This Presentation

Title:

Graphical Models: An Introduction

Description:

B or one of its descendents are in Z. No other nodes in the path are in Z ... Mark all nodes whose descendents are in Z. X to Y phase: ... – PowerPoint PPT presentation

Number of Views:43

Avg rating:3.0/5.0

Slides: 138

Provided by: get79

Learn more at: http://www.cs.umd.edu

Category:

more less

Transcript and Presenter's Notes

Title: Graphical Models: An Introduction

1
Graphical Models An Introduction

Lise Getoor
Computer Science Dept
University of Maryland
http//www.cs.umd.edu/getoor

2
Reading List for Next Lecture

Learning Probabilistic Relational Models, L.
Getoor, N. Friedman, D. Koller, A. Pfeffer.
Invited contribution to the book Relational Data
Mining, S. Dzeroski and N. Lavrac, Eds.,
Springer-Verlag, 2001.
http//www.cs.umd.edu/getoor/Publications/lprm-c
h.ps http//www.cs.umd.edu/class/spring2005/cmsc82
8g/Readings/lprm-ch.pdf
Probabilistic Models for Relational Data, David
Heckerman, Christopher Meek and Daphne Koller
http//www.cs.umd.edu/projects/srl2004/Papers/hec
kerman.pdfftp//ftp.research.microsoft.com/pub/tr/
TR-2004-30.pdf

3
Graphical Models

e.g. Bayesian networks, Bayes nets, Belief nets,
Markov networks, HMMs, Dynamic Bayes nets, etc.
Themes
representation
reasoning
learning
Materials based on upcoming book by Nir Friedman
and Daphne Koller.
Slides based on material from Nir Friedman.

4
Probability Distributions

Let X1,,Xp be discrete random variables
Let P be a joint distribution over X1,,Xp
If the variables are binary, then we need O(2p)
parameters to describe P
Can we do better?
Key idea use properties of independence

5
Independent Random Variables

Two variables X and Y are independent if
P(X xY y) P(X x) for all values x, y
That is, learning the values of Y does not change
prediction of X
If X and Y are independent then
P(X,Y) P(XY)P(Y) P(X)P(Y)
In general, if X1,,Xp are independent, then
P(X1,,Xp) P(X1)...P(Xp)
Requires O(n) parameters

6
Conditional Independence

Unfortunately, most of random variables of
interest are not independent of each other
A more suitable notion is that of conditional
independence
Two variables X and Y are conditionally
independent given Z if
P(X xY y,Zz) P(X xZz) for all values
x,y,z
That is, learning the values of Y does not change
prediction of X once we know the value of Z
notation I ( X , Y Z )

7
Example Naïve Bayesian Model

A common model in early diagnosis
Symptoms are conditionally independent given the
disease (or fault)
Thus, if
X1,,Xp denote whether the symptoms exhibited by
the patient (headache, high-fever, etc.) and
H denotes the hypothesis about the patients
health
then, P(X1,,Xp,H) P(H)P(X1H)P(XpH),
This naïve Bayesian model allows compact
representation
It does embody strong independence assumptions

8
Graphical Models

Graph is language for representing independencies
Directed Acyclic Graph -gt Bayesian Network
Undirected Graph -gt Markov Network

9
DAGS Markov Assumption
Ancestor

We now make this independence assumption more
precise for directed acyclic graphs (DAGs)
Each random variable X, is independent of its
non-descendents, given its parents Pa(X)
Formally,I (X, NonDesc(X) Pa(X))

Parent
Non-descendent
Descendent
10
Markov Assumption Example

In this example
I ( E, B )
I ( B, E, R )
I ( R, A, B, C E )
I ( A, R B,E )
I ( C, B, E, R A)

11
I-Maps

A DAG G is an I-Map of a distribution P if the
all Markov assumptions implied by G are satisfied
by P
(Assuming G and P both use the same set of random
variables)
Examples

12
Factorization

Given that G is an I-Map of P, can we simplify
the representation of P?
Example
Since I(X,Y), we have that P(XY) P(X)
Applying the chain ruleP(X,Y) P(XY) P(Y)
P(X) P(Y)
Thus, we have a simpler representation of P(X,Y)

13
Factorization Theorem
Thm if G is an I-Map of P, then

From assumption

Since G is an I-Map, I (Xi, NonDesc(Xi) Pa(Xi))

We conclude, P(Xi X1,,Xi-1) P(Xi Pa(Xi) )

14
Factorization Example

P(C,A,R,E,B) P(B)P(EB)P(RE,B)P(AR,B,E)P(CA,R
,B,E)

versus P(C,A,R,E,B) P(B) P(E) P(RE) P(AB,E)
P(CA)
15
Consequences

We can write P in terms of local conditional
probabilities
If G is sparse,
that is, Pa(Xi) lt k ,
? each conditional probability can be specified
compactly
e.g. for binary variables, these require O(2k)
params.
? representation of P is compact
linear in number of variables

16
DAGS Summary

The Markov Independences of a DAG G
I (Xi , NonDesc(Xi) Pai )
G is an I-Map of a distribution P
If P satisfies the Markov independencies implied
by G
if G is an I-Map of P, then

17
Conditional Independencies

Let Markov(G) be the set of Markov Independencies
implied by G
The factorization theorem shows
G is an I-Map of P ?
We can also show the opposite
Thm
? G is an I-Map of P

18
Implied Independencies

Does a graph G imply additional independencies as
a consequence of Markov(G)?
We can define a logic of independence statements
Some axioms
I( X Y Z ) ? I( Y X Z )
I( X Y1, Y2 Z ) ? I( X Y1 Z )

19
d-seperation

A procedure d-sep(X Y Z, G) that given a DAG
G, and sets X, Y, and Z returns either yes or no
Goal
d-sep(X Y Z, G) yes iff I(XYZ) follows
from Markov(G)

20
Paths

Intuition dependency must flow along paths in
the graph
A path is a sequence of neighboring variables
Examples
R ? E ? A ? B
C ? A ? E ? R

21
Paths

We want to know when a path is
active -- creates dependency between end nodes
blocked -- cannot create dependency end nodes
We want to classify situations in which paths are
active.

22
Path Blockage

Three cases
Common cause

23
Path Blockage

Three cases
Common cause
Intermediate cause

24
Path Blockage

Three cases
Common cause
Intermediate cause
Common Effect

25
Path Blockage -- General Case

A path is active, given evidence Z, if
Whenever we have the configurationB or one
of its descendents are in Z
No other nodes in the path are in Z
A path is blocked, given evidence Z, if it is not
active.

A
C
B
26
Example

d-sep(R,B)?

E
B
A
R
C
27
Example

d-sep(R,B) yes
d-sep(R,BA)?

E
B
A
R
C
28
Example

d-sep(R,B) yes
d-sep(R,BA) no
d-sep(R,BE,A)?

E
B
A
R
C
29
d-Separation

X is d-separated from Y, given Z, if all paths
from a node in X to a node in Y are blocked,
given Z.
Checking d-separation can be done efficiently
(linear time in number of edges)
Bottom-up phase Mark all nodes whose
descendents are in Z
X to Y phaseTraverse (BFS) all edges on paths
from X to Y and check if they are blocked

30
Soundness

Thm
If
G is an I-Map of P
d-sep( X Y Z, G ) yes
then
P satisfies I( X Y Z )
Informally,
Any independence reported by d-separation is
satisfied by underlying distribution

31
Completeness

Thm
If d-sep( X Y Z, G ) no
then there is a distribution P such that
G is an I-Map of P
P does not satisfy I( X Y Z )
Informally,
Any independence not reported by d-separation
might be violated by the underlying distribution
We cannot determine this by examining the graph
structure alone

32
I-Maps revisited

The fact that G is I-Map of P might not be that
useful
For example, complete DAGs
A DAG is G is complete is we cannot add an arc
without creating a cycle
These DAGs do not imply any independencies
Thus, they are I-Maps of any distribution

33
Minimal I-Maps

A DAG G is a minimal I-Map of P if
G is an I-Map of P
If G ? G, then G is not an I-Map of P
Removing any arc from G introduces
(conditional) independencies that do not hold in P

34
Minimal I-Map Example

If is a
minimal I-Map
Then, these are not I-Maps

35
Constructing minimal I-Maps

The factorization theorem suggests an algorithm
Fix an ordering X1,,Xn
For each i,
select Pai to be a minimal subset of X1,,Xi-1
,such that I(Xi X1,,Xi-1 - Pai Pai )
Clearly, the resulting graph is a minimal I-Map.

36
Non-uniqueness of minimal I-Map

Unfortunately, there may be several minimal
I-Maps for the same distribution
Applying I-Map construction procedure with
different orders can lead to different structures

Original I-Map
Order C, R, A, E, B
37
Choosing Ordering Causality

The choice of order can have drastic impact on
the complexity of minimal I-Map
Heuristic argument construct I-Map using causal
ordering among variables
Justification?
It is often reasonable to assume that graphs of
causal influence should satisfy the Markov
properties.

38
P-Maps

A DAG G is P-Map (perfect map) of a distribution
P if
I(X Y Z) if and only if d-sep(X Y Z, G)
yes
Notes
A P-Map captures all the independencies in the
distribution
P-Maps are unique, up to DAG equivalence

39
P-Maps

Unfortunately, some distributions do not have a
P-Map

40
Bayesian Networks

A Bayesian network specifies a probability
distribution via two components
A DAG G
A collection of conditional probability
distributions P(XiPai)
The joint distribution P is defined by the
factorization
Additional requirement G is a minimal I-Map of P

41
Bayesian Networks

A Bayesian network specifies a probability
distribution via two components
A DAG G
A collection of conditional probability
distributions P(XiPai)
The joint distribution P is defined by the
factorization
Additional requirement G is a minimal I-Map of P

42
DAGs and BNs

DAGs as a representation of conditional
independencies
Markov independencies of a DAG
Tight correspondence between Markov(G) and the
factorization defined by G
d-separation, a sound complete procedure for
computing the consequences of the independencies
Notion of minimal I-Map
P-Maps
This theory is the basis for defining Bayesian
networks

43
Undirected Graphs Markov Networks

Alternative representation of conditional
independencies
Let U be an undirected graph
Let Ni be the set of neighbors of Xi
Define Markov(U) to be the set of
independenciesI( Xi X1,,Xn - Ni - Xi
Ni )
U is an I-Map of P if P satisfies Markov(U)

44
Example

This graph implies that
I(A C B, D )
I(B D A, C )
Note this example does not have a directed P-Map

A
B
D
C
45
Markov Network Factorization

Thm if
P is strictly positive, that is P(x1, , xn ) gt 0
for all assignments
then
U is an I-Map of P
if and only if
there is a factorization
where C1, , Ck are the maximal cliques in U
Alternative form

46
Relationship between Directed Undirected Models
Chain Graphs
Directed Graphs
Undirected Graphs
47
CPDs

So far, we focused on how to represent
independencies using DAGs
The other component of a Bayesian networks is
the specification of the conditional probability
distributions (CPDs)
Here, well just discuss the simplest
representation of CPDs

48
Tabular CPDs

When the variable of interest are all discrete,
the common representation is as a table
For example P(CA,B) can be represented by

49
Tabular CPDs

Pros
Very flexible, can capture any CPD of discrete
variables
Can be easily stored and manipulated
Cons
Representation size grows exponentially with the
number of parents!
Unwieldy to assess probabilities for more than
few parents

50
Continuous CPDs

When X is a continuous variables, we need to
represent the density of X, given any value of
its parents
Gaussian
Conditional Gaussian

51
CPDs Summary

Many choices for representing CPDs
Any statistical model of conditional
distribution can be used
e.g., any regression model
Representing structure in CPDs can have
implications on independencies among variables

52
Inference in Bayesian Networks
53
Inference

We now have compact representations of
probability distributions
Bayesian Networks
Markov Networks
Network describes a unique probability
distribution P
How do we answer queries about P?
inference is name for the process of computing
answers to such queries

54
Queries Likelihood

There are many types of queries we might ask.
Most of these involve evidence
An evidence e is an assignment of values to a set
E variables in the domain
Without loss of generality E Xk1, , Xn
Simplest query compute probability of
evidence
This is often referred to as computing the
likelihood of the evidence

55
Queries A posteriori belief

Often we are interested in the conditional
probability of a variable given the evidence
This is the a posteriori belief in X, given
evidence e
A related task is computing the term P(X, e)
i.e., the likelihood of e and X x for values
of X
we can recover the a posteriori belief by

56
A posteriori belief

This query is useful in many cases
Prediction what is the probability of an outcome
given the starting condition
Target is a descendent of the evidence
Diagnosis what is the probability of
disease/fault given symptoms
Target is an ancestor of the evidence
Note the direction between variables does not
restrict the directions of the queries
Probabilistic inference can combine evidence form
all parts of the network

57
Queries MAP

In this query we want to find the maximum a
posteriori assignment for some variable of
interest (say X1,,Xl )
That is, x1,,xl maximize the probability P(x1,
,xl e)
Note that this is equivalent to
maximizing P(x1,,xl, e)

58
Queries MAP

We can use MAP for
Classification
find most likely label, given the evidence
Explanation
What is the most likely scenario, given the
evidence

59
Queries MAP

Cautionary note
The MAP depends on the set of variables
Example
MAP of X
MAP of (X, Y)

60
Complexity of Inference

Thm
Computing P(X x) in a Bayesian network is
NP-hard
Not surprising, since we can simulate Boolean
gates.

61
Hardness

Hardness does not mean we cannot solve inference
It implies that we cannot find a general
procedure that works efficiently for all networks
For particular families of networks, we can have
provably efficient procedures

62
Approaches to inference

Exact inference
Inference in Simple Chains
Variable elimination
Clustering / join tree algorithms
Approximate inference
Stochastic simulation / sampling methods
Markov chain Monte Carlo methods
Mean field theory

63
Inference in Simple Chains
X1
X2

How do we compute P(X2)?

64
Inference in Simple Chains (cont.)
X1
X2
X3

How do we compute P(X3)?
we already know how to compute P(X2)...

65
Inference in Simple Chains (cont.)
...

How do we compute P(Xn)?
Compute P(X1), P(X2), P(X3),
We compute each term by using the previous one

Complexity
Each step costs O(Val(Xi)Val(Xi1))
operations
Compare to naïve evaluation, that requires
summing over joint values of n-1 variables

66
Inference in Simple Chains (cont.)
X1
X2

Suppose that we observe the value of X2 x2
How do we compute P(X1x2)?
Recall that we it suffices to compute P(X1,x2)

67
Inference in Simple Chains (cont.)
X1
X2
X3

Suppose that we observe the value of X3 x3
How do we compute P(X1,x3)?
How do we compute P(x3x1)?

68
Inference in Simple Chains (cont.)
...
X1
X2
X3
Xn

Suppose that we observe the value of Xn xn
How do we compute P(X1,xn)?
We compute P(xnxn-1), P(xnxn-2), iteratively

69
Inference in Simple Chains (cont.)
...
...
X1
X2
Xk
Xn

Suppose that we observe the value of Xn xn
We want to find P(Xkxn )
How do we compute P(Xk,xn )?
We compute P(Xk ) by forward iterations
We compute P(xn Xk ) by backward iterations

70
Elimination in Chains

We now try to understand the simple chain example
using first-order principles
Using definition of probability, we have

71
Elimination in Chains

By chain decomposition, we get

72
Elimination in Chains

Rearranging terms ...

73
Elimination in Chains

Now we can perform innermost summation
This summation, is exactly the first step in the
forward iteration we describe before

X
74
Elimination in Chains

Rearranging and then summing again, we get

X
X
75
Elimination in Chains with Evidence

Similarly, we understand the backward pass
We write the query in explicit form

76
Elimination in Chains with Evidence

Eliminating d, we get

X
77
Elimination in Chains with Evidence

Eliminating c, we get

X
X
78
Elimination in Chains with Evidence

Finally, we eliminate b

X
X
X
79
Variable Elimination

General idea
Write query in the form
Iteratively
Move all irrelevant terms outside of innermost
sum
Perform innermost sum, getting a new term
Insert the new term into the product

80
A More Complex Example

Asia network

We want to compute P(d)
Need to eliminate v,s,x,t,l,a,b
Initial factors

We want to compute P(d)
Need to eliminate v,s,x,t,l,a,b
Initial factors

Eliminate v
Note fv(t) P(t) In general, result of
elimination is not necessarily a probability term
83

We want to compute P(d)
Need to eliminate s,x,t,l,a,b
Initial factors

Eliminate s
Summing on s results in a factor with two
arguments fs(b,l) In general, result of
elimination may be a function of several variables
84

We want to compute P(d)
Need to eliminate x,t,l,a,b
Initial factors

Eliminate x
Note fx(a) 1 for all values of a !!
85

We want to compute P(d)
Need to eliminate t,l,a,b
Initial factors

Eliminate t
86

We want to compute P(d)
Need to eliminate l,a,b
Initial factors

Eliminate l
87

We want to compute P(d)
Need to eliminate b
Initial factors

Eliminate a,b
88
Variable Elimination

We now understand variable elimination as a
sequence of rewriting operations
Actual computation is done in elimination step
Exactly the same computation procedure applies to
Markov networks
Computation depends on order of elimination

89
Dealing with evidence

How do we deal with evidence?
Suppose get evidence V t, S f, D t
We want to compute P(L, V t, S f, D t)

90
Dealing with Evidence

We start by writing the factors
Since we know that V t, we dont need to
eliminate V
Instead, we can replace the factors P(V) and
P(TV) with
These select the appropriate parts of the
original factors given the evidence
Note that fp(V) is a constant, and thus does not
appear in elimination of other variables

91
Dealing with Evidence

Given evidence V t, S f, D t
Compute P(L, V t, S f, D t )
Initial factors, after setting evidence

92
Dealing with Evidence

Given evidence V t, S f, D t
Compute P(L, V t, S f, D t )
Initial factors, after setting evidence
Eliminating x, we get

93
Dealing with Evidence

Given evidence V t, S f, D t
Compute P(L, V t, S f, D t )
Initial factors, after setting evidence
Eliminating x, we get
Eliminating t, we get

94
Dealing with Evidence

Given evidence V t, S f, D t
Compute P(L, V t, S f, D t )
Initial factors, after setting evidence
Eliminating x, we get
Eliminating t, we get
Eliminating a, we get

95
Dealing with Evidence

Given evidence V t, S f, D t
Compute P(L, V t, S f, D t )
Initial factors, after setting evidence
Eliminating x, we get
Eliminating t, we get
Eliminating a, we get
Eliminating b, we get

96
Complexity of variable elimination

Suppose in one elimination step we compute
This requires
multiplications
For each value for x, y1, , yk, we do m
multiplications
additions
For each value of y1, , yk , we do Val(X)
additions
Complexity is exponential in number of variables
in the intermediate factor!

97
Understanding Variable Elimination

We want to select good elimination orderings
that reduce complexity
We start by attempting to understand variable
elimination via the graph we are working with
This will reduce the problem of finding good
ordering to graph-theoretic operation that is
well-understood

98
Undirected graph representation

At each stage of the procedure, we have an
algebraic term that we need to evaluate
In general this term is of the formwhere Zi
are sets of variables
We now plot a graph where there is undirected
edge X--Y if X,Y are arguments of some factor
that is, if X,Y are in some Zi
Note this is the Markov network that describes
the probability on the variables we did not
eliminate yet

99
Chordal Graphs

elimination ordering ? undirected chordal graph
Graph
Maximal cliques are factors in elimination
Factors in elimination are cliques in the graph
Complexity is exponential in size of the largest
clique in graph

100
Induced Width

The size of the largest clique in the induced
graph is thus an indicator for the complexity of
variable elimination
This quantity is called the induced width of a
graph according to the specified ordering
Finding a good ordering for a graph is equivalent
to finding the minimal induced width of the graph

101
General Networks

From graph theory
Thm
Finding an ordering that minimizes the induced
width is NP-Hard
However,
There are reasonable heuristic for finding
relatively good ordering
There are provable approximations to the best
induced width
If the graph has a small induced width, there are
algorithms that find it in polynomial time

102
Elimination on Trees

Formally, for any tree, there is an elimination
ordering with induced width 1
Thm
Inference on trees is linear in number of
variables

103
PolyTrees

A polytree is a network where there is at most
one path from one variable to another
Thm
Inference in a polytree is linear in the
representation size of the network
This assumes tabular CPT representation

104
Approaches to inference

Exact inference
Inference in Simple Chains
Variable elimination
Clustering / join tree algorithms
Approximate inference
Stochastic simulation / sampling methods
Markov chain Monte Carlo methods
Mean field theory

105
Learning Bayesian Networks
106
Learning Bayesian networks
Inducer
107
Known Structure -- Complete Data
E, B, A ltY,N,Ngt ltY,Y,Ygt ltN,N,Ygt ltN,Y,Ygt .
. ltN,Y,Ygt
Inducer

Network structure is specified
Inducer needs to estimate parameters
Data does not contain missing values

108
Unknown Structure -- Complete Data
E, B, A ltY,N,Ngt ltY,Y,Ygt ltN,N,Ygt ltN,Y,Ygt .
. ltN,Y,Ygt
Inducer

Network structure is not specified
Inducer needs to select arcs estimate
parameters
Data does not contain missing values

109
Known Structure -- Incomplete Data
E, B, A ltY,N,Ngt ltY,?,Ygt ltN,N,Ygt ltN,Y,?gt .
. lt?,Y,Ygt
Inducer

Network structure is specified
Data contains missing values
We consider assignments to missing values

110
Known Structure / Complete Data

Given a network structure G
And choice of parametric family for P(XiPai)
Learn parameters for network
Goal
Construct a network that is closest to
probability that generated the data

111
Learning Parameters for a Bayesian Network

Training data has the form

112
Learning Parameters for a Bayesian Network

Since we assume i.i.d. samples,likelihood
function is

113
Learning Parameters for a Bayesian Network

By definition of network, we get

114
Learning Parameters for a Bayesian Network

Rewriting terms, we get

115
General Bayesian Networks

Generalizing for any Bayesian network
The likelihood decomposes according to the
structure of the network.

i.i.d. samples
Network factorization
116
General Bayesian Networks (Cont.)

Decomposition
? Independent Estimation Problems
If the parameters for each family are not
related, then they can be estimated independently
of each other.

117
From Binomial to Multinomial

For example, suppose X can have the values
1,2,,K
We want to learn the parameters ? 1, ? 2. , ? K
Sufficient statistics
N1, N2, , NK - the number of times each outcome
is observed
Likelihood function
MLE

118
Likelihood for Multinomial Networks

When we assume that P(Xi Pai ) is multinomial,
we get further decomposition

119
Likelihood for Multinomial Networks

When we assume that P(Xi Pai ) is multinomial,
we get further decomposition
For each value pai of the parents of Xi we get an
independent multinomial problem
The MLE is

120
Bayesian Approach Dirichlet Priors

Recall that the likelihood function is
A Dirichlet prior with hyperparameters ?1,,?K
is defined as
for legal
? 1,, ? K
Then the posterior has the same form, with
hyperparameters ?1N 1,,?K N K

121
Dirichlet Priors (cont.)

We can compute the prediction on a new event in
closed form
If P(?) is Dirichlet with hyperparameters ?1,,?K
then
Since the posterior is also Dirichlet, we get

122
Prior Knowledge

The hyperparameters ?1,,?K can be thought of as
imaginary counts from our prior experience
Equivalent sample size ?1?K
The larger the equivalent sample size the more
confident we are in our prior

123
Conjugate Families

The property that the posterior distribution
follows the same parametric form as the prior
distribution is called conjugacy
Dirichlet prior is a conjugate family for the
multinomial likelihood
Conjugate families are useful since
For many distributions we can represent them with
hyperparameters
They allow for sequential update within the same
representation
In many cases we have closed-form solution for
prediction

124
Bayesian Prediction(cont.)

Given these observations, we can compute the
posterior for each multinomial ? Xi pai
independently
The posterior is Dirichlet with parameters
?(Xi1pai)N (Xi1pai),, ?(Xikpai)N
(Xikpai)
The predictive distribution is then represented
by the parameters

125
Learning Parameters Summary

Estimation relies on sufficient statistics
For multinomial these are of the form N (xi,pai)
Parameter estimation
Bayesian methods also require choice of priors
Both MLE and Bayesian are asymptotically
equivalent and consistent
Both can be implemented in an on-line manner by
accumulating sufficient statistics

126
Learning Structure from Complete Data
127
Benefits of Learning Structure

Efficient learning -- more accurate models with
less data
Compare P(A) and P(B) vs. joint P(A,B)
Discover structural properties of the domain
Ordering of events
Relevance
Identifying independencies ? faster inference
Predict effect of actions
Involves learning causal relationship among
variables

128
Why Struggle for Accurate Structure?
Adding an arc
Missing an arc

Cannot be compensated by accurate fitting of
parameters
Also misses causality and domain structure

Increases the number of parameters to be fitted
Wrong assumptions about causality and domain
structure

129
Approaches to Learning Structure

Constraint based
Perform tests of conditional independence
Search for a network that is consistent with the
observed dependencies and independencies
Pros Cons
Intuitive, follows closely the construction of
BNs
Separates structure learning from the form of the
independence tests
Sensitive to errors in individual tests

130
Approaches to Learning Structure

Score based
Define a score that evaluates how well the
(in)dependencies in a structure match the
observations
Search for a structure that maximizes the score
Pros Cons
Statistically motivated
Can make compromises
Takes the structure of conditional probabilities
into account
Computationally hard

131
Likelihood Score for Structures

First cut approach
Use likelihood function
Recall, the likelihood score for a network
structure and parameters is
Since we know how to maximize parameters from now
we assume

132
Likelihood Score for Structure (cont.)

Rearranging terms
where
H(X) is the entropy of X
I(XY) is the mutual information between X and Y
I(XY) measures how much information each
variables provides about the other
I(XY) ? 0
I(XY) 0 iff X and Y are independent
I(XY) H(X) iff X is totally predictable given
Y

133
Likelihood Score for Structure (cont.)

Good news
Intuitive explanation of likelihood score
The larger the dependency of each variable on its
parents, the higher the score
Likelihood as a compromise among dependencies,
based on their strength

134
Likelihood Score for Structure (cont.)

Bad news
Adding arcs always helps
I(XY) ? I(XY,Z)
Maximal score attained by fully connected
networks
Such networks can overfit the data ---
parameters capture the noise in the data

135
Avoiding Overfitting

Classic issue in learning.
Approaches
Restricting the hypotheses space
Limits the overfitting capability of the learner
Example restrict of parents or of parameters
Minimum description length
Description length measures complexity
Prefer models that compactly describes the
training data
Bayesian methods
Average over all possible parameter values
Use prior knowledge

136
Bayesian Inference

Bayesian Reasoning---compute expectation over
unknown G
Assumption Gs are mutually exclusive and
exhaustive
We know how to compute P(xM1G,D)
Same as prediction with fixed structure
How do we compute P(GD)?

137
Posterior Score
Using Bayes rule P(D) is the same for all
structures G Can be ignored when comparing
structures
Prior over structures
Marginal likelihood
Probability of Data
138
Marginal Likelihood

By introduction of variables, we have that
This integral measures sensitivity to choice of
parameters

139
Marginal Likelihood for General Network

The marginal likelihood has the form
where
N(..) are the counts from the data
?(..) are the hyperparameters for each family
given G

Dirichlet Marginal Likelihood For the sequence of
values of Xi when Xis parents have a particular
value
140
Priors

We need prior counts ?(..) for each network
structure G
This can be a formidable task
There are exponentially many structures

141
BDe Score

Possible solution The BDe prior
Represent prior using two elements M0, B0
M0 - equivalent sample size
B0 - network representing the prior probability
of events

142
BDe Score

Intuition M0 prior examples distributed by B0
Set ?(xi,paiG) M0 P(xi,paiG B0)
Note that paiG are not the same as the parents of
Xi in B0.
Compute P(xi,paiG B0) using standard inference
procedures
Such priors have desirable theoretical properties
Equivalent networks are assigned the same score

143
Bayesian Score Asymptotic Behavior

Theorem If the prior P(? G) is well-behaved,
then

144
Asymptotic Behavior Consequences

Bayesian score is consistent
As M ?? the true structure G maximizes the
score (almost surely)
For sufficiently large M, the maximal scoring
structures are equivalent to G
Observed data eventually overrides prior
information
Assuming that the prior assigns positive
probability to all cases

145
Asymptotic Behavior

This score can also be justified by the Minimal
Description Length (MDL) principle
This equation explicitly shows the tradeoff
between
Fitness to data --- likelihood term
Penalty for complexity --- regularization term

146
Scores -- Summary

Likelihood, MDL, (log) BDe have the form
BDe requires assessing prior network.It can
naturally incorporate prior knowledge and
previous experience
BDe is consistent and asymptotically equivalent
(up to a constant) to MDL
All are score-equivalent
G equivalent to G ? Score(G) Score(G)

147
Optimization Problem

Input
Training data
Scoring function (including priors, if needed)
Set of possible structures
Including prior knowledge about structure
Output
A network (or networks) that maximize the score
Key Property
Decomposability the score of a network is a sum
of terms.

148
Heuristic Search

We address the problem by using heuristic search
Define a search space
nodes are possible structures
edges denote adjacency of structures
Traverse this space looking for high-scoring
structures
Search techniques
Greedy hill-climbing
Best first search
Simulated Annealing
...

149
Heuristic Search (cont.)

Typical operations

Add C ?D
Reverse C ?E
Delete C ?E
150
Exploiting Decomposability in Local Search

Caching To update the score of after a local
change, we only need to re-score the families
that were changed in the last move

151
Greedy Hill-Climbing

Simplest heuristic local search
Start with a given network
empty network
best tree
a random network
At each iteration
Evaluate all possible changes
Apply change that leads to best improvement in
score
Reiterate
Stop when no modification improves score
Each step requires evaluating approximately n new
changes

152
Greedy Hill-Climbing Possible Pitfalls

Greedy Hill-Climbing can get struck in
Local Maxima
All one-edge changes reduce the score
Plateaus
Some one-edge changes leave the score unchanged
Happens because equivalent networks received the
same score and are neighbors in the search space
Both occur during structure search
Standard heuristics can escape both
Random restarts
TABU search

153
Search Summary

Discrete optimization problem
In general, NP-Hard
Need to resort to heuristic search
In practice, search is relatively fast (100 vars
in 10 min)
Decomposability
Sufficient statistics
In some cases, we can reduce the search problem
to an easy optimization problem
Example learning trees

154
Graphical Models Intro Summary

Representations
Graphs are cool way to put constraints on
distributions, so that you can say lots of stuff
without even looking at the numbers!
Inference
GM let you compute all kinds of different
probabilities efficiently
Learning
You can even learn them auto-magically!

Write a Comment

User Comments (0)