Title: Graphical Models: An Introduction
1Graphical Models An Introduction
- Lise Getoor
- Computer Science Dept
- University of Maryland
- http//www.cs.umd.edu/getoor
2Reading List for Next Lecture
- Learning Probabilistic Relational Models, L.
Getoor, N. Friedman, D. Koller, A. Pfeffer. - Invited contribution to the book Relational Data
Mining, S. Dzeroski and N. Lavrac, Eds.,
Springer-Verlag, 2001. - http//www.cs.umd.edu/getoor/Publications/lprm-c
h.ps http//www.cs.umd.edu/class/spring2005/cmsc82
8g/Readings/lprm-ch.pdf - Probabilistic Models for Relational Data, David
Heckerman, Christopher Meek and Daphne Koller - http//www.cs.umd.edu/projects/srl2004/Papers/hec
kerman.pdfftp//ftp.research.microsoft.com/pub/tr/
TR-2004-30.pdf
3Graphical Models
- e.g. Bayesian networks, Bayes nets, Belief nets,
Markov networks, HMMs, Dynamic Bayes nets, etc. - Themes
- representation
- reasoning
- learning
- Materials based on upcoming book by Nir Friedman
and Daphne Koller. - Slides based on material from Nir Friedman.
4Probability Distributions
- Let X1,,Xp be discrete random variables
- Let P be a joint distribution over X1,,Xp
- If the variables are binary, then we need O(2p)
parameters to describe P - Can we do better?
- Key idea use properties of independence
5Independent Random Variables
- Two variables X and Y are independent if
- P(X xY y) P(X x) for all values x, y
- That is, learning the values of Y does not change
prediction of X - If X and Y are independent then
- P(X,Y) P(XY)P(Y) P(X)P(Y)
- In general, if X1,,Xp are independent, then
- P(X1,,Xp) P(X1)...P(Xp)
- Requires O(n) parameters
6Conditional Independence
- Unfortunately, most of random variables of
interest are not independent of each other - A more suitable notion is that of conditional
independence - Two variables X and Y are conditionally
independent given Z if - P(X xY y,Zz) P(X xZz) for all values
x,y,z - That is, learning the values of Y does not change
prediction of X once we know the value of Z - notation I ( X , Y Z )
7Example Naïve Bayesian Model
- A common model in early diagnosis
- Symptoms are conditionally independent given the
disease (or fault) - Thus, if
- X1,,Xp denote whether the symptoms exhibited by
the patient (headache, high-fever, etc.) and - H denotes the hypothesis about the patients
health - then, P(X1,,Xp,H) P(H)P(X1H)P(XpH),
- This naïve Bayesian model allows compact
representation - It does embody strong independence assumptions
8Graphical Models
- Graph is language for representing independencies
- Directed Acyclic Graph -gt Bayesian Network
- Undirected Graph -gt Markov Network
9DAGS Markov Assumption
Ancestor
- We now make this independence assumption more
precise for directed acyclic graphs (DAGs) - Each random variable X, is independent of its
non-descendents, given its parents Pa(X) - Formally,I (X, NonDesc(X) Pa(X))
Parent
Non-descendent
Descendent
10Markov Assumption Example
- In this example
- I ( E, B )
- I ( B, E, R )
- I ( R, A, B, C E )
- I ( A, R B,E )
- I ( C, B, E, R A)
11I-Maps
- A DAG G is an I-Map of a distribution P if the
all Markov assumptions implied by G are satisfied
by P - (Assuming G and P both use the same set of random
variables) - Examples
12Factorization
- Given that G is an I-Map of P, can we simplify
the representation of P? - Example
- Since I(X,Y), we have that P(XY) P(X)
- Applying the chain ruleP(X,Y) P(XY) P(Y)
P(X) P(Y) - Thus, we have a simpler representation of P(X,Y)
13Factorization Theorem
Thm if G is an I-Map of P, then
- Since G is an I-Map, I (Xi, NonDesc(Xi) Pa(Xi))
- We conclude, P(Xi X1,,Xi-1) P(Xi Pa(Xi) )
14Factorization Example
- P(C,A,R,E,B) P(B)P(EB)P(RE,B)P(AR,B,E)P(CA,R
,B,E)
versus P(C,A,R,E,B) P(B) P(E) P(RE) P(AB,E)
P(CA)
15Consequences
- We can write P in terms of local conditional
probabilities - If G is sparse,
- that is, Pa(Xi) lt k ,
- ? each conditional probability can be specified
compactly - e.g. for binary variables, these require O(2k)
params. - ? representation of P is compact
- linear in number of variables
16DAGS Summary
- The Markov Independences of a DAG G
- I (Xi , NonDesc(Xi) Pai )
- G is an I-Map of a distribution P
- If P satisfies the Markov independencies implied
by G - if G is an I-Map of P, then
17Conditional Independencies
- Let Markov(G) be the set of Markov Independencies
implied by G - The factorization theorem shows
- G is an I-Map of P ?
- We can also show the opposite
- Thm
-
? G is an I-Map of P
18Implied Independencies
- Does a graph G imply additional independencies as
a consequence of Markov(G)? - We can define a logic of independence statements
- Some axioms
- I( X Y Z ) ? I( Y X Z )
- I( X Y1, Y2 Z ) ? I( X Y1 Z )
19d-seperation
- A procedure d-sep(X Y Z, G) that given a DAG
G, and sets X, Y, and Z returns either yes or no - Goal
- d-sep(X Y Z, G) yes iff I(XYZ) follows
from Markov(G)
20Paths
- Intuition dependency must flow along paths in
the graph - A path is a sequence of neighboring variables
- Examples
- R ? E ? A ? B
- C ? A ? E ? R
21Paths
- We want to know when a path is
- active -- creates dependency between end nodes
- blocked -- cannot create dependency end nodes
- We want to classify situations in which paths are
active.
22Path Blockage
23Path Blockage
- Three cases
- Common cause
- Intermediate cause
-
24Path Blockage
- Three cases
- Common cause
- Intermediate cause
- Common Effect
25Path Blockage -- General Case
- A path is active, given evidence Z, if
- Whenever we have the configurationB or one
of its descendents are in Z - No other nodes in the path are in Z
- A path is blocked, given evidence Z, if it is not
active.
A
C
B
26Example
E
B
A
R
C
27Example
- d-sep(R,B) yes
- d-sep(R,BA)?
E
B
A
R
C
28Example
- d-sep(R,B) yes
- d-sep(R,BA) no
- d-sep(R,BE,A)?
E
B
A
R
C
29d-Separation
- X is d-separated from Y, given Z, if all paths
from a node in X to a node in Y are blocked,
given Z. - Checking d-separation can be done efficiently
(linear time in number of edges) - Bottom-up phase Mark all nodes whose
descendents are in Z - X to Y phaseTraverse (BFS) all edges on paths
from X to Y and check if they are blocked
30Soundness
- Thm
- If
- G is an I-Map of P
- d-sep( X Y Z, G ) yes
- then
- P satisfies I( X Y Z )
- Informally,
- Any independence reported by d-separation is
satisfied by underlying distribution
31Completeness
- Thm
- If d-sep( X Y Z, G ) no
- then there is a distribution P such that
- G is an I-Map of P
- P does not satisfy I( X Y Z )
- Informally,
- Any independence not reported by d-separation
might be violated by the underlying distribution - We cannot determine this by examining the graph
structure alone
32I-Maps revisited
- The fact that G is I-Map of P might not be that
useful - For example, complete DAGs
- A DAG is G is complete is we cannot add an arc
without creating a cycle - These DAGs do not imply any independencies
- Thus, they are I-Maps of any distribution
33Minimal I-Maps
- A DAG G is a minimal I-Map of P if
- G is an I-Map of P
- If G ? G, then G is not an I-Map of P
- Removing any arc from G introduces
(conditional) independencies that do not hold in P
34Minimal I-Map Example
- If is a
minimal I-Map - Then, these are not I-Maps
35Constructing minimal I-Maps
- The factorization theorem suggests an algorithm
- Fix an ordering X1,,Xn
- For each i,
- select Pai to be a minimal subset of X1,,Xi-1
,such that I(Xi X1,,Xi-1 - Pai Pai ) - Clearly, the resulting graph is a minimal I-Map.
36Non-uniqueness of minimal I-Map
- Unfortunately, there may be several minimal
I-Maps for the same distribution - Applying I-Map construction procedure with
different orders can lead to different structures
Original I-Map
Order C, R, A, E, B
37Choosing Ordering Causality
- The choice of order can have drastic impact on
the complexity of minimal I-Map - Heuristic argument construct I-Map using causal
ordering among variables - Justification?
- It is often reasonable to assume that graphs of
causal influence should satisfy the Markov
properties.
38P-Maps
- A DAG G is P-Map (perfect map) of a distribution
P if - I(X Y Z) if and only if d-sep(X Y Z, G)
yes - Notes
- A P-Map captures all the independencies in the
distribution - P-Maps are unique, up to DAG equivalence
39P-Maps
- Unfortunately, some distributions do not have a
P-Map
40Bayesian Networks
- A Bayesian network specifies a probability
distribution via two components - A DAG G
- A collection of conditional probability
distributions P(XiPai) - The joint distribution P is defined by the
factorization - Additional requirement G is a minimal I-Map of P
41Bayesian Networks
- A Bayesian network specifies a probability
distribution via two components - A DAG G
- A collection of conditional probability
distributions P(XiPai) - The joint distribution P is defined by the
factorization - Additional requirement G is a minimal I-Map of P
42DAGs and BNs
- DAGs as a representation of conditional
independencies - Markov independencies of a DAG
- Tight correspondence between Markov(G) and the
factorization defined by G - d-separation, a sound complete procedure for
computing the consequences of the independencies - Notion of minimal I-Map
- P-Maps
- This theory is the basis for defining Bayesian
networks
43Undirected Graphs Markov Networks
- Alternative representation of conditional
independencies - Let U be an undirected graph
- Let Ni be the set of neighbors of Xi
- Define Markov(U) to be the set of
independenciesI( Xi X1,,Xn - Ni - Xi
Ni ) - U is an I-Map of P if P satisfies Markov(U)
44Example
- This graph implies that
- I(A C B, D )
- I(B D A, C )
- Note this example does not have a directed P-Map
A
B
D
C
45Markov Network Factorization
- Thm if
- P is strictly positive, that is P(x1, , xn ) gt 0
for all assignments - then
- U is an I-Map of P
- if and only if
- there is a factorization
- where C1, , Ck are the maximal cliques in U
- Alternative form
46Relationship between Directed Undirected Models
Chain Graphs
Directed Graphs
Undirected Graphs
47CPDs
- So far, we focused on how to represent
independencies using DAGs - The other component of a Bayesian networks is
the specification of the conditional probability
distributions (CPDs) - Here, well just discuss the simplest
representation of CPDs
48Tabular CPDs
- When the variable of interest are all discrete,
the common representation is as a table - For example P(CA,B) can be represented by
49Tabular CPDs
- Pros
- Very flexible, can capture any CPD of discrete
variables - Can be easily stored and manipulated
- Cons
- Representation size grows exponentially with the
number of parents! - Unwieldy to assess probabilities for more than
few parents
50Continuous CPDs
- When X is a continuous variables, we need to
represent the density of X, given any value of
its parents - Gaussian
- Conditional Gaussian
51CPDs Summary
- Many choices for representing CPDs
- Any statistical model of conditional
distribution can be used - e.g., any regression model
- Representing structure in CPDs can have
implications on independencies among variables
52Inference in Bayesian Networks
53Inference
- We now have compact representations of
probability distributions - Bayesian Networks
- Markov Networks
- Network describes a unique probability
distribution P - How do we answer queries about P?
- inference is name for the process of computing
answers to such queries
54Queries Likelihood
- There are many types of queries we might ask.
- Most of these involve evidence
- An evidence e is an assignment of values to a set
E variables in the domain - Without loss of generality E Xk1, , Xn
- Simplest query compute probability of
evidence - This is often referred to as computing the
likelihood of the evidence
55Queries A posteriori belief
- Often we are interested in the conditional
probability of a variable given the evidence - This is the a posteriori belief in X, given
evidence e - A related task is computing the term P(X, e)
- i.e., the likelihood of e and X x for values
of X - we can recover the a posteriori belief by
56A posteriori belief
- This query is useful in many cases
- Prediction what is the probability of an outcome
given the starting condition - Target is a descendent of the evidence
- Diagnosis what is the probability of
disease/fault given symptoms - Target is an ancestor of the evidence
- Note the direction between variables does not
restrict the directions of the queries - Probabilistic inference can combine evidence form
all parts of the network
57Queries MAP
- In this query we want to find the maximum a
posteriori assignment for some variable of
interest (say X1,,Xl ) - That is, x1,,xl maximize the probability P(x1,
,xl e) - Note that this is equivalent to
maximizing P(x1,,xl, e)
58Queries MAP
- We can use MAP for
- Classification
- find most likely label, given the evidence
- Explanation
- What is the most likely scenario, given the
evidence
59Queries MAP
- Cautionary note
- The MAP depends on the set of variables
- Example
- MAP of X
- MAP of (X, Y)
60Complexity of Inference
- Thm
- Computing P(X x) in a Bayesian network is
NP-hard - Not surprising, since we can simulate Boolean
gates.
61Hardness
- Hardness does not mean we cannot solve inference
- It implies that we cannot find a general
procedure that works efficiently for all networks - For particular families of networks, we can have
provably efficient procedures
62Approaches to inference
- Exact inference
- Inference in Simple Chains
- Variable elimination
- Clustering / join tree algorithms
- Approximate inference
- Stochastic simulation / sampling methods
- Markov chain Monte Carlo methods
- Mean field theory
63Inference in Simple Chains
X1
X2
64Inference in Simple Chains (cont.)
X1
X2
X3
- How do we compute P(X3)?
- we already know how to compute P(X2)...
65Inference in Simple Chains (cont.)
...
- How do we compute P(Xn)?
- Compute P(X1), P(X2), P(X3),
- We compute each term by using the previous one
- Complexity
- Each step costs O(Val(Xi)Val(Xi1))
operations - Compare to naïve evaluation, that requires
summing over joint values of n-1 variables
66Inference in Simple Chains (cont.)
X1
X2
- Suppose that we observe the value of X2 x2
- How do we compute P(X1x2)?
- Recall that we it suffices to compute P(X1,x2)
67Inference in Simple Chains (cont.)
X1
X2
X3
- Suppose that we observe the value of X3 x3
- How do we compute P(X1,x3)?
- How do we compute P(x3x1)?
68Inference in Simple Chains (cont.)
...
X1
X2
X3
Xn
- Suppose that we observe the value of Xn xn
- How do we compute P(X1,xn)?
- We compute P(xnxn-1), P(xnxn-2), iteratively
69Inference in Simple Chains (cont.)
...
...
X1
X2
Xk
Xn
- Suppose that we observe the value of Xn xn
- We want to find P(Xkxn )
- How do we compute P(Xk,xn )?
- We compute P(Xk ) by forward iterations
- We compute P(xn Xk ) by backward iterations
70Elimination in Chains
- We now try to understand the simple chain example
using first-order principles - Using definition of probability, we have
71Elimination in Chains
- By chain decomposition, we get
72Elimination in Chains
73Elimination in Chains
- Now we can perform innermost summation
- This summation, is exactly the first step in the
forward iteration we describe before
X
74Elimination in Chains
- Rearranging and then summing again, we get
X
X
75Elimination in Chains with Evidence
- Similarly, we understand the backward pass
- We write the query in explicit form
76Elimination in Chains with Evidence
X
77Elimination in Chains with Evidence
X
X
78Elimination in Chains with Evidence
X
X
X
79Variable Elimination
- General idea
- Write query in the form
- Iteratively
- Move all irrelevant terms outside of innermost
sum - Perform innermost sum, getting a new term
- Insert the new term into the product
80A More Complex Example
81- We want to compute P(d)
- Need to eliminate v,s,x,t,l,a,b
- Initial factors
82- We want to compute P(d)
- Need to eliminate v,s,x,t,l,a,b
- Initial factors
Eliminate v
Note fv(t) P(t) In general, result of
elimination is not necessarily a probability term
83- We want to compute P(d)
- Need to eliminate s,x,t,l,a,b
- Initial factors
Eliminate s
Summing on s results in a factor with two
arguments fs(b,l) In general, result of
elimination may be a function of several variables
84- We want to compute P(d)
- Need to eliminate x,t,l,a,b
- Initial factors
Eliminate x
Note fx(a) 1 for all values of a !!
85- We want to compute P(d)
- Need to eliminate t,l,a,b
- Initial factors
Eliminate t
86- We want to compute P(d)
- Need to eliminate l,a,b
- Initial factors
Eliminate l
87- We want to compute P(d)
- Need to eliminate b
- Initial factors
Eliminate a,b
88Variable Elimination
- We now understand variable elimination as a
sequence of rewriting operations - Actual computation is done in elimination step
- Exactly the same computation procedure applies to
Markov networks - Computation depends on order of elimination
89Dealing with evidence
- How do we deal with evidence?
- Suppose get evidence V t, S f, D t
- We want to compute P(L, V t, S f, D t)
90Dealing with Evidence
- We start by writing the factors
- Since we know that V t, we dont need to
eliminate V - Instead, we can replace the factors P(V) and
P(TV) with - These select the appropriate parts of the
original factors given the evidence - Note that fp(V) is a constant, and thus does not
appear in elimination of other variables
91Dealing with Evidence
- Given evidence V t, S f, D t
- Compute P(L, V t, S f, D t )
- Initial factors, after setting evidence
92Dealing with Evidence
- Given evidence V t, S f, D t
- Compute P(L, V t, S f, D t )
- Initial factors, after setting evidence
- Eliminating x, we get
93Dealing with Evidence
- Given evidence V t, S f, D t
- Compute P(L, V t, S f, D t )
- Initial factors, after setting evidence
- Eliminating x, we get
- Eliminating t, we get
94Dealing with Evidence
- Given evidence V t, S f, D t
- Compute P(L, V t, S f, D t )
- Initial factors, after setting evidence
- Eliminating x, we get
- Eliminating t, we get
- Eliminating a, we get
95Dealing with Evidence
- Given evidence V t, S f, D t
- Compute P(L, V t, S f, D t )
- Initial factors, after setting evidence
- Eliminating x, we get
- Eliminating t, we get
- Eliminating a, we get
- Eliminating b, we get
96Complexity of variable elimination
- Suppose in one elimination step we compute
- This requires
-
multiplications - For each value for x, y1, , yk, we do m
multiplications - additions
- For each value of y1, , yk , we do Val(X)
additions - Complexity is exponential in number of variables
in the intermediate factor!
97Understanding Variable Elimination
- We want to select good elimination orderings
that reduce complexity - We start by attempting to understand variable
elimination via the graph we are working with - This will reduce the problem of finding good
ordering to graph-theoretic operation that is
well-understood
98Undirected graph representation
- At each stage of the procedure, we have an
algebraic term that we need to evaluate - In general this term is of the formwhere Zi
are sets of variables - We now plot a graph where there is undirected
edge X--Y if X,Y are arguments of some factor - that is, if X,Y are in some Zi
- Note this is the Markov network that describes
the probability on the variables we did not
eliminate yet
99Chordal Graphs
- elimination ordering ? undirected chordal graph
- Graph
- Maximal cliques are factors in elimination
- Factors in elimination are cliques in the graph
- Complexity is exponential in size of the largest
clique in graph
100Induced Width
- The size of the largest clique in the induced
graph is thus an indicator for the complexity of
variable elimination - This quantity is called the induced width of a
graph according to the specified ordering - Finding a good ordering for a graph is equivalent
to finding the minimal induced width of the graph
101General Networks
- From graph theory
- Thm
- Finding an ordering that minimizes the induced
width is NP-Hard - However,
- There are reasonable heuristic for finding
relatively good ordering - There are provable approximations to the best
induced width - If the graph has a small induced width, there are
algorithms that find it in polynomial time
102Elimination on Trees
- Formally, for any tree, there is an elimination
ordering with induced width 1 - Thm
- Inference on trees is linear in number of
variables
103PolyTrees
- A polytree is a network where there is at most
one path from one variable to another - Thm
- Inference in a polytree is linear in the
representation size of the network - This assumes tabular CPT representation
104Approaches to inference
- Exact inference
- Inference in Simple Chains
- Variable elimination
- Clustering / join tree algorithms
- Approximate inference
- Stochastic simulation / sampling methods
- Markov chain Monte Carlo methods
- Mean field theory
105Learning Bayesian Networks
106Learning Bayesian networks
Inducer
107Known Structure -- Complete Data
E, B, A ltY,N,Ngt ltY,Y,Ygt ltN,N,Ygt ltN,Y,Ygt .
. ltN,Y,Ygt
Inducer
- Network structure is specified
- Inducer needs to estimate parameters
- Data does not contain missing values
108Unknown Structure -- Complete Data
E, B, A ltY,N,Ngt ltY,Y,Ygt ltN,N,Ygt ltN,Y,Ygt .
. ltN,Y,Ygt
Inducer
- Network structure is not specified
- Inducer needs to select arcs estimate
parameters - Data does not contain missing values
109Known Structure -- Incomplete Data
E, B, A ltY,N,Ngt ltY,?,Ygt ltN,N,Ygt ltN,Y,?gt .
. lt?,Y,Ygt
Inducer
- Network structure is specified
- Data contains missing values
- We consider assignments to missing values
110Known Structure / Complete Data
- Given a network structure G
- And choice of parametric family for P(XiPai)
- Learn parameters for network
- Goal
- Construct a network that is closest to
probability that generated the data
111Learning Parameters for a Bayesian Network
- Training data has the form
112Learning Parameters for a Bayesian Network
- Since we assume i.i.d. samples,likelihood
function is
113Learning Parameters for a Bayesian Network
- By definition of network, we get
114Learning Parameters for a Bayesian Network
115General Bayesian Networks
- Generalizing for any Bayesian network
- The likelihood decomposes according to the
structure of the network.
i.i.d. samples
Network factorization
116General Bayesian Networks (Cont.)
- Decomposition
- ? Independent Estimation Problems
- If the parameters for each family are not
related, then they can be estimated independently
of each other.
117From Binomial to Multinomial
- For example, suppose X can have the values
1,2,,K - We want to learn the parameters ? 1, ? 2. , ? K
- Sufficient statistics
- N1, N2, , NK - the number of times each outcome
is observed - Likelihood function
- MLE
118Likelihood for Multinomial Networks
- When we assume that P(Xi Pai ) is multinomial,
we get further decomposition
119Likelihood for Multinomial Networks
- When we assume that P(Xi Pai ) is multinomial,
we get further decomposition - For each value pai of the parents of Xi we get an
independent multinomial problem - The MLE is
120Bayesian Approach Dirichlet Priors
- Recall that the likelihood function is
- A Dirichlet prior with hyperparameters ?1,,?K
is defined as - for legal
? 1,, ? K - Then the posterior has the same form, with
hyperparameters ?1N 1,,?K N K
121Dirichlet Priors (cont.)
- We can compute the prediction on a new event in
closed form - If P(?) is Dirichlet with hyperparameters ?1,,?K
then -
- Since the posterior is also Dirichlet, we get
122Prior Knowledge
- The hyperparameters ?1,,?K can be thought of as
imaginary counts from our prior experience - Equivalent sample size ?1?K
- The larger the equivalent sample size the more
confident we are in our prior
123Conjugate Families
- The property that the posterior distribution
follows the same parametric form as the prior
distribution is called conjugacy - Dirichlet prior is a conjugate family for the
multinomial likelihood - Conjugate families are useful since
- For many distributions we can represent them with
hyperparameters - They allow for sequential update within the same
representation - In many cases we have closed-form solution for
prediction
124Bayesian Prediction(cont.)
- Given these observations, we can compute the
posterior for each multinomial ? Xi pai
independently - The posterior is Dirichlet with parameters
- ?(Xi1pai)N (Xi1pai),, ?(Xikpai)N
(Xikpai) - The predictive distribution is then represented
by the parameters -
125Learning Parameters Summary
- Estimation relies on sufficient statistics
- For multinomial these are of the form N (xi,pai)
- Parameter estimation
- Bayesian methods also require choice of priors
- Both MLE and Bayesian are asymptotically
equivalent and consistent - Both can be implemented in an on-line manner by
accumulating sufficient statistics
126Learning Structure from Complete Data
127Benefits of Learning Structure
- Efficient learning -- more accurate models with
less data - Compare P(A) and P(B) vs. joint P(A,B)
- Discover structural properties of the domain
- Ordering of events
- Relevance
- Identifying independencies ? faster inference
- Predict effect of actions
- Involves learning causal relationship among
variables
128Why Struggle for Accurate Structure?
Adding an arc
Missing an arc
- Cannot be compensated by accurate fitting of
parameters - Also misses causality and domain structure
- Increases the number of parameters to be fitted
- Wrong assumptions about causality and domain
structure
129Approaches to Learning Structure
- Constraint based
- Perform tests of conditional independence
- Search for a network that is consistent with the
observed dependencies and independencies - Pros Cons
- Intuitive, follows closely the construction of
BNs - Separates structure learning from the form of the
independence tests - Sensitive to errors in individual tests
130Approaches to Learning Structure
- Score based
- Define a score that evaluates how well the
(in)dependencies in a structure match the
observations - Search for a structure that maximizes the score
- Pros Cons
- Statistically motivated
- Can make compromises
- Takes the structure of conditional probabilities
into account - Computationally hard
131Likelihood Score for Structures
- First cut approach
- Use likelihood function
- Recall, the likelihood score for a network
structure and parameters is - Since we know how to maximize parameters from now
we assume
132Likelihood Score for Structure (cont.)
- Rearranging terms
- where
- H(X) is the entropy of X
- I(XY) is the mutual information between X and Y
- I(XY) measures how much information each
variables provides about the other - I(XY) ? 0
- I(XY) 0 iff X and Y are independent
- I(XY) H(X) iff X is totally predictable given
Y
133Likelihood Score for Structure (cont.)
- Good news
- Intuitive explanation of likelihood score
- The larger the dependency of each variable on its
parents, the higher the score - Likelihood as a compromise among dependencies,
based on their strength
134Likelihood Score for Structure (cont.)
- Bad news
- Adding arcs always helps
- I(XY) ? I(XY,Z)
- Maximal score attained by fully connected
networks - Such networks can overfit the data ---
parameters capture the noise in the data
135Avoiding Overfitting
- Classic issue in learning.
- Approaches
- Restricting the hypotheses space
- Limits the overfitting capability of the learner
- Example restrict of parents or of parameters
- Minimum description length
- Description length measures complexity
- Prefer models that compactly describes the
training data - Bayesian methods
- Average over all possible parameter values
- Use prior knowledge
136Bayesian Inference
- Bayesian Reasoning---compute expectation over
unknown G - Assumption Gs are mutually exclusive and
exhaustive - We know how to compute P(xM1G,D)
- Same as prediction with fixed structure
- How do we compute P(GD)?
137Posterior Score
Using Bayes rule P(D) is the same for all
structures G Can be ignored when comparing
structures
Prior over structures
Marginal likelihood
Probability of Data
138Marginal Likelihood
- By introduction of variables, we have that
- This integral measures sensitivity to choice of
parameters
139Marginal Likelihood for General Network
- The marginal likelihood has the form
- where
- N(..) are the counts from the data
- ?(..) are the hyperparameters for each family
given G
Dirichlet Marginal Likelihood For the sequence of
values of Xi when Xis parents have a particular
value
140Priors
- We need prior counts ?(..) for each network
structure G - This can be a formidable task
- There are exponentially many structures
141BDe Score
- Possible solution The BDe prior
- Represent prior using two elements M0, B0
- M0 - equivalent sample size
- B0 - network representing the prior probability
of events
142BDe Score
- Intuition M0 prior examples distributed by B0
- Set ?(xi,paiG) M0 P(xi,paiG B0)
- Note that paiG are not the same as the parents of
Xi in B0. - Compute P(xi,paiG B0) using standard inference
procedures - Such priors have desirable theoretical properties
- Equivalent networks are assigned the same score
143Bayesian Score Asymptotic Behavior
- Theorem If the prior P(? G) is well-behaved,
then
144Asymptotic Behavior Consequences
- Bayesian score is consistent
- As M ?? the true structure G maximizes the
score (almost surely) - For sufficiently large M, the maximal scoring
structures are equivalent to G - Observed data eventually overrides prior
information - Assuming that the prior assigns positive
probability to all cases
145Asymptotic Behavior
- This score can also be justified by the Minimal
Description Length (MDL) principle - This equation explicitly shows the tradeoff
between - Fitness to data --- likelihood term
- Penalty for complexity --- regularization term
146Scores -- Summary
- Likelihood, MDL, (log) BDe have the form
- BDe requires assessing prior network.It can
naturally incorporate prior knowledge and
previous experience - BDe is consistent and asymptotically equivalent
(up to a constant) to MDL - All are score-equivalent
- G equivalent to G ? Score(G) Score(G)
147Optimization Problem
- Input
- Training data
- Scoring function (including priors, if needed)
- Set of possible structures
- Including prior knowledge about structure
- Output
- A network (or networks) that maximize the score
- Key Property
- Decomposability the score of a network is a sum
of terms.
148Heuristic Search
- We address the problem by using heuristic search
- Define a search space
- nodes are possible structures
- edges denote adjacency of structures
- Traverse this space looking for high-scoring
structures - Search techniques
- Greedy hill-climbing
- Best first search
- Simulated Annealing
- ...
149Heuristic Search (cont.)
Add C ?D
Reverse C ?E
Delete C ?E
150Exploiting Decomposability in Local Search
- Caching To update the score of after a local
change, we only need to re-score the families
that were changed in the last move
151Greedy Hill-Climbing
- Simplest heuristic local search
- Start with a given network
- empty network
- best tree
- a random network
- At each iteration
- Evaluate all possible changes
- Apply change that leads to best improvement in
score - Reiterate
- Stop when no modification improves score
- Each step requires evaluating approximately n new
changes
152Greedy Hill-Climbing Possible Pitfalls
- Greedy Hill-Climbing can get struck in
- Local Maxima
- All one-edge changes reduce the score
- Plateaus
- Some one-edge changes leave the score unchanged
- Happens because equivalent networks received the
same score and are neighbors in the search space - Both occur during structure search
- Standard heuristics can escape both
- Random restarts
- TABU search
153Search Summary
- Discrete optimization problem
- In general, NP-Hard
- Need to resort to heuristic search
- In practice, search is relatively fast (100 vars
in 10 min) - Decomposability
- Sufficient statistics
- In some cases, we can reduce the search problem
to an easy optimization problem - Example learning trees
154Graphical Models Intro Summary
- Representations
- Graphs are cool way to put constraints on
distributions, so that you can say lots of stuff
without even looking at the numbers! - Inference
- GM let you compute all kinds of different
probabilities efficiently - Learning
- You can even learn them auto-magically!