Graphical Models: An Introduction - PowerPoint PPT Presentation

1 / 137
About This Presentation
Title:

Graphical Models: An Introduction

Description:

B or one of its descendents are in Z. No other nodes in the path are in Z ... Mark all nodes whose descendents are in Z. X to Y phase: ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 138
Provided by: get79
Learn more at: http://www.cs.umd.edu
Category:

less

Transcript and Presenter's Notes

Title: Graphical Models: An Introduction


1
Graphical Models An Introduction
  • Lise Getoor
  • Computer Science Dept
  • University of Maryland
  • http//www.cs.umd.edu/getoor

2
Reading List for Next Lecture
  • Learning Probabilistic Relational Models, L.
    Getoor, N. Friedman, D. Koller, A. Pfeffer.
  • Invited contribution to the book Relational Data
    Mining, S. Dzeroski and N. Lavrac, Eds.,
    Springer-Verlag, 2001.
  • http//www.cs.umd.edu/getoor/Publications/lprm-c
    h.ps http//www.cs.umd.edu/class/spring2005/cmsc82
    8g/Readings/lprm-ch.pdf
  • Probabilistic Models for Relational Data, David
    Heckerman, Christopher Meek and Daphne Koller
  • http//www.cs.umd.edu/projects/srl2004/Papers/hec
    kerman.pdfftp//ftp.research.microsoft.com/pub/tr/
    TR-2004-30.pdf

3
Graphical Models
  • e.g. Bayesian networks, Bayes nets, Belief nets,
    Markov networks, HMMs, Dynamic Bayes nets, etc.
  • Themes
  • representation
  • reasoning
  • learning
  • Materials based on upcoming book by Nir Friedman
    and Daphne Koller.
  • Slides based on material from Nir Friedman.

4
Probability Distributions
  • Let X1,,Xp be discrete random variables
  • Let P be a joint distribution over X1,,Xp
  • If the variables are binary, then we need O(2p)
    parameters to describe P
  • Can we do better?
  • Key idea use properties of independence

5
Independent Random Variables
  • Two variables X and Y are independent if
  • P(X xY y) P(X x) for all values x, y
  • That is, learning the values of Y does not change
    prediction of X
  • If X and Y are independent then
  • P(X,Y) P(XY)P(Y) P(X)P(Y)
  • In general, if X1,,Xp are independent, then
  • P(X1,,Xp) P(X1)...P(Xp)
  • Requires O(n) parameters

6
Conditional Independence
  • Unfortunately, most of random variables of
    interest are not independent of each other
  • A more suitable notion is that of conditional
    independence
  • Two variables X and Y are conditionally
    independent given Z if
  • P(X xY y,Zz) P(X xZz) for all values
    x,y,z
  • That is, learning the values of Y does not change
    prediction of X once we know the value of Z
  • notation I ( X , Y Z )

7
Example Naïve Bayesian Model
  • A common model in early diagnosis
  • Symptoms are conditionally independent given the
    disease (or fault)
  • Thus, if
  • X1,,Xp denote whether the symptoms exhibited by
    the patient (headache, high-fever, etc.) and
  • H denotes the hypothesis about the patients
    health
  • then, P(X1,,Xp,H) P(H)P(X1H)P(XpH),
  • This naïve Bayesian model allows compact
    representation
  • It does embody strong independence assumptions

8
Graphical Models
  • Graph is language for representing independencies
  • Directed Acyclic Graph -gt Bayesian Network
  • Undirected Graph -gt Markov Network

9
DAGS Markov Assumption
Ancestor
  • We now make this independence assumption more
    precise for directed acyclic graphs (DAGs)
  • Each random variable X, is independent of its
    non-descendents, given its parents Pa(X)
  • Formally,I (X, NonDesc(X) Pa(X))

Parent
Non-descendent
Descendent
10
Markov Assumption Example
  • In this example
  • I ( E, B )
  • I ( B, E, R )
  • I ( R, A, B, C E )
  • I ( A, R B,E )
  • I ( C, B, E, R A)

11
I-Maps
  • A DAG G is an I-Map of a distribution P if the
    all Markov assumptions implied by G are satisfied
    by P
  • (Assuming G and P both use the same set of random
    variables)
  • Examples

12
Factorization
  • Given that G is an I-Map of P, can we simplify
    the representation of P?
  • Example
  • Since I(X,Y), we have that P(XY) P(X)
  • Applying the chain ruleP(X,Y) P(XY) P(Y)
    P(X) P(Y)
  • Thus, we have a simpler representation of P(X,Y)

13
Factorization Theorem
Thm if G is an I-Map of P, then
  • From assumption
  • Since G is an I-Map, I (Xi, NonDesc(Xi) Pa(Xi))
  • We conclude, P(Xi X1,,Xi-1) P(Xi Pa(Xi) )

14
Factorization Example
  • P(C,A,R,E,B) P(B)P(EB)P(RE,B)P(AR,B,E)P(CA,R
    ,B,E)

versus P(C,A,R,E,B) P(B) P(E) P(RE) P(AB,E)
P(CA)
15
Consequences
  • We can write P in terms of local conditional
    probabilities
  • If G is sparse,
  • that is, Pa(Xi) lt k ,
  • ? each conditional probability can be specified
    compactly
  • e.g. for binary variables, these require O(2k)
    params.
  • ? representation of P is compact
  • linear in number of variables

16
DAGS Summary
  • The Markov Independences of a DAG G
  • I (Xi , NonDesc(Xi) Pai )
  • G is an I-Map of a distribution P
  • If P satisfies the Markov independencies implied
    by G
  • if G is an I-Map of P, then

17
Conditional Independencies
  • Let Markov(G) be the set of Markov Independencies
    implied by G
  • The factorization theorem shows
  • G is an I-Map of P ?
  • We can also show the opposite
  • Thm

  • ? G is an I-Map of P

18
Implied Independencies
  • Does a graph G imply additional independencies as
    a consequence of Markov(G)?
  • We can define a logic of independence statements
  • Some axioms
  • I( X Y Z ) ? I( Y X Z )
  • I( X Y1, Y2 Z ) ? I( X Y1 Z )

19
d-seperation
  • A procedure d-sep(X Y Z, G) that given a DAG
    G, and sets X, Y, and Z returns either yes or no
  • Goal
  • d-sep(X Y Z, G) yes iff I(XYZ) follows
    from Markov(G)

20
Paths
  • Intuition dependency must flow along paths in
    the graph
  • A path is a sequence of neighboring variables
  • Examples
  • R ? E ? A ? B
  • C ? A ? E ? R

21
Paths
  • We want to know when a path is
  • active -- creates dependency between end nodes
  • blocked -- cannot create dependency end nodes
  • We want to classify situations in which paths are
    active.

22
Path Blockage
  • Three cases
  • Common cause

23
Path Blockage
  • Three cases
  • Common cause
  • Intermediate cause

24
Path Blockage
  • Three cases
  • Common cause
  • Intermediate cause
  • Common Effect

25
Path Blockage -- General Case
  • A path is active, given evidence Z, if
  • Whenever we have the configurationB or one
    of its descendents are in Z
  • No other nodes in the path are in Z
  • A path is blocked, given evidence Z, if it is not
    active.

A
C
B
26
Example
  • d-sep(R,B)?

E
B
A
R
C
27
Example
  • d-sep(R,B) yes
  • d-sep(R,BA)?

E
B
A
R
C
28
Example
  • d-sep(R,B) yes
  • d-sep(R,BA) no
  • d-sep(R,BE,A)?

E
B
A
R
C
29
d-Separation
  • X is d-separated from Y, given Z, if all paths
    from a node in X to a node in Y are blocked,
    given Z.
  • Checking d-separation can be done efficiently
    (linear time in number of edges)
  • Bottom-up phase Mark all nodes whose
    descendents are in Z
  • X to Y phaseTraverse (BFS) all edges on paths
    from X to Y and check if they are blocked

30
Soundness
  • Thm
  • If
  • G is an I-Map of P
  • d-sep( X Y Z, G ) yes
  • then
  • P satisfies I( X Y Z )
  • Informally,
  • Any independence reported by d-separation is
    satisfied by underlying distribution

31
Completeness
  • Thm
  • If d-sep( X Y Z, G ) no
  • then there is a distribution P such that
  • G is an I-Map of P
  • P does not satisfy I( X Y Z )
  • Informally,
  • Any independence not reported by d-separation
    might be violated by the underlying distribution
  • We cannot determine this by examining the graph
    structure alone

32
I-Maps revisited
  • The fact that G is I-Map of P might not be that
    useful
  • For example, complete DAGs
  • A DAG is G is complete is we cannot add an arc
    without creating a cycle
  • These DAGs do not imply any independencies
  • Thus, they are I-Maps of any distribution

33
Minimal I-Maps
  • A DAG G is a minimal I-Map of P if
  • G is an I-Map of P
  • If G ? G, then G is not an I-Map of P
  • Removing any arc from G introduces
    (conditional) independencies that do not hold in P

34
Minimal I-Map Example
  • If is a
    minimal I-Map
  • Then, these are not I-Maps

35
Constructing minimal I-Maps
  • The factorization theorem suggests an algorithm
  • Fix an ordering X1,,Xn
  • For each i,
  • select Pai to be a minimal subset of X1,,Xi-1
    ,such that I(Xi X1,,Xi-1 - Pai Pai )
  • Clearly, the resulting graph is a minimal I-Map.

36
Non-uniqueness of minimal I-Map
  • Unfortunately, there may be several minimal
    I-Maps for the same distribution
  • Applying I-Map construction procedure with
    different orders can lead to different structures

Original I-Map
Order C, R, A, E, B
37
Choosing Ordering Causality
  • The choice of order can have drastic impact on
    the complexity of minimal I-Map
  • Heuristic argument construct I-Map using causal
    ordering among variables
  • Justification?
  • It is often reasonable to assume that graphs of
    causal influence should satisfy the Markov
    properties.

38
P-Maps
  • A DAG G is P-Map (perfect map) of a distribution
    P if
  • I(X Y Z) if and only if d-sep(X Y Z, G)
    yes
  • Notes
  • A P-Map captures all the independencies in the
    distribution
  • P-Maps are unique, up to DAG equivalence

39
P-Maps
  • Unfortunately, some distributions do not have a
    P-Map

40
Bayesian Networks
  • A Bayesian network specifies a probability
    distribution via two components
  • A DAG G
  • A collection of conditional probability
    distributions P(XiPai)
  • The joint distribution P is defined by the
    factorization
  • Additional requirement G is a minimal I-Map of P

41
Bayesian Networks
  • A Bayesian network specifies a probability
    distribution via two components
  • A DAG G
  • A collection of conditional probability
    distributions P(XiPai)
  • The joint distribution P is defined by the
    factorization
  • Additional requirement G is a minimal I-Map of P

42
DAGs and BNs
  • DAGs as a representation of conditional
    independencies
  • Markov independencies of a DAG
  • Tight correspondence between Markov(G) and the
    factorization defined by G
  • d-separation, a sound complete procedure for
    computing the consequences of the independencies
  • Notion of minimal I-Map
  • P-Maps
  • This theory is the basis for defining Bayesian
    networks

43
Undirected Graphs Markov Networks
  • Alternative representation of conditional
    independencies
  • Let U be an undirected graph
  • Let Ni be the set of neighbors of Xi
  • Define Markov(U) to be the set of
    independenciesI( Xi X1,,Xn - Ni - Xi
    Ni )
  • U is an I-Map of P if P satisfies Markov(U)

44
Example
  • This graph implies that
  • I(A C B, D )
  • I(B D A, C )
  • Note this example does not have a directed P-Map

A
B
D
C
45
Markov Network Factorization
  • Thm if
  • P is strictly positive, that is P(x1, , xn ) gt 0
    for all assignments
  • then
  • U is an I-Map of P
  • if and only if
  • there is a factorization
  • where C1, , Ck are the maximal cliques in U
  • Alternative form

46
Relationship between Directed Undirected Models
Chain Graphs
Directed Graphs
Undirected Graphs
47
CPDs
  • So far, we focused on how to represent
    independencies using DAGs
  • The other component of a Bayesian networks is
    the specification of the conditional probability
    distributions (CPDs)
  • Here, well just discuss the simplest
    representation of CPDs

48
Tabular CPDs
  • When the variable of interest are all discrete,
    the common representation is as a table
  • For example P(CA,B) can be represented by

49
Tabular CPDs
  • Pros
  • Very flexible, can capture any CPD of discrete
    variables
  • Can be easily stored and manipulated
  • Cons
  • Representation size grows exponentially with the
    number of parents!
  • Unwieldy to assess probabilities for more than
    few parents

50
Continuous CPDs
  • When X is a continuous variables, we need to
    represent the density of X, given any value of
    its parents
  • Gaussian
  • Conditional Gaussian

51
CPDs Summary
  • Many choices for representing CPDs
  • Any statistical model of conditional
    distribution can be used
  • e.g., any regression model
  • Representing structure in CPDs can have
    implications on independencies among variables

52
Inference in Bayesian Networks
53
Inference
  • We now have compact representations of
    probability distributions
  • Bayesian Networks
  • Markov Networks
  • Network describes a unique probability
    distribution P
  • How do we answer queries about P?
  • inference is name for the process of computing
    answers to such queries

54
Queries Likelihood
  • There are many types of queries we might ask.
  • Most of these involve evidence
  • An evidence e is an assignment of values to a set
    E variables in the domain
  • Without loss of generality E Xk1, , Xn
  • Simplest query compute probability of
    evidence
  • This is often referred to as computing the
    likelihood of the evidence

55
Queries A posteriori belief
  • Often we are interested in the conditional
    probability of a variable given the evidence
  • This is the a posteriori belief in X, given
    evidence e
  • A related task is computing the term P(X, e)
  • i.e., the likelihood of e and X x for values
    of X
  • we can recover the a posteriori belief by

56
A posteriori belief
  • This query is useful in many cases
  • Prediction what is the probability of an outcome
    given the starting condition
  • Target is a descendent of the evidence
  • Diagnosis what is the probability of
    disease/fault given symptoms
  • Target is an ancestor of the evidence
  • Note the direction between variables does not
    restrict the directions of the queries
  • Probabilistic inference can combine evidence form
    all parts of the network

57
Queries MAP
  • In this query we want to find the maximum a
    posteriori assignment for some variable of
    interest (say X1,,Xl )
  • That is, x1,,xl maximize the probability P(x1,
    ,xl e)
  • Note that this is equivalent to
    maximizing P(x1,,xl, e)

58
Queries MAP
  • We can use MAP for
  • Classification
  • find most likely label, given the evidence
  • Explanation
  • What is the most likely scenario, given the
    evidence

59
Queries MAP
  • Cautionary note
  • The MAP depends on the set of variables
  • Example
  • MAP of X
  • MAP of (X, Y)

60
Complexity of Inference
  • Thm
  • Computing P(X x) in a Bayesian network is
    NP-hard
  • Not surprising, since we can simulate Boolean
    gates.

61
Hardness
  • Hardness does not mean we cannot solve inference
  • It implies that we cannot find a general
    procedure that works efficiently for all networks
  • For particular families of networks, we can have
    provably efficient procedures

62
Approaches to inference
  • Exact inference
  • Inference in Simple Chains
  • Variable elimination
  • Clustering / join tree algorithms
  • Approximate inference
  • Stochastic simulation / sampling methods
  • Markov chain Monte Carlo methods
  • Mean field theory

63
Inference in Simple Chains
X1
X2
  • How do we compute P(X2)?

64
Inference in Simple Chains (cont.)
X1
X2
X3
  • How do we compute P(X3)?
  • we already know how to compute P(X2)...

65
Inference in Simple Chains (cont.)
...
  • How do we compute P(Xn)?
  • Compute P(X1), P(X2), P(X3),
  • We compute each term by using the previous one
  • Complexity
  • Each step costs O(Val(Xi)Val(Xi1))
    operations
  • Compare to naïve evaluation, that requires
    summing over joint values of n-1 variables

66
Inference in Simple Chains (cont.)
X1
X2
  • Suppose that we observe the value of X2 x2
  • How do we compute P(X1x2)?
  • Recall that we it suffices to compute P(X1,x2)

67
Inference in Simple Chains (cont.)
X1
X2
X3
  • Suppose that we observe the value of X3 x3
  • How do we compute P(X1,x3)?
  • How do we compute P(x3x1)?

68
Inference in Simple Chains (cont.)
...
X1
X2
X3
Xn
  • Suppose that we observe the value of Xn xn
  • How do we compute P(X1,xn)?
  • We compute P(xnxn-1), P(xnxn-2), iteratively

69
Inference in Simple Chains (cont.)
...
...
X1
X2
Xk
Xn
  • Suppose that we observe the value of Xn xn
  • We want to find P(Xkxn )
  • How do we compute P(Xk,xn )?
  • We compute P(Xk ) by forward iterations
  • We compute P(xn Xk ) by backward iterations

70
Elimination in Chains
  • We now try to understand the simple chain example
    using first-order principles
  • Using definition of probability, we have

71
Elimination in Chains
  • By chain decomposition, we get

72
Elimination in Chains
  • Rearranging terms ...

73
Elimination in Chains
  • Now we can perform innermost summation
  • This summation, is exactly the first step in the
    forward iteration we describe before

X
74
Elimination in Chains
  • Rearranging and then summing again, we get

X
X
75
Elimination in Chains with Evidence
  • Similarly, we understand the backward pass
  • We write the query in explicit form

76
Elimination in Chains with Evidence
  • Eliminating d, we get

X
77
Elimination in Chains with Evidence
  • Eliminating c, we get

X
X
78
Elimination in Chains with Evidence
  • Finally, we eliminate b

X
X
X
79
Variable Elimination
  • General idea
  • Write query in the form
  • Iteratively
  • Move all irrelevant terms outside of innermost
    sum
  • Perform innermost sum, getting a new term
  • Insert the new term into the product

80
A More Complex Example
  • Asia network

81
  • We want to compute P(d)
  • Need to eliminate v,s,x,t,l,a,b
  • Initial factors

82
  • We want to compute P(d)
  • Need to eliminate v,s,x,t,l,a,b
  • Initial factors

Eliminate v
Note fv(t) P(t) In general, result of
elimination is not necessarily a probability term
83
  • We want to compute P(d)
  • Need to eliminate s,x,t,l,a,b
  • Initial factors

Eliminate s
Summing on s results in a factor with two
arguments fs(b,l) In general, result of
elimination may be a function of several variables
84
  • We want to compute P(d)
  • Need to eliminate x,t,l,a,b
  • Initial factors

Eliminate x
Note fx(a) 1 for all values of a !!
85
  • We want to compute P(d)
  • Need to eliminate t,l,a,b
  • Initial factors

Eliminate t
86
  • We want to compute P(d)
  • Need to eliminate l,a,b
  • Initial factors

Eliminate l
87
  • We want to compute P(d)
  • Need to eliminate b
  • Initial factors

Eliminate a,b
88
Variable Elimination
  • We now understand variable elimination as a
    sequence of rewriting operations
  • Actual computation is done in elimination step
  • Exactly the same computation procedure applies to
    Markov networks
  • Computation depends on order of elimination

89
Dealing with evidence
  • How do we deal with evidence?
  • Suppose get evidence V t, S f, D t
  • We want to compute P(L, V t, S f, D t)

90
Dealing with Evidence
  • We start by writing the factors
  • Since we know that V t, we dont need to
    eliminate V
  • Instead, we can replace the factors P(V) and
    P(TV) with
  • These select the appropriate parts of the
    original factors given the evidence
  • Note that fp(V) is a constant, and thus does not
    appear in elimination of other variables

91
Dealing with Evidence
  • Given evidence V t, S f, D t
  • Compute P(L, V t, S f, D t )
  • Initial factors, after setting evidence

92
Dealing with Evidence
  • Given evidence V t, S f, D t
  • Compute P(L, V t, S f, D t )
  • Initial factors, after setting evidence
  • Eliminating x, we get

93
Dealing with Evidence
  • Given evidence V t, S f, D t
  • Compute P(L, V t, S f, D t )
  • Initial factors, after setting evidence
  • Eliminating x, we get
  • Eliminating t, we get

94
Dealing with Evidence
  • Given evidence V t, S f, D t
  • Compute P(L, V t, S f, D t )
  • Initial factors, after setting evidence
  • Eliminating x, we get
  • Eliminating t, we get
  • Eliminating a, we get

95
Dealing with Evidence
  • Given evidence V t, S f, D t
  • Compute P(L, V t, S f, D t )
  • Initial factors, after setting evidence
  • Eliminating x, we get
  • Eliminating t, we get
  • Eliminating a, we get
  • Eliminating b, we get

96
Complexity of variable elimination
  • Suppose in one elimination step we compute
  • This requires

  • multiplications
  • For each value for x, y1, , yk, we do m
    multiplications
  • additions
  • For each value of y1, , yk , we do Val(X)
    additions
  • Complexity is exponential in number of variables
    in the intermediate factor!

97
Understanding Variable Elimination
  • We want to select good elimination orderings
    that reduce complexity
  • We start by attempting to understand variable
    elimination via the graph we are working with
  • This will reduce the problem of finding good
    ordering to graph-theoretic operation that is
    well-understood

98
Undirected graph representation
  • At each stage of the procedure, we have an
    algebraic term that we need to evaluate
  • In general this term is of the formwhere Zi
    are sets of variables
  • We now plot a graph where there is undirected
    edge X--Y if X,Y are arguments of some factor
  • that is, if X,Y are in some Zi
  • Note this is the Markov network that describes
    the probability on the variables we did not
    eliminate yet

99
Chordal Graphs
  • elimination ordering ? undirected chordal graph
  • Graph
  • Maximal cliques are factors in elimination
  • Factors in elimination are cliques in the graph
  • Complexity is exponential in size of the largest
    clique in graph

100
Induced Width
  • The size of the largest clique in the induced
    graph is thus an indicator for the complexity of
    variable elimination
  • This quantity is called the induced width of a
    graph according to the specified ordering
  • Finding a good ordering for a graph is equivalent
    to finding the minimal induced width of the graph

101
General Networks
  • From graph theory
  • Thm
  • Finding an ordering that minimizes the induced
    width is NP-Hard
  • However,
  • There are reasonable heuristic for finding
    relatively good ordering
  • There are provable approximations to the best
    induced width
  • If the graph has a small induced width, there are
    algorithms that find it in polynomial time

102
Elimination on Trees
  • Formally, for any tree, there is an elimination
    ordering with induced width 1
  • Thm
  • Inference on trees is linear in number of
    variables

103
PolyTrees
  • A polytree is a network where there is at most
    one path from one variable to another
  • Thm
  • Inference in a polytree is linear in the
    representation size of the network
  • This assumes tabular CPT representation

104
Approaches to inference
  • Exact inference
  • Inference in Simple Chains
  • Variable elimination
  • Clustering / join tree algorithms
  • Approximate inference
  • Stochastic simulation / sampling methods
  • Markov chain Monte Carlo methods
  • Mean field theory

105
Learning Bayesian Networks
106
Learning Bayesian networks
Inducer
107
Known Structure -- Complete Data
E, B, A ltY,N,Ngt ltY,Y,Ygt ltN,N,Ygt ltN,Y,Ygt .
. ltN,Y,Ygt
Inducer
  • Network structure is specified
  • Inducer needs to estimate parameters
  • Data does not contain missing values

108
Unknown Structure -- Complete Data
E, B, A ltY,N,Ngt ltY,Y,Ygt ltN,N,Ygt ltN,Y,Ygt .
. ltN,Y,Ygt
Inducer
  • Network structure is not specified
  • Inducer needs to select arcs estimate
    parameters
  • Data does not contain missing values

109
Known Structure -- Incomplete Data
E, B, A ltY,N,Ngt ltY,?,Ygt ltN,N,Ygt ltN,Y,?gt .
. lt?,Y,Ygt
Inducer
  • Network structure is specified
  • Data contains missing values
  • We consider assignments to missing values

110
Known Structure / Complete Data
  • Given a network structure G
  • And choice of parametric family for P(XiPai)
  • Learn parameters for network
  • Goal
  • Construct a network that is closest to
    probability that generated the data

111
Learning Parameters for a Bayesian Network
  • Training data has the form

112
Learning Parameters for a Bayesian Network
  • Since we assume i.i.d. samples,likelihood
    function is

113
Learning Parameters for a Bayesian Network
  • By definition of network, we get

114
Learning Parameters for a Bayesian Network
  • Rewriting terms, we get

115
General Bayesian Networks
  • Generalizing for any Bayesian network
  • The likelihood decomposes according to the
    structure of the network.

i.i.d. samples
Network factorization
116
General Bayesian Networks (Cont.)
  • Decomposition
  • ? Independent Estimation Problems
  • If the parameters for each family are not
    related, then they can be estimated independently
    of each other.

117
From Binomial to Multinomial
  • For example, suppose X can have the values
    1,2,,K
  • We want to learn the parameters ? 1, ? 2. , ? K
  • Sufficient statistics
  • N1, N2, , NK - the number of times each outcome
    is observed
  • Likelihood function
  • MLE

118
Likelihood for Multinomial Networks
  • When we assume that P(Xi Pai ) is multinomial,
    we get further decomposition

119
Likelihood for Multinomial Networks
  • When we assume that P(Xi Pai ) is multinomial,
    we get further decomposition
  • For each value pai of the parents of Xi we get an
    independent multinomial problem
  • The MLE is

120
Bayesian Approach Dirichlet Priors
  • Recall that the likelihood function is
  • A Dirichlet prior with hyperparameters ?1,,?K
    is defined as
  • for legal
    ? 1,, ? K
  • Then the posterior has the same form, with
    hyperparameters ?1N 1,,?K N K

121
Dirichlet Priors (cont.)
  • We can compute the prediction on a new event in
    closed form
  • If P(?) is Dirichlet with hyperparameters ?1,,?K
    then
  • Since the posterior is also Dirichlet, we get

122
Prior Knowledge
  • The hyperparameters ?1,,?K can be thought of as
    imaginary counts from our prior experience
  • Equivalent sample size ?1?K
  • The larger the equivalent sample size the more
    confident we are in our prior

123
Conjugate Families
  • The property that the posterior distribution
    follows the same parametric form as the prior
    distribution is called conjugacy
  • Dirichlet prior is a conjugate family for the
    multinomial likelihood
  • Conjugate families are useful since
  • For many distributions we can represent them with
    hyperparameters
  • They allow for sequential update within the same
    representation
  • In many cases we have closed-form solution for
    prediction

124
Bayesian Prediction(cont.)
  • Given these observations, we can compute the
    posterior for each multinomial ? Xi pai
    independently
  • The posterior is Dirichlet with parameters
  • ?(Xi1pai)N (Xi1pai),, ?(Xikpai)N
    (Xikpai)
  • The predictive distribution is then represented
    by the parameters

125
Learning Parameters Summary
  • Estimation relies on sufficient statistics
  • For multinomial these are of the form N (xi,pai)
  • Parameter estimation
  • Bayesian methods also require choice of priors
  • Both MLE and Bayesian are asymptotically
    equivalent and consistent
  • Both can be implemented in an on-line manner by
    accumulating sufficient statistics

126
Learning Structure from Complete Data
127
Benefits of Learning Structure
  • Efficient learning -- more accurate models with
    less data
  • Compare P(A) and P(B) vs. joint P(A,B)
  • Discover structural properties of the domain
  • Ordering of events
  • Relevance
  • Identifying independencies ? faster inference
  • Predict effect of actions
  • Involves learning causal relationship among
    variables

128
Why Struggle for Accurate Structure?
Adding an arc
Missing an arc
  • Cannot be compensated by accurate fitting of
    parameters
  • Also misses causality and domain structure
  • Increases the number of parameters to be fitted
  • Wrong assumptions about causality and domain
    structure

129
Approaches to Learning Structure
  • Constraint based
  • Perform tests of conditional independence
  • Search for a network that is consistent with the
    observed dependencies and independencies
  • Pros Cons
  • Intuitive, follows closely the construction of
    BNs
  • Separates structure learning from the form of the
    independence tests
  • Sensitive to errors in individual tests

130
Approaches to Learning Structure
  • Score based
  • Define a score that evaluates how well the
    (in)dependencies in a structure match the
    observations
  • Search for a structure that maximizes the score
  • Pros Cons
  • Statistically motivated
  • Can make compromises
  • Takes the structure of conditional probabilities
    into account
  • Computationally hard

131
Likelihood Score for Structures
  • First cut approach
  • Use likelihood function
  • Recall, the likelihood score for a network
    structure and parameters is
  • Since we know how to maximize parameters from now
    we assume

132
Likelihood Score for Structure (cont.)
  • Rearranging terms
  • where
  • H(X) is the entropy of X
  • I(XY) is the mutual information between X and Y
  • I(XY) measures how much information each
    variables provides about the other
  • I(XY) ? 0
  • I(XY) 0 iff X and Y are independent
  • I(XY) H(X) iff X is totally predictable given
    Y

133
Likelihood Score for Structure (cont.)
  • Good news
  • Intuitive explanation of likelihood score
  • The larger the dependency of each variable on its
    parents, the higher the score
  • Likelihood as a compromise among dependencies,
    based on their strength

134
Likelihood Score for Structure (cont.)
  • Bad news
  • Adding arcs always helps
  • I(XY) ? I(XY,Z)
  • Maximal score attained by fully connected
    networks
  • Such networks can overfit the data ---
    parameters capture the noise in the data

135
Avoiding Overfitting
  • Classic issue in learning.
  • Approaches
  • Restricting the hypotheses space
  • Limits the overfitting capability of the learner
  • Example restrict of parents or of parameters
  • Minimum description length
  • Description length measures complexity
  • Prefer models that compactly describes the
    training data
  • Bayesian methods
  • Average over all possible parameter values
  • Use prior knowledge

136
Bayesian Inference
  • Bayesian Reasoning---compute expectation over
    unknown G
  • Assumption Gs are mutually exclusive and
    exhaustive
  • We know how to compute P(xM1G,D)
  • Same as prediction with fixed structure
  • How do we compute P(GD)?

137
Posterior Score
Using Bayes rule P(D) is the same for all
structures G Can be ignored when comparing
structures
Prior over structures
Marginal likelihood
Probability of Data
138
Marginal Likelihood
  • By introduction of variables, we have that
  • This integral measures sensitivity to choice of
    parameters

139
Marginal Likelihood for General Network
  • The marginal likelihood has the form
  • where
  • N(..) are the counts from the data
  • ?(..) are the hyperparameters for each family
    given G

Dirichlet Marginal Likelihood For the sequence of
values of Xi when Xis parents have a particular
value
140
Priors
  • We need prior counts ?(..) for each network
    structure G
  • This can be a formidable task
  • There are exponentially many structures

141
BDe Score
  • Possible solution The BDe prior
  • Represent prior using two elements M0, B0
  • M0 - equivalent sample size
  • B0 - network representing the prior probability
    of events

142
BDe Score
  • Intuition M0 prior examples distributed by B0
  • Set ?(xi,paiG) M0 P(xi,paiG B0)
  • Note that paiG are not the same as the parents of
    Xi in B0.
  • Compute P(xi,paiG B0) using standard inference
    procedures
  • Such priors have desirable theoretical properties
  • Equivalent networks are assigned the same score

143
Bayesian Score Asymptotic Behavior
  • Theorem If the prior P(? G) is well-behaved,
    then

144
Asymptotic Behavior Consequences
  • Bayesian score is consistent
  • As M ?? the true structure G maximizes the
    score (almost surely)
  • For sufficiently large M, the maximal scoring
    structures are equivalent to G
  • Observed data eventually overrides prior
    information
  • Assuming that the prior assigns positive
    probability to all cases

145
Asymptotic Behavior
  • This score can also be justified by the Minimal
    Description Length (MDL) principle
  • This equation explicitly shows the tradeoff
    between
  • Fitness to data --- likelihood term
  • Penalty for complexity --- regularization term

146
Scores -- Summary
  • Likelihood, MDL, (log) BDe have the form
  • BDe requires assessing prior network.It can
    naturally incorporate prior knowledge and
    previous experience
  • BDe is consistent and asymptotically equivalent
    (up to a constant) to MDL
  • All are score-equivalent
  • G equivalent to G ? Score(G) Score(G)

147
Optimization Problem
  • Input
  • Training data
  • Scoring function (including priors, if needed)
  • Set of possible structures
  • Including prior knowledge about structure
  • Output
  • A network (or networks) that maximize the score
  • Key Property
  • Decomposability the score of a network is a sum
    of terms.

148
Heuristic Search
  • We address the problem by using heuristic search
  • Define a search space
  • nodes are possible structures
  • edges denote adjacency of structures
  • Traverse this space looking for high-scoring
    structures
  • Search techniques
  • Greedy hill-climbing
  • Best first search
  • Simulated Annealing
  • ...

149
Heuristic Search (cont.)
  • Typical operations

Add C ?D
Reverse C ?E
Delete C ?E
150
Exploiting Decomposability in Local Search
  • Caching To update the score of after a local
    change, we only need to re-score the families
    that were changed in the last move

151
Greedy Hill-Climbing
  • Simplest heuristic local search
  • Start with a given network
  • empty network
  • best tree
  • a random network
  • At each iteration
  • Evaluate all possible changes
  • Apply change that leads to best improvement in
    score
  • Reiterate
  • Stop when no modification improves score
  • Each step requires evaluating approximately n new
    changes

152
Greedy Hill-Climbing Possible Pitfalls
  • Greedy Hill-Climbing can get struck in
  • Local Maxima
  • All one-edge changes reduce the score
  • Plateaus
  • Some one-edge changes leave the score unchanged
  • Happens because equivalent networks received the
    same score and are neighbors in the search space
  • Both occur during structure search
  • Standard heuristics can escape both
  • Random restarts
  • TABU search

153
Search Summary
  • Discrete optimization problem
  • In general, NP-Hard
  • Need to resort to heuristic search
  • In practice, search is relatively fast (100 vars
    in 10 min)
  • Decomposability
  • Sufficient statistics
  • In some cases, we can reduce the search problem
    to an easy optimization problem
  • Example learning trees

154
Graphical Models Intro Summary
  • Representations
  • Graphs are cool way to put constraints on
    distributions, so that you can say lots of stuff
    without even looking at the numbers!
  • Inference
  • GM let you compute all kinds of different
    probabilities efficiently
  • Learning
  • You can even learn them auto-magically!
Write a Comment
User Comments (0)
About PowerShow.com