Finding Optimal Bayesian Networks with Greedy Search - PowerPoint PPT Presentation

1 / 57

About This Presentation

Title:

Finding Optimal Bayesian Networks with Greedy Search

Description:

For each DAG, consider all single-edge additions (acyclic) ... In practice, GES is as fast as DAG-based search ... When DAG-perfect assumption fails, we still ... – PowerPoint PPT presentation

Number of Views:73

Avg rating:3.0/5.0

Slides: 58

Provided by: dmax

Category:

more less

Transcript and Presenter's Notes

Title: Finding Optimal Bayesian Networks with Greedy Search

1
Finding Optimal Bayesian Networks with Greedy
Search

Max Chickering

2
Reasoning Under Uncertainty
Print Troubleshooter (Win95, Win2k, WinXP)
Net/Local Printing
Network Up
Correct Local Port
Correct Printer Path
PC to Printer Transport OK
Net Path OK
Local Path OK
Local Cable Connected
Net Cable Connected
Paper Loaded
Printer On and Online
Printer Data OK
Printer Memory Adequate
Print Output OK
3
Troubleshooters(Win95 on)
4
Answer Wizard(Office 95 on)
5
Machine Learning and Applied Statistics Group
Data

Applications
Commerce Server
SQL Server
Spam Detection
Machine Translation
Analysis of Web Data

6
Outline

Bayesian-Networks
Learning
Greedy Equivalence Search (GES)
Optimality of GES
(Details of Meeks Conjecture)

7
Bayesian Networks

Use B (S,q) to represent p(X1, , Xn)
Structural Component S is a DAG
Parameters q specify local probability
distributions

8
Markov Conditions
From factorization I(X, ND Par(X))
Par
Par
Par
ND
X
Desc
ND
Desc
Markov Conditions Graphoid Axioms characterize
all independencies
9
Structure/Distribution Inclusion
All distributions
p
X
Y
Z
S

p is included in S if there exists q s.t. B(S,q)
defines p

10
Structure/Structure Inclusion T S
All distributions
X
Y
Z
X
Y
Z
S
T

T is included in S if every p included in T is
included in S

(S is an I-map of T)
11
Structure/Structure EquivalenceT ? S
All distributions
X
Y
Z
X
Y
Z
S
T
Reflexive, Symmetric, Transitive
12
Equivalence
A
B
C
A
B
C
D
D
Skeleton
V-structure
Theorem (Verma and Pearl, 1990) S ? T ? same
v-structures and skeletons
13
Learning Bayesian Networks
X
X Y Z 0 1 1 1 0 1 0 1 0 . . . 1 0 1
iid samples
Y
p
Z
Generative Distribution
Observed Data
Learned Model

Learn the structure
Estimate the conditional distributions

14
Learning Structure

Scoring criterion
F(D, S)
Search procedure
Identify one or more structures with high values
for the scoring function

15
Bayesian Criterion

Sh generative distribution p has same
independence constraints as S.
FBayes(S,D) log p(Sh D)
k log p(DSh) log p(Sh)

Structure Prior (e.g. prefer simple)
Marginal Likelihood (closed form w/ assumptions)
16
Consistent Scoring Criterion
Criterion favors (in the limit) simplest model
that includes the generative distribution p

S includes p, T does not include p ? F(S,D)
gt F(T,D)
Both include p, S has fewer parameters ? F(S,D)
gt F(T,D)

X
Y
Z
p
X
Y
Z
X
Y
Z
X
Y
Z
17
Bayesian Criterion is Consistent

Assume Conditionals
unconstrained multinomials
linear regressions

Geiger, Heckerman, King and Meek (2001)
Network structures curved exponential models
Haughton (1988)
Bayesian Criterion is consistent
18
Locally Consistent Criterion
S and T differ by one edge
X
Y
X
Y
S
T
If I(X,YPar(X)) in p then F(S,D) gt
F(T,D) Otherwise F(S,D) lt F(T,D)
19
Bayesian Criterion is Locally Consistent

Bayesian score approaches BIC constant
BIC is decomposible
Difference in score same for any DAGS that differ
by Y?X edge if X has same parents

X
Y
X
Y
Complete network (always includes p)
20
Bayesian Criterion isScore Equivalent
S?T ? F(S,D) F(T,D)
X
Y
Sh no independence constraints
S
X
Y
Th no independence constraints
T
Sh Th
21
Search Procedure

Set of states
Representation for the states
Operators to move between states
Systematic Search Algorithm

22
Greedy Equivalence Search

Set of states
Equivalence classes of DAGs
Representation for the states
Essential graphs
Operators to move between states
Forward and Backward Operators
Systematic Search Algorithm
Two-phase Greedy

23
Representation Essential Graphs
A
B
C
Compelled Edges Reversible Edges
E
F
D
A
B
C
E
F
D
24
GES Operators
Forward Direction single edge additions
Backward Direction single edge deletions
25
Two-Phase Greedy Algorithm

Phase 1 Forward Equivalence Search (FES)
Start with all-independence model
Run Greedy using forward operators

Phase 2 Backward Equivalence Search (BES)
Start with local max from FES
Run Greedy using backward operators

26
Forward Operators

Consider all DAGs in the current state
For each DAG, consider all single-edge additions
(acyclic)
Take the union of the resulting equivalence
classes

27
Forward-Operators Example
Current State
All DAGs
All DAGs resulting from single-edge addition
Union of corresponding essential graphs
28
Forward-Operators Example
29
Backward Operators

Consider all DAGs in the current state
For each DAG, consider all single-edge deletions
Take the union of the resulting equivalence
classes

30
Backward-Operators Example
Current State
All DAGs
All DAGs resulting from single-edge deletion
B
A
B
A
B
A
B
A
B
A
B
A
C
C
C
C
C
C
Union of corresponding essential graphs
31
Backward-Operators Example
32
DAG Perfect

DAG-perfect distribution p
Exists DAG G
I(X,YZ) in p ? I(X,YZ) in G

Non-DAG-perfect distribution q
B
A
B
A
B
A
D
C
D
C
D
C
I(A,DB,C) I(B,CA,D)
I(B,CA,D)
I(A,DB,C)
33
Optimality of GES
If p is DAG-perfect wrt some G
X
X
X
X Y Z 0 1 1 1 0 1 0 1 0 . .
. 1 0 1
Y
Y
Y
n
iid samples
GES
Z
Z
Z
S
G
S
p
For large n, S S
34
Optimality of GES
BES
FES
State includes S
State equals S
All-independence

Proof Outline
After first phase (FES), current state includes
S
After second phase (BES), the current state S

35
FES Maximum Includes S
Assume Local Max does NOT include S
Any DAG G from S
Markov Conditions characterize independencies In
p, exists X not indep. non-desc given parents
A
B
C
? I(X,A,B,C,D E) in p
E
X
D
p is DAG-perfect ? composition axiom holds
A
B
C
? I(X,C E) in p
E
X
D
Locally consistent adding C?X edge improves
score, and EQ class is a neighbor
36
BES Identifies S

Current state always includes S
Local consistency of the criterion
Local Minimum is S
Meeks conjecture

37
Meeks Conjecture

Any pair of DAGs G,H such that H includes G (G
H)
There exists a sequence of
covered edge reversals in G
(2) single-edge additions to G
after each change G H
after all changes GH

38
Meeks Conjecture
B
A
I(A,B) I(C,BA,D)
C
D
H
B
A
B
A
B
A
B
A
C
D
C
D
C
D
C
D
G
39
Meeks Conjecture and BESSS
Assume Local Max S Not S
Any DAG H from S
Any DAG G from S
Add
Add
Rev
Rev
Rev
G
H
S
S
Neighbor of S in BES
40
Discussion Points

In practice, GES is as fast as DAG-based search
Neighborhood of essential graphs can be
generated and scored very efficiently
When DAG-perfect assumption fails, we still get
optimality guarantees
As long as composition holds in generative
distribution, local maximum is inclusion-minimal

41
Thanks!

My Home Page
http//research.microsoft.com/dmax
Relevant Papers
Optimal Structure Identification with Greedy
Search
JMLR Submission
Contains detailed proofs of Meeks conjecture and
optimality of GES
Finding Optimal Bayesian Networks
UAI02 Paper with Chris Meek
Contains extension of optimality results of GES
when not DAG perfect

42
Active Paths

Z-active Path between X and Y (non-standard)
Neither X nor Y is in Z
Every pair of colliding edges meets at a member
of Z
No other pair of edges meets at a member of Z

X
Z
Y
G H ? If Z-active path between X and Y in
G then Z-active path between X and Y in H
43
Active Paths
X
A
Z
W
B
Y

X-Y Out-of X and In-to Y
X-W Out-of both X and W
Any sub-path between A,B?Z is also active
A B, B C, at least one is out-of B
?Active path between A and C

44
Simple Active Paths
A
B
contains Y?X
Then ? active path
(1) Edge appears exactly once
OR
Y
X
B
A
(2) Edge appears exactly twice
A
Y
X
X
Y
B
Simplify discussion Assume (1) only proofs
for (2) almost identical
45
Typical ArgumentCombining Active Paths
X
Y
B
A
X
Y
Z
Z sink node adj X,Y
G
Z
H
X
Y
B
A
X
A
GH
Y
B
Z
G Suppose AP in G (X not in CS) with no
corresp. AP in H. Then Z not in CS.
46
Proof Sketch

Two DAGs G, H with GltH
Identify either
a covered edge X?Y in G that has opposite
orientation in H
a new edge X?Y to be added to G such that it
remains included in H

47
The Transformation
Choose any node Y that is a sink in H
Case 1a Y is a sink in G X ? ParH(Y) X ?
ParG(Y) Case 1b Y is a sink in G same
parents Case 2a ?X s.t. Y?X covered Case
2b ?X s.t. Y?X W par of Y but not X Case
2c Every Y?X, Par (Y) ? Par(X)
X
Y
X
Y
Y
X
Y
X
Y
W
W
X
Y
X
Y
Y
Y
48
Preliminaries
(G H)

The adjacencies in G are a subset of the
adjacencies in H
If X?Y?Z is a v-structure in G but not H, then X
and Z are adjacent in H
Any new active path that results from adding X?Y
to G includes X?Y

49
Proof Sketch Case 1
Y is a sink in G
Case 1a X ? ParH(Y) X ? ParG(Y)
H
X
Y
X
Y
G
X
Y
Suppose theres some new active path between A
and B not in H
Y
X
B
Z
A

Y is a sink in G, so it must be in CS
Neither X nor next node Z is in CS
In H, AP(A,Z), AP(X,B), Z?Y?X

Case 1b Parents identical Remove Y from both
graphs proof similar
50
Proof Sketch Case 2
Y is not a sink in G
Case 2a There is a covered edge Y?X Reverse
the edge
Case 2b There is a non-covered edge Y?X such
that W is a parent of Y but not a parent of X
W
W
W
G
H
G
X
Y
X
Y
X
Y
Suppose theres some new active path between A
and B not in H
Y must be in CS, else replace W?X by W ? Y ? X
(not new). If X not in CS, then in H active A-W,
X-B, W?Y?X
B
W
A
B
W
A
G
H
Z
X
Y
Z
X
Y
51
Case 2c The Difficult Case

All non-covered edges Y?Z have Par(Y) ? Par(Z)

W1
W2
W1
W2
Y
Y
Z1
Z2
Z1
Z2
G
H
W1?Y G no longer lt H (Z2-active path between W1
and W2) W2?Y G lt H
52
Choosing Z
G
H
Y
Y
D
D
Z
Descendants of Y in G
Descendants of Y in G
D is the maximal G-descendant in H Z is any
maximal child of Y such that D is a descendant of
Z in G
53
Choosing Z
G
H
Descendants of Y in G Y, Z1, Z2 Maximal
descendant in H DZ2 Maximal child of Y in G
that has DZ2 as descendant Z2
Add W2?Y
54
Difficult Case Proof Intuition
Y
A
Y
W
W
A
B
B
Z
Z
B or CS
B or CS
D
D
G
H
1. W not in CS 2. Y not in CS, else active in
H 3. In G, next edges must be away from Y until B
or CS reached 4. In G, neither Z nor desc in CS,
else active before addition 5. From (1,2,4), AP
(A,D) and (B,D) in H 6. Choice of D directed
path from D to B or CS in H
55
(No Transcript)
56
Optimality of GES