Title: Finding Optimal Bayesian Networks with Greedy Search
1Finding Optimal Bayesian Networks with Greedy
Search
2Reasoning Under Uncertainty
Print Troubleshooter (Win95, Win2k, WinXP)
Net/Local Printing
Network Up
Correct Local Port
Correct Printer Path
PC to Printer Transport OK
Net Path OK
Local Path OK
Local Cable Connected
Net Cable Connected
Paper Loaded
Printer On and Online
Printer Data OK
Printer Memory Adequate
Print Output OK
3Troubleshooters(Win95 on)
4Answer Wizard(Office 95 on)
5Machine Learning and Applied Statistics Group
Data
- Applications
- Commerce Server
- SQL Server
- Spam Detection
- Machine Translation
- Analysis of Web Data
6Outline
- Bayesian-Networks
- Learning
- Greedy Equivalence Search (GES)
- Optimality of GES
- (Details of Meeks Conjecture)
7Bayesian Networks
- Use B (S,q) to represent p(X1, , Xn)
- Structural Component S is a DAG
- Parameters q specify local probability
- distributions
8Markov Conditions
From factorization I(X, ND Par(X))
Par
Par
Par
ND
X
Desc
ND
Desc
Markov Conditions Graphoid Axioms characterize
all independencies
9Structure/Distribution Inclusion
All distributions
p
X
Y
Z
S
- p is included in S if there exists q s.t. B(S,q)
defines p
10Structure/Structure Inclusion T S
All distributions
X
Y
Z
X
Y
Z
S
T
- T is included in S if every p included in T is
included in S
(S is an I-map of T)
11Structure/Structure EquivalenceT ? S
All distributions
X
Y
Z
X
Y
Z
S
T
Reflexive, Symmetric, Transitive
12Equivalence
A
B
C
A
B
C
D
D
Skeleton
V-structure
Theorem (Verma and Pearl, 1990) S ? T ? same
v-structures and skeletons
13Learning Bayesian Networks
X
X Y Z 0 1 1 1 0 1 0 1 0 . . . 1 0 1
iid samples
Y
p
Z
Generative Distribution
Observed Data
Learned Model
- Learn the structure
- Estimate the conditional distributions
14Learning Structure
- Scoring criterion
- F(D, S)
- Search procedure
- Identify one or more structures with high values
- for the scoring function
15Bayesian Criterion
- Sh generative distribution p has same
- independence constraints as S.
- FBayes(S,D) log p(Sh D)
- k log p(DSh) log p(Sh)
Structure Prior (e.g. prefer simple)
Marginal Likelihood (closed form w/ assumptions)
16Consistent Scoring Criterion
Criterion favors (in the limit) simplest model
that includes the generative distribution p
- S includes p, T does not include p ? F(S,D)
gt F(T,D) - Both include p, S has fewer parameters ? F(S,D)
gt F(T,D)
X
Y
Z
p
X
Y
Z
X
Y
Z
X
Y
Z
17Bayesian Criterion is Consistent
- Assume Conditionals
- unconstrained multinomials
- linear regressions
Geiger, Heckerman, King and Meek (2001)
Network structures curved exponential models
Haughton (1988)
Bayesian Criterion is consistent
18Locally Consistent Criterion
S and T differ by one edge
X
Y
X
Y
S
T
If I(X,YPar(X)) in p then F(S,D) gt
F(T,D) Otherwise F(S,D) lt F(T,D)
19Bayesian Criterion is Locally Consistent
- Bayesian score approaches BIC constant
- BIC is decomposible
- Difference in score same for any DAGS that differ
by Y?X edge if X has same parents
X
Y
X
Y
Complete network (always includes p)
20Bayesian Criterion isScore Equivalent
S?T ? F(S,D) F(T,D)
X
Y
Sh no independence constraints
S
X
Y
Th no independence constraints
T
Sh Th
21Search Procedure
- Set of states
- Representation for the states
- Operators to move between states
- Systematic Search Algorithm
22Greedy Equivalence Search
- Set of states
- Equivalence classes of DAGs
- Representation for the states
- Essential graphs
- Operators to move between states
- Forward and Backward Operators
- Systematic Search Algorithm
- Two-phase Greedy
23Representation Essential Graphs
A
B
C
Compelled Edges Reversible Edges
E
F
D
A
B
C
E
F
D
24GES Operators
Forward Direction single edge additions
Backward Direction single edge deletions
25Two-Phase Greedy Algorithm
- Phase 1 Forward Equivalence Search (FES)
- Start with all-independence model
- Run Greedy using forward operators
- Phase 2 Backward Equivalence Search (BES)
- Start with local max from FES
- Run Greedy using backward operators
26Forward Operators
- Consider all DAGs in the current state
- For each DAG, consider all single-edge additions
(acyclic) - Take the union of the resulting equivalence
classes
27Forward-Operators Example
Current State
All DAGs
All DAGs resulting from single-edge addition
Union of corresponding essential graphs
28Forward-Operators Example
29Backward Operators
- Consider all DAGs in the current state
- For each DAG, consider all single-edge deletions
- Take the union of the resulting equivalence
classes
30Backward-Operators Example
Current State
All DAGs
All DAGs resulting from single-edge deletion
B
A
B
A
B
A
B
A
B
A
B
A
C
C
C
C
C
C
Union of corresponding essential graphs
31Backward-Operators Example
32DAG Perfect
- DAG-perfect distribution p
- Exists DAG G
- I(X,YZ) in p ? I(X,YZ) in G
Non-DAG-perfect distribution q
B
A
B
A
B
A
D
C
D
C
D
C
I(A,DB,C) I(B,CA,D)
I(B,CA,D)
I(A,DB,C)
33Optimality of GES
If p is DAG-perfect wrt some G
X
X
X
X Y Z 0 1 1 1 0 1 0 1 0 . .
. 1 0 1
Y
Y
Y
n
iid samples
GES
Z
Z
Z
S
G
S
p
For large n, S S
34Optimality of GES
BES
FES
State includes S
State equals S
All-independence
- Proof Outline
- After first phase (FES), current state includes
S - After second phase (BES), the current state S
35FES Maximum Includes S
Assume Local Max does NOT include S
Any DAG G from S
Markov Conditions characterize independencies In
p, exists X not indep. non-desc given parents
A
B
C
? I(X,A,B,C,D E) in p
E
X
D
p is DAG-perfect ? composition axiom holds
A
B
C
? I(X,C E) in p
E
X
D
Locally consistent adding C?X edge improves
score, and EQ class is a neighbor
36BES Identifies S
- Current state always includes S
- Local consistency of the criterion
- Local Minimum is S
- Meeks conjecture
37Meeks Conjecture
- Any pair of DAGs G,H such that H includes G (G
H) - There exists a sequence of
- covered edge reversals in G
- (2) single-edge additions to G
-
- after each change G H
- after all changes GH
38Meeks Conjecture
B
A
I(A,B) I(C,BA,D)
C
D
H
B
A
B
A
B
A
B
A
C
D
C
D
C
D
C
D
G
39Meeks Conjecture and BESSS
Assume Local Max S Not S
Any DAG H from S
Any DAG G from S
Add
Add
Rev
Rev
Rev
G
H
S
S
Neighbor of S in BES
40Discussion Points
- In practice, GES is as fast as DAG-based search
- Neighborhood of essential graphs can be
generated and scored very efficiently - When DAG-perfect assumption fails, we still get
optimality guarantees - As long as composition holds in generative
distribution, local maximum is inclusion-minimal
41Thanks!
- My Home Page
- http//research.microsoft.com/dmax
- Relevant Papers
- Optimal Structure Identification with Greedy
Search - JMLR Submission
- Contains detailed proofs of Meeks conjecture and
optimality of GES - Finding Optimal Bayesian Networks
- UAI02 Paper with Chris Meek
- Contains extension of optimality results of GES
when not DAG perfect
42Active Paths
- Z-active Path between X and Y (non-standard)
- Neither X nor Y is in Z
- Every pair of colliding edges meets at a member
of Z - No other pair of edges meets at a member of Z
X
Z
Y
G H ? If Z-active path between X and Y in
G then Z-active path between X and Y in H
43Active Paths
X
A
Z
W
B
Y
- X-Y Out-of X and In-to Y
- X-W Out-of both X and W
- Any sub-path between A,B?Z is also active
- A B, B C, at least one is out-of B
- ?Active path between A and C
44Simple Active Paths
A
B
contains Y?X
Then ? active path
(1) Edge appears exactly once
OR
Y
X
B
A
(2) Edge appears exactly twice
A
Y
X
X
Y
B
Simplify discussion Assume (1) only proofs
for (2) almost identical
45Typical ArgumentCombining Active Paths
X
Y
B
A
X
Y
Z
Z sink node adj X,Y
G
Z
H
X
Y
B
A
X
A
GH
Y
B
Z
G Suppose AP in G (X not in CS) with no
corresp. AP in H. Then Z not in CS.
46Proof Sketch
- Two DAGs G, H with GltH
- Identify either
- a covered edge X?Y in G that has opposite
orientation in H - a new edge X?Y to be added to G such that it
remains included in H
47The Transformation
Choose any node Y that is a sink in H
Case 1a Y is a sink in G X ? ParH(Y) X ?
ParG(Y) Case 1b Y is a sink in G same
parents Case 2a ?X s.t. Y?X covered Case
2b ?X s.t. Y?X W par of Y but not X Case
2c Every Y?X, Par (Y) ? Par(X)
X
Y
X
Y
Y
X
Y
X
Y
W
W
X
Y
X
Y
Y
Y
48Preliminaries
(G H)
- The adjacencies in G are a subset of the
adjacencies in H - If X?Y?Z is a v-structure in G but not H, then X
and Z are adjacent in H - Any new active path that results from adding X?Y
to G includes X?Y
49Proof Sketch Case 1
Y is a sink in G
Case 1a X ? ParH(Y) X ? ParG(Y)
H
X
Y
X
Y
G
X
Y
Suppose theres some new active path between A
and B not in H
Y
X
B
Z
A
- Y is a sink in G, so it must be in CS
- Neither X nor next node Z is in CS
- In H, AP(A,Z), AP(X,B), Z?Y?X
Case 1b Parents identical Remove Y from both
graphs proof similar
50Proof Sketch Case 2
Y is not a sink in G
Case 2a There is a covered edge Y?X Reverse
the edge
Case 2b There is a non-covered edge Y?X such
that W is a parent of Y but not a parent of X
W
W
W
G
H
G
X
Y
X
Y
X
Y
Suppose theres some new active path between A
and B not in H
Y must be in CS, else replace W?X by W ? Y ? X
(not new). If X not in CS, then in H active A-W,
X-B, W?Y?X
B
W
A
B
W
A
G
H
Z
X
Y
Z
X
Y
51Case 2c The Difficult Case
- All non-covered edges Y?Z have Par(Y) ? Par(Z)
W1
W2
W1
W2
Y
Y
Z1
Z2
Z1
Z2
G
H
W1?Y G no longer lt H (Z2-active path between W1
and W2) W2?Y G lt H
52Choosing Z
G
H
Y
Y
D
D
Z
Descendants of Y in G
Descendants of Y in G
D is the maximal G-descendant in H Z is any
maximal child of Y such that D is a descendant of
Z in G
53Choosing Z
G
H
Descendants of Y in G Y, Z1, Z2 Maximal
descendant in H DZ2 Maximal child of Y in G
that has DZ2 as descendant Z2
Add W2?Y
54Difficult Case Proof Intuition
Y
A
Y
W
W
A
B
B
Z
Z
B or CS
B or CS
D
D
G
H
1. W not in CS 2. Y not in CS, else active in
H 3. In G, next edges must be away from Y until B
or CS reached 4. In G, neither Z nor desc in CS,
else active before addition 5. From (1,2,4), AP
(A,D) and (B,D) in H 6. Choice of D directed
path from D to B or CS in H
55(No Transcript)
56Optimality of GES
- Definition
- p is DAG-perfect wrt G
- Independence constraints in p are precisely those
in G - Assumption
- Generative distribution p is perfect wrt some G
defined - over the observable variables
- S Equivalence class containing G
- Under DAG-perfect assumption, GES results in S
57Important Definitions
- Bayesian Networks
- Markov Conditions
- Distribution/Structure Inclusion
- Structure/Structure Inclusion