Title: Junction trees
1Efficient Principled Learning of Junction Trees
Anton Chechetka and Carlos Guestrin
Carnegie Mellon University
Motivation
Constructing a junction tree Using Alg.1 for
every S?V, obtain a list L of pairs (S,Q) s.t
I(Q,V\SQS)ltV(??) Example
Theoretical guarantees Intuition if the
intra-clique dependencies are strong enough,
guaranteed to find a well- approximating JT in
polynomial time.
Experimental results
- Finding almost independent subsets
- Question if S is a separator of an ?-JT, which
variables are on the same side of S? - More than one correct answer possible
- We will settle for finding one
- Drop the complexity to polynomial from exponential
- Junction trees
- Trees where each node is a set of variables
- Running intersection property every clique
between Ci and Cj contains Ci ? Cj - Ci and Cj are neighbors ? SijCi ? Cj is called a
separator - Example
-
-
- Notation Vi?j is a set of all variables on the
same side of edge i-j as clique Cj - V3?4GF, V3?1A, V4?3AD
- Encoded independencies (Vi?j ? Vj?i
Sij)
- Constraint-based learning
- Naively
- for every candidate sep. S of size k
- for every X?V\S
- if I(X, V\SX S) lt ?
- add (S,X) to the list of useful components L
- find a JT consistent with L
- Probabilistic graphical models are everywhere
- Medical diagnosis, datacenter performance
monitoring, sensor nets,
Model quality (log-likelihood on test set)
- Compare this work with
- ordering-based search (OBS) TeyssierKollerUAI05
- Chow-Liu alg. ChowLiuIEEE68
- Karger-Srebro alg.KargerSrebroSODA01
- local search
- this work local search combination (using our
algorithm to initialize local search)
Complexity
B
B
B
C
BE
Theorem Suppose a maximal ?-JT tree of treewidth
k exists for P(V) s.t. for every clique C and
separator S of tree it holds that
minX?(C\S)I(X,C\SXS) gt (k3)(??) then our
algorithm will find a kV(??)-JT for P(V) with
probability at least (1-?) using
AB
,
,
,
,
S
A
EF
CD
D
- Main advantages
- Compact representation of probability
distributions - Exploit structure to speed up inference
BC
Q
C
E
E
SA B, C,D OR B,C, D
EF
CD
,
,
Separators
ABCD
1
ABEF
F
V3? 4
EG
AB
B
E
5
- Problem From L, reconstruct a junction tree.
- This is non-trivial. Complications
- L may encode more independencies than a single JT
encodes - Several different JTs may be consistent with
independencies in L
B
BC
BE
Cliques
samples and
E
3
4
C
- But also problems
- Compact representation? tractable inference.
- Exact inference P-complete in general
- Often still need exponential time even for
compact models - Example
- Often do not even have structure, only data
- Best structure is NP-complete to find
- Most structure learning algorithms return complex
models, where inference is intractable - Very few structure learning algorithms have
global quality guarantees - We address both of these issues! We provide
- efficient structure learning algorithm
- guaranteed to learn tractable models
- with global guarantees on the results quality
CD
Intuition Consider set of variables QBCD.
Suppose an ?-JT (e.g. above) with separator SA
exists s.t. some of the variables in Q (B) are
on the left of S and the remaining ones (CD) on
the right.then a partitioning of Q into X and Y
exists s.t. I(X,YS)lt?
EF
2
Key theoretical result Efficient upper bound for
I(?,??)
6
time
Key insight Arnborgal,SIAM-JADM1987,
NarasimhanBilmes UAI05 In a junction tree,
components (S,Q) have recursive decomposition
Data BeinlichalECAIM1988 37 variables,
treewidth 4, learned treewidth 3
Intuition Suppose a distribution P(V) can be
well approximated by a junction tree with clique
size k. Then for every set S?V of size k, A,B?V
of arbitrary size, to check that I(A,B S) is
small, it is enough to check for all subsets X?A,
Y?B of size at most k that I(X,YS) is small.
possible partitionings
Corollary Maximal JTs of fixed treewidth s.t.
for every clique C and separator S it holds
that minX?(C\S)I(X,C\SXS) gt? for fixed ?gt0 are
efficiently PAC learnable
B
C
4 neighbors per variable (a constant!), but
inference still hard
a clique in the junction tree
D
C
smaller components from L
- Tractability guarantees
- Inference exponential in clique size k
- Small cliques ? tractable inference ?
if no such splits exist, all variables of Q must
be on the same side of S
B
EF
A
I(A,B S)??
Only need to compute I(X,YS) for small X and Y!
Related work
A
B
- Alg. 1 (given candidate sep. S), threshold ?
- each variable of V\S starts out as a separate
partition - for every Q?V\S of size at most k2
- if minX?Q I(X,Q\S S) gt ?
- merge all partitions that have variables in Q
Data DesphandealVLDB04 54 variables, treewidth
2
Look for such recursive decompositions in L!
- JTs as approximations
- Often exact conditional independence is too
- strict a requirement
- generalization conditional mutual
informationI(A , B C) ? H(A B) H(A BC) - H() is conditional entropy
- I(A , B C) 0 always
- I(A , B C) 0 ? (A ? B C)
- intuitively if C is already known, how much new
information about A is contained in B?
Y
I(X,YS)
X
Fixed size regardless of Q
- DP algorithm (input list L of pairs (S,Q))
- sort L in the order of increasing Q
- mark (S,Q)?L with Q1 as positive
- for (S,Q)?L, Q2, in the sorted order
- if ?x?Q, (S1,Q1), , (Sm,Qm) ?L s.t.
- Si ?Sx, (Si,Qi) is positive
- Qi?Qj?
- ?i1mQiQ\x
- then mark (S,Q) positive
- decomposition(S,Q)(S1,Q1),...,(Sm,Qm)
- if ?S s.t. all (S,Qi)?L are positive
- return corresponding junction tree
S
Computation time is reduced from exponential in
V to polynomial!
Example ?0.25
Set S does not have to relate to the separators
of the true JT in any way!
NP-complete to decide ? We use greedy heuristic
Data KrauseGuestrinUAI05 32 variables,
treewidth 3
merge end result
Pairwise I(.,.S)
Test edge, merge variables
I() too low, do not merge
Approximation quality guarantee
Theorem 1 Suppose an ?-JT of treewidth k exists
for P(V). Suppose the sets S?V of size k, A?V\S
of arbitrary size are s.t. for every X?V\S of
size k1 it holds that I(X?A, X?(V\SA)S S) lt
? then I(A, V\SA S) lt V(??)
- This work contributions
- The first polynomial time algorithm with PAC
guarantees for learning low-treewidth graphical
models with - guaranteed tractable inference!
- Key theoretical insight polynomial-time upper
bound on conditional mutual information for
arbitrarily large sets of variables - Empirical viability demonstration
Theorem Narasimhan and Bilmes, UAI05 If for
every separator Sij in the junction tree it holds
that the conditional mutual information I(Vi?j,
Vj?i Sij ) lt ? (call it ?-junction
tree) then KL(PPtree) lt V?
1 BachJordanNIPS-02 2 ChoialUAI-05 3
ChowLiuIEEE-1968 4 MeilaJordanJMLR-01 5
TeyssierKollerUAI-05 6 SinghMooreCMU-CALD-05
7 KargerSrebroSODA-01 8 AbbeelalJMLR-06
9 NarasimhanBilmesUAI-04
- Theorem (results quality)If after invoking
Alg.1(S,??) a set U is a connected component,
then - For every Z s.t. I(Z, V\ZS S)lt?it holds that
U?Z - I(U, V\US S)ltnk?
- Greedy heuristic for decomposition search
- initialize decomposition to empty
- iteratively add pairs (Si,Qi) that do not
conflict with those already in the decomposition - if all variables of Q are covered, success
- May fail even if a decomposition exists
- But we prove that for certain distributions
guaranteed to work
never mistakenly put variables together
- Future work
- Extend to non-maximal junction trees
- Heuristics to speed up performance
- Using information about edges likelihood (e.g.
from L1 regularized logistic regression) to cut
down on computation.
Incorrect splits not too bad
Goal find an ?junction tree with fixed clique
size k in polynomial (in V) time
Complexity O(nk1). Polynomial in n,instead of
O(exp(n)) for straightforward computation
Complexity O(nk3). Polynomial in n.