Title: Learning factor graphs in polynomial time
1Learning factor graphs in polynomial time
sample complexity
- Pieter Abbeel
- Daphne Koller
- Andrew Y. Ng
- Stanford University
2Overview
Introduction
- First polynomial time sample complexity
learning algorithm for factor graphs, - a superset of Bayesian nets, Markov nets.
- Applicable to any factor graph of bounded factor
size and connectivity, - including intractable networks (e.g., grids).
- New technical ideas
- Parameter learning closed-form parameterization
with low-dimensional frequencies only. - Structure learning results about
guaranteed-approximate Markov blankets from
sample data.
3Factor graph distributions
Introduction
Bayesian network
Factor graph
1 factor per conditional probability table
Factor graph
Markov random field
1 factor per clique
4Factor graph distributions
Introduction
factor over variables Cj µ X1n
partition function
instantiation x1n restricted to Cj
Factor node Variable node
5Related work
Introduction
Target distr. (structure/parameter learning) True distr. Samples Time Graceful degradation Ref.
ML Tree (structure) any poly poly yes 1
ML Bounded tree width (structure) any poly NP-hard yes 2
Bounded tree width (structure) same poly poly no 3
Factor graph (parameter) same poly poly yes x
Factor graph (structure) same poly poly yes x
- Our work first poly time sample complexity
solutions for parameter estimation structure
learning of factor graphs. - Current practice for parameter learning max
likelihood. - Expensive, and applies only to tractable
networks. - Current practice for structure learning local
search heuristics or heuristic learning of
bounded tree-width model. - Slow to evaluate, and no performance guarantees.
4,5,6,7,8
1 ChowLiu, 1968 2 Srebro, 2001 3
NarasimhanBilmes, 2004 4 Della Pietra et al.,
1997 5 McCallum, 2003 6 Malvestuto, 1991
7 BachJordan, 2002 8 Deshpande et al., 2001
6Canonical parameterization
Parameter learning
- Consider the
- factor graph
- Hammersley-Clifford theorem gives
canonical factors
7Canonical factors
Parameter learning
- No lower-order interactions by inclusion-exclusion
Complete interaction.
Subtract lower order interactions.
Compensate for double counting.
Frequencies only.
Equal of ,- terms.
Closed-form parameter learning ? NO. (Not yet.)
The frequencies P(X116(x1,x2,0,,0)) involve
full instantiations and are thus expensive to
estimate from samples.
8Markov blanket canonical factors
Parameter learning
Positive and negative term of canonical factor
Transform to conditional probability.
Terms cancel.
Conditional independence.
Low dimensional distributions.
(MB Markov blanket.)
9Markov blanket canonical factors
Parameter learning
- Cj all subfactors of the given structure.
- from distribution over Cj,
MB(Cj).
- Low dimensional distributions.
- Efficient estimation from samples.
10Parameter learning
Parameter learning
- Algorithm
- Estimate the Markov blanket canonical
- factors from data.
- Theorem. The parameter learning algorithm
- runs in polynomial time,
- uses polynomial of samples,
- guarantees D() is small with high probability.
No dependence on tree-width of the network!
11Graceful degradation
Parameter learning
- Theorem. When
- true distribution factor graph G,
- structure for parameter learning factor graph G
(? G), - then the additional error consists of two terms
?
Canonical factors capture residual highest-order
interactions only. Small error when subfactors
are in G.
?
?
If MB is a good approximation of MB, error will
be small. (See structure learning.)
MB in given factor graph G
MB in given factor graph G
?
12Structure learning
Structure learning
?
Structure learning
Structure all factors of size ? k
Parameter learning
Estimating Markov blanket canonical factors
requires knowledge of the Markov blankets.
NO
But if we knew the Markov blankets, structure
learning problem would be solved.
13Recovering the Markov blankets
Structure learning
Markov blanket criterion
True distribution
True Markov blankets
Markov blanket criterion
Sample data
???
At best approximate Markov blanket from sample
data.
Key for parameter learning
Desired property for approximate Markov blanket
14Conditional entropy
Structure learning
- Conditional entropy
- For any candidate Markov blanket Y
Conditional independence
Conditioning reduces entropy For any X,Y,Z H
(X Y,Z ) ? H (X Y ).
Conditional entropy
Thus
True distribution
True Markov blankets
What about
Conditional entropy
Sample data
???
15Conditional entropy
Structure learning
- Theorem. Empirical conditional entropy estimates
are a good approximation for the true conditional
entropy, even with poly number of samples.
- Theorem. Conditional entropy satisfies the
desired approximate Markov blanket property - For any ? gt 0,
-
?
MB(C) looks like Markov blanket
if
?
MB(C) can be used as Markov blanket for learning
then
where
16Structure learning algorithm
Structure learning
- Assume factor size ? k, Markov blanket size ? b.
- For all subsets of variables Cj of size ? k
-
- Estimate Markov blanket
- canonical factors
- from data.
- Discard factors that are close
- to the trivial all ones factor.
- Return
Find Markov blankets from empirical entropy.
Parameter learning.
Simplify structure.
17Structure learning theorem
Structure learning
- Assume fixed factor size ? k, MB size ? b.
- Theorem. The structure learning algorithm
- runs in polynomial time,
- uses polynomial of samples,
- guarantees D() is small with high
probability. - Note
- Exponential dependence on factor size, MB size
for computational and sample complexity. - Bounded connectivity implies bounded factor and
MB size.
No dependence on tree-width of the network!
18Graceful degradation
Structure learning
- Theorem. Let G be the factor graph of true
distribution. When in the true distribution the
max factor size gt k or max MB size gt b, the
additional error consists of three terms
Canonical factors capture residual highest-order
interactions only. Small error when small true
interactions of order gt k.
?
If MB is a good approximation of MB, error will
be small.
Factors that are trivial in the true
distribution but estimated as non-trivial since
their MB size is larger than b.
19Consequences for Bayesian networks
Structure learning
Factor graph
Bayesian network
1 factor per conditional probability table
bounded factor size, bounded Markov blanket size
bounded fan-in, fan-out
Factor graph
Samples from PBN with unknown structure.
Factor graph distribution P with D(PBNP) ? ?.
?
?
structure learning
Learning a factor graph (not a Bayesian network)
gives efficient learning of the distribution from
finite data.
20Related work
Structure learning
- Finding highest scoring, bounded in-degree
Bayesian network is NP-hard (Chickering, Meek
Heckerman, 2003).
- Our algorithm recovers a factor graph
representation only. - The (difficult) acyclicity constraint is avoided.
?
- Learning a factor graph (not a Bayesian network)
- gives efficient learning of the distribution
from finite data.
- Note Spirtes, Glymour Scheines (2000) and
Chickering Meek (2002) do recover Bayesian
network structure, but only with access to true
distribution (infinite sample size).
21Discussion and conclusion
Conclusion
- First polynomial time polynomial sample
complexity learning algorithm for factor graphs. - Applicable to any factor graph of bounded factor
size and connectivity, - including intractable networks (e.g., grids).
- Practical drawbacks of the proposed algorithm
- Estimates parameters from only small fraction of
data. - Structure learning algorithm enumerates all
possible Markov blankets. - Complexity exponential in Markov blanket size.
22Done ...
- Additional and outdated slides follow.
23Parameter learning theorem
Detailed theorem statements
24Structure learning theorem
Detailed theorem statements
25Learning factor graphs in polynomial time
sample complexity
- Factor graphs superset of Markov, Bayesian
networks.
Factor graph
Markov network (MN)
1 factor per clique
Bayesian network (BN)
Factor graph
1 factor per conditional probability table
- Current practice in Markov network learning
- parameter learning max likelihood, only
applicable in tractable MNs. - structure learning local-search heuristics or
heuristic learning of bounded tree-width model.
No performance guarantees. - Finding highest scoring BN is NP-hard (Chickering
et al. 2003).
Pieter Abbeel, Daphne Koller and Andrew Y. Ng
26Learning factor graphs in polynomial time
sample complexity
- First polynomial time sample complexity
learning algorithm for factor graphs. - Applicable to any factor graph of bounded factor
size and connectivity, - including intractable networks (e.g., grids).
- New technical ideas
- Parameter learning in closed-form, using
parameterization with low-dimensional frequencies
only. - Structure learning results about
guaranteed-approximate Markov blankets from
sample data.
Pieter Abbeel, Daphne Koller and Andrew Y. Ng
27Relation to Narasimhan Bilmes (2004)
Structure learning
Narasimhan Bilmes (2004) This paper
Independent of treewidth. NO. YES.
Independent of Markov blanket size. YES. NO.
Graceful degradation result. NO. YES.
n x n grid treewidthn1, Markov blanket size6.
n-star graph treewidth2, Markov blanket sizen.
Factor node Variable node
28Canonical parameterization
29Canonical parameterization (2)
30Canonical parameterization (3)
31Markov blanket canonical factors
32Markov blanket canonical parameterization
33Approximate Markov blankets
34Structure learning algorithm
35Structure learning algorithm