Title: Algorithms for Answering Queries with Graphical Models
1Algorithms for Answering Queries with Graphical
Models
Thesis Proposal
Anton Chechetka
- Thesis committee Carlos Guestrin
Eric Xing
Drew Bagnell Pedro
Domingos (UW)
21 May 2009
2Motivation
Activity recognition
Sensor networks
Patient monitoring diagnosis
Image credit http//www.dremed.com
Image credit Pentneyal2006
3Motivation
Common problem computeP(Q E e)
True temperature in a room?
Sensor 3 reads 25C
Has the person finished cooking?
The person is next to the kitchen sink (RFID)
Is the patient well?
Heart rate is 70 BPM
4Common solution
Common problem compute P(Q E e)
(query)Common solution probabilistic graphical
models
This thesis New algorithms for learning and
inference in PGMs to make answering queries
better
Pentneyal2006
Deshpandeal2004
Beinlichal1988
5Graphical models
Represent factorized distributions X? are
small subsets of X ? compact representation
corresponding graph structure
X4
X1
X3
X5
X2
Learn/constructstructure
Learn/defineparameters
Inference
P(QEe)
- Fundamental problems
- P(QEe) given a PGM?
- Best parameters f? given the structure?
- Optimal structure (i.e. sets X?)?
P-complete / NP-complete exp(X)
complexity NP-complete
6This thesis
Learn/constructstructure
Learn/defineparameters
Inference
P(QEe)
NIPS 2007
1. Learning tractable models efficiently
and with quality guarantees
2. Simplifying large-scale models /
focusing inference on the query
Thesis contributions
3. Learning simple local models by
exploiting evidence assignments
7Leaning tractable models
Learn/constructstructure
Learn/defineparameters
Inference
P(QEe)
- Every step in the pipeline is computationally
hard for general PGMs - Compounding errors
- But there are exact inference and parameter
learning algorithms with exp(graph treewidth)
complexity - So if we learn low-treewidth models, all the rest
is easy!
8Treewidth
Learn/constructstructure
Learn/defineparameters
Inference
P(QEe)
- Learn low-treewidth models ? all the rest is
easy! - Treewidth size of largest clique in a
triangulated graph - Computing treewidth is NP-complete in general
- But easy to constructgraphs with given treewidth
- Convenient representation junction tree
X4,X5
X1,X4,X5
X4,X5,X6
C1
C4
X1,X5
X1,X5
X1,X2,X5
X1,X3,X5
C2
C5
3
4
X1,X2
1
5
6
X1,X2,X7
C3
7
2
9Junction trees
- Learn junction trees ? all the rest is easy!
- Other classes of tractable models exist, e.g.
LowdDomingos2008 - Running intersection property
- Most likely junction tree of fixed treewidth gt1
is NP-complete - We will look for good approximations
X4,X5
X1,X4,X5
X4,X5,X6
C1
C4
X1,X5
X1,X5
X1,X2,X5
X1,X3,X5
C2
C5
X1,X2
X1,X2,X7
C3
10Independencies in low-treewidth distributions
P(X) factorizes according to a JT
conditional independencies hold
conditional mutual information
works in the other way too!
X4,X5,X6
X1,X3,X5
X???X2X3X7
X???X4X6
X1,X5
X1,X2,X7
X1,X2,X5
X1,X4,X5
11Constraint-based structure learning
We will look for JTs where this holds
Constraint-based structure learning
S1
S4
S3
S3
S1
S2
I(V, X \VS S) lt ? ??
S4
S2
Construct a junction tree (e.g. using dynamic
programming)
Take all candidate separators
12Mutual information estimation
I(V, X \VS S) lt ? ??
definition I(A,BS) H(A S)
H(ABS) naïve estimation of costs
exp(X), too expensive
sum over all 2Xassignments to X
our work upper bound on I(V, X \VS S),
using values of I(Y,ZS) for
YZ?treewidth1 there are
O(Xtreewidth1) subsets Y and Z
?
complexity polynomial in X
13Mutual information estimation
I(V, X \VS S) lt ? ??
hard
- Theorem suppose that P(X), S, V are s.t.
- an ?-JT of treewidth k for P(X) exists
- for every A?V, B?X-VS s.t. AB ? k1
- I( ? ) ? ?
- Then
- I(V, X-VS S) ? X(? ?)
I(V,X-VS S)??
V
X-VS
easy
I(A,BS)
B
A
AB?treewidth1
- Complexity O(Xk1) ? exponential speedup
- No need to know the ?-JT, only that it exists
- The bound is loose only when there is no hope to
learn a good JT
14Guarantees on learned model quality
- Theorem suppose that P(X) is s.t.
- a strongly connected ?-JT of treewidth k for
P(X) exists - Then our alg. will with probability at least
(1-?) find a JT (C,E) s.t.
quality guarantee
using
samples
and
time
poly samples
poly time
Corollary strongly connected junction trees are
PAC-learnable
15Related work
Ref. Model Guarantees Time
BachJordan2002 tractable local poly(n)
ChowLiu1968 tree global O(n2 log n)
MeilaJordan2001 tree mix local O(n2 log n)
TeyssierKoller2005 compact local poly(n)
SinghMoore2005 all global exp(n)
KargerSrebro2001 tractable const-factor poly(n)
Abbeelal2006 compact PAC poly(n)
NarasimhanBilmes2004 tractable PAC exp(n)
our work tractable PAC poly(n)
16Results typical convergence time
Test log-likelihood
good results early on in practice
17Results log-likelihood
OBS ? local search in limited
in-degree Bayes nets Chow-Liu ? most
likely JTs of treewidth 1 Karger-Srebro ?
constant-factor approximation JTs
better
our method
18This thesis
Learn/constructstructure
Learn/defineparameters
Inference
P(QEe)
NIPS 2007
1. Learning tractable models efficiently
and with quality guarantees
2. Simplifying large-scale models /
focusing inference on the query
Thesis contributions
3. Learning simple local models by
exploiting evidence assignments
19Approximate inference is still useful
- Often learning a tractable graphical model is not
an option - Need domain knowledge
- Templatized models
- Markov logic nets
- Probabilistic relational models
- Dynamic Bayesian nets
- This part the (intractable) PGM is a given
- What can we do with the inference?
- What if we know the query variables Q and
evidence Ee?
20Query-specific simplification
- This part the (intractable) PGM is a given
Observation often many variables are unknown,
but also not important to the user
Suppose we know the variables Q of interest (the
query)
Observation usually, variables far away from
the query do not affect P(Q) much
21Query-specific simplification
Observation variables far away from the query
do not affect P(Q) much
these have little effect on P(Q)
query
Idea discard parts of the model that
have little effect on the query
Observation values of potentials are important
want this part first
Our work
- edge importance from values of potentials
- efficient algorithms for model simplification
- focused inference as soft model simplification
22Belief propagation Pearl1988
- For every edge Xi-Xj and variable, a message
- Belief about the marginal over Xi
- Algorithm until
convergence - Fixed point of BP(?) is the solution
23Model simplification problem
Model simplificationproblem
which messages to skip updating s.t.-
inference cost gets small enough - BP fixed point
for P(Q) does not change much
24Edge costs
- Inference cost IC(i?j)
- complexity of one BP update for mi?j
- Approximation value AV(i?j)
- Measure of influence of mi?j on the belief P(Q)
Model simplificationproblem
Find the set E?E of edges s.t.- ? AV(i?j) ?
max - ? IC(i?j) ? inference budget
maximize fit quality
keep inference affordable
Lemma Model simplification problem is NP-hard
Greedy edge selection gets
-factor
approximation
25Approximation values
- Approximation value AV(i?j)
- Measure of influence of mi?j on the belief P(Q)
(i?j) - how important is it?
(r?q)
mr?q BP?(mv?n)
define path strength(?)
max-sensitivity approximation value AV(i?j) is
the single strongest
dependency (in derivative) that (i?j)
participates in
define AV(i?j) max(i?j)?? path strength(?)
26Efficient model simplification
max-sensitivity approximation value AV(i?j) is
the single strongest
dependency (in derivative) that (i?j)
participates in
Lemma with max-sensitivity edge values can find
optimal submodel - as the first M
edges expanded by best-first search
- with constant-time computation per expanded
edge
(using MooijKappen2007)
Simplification complexity independent of the size
of the full model(only depends on the solution
size)
Templated models only instantiate model parts
that are in the solution
27Future work multi-path dependencies
(i?j)
Want to take both of these into account
(r?q)
query
- All paths possible, but expensive O(E3)
- k strongest paths?
- AV(i?j) max(i?j)??1,,?k ?m path strength(?m)
- best-first search with at most k visits of an
edge?
28Perturbation approximation values
(v?n)
mr?q BP?(mv?n)
fix all messagesnot in ?
(r?q)
simple path ?
query
path strength(?) is the largest derivative value
along the path w.r.t the endpoint message
mean value theorem
tighter bound from BP message properties
upper bound on mr?q change
observation do not take the possible range
of the endpoint message into account
define path strength(?)
29Efficient model simplification
define max-perturbation AV(??i) max(??i)??
path strength(?)
Lemma with max-perturbation edge values,
assuming that the message derivatives along
paths ? are known, can find optimal
submodel - as the first M edges
expanded by best-first search -
with constant-time computation per expanded edge
extra work need to know derivatives along paths ?
solution max-sensitivity best-first search as a
subroutine
30Future work efficient max-perturbation
simplification
AV(i?j)
only need exact derivative iff??derivativei
s in this range
min???f?
current lower bound onpath strength from BFS
define path strength(?)
extra work need to know derivatives along paths ?
not always!
31Future work computation trees
1
1
?
2
3
2
4
4
prune computation trees according to edge
importance
1
3
computation tree traversal message update
schedule
1
2
3
2
4
4
4
32Focused inference
- BP proceeds until all beliefs converge
- But we only care about query beliefs
- Residual importance weighting for convergence
testing - For residual BP ? more attention to more
important regions
convergence hereis less important
Weigh residuals by edge importance
convergence hereis more important
33Related work
- Minimal submodel to have exactly the same
P(QEe) regardless of the values of potentials - Knowledge-based model construction
Wellmanal1992,RichardsonDomingos2006 - Graph distance as edge importance measure
Pentneyal2006 - Empirical mutual information as variable
importance measure Pentneyal2007 - Inference in simplified model to quantify the
effect of an extra edge exactly
Kjaerulff1993,ChoiDarwiche2008
34This thesis
Learn/constructstructure
Learn/defineparameters
Inference
P(QEe)
NIPS 2007
1. Learning tractable models efficiently
and with quality guarantees
2. Simplifying large-scale models /
focusing inference on the query
Thesis contributions
3. Learning simple local models by
exploiting evidence assignments
35Local models motivation
Common approach
Learn/constructstructure
Approximate parameters
Approximateinference
P(QEe)
This talk, part 1
Learn tractablestructure
optimalparameters
exactinference
P(QEe)
What if no single tractable structure fits well?
36Local models motivation
What if no single tractable structure fits well?
But locallyalmost lineardependence
Regression analogy
query
no single line fits well
q
qf(e)
solution learn local tractable models
e
evidence
learn tractablestructure
optimalparameters
exactinference
P(QEe)
get evidenceasssignmentEe
learn tractablestructurefor Ee
parametersfor Ee
37Local models example
exactinference
P(QEe)
get evidenceasssignmentEe
learn tractablestructurefor Ee
parametersfor Ee
example local conditional random fields (CRFs)
global CRF
local CRF
feature
weight
query-specific structure. I?(E)?0, 1
Ee1
Ee1
Ee2
Ee2
Een
Een
38Learning local models
Need to learn w and QS structure I(E)
known structures for every training point
Ee1
good local structures (e.g. local search)
Een
Ee1
Qq1
Een
Qqn
Iterate!
Ee1
Qq1
Een
Qqn
optimal weights w(convex opt)
known weights w
local CRF
query-specific structure. I?(E)?0, 1
need query values here!cannot use at test time ?
Ee1
Een
39Learning local models
parametrize I(E) by V I?I(E, V)
learn w and QS structure parameters V
known structures for every training point
optimize V so that I(E, V) mimics the good
local structures well for training data
Ee1, V
Een, V
Ee1
Qq1
Een
Qqn
Iterate!
Ee1
good local structures (e.g. local search)
Een
Ee1
Qq1
Een
Qqn
optimal weights w(convex opt)
known weights w
40Future work better exploration
need to avoid shallow local minima- multiple
structures per datapoint- stochastic
optimization ? sample structures
will these be different?
known structures for every training point
Ee1
good local structures (e.g. local search)
Een
Ee1
Qq1
Een
Qqn
Ee1
Qq1
Een
Qqn
optimal weights w(convex opt)
known weights w
41Future work multi-query optimization
separate structure for every query may be too
costlyquery clustering- directly using
evidence- using inferred model parameters (given
w and V)
42Future work faster local search
need efficient structure learning- amortize
inference cost for scoring multiple search
steps
need support for nuisance vars in structure
scores
43Recap
Learn/constructstructure
Learn/defineparameters
Inference
P(QEe)
NIPS 2007
1. Learning tractable models efficiently
and with quality guarantees
2. Simplifying large-scale models /
focusing inference on the query
Thesis contributions
3. Learning local tractable models by
exploiting evidence assignments
44Timeline
- Validation of QS model simplification
- Activity recognition data, MLN data
- QS simplification
- Multi-path extensions for edge importance
measures - Computation trees connections
- Max-perturbation computation speedups
- QS learning
- Better exploration (stochastic optimization /
multiple structures per datapoint) - Multi-query optimization
- Validation
- QS learning
- Nuisance variables support
- Local search speedups
- Quality guarantees
- Validation
- Write thesis, defend
Summer 2009
Fall 2009
Spring 2010
Summer 2010
45Thank you!
Collaborators Carlos Guestrin, Joseph Bradley,
Dafna Shahaf
46Speeding things up
there are O(Xk) separators here
- Constraint-based algorithm
- set L ?
- for every potential separator S?X s.t. Sk
- do I(?) estimation, change L
- find junction tree (C,E) consistent with L
Observation there are X-k separators in (C,E)
?
I(?) estimations for the rest
O(Xk) separators are wasted
- Faster heuristic
- until (C,E) passes checks
- do I(?) estimation, change L
- find junction tree (C,E) consistent with L
47Speeding things up
- Faster heuristic
- until (C,E) passes checks
- do I(?) estimation, change L
- find junction tree (C,E) consistent with L
Recall that our upper bound on I(?)uses all Y?X
\S for Y?k
I(V,X-VS S)??
V
X-VS
Idea get a rough estimate by only
looking at smaller Y (e.g. Y2)
I(Y?V,Y?X-VSS)
Y?X-VS
- Faster heuristic
- estimate I(?) with Y2, form L
- do
- find junction tree (C,E) consistent with L
- estimate I(?S??) with Yk for S???S,
update L - check if (C,E) is still an ?-JT with the
updated I(?S??) - until (C,E) passes checks
Y?V