Algorithms for Answering Queries with Graphical Models

About This Presentation

Title:

Algorithms for Answering Queries with Graphical Models

Description:

New algorithms for learning and inference in PGMs. to make ... Results typical convergence time. good results. early on in practice. 16. Test log-likelihood ... – PowerPoint PPT presentation

Number of Views:33

Avg rating:3.0/5.0

Slides: 48

Provided by: ANT971

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Algorithms for Answering Queries with Graphical Models

1
Algorithms for Answering Queries with Graphical
Models
Thesis Proposal
Anton Chechetka

Thesis committee Carlos Guestrin
Eric Xing
Drew Bagnell Pedro
Domingos (UW)

21 May 2009
2
Motivation
Activity recognition
Sensor networks
Patient monitoring diagnosis
Image credit http//www.dremed.com
Image credit Pentneyal2006
3
Motivation
Common problem computeP(Q E e)
True temperature in a room?
Sensor 3 reads 25C
Has the person finished cooking?
The person is next to the kitchen sink (RFID)
Is the patient well?
Heart rate is 70 BPM
4
Common solution
Common problem compute P(Q E e)
(query)Common solution probabilistic graphical
models
This thesis New algorithms for learning and
inference in PGMs to make answering queries
better
Pentneyal2006
Deshpandeal2004
Beinlichal1988
5
Graphical models
Represent factorized distributions X? are
small subsets of X ? compact representation
corresponding graph structure
X4
X1
X3
X5
X2
Learn/constructstructure
Learn/defineparameters
Inference
P(QEe)

Fundamental problems
P(QEe) given a PGM?
Best parameters f? given the structure?
Optimal structure (i.e. sets X?)?

P-complete / NP-complete exp(X)
complexity NP-complete
6
This thesis
Learn/constructstructure
Learn/defineparameters
Inference
P(QEe)
NIPS 2007
1. Learning tractable models efficiently
and with quality guarantees
2. Simplifying large-scale models /
focusing inference on the query
Thesis contributions
3. Learning simple local models by
exploiting evidence assignments
7
Leaning tractable models
Learn/constructstructure
Learn/defineparameters
Inference
P(QEe)

Every step in the pipeline is computationally
hard for general PGMs
Compounding errors
But there are exact inference and parameter
learning algorithms with exp(graph treewidth)
complexity
So if we learn low-treewidth models, all the rest
is easy!

8
Treewidth
Learn/constructstructure
Learn/defineparameters
Inference
P(QEe)

Learn low-treewidth models ? all the rest is
easy!
Treewidth size of largest clique in a
triangulated graph
Computing treewidth is NP-complete in general
But easy to constructgraphs with given treewidth
Convenient representation junction tree

X4,X5
X1,X4,X5
X4,X5,X6
C1
C4
X1,X5
X1,X5
X1,X2,X5
X1,X3,X5
C2
C5
3
4
X1,X2
1
5
6
X1,X2,X7
C3
7
2
9
Junction trees

Learn junction trees ? all the rest is easy!
Other classes of tractable models exist, e.g.
LowdDomingos2008
Running intersection property
Most likely junction tree of fixed treewidth gt1
is NP-complete
We will look for good approximations

X4,X5
X1,X4,X5
X4,X5,X6
C1
C4
X1,X5
X1,X5
X1,X2,X5
X1,X3,X5
C2
C5
X1,X2
X1,X2,X7
C3
10
Independencies in low-treewidth distributions
P(X) factorizes according to a JT
conditional independencies hold
conditional mutual information
works in the other way too!
X4,X5,X6
X1,X3,X5
X???X2X3X7
X???X4X6
X1,X5
X1,X2,X7
X1,X2,X5
X1,X4,X5
11
Constraint-based structure learning
We will look for JTs where this holds
Constraint-based structure learning
S1
S4
S3
S3

S1
S2
I(V, X \VS S) lt ? ??
S4
S2
Construct a junction tree (e.g. using dynamic
programming)
Take all candidate separators
12
Mutual information estimation
I(V, X \VS S) lt ? ??
definition I(A,BS) H(A S)
H(ABS) naïve estimation of costs
exp(X), too expensive
sum over all 2Xassignments to X
our work upper bound on I(V, X \VS S),
using values of I(Y,ZS) for
YZ?treewidth1 there are
O(Xtreewidth1) subsets Y and Z
?
complexity polynomial in X
13
Mutual information estimation
I(V, X \VS S) lt ? ??
hard

Theorem suppose that P(X), S, V are s.t.
an ?-JT of treewidth k for P(X) exists
for every A?V, B?X-VS s.t. AB ? k1
I( ? ) ? ?
Then
I(V, X-VS S) ? X(? ?)

I(V,X-VS S)??
V
X-VS
easy
I(A,BS)
B
A
AB?treewidth1

Complexity O(Xk1) ? exponential speedup
No need to know the ?-JT, only that it exists
The bound is loose only when there is no hope to
learn a good JT

14
Guarantees on learned model quality

Theorem suppose that P(X) is s.t.
a strongly connected ?-JT of treewidth k for
P(X) exists
Then our alg. will with probability at least
(1-?) find a JT (C,E) s.t.

quality guarantee
using
samples
and
time
poly samples
poly time
Corollary strongly connected junction trees are
PAC-learnable
15
Related work
Ref. Model Guarantees Time
BachJordan2002 tractable local poly(n)
ChowLiu1968 tree global O(n2 log n)
MeilaJordan2001 tree mix local O(n2 log n)
TeyssierKoller2005 compact local poly(n)
SinghMoore2005 all global exp(n)
KargerSrebro2001 tractable const-factor poly(n)
Abbeelal2006 compact PAC poly(n)
NarasimhanBilmes2004 tractable PAC exp(n)
our work tractable PAC poly(n)
16
Results typical convergence time
Test log-likelihood
good results early on in practice
17
Results log-likelihood
OBS ? local search in limited
in-degree Bayes nets Chow-Liu ? most
likely JTs of treewidth 1 Karger-Srebro ?
constant-factor approximation JTs
better
our method
18
This thesis
Learn/constructstructure
Learn/defineparameters
Inference
P(QEe)
NIPS 2007
1. Learning tractable models efficiently
and with quality guarantees
2. Simplifying large-scale models /
focusing inference on the query
Thesis contributions
3. Learning simple local models by
exploiting evidence assignments
19
Approximate inference is still useful

Often learning a tractable graphical model is not
an option
Need domain knowledge
Templatized models
Markov logic nets
Probabilistic relational models
Dynamic Bayesian nets
This part the (intractable) PGM is a given
What can we do with the inference?
What if we know the query variables Q and
evidence Ee?

20
Query-specific simplification

This part the (intractable) PGM is a given

Observation often many variables are unknown,
but also not important to the user
Suppose we know the variables Q of interest (the
query)
Observation usually, variables far away from
the query do not affect P(Q) much
21
Query-specific simplification
Observation variables far away from the query
do not affect P(Q) much
these have little effect on P(Q)
query
Idea discard parts of the model that
have little effect on the query
Observation values of potentials are important
want this part first
Our work

edge importance from values of potentials
efficient algorithms for model simplification
focused inference as soft model simplification

22
Belief propagation Pearl1988

For every edge Xi-Xj and variable, a message
Belief about the marginal over Xi
Algorithm until
convergence
Fixed point of BP(?) is the solution

23
Model simplification problem

Model simplificationproblem
which messages to skip updating s.t.-
inference cost gets small enough - BP fixed point
for P(Q) does not change much
24
Edge costs

Inference cost IC(i?j)
complexity of one BP update for mi?j
Approximation value AV(i?j)
Measure of influence of mi?j on the belief P(Q)

Model simplificationproblem
Find the set E?E of edges s.t.- ? AV(i?j) ?
max - ? IC(i?j) ? inference budget
maximize fit quality
keep inference affordable
Lemma Model simplification problem is NP-hard
Greedy edge selection gets
-factor
approximation
25
Approximation values

Approximation value AV(i?j)
Measure of influence of mi?j on the belief P(Q)

(i?j) - how important is it?
(r?q)
mr?q BP?(mv?n)
define path strength(?)
max-sensitivity approximation value AV(i?j) is
the single strongest
dependency (in derivative) that (i?j)
participates in
define AV(i?j) max(i?j)?? path strength(?)
26
Efficient model simplification
max-sensitivity approximation value AV(i?j) is
the single strongest
dependency (in derivative) that (i?j)
participates in
Lemma with max-sensitivity edge values can find
optimal submodel - as the first M
edges expanded by best-first search
- with constant-time computation per expanded
edge
(using MooijKappen2007)
Simplification complexity independent of the size
of the full model(only depends on the solution
size)
Templated models only instantiate model parts
that are in the solution
27
Future work multi-path dependencies
(i?j)
Want to take both of these into account
(r?q)
query

All paths possible, but expensive O(E3)
k strongest paths?
AV(i?j) max(i?j)??1,,?k ?m path strength(?m)
best-first search with at most k visits of an
edge?

28
Perturbation approximation values
(v?n)
mr?q BP?(mv?n)
fix all messagesnot in ?
(r?q)
simple path ?
query
path strength(?) is the largest derivative value
along the path w.r.t the endpoint message
mean value theorem
tighter bound from BP message properties
upper bound on mr?q change
observation do not take the possible range
of the endpoint message into account
define path strength(?)
29
Efficient model simplification
define max-perturbation AV(??i) max(??i)??
path strength(?)
Lemma with max-perturbation edge values,
assuming that the message derivatives along
paths ? are known, can find optimal
submodel - as the first M edges
expanded by best-first search -
with constant-time computation per expanded edge
extra work need to know derivatives along paths ?
solution max-sensitivity best-first search as a
subroutine
30
Future work efficient max-perturbation
simplification
AV(i?j)
only need exact derivative iff??derivativei
s in this range
min???f?
current lower bound onpath strength from BFS
define path strength(?)
extra work need to know derivatives along paths ?
not always!
31
Future work computation trees
1
1
?
2
3
2
4
4
prune computation trees according to edge
importance
1
3
computation tree traversal message update
schedule
1
2
3
2
4
4
4

32
Focused inference

BP proceeds until all beliefs converge
But we only care about query beliefs
Residual importance weighting for convergence
testing
For residual BP ? more attention to more
important regions

convergence hereis less important
Weigh residuals by edge importance
convergence hereis more important
33
Related work

Minimal submodel to have exactly the same
P(QEe) regardless of the values of potentials
Knowledge-based model construction
Wellmanal1992,RichardsonDomingos2006
Graph distance as edge importance measure
Pentneyal2006
Empirical mutual information as variable
importance measure Pentneyal2007
Inference in simplified model to quantify the
effect of an extra edge exactly
Kjaerulff1993,ChoiDarwiche2008

34
This thesis
Learn/constructstructure
Learn/defineparameters
Inference
P(QEe)
NIPS 2007
1. Learning tractable models efficiently
and with quality guarantees
2. Simplifying large-scale models /
focusing inference on the query
Thesis contributions
3. Learning simple local models by
exploiting evidence assignments
35
Local models motivation
Common approach
Learn/constructstructure
Approximate parameters
Approximateinference
P(QEe)
This talk, part 1
Learn tractablestructure
optimalparameters
exactinference
P(QEe)
What if no single tractable structure fits well?
36
Local models motivation
What if no single tractable structure fits well?
But locallyalmost lineardependence
Regression analogy
query
no single line fits well
q
qf(e)
solution learn local tractable models
e
evidence
learn tractablestructure
optimalparameters
exactinference
P(QEe)
get evidenceasssignmentEe
learn tractablestructurefor Ee
parametersfor Ee
37
Local models example
exactinference
P(QEe)
get evidenceasssignmentEe
learn tractablestructurefor Ee
parametersfor Ee
example local conditional random fields (CRFs)
global CRF
local CRF
feature
weight
query-specific structure. I?(E)?0, 1
Ee1
Ee1
Ee2
Ee2

Een
Een
38
Learning local models
Need to learn w and QS structure I(E)
known structures for every training point
Ee1
good local structures (e.g. local search)

Een
Ee1
Qq1

Een
Qqn
Iterate!
Ee1
Qq1

Een
Qqn
optimal weights w(convex opt)
known weights w
local CRF
query-specific structure. I?(E)?0, 1
need query values here!cannot use at test time ?
Ee1

Een
39
Learning local models
parametrize I(E) by V I?I(E, V)
learn w and QS structure parameters V
known structures for every training point
optimize V so that I(E, V) mimics the good
local structures well for training data
Ee1, V

Een, V
Ee1
Qq1

Een
Qqn
Iterate!
Ee1
good local structures (e.g. local search)

Een
Ee1
Qq1

Een
Qqn
optimal weights w(convex opt)
known weights w
40
Future work better exploration
need to avoid shallow local minima- multiple
structures per datapoint- stochastic
optimization ? sample structures
will these be different?
known structures for every training point
Ee1
good local structures (e.g. local search)

Een
Ee1
Qq1

Een
Qqn
Ee1
Qq1

Een
Qqn
optimal weights w(convex opt)
known weights w
41
Future work multi-query optimization
separate structure for every query may be too
costlyquery clustering- directly using
evidence- using inferred model parameters (given
w and V)
42
Future work faster local search
need efficient structure learning- amortize
inference cost for scoring multiple search
steps

need support for nuisance vars in structure
scores
43
Recap
Learn/constructstructure
Learn/defineparameters
Inference
P(QEe)
NIPS 2007
1. Learning tractable models efficiently
and with quality guarantees
2. Simplifying large-scale models /
focusing inference on the query
Thesis contributions
3. Learning local tractable models by
exploiting evidence assignments
44
Timeline

Validation of QS model simplification
Activity recognition data, MLN data
QS simplification
Multi-path extensions for edge importance
measures
Computation trees connections
Max-perturbation computation speedups
QS learning
Better exploration (stochastic optimization /
multiple structures per datapoint)
Multi-query optimization
Validation
QS learning
Nuisance variables support
Local search speedups
Quality guarantees
Validation
Write thesis, defend

Summer 2009
Fall 2009
Spring 2010
Summer 2010
45
Thank you!
Collaborators Carlos Guestrin, Joseph Bradley,
Dafna Shahaf
46
Speeding things up
there are O(Xk) separators here

Constraint-based algorithm
set L ?
for every potential separator S?X s.t. Sk
do I(?) estimation, change L
find junction tree (C,E) consistent with L

Observation there are X-k separators in (C,E)
?
I(?) estimations for the rest
O(Xk) separators are wasted

Faster heuristic
until (C,E) passes checks
do I(?) estimation, change L
find junction tree (C,E) consistent with L

47
Speeding things up

Faster heuristic
until (C,E) passes checks
do I(?) estimation, change L
find junction tree (C,E) consistent with L

Recall that our upper bound on I(?)uses all Y?X
\S for Y?k
I(V,X-VS S)??
V
X-VS
Idea get a rough estimate by only
looking at smaller Y (e.g. Y2)
I(Y?V,Y?X-VSS)
Y?X-VS