Algorithmic Tools Applied to Some Machine Learning and Inference Problems presentation

About This Presentation

Transcript and Presenter's Notes

Title: Algorithmic Tools Applied to Some Machine Learning and Inference Problems

1
Algorithmic Tools Applied to Some Machine
Learning and Inference Problems

David Karger

2
Conclusion

Algorithms research has produced a huge set of
tools that can be applied to efficiently solving
computational problems
Any time one can precisely define a computational
problem, its worth looking in the algorithms
toolbox to see if it can be solved efficiently
So, define some problems and talk to your local
algorithmicist!

3
Outline

Demonstrate by case studies applying algorithmic
tools to four distinct learning/inference
problems
Near neighbor searching
MDPs with nonrecurring rewards
Decoding turbo codes
Learning low-treewidth graphical models
Each is a well defined computational problem

4
Approach

4 separate talks?
Commonality not in tools (too many to enumerate)
Rather, in reductionist / analogical approach
Common heuristics in search for right tools
By analogy to similar problems
By properties of problem, e.g.
Self-reducibility ---recursion, sampling
Use of paths, trees, cliques, etc.
Natural integer programming formulations
By judicious simplification

5
Tools well use today

Random sampling
Self-reducibility
Amortization
Dynamic programming
Combinatorial graph theory
Constant-factor approximation algorithms for
NP-hard problems
Max flow
Approximation-preserving reductions
Locally checkable structures
Linear programming relaxations
Randomized rounding of relaxed solutions

6
Lost in Translation

Algorithms designed for clean problems
Often require simplifying assumptions
fromreal-world problems
Measures of performance sometimes artificial
Constant approximation factor nice in theory,
but which constant matters in practice
So, algorithm may or may not be useful
But even if not, can provide insight into
practical ways of attacking the problem

7
A Data Structure for Nearest Neighbors on
Manifolds

Random Sampling
Greedy algorithms
Local Search
Memoization

8
Motivation

Common theme in machine learning find
high-fidelity embedding of high-dimensional data
into a lower-dimensional representation
Isomap TdSL and locally linear embedding RS
postulate data on low-dimensional manifold in
high dimensional space
To reconstruct manifold, both find each nodes
near neighbors by brute force
Quadratic time in number of points
Improve with fast near-neighbor queries?

9
The Original Motivation

Peer to peer systems
Nodes need to find nearby nodes for fast
communication/data accesss
Internet is (kind-of) low dimensional
This near neighbor data structure works well

10
Near Neighbor Search

Many optimal data structures in low dimensions
(computational geometry)
But high dimensional problem hard
Subject of much recent attention in algorithms
community
Kleinberg---Two algorithms for NNS in high
dimension STOC 1997
DISV et al workshop on near neighbor search
NIPS 2003
But focus on approximate near neighbors

11
First Try

Manifold is low dimensional
Use low dimensional NN solutions?
E.g. Voronoi diagram

Problem
Voronoi diagram relies on geometry of space
But we dont know geometry (coordinate system)
until after we construct manifold
So to reconstruct, are limited to using distance,
not geometry

12
Some Generic Ideas

Greedy approach
Take query q and closest point p found so far
Arrange to find something closer to q than p
Repeat till done

Local Search
Want point closer to q
Must be pretty close to p
So examine points near p till find one close to q
Repeat with new point

p
p
d
d/2
q
13
Some More Ideas

Random Sampling
For fast search, look only at a few candidates
Good candidates depend on query point
For any choice, adversary can pick bad q
So choose random set---probably good for all
Memoization
Picking random points near p is a NN search!
Pick in advance, as part of building data
structure
E.g., for every possible radius (!), store (a
few) random candidates in that ball around p

14
Random Sampling

When will it work?
Evenly distributed low-dimensional point sets
E.g. a grid of points
E.g., random sample from low-dim. manifold
On grid, if p and q at distance d, then
Ball(q,d/2) is 1/16 of points in Ball(p,2d)
So, random point has 1/16 chance of improving
And trying 16 random points raises odds to 60-40

15
A Big Fast Data Structure?

At each point p, store 16 random points from
balls of every possible radius around p
Resolve query by iterating improvements
Look in right size ball around current point
If lucky, find point closer to q, can look in
smaller ball next time
If unlucky, end up with bigger ball
But luck is more likely---drift in right
direction
Expected ball size shrinks by const factor
So, O(log n) steps get to tiny ball---at answer

16
Time to Get Real

Storing samples for all radii expensive
OK to be sloppy, and store only for powers of 2
Means always have ball of roughly correct size
But only log n balls needed per point
So O(n log n) space overall
Worry about dependence
What if search cycles back to point already seen?
random samples are no longer random
And how find random samples in first place?

17
Metric Skip Lists

Consider randomly ordering list of points
For each power-of-two d, point p records next 16
points in list within distance d
These points are a random sample from ball
Search inspects points forward in list
From current p, check 16 following in current d
Search always moves forward
So always considering new points
So no cycles, and samples independent in every
step (mostly)

18
Construction

Randomized Incremental Construction
Common idea from computational geometry
Adding points to data structure in random order
is often easy, yields good data structure
Build list by prepending points in random order
To set new pointers, must find next 16 points
within (each) distance d of new point
But this is like NN query (given the way we
search)
Find answer by slight variation on NN search
List so far constructed makes this easy

19
Summary

In any low-dimensional space with evenly
distributed points, metric skip lists solve
nearest neighbor queries
O(n log n) space
O(n log n) time to build data structure
O(log n) time to query
Constant depends on dimensionality of space
Application by building this structure, can find
NNs for Isomap/LLE in O(n log n) time

20
Experiments

Data structure is simple---no hidden
theoretical aspects
Implementation for P2P search works well
Experiments on some simple manifolds JT work
well too

21
Improvements KL04

There are better ways to pick improvement
candidates for local search
Assouads doubling dimension is the maximum
number of diameter-d balls needed to cover any
diameter-d/2 ball
Pick one candidate from each ball in cover
Get same bounds as above, but
Applies to unevenly distributed points
Deterministic

22
Markov Decision Processes with Nonrecurring
Rewards

Approximation algorithms
Reductions to related problems
Dynamic programming

23
A Robot Navigation Problem

Robot to deliver packages
Goal to deliver as quickly as possible
Sounds like traveling salesman problem?
Mismatches
Robot may not go where it plans to (sensor error,
motor control error, battery failure.)
Some packages matter more

24
Formulate asMarkov Decision Process

Graph with rewards rv on states (vertices) v,
travel times (lengths) on edges
From each node, choice of actions (each a
probability distribution on next vertex)
Choosing sequence of actions produces a random
path through graph
If arrive at vertex v at time t, receive
discounted reward gt rv where glt1
Motivates getting there quickly
Goal maximize total discounted reward

25
MDP Discounting

Reward received each time vertex is visited
So plain value of infinite path can be infinite
Discounting means total reward is bounded by a
geometric series, so bounded
Alternative consider average reward per unit
time
Other reasons for discounting
Inflation (money in future less value than now)
Uncertainty (what if something happens before I
collect future prize?)
Mathematical elegance

26
Solution

Fixing action at each state produces a Markov
Chain. Transition probabilities pvw
Can compute expected discounted reward rv if
start at state v
rv rv Sw pvw gt(v,w) rw
Choosing actions to optimize this recurrence is
polynomial time solvable
Linear programming
Dynamic programming (like shortest paths)

27
Solving the wrong problem

Package can only be delivered once
So, incorrect to get reward each time reach
target
One solution expand state space
Vertex represents where I am and where I have
been before (what packages already delivered)
Reward nonzero only on states where current
location not included in list of previously
visited
Now apply MDP algorithm
Problem expanded problem size exponential in
original input

28
This is one Instance of a General Problem KL

Often, MDP has state space with nice small
implicit description but huge explicit
description
How do we accomplish MDP optimization on such
instances?

29
Tackle an easier problem

Problem has two novel elements for TOC
Discounting of reward based on arrival time
Probability distribution on outcome of actions
We will ignore second issue
In practice, robot can control errors
Even first issue by itself is hard and
interesting
First step towards solving whole problem
Frantic Salesman Problem
Given rewards, travel times, and discount factor,
find a path maximizing total discounted reward

30
Approximation Algorithms

FSP is NP-complete (thus, so is more general
MDP-type problem)
Reduction from minimum latency TSP
So intractable to solve exactly
Goal approximation algorithm that is guaranteed
to achieve at least some constant (lt1) multiple
of the best possible discounted reward

31
TOC Toolbox

Goal seems to be to find a short path that
visits lots of reward
Relates to previously studied k-TSP problem
Given a root vertex v, find a path of minimum
total length that starts at v and visits vertices
with (undiscounted) prize at least k
Constant factor approximation algorithm known for
undirected graphs (so we assume this too)
i.e., can find path of at most a constant (gt1)
multiple of minimum possible total edge length

32
Mismatch

Constant factor approximation doesnt
exponentiate well
Suppose optimum solution reaches some reward r at
time t for reward gtr
Constant factor approximation would reach within
time 2t for reward g2tr
Result get only gt fraction of optimum
discounted reward, not a constant fraction.

33
Idea Change Objective Function

Modify k-TSP to approximate prize collected
instead of length orienteering problem
Assume tour of length l collecting prize p
Find tour of length l collecting prize p/2
Avoids changing length, so exponentiation doesnt
hurt
Drawback no constant factor approximation
previously known
Flipping objective/feasibility transforms problem
(Our techniques end up resolving this too)

34
Idea Upper Bounds

General tool for approximation algorithms
Show close to something no solution can beat
Let dv denote shortest path distance to v
Define the prize at v as pvgdv rv
Max discounted reward possibly collectable at v
So max conceivable reward S pv
Potential greedy algorithm take shortest path to
one max-prize vertex
Gets at least 1/n of optimum

35
Compare to Upper Bound

If given path reaches v at time tv, define excess
at v as ev tv dv
Difference between shortest path and chosen one
Then discounted reward at v is gev pv
Idea good solution need not bother visiting
nodes at large excess
If excess large and node still worth visiting,
prize must be huge
So forget current path, just go straight to huge
prize without discounting

36
Formalize

Fact excess only increases as traverse path
Excess reflects lost time cant make it up
Without loss of generality, assume g ¼
Just scale edge lengths
Claim at least ½ of optimum paths discounted
reward R is collected before paths excess rises
to ½
Let w be first vertex with ew gt ½
Suppose more than R/2 reward follows w
Show contradiction

37
Path Improvement

ew gt ½ but more than R/2 reward follows w
Shortcut directly to w then traverse optimum
reduces all excesses after w by ½
so improves discounts by (1/g) ½ 2
so doubles discounted reward collected
but this was more than R/2 contradiction

w
excess 1/2
0
excess 1
1/2
Reward R/2
38
Discount Discounting

We showed large excess can be ignored
But if excess is small, discounting by excess can
be ignored!
(discounted) reward (undiscounted) prize
So, just find path with small excess maximizing
amount of (undiscounted) prize
Gives path with (discounted) reward prize
Of course, min-excess gives min-distance
But, may be better off approximating excess

39
Improvement on k-TSP Approximate Excess

Recall discounted reward at v is gev pv
Prefix of optimum discounted reward path
Has discounted reward S gev pv gt R/2
So has prize S pv gt R/2
And has no vertex with large excess
Find a path of approximately (3 times) minimum
excess and prize R/2
(we can guess R/2)
Excesses at most 3/2, so gev pv gt pv/8
So discounted reward on found path gt R/8

40
The Downside

Min-excess problem more useful
But harder to solve
Approximating min-distance does not approximate
min-excess

Opt length 1e e excess
length 2 1 excess
41
Exactly Solvable Case monotonic paths

Suppose optimum goes through vertices in strictly
increasing distance order from root
Then can find optimum by dynamic program
Just as can solve longest path in an acyclic
graph
Build table is there a monotonic path from v
with length l and prize p?
To answer, look for a u after v with a path of
length l dvu and prize p - pv
Works because monotonic path wont go back
through v

42
Dynamic Program
(3,1)
(4,2)
1
1
2
2
4
(2,0)
(4,1)
(7,2)
(8,2)
(9,2)
(5,2)
2
(8,3)
(13,3)
5
(6,2) (11,3)
(9,2)
(15,4)
(7,1)
43
Approximable casewiggly paths

Length of path to v is tv dv ev
If ev gt dv then tv gt ev gt tv / 2
i.e., take twice as long as necessary to reach
end
So if approximate tv to constant factor, also
approximate ev to twice that constant factor
But finding approximately optimum tv is k-TSP
problem
Constant factor approximation known

44
Decompose into easy cases
monotone
monotone
monotone
wiggly
wiggly
Divides into independent problems
gt 2/3 of each wiggly path is excess
45
Decomposition Analysis

2/3 of each wiggly path is excess
That excess accumulates into whole path
So, total excess of wiggly paths upper bounded by
excess of whole path
Conclude total length of wiggly paths upper
bounded by 3/2 of path excess
Use k-TSP to find approximately shortest wiggles
collecting right amount of prize
Approximates length, so approximates excess
Over all wiggly parts, approximates total excess

46
Dynamic program

For each pair of vertices and each (discretized)
prize value, find
Shortest monotonic path collecting desired prize
Approximately shortest wiggly path collecting
desired prize
Note polynomially many subproblems
Use dynamic programming to find optimum pasting
together of subproblems

47
Summary

Showed maximum discount prize can be approximated
by minimum excess path
Showed how to approximate min-excess path
Also solves orienteering problem
Also solves dual of k-TSP where length is fixed
Also solves tree versions of all these
problems, e.g. prize-approximate k-MST

48
Open Questions

Directed graphs?
We used k-TSP, only solved for undirected
For directed, even standard TSP has no known
constant factor approximation
We only use k-TSP/undirectedness in wiggly parts
Stochastic actions?
Stochastic seems to imply directed
Special case forget rewards.
Given choice of actions, choose actions to
minimize cover time of graph

49
Decoding Turbo Codes via Linear Programming

Linear Programming Relaxations

50
Basic Coding Problem

Goal transmit message across noisy channel that
randomly perturbs parts of message
Binary symmetric channel each transmitted bit
flipped independently with probability p lt ½
Approach also works for AWGN channel
Method introduce redundancy in messages to cope
with perturbations
Compute encoding function mapping each
information word u Î 0,1k to codeword ? Î
0,1n
Receive and decode randomly perturbed y

51
Definitions
Encoder
Information word u Length k
Codeword ? Length n
Word error rate is u u ?
Noisy Channel Bit error rate p
Decoded info word u
Decoder
Corrupt codeword y
Decoded codeword y
52
Decoding

Received perturbed codeword usually not a
codeword.
Need rule for picking best decoding possibility
Performance measure probability of giving wrong
answer
Known as word error rate (WER)
Wont be zero, since with nonzero probability
channel replaces codeword with a different, valid
codeword (at that point, only natural to answer
with wrong information word)

53
Maximum Likelihood Decoding

Specific decoding rule
Choose info word whose codeword maximizes
probability of received word Pr(y y)
Assuming binary symmetric channel errors, this is
just codeword at minimum hamming distance from
received word
More generally, linear cost function on codewords
Under general assumptions, this is the best way
to decode
Drawback generally NP-complete

54
Turbo Codes

A particular encoding approach
Introduced in 1993 Berrou, Glavieux,
Thitimajshima
Simple linear time encoding (state machine)
Numerous fast heuristics for decoding
E.g. belief propagation
But may not converge
Fantastic in practice, but unclear why
distance of code is bad (multiple codewords
with few different bits easily confused with
each other)
Some asymptotic analysis of random turbo codes

55
Decoding byLinear Programming

Describe polynomial time linear-programming
approach to decoding (arbitrary) turbo codes
Generalizes to LDPC codes
Certifies when finds ML codeword
Precise analysis for RA(2) codes
Repeat accumulate are special type of turbo
2 means rate ½ two code bits per info bit
Based on analysis, explanation of how to build
good RA(2) code
Gives code with inverse-polynomial error bound

56
ML Decoding as Integer Linear Programming

Given corrupted codeword y, want find codeword y
of minimum hamming distance
Note y - y S(1- 2yi)y constant (linear
function)
Code C ? 0,1n
Polytope P as convex hull of C
ML decoding wants minimum-cost vertex of P
Errors in channel change y, perturb objective
Good code small objective perturbations leave
optimum vertex unchanged

57
Linear Programming Relaxation

Optimizing over P intractable
Generally no tractable way to work with
constraints (facets) defining P
Find a tractable polytope similar to P,
optimize over that instead
New polytope will generally have additional,
non-integral vertices

Relax
58
Properties of a Good Relaxation

Relaxed LP should be tractable
Preferably few constraints
Combinatorial structure to aid solution
Relaxed polytope Q should contain original P
Ensures that true optimum is feasible in
relaxation
So Q optimum is lower bound on P optimum
Relaxation should be tight
Vertices of Q should be close to P
Increases hope that optimum in Q will be valid in
P
Makes it easier to round Q solution into P

59
Repeated Accumulate Codes

Particular type of turbo code
Accumulator accumulates sum of bits mod 2
RA(m) code
Make m copies of info word (total km bits)
Apply given fixed permutation to bits
Feed resulting sequence through accumulator
Output sequence of accumulator values (km
outputs)
We focus on RA(2) code

60
Trellis
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
?1
?6
?5
?4
?3
?2
1
1
1
1
1
1
1
0
0
0
0
0
u1
u3
u2
u3
u1
u2

Circles represent state of accumulator
Each info bit appears at two transitions
Sequence of states is codeword

61
Encoding with Trellis
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
u1
u3
u2
u3
u1
u2
0
0
1
1
1
1

Encode 011
Output 010110

62
Decoding with Trellis
?20
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
?11
u1
u3
u2
u3
u1
u2
1

At each step, codeword says which state should
transition to
Read off ui label on transition arc

63
Handling Errors

Not every path through trellis is a codeword
Need both occurrences of each info bit to agree
(cause same transition)
Call such a path agreeable
Want path with fewest transitions not matching
received codeword
Give cost 0 to transitions matching received
word, cost 1 to transitions not matching
Shortest agreeable path is ML codeword

64
Relaxation

Shortest agreeable path is NP-complete
Relax to min-cost agreeable flow
Flow is convex combination of paths (exponential
number of variables, but tractable)
Alternatively, poly-size LP based on balance of
incoming and outgoing flow
Agreeability is a constraint that certain groups
of transition edges carry same amount of flow

65
Properties of Relaxation

Any agreeable path is an agreeable flow
Conclude correct answer is feasible in
relaxation
Conclude if min-cost agreeable flow is a path,
then it is shortest agreeable path
Certificate property min-cost agreeable flow
decoder will know it has found ML codeword
Relaxation is tractable to solve
Can directly solve via LP
But actually exists reduction to standard
min-cost flow, so can use specialized fast
algorithms

66
Performance

When is correct codeword (path) the optimum for
MCAF?
Intuition from residual graphs in flow
Build new graph by subtracting from each edges
capacity amount of flow currently on edge
A minimum cost flow is optimal if and only if
there is no negative cost cycle in the residual
graph (i.e., of positive residual capacity)
Can push flow around cycle, reduce cost of flow
Says local optimum is global optimum

67
Generalize to MCAF

True codeword is optimal if no negative cost
agreeable cycle in residual graph
What does agreeable cycle look like?
Must diverge from correct codeword path at some
point (traverses other label)
Then traverses same labels as correct for a while
Then may return to codeword path (again by
traversing opposite label
Each time traverses opposite label at some bit,
agreeability requires it also traverse opposite
label at other copy of that bit

68
Trellis
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
?1
?6
?5
?4
?3
?2
1
1
1
1
1
1
1
0
0
0
0
0
u1
u3
u2
u3
u1
u2

Suppose all 0s sent
If cycle uses 1-edge at second layer
must also use 1-edge at 4th layer

69
An Easier Representationthe Tanner Graph

Draw path along codeword
Edge cost -1 if received bit flipped
1 otherwise
Add matching edge between vertices
corresponding to copies of same info bit
All edge costs 0
Cycles in residual graph correspond to cycles in
this graph G, and have same cost
Hamiltonian edge use same transition as
codeword
Matching edge use opposite transition

70
Picture
codeword
?1
u1
?6
1
-1
u3
u2
0
0
?2
1
?5
-1
0
u2
u1
1
1
u3
?4
?3
71
Connection
?1
u1
?6
1
-1
u3
u2

Circle paths simple cycles in trellis
Matching edges for agreement

0
0
?2
1
?5
-1
0
u2
u1
1
1
?3
?4
u3
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
?1
?6
?5
?4
?3
?2
1
1
1
1
1
1
1
0
0
0
0
0
u1
u3
u2
u3
u1
u2
72
Main Theorem

The true codeword is the min-cost agreeable flow
solution iff there is no negative cost cycle in
the Tanner graph G

73
Suggests Good Codes

Recall edge cost is -1 if bit flipped, which
occurs with probability lt ½
Intuition if cycle is large, unlikely that more
than half its edges flip
So, good idea to build graph in which all cycle
are large
Achieve by building graph with larger
girthlength of shortest cycle
Erdos gives (PathMatching) graph with girth log
n (and this is best possible)

74
Analysis

Theorem Using RA(2) code from Erdos Graph, if p
lt 2-4(elog 24)/2) then WERPrnegative cycle lt
n-e
Proof
Negative cycle has length at least log n
Break into subpaths of length log n
One must be negative
What is Prnegative length log n path?
probability at most n-2-e
Degree 3 graph, so only n2 paths of length log n
Add up

75
Experiments
76
Summary

Combinatorial algorithm for decoding RA(2) codes
Analysis of its error probability
Recommended code based on analysis
That code has polynomially small WER
Experiments show good performance in practice
But slow, since solving LP
This approach gives insight, not algorithms

77
Extensions

Discovered that Wainrights Tree Reweighted Max
Product is solving dual of our LP
Thus, TRMP has same performance
Gives a belief-propogation flavored decoder with
same performance as LP decoder
Techniques extend to arbitrary LDPC codes
Relaxation by intersecting parity polytopes for
each parity check
LP decoder (but flow decoder doesnt extend)

78
Breaking Result FMSSW

Analysis of LP decoder for LDPC codes
Proof that when code based on expander graph, LP
decoder handles constant fraction of bits
corrupted
Gives proof of exponentially small error rate for
polynomial-time decoder of LDPC codes.

79
Learning Markov Random Fields

Randomized Rounding

80
Overview

Fundamentals of Markov networks
Maximum likelihood markov network structure as a
maximum hypertree problem
Tree-width and hypertrees
An approximation algorithm for maximum hypertrees
Reducing maximum hypertrees to/from maximum
likelihood markov networks

81
Density Estimation

T observations x1,,xT
Each x1 x11,, x1n a vector of n values
Estimate joint probability distribution
P(X1,,Xn) from which samples were taken.
(Assume observations i.i.d)

82
Maximum likelihood approach

Postulate best fit is maximum likelihood
distribution
Distribution that maximizes likelihood of data
To avoid over-fitting, limit choice to within
some (parametric) class.
Equivalent to projecting onto our class to find
nearest neighbor of empirical distribution
Our approach applies to this general distribution
projection problem

83
Markov Random Networks

Representations of joint distributions with
limited dependence.
Variables x1xn
Graph with variables as vertices
Edges represent dependencies
Conditioned on (any specific values of) its
neighbors, xi is independent of all other
variables.
More generally, two sides of any separator are
independent conditioned on separator

84
Problems

Markov net inference
Given data and specific Markov net, find
parameter settings that best fit data
Markov net learning
Given data, find Markov net structure that best
fits data (under best parameter settings)
For us, best fitsmaximum likelihood

85
Hammersley Clifford Theorem

Cliques in Markov network are special no
restriction on their marginal distribution
A distribution P is a Markov network over graph G
iff P factorizes over the cliques of G
Note jh assigns a value to each possible setting
xh of values of variables in clique h

86
Value of HC

HC gives a concise representation of any MRF
probability distribution function
Only need to specify clique potentials
If variables take s values, then potential on
size-k clique represented by sk values
If cliques are small (constant size) then
representation is small (linear size)

87
Limits of HC

Cannot use HC factorization to compute important
quantities
Normalization of j (cant even tell if needed)
Marginal probability distributions (even 1
variable)
Conditional probability distributions (ditto)
Maximum likelihood parameter settings (finding j
to fit data)

88
Triangulated Markov Networks

No minimal cycles of more then three vertices.

X1
X2
X1
X1
X2
X5
X6
X5
X6
X3
X4
X4
X3
89
Benefits of Triangulation

Efficient (linear time) exact calculations
Marginals
Conditional distributions
Just about anything else (via canonical dynamic
program)
Explicit Hammersley-Clifford factorization
Efficient inference calculation of maximum
likelihood j to fit observed data

90
Efficient Calculation

Triangulated graph has elimination ordering
Order of deleting vertices such that when vertex
is deleted, its surviving neighbors form a clique
Run backwards, adds each vertex to a clique
Canonical dynamic program
Memoize values on each clique (eg, distribution)
Gives necessary info to add new vertex
Memo-table size exponential in clique sizes
Happy if max clique size small

91
HC Factorization of Triangulated Graphs

(as before)

(new explicit j)
92
ML Inference on Triangulated Graphs

Fixed network given
Choose parameters (potentials j) to maximize
likelihood of observations
In triangulated network, do so by making
marginals correct on cliques
i.e., want derived P(xh) equal to empirical
distribution P(xh) for each clique h
Achieve by plugging P into explicit formula for j

93
Inference on Triangulated Graphs

(as before)

(new explicit j)
94
Triangulation

Non-triangulated Markov network can be
triangulated by adding edges
Find large minimal cycles, add a chords
Adding edges removes independence constraints, so
broadens class of models
So, only increases fit to data (maximum
likelihood)
And, makes computations tractable

95
Treewidth

Could just add all edges (complete graph is
triangulated)
Drawbacks
Dynamic programs exponential in clique size
Number of model parameters exponential in clique
size leads to over-fitting
Treewidth of a graph minimum over all
triangulations, of the maximum clique size of the
triangulation, minus one

96
Markov Net Learning

Given data, wish to find MRN of tree-width at
most k that maximizes likelihood of observed data
Equivalently, since triangulation can only
increase likelihood, wish to find maximum
likelihood triangulated graph with clique size at
most k1
We call such a graph a k-hyperforest
If maximal, k-hypertree

97
Computing (Log-)Likelihood
98
Additive Weights

For triangulated G, HC gives explicit formulation
for maximum likelihood j
Key each jh is independent of graph choice!
Set
Then

99
New formulation

Max-likelihood value of G is just sum of weights
of cliques it contains
So, given weights on cliques of size up to k1,
want to find a hypertree (triangulated graph of
treewidth at most k) containing maximum weight of
cliques
This is the maximum hypertree problem

100
Chow Liu (1968) k1

Treewidth 1 is just a tree
Edges are cliques of size 2
Weight wh on edge turns out to be mutual
information between the variables on its
endpoints.
Maximum likelihood tree is the maximum spanning
tree with
Polynomial time solvable

101
Larger Treewidths

Theorem for k gt1, maximum hypertree problem is
NP-complete
Reduction from SAT
S conclude ML MRN NP-complete
So, seek approximation algorithm
Given optimum has value w, find (in polynomial
time) some solution with value at least w/a
We give an algorithm with a8kk!(k1)!
Constant for any fixed k

102
Idea Locally Testable Structure

Treewidth k is a global constraint, hard to aim
for
Define windmill, an object with local
characterization
Every hypertree contains a windmill with at least
1/(k1)! of its weight
Algorithm to find windmill of approximately
(factor 1/8kk!) maximum weight

103
Star graphs
Covering 11 of the 15 edges of a tree with
disjoint stars
104
Partitioning a tree into two sets of disjoint
stars
Conclusion some set of disjoint stars contains
at least 1/2 the edge weight of any given tree.
105
Windmills

A k-windmill is defined by a depth-k rooted tree
Its hyperedges are all the paths from the root
It has treewidth k
1-windmill is star

106
Windmill Theorem

Windmill farm collection of disjoint windmills
Theorem any weighted k-hypertree contains a
k-windmill-farm with at least a 1/(k1)! fraction
of the weight
k-color the hypertree so no edge has repeated
colors
All edges that get colors in same order form a
windmill farm
Only k! orders

107
Idea Randomized Rounding

Already saw idea of ILP relaxations
Define integer linear program solving problem
(not convex)
Ignore integrality constraints solve LP (convex)
Now, want integral solution
Round fractional solution to integral by setting
variable of value 0 lt x lt 1 equal to 1 with
probability x and 0 otherwise
Gives integer solution where all constraints
satisfied in expectation

108
Special Case Layered 2-windmills

Variable yuv set to 1 if u is root and v is child
of u
Variable zuvx set to 1 if x is a child of v which
is a child of u
Objective function S wuvx zuvx
Consistency constraint zuvx yuv
Single-parent constraint Su yuv 1

109
Rounding

Consider constraint Su yuv 1
Means yuv form a probability distribution on
choices of parent for v
Choose parents according to this distribution
Objective function was S wuvx zuvx
We can keep wuvx if yuv was set to 1
So expected kept value is S wuvx yuv gt S wuvx
zuvx
i.e., rounded solution matches LP objective
value!
(white lie)

110
Open problems

We have a constant factor approximation for
constant k, but its a pretty bad constant!
Two separate reasons for badness
Windmill theorem may be very loose
(but we have examples of gap exceeding k)
Better windmill approximations?
Multilevel facility location?
Use idea of restricted, locally testable class
of MRNs to find other, practical and tractable
subsets
Hardness of approximation?

111
Conclusion
112
Near Neighbor Structure

Concepts
Random sampling
Memoization
Greedy improvement
Local search
Impact
Simple
Good constants
Good experimental performance
So likely useful in practice

113
One-shot MDPs

Concepts
Approximation Algorithms
Reductions to related problems
Dynamic Programming
Impact
Solved an open theory problem
Unlikely to be a good practical performer
But suggests heuristics
And poses good next questions

114
Turbo Decoding

Concepts
LP relaxations
Flows
Graph theory
Impact
Probably dont want to use LP in DSP chips
Strengthens theoretical grounds/explanations for
good performance of certain codes
Provides insight for improving codes and decoding
algorithms

115
Hypertrees

Concepts
Randomized rounding of LP relaxations
Impact
Unlikely to be useful in practice
Possible application to branch and bound
optimization methods
Suggests some useful special classes of
low-treewidth graphs (windmills) that may be more
tractable to construct

116
Working with Algorithms

Many NIPS problems would be trivial given
infinite computational resources
Good algorithms can simulate such resources
Large collection of available tools
Large collection of people who know them
To apply algorithmic methods, want well-defined
algorithmic problem
Avoid defining problem via algorithmic solution
Dialogue can help define true problem
Even oversimplified solution may give insight

117
Acknowledgements

Near Neighbor Searching
With Matthias Ruhl
MDPs
With Avrim Blum, Shuchi Chawla, Terran Lane, Adam
Meyerson, Maria Minkoff
Turbo Decoding
With Jon Feldman, Martin Wainright
Learning Markov Models
With Nati Srebro

118
Conculsion

Lots of interesting NIPS problems!
Techniques from theoretical computer science can
be applied
Toolbox of prior approximation algorithms
Combinatorial structure of problems
Wanted more problems
Value in both definition and solution
http//theory.lcs.mit.edu/karger/Talks/NIPS.ppt

Write a Comment

User Comments (0)

Algorithmic Tools Applied to Some Machine Learning and Inference Problems PowerPoint PPT Presentation