Title: Algorithmic Tools Applied to Some Machine Learning and Inference Problems
1Algorithmic Tools Applied to Some Machine
Learning and Inference Problems
2Conclusion
- Algorithms research has produced a huge set of
tools that can be applied to efficiently solving
computational problems - Any time one can precisely define a computational
problem, its worth looking in the algorithms
toolbox to see if it can be solved efficiently - So, define some problems and talk to your local
algorithmicist!
3Outline
- Demonstrate by case studies applying algorithmic
tools to four distinct learning/inference
problems - Near neighbor searching
- MDPs with nonrecurring rewards
- Decoding turbo codes
- Learning low-treewidth graphical models
- Each is a well defined computational problem
4Approach
- 4 separate talks?
- Commonality not in tools (too many to enumerate)
- Rather, in reductionist / analogical approach
- Common heuristics in search for right tools
- By analogy to similar problems
- By properties of problem, e.g.
- Self-reducibility ---recursion, sampling
- Use of paths, trees, cliques, etc.
- Natural integer programming formulations
- By judicious simplification
5Tools well use today
- Random sampling
- Self-reducibility
- Amortization
- Dynamic programming
- Combinatorial graph theory
- Constant-factor approximation algorithms for
NP-hard problems - Max flow
- Approximation-preserving reductions
- Locally checkable structures
- Linear programming relaxations
- Randomized rounding of relaxed solutions
6Lost in Translation
- Algorithms designed for clean problems
- Often require simplifying assumptions
fromreal-world problems - Measures of performance sometimes artificial
- Constant approximation factor nice in theory,
but which constant matters in practice - So, algorithm may or may not be useful
- But even if not, can provide insight into
practical ways of attacking the problem
7A Data Structure for Nearest Neighbors on
Manifolds
- Random Sampling
- Greedy algorithms
- Local Search
- Memoization
8Motivation
- Common theme in machine learning find
high-fidelity embedding of high-dimensional data
into a lower-dimensional representation - Isomap TdSL and locally linear embedding RS
postulate data on low-dimensional manifold in
high dimensional space - To reconstruct manifold, both find each nodes
near neighbors by brute force - Quadratic time in number of points
- Improve with fast near-neighbor queries?
9The Original Motivation
- Peer to peer systems
- Nodes need to find nearby nodes for fast
communication/data accesss - Internet is (kind-of) low dimensional
- This near neighbor data structure works well
10Near Neighbor Search
- Many optimal data structures in low dimensions
(computational geometry) - But high dimensional problem hard
- Subject of much recent attention in algorithms
community - Kleinberg---Two algorithms for NNS in high
dimension STOC 1997 - DISV et al workshop on near neighbor search
NIPS 2003 - But focus on approximate near neighbors
11First Try
- Manifold is low dimensional
- Use low dimensional NN solutions?
- E.g. Voronoi diagram
- Problem
- Voronoi diagram relies on geometry of space
- But we dont know geometry (coordinate system)
until after we construct manifold - So to reconstruct, are limited to using distance,
not geometry
12Some Generic Ideas
- Greedy approach
- Take query q and closest point p found so far
- Arrange to find something closer to q than p
- Repeat till done
- Local Search
- Want point closer to q
- Must be pretty close to p
- So examine points near p till find one close to q
- Repeat with new point
p
p
d
d/2
q
13Some More Ideas
- Random Sampling
- For fast search, look only at a few candidates
- Good candidates depend on query point
- For any choice, adversary can pick bad q
- So choose random set---probably good for all
- Memoization
- Picking random points near p is a NN search!
- Pick in advance, as part of building data
structure - E.g., for every possible radius (!), store (a
few) random candidates in that ball around p
14Random Sampling
- When will it work?
- Evenly distributed low-dimensional point sets
- E.g. a grid of points
- E.g., random sample from low-dim. manifold
- On grid, if p and q at distance d, then
Ball(q,d/2) is 1/16 of points in Ball(p,2d) - So, random point has 1/16 chance of improving
- And trying 16 random points raises odds to 60-40
15A Big Fast Data Structure?
- At each point p, store 16 random points from
balls of every possible radius around p - Resolve query by iterating improvements
- Look in right size ball around current point
- If lucky, find point closer to q, can look in
smaller ball next time - If unlucky, end up with bigger ball
- But luck is more likely---drift in right
direction - Expected ball size shrinks by const factor
- So, O(log n) steps get to tiny ball---at answer
16Time to Get Real
- Storing samples for all radii expensive
- OK to be sloppy, and store only for powers of 2
- Means always have ball of roughly correct size
- But only log n balls needed per point
- So O(n log n) space overall
- Worry about dependence
- What if search cycles back to point already seen?
- random samples are no longer random
- And how find random samples in first place?
17Metric Skip Lists
- Consider randomly ordering list of points
- For each power-of-two d, point p records next 16
points in list within distance d - These points are a random sample from ball
- Search inspects points forward in list
- From current p, check 16 following in current d
- Search always moves forward
- So always considering new points
- So no cycles, and samples independent in every
step (mostly)
18Construction
- Randomized Incremental Construction
- Common idea from computational geometry
- Adding points to data structure in random order
is often easy, yields good data structure - Build list by prepending points in random order
- To set new pointers, must find next 16 points
within (each) distance d of new point - But this is like NN query (given the way we
search) - Find answer by slight variation on NN search
- List so far constructed makes this easy
19Summary
- In any low-dimensional space with evenly
distributed points, metric skip lists solve
nearest neighbor queries - O(n log n) space
- O(n log n) time to build data structure
- O(log n) time to query
- Constant depends on dimensionality of space
- Application by building this structure, can find
NNs for Isomap/LLE in O(n log n) time
20Experiments
- Data structure is simple---no hidden
theoretical aspects - Implementation for P2P search works well
- Experiments on some simple manifolds JT work
well too
21Improvements KL04
- There are better ways to pick improvement
candidates for local search - Assouads doubling dimension is the maximum
number of diameter-d balls needed to cover any
diameter-d/2 ball - Pick one candidate from each ball in cover
- Get same bounds as above, but
- Applies to unevenly distributed points
- Deterministic
22Markov Decision Processes with Nonrecurring
Rewards
- Approximation algorithms
- Reductions to related problems
- Dynamic programming
23A Robot Navigation Problem
- Robot to deliver packages
- Goal to deliver as quickly as possible
- Sounds like traveling salesman problem?
- Mismatches
- Robot may not go where it plans to (sensor error,
motor control error, battery failure.) - Some packages matter more
24Formulate asMarkov Decision Process
- Graph with rewards rv on states (vertices) v,
travel times (lengths) on edges - From each node, choice of actions (each a
probability distribution on next vertex) - Choosing sequence of actions produces a random
path through graph - If arrive at vertex v at time t, receive
discounted reward gt rv where glt1 - Motivates getting there quickly
- Goal maximize total discounted reward
25MDP Discounting
- Reward received each time vertex is visited
- So plain value of infinite path can be infinite
- Discounting means total reward is bounded by a
geometric series, so bounded - Alternative consider average reward per unit
time - Other reasons for discounting
- Inflation (money in future less value than now)
- Uncertainty (what if something happens before I
collect future prize?) - Mathematical elegance
26Solution
- Fixing action at each state produces a Markov
Chain. Transition probabilities pvw - Can compute expected discounted reward rv if
start at state v - rv rv Sw pvw gt(v,w) rw
- Choosing actions to optimize this recurrence is
polynomial time solvable - Linear programming
- Dynamic programming (like shortest paths)
27Solving the wrong problem
- Package can only be delivered once
- So, incorrect to get reward each time reach
target - One solution expand state space
- Vertex represents where I am and where I have
been before (what packages already delivered) - Reward nonzero only on states where current
location not included in list of previously
visited - Now apply MDP algorithm
- Problem expanded problem size exponential in
original input
28This is one Instance of a General Problem KL
- Often, MDP has state space with nice small
implicit description but huge explicit
description - How do we accomplish MDP optimization on such
instances?
29Tackle an easier problem
- Problem has two novel elements for TOC
- Discounting of reward based on arrival time
- Probability distribution on outcome of actions
- We will ignore second issue
- In practice, robot can control errors
- Even first issue by itself is hard and
interesting - First step towards solving whole problem
- Frantic Salesman Problem
- Given rewards, travel times, and discount factor,
find a path maximizing total discounted reward
30Approximation Algorithms
- FSP is NP-complete (thus, so is more general
MDP-type problem) - Reduction from minimum latency TSP
- So intractable to solve exactly
- Goal approximation algorithm that is guaranteed
to achieve at least some constant (lt1) multiple
of the best possible discounted reward
31TOC Toolbox
- Goal seems to be to find a short path that
visits lots of reward - Relates to previously studied k-TSP problem
- Given a root vertex v, find a path of minimum
total length that starts at v and visits vertices
with (undiscounted) prize at least k - Constant factor approximation algorithm known for
undirected graphs (so we assume this too) - i.e., can find path of at most a constant (gt1)
multiple of minimum possible total edge length
32Mismatch
- Constant factor approximation doesnt
exponentiate well - Suppose optimum solution reaches some reward r at
time t for reward gtr - Constant factor approximation would reach within
time 2t for reward g2tr - Result get only gt fraction of optimum
discounted reward, not a constant fraction.
33Idea Change Objective Function
- Modify k-TSP to approximate prize collected
instead of length orienteering problem - Assume tour of length l collecting prize p
- Find tour of length l collecting prize p/2
- Avoids changing length, so exponentiation doesnt
hurt - Drawback no constant factor approximation
previously known - Flipping objective/feasibility transforms problem
- (Our techniques end up resolving this too)
34Idea Upper Bounds
- General tool for approximation algorithms
- Show close to something no solution can beat
- Let dv denote shortest path distance to v
- Define the prize at v as pvgdv rv
- Max discounted reward possibly collectable at v
- So max conceivable reward S pv
- Potential greedy algorithm take shortest path to
one max-prize vertex - Gets at least 1/n of optimum
35Compare to Upper Bound
- If given path reaches v at time tv, define excess
at v as ev tv dv - Difference between shortest path and chosen one
- Then discounted reward at v is gev pv
- Idea good solution need not bother visiting
nodes at large excess - If excess large and node still worth visiting,
prize must be huge - So forget current path, just go straight to huge
prize without discounting
36Formalize
- Fact excess only increases as traverse path
- Excess reflects lost time cant make it up
- Without loss of generality, assume g ¼
- Just scale edge lengths
- Claim at least ½ of optimum paths discounted
reward R is collected before paths excess rises
to ½ - Let w be first vertex with ew gt ½
- Suppose more than R/2 reward follows w
- Show contradiction
37Path Improvement
- ew gt ½ but more than R/2 reward follows w
- Shortcut directly to w then traverse optimum
- reduces all excesses after w by ½
- so improves discounts by (1/g) ½ 2
- so doubles discounted reward collected
- but this was more than R/2 contradiction
w
excess 1/2
0
excess 1
1/2
Reward R/2
38Discount Discounting
- We showed large excess can be ignored
- But if excess is small, discounting by excess can
be ignored! - (discounted) reward (undiscounted) prize
- So, just find path with small excess maximizing
amount of (undiscounted) prize - Gives path with (discounted) reward prize
- Of course, min-excess gives min-distance
- But, may be better off approximating excess
39Improvement on k-TSP Approximate Excess
- Recall discounted reward at v is gev pv
- Prefix of optimum discounted reward path
- Has discounted reward S gev pv gt R/2
- So has prize S pv gt R/2
- And has no vertex with large excess
- Find a path of approximately (3 times) minimum
excess and prize R/2 - (we can guess R/2)
- Excesses at most 3/2, so gev pv gt pv/8
- So discounted reward on found path gt R/8
40The Downside
- Min-excess problem more useful
- But harder to solve
- Approximating min-distance does not approximate
min-excess
Opt length 1e e excess
length 2 1 excess
41Exactly Solvable Case monotonic paths
- Suppose optimum goes through vertices in strictly
increasing distance order from root - Then can find optimum by dynamic program
- Just as can solve longest path in an acyclic
graph - Build table is there a monotonic path from v
with length l and prize p? - To answer, look for a u after v with a path of
length l dvu and prize p - pv - Works because monotonic path wont go back
through v
42Dynamic Program
(3,1)
(4,2)
1
1
2
2
4
(2,0)
(4,1)
(7,2)
(8,2)
(9,2)
(5,2)
2
(8,3)
(13,3)
5
(6,2) (11,3)
(9,2)
(15,4)
(7,1)
43Approximable casewiggly paths
- Length of path to v is tv dv ev
- If ev gt dv then tv gt ev gt tv / 2
- i.e., take twice as long as necessary to reach
end - So if approximate tv to constant factor, also
approximate ev to twice that constant factor - But finding approximately optimum tv is k-TSP
problem - Constant factor approximation known
44Decompose into easy cases
monotone
monotone
monotone
wiggly
wiggly
Divides into independent problems
gt 2/3 of each wiggly path is excess
45Decomposition Analysis
- 2/3 of each wiggly path is excess
- That excess accumulates into whole path
- So, total excess of wiggly paths upper bounded by
excess of whole path - Conclude total length of wiggly paths upper
bounded by 3/2 of path excess - Use k-TSP to find approximately shortest wiggles
collecting right amount of prize - Approximates length, so approximates excess
- Over all wiggly parts, approximates total excess
46Dynamic program
- For each pair of vertices and each (discretized)
prize value, find - Shortest monotonic path collecting desired prize
- Approximately shortest wiggly path collecting
desired prize - Note polynomially many subproblems
- Use dynamic programming to find optimum pasting
together of subproblems
47Summary
- Showed maximum discount prize can be approximated
by minimum excess path - Showed how to approximate min-excess path
- Also solves orienteering problem
- Also solves dual of k-TSP where length is fixed
- Also solves tree versions of all these
problems, e.g. prize-approximate k-MST
48Open Questions
- Directed graphs?
- We used k-TSP, only solved for undirected
- For directed, even standard TSP has no known
constant factor approximation - We only use k-TSP/undirectedness in wiggly parts
- Stochastic actions?
- Stochastic seems to imply directed
- Special case forget rewards.
- Given choice of actions, choose actions to
minimize cover time of graph
49Decoding Turbo Codes via Linear Programming
- Linear Programming Relaxations
50Basic Coding Problem
- Goal transmit message across noisy channel that
randomly perturbs parts of message - Binary symmetric channel each transmitted bit
flipped independently with probability p lt ½ - Approach also works for AWGN channel
- Method introduce redundancy in messages to cope
with perturbations - Compute encoding function mapping each
information word u ÃŽ 0,1k to codeword ? ÃŽ
0,1n - Receive and decode randomly perturbed y
51Definitions
Encoder
Information word u Length k
Codeword ? Length n
Word error rate is u u ?
Noisy Channel Bit error rate p
Decoded info word u
Decoder
Corrupt codeword y
Decoded codeword y
52Decoding
- Received perturbed codeword usually not a
codeword. - Need rule for picking best decoding possibility
- Performance measure probability of giving wrong
answer - Known as word error rate (WER)
- Wont be zero, since with nonzero probability
channel replaces codeword with a different, valid
codeword (at that point, only natural to answer
with wrong information word)
53Maximum Likelihood Decoding
- Specific decoding rule
- Choose info word whose codeword maximizes
probability of received word Pr(y y) - Assuming binary symmetric channel errors, this is
just codeword at minimum hamming distance from
received word - More generally, linear cost function on codewords
- Under general assumptions, this is the best way
to decode - Drawback generally NP-complete
54Turbo Codes
- A particular encoding approach
- Introduced in 1993 Berrou, Glavieux,
Thitimajshima - Simple linear time encoding (state machine)
- Numerous fast heuristics for decoding
- E.g. belief propagation
- But may not converge
- Fantastic in practice, but unclear why
- distance of code is bad (multiple codewords
with few different bits easily confused with
each other) - Some asymptotic analysis of random turbo codes
55Decoding byLinear Programming
- Describe polynomial time linear-programming
approach to decoding (arbitrary) turbo codes - Generalizes to LDPC codes
- Certifies when finds ML codeword
- Precise analysis for RA(2) codes
- Repeat accumulate are special type of turbo
- 2 means rate ½ two code bits per info bit
- Based on analysis, explanation of how to build
good RA(2) code - Gives code with inverse-polynomial error bound
56ML Decoding as Integer Linear Programming
- Given corrupted codeword y, want find codeword y
of minimum hamming distance - Note y - y S(1- 2yi)y constant (linear
function) - Code C ? 0,1n
- Polytope P as convex hull of C
- ML decoding wants minimum-cost vertex of P
- Errors in channel change y, perturb objective
- Good code small objective perturbations leave
optimum vertex unchanged
57Linear Programming Relaxation
- Optimizing over P intractable
- Generally no tractable way to work with
constraints (facets) defining P - Find a tractable polytope similar to P,
optimize over that instead - New polytope will generally have additional,
non-integral vertices
Relax
58Properties of a Good Relaxation
- Relaxed LP should be tractable
- Preferably few constraints
- Combinatorial structure to aid solution
- Relaxed polytope Q should contain original P
- Ensures that true optimum is feasible in
relaxation - So Q optimum is lower bound on P optimum
- Relaxation should be tight
- Vertices of Q should be close to P
- Increases hope that optimum in Q will be valid in
P - Makes it easier to round Q solution into P
59Repeated Accumulate Codes
- Particular type of turbo code
- Accumulator accumulates sum of bits mod 2
- RA(m) code
- Make m copies of info word (total km bits)
- Apply given fixed permutation to bits
- Feed resulting sequence through accumulator
- Output sequence of accumulator values (km
outputs) - We focus on RA(2) code
60Trellis
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
?1
?6
?5
?4
?3
?2
1
1
1
1
1
1
1
0
0
0
0
0
u1
u3
u2
u3
u1
u2
- Circles represent state of accumulator
- Each info bit appears at two transitions
- Sequence of states is codeword
61Encoding with Trellis
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
u1
u3
u2
u3
u1
u2
0
0
1
1
1
1
62Decoding with Trellis
?20
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
?11
u1
u3
u2
u3
u1
u2
1
- At each step, codeword says which state should
transition to - Read off ui label on transition arc
63Handling Errors
- Not every path through trellis is a codeword
- Need both occurrences of each info bit to agree
(cause same transition) - Call such a path agreeable
- Want path with fewest transitions not matching
received codeword - Give cost 0 to transitions matching received
word, cost 1 to transitions not matching - Shortest agreeable path is ML codeword
64Relaxation
- Shortest agreeable path is NP-complete
- Relax to min-cost agreeable flow
- Flow is convex combination of paths (exponential
number of variables, but tractable) - Alternatively, poly-size LP based on balance of
incoming and outgoing flow - Agreeability is a constraint that certain groups
of transition edges carry same amount of flow
65Properties of Relaxation
- Any agreeable path is an agreeable flow
- Conclude correct answer is feasible in
relaxation - Conclude if min-cost agreeable flow is a path,
then it is shortest agreeable path - Certificate property min-cost agreeable flow
decoder will know it has found ML codeword - Relaxation is tractable to solve
- Can directly solve via LP
- But actually exists reduction to standard
min-cost flow, so can use specialized fast
algorithms
66Performance
- When is correct codeword (path) the optimum for
MCAF? - Intuition from residual graphs in flow
- Build new graph by subtracting from each edges
capacity amount of flow currently on edge - A minimum cost flow is optimal if and only if
there is no negative cost cycle in the residual
graph (i.e., of positive residual capacity) - Can push flow around cycle, reduce cost of flow
- Says local optimum is global optimum
67Generalize to MCAF
- True codeword is optimal if no negative cost
agreeable cycle in residual graph - What does agreeable cycle look like?
- Must diverge from correct codeword path at some
point (traverses other label) - Then traverses same labels as correct for a while
- Then may return to codeword path (again by
traversing opposite label - Each time traverses opposite label at some bit,
agreeability requires it also traverse opposite
label at other copy of that bit
68Trellis
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
?1
?6
?5
?4
?3
?2
1
1
1
1
1
1
1
0
0
0
0
0
u1
u3
u2
u3
u1
u2
- Suppose all 0s sent
- If cycle uses 1-edge at second layer
- must also use 1-edge at 4th layer
69An Easier Representationthe Tanner Graph
- Draw path along codeword
- Edge cost -1 if received bit flipped
- 1 otherwise
- Add matching edge between vertices
corresponding to copies of same info bit - All edge costs 0
- Cycles in residual graph correspond to cycles in
this graph G, and have same cost - Hamiltonian edge use same transition as
codeword - Matching edge use opposite transition
70Picture
codeword
?1
u1
?6
1
-1
u3
u2
0
0
?2
1
?5
-1
0
u2
u1
1
1
u3
?4
?3
71Connection
?1
u1
?6
1
-1
u3
u2
- Circle paths simple cycles in trellis
- Matching edges for agreement
0
0
?2
1
?5
-1
0
u2
u1
1
1
?3
?4
u3
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
?1
?6
?5
?4
?3
?2
1
1
1
1
1
1
1
0
0
0
0
0
u1
u3
u2
u3
u1
u2
72Main Theorem
- The true codeword is the min-cost agreeable flow
solution iff there is no negative cost cycle in
the Tanner graph G
73Suggests Good Codes
- Recall edge cost is -1 if bit flipped, which
occurs with probability lt ½ - Intuition if cycle is large, unlikely that more
than half its edges flip - So, good idea to build graph in which all cycle
are large - Achieve by building graph with larger
girthlength of shortest cycle - Erdos gives (PathMatching) graph with girth log
n (and this is best possible)
74Analysis
- Theorem Using RA(2) code from Erdos Graph, if p
lt 2-4(elog 24)/2) then WERPrnegative cycle lt
n-e - Proof
- Negative cycle has length at least log n
- Break into subpaths of length log n
- One must be negative
- What is Prnegative length log n path?
- probability at most n-2-e
- Degree 3 graph, so only n2 paths of length log n
- Add up
75Experiments
76Summary
- Combinatorial algorithm for decoding RA(2) codes
- Analysis of its error probability
- Recommended code based on analysis
- That code has polynomially small WER
- Experiments show good performance in practice
- But slow, since solving LP
- This approach gives insight, not algorithms
77Extensions
- Discovered that Wainrights Tree Reweighted Max
Product is solving dual of our LP - Thus, TRMP has same performance
- Gives a belief-propogation flavored decoder with
same performance as LP decoder - Techniques extend to arbitrary LDPC codes
- Relaxation by intersecting parity polytopes for
each parity check - LP decoder (but flow decoder doesnt extend)
78Breaking Result FMSSW
- Analysis of LP decoder for LDPC codes
- Proof that when code based on expander graph, LP
decoder handles constant fraction of bits
corrupted - Gives proof of exponentially small error rate for
polynomial-time decoder of LDPC codes.
79Learning Markov Random Fields
80Overview
- Fundamentals of Markov networks
- Maximum likelihood markov network structure as a
maximum hypertree problem - Tree-width and hypertrees
- An approximation algorithm for maximum hypertrees
- Reducing maximum hypertrees to/from maximum
likelihood markov networks
81Density Estimation
- T observations x1,,xT
- Each x1 x11,, x1n a vector of n values
- Estimate joint probability distribution
P(X1,,Xn) from which samples were taken. - (Assume observations i.i.d)
82Maximum likelihood approach
- Postulate best fit is maximum likelihood
distribution - Distribution that maximizes likelihood of data
- To avoid over-fitting, limit choice to within
some (parametric) class. - Equivalent to projecting onto our class to find
nearest neighbor of empirical distribution - Our approach applies to this general distribution
projection problem
83Markov Random Networks
- Representations of joint distributions with
limited dependence. - Variables x1xn
- Graph with variables as vertices
- Edges represent dependencies
- Conditioned on (any specific values of) its
neighbors, xi is independent of all other
variables. - More generally, two sides of any separator are
independent conditioned on separator
84Problems
- Markov net inference
- Given data and specific Markov net, find
parameter settings that best fit data - Markov net learning
- Given data, find Markov net structure that best
fits data (under best parameter settings) - For us, best fitsmaximum likelihood
85Hammersley Clifford Theorem
- Cliques in Markov network are special no
restriction on their marginal distribution - A distribution P is a Markov network over graph G
iff P factorizes over the cliques of G - Note jh assigns a value to each possible setting
xh of values of variables in clique h
86Value of HC
- HC gives a concise representation of any MRF
probability distribution function - Only need to specify clique potentials
- If variables take s values, then potential on
size-k clique represented by sk values - If cliques are small (constant size) then
representation is small (linear size)
87Limits of HC
- Cannot use HC factorization to compute important
quantities - Normalization of j (cant even tell if needed)
- Marginal probability distributions (even 1
variable) - Conditional probability distributions (ditto)
- Maximum likelihood parameter settings (finding j
to fit data)
88Triangulated Markov Networks
- No minimal cycles of more then three vertices.
X1
X2
X1
X1
X2
X5
X6
X5
X6
X3
X4
X4
X3
89Benefits of Triangulation
- Efficient (linear time) exact calculations
- Marginals
- Conditional distributions
- Just about anything else (via canonical dynamic
program) - Explicit Hammersley-Clifford factorization
- Efficient inference calculation of maximum
likelihood j to fit observed data
90Efficient Calculation
- Triangulated graph has elimination ordering
- Order of deleting vertices such that when vertex
is deleted, its surviving neighbors form a clique - Run backwards, adds each vertex to a clique
- Canonical dynamic program
- Memoize values on each clique (eg, distribution)
- Gives necessary info to add new vertex
- Memo-table size exponential in clique sizes
- Happy if max clique size small
91HC Factorization of Triangulated Graphs
(new explicit j)
92ML Inference on Triangulated Graphs
- Fixed network given
- Choose parameters (potentials j) to maximize
likelihood of observations - In triangulated network, do so by making
marginals correct on cliques - i.e., want derived P(xh) equal to empirical
distribution P(xh) for each clique h - Achieve by plugging P into explicit formula for j
93Inference on Triangulated Graphs
(new explicit j)
94Triangulation
- Non-triangulated Markov network can be
triangulated by adding edges - Find large minimal cycles, add a chords
- Adding edges removes independence constraints, so
broadens class of models - So, only increases fit to data (maximum
likelihood) - And, makes computations tractable
95Treewidth
- Could just add all edges (complete graph is
triangulated) - Drawbacks
- Dynamic programs exponential in clique size
- Number of model parameters exponential in clique
size leads to over-fitting - Treewidth of a graph minimum over all
triangulations, of the maximum clique size of the
triangulation, minus one
96Markov Net Learning
- Given data, wish to find MRN of tree-width at
most k that maximizes likelihood of observed data - Equivalently, since triangulation can only
increase likelihood, wish to find maximum
likelihood triangulated graph with clique size at
most k1 - We call such a graph a k-hyperforest
- If maximal, k-hypertree
97Computing (Log-)Likelihood
98Additive Weights
- For triangulated G, HC gives explicit formulation
for maximum likelihood j - Key each jh is independent of graph choice!
- Set
- Then
99New formulation
- Max-likelihood value of G is just sum of weights
of cliques it contains - So, given weights on cliques of size up to k1,
want to find a hypertree (triangulated graph of
treewidth at most k) containing maximum weight of
cliques - This is the maximum hypertree problem
100Chow Liu (1968) k1
- Treewidth 1 is just a tree
- Edges are cliques of size 2
- Weight wh on edge turns out to be mutual
information between the variables on its
endpoints. - Maximum likelihood tree is the maximum spanning
tree with - Polynomial time solvable
101Larger Treewidths
- Theorem for k gt1, maximum hypertree problem is
NP-complete - Reduction from SAT
- S conclude ML MRN NP-complete
- So, seek approximation algorithm
- Given optimum has value w, find (in polynomial
time) some solution with value at least w/a - We give an algorithm with a8kk!(k1)!
- Constant for any fixed k
102Idea Locally Testable Structure
- Treewidth k is a global constraint, hard to aim
for - Define windmill, an object with local
characterization - Every hypertree contains a windmill with at least
1/(k1)! of its weight - Algorithm to find windmill of approximately
(factor 1/8kk!) maximum weight
103Star graphs
Covering 11 of the 15 edges of a tree with
disjoint stars
104Partitioning a tree into two sets of disjoint
stars
Conclusion some set of disjoint stars contains
at least 1/2 the edge weight of any given tree.
105Windmills
- A k-windmill is defined by a depth-k rooted tree
- Its hyperedges are all the paths from the root
- It has treewidth k
- 1-windmill is star
106Windmill Theorem
- Windmill farm collection of disjoint windmills
- Theorem any weighted k-hypertree contains a
k-windmill-farm with at least a 1/(k1)! fraction
of the weight - k-color the hypertree so no edge has repeated
colors - All edges that get colors in same order form a
windmill farm - Only k! orders
107Idea Randomized Rounding
- Already saw idea of ILP relaxations
- Define integer linear program solving problem
(not convex) - Ignore integrality constraints solve LP (convex)
- Now, want integral solution
- Round fractional solution to integral by setting
variable of value 0 lt x lt 1 equal to 1 with
probability x and 0 otherwise - Gives integer solution where all constraints
satisfied in expectation
108Special Case Layered 2-windmills
- Variable yuv set to 1 if u is root and v is child
of u - Variable zuvx set to 1 if x is a child of v which
is a child of u - Objective function S wuvx zuvx
- Consistency constraint zuvx yuv
- Single-parent constraint Su yuv 1
109Rounding
- Consider constraint Su yuv 1
- Means yuv form a probability distribution on
choices of parent for v - Choose parents according to this distribution
- Objective function was S wuvx zuvx
- We can keep wuvx if yuv was set to 1
- So expected kept value is S wuvx yuv gt S wuvx
zuvx - i.e., rounded solution matches LP objective
value! - (white lie)
110Open problems
- We have a constant factor approximation for
constant k, but its a pretty bad constant! - Two separate reasons for badness
- Windmill theorem may be very loose
- (but we have examples of gap exceeding k)
- Better windmill approximations?
- Multilevel facility location?
- Use idea of restricted, locally testable class
of MRNs to find other, practical and tractable
subsets - Hardness of approximation?
111Conclusion
112Near Neighbor Structure
- Concepts
- Random sampling
- Memoization
- Greedy improvement
- Local search
- Impact
- Simple
- Good constants
- Good experimental performance
- So likely useful in practice
113One-shot MDPs
- Concepts
- Approximation Algorithms
- Reductions to related problems
- Dynamic Programming
- Impact
- Solved an open theory problem
- Unlikely to be a good practical performer
- But suggests heuristics
- And poses good next questions
114Turbo Decoding
- Concepts
- LP relaxations
- Flows
- Graph theory
- Impact
- Probably dont want to use LP in DSP chips
- Strengthens theoretical grounds/explanations for
good performance of certain codes - Provides insight for improving codes and decoding
algorithms
115Hypertrees
- Concepts
- Randomized rounding of LP relaxations
- Impact
- Unlikely to be useful in practice
- Possible application to branch and bound
optimization methods - Suggests some useful special classes of
low-treewidth graphs (windmills) that may be more
tractable to construct
116Working with Algorithms
- Many NIPS problems would be trivial given
infinite computational resources - Good algorithms can simulate such resources
- Large collection of available tools
- Large collection of people who know them
- To apply algorithmic methods, want well-defined
algorithmic problem - Avoid defining problem via algorithmic solution
- Dialogue can help define true problem
- Even oversimplified solution may give insight
117Acknowledgements
- Near Neighbor Searching
- With Matthias Ruhl
- MDPs
- With Avrim Blum, Shuchi Chawla, Terran Lane, Adam
Meyerson, Maria Minkoff - Turbo Decoding
- With Jon Feldman, Martin Wainright
- Learning Markov Models
- With Nati Srebro
118Conculsion
- Lots of interesting NIPS problems!
- Techniques from theoretical computer science can
be applied - Toolbox of prior approximation algorithms
- Combinatorial structure of problems
- Wanted more problems
- Value in both definition and solution
- http//theory.lcs.mit.edu/karger/Talks/NIPS.ppt