Algorithmic Tools Applied to Some Machine Learning and Inference Problems PowerPoint PPT Presentation

presentation player overlay
1 / 118
About This Presentation
Transcript and Presenter's Notes

Title: Algorithmic Tools Applied to Some Machine Learning and Inference Problems


1
Algorithmic Tools Applied to Some Machine
Learning and Inference Problems
  • David Karger

2
Conclusion
  • Algorithms research has produced a huge set of
    tools that can be applied to efficiently solving
    computational problems
  • Any time one can precisely define a computational
    problem, its worth looking in the algorithms
    toolbox to see if it can be solved efficiently
  • So, define some problems and talk to your local
    algorithmicist!

3
Outline
  • Demonstrate by case studies applying algorithmic
    tools to four distinct learning/inference
    problems
  • Near neighbor searching
  • MDPs with nonrecurring rewards
  • Decoding turbo codes
  • Learning low-treewidth graphical models
  • Each is a well defined computational problem

4
Approach
  • 4 separate talks?
  • Commonality not in tools (too many to enumerate)
  • Rather, in reductionist / analogical approach
  • Common heuristics in search for right tools
  • By analogy to similar problems
  • By properties of problem, e.g.
  • Self-reducibility ---recursion, sampling
  • Use of paths, trees, cliques, etc.
  • Natural integer programming formulations
  • By judicious simplification

5
Tools well use today
  • Random sampling
  • Self-reducibility
  • Amortization
  • Dynamic programming
  • Combinatorial graph theory
  • Constant-factor approximation algorithms for
    NP-hard problems
  • Max flow
  • Approximation-preserving reductions
  • Locally checkable structures
  • Linear programming relaxations
  • Randomized rounding of relaxed solutions

6
Lost in Translation
  • Algorithms designed for clean problems
  • Often require simplifying assumptions
    fromreal-world problems
  • Measures of performance sometimes artificial
  • Constant approximation factor nice in theory,
    but which constant matters in practice
  • So, algorithm may or may not be useful
  • But even if not, can provide insight into
    practical ways of attacking the problem

7
A Data Structure for Nearest Neighbors on
Manifolds
  • Random Sampling
  • Greedy algorithms
  • Local Search
  • Memoization

8
Motivation
  • Common theme in machine learning find
    high-fidelity embedding of high-dimensional data
    into a lower-dimensional representation
  • Isomap TdSL and locally linear embedding RS
    postulate data on low-dimensional manifold in
    high dimensional space
  • To reconstruct manifold, both find each nodes
    near neighbors by brute force
  • Quadratic time in number of points
  • Improve with fast near-neighbor queries?

9
The Original Motivation
  • Peer to peer systems
  • Nodes need to find nearby nodes for fast
    communication/data accesss
  • Internet is (kind-of) low dimensional
  • This near neighbor data structure works well

10
Near Neighbor Search
  • Many optimal data structures in low dimensions
    (computational geometry)
  • But high dimensional problem hard
  • Subject of much recent attention in algorithms
    community
  • Kleinberg---Two algorithms for NNS in high
    dimension STOC 1997
  • DISV et al workshop on near neighbor search
    NIPS 2003
  • But focus on approximate near neighbors

11
First Try
  • Manifold is low dimensional
  • Use low dimensional NN solutions?
  • E.g. Voronoi diagram
  • Problem
  • Voronoi diagram relies on geometry of space
  • But we dont know geometry (coordinate system)
    until after we construct manifold
  • So to reconstruct, are limited to using distance,
    not geometry

12
Some Generic Ideas
  • Greedy approach
  • Take query q and closest point p found so far
  • Arrange to find something closer to q than p
  • Repeat till done
  • Local Search
  • Want point closer to q
  • Must be pretty close to p
  • So examine points near p till find one close to q
  • Repeat with new point

p
p
d
d/2
q
13
Some More Ideas
  • Random Sampling
  • For fast search, look only at a few candidates
  • Good candidates depend on query point
  • For any choice, adversary can pick bad q
  • So choose random set---probably good for all
  • Memoization
  • Picking random points near p is a NN search!
  • Pick in advance, as part of building data
    structure
  • E.g., for every possible radius (!), store (a
    few) random candidates in that ball around p

14
Random Sampling
  • When will it work?
  • Evenly distributed low-dimensional point sets
  • E.g. a grid of points
  • E.g., random sample from low-dim. manifold
  • On grid, if p and q at distance d, then
    Ball(q,d/2) is 1/16 of points in Ball(p,2d)
  • So, random point has 1/16 chance of improving
  • And trying 16 random points raises odds to 60-40

15
A Big Fast Data Structure?
  • At each point p, store 16 random points from
    balls of every possible radius around p
  • Resolve query by iterating improvements
  • Look in right size ball around current point
  • If lucky, find point closer to q, can look in
    smaller ball next time
  • If unlucky, end up with bigger ball
  • But luck is more likely---drift in right
    direction
  • Expected ball size shrinks by const factor
  • So, O(log n) steps get to tiny ball---at answer

16
Time to Get Real
  • Storing samples for all radii expensive
  • OK to be sloppy, and store only for powers of 2
  • Means always have ball of roughly correct size
  • But only log n balls needed per point
  • So O(n log n) space overall
  • Worry about dependence
  • What if search cycles back to point already seen?
  • random samples are no longer random
  • And how find random samples in first place?

17
Metric Skip Lists
  • Consider randomly ordering list of points
  • For each power-of-two d, point p records next 16
    points in list within distance d
  • These points are a random sample from ball
  • Search inspects points forward in list
  • From current p, check 16 following in current d
  • Search always moves forward
  • So always considering new points
  • So no cycles, and samples independent in every
    step (mostly)

18
Construction
  • Randomized Incremental Construction
  • Common idea from computational geometry
  • Adding points to data structure in random order
    is often easy, yields good data structure
  • Build list by prepending points in random order
  • To set new pointers, must find next 16 points
    within (each) distance d of new point
  • But this is like NN query (given the way we
    search)
  • Find answer by slight variation on NN search
  • List so far constructed makes this easy

19
Summary
  • In any low-dimensional space with evenly
    distributed points, metric skip lists solve
    nearest neighbor queries
  • O(n log n) space
  • O(n log n) time to build data structure
  • O(log n) time to query
  • Constant depends on dimensionality of space
  • Application by building this structure, can find
    NNs for Isomap/LLE in O(n log n) time

20
Experiments
  • Data structure is simple---no hidden
    theoretical aspects
  • Implementation for P2P search works well
  • Experiments on some simple manifolds JT work
    well too

21
Improvements KL04
  • There are better ways to pick improvement
    candidates for local search
  • Assouads doubling dimension is the maximum
    number of diameter-d balls needed to cover any
    diameter-d/2 ball
  • Pick one candidate from each ball in cover
  • Get same bounds as above, but
  • Applies to unevenly distributed points
  • Deterministic

22
Markov Decision Processes with Nonrecurring
Rewards
  • Approximation algorithms
  • Reductions to related problems
  • Dynamic programming

23
A Robot Navigation Problem
  • Robot to deliver packages
  • Goal to deliver as quickly as possible
  • Sounds like traveling salesman problem?
  • Mismatches
  • Robot may not go where it plans to (sensor error,
    motor control error, battery failure.)
  • Some packages matter more

24
Formulate asMarkov Decision Process
  • Graph with rewards rv on states (vertices) v,
    travel times (lengths) on edges
  • From each node, choice of actions (each a
    probability distribution on next vertex)
  • Choosing sequence of actions produces a random
    path through graph
  • If arrive at vertex v at time t, receive
    discounted reward gt rv where glt1
  • Motivates getting there quickly
  • Goal maximize total discounted reward

25
MDP Discounting
  • Reward received each time vertex is visited
  • So plain value of infinite path can be infinite
  • Discounting means total reward is bounded by a
    geometric series, so bounded
  • Alternative consider average reward per unit
    time
  • Other reasons for discounting
  • Inflation (money in future less value than now)
  • Uncertainty (what if something happens before I
    collect future prize?)
  • Mathematical elegance

26
Solution
  • Fixing action at each state produces a Markov
    Chain. Transition probabilities pvw
  • Can compute expected discounted reward rv if
    start at state v
  • rv rv Sw pvw gt(v,w) rw
  • Choosing actions to optimize this recurrence is
    polynomial time solvable
  • Linear programming
  • Dynamic programming (like shortest paths)

27
Solving the wrong problem
  • Package can only be delivered once
  • So, incorrect to get reward each time reach
    target
  • One solution expand state space
  • Vertex represents where I am and where I have
    been before (what packages already delivered)
  • Reward nonzero only on states where current
    location not included in list of previously
    visited
  • Now apply MDP algorithm
  • Problem expanded problem size exponential in
    original input

28
This is one Instance of a General Problem KL
  • Often, MDP has state space with nice small
    implicit description but huge explicit
    description
  • How do we accomplish MDP optimization on such
    instances?

29
Tackle an easier problem
  • Problem has two novel elements for TOC
  • Discounting of reward based on arrival time
  • Probability distribution on outcome of actions
  • We will ignore second issue
  • In practice, robot can control errors
  • Even first issue by itself is hard and
    interesting
  • First step towards solving whole problem
  • Frantic Salesman Problem
  • Given rewards, travel times, and discount factor,
    find a path maximizing total discounted reward

30
Approximation Algorithms
  • FSP is NP-complete (thus, so is more general
    MDP-type problem)
  • Reduction from minimum latency TSP
  • So intractable to solve exactly
  • Goal approximation algorithm that is guaranteed
    to achieve at least some constant (lt1) multiple
    of the best possible discounted reward

31
TOC Toolbox
  • Goal seems to be to find a short path that
    visits lots of reward
  • Relates to previously studied k-TSP problem
  • Given a root vertex v, find a path of minimum
    total length that starts at v and visits vertices
    with (undiscounted) prize at least k
  • Constant factor approximation algorithm known for
    undirected graphs (so we assume this too)
  • i.e., can find path of at most a constant (gt1)
    multiple of minimum possible total edge length

32
Mismatch
  • Constant factor approximation doesnt
    exponentiate well
  • Suppose optimum solution reaches some reward r at
    time t for reward gtr
  • Constant factor approximation would reach within
    time 2t for reward g2tr
  • Result get only gt fraction of optimum
    discounted reward, not a constant fraction.

33
Idea Change Objective Function
  • Modify k-TSP to approximate prize collected
    instead of length orienteering problem
  • Assume tour of length l collecting prize p
  • Find tour of length l collecting prize p/2
  • Avoids changing length, so exponentiation doesnt
    hurt
  • Drawback no constant factor approximation
    previously known
  • Flipping objective/feasibility transforms problem
  • (Our techniques end up resolving this too)

34
Idea Upper Bounds
  • General tool for approximation algorithms
  • Show close to something no solution can beat
  • Let dv denote shortest path distance to v
  • Define the prize at v as pvgdv rv
  • Max discounted reward possibly collectable at v
  • So max conceivable reward S pv
  • Potential greedy algorithm take shortest path to
    one max-prize vertex
  • Gets at least 1/n of optimum

35
Compare to Upper Bound
  • If given path reaches v at time tv, define excess
    at v as ev tv dv
  • Difference between shortest path and chosen one
  • Then discounted reward at v is gev pv
  • Idea good solution need not bother visiting
    nodes at large excess
  • If excess large and node still worth visiting,
    prize must be huge
  • So forget current path, just go straight to huge
    prize without discounting

36
Formalize
  • Fact excess only increases as traverse path
  • Excess reflects lost time cant make it up
  • Without loss of generality, assume g ¼
  • Just scale edge lengths
  • Claim at least ½ of optimum paths discounted
    reward R is collected before paths excess rises
    to ½
  • Let w be first vertex with ew gt ½
  • Suppose more than R/2 reward follows w
  • Show contradiction

37
Path Improvement
  • ew gt ½ but more than R/2 reward follows w
  • Shortcut directly to w then traverse optimum
  • reduces all excesses after w by ½
  • so improves discounts by (1/g) ½ 2
  • so doubles discounted reward collected
  • but this was more than R/2 contradiction

w
excess 1/2
0
excess 1
1/2
Reward R/2
38
Discount Discounting
  • We showed large excess can be ignored
  • But if excess is small, discounting by excess can
    be ignored!
  • (discounted) reward (undiscounted) prize
  • So, just find path with small excess maximizing
    amount of (undiscounted) prize
  • Gives path with (discounted) reward prize
  • Of course, min-excess gives min-distance
  • But, may be better off approximating excess

39
Improvement on k-TSP Approximate Excess
  • Recall discounted reward at v is gev pv
  • Prefix of optimum discounted reward path
  • Has discounted reward S gev pv gt R/2
  • So has prize S pv gt R/2
  • And has no vertex with large excess
  • Find a path of approximately (3 times) minimum
    excess and prize R/2
  • (we can guess R/2)
  • Excesses at most 3/2, so gev pv gt pv/8
  • So discounted reward on found path gt R/8

40
The Downside
  • Min-excess problem more useful
  • But harder to solve
  • Approximating min-distance does not approximate
    min-excess

Opt length 1e e excess
length 2 1 excess
41
Exactly Solvable Case monotonic paths
  • Suppose optimum goes through vertices in strictly
    increasing distance order from root
  • Then can find optimum by dynamic program
  • Just as can solve longest path in an acyclic
    graph
  • Build table is there a monotonic path from v
    with length l and prize p?
  • To answer, look for a u after v with a path of
    length l dvu and prize p - pv
  • Works because monotonic path wont go back
    through v

42
Dynamic Program
(3,1)
(4,2)
1
1
2
2
4
(2,0)
(4,1)
(7,2)
(8,2)
(9,2)
(5,2)
2
(8,3)
(13,3)
5
(6,2) (11,3)
(9,2)
(15,4)
(7,1)
43
Approximable casewiggly paths
  • Length of path to v is tv dv ev
  • If ev gt dv then tv gt ev gt tv / 2
  • i.e., take twice as long as necessary to reach
    end
  • So if approximate tv to constant factor, also
    approximate ev to twice that constant factor
  • But finding approximately optimum tv is k-TSP
    problem
  • Constant factor approximation known

44
Decompose into easy cases
monotone
monotone
monotone
wiggly
wiggly
Divides into independent problems
gt 2/3 of each wiggly path is excess
45
Decomposition Analysis
  • 2/3 of each wiggly path is excess
  • That excess accumulates into whole path
  • So, total excess of wiggly paths upper bounded by
    excess of whole path
  • Conclude total length of wiggly paths upper
    bounded by 3/2 of path excess
  • Use k-TSP to find approximately shortest wiggles
    collecting right amount of prize
  • Approximates length, so approximates excess
  • Over all wiggly parts, approximates total excess

46
Dynamic program
  • For each pair of vertices and each (discretized)
    prize value, find
  • Shortest monotonic path collecting desired prize
  • Approximately shortest wiggly path collecting
    desired prize
  • Note polynomially many subproblems
  • Use dynamic programming to find optimum pasting
    together of subproblems

47
Summary
  • Showed maximum discount prize can be approximated
    by minimum excess path
  • Showed how to approximate min-excess path
  • Also solves orienteering problem
  • Also solves dual of k-TSP where length is fixed
  • Also solves tree versions of all these
    problems, e.g. prize-approximate k-MST

48
Open Questions
  • Directed graphs?
  • We used k-TSP, only solved for undirected
  • For directed, even standard TSP has no known
    constant factor approximation
  • We only use k-TSP/undirectedness in wiggly parts
  • Stochastic actions?
  • Stochastic seems to imply directed
  • Special case forget rewards.
  • Given choice of actions, choose actions to
    minimize cover time of graph

49
Decoding Turbo Codes via Linear Programming
  • Linear Programming Relaxations

50
Basic Coding Problem
  • Goal transmit message across noisy channel that
    randomly perturbs parts of message
  • Binary symmetric channel each transmitted bit
    flipped independently with probability p lt ½
  • Approach also works for AWGN channel
  • Method introduce redundancy in messages to cope
    with perturbations
  • Compute encoding function mapping each
    information word u ÃŽ 0,1k to codeword ? ÃŽ
    0,1n
  • Receive and decode randomly perturbed y

51
Definitions
Encoder
Information word u Length k
Codeword ? Length n
Word error rate is u u ?
Noisy Channel Bit error rate p
Decoded info word u
Decoder
Corrupt codeword y
Decoded codeword y
52
Decoding
  • Received perturbed codeword usually not a
    codeword.
  • Need rule for picking best decoding possibility
  • Performance measure probability of giving wrong
    answer
  • Known as word error rate (WER)
  • Wont be zero, since with nonzero probability
    channel replaces codeword with a different, valid
    codeword (at that point, only natural to answer
    with wrong information word)

53
Maximum Likelihood Decoding
  • Specific decoding rule
  • Choose info word whose codeword maximizes
    probability of received word Pr(y y)
  • Assuming binary symmetric channel errors, this is
    just codeword at minimum hamming distance from
    received word
  • More generally, linear cost function on codewords
  • Under general assumptions, this is the best way
    to decode
  • Drawback generally NP-complete

54
Turbo Codes
  • A particular encoding approach
  • Introduced in 1993 Berrou, Glavieux,
    Thitimajshima
  • Simple linear time encoding (state machine)
  • Numerous fast heuristics for decoding
  • E.g. belief propagation
  • But may not converge
  • Fantastic in practice, but unclear why
  • distance of code is bad (multiple codewords
    with few different bits easily confused with
    each other)
  • Some asymptotic analysis of random turbo codes

55
Decoding byLinear Programming
  • Describe polynomial time linear-programming
    approach to decoding (arbitrary) turbo codes
  • Generalizes to LDPC codes
  • Certifies when finds ML codeword
  • Precise analysis for RA(2) codes
  • Repeat accumulate are special type of turbo
  • 2 means rate ½ two code bits per info bit
  • Based on analysis, explanation of how to build
    good RA(2) code
  • Gives code with inverse-polynomial error bound

56
ML Decoding as Integer Linear Programming
  • Given corrupted codeword y, want find codeword y
    of minimum hamming distance
  • Note y - y S(1- 2yi)y constant (linear
    function)
  • Code C ? 0,1n
  • Polytope P as convex hull of C
  • ML decoding wants minimum-cost vertex of P
  • Errors in channel change y, perturb objective
  • Good code small objective perturbations leave
    optimum vertex unchanged

57
Linear Programming Relaxation
  • Optimizing over P intractable
  • Generally no tractable way to work with
    constraints (facets) defining P
  • Find a tractable polytope similar to P,
    optimize over that instead
  • New polytope will generally have additional,
    non-integral vertices

Relax
58
Properties of a Good Relaxation
  • Relaxed LP should be tractable
  • Preferably few constraints
  • Combinatorial structure to aid solution
  • Relaxed polytope Q should contain original P
  • Ensures that true optimum is feasible in
    relaxation
  • So Q optimum is lower bound on P optimum
  • Relaxation should be tight
  • Vertices of Q should be close to P
  • Increases hope that optimum in Q will be valid in
    P
  • Makes it easier to round Q solution into P

59
Repeated Accumulate Codes
  • Particular type of turbo code
  • Accumulator accumulates sum of bits mod 2
  • RA(m) code
  • Make m copies of info word (total km bits)
  • Apply given fixed permutation to bits
  • Feed resulting sequence through accumulator
  • Output sequence of accumulator values (km
    outputs)
  • We focus on RA(2) code

60
Trellis
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
?1
?6
?5
?4
?3
?2
1
1
1
1
1
1
1
0
0
0
0
0
u1
u3
u2
u3
u1
u2
  • Circles represent state of accumulator
  • Each info bit appears at two transitions
  • Sequence of states is codeword

61
Encoding with Trellis
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
u1
u3
u2
u3
u1
u2
0
0
1
1
1
1
  • Encode 011
  • Output 010110

62
Decoding with Trellis
?20
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
?11
u1
u3
u2
u3
u1
u2
1
  • At each step, codeword says which state should
    transition to
  • Read off ui label on transition arc

63
Handling Errors
  • Not every path through trellis is a codeword
  • Need both occurrences of each info bit to agree
    (cause same transition)
  • Call such a path agreeable
  • Want path with fewest transitions not matching
    received codeword
  • Give cost 0 to transitions matching received
    word, cost 1 to transitions not matching
  • Shortest agreeable path is ML codeword

64
Relaxation
  • Shortest agreeable path is NP-complete
  • Relax to min-cost agreeable flow
  • Flow is convex combination of paths (exponential
    number of variables, but tractable)
  • Alternatively, poly-size LP based on balance of
    incoming and outgoing flow
  • Agreeability is a constraint that certain groups
    of transition edges carry same amount of flow

65
Properties of Relaxation
  • Any agreeable path is an agreeable flow
  • Conclude correct answer is feasible in
    relaxation
  • Conclude if min-cost agreeable flow is a path,
    then it is shortest agreeable path
  • Certificate property min-cost agreeable flow
    decoder will know it has found ML codeword
  • Relaxation is tractable to solve
  • Can directly solve via LP
  • But actually exists reduction to standard
    min-cost flow, so can use specialized fast
    algorithms

66
Performance
  • When is correct codeword (path) the optimum for
    MCAF?
  • Intuition from residual graphs in flow
  • Build new graph by subtracting from each edges
    capacity amount of flow currently on edge
  • A minimum cost flow is optimal if and only if
    there is no negative cost cycle in the residual
    graph (i.e., of positive residual capacity)
  • Can push flow around cycle, reduce cost of flow
  • Says local optimum is global optimum

67
Generalize to MCAF
  • True codeword is optimal if no negative cost
    agreeable cycle in residual graph
  • What does agreeable cycle look like?
  • Must diverge from correct codeword path at some
    point (traverses other label)
  • Then traverses same labels as correct for a while
  • Then may return to codeword path (again by
    traversing opposite label
  • Each time traverses opposite label at some bit,
    agreeability requires it also traverse opposite
    label at other copy of that bit

68
Trellis
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
?1
?6
?5
?4
?3
?2
1
1
1
1
1
1
1
0
0
0
0
0
u1
u3
u2
u3
u1
u2
  • Suppose all 0s sent
  • If cycle uses 1-edge at second layer
  • must also use 1-edge at 4th layer

69
An Easier Representationthe Tanner Graph
  • Draw path along codeword
  • Edge cost -1 if received bit flipped
  • 1 otherwise
  • Add matching edge between vertices
    corresponding to copies of same info bit
  • All edge costs 0
  • Cycles in residual graph correspond to cycles in
    this graph G, and have same cost
  • Hamiltonian edge use same transition as
    codeword
  • Matching edge use opposite transition

70
Picture
codeword
?1
u1
?6
1
-1
u3
u2
0
0
?2
1
?5
-1
0
u2
u1
1
1
u3
?4
?3
71
Connection
?1
u1
?6
1
-1
u3
u2
  • Circle paths simple cycles in trellis
  • Matching edges for agreement

0
0
?2
1
?5
-1
0
u2
u1
1
1
?3
?4
u3
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
?1
?6
?5
?4
?3
?2
1
1
1
1
1
1
1
0
0
0
0
0
u1
u3
u2
u3
u1
u2
72
Main Theorem
  • The true codeword is the min-cost agreeable flow
    solution iff there is no negative cost cycle in
    the Tanner graph G

73
Suggests Good Codes
  • Recall edge cost is -1 if bit flipped, which
    occurs with probability lt ½
  • Intuition if cycle is large, unlikely that more
    than half its edges flip
  • So, good idea to build graph in which all cycle
    are large
  • Achieve by building graph with larger
    girthlength of shortest cycle
  • Erdos gives (PathMatching) graph with girth log
    n (and this is best possible)

74
Analysis
  • Theorem Using RA(2) code from Erdos Graph, if p
    lt 2-4(elog 24)/2) then WERPrnegative cycle lt
    n-e
  • Proof
  • Negative cycle has length at least log n
  • Break into subpaths of length log n
  • One must be negative
  • What is Prnegative length log n path?
  • probability at most n-2-e
  • Degree 3 graph, so only n2 paths of length log n
  • Add up

75
Experiments
76
Summary
  • Combinatorial algorithm for decoding RA(2) codes
  • Analysis of its error probability
  • Recommended code based on analysis
  • That code has polynomially small WER
  • Experiments show good performance in practice
  • But slow, since solving LP
  • This approach gives insight, not algorithms

77
Extensions
  • Discovered that Wainrights Tree Reweighted Max
    Product is solving dual of our LP
  • Thus, TRMP has same performance
  • Gives a belief-propogation flavored decoder with
    same performance as LP decoder
  • Techniques extend to arbitrary LDPC codes
  • Relaxation by intersecting parity polytopes for
    each parity check
  • LP decoder (but flow decoder doesnt extend)

78
Breaking Result FMSSW
  • Analysis of LP decoder for LDPC codes
  • Proof that when code based on expander graph, LP
    decoder handles constant fraction of bits
    corrupted
  • Gives proof of exponentially small error rate for
    polynomial-time decoder of LDPC codes.

79
Learning Markov Random Fields
  • Randomized Rounding

80
Overview
  • Fundamentals of Markov networks
  • Maximum likelihood markov network structure as a
    maximum hypertree problem
  • Tree-width and hypertrees
  • An approximation algorithm for maximum hypertrees
  • Reducing maximum hypertrees to/from maximum
    likelihood markov networks

81
Density Estimation
  • T observations x1,,xT
  • Each x1 x11,, x1n a vector of n values
  • Estimate joint probability distribution
    P(X1,,Xn) from which samples were taken.
  • (Assume observations i.i.d)

82
Maximum likelihood approach
  • Postulate best fit is maximum likelihood
    distribution
  • Distribution that maximizes likelihood of data
  • To avoid over-fitting, limit choice to within
    some (parametric) class.
  • Equivalent to projecting onto our class to find
    nearest neighbor of empirical distribution
  • Our approach applies to this general distribution
    projection problem

83
Markov Random Networks
  • Representations of joint distributions with
    limited dependence.
  • Variables x1xn
  • Graph with variables as vertices
  • Edges represent dependencies
  • Conditioned on (any specific values of) its
    neighbors, xi is independent of all other
    variables.
  • More generally, two sides of any separator are
    independent conditioned on separator

84
Problems
  • Markov net inference
  • Given data and specific Markov net, find
    parameter settings that best fit data
  • Markov net learning
  • Given data, find Markov net structure that best
    fits data (under best parameter settings)
  • For us, best fitsmaximum likelihood

85
Hammersley Clifford Theorem
  • Cliques in Markov network are special no
    restriction on their marginal distribution
  • A distribution P is a Markov network over graph G
    iff P factorizes over the cliques of G
  • Note jh assigns a value to each possible setting
    xh of values of variables in clique h

86
Value of HC
  • HC gives a concise representation of any MRF
    probability distribution function
  • Only need to specify clique potentials
  • If variables take s values, then potential on
    size-k clique represented by sk values
  • If cliques are small (constant size) then
    representation is small (linear size)

87
Limits of HC
  • Cannot use HC factorization to compute important
    quantities
  • Normalization of j (cant even tell if needed)
  • Marginal probability distributions (even 1
    variable)
  • Conditional probability distributions (ditto)
  • Maximum likelihood parameter settings (finding j
    to fit data)

88
Triangulated Markov Networks
  • No minimal cycles of more then three vertices.

X1
X2
X1
X1
X2
X5
X6
X5
X6
X3
X4
X4
X3
89
Benefits of Triangulation
  • Efficient (linear time) exact calculations
  • Marginals
  • Conditional distributions
  • Just about anything else (via canonical dynamic
    program)
  • Explicit Hammersley-Clifford factorization
  • Efficient inference calculation of maximum
    likelihood j to fit observed data

90
Efficient Calculation
  • Triangulated graph has elimination ordering
  • Order of deleting vertices such that when vertex
    is deleted, its surviving neighbors form a clique
  • Run backwards, adds each vertex to a clique
  • Canonical dynamic program
  • Memoize values on each clique (eg, distribution)
  • Gives necessary info to add new vertex
  • Memo-table size exponential in clique sizes
  • Happy if max clique size small

91
HC Factorization of Triangulated Graphs
  • (as before)

(new explicit j)
92
ML Inference on Triangulated Graphs
  • Fixed network given
  • Choose parameters (potentials j) to maximize
    likelihood of observations
  • In triangulated network, do so by making
    marginals correct on cliques
  • i.e., want derived P(xh) equal to empirical
    distribution P(xh) for each clique h
  • Achieve by plugging P into explicit formula for j

93
Inference on Triangulated Graphs
  • (as before)

(new explicit j)
94
Triangulation
  • Non-triangulated Markov network can be
    triangulated by adding edges
  • Find large minimal cycles, add a chords
  • Adding edges removes independence constraints, so
    broadens class of models
  • So, only increases fit to data (maximum
    likelihood)
  • And, makes computations tractable

95
Treewidth
  • Could just add all edges (complete graph is
    triangulated)
  • Drawbacks
  • Dynamic programs exponential in clique size
  • Number of model parameters exponential in clique
    size leads to over-fitting
  • Treewidth of a graph minimum over all
    triangulations, of the maximum clique size of the
    triangulation, minus one

96
Markov Net Learning
  • Given data, wish to find MRN of tree-width at
    most k that maximizes likelihood of observed data
  • Equivalently, since triangulation can only
    increase likelihood, wish to find maximum
    likelihood triangulated graph with clique size at
    most k1
  • We call such a graph a k-hyperforest
  • If maximal, k-hypertree

97
Computing (Log-)Likelihood
98
Additive Weights
  • For triangulated G, HC gives explicit formulation
    for maximum likelihood j
  • Key each jh is independent of graph choice!
  • Set
  • Then

99
New formulation
  • Max-likelihood value of G is just sum of weights
    of cliques it contains
  • So, given weights on cliques of size up to k1,
    want to find a hypertree (triangulated graph of
    treewidth at most k) containing maximum weight of
    cliques
  • This is the maximum hypertree problem

100
Chow Liu (1968) k1
  • Treewidth 1 is just a tree
  • Edges are cliques of size 2
  • Weight wh on edge turns out to be mutual
    information between the variables on its
    endpoints.
  • Maximum likelihood tree is the maximum spanning
    tree with
  • Polynomial time solvable

101
Larger Treewidths
  • Theorem for k gt1, maximum hypertree problem is
    NP-complete
  • Reduction from SAT
  • S conclude ML MRN NP-complete
  • So, seek approximation algorithm
  • Given optimum has value w, find (in polynomial
    time) some solution with value at least w/a
  • We give an algorithm with a8kk!(k1)!
  • Constant for any fixed k

102
Idea Locally Testable Structure
  • Treewidth k is a global constraint, hard to aim
    for
  • Define windmill, an object with local
    characterization
  • Every hypertree contains a windmill with at least
    1/(k1)! of its weight
  • Algorithm to find windmill of approximately
    (factor 1/8kk!) maximum weight

103
Star graphs
Covering 11 of the 15 edges of a tree with
disjoint stars
104
Partitioning a tree into two sets of disjoint
stars
Conclusion some set of disjoint stars contains
at least 1/2 the edge weight of any given tree.
105
Windmills
  • A k-windmill is defined by a depth-k rooted tree
  • Its hyperedges are all the paths from the root
  • It has treewidth k
  • 1-windmill is star

106
Windmill Theorem
  • Windmill farm collection of disjoint windmills
  • Theorem any weighted k-hypertree contains a
    k-windmill-farm with at least a 1/(k1)! fraction
    of the weight
  • k-color the hypertree so no edge has repeated
    colors
  • All edges that get colors in same order form a
    windmill farm
  • Only k! orders

107
Idea Randomized Rounding
  • Already saw idea of ILP relaxations
  • Define integer linear program solving problem
    (not convex)
  • Ignore integrality constraints solve LP (convex)
  • Now, want integral solution
  • Round fractional solution to integral by setting
    variable of value 0 lt x lt 1 equal to 1 with
    probability x and 0 otherwise
  • Gives integer solution where all constraints
    satisfied in expectation

108
Special Case Layered 2-windmills
  • Variable yuv set to 1 if u is root and v is child
    of u
  • Variable zuvx set to 1 if x is a child of v which
    is a child of u
  • Objective function S wuvx zuvx
  • Consistency constraint zuvx yuv
  • Single-parent constraint Su yuv 1

109
Rounding
  • Consider constraint Su yuv 1
  • Means yuv form a probability distribution on
    choices of parent for v
  • Choose parents according to this distribution
  • Objective function was S wuvx zuvx
  • We can keep wuvx if yuv was set to 1
  • So expected kept value is S wuvx yuv gt S wuvx
    zuvx
  • i.e., rounded solution matches LP objective
    value!
  • (white lie)

110
Open problems
  • We have a constant factor approximation for
    constant k, but its a pretty bad constant!
  • Two separate reasons for badness
  • Windmill theorem may be very loose
  • (but we have examples of gap exceeding k)
  • Better windmill approximations?
  • Multilevel facility location?
  • Use idea of restricted, locally testable class
    of MRNs to find other, practical and tractable
    subsets
  • Hardness of approximation?

111
Conclusion
112
Near Neighbor Structure
  • Concepts
  • Random sampling
  • Memoization
  • Greedy improvement
  • Local search
  • Impact
  • Simple
  • Good constants
  • Good experimental performance
  • So likely useful in practice

113
One-shot MDPs
  • Concepts
  • Approximation Algorithms
  • Reductions to related problems
  • Dynamic Programming
  • Impact
  • Solved an open theory problem
  • Unlikely to be a good practical performer
  • But suggests heuristics
  • And poses good next questions

114
Turbo Decoding
  • Concepts
  • LP relaxations
  • Flows
  • Graph theory
  • Impact
  • Probably dont want to use LP in DSP chips
  • Strengthens theoretical grounds/explanations for
    good performance of certain codes
  • Provides insight for improving codes and decoding
    algorithms

115
Hypertrees
  • Concepts
  • Randomized rounding of LP relaxations
  • Impact
  • Unlikely to be useful in practice
  • Possible application to branch and bound
    optimization methods
  • Suggests some useful special classes of
    low-treewidth graphs (windmills) that may be more
    tractable to construct

116
Working with Algorithms
  • Many NIPS problems would be trivial given
    infinite computational resources
  • Good algorithms can simulate such resources
  • Large collection of available tools
  • Large collection of people who know them
  • To apply algorithmic methods, want well-defined
    algorithmic problem
  • Avoid defining problem via algorithmic solution
  • Dialogue can help define true problem
  • Even oversimplified solution may give insight

117
Acknowledgements
  • Near Neighbor Searching
  • With Matthias Ruhl
  • MDPs
  • With Avrim Blum, Shuchi Chawla, Terran Lane, Adam
    Meyerson, Maria Minkoff
  • Turbo Decoding
  • With Jon Feldman, Martin Wainright
  • Learning Markov Models
  • With Nati Srebro

118
Conculsion
  • Lots of interesting NIPS problems!
  • Techniques from theoretical computer science can
    be applied
  • Toolbox of prior approximation algorithms
  • Combinatorial structure of problems
  • Wanted more problems
  • Value in both definition and solution
  • http//theory.lcs.mit.edu/karger/Talks/NIPS.ppt
Write a Comment
User Comments (0)