Probabilistic Modeling for Combinatorial Optimization - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

Probabilistic Modeling for Combinatorial Optimization

Description:

Attempt to implicitly capture dependency of solution quality on bit values by ... Ahhh. Much better. P(x1,...,xn) = P(x1) P(x2) P(x3)...P(xn) ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 42
Provided by: scottd153
Category:

less

Transcript and Presenter's Notes

Title: Probabilistic Modeling for Combinatorial Optimization


1
Probabilistic Modeling for Combinatorial
Optimization
  • Scott Davies
  • School of Computer Science
  • Carnegie Mellon University
  • Joint work with Shumeet Baluja

2
Combinatorial Optimization
  • Maximize evaluation function f(x)
  • input fixed-length bitstring x
  • output real value
  • x might represent
  • job shop schedules
  • TSP tours
  • discretized numeric values
  • etc.
  • Our focus Black Box optimization
  • No domain-dependent heuristics

x1001001...
f(x)
37.4
3
Most Commonly Used Approaches
  • Hill-climbing, simulated annealing
  • Generate candidate solutions neighboring single
    current working solution (e.g. differing by one
    bit)
  • Typically make no attempt to model how particular
    bits affect solution quality
  • Genetic algorithms
  • Attempt to implicitly capture dependency of
    solution quality on bit values by maintaining a
    population of candidate solutions
  • Use crossover and mutation operators on
    population members to generate new candidate
    solutions

4
Using Explicit Probabilistic Models
  • Maintain an explicit probability distribution P
    from which we generate new candidate solutions
  • Initialize P to uniform distribution
  • Until termination criteria met
  • Stochastically generate K candidate solutions
    from P
  • Evaluate them
  • Update P to make it more likely to generate
    solutions similar to the good solutions
  • Several different choices for what sorts of P to
    use and how to update it after candidate solution
    evaluation

5
Probability Distributions Over Bitstrings
  • Let x (x1, x2, , xn), where xi can take one of
    the values 0, 1 and n is the length of the
    bitstring.
  • Can factorize any distribution P(x1xn) bit by
    bit
  • P(x1,,xn) P(x1) P(x2 x1) P(x3 x1,
    x2)P(xnx1, , xn-1)
  • In general, the above formula is just another way
    of representing a big lookup table with one entry
    for each of the 2n possible bitstrings.
  • Obviously too many parameters to estimate from
    limited data!

6
Representing Independencies withBayesian Networks
  • Graphical representation of probability
    distributions
  • Each variable is a vertex
  • Each variables probability distribution is
    conditioned only on its parents in the directed
    acyclic graph (dag)

Wean on Fire
P(F,D,I,A,H) P(F) P(D) P(IF) P(AF)
P(HD,I,A)
F
Ice Cream Truck Nearby
Fire Alarm Activated
D
I
A
Office Door Open
H
Hear Bells
7
Bayesian Networks for Bitstring Optimization?
  • P(x1,,xn) P(x1) P(x2 x1) P(x3 x1,
    x2)P(xnx1, , xn-1)

Yuck. Lets just assume all the bits are
independent instead. (For now.)
P(x1,,xn) P(x1) P(x2) P(x3)P(xn)
Ahhh. Much better.
8
Population-Based Incremental Learning
  • Population-Based Incremental Learning (PBIL)
    Baluja, 1995
  • Maintains a vector of probabilities one
    independent probability P(xi) for each bit xi.
  • Until termination criteria met
  • Generate a population of K bitstrings from P
  • Evaluate them
  • Use the best M of the K to update P as follows

if xi is set to 1
if xi is set to 0
or
  • Optionally, also update P similarly with the
    bitwise complement of the worst of the K
    bitstrings
  • Return best bitstring ever evaluated

9
PBIL vs. Discrete Learning Automata
  • Equivalent to a team of Discrete Learning
    Automata, one automata per bit. Thathachar
    Sastry, 1987
  • Learning automata choose actions independently,
    but receive common reinforcement signal dependent
    on all their actions
  • PBIL update rule equivalent to linear
    reward-inaction algorithm Hilgard Bower, 1975
    with success defined as best in the bunch
  • However, Discrete Learning Automata typically
    used previously in problems with few variables
    but noisy evaluation functions

10
PBIL vs. Genetic Algorithms
  • PBIL originated as tool for understanding GA
    behavior
  • Similar to Bit-Based Simulated Crossover (BSC)
    Syswerda, 1993
  • Regenerates P from scratch after every generation
  • All K used to update P, weighted according to
    probabilities that a GA would have selected them
    for reproduction
  • Why might normal GAs be better?
  • Implicitly capture inter-bit dependencies with
    population
  • However, because model is only implicit,
    crossover must be randomized. Also, limited
    population size often leads to premature
    convergence based on noise in samples.

11
Four Peaks Problem
  • Problem used in Baluja Caruana, 1995 to test
    how well GAs maintain multiple solutions before
    converging.
  • Given input vector X with N bits, and difficulty
    parameter T
  • FourPeaks(T,X)MAX(head(1,X), tail(0,X))Bonus(T,X
    )
  • head(b,X) of contiguous leading bits in X set
    to b
  • tail(b,X) of contiguous trailing bits in X
    set to b
  • Bonus(T,X) 100 if (head(1,X)gtT) AND (tail(0,X)
    gt T), or 0 otherwise

12
Four Peaks Problem Results
  • 80 GA settings tried best five shown here
  • Average over 50 runs

13
Large-Scale PBIL Empirical Comparison
  • See table in other slide

14
Modeling Inter-Bit Dependencies
  • How about automatically learning probability
    distributions in which at least some dependencies
    between variables are modeled?
  • Problem statement given a dataset D and a set of
    allowable Bayesian networks Bi, find the Bi
    with the maximum posterior probability

15
Equations. (Mwuh hah hah hah!)
  • Where
  • d j is the jth datapoint
  • dij is the value assigned to xi by d j.
  • Pi is the set of xis parents in B
  • is the set of values assigned to Pi by d j
  • P is the empirical probability distribution
    exhibited by D

16
Mmmm...Entropy.
Pick B to Maximize
  • Fact given that B has a network structure S, the
    optimal probabilities to use in B are just the
    probabilities in D, i.e. P. So

Pick S to Maximize
17
Entropy Calculation Example
x20s contribution to score
80 total datapoints
where H(p,q) p log p q log q.
18
Single-Parent Tree-Shaped Networks
  • Now lets allow each bit to be conditioned on at
    most one other bit.
  • Adding an arc from xj to xi increases network
    score by
  • H(xi) - H(xixj) (xjs information gain
    with xi)
  • H(xj) - H(xjxi) (not necessarily obvious,
    but true)
  • I(xi, xj) Mutual information
    between xi and xj

19
Optimal Single-Parent Tree-Shaped Networks
  • To find optimal single-parent tree-shaped
    network, just find maximum spanning tree using
    I(xi, xj) as the weight for the edge between xi
    and xj. Chow and Liu, 1968
  • Start with an arbitrary root node xr.
  • Until all n nodes have been added to the tree
  • Of all pairs of nodes xin and xout, where xin has
    already been added to the tree but xout has not,
    find the pair with the largest I(xin, xout).
  • Add xout to the tree with xin as its parent.
  • Can be done in O(n2) time (assuming D has already
    been reduced to sufficient statistics)

20
Optimal Dependency Trees for Combinatorial
Optimization
  • Baluja Davies, 1997
  • Start with a dataset D initialized from the
    uniform distribution
  • Until termination criteria met
  • Build optimal dependency tree T with which to
    model D.
  • Generate K bitstrings from probability
    distribution represented by T. Evaluate them.
  • Add best M bitstrings to D after decaying the
    weight of all datapoints already in D by a factor
    a between 0 and 1.
  • Return best bitstring ever evaluated.
  • Running time O(Kn n2) per iteration

21
Tree-based Optimization vs. MIMIC
  • Tree-based optimization algorithm inspired by
    Mutual Information Maximization for Input
    Clustering (MIMIC) De Bonet, et al., 1997
  • Learned chain-shaped networks rather than
    tree-shaped networks
  • Dataset best N of all bitstrings ever evaluated
  • dataset has simple, well-defined
    interpretation
  • - have to remember bitstrings
  • - seems to converge too quickly on some larger
    problems

x4
x8
x17
x5
22
MIMIC dataset vs. Exp. Decay dataset
  • Solving a system of linear equations the hard way
  • Error of best solution as a function of
    generation , averaged over 10 problems
  • 81 bits

23
Graph-Coloring Example
  • Noisy Graph-Coloring example
  • For each edge connected to vertices of different
    colors, add 1 to evaluation function with
    probability 0.5

24
Peaks problems results
  • Oops. Forgot this slide.

25
Checkerboard problem results
  • 1616 grid of bits
  • For each bit in middle 1414 subgrid, add 1 to
    evaluation for each of four neighbors set to
    opposite value

Genetic Algorithm (avg 740)
PBIL (avg 742)
Chain (avg 760)
Tree (avg 776)
26
Linear Equations Results
  • 81 bits again 9 problems, each with results
    averaged over 25 runs
  • Minimize error

27
Modeling Higher-Order Dependencies
  • The maximum spanning tree algorithm gives us the
    optimal Bayesian Network in which each node has
    at most one parent.
  • What about finding the best network in which each
    node has at most K parents for Kgt1?
  • NP-complete problem! Chickering, et al., 1995
  • However, can use search heuristics to look for
    good network structures (e.g., Heckerman, et
    al., 1995), e.g. hillclimbing.

28
Scoring Function for Arbitrary Networks
  • Rather than restricting K directly, add penalty
    term to scoring function to limit total network
    size
  • Equivalent to priors favoring simpler network
    structures
  • Alternatively, lends itself nicely to MDL
    interpretation
  • Size of penalty controls exploration/exploitation
    tradeoff

l penalty factor
B number of parameters in B
29
Bayesian Network-Based Combinatorial Optimization
  • Initialize D with C bitstrings from uniform
    distribution, and Bayesian network B to empty
    network containing no edges
  • Until termination criteria met
  • Perform steepest-ascent hillclimbing from B to
    find locally optimal network B. Repeat until no
    changes increase score
  • Evaluate how each possible edge addition, removal
    or deletion would affect penalized log-likelihood
    score
  • Perform change that maximizes increase in score.
  • Set B to B.
  • Generate and evaluate K bitstrings from B.
  • Decay weight of datapoints in D by a.
  • Add best M of the K recently generated datapoints
    to D.
  • Return best bit string ever evaluated.

30
Cutting Computational Costs
  • Can cache contingency tables for all possible
    one-arc changes to network structure
  • Only have to recompute scores associated with at
    most two nodes after arc added, removed, or
    reversed.
  • Prevents having to slog through the entire
    dataset recomputing score changes for every
    possible arc change when dataset changes.
  • However, kiss memory goodbye
  • Only a few network structure changes required
    after each iteration since dataset hasnt changed
    much (one or two structural changes on average)

31
Evolution of Network Complexity
  • 100 bits 1 added to evaluation function for
    every bit set to the parity of the previous K bits

32
Summation Cancellation
  • Minimize magnitudes of cumulative sum of
    discretized numeric parameters (s1, , sn)
    represented with standard binary encoding

Average value over 50 runs of best solution found
in 2000 generations
33
Bayesian Networks Empirical Results Summary
  • Does better than Tree-based optimization
    algorithm on some toy problems
  • Significantly better on Summation Cancellation
    problem
  • 10 reduction in error on System of Linear
    Equation problems
  • Roughly the same as Tree-based algorithm on
    others, e.g. small Knapsack problems
  • Significantly more computation despite efficiency
    hacks, however.
  • Why not much better performance?
  • Too much emphasis on exploitation rather than
    exploration?
  • Steepest-ascent hillclimbing over network
    structures not good enough?

34
Using Probabilistic Models for Intelligent
Restarts
  • Tree-based algorithms O(n2) execution time per
    generation very expensive for large problems
  • Even more so for more complicated Bayesian
    networks
  • One possible approach use probabilistic models
    to select good starting points for faster
    optimization algorithms, e.g. hillclimbing

35
COMIT
  • Combining Optimizers with Mutual Information
    Trees Baluja Davies, 1997b
  • Initialize dataset D with bitstrings drawn from
    uniform distribution
  • Until termination criteria met
  • Build optimal dependency tree T with which to
    model D.
  • Generate K bitstrings from the distribution
    represented by T. Evaluate them.
  • Execute a hillclimbing run starting from single
    best bitstring of these K.
  • Replace up to M bitstrings in D with the best
    bitstrings found during the hillclimbing run.
  • Return best bitstring ever evaluated

36
COMIT, contd
  • Empirical tests performed with stochastic
    hillclimbing algorithm that allows at most
    PATIENCE moves to points of equal value before
    restarting
  • Compare COMIT vs.
  • Hillclimbing with restarts from bitstrings chosen
    randomly from uniform distribution
  • Hillclimbing with restarts from best bitstring
    out of K chosen randomly from uniform
    distribution
  • Genetic algorithms?

37
COMIT Example of Behavior
  • TSP domain 100 cities, 700 bits

Tour Length 103
Evaluation Number 103
38
COMIT Empirical Comparisons
  • Each number is average over at least 25 runs
  • Highlighted better than each non-COMIT
    hillclimber with P gt 95
  • AHCxxx pick best of xxx randomly generated
    starting points
  • COMITxxx pick best of xxx starting points
    generated by tree

39
Summary
  • PBIL uses very simple probability distribution
    from which new solutions are generated, yet works
    surprisingly well
  • Algorithm using tree-based distributions seems to
    work even better, though at significantly more
    computational expense
  • More sophisticated networks past the point of
    diminishing marginal returns? Future research
  • COMIT makes tree-based algorithm applicable to
    much larger problems

40
Future Work
  • Making algorithm based on complex Bayesian
    Networks more practical
  • Combine w/simpler search algorithms, ala COMIT?
  • Applying COMIT to more interesting problems
  • WALKSAT?
  • Using COMIT to combine results of multiple search
    algorithms
  • Optimization in real-valued state spaces. What
    sorts of PDF representations might be useful?
  • Gaussians?
  • Kernel-based representations?
  • Hierarchical representations?
  • Hands off -)

41
Acknowledgements
  • Shumeet Baluja
  • Doug Baker, Justin Boyan, Lonnie Chrisman, Greg
    Cooper, Geoff Gordon, Andrew Moore, ?
Write a Comment
User Comments (0)
About PowerShow.com