Title: Probabilistic Modeling for Combinatorial Optimization
1Probabilistic Modeling for Combinatorial
Optimization
- Scott Davies
- School of Computer Science
- Carnegie Mellon University
- Joint work with Shumeet Baluja
2Combinatorial Optimization
- Maximize evaluation function f(x)
- input fixed-length bitstring x
- output real value
- x might represent
- job shop schedules
- TSP tours
- discretized numeric values
- etc.
- Our focus Black Box optimization
- No domain-dependent heuristics
x1001001...
f(x)
37.4
3Most Commonly Used Approaches
- Hill-climbing, simulated annealing
- Generate candidate solutions neighboring single
current working solution (e.g. differing by one
bit) - Typically make no attempt to model how particular
bits affect solution quality - Genetic algorithms
- Attempt to implicitly capture dependency of
solution quality on bit values by maintaining a
population of candidate solutions - Use crossover and mutation operators on
population members to generate new candidate
solutions
4Using Explicit Probabilistic Models
- Maintain an explicit probability distribution P
from which we generate new candidate solutions - Initialize P to uniform distribution
- Until termination criteria met
- Stochastically generate K candidate solutions
from P - Evaluate them
- Update P to make it more likely to generate
solutions similar to the good solutions - Several different choices for what sorts of P to
use and how to update it after candidate solution
evaluation
5Probability Distributions Over Bitstrings
- Let x (x1, x2, , xn), where xi can take one of
the values 0, 1 and n is the length of the
bitstring. - Can factorize any distribution P(x1xn) bit by
bit - P(x1,,xn) P(x1) P(x2 x1) P(x3 x1,
x2)P(xnx1, , xn-1) - In general, the above formula is just another way
of representing a big lookup table with one entry
for each of the 2n possible bitstrings. - Obviously too many parameters to estimate from
limited data!
6Representing Independencies withBayesian Networks
- Graphical representation of probability
distributions - Each variable is a vertex
- Each variables probability distribution is
conditioned only on its parents in the directed
acyclic graph (dag)
Wean on Fire
P(F,D,I,A,H) P(F) P(D) P(IF) P(AF)
P(HD,I,A)
F
Ice Cream Truck Nearby
Fire Alarm Activated
D
I
A
Office Door Open
H
Hear Bells
7Bayesian Networks for Bitstring Optimization?
- P(x1,,xn) P(x1) P(x2 x1) P(x3 x1,
x2)P(xnx1, , xn-1)
Yuck. Lets just assume all the bits are
independent instead. (For now.)
P(x1,,xn) P(x1) P(x2) P(x3)P(xn)
Ahhh. Much better.
8Population-Based Incremental Learning
- Population-Based Incremental Learning (PBIL)
Baluja, 1995 - Maintains a vector of probabilities one
independent probability P(xi) for each bit xi. - Until termination criteria met
- Generate a population of K bitstrings from P
- Evaluate them
- Use the best M of the K to update P as follows
if xi is set to 1
if xi is set to 0
or
- Optionally, also update P similarly with the
bitwise complement of the worst of the K
bitstrings - Return best bitstring ever evaluated
9PBIL vs. Discrete Learning Automata
- Equivalent to a team of Discrete Learning
Automata, one automata per bit. Thathachar
Sastry, 1987 - Learning automata choose actions independently,
but receive common reinforcement signal dependent
on all their actions - PBIL update rule equivalent to linear
reward-inaction algorithm Hilgard Bower, 1975
with success defined as best in the bunch - However, Discrete Learning Automata typically
used previously in problems with few variables
but noisy evaluation functions
10PBIL vs. Genetic Algorithms
- PBIL originated as tool for understanding GA
behavior - Similar to Bit-Based Simulated Crossover (BSC)
Syswerda, 1993 - Regenerates P from scratch after every generation
- All K used to update P, weighted according to
probabilities that a GA would have selected them
for reproduction - Why might normal GAs be better?
- Implicitly capture inter-bit dependencies with
population - However, because model is only implicit,
crossover must be randomized. Also, limited
population size often leads to premature
convergence based on noise in samples.
11Four Peaks Problem
- Problem used in Baluja Caruana, 1995 to test
how well GAs maintain multiple solutions before
converging. - Given input vector X with N bits, and difficulty
parameter T - FourPeaks(T,X)MAX(head(1,X), tail(0,X))Bonus(T,X
) - head(b,X) of contiguous leading bits in X set
to b - tail(b,X) of contiguous trailing bits in X
set to b - Bonus(T,X) 100 if (head(1,X)gtT) AND (tail(0,X)
gt T), or 0 otherwise
12Four Peaks Problem Results
- 80 GA settings tried best five shown here
- Average over 50 runs
13Large-Scale PBIL Empirical Comparison
14Modeling Inter-Bit Dependencies
- How about automatically learning probability
distributions in which at least some dependencies
between variables are modeled? - Problem statement given a dataset D and a set of
allowable Bayesian networks Bi, find the Bi
with the maximum posterior probability
15Equations. (Mwuh hah hah hah!)
- Where
- d j is the jth datapoint
- dij is the value assigned to xi by d j.
- Pi is the set of xis parents in B
- is the set of values assigned to Pi by d j
- P is the empirical probability distribution
exhibited by D
16Mmmm...Entropy.
Pick B to Maximize
- Fact given that B has a network structure S, the
optimal probabilities to use in B are just the
probabilities in D, i.e. P. So
Pick S to Maximize
17Entropy Calculation Example
x20s contribution to score
80 total datapoints
where H(p,q) p log p q log q.
18Single-Parent Tree-Shaped Networks
- Now lets allow each bit to be conditioned on at
most one other bit.
- Adding an arc from xj to xi increases network
score by - H(xi) - H(xixj) (xjs information gain
with xi) - H(xj) - H(xjxi) (not necessarily obvious,
but true) - I(xi, xj) Mutual information
between xi and xj
19Optimal Single-Parent Tree-Shaped Networks
- To find optimal single-parent tree-shaped
network, just find maximum spanning tree using
I(xi, xj) as the weight for the edge between xi
and xj. Chow and Liu, 1968 - Start with an arbitrary root node xr.
- Until all n nodes have been added to the tree
- Of all pairs of nodes xin and xout, where xin has
already been added to the tree but xout has not,
find the pair with the largest I(xin, xout). - Add xout to the tree with xin as its parent.
- Can be done in O(n2) time (assuming D has already
been reduced to sufficient statistics)
20Optimal Dependency Trees for Combinatorial
Optimization
- Baluja Davies, 1997
- Start with a dataset D initialized from the
uniform distribution - Until termination criteria met
- Build optimal dependency tree T with which to
model D. - Generate K bitstrings from probability
distribution represented by T. Evaluate them. - Add best M bitstrings to D after decaying the
weight of all datapoints already in D by a factor
a between 0 and 1. - Return best bitstring ever evaluated.
- Running time O(Kn n2) per iteration
21Tree-based Optimization vs. MIMIC
- Tree-based optimization algorithm inspired by
Mutual Information Maximization for Input
Clustering (MIMIC) De Bonet, et al., 1997 - Learned chain-shaped networks rather than
tree-shaped networks - Dataset best N of all bitstrings ever evaluated
- dataset has simple, well-defined
interpretation - - have to remember bitstrings
- - seems to converge too quickly on some larger
problems
x4
x8
x17
x5
22MIMIC dataset vs. Exp. Decay dataset
- Solving a system of linear equations the hard way
- Error of best solution as a function of
generation , averaged over 10 problems - 81 bits
23Graph-Coloring Example
- Noisy Graph-Coloring example
- For each edge connected to vertices of different
colors, add 1 to evaluation function with
probability 0.5
24Peaks problems results
25Checkerboard problem results
- 1616 grid of bits
- For each bit in middle 1414 subgrid, add 1 to
evaluation for each of four neighbors set to
opposite value
Genetic Algorithm (avg 740)
PBIL (avg 742)
Chain (avg 760)
Tree (avg 776)
26Linear Equations Results
- 81 bits again 9 problems, each with results
averaged over 25 runs - Minimize error
27Modeling Higher-Order Dependencies
- The maximum spanning tree algorithm gives us the
optimal Bayesian Network in which each node has
at most one parent. - What about finding the best network in which each
node has at most K parents for Kgt1? - NP-complete problem! Chickering, et al., 1995
- However, can use search heuristics to look for
good network structures (e.g., Heckerman, et
al., 1995), e.g. hillclimbing.
28Scoring Function for Arbitrary Networks
- Rather than restricting K directly, add penalty
term to scoring function to limit total network
size - Equivalent to priors favoring simpler network
structures - Alternatively, lends itself nicely to MDL
interpretation - Size of penalty controls exploration/exploitation
tradeoff
l penalty factor
B number of parameters in B
29Bayesian Network-Based Combinatorial Optimization
- Initialize D with C bitstrings from uniform
distribution, and Bayesian network B to empty
network containing no edges - Until termination criteria met
- Perform steepest-ascent hillclimbing from B to
find locally optimal network B. Repeat until no
changes increase score - Evaluate how each possible edge addition, removal
or deletion would affect penalized log-likelihood
score - Perform change that maximizes increase in score.
- Set B to B.
- Generate and evaluate K bitstrings from B.
- Decay weight of datapoints in D by a.
- Add best M of the K recently generated datapoints
to D. - Return best bit string ever evaluated.
30Cutting Computational Costs
- Can cache contingency tables for all possible
one-arc changes to network structure - Only have to recompute scores associated with at
most two nodes after arc added, removed, or
reversed. - Prevents having to slog through the entire
dataset recomputing score changes for every
possible arc change when dataset changes. - However, kiss memory goodbye
- Only a few network structure changes required
after each iteration since dataset hasnt changed
much (one or two structural changes on average)
31Evolution of Network Complexity
- 100 bits 1 added to evaluation function for
every bit set to the parity of the previous K bits
32Summation Cancellation
- Minimize magnitudes of cumulative sum of
discretized numeric parameters (s1, , sn)
represented with standard binary encoding
Average value over 50 runs of best solution found
in 2000 generations
33Bayesian Networks Empirical Results Summary
- Does better than Tree-based optimization
algorithm on some toy problems - Significantly better on Summation Cancellation
problem - 10 reduction in error on System of Linear
Equation problems - Roughly the same as Tree-based algorithm on
others, e.g. small Knapsack problems - Significantly more computation despite efficiency
hacks, however. - Why not much better performance?
- Too much emphasis on exploitation rather than
exploration? - Steepest-ascent hillclimbing over network
structures not good enough?
34Using Probabilistic Models for Intelligent
Restarts
- Tree-based algorithms O(n2) execution time per
generation very expensive for large problems - Even more so for more complicated Bayesian
networks - One possible approach use probabilistic models
to select good starting points for faster
optimization algorithms, e.g. hillclimbing
35COMIT
- Combining Optimizers with Mutual Information
Trees Baluja Davies, 1997b - Initialize dataset D with bitstrings drawn from
uniform distribution - Until termination criteria met
- Build optimal dependency tree T with which to
model D. - Generate K bitstrings from the distribution
represented by T. Evaluate them. - Execute a hillclimbing run starting from single
best bitstring of these K. - Replace up to M bitstrings in D with the best
bitstrings found during the hillclimbing run. - Return best bitstring ever evaluated
36COMIT, contd
- Empirical tests performed with stochastic
hillclimbing algorithm that allows at most
PATIENCE moves to points of equal value before
restarting - Compare COMIT vs.
- Hillclimbing with restarts from bitstrings chosen
randomly from uniform distribution - Hillclimbing with restarts from best bitstring
out of K chosen randomly from uniform
distribution - Genetic algorithms?
37COMIT Example of Behavior
- TSP domain 100 cities, 700 bits
Tour Length 103
Evaluation Number 103
38COMIT Empirical Comparisons
- Each number is average over at least 25 runs
- Highlighted better than each non-COMIT
hillclimber with P gt 95 - AHCxxx pick best of xxx randomly generated
starting points - COMITxxx pick best of xxx starting points
generated by tree
39Summary
- PBIL uses very simple probability distribution
from which new solutions are generated, yet works
surprisingly well - Algorithm using tree-based distributions seems to
work even better, though at significantly more
computational expense - More sophisticated networks past the point of
diminishing marginal returns? Future research - COMIT makes tree-based algorithm applicable to
much larger problems
40Future Work
- Making algorithm based on complex Bayesian
Networks more practical - Combine w/simpler search algorithms, ala COMIT?
- Applying COMIT to more interesting problems
- WALKSAT?
- Using COMIT to combine results of multiple search
algorithms - Optimization in real-valued state spaces. What
sorts of PDF representations might be useful? - Gaussians?
- Kernel-based representations?
- Hierarchical representations?
- Hands off -)
41Acknowledgements
- Shumeet Baluja
- Doug Baker, Justin Boyan, Lonnie Chrisman, Greg
Cooper, Geoff Gordon, Andrew Moore, ?