Probabilistic Modeling for Combinatorial Optimization - PowerPoint PPT Presentation

1 / 41

About This Presentation

Title:

Probabilistic Modeling for Combinatorial Optimization

Description:

Attempt to implicitly capture dependency of solution quality on bit values by ... Ahhh. Much better. P(x1,...,xn) = P(x1) P(x2) P(x3)...P(xn) ... – PowerPoint PPT presentation

Number of Views:48

Avg rating:3.0/5.0

Slides: 42

Provided by: scottd153

Category:

more less

Transcript and Presenter's Notes

Title: Probabilistic Modeling for Combinatorial Optimization

1
Probabilistic Modeling for Combinatorial
Optimization

Scott Davies
School of Computer Science
Carnegie Mellon University
Joint work with Shumeet Baluja

2
Combinatorial Optimization

Maximize evaluation function f(x)
input fixed-length bitstring x
output real value
x might represent
job shop schedules
TSP tours
discretized numeric values
etc.
Our focus Black Box optimization
No domain-dependent heuristics

x1001001...
f(x)
37.4
3
Most Commonly Used Approaches

Hill-climbing, simulated annealing
Generate candidate solutions neighboring single
current working solution (e.g. differing by one
bit)
Typically make no attempt to model how particular
bits affect solution quality
Genetic algorithms
Attempt to implicitly capture dependency of
solution quality on bit values by maintaining a
population of candidate solutions
Use crossover and mutation operators on
population members to generate new candidate
solutions

4
Using Explicit Probabilistic Models

Maintain an explicit probability distribution P
from which we generate new candidate solutions
Initialize P to uniform distribution
Until termination criteria met
Stochastically generate K candidate solutions
from P
Evaluate them
Update P to make it more likely to generate
solutions similar to the good solutions
Several different choices for what sorts of P to
use and how to update it after candidate solution
evaluation

5
Probability Distributions Over Bitstrings

Let x (x1, x2, , xn), where xi can take one of
the values 0, 1 and n is the length of the
bitstring.
Can factorize any distribution P(x1xn) bit by
bit
P(x1,,xn) P(x1) P(x2 x1) P(x3 x1,
x2)P(xnx1, , xn-1)
In general, the above formula is just another way
of representing a big lookup table with one entry
for each of the 2n possible bitstrings.
Obviously too many parameters to estimate from
limited data!

6
Representing Independencies withBayesian Networks

Graphical representation of probability
distributions
Each variable is a vertex
Each variables probability distribution is
conditioned only on its parents in the directed
acyclic graph (dag)

Wean on Fire
P(F,D,I,A,H) P(F) P(D) P(IF) P(AF)
P(HD,I,A)
F
Ice Cream Truck Nearby
Fire Alarm Activated
D
I
A
Office Door Open
H
Hear Bells
7
Bayesian Networks for Bitstring Optimization?

P(x1,,xn) P(x1) P(x2 x1) P(x3 x1,
x2)P(xnx1, , xn-1)

Yuck. Lets just assume all the bits are
independent instead. (For now.)
P(x1,,xn) P(x1) P(x2) P(x3)P(xn)
Ahhh. Much better.
8
Population-Based Incremental Learning

Population-Based Incremental Learning (PBIL)
Baluja, 1995
Maintains a vector of probabilities one
independent probability P(xi) for each bit xi.
Until termination criteria met
Generate a population of K bitstrings from P
Evaluate them
Use the best M of the K to update P as follows

if xi is set to 1
if xi is set to 0
or

Optionally, also update P similarly with the
bitwise complement of the worst of the K
bitstrings
Return best bitstring ever evaluated

9
PBIL vs. Discrete Learning Automata

Equivalent to a team of Discrete Learning
Automata, one automata per bit. Thathachar
Sastry, 1987
Learning automata choose actions independently,
but receive common reinforcement signal dependent
on all their actions
PBIL update rule equivalent to linear
reward-inaction algorithm Hilgard Bower, 1975
with success defined as best in the bunch
However, Discrete Learning Automata typically
used previously in problems with few variables
but noisy evaluation functions

10
PBIL vs. Genetic Algorithms

PBIL originated as tool for understanding GA
behavior
Similar to Bit-Based Simulated Crossover (BSC)
Syswerda, 1993
Regenerates P from scratch after every generation
All K used to update P, weighted according to
probabilities that a GA would have selected them
for reproduction
Why might normal GAs be better?
Implicitly capture inter-bit dependencies with
population
However, because model is only implicit,
crossover must be randomized. Also, limited
population size often leads to premature
convergence based on noise in samples.

11
Four Peaks Problem

Problem used in Baluja Caruana, 1995 to test
how well GAs maintain multiple solutions before
converging.
Given input vector X with N bits, and difficulty
parameter T
FourPeaks(T,X)MAX(head(1,X), tail(0,X))Bonus(T,X
)
head(b,X) of contiguous leading bits in X set
to b
tail(b,X) of contiguous trailing bits in X
set to b
Bonus(T,X) 100 if (head(1,X)gtT) AND (tail(0,X)
gt T), or 0 otherwise

12
Four Peaks Problem Results

80 GA settings tried best five shown here
Average over 50 runs

13
Large-Scale PBIL Empirical Comparison

See table in other slide

14
Modeling Inter-Bit Dependencies

How about automatically learning probability
distributions in which at least some dependencies
between variables are modeled?
Problem statement given a dataset D and a set of
allowable Bayesian networks Bi, find the Bi
with the maximum posterior probability

15
Equations. (Mwuh hah hah hah!)

Where
d j is the jth datapoint
dij is the value assigned to xi by d j.
Pi is the set of xis parents in B

is the set of values assigned to Pi by d j
P is the empirical probability distribution
exhibited by D

16
Mmmm...Entropy.
Pick B to Maximize

Fact given that B has a network structure S, the
optimal probabilities to use in B are just the
probabilities in D, i.e. P. So

Pick S to Maximize
17
Entropy Calculation Example
x20s contribution to score
80 total datapoints
where H(p,q) p log p q log q.
18
Single-Parent Tree-Shaped Networks

Now lets allow each bit to be conditioned on at
most one other bit.

Adding an arc from xj to xi increases network
score by
H(xi) - H(xixj) (xjs information gain
with xi)
H(xj) - H(xjxi) (not necessarily obvious,
but true)
I(xi, xj) Mutual information
between xi and xj

19
Optimal Single-Parent Tree-Shaped Networks

To find optimal single-parent tree-shaped
network, just find maximum spanning tree using
I(xi, xj) as the weight for the edge between xi
and xj. Chow and Liu, 1968
Start with an arbitrary root node xr.
Until all n nodes have been added to the tree
Of all pairs of nodes xin and xout, where xin has
already been added to the tree but xout has not,
find the pair with the largest I(xin, xout).
Add xout to the tree with xin as its parent.
Can be done in O(n2) time (assuming D has already
been reduced to sufficient statistics)

20
Optimal Dependency Trees for Combinatorial
Optimization

Baluja Davies, 1997
Start with a dataset D initialized from the
uniform distribution
Until termination criteria met
Build optimal dependency tree T with which to
model D.
Generate K bitstrings from probability
distribution represented by T. Evaluate them.
Add best M bitstrings to D after decaying the
weight of all datapoints already in D by a factor
a between 0 and 1.
Return best bitstring ever evaluated.
Running time O(Kn n2) per iteration

21
Tree-based Optimization vs. MIMIC

Tree-based optimization algorithm inspired by
Mutual Information Maximization for Input
Clustering (MIMIC) De Bonet, et al., 1997
Learned chain-shaped networks rather than
tree-shaped networks
Dataset best N of all bitstrings ever evaluated
dataset has simple, well-defined
interpretation
- have to remember bitstrings
- seems to converge too quickly on some larger
problems

x4
x8
x17
x5
22
MIMIC dataset vs. Exp. Decay dataset

Solving a system of linear equations the hard way
Error of best solution as a function of
generation , averaged over 10 problems
81 bits

23
Graph-Coloring Example

Noisy Graph-Coloring example
For each edge connected to vertices of different
colors, add 1 to evaluation function with
probability 0.5

24
Peaks problems results

Oops. Forgot this slide.

25
Checkerboard problem results

1616 grid of bits
For each bit in middle 1414 subgrid, add 1 to
evaluation for each of four neighbors set to
opposite value

Genetic Algorithm (avg 740)
PBIL (avg 742)
Chain (avg 760)
Tree (avg 776)
26
Linear Equations Results

81 bits again 9 problems, each with results
averaged over 25 runs
Minimize error

27
Modeling Higher-Order Dependencies

The maximum spanning tree algorithm gives us the
optimal Bayesian Network in which each node has
at most one parent.
What about finding the best network in which each
node has at most K parents for Kgt1?
NP-complete problem! Chickering, et al., 1995
However, can use search heuristics to look for
good network structures (e.g., Heckerman, et
al., 1995), e.g. hillclimbing.

28
Scoring Function for Arbitrary Networks

Rather than restricting K directly, add penalty
term to scoring function to limit total network
size
Equivalent to priors favoring simpler network
structures
Alternatively, lends itself nicely to MDL
interpretation
Size of penalty controls exploration/exploitation
tradeoff

l penalty factor
B number of parameters in B
29
Bayesian Network-Based Combinatorial Optimization

Initialize D with C bitstrings from uniform
distribution, and Bayesian network B to empty
network containing no edges
Until termination criteria met
Perform steepest-ascent hillclimbing from B to
find locally optimal network B. Repeat until no
changes increase score
Evaluate how each possible edge addition, removal
or deletion would affect penalized log-likelihood
score
Perform change that maximizes increase in score.
Set B to B.
Generate and evaluate K bitstrings from B.
Decay weight of datapoints in D by a.
Add best M of the K recently generated datapoints
to D.
Return best bit string ever evaluated.

30
Cutting Computational Costs

Can cache contingency tables for all possible
one-arc changes to network structure
Only have to recompute scores associated with at
most two nodes after arc added, removed, or
reversed.
Prevents having to slog through the entire
dataset recomputing score changes for every
possible arc change when dataset changes.
However, kiss memory goodbye
Only a few network structure changes required
after each iteration since dataset hasnt changed
much (one or two structural changes on average)

31
Evolution of Network Complexity

100 bits 1 added to evaluation function for
every bit set to the parity of the previous K bits

32
Summation Cancellation

Minimize magnitudes of cumulative sum of
discretized numeric parameters (s1, , sn)
represented with standard binary encoding

Average value over 50 runs of best solution found
in 2000 generations
33
Bayesian Networks Empirical Results Summary

Does better than Tree-based optimization
algorithm on some toy problems
Significantly better on Summation Cancellation
problem
10 reduction in error on System of Linear
Equation problems
Roughly the same as Tree-based algorithm on
others, e.g. small Knapsack problems
Significantly more computation despite efficiency
hacks, however.
Why not much better performance?
Too much emphasis on exploitation rather than
exploration?
Steepest-ascent hillclimbing over network
structures not good enough?

34
Using Probabilistic Models for Intelligent
Restarts

Tree-based algorithms O(n2) execution time per
generation very expensive for large problems
Even more so for more complicated Bayesian
networks
One possible approach use probabilistic models
to select good starting points for faster
optimization algorithms, e.g. hillclimbing

35
COMIT

Combining Optimizers with Mutual Information
Trees Baluja Davies, 1997b
Initialize dataset D with bitstrings drawn from
uniform distribution
Until termination criteria met
Build optimal dependency tree T with which to
model D.
Generate K bitstrings from the distribution
represented by T. Evaluate them.
Execute a hillclimbing run starting from single
best bitstring of these K.
Replace up to M bitstrings in D with the best
bitstrings found during the hillclimbing run.
Return best bitstring ever evaluated

36
COMIT, contd

Empirical tests performed with stochastic
hillclimbing algorithm that allows at most
PATIENCE moves to points of equal value before
restarting
Compare COMIT vs.
Hillclimbing with restarts from bitstrings chosen
randomly from uniform distribution
Hillclimbing with restarts from best bitstring
out of K chosen randomly from uniform
distribution
Genetic algorithms?

37
COMIT Example of Behavior

TSP domain 100 cities, 700 bits

Tour Length 103
Evaluation Number 103
38
COMIT Empirical Comparisons

Each number is average over at least 25 runs
Highlighted better than each non-COMIT
hillclimber with P gt 95
AHCxxx pick best of xxx randomly generated
starting points
COMITxxx pick best of xxx starting points
generated by tree

39
Summary

PBIL uses very simple probability distribution
from which new solutions are generated, yet works
surprisingly well
Algorithm using tree-based distributions seems to
work even better, though at significantly more
computational expense
More sophisticated networks past the point of
diminishing marginal returns? Future research
COMIT makes tree-based algorithm applicable to
much larger problems

40
Future Work

Making algorithm based on complex Bayesian
Networks more practical
Combine w/simpler search algorithms, ala COMIT?
Applying COMIT to more interesting problems
WALKSAT?
Using COMIT to combine results of multiple search
algorithms
Optimization in real-valued state spaces. What
sorts of PDF representations might be useful?
Gaussians?
Kernel-based representations?
Hierarchical representations?
Hands off -)

41
Acknowledgements