Probabilistic Modeling for Combinatorial Optimization - PowerPoint PPT Presentation

1 / 42

About This Presentation

Title:

Probabilistic Modeling for Combinatorial Optimization

Description:

Insert nasty photocopy here. Large-Scale PBIL Empirical Comparison ... Yucky photocopy maybe. Modeling Higher-Order Dependencies ... – PowerPoint PPT presentation

Number of Views:99

Avg rating:3.0/5.0

Slides: 43

Provided by: scottd153

Category:

more less

Transcript and Presenter's Notes

Title: Probabilistic Modeling for Combinatorial Optimization

1
Probabilistic Modeling for Combinatorial
Optimization

Scott Davies
School of Computer Science
Carnegie Mellon University
Joint work with Shumeet Baluja

2
Combinatorial Optimization

Maximize evaluation function f(x)
input fixed-length bitstring x
output real value
x might represent
job shop schedules
TSP tours
discretized numeric values
etc.
Our focus Black Box optimization
No domain-dependent heuristics

x1001001...
f(x)
37.4
3
Most Commonly Used Approaches

Hill-climbing, simulated annealing
Generate candidate solutions neighboring single
current working solution (e.g. differing by one
bit)
Typically make no attempt to model how particular
bits affect solution quality
Genetic algorithms
Attempt to implicitly capture dependency of
solution quality on bit values by maintaining a
population of candidate solutions
Use crossover and mutation operators on
population members to generate new candidate
solutions

4
Using Explicit Probabilistic Models

Maintain an explicit probability distribution P
from which we generate new candidate solutions
Initialize P to uniform distribution
Until termination criteria met
Stochastically generate K candidate solutions
from P
Evaluate them
Update P to make it more likely to generate
solutions similar to the good solutions
Several different choices for what sorts of P to
use and how to update it after candidate solution
evaluation

5
Probability Distributions Over Bitstrings

Let x (x1, x2, , xn), where xi can take one of
the values 0, 1 and n is the length of the
bitstring.
Can factorize any distribution P(x1xn) bit by
bit
P(x1,,xn) P(x1) P(x2 x1) P(x3 x1,
x2)P(xnx1, , xn-1)
In general, the above formula is just another way
of representing a big lookup table with one entry
for each of the 2n possible bitstrings.
Obviously too many parameters to estimate from
limited data!

6
Representing Independencies withBayesian Networks

Graphical representation of probability
distributions
Each variable is a vertex
Each variables probability distribution is
conditioned only on its parents in the directed
acyclic graph (dag)

Wean on Fire
P(F,D,I,A,H) P(F) P(D) P(IF) P(AF)
P(HD,I,A)
F
Ice Cream Truck Nearby
Fire Alarm Activated
D
I
A
Office Door Open
H
Hear Bells
7
Bayesian Networks for Bitstring Optimization?

P(x1,,xn) P(x1) P(x2 x1) P(x3 x1,
x2)P(xnx1, , xn-1)

Yuck. Lets just assume all the bits are
independent instead. (For now.)
P(x1,,xn) P(x1) P(x2) P(x3)P(xn)
Ahhh. Much better.
8
Population-Based Incremental Learning

Population-Based Incremental Learning (PBIL)
Baluja, 1995
Maintains a vector of probabilities one
independent probability P(xi) for each bit xi.
Until termination criteria met
Generate a population of K bitstrings from P
Evaluate them
Use the best M of the K to update P as follows

if xi is set to 1
if xi is set to 0
or

Optionally, also update P similarly with the
bitwise complement of the worst of the K
bitstrings
Return best bitstring ever evaluated

9
PBIL vs. Discrete Learning Automata

Equivalent to a team of Discrete Learning
Automata, one automata per bit. Thathachar
Sastry, 1987
Learning automata choose actions independently,
but receive common reinforcement signal dependent
on all their actions
PBIL update rule equivalent to linear
reward-inaction algorithm Hilgard Bower, 1975
with success defined as best in the bunch
However, Discrete Learning Automata typically
used previously in problems with few variables
but noisy evaluation functions

10
PBIL vs. Genetic Algorithms

PBIL originated as tool for understanding GA
behavior
Similar to Bit-Based Simulated Crossover (BSC)
Syswerda, 1993
Regenerates P from scratch after every generation
All K used to update P, weighted according to
probabilities that a GA would have selected them
for reproduction
Why might normal GAs be better?
Implicitly capture inter-bit dependencies with
population
However, because model is only implicit,
crossover must be randomized. Also, limited
population size often leads to premature
convergence based on noise in samples.

11
Four Peaks Problem

Problem used in Baluja Caruana, 1995 to test
how well GAs maintain multiple solutions before
converging.
Given input vector X with N bits, and difficulty
parameter T
FourPeaks(T,X)MAX(head(1,X), tail(0,X))Bonus(T,X
)
head(b,X) of contiguous leading bits in X set
to b
tail(b,X) of contiguous trailing bits in X
set to b
Bonus(T,X) 100 if (head(1,X)gtT) AND (tail(0,X)
gt T), or 0 otherwise
Should theoretically be easy for GA to handle
with single-point crossover

12
Four Peaks Problem Results

Insert nasty photocopy here

13
Large-Scale PBIL Empirical Comparison

See bug-ugly photocopied table

14
Modeling Inter-Bit Dependencies

How about automatically learning probability
distributions in which at least some dependencies
between variables are modeled?
Problem statement given a dataset D and a set of
allowable Bayesian networks Bi, find the Bi
with the maximum posterior probability

15
Equations. (Mwuh hah hah hah!)

Where
d j is the jth datapoint
dij is the value assigned to xi by d j.
Pi is the set of xis parents in B

is the set of values assigned to Pi by d j
P is the empirical probability distribution
exhibited by D

16
Mmmm...Entropy.
Pick B to Maximize

Fact given that B has a network structure S, the
optimal probabilities to use in B are just the
probabilities in D, i.e. P. So

Pick S to Maximize
17
Entropy Calculation Example
x20s contribution to score
80 total datapoints
where H(p,q) p log p q log q.
18
Single-Parent Tree-Shaped Networks

Now lets allow each bit to be conditioned on at
most one other bit.

Adding an arc from xj to xi increases network
score by
H(xi) - H(xixj) (xjs information gain
with xi)
H(xj) - H(xjxi) (not necessarily obvious,
but true)
I(xi, xj) Mutual information
between xi and xj

19
Optimal Single-Parent Tree-Shaped Networks

To find optimal single-parent tree-shaped
network, just find maximum spanning tree using
I(xi, xj) as the weight for the edge between xi
and xj. Chow and Liu, 1968
Start with an arbitrary root node xr.
Until all n nodes have been added to the tree
Of all pairs of nodes xin and xout, where xin has
already been added to the tree but xout has not,
find the pair with the largest I(xin, xout).
Add xout to the tree with xin as its parent.
Can be done in O(n2) time (assuming D has already
been reduced to sufficient statistics)

20
Optimal Dependency Trees for Combinatorial
Optimization

Baluja Davies, 1997
Start with a dataset D initialized from the
uniform distribution
Until termination criteria met
Build optimal dependency tree T with which to
model D.
Generate K bitstrings from probability
distribution represented by T. Evaluate them.
Add best M bitstrings to D after decaying the
weight of all datapoints already in D by a factor
a between 0 and 1.
Return best bitstring ever evaluated.
Running time O(Kn n2) per iteration

21
Tree-based Optimization vs. MIMIC

Tree-based optimization algorithm inspired by
Mutual Information Maximization for Input
Clustering (MIMIC) De Bonet, et al., 1997
Learned chain-shaped networks rather than
tree-shaped networks
Dataset best N of all bitstrings ever evaluated
dataset has simple, well-defined
interpretation
- have to remember bitstrings
- seems to converge too quickly on some larger
problems

x4
x8
x17
x5
22
MIMIC dataset vs. Exp. Decay dataset

Insert skanky photocopy here

23
Graph-Coloring Example

Photocopy or PostScript inclusion

24
Peaks problems results

Postscript/photcopy

25
Tree-Max problem results

Yet another dummy slide

26
Checkerboard problem results

Dummy slide

27
Linear Equations Results

Yucky photocopy maybe

28
Modeling Higher-Order Dependencies

The maximum spanning tree algorithm gives us the
optimal Bayesian Network in which each node has
at most one parent.
What about finding the best network in which each
node has at most K parents for Kgt1?
NP-complete problem! Chickering, et al., 1995
However, can use search heuristics to look for
good network structures (e.g., Heckerman, et
al., 1995), e.g. hillclimbing.

29
Scoring Function for Arbitrary Networks

Rather than restricting K directly, add penalty
term to scoring function to limit total network
size
Equivalent to priors favoring simpler network
structures
Alternatively, lends itself nicely to MDL
interpretation
Size of penalty controls exploration/exploitation
tradeoff

l penalty factor
B number of parameters in B
30
Bayesian Network-Based Combinatorial Optimization

Initialize D with C bitstrings from uniform
distribution, and Bayesian network B to empty
network containing no edges
Until termination criteria met
Perform steepest-ascent hillclimbing from B to
find locally optimal network B. Repeat until no
changes increase score
Evaluate how each possible edge addition, removal
or deletion would affect penalized log-likelihood
score
Perform change that maximizes increase in score.
Set B to B.
Generate and evaluate K bitstrings from B.
Decay weight of datapoints in D by a.
Add best M of the K recently generated datapoints
to D.
Return best bit string ever evaluated.

31
Cutting Computational Costs

Can cache contingency tables for all possible
one-arc changes to network structure
Only have to recompute scores associated with at
most two nodes after arc added, removed, or
reversed.
Prevents having to slog through the entire
dataset recomputing score changes for every
possible arc change when dataset changes.
However, kiss memory goodbye
Only a few network structure changes required
after each iteration since dataset hasnt changed
much (one or two structural changes on average)

32
Evolution of Network Complexity

Placeholder for parity-based network complexity
graph

33
Summation Cancellation

Minimize magnitudes of cumulative sum of
discretized numeric parameters (s1, , sn)
represented with standard binary encoding

Average value over 50 runs of best solution found
in 2000 generations
34
Bayesian Networks Empirical Results Summary

Does better than Tree-based optimization
algorithm on some toy problems
Significantly better on Summation Cancellation
problem
10 reduction in error on System of Linear
Equation problems
Roughly the same as Tree-based algorithm on
others, e.g. small Knapsack problems
Significantly more computation despite efficiency
hacks, however.
Why not much better performance?
Too much emphasis on exploitation rather than
exploration?
Steepest-ascent hillclimbing over network
structures not good enough?

35
Using Probabilistic Models for Intelligent
Restarts

Tree-based algorithms O(n2) execution time per
generation very expensive for large problems
Even more so for more complicated Bayesian
networks
One possible approach use probabilistic models
to select good starting points for faster
optimization algorithms, e.g. hillclimbing

36
COMIT

Combining Optimizers with Mutual Information
Trees Baluja Davies, 1997b
Initialize dataset D with bitstrings drawn from
uniform distribution
Until termination criteria met
Build optimal dependency tree T with which to
model D.
Generate K bitstrings from the distribution
represented by T. Evaluate them.
Execute a hillclimbing run starting from single
best bitstring of these K.
Replace up to M bitstrings in D with the best
bitstrings found during the hillclimbing run.
Return best bitstring ever evaluated

37
COMIT, contd

Empirical tests performed with stochastic
hillclimbing algorithm that allows at most
PATIENCE moves to points of equal value before
restarting
Compare COMIT vs.
Hillclimbing with restarts from bitstrings chosen
randomly from uniform distribution
Hillclimbing with restarts from best bitstring
out of K chosen randomly from uniform
distribution
Genetic algorithms?

38
COMIT Example of Behavior

Dummy slide for graphs showing all evaluations in
TSP domain

39
COMIT Empirical Comparisons

Big ugly table goes here

40
Summary

PBIL uses very simple probability distribution
from which new solutions are generated, yet works
surprisingly well
Algorithm using tree-based distributions seems to
work even better, though at significantly more
computational expense
More sophisticated networks past the point of
diminishing marginal returns? Future research
COMIT makes tree-based algorithm applicable to
much larger problems

41
Future Work

Making algorithm based on complex Bayesian
Networks more practical
Combine w/simpler search algorithms, ala COMIT?
Applying COMIT to more interesting problems
WALKSAT?
Optimization in real-valued state spaces. What
sorts of PDF representations might be useful?
Gaussians?
Kernel-based representations?
Hierarchical representations?

42
Acknowledgements