Title: High-Performance and Parallel Optimization
1High-Performance and Parallel Optimization
- Mark P. Wachowiak, Ph.D.
- Department of Computer Science and Mathematics
- Nipissing University
- June 3, 2007
2Objectives
- Brief overview of optimization and applications.
- Parallel computing and parallel optimization.
- Specific examples of parallel optimization.
- Future directions.
3Optimization
- Find the best, or optimum value of an objective
(cost) function. - Very large research area.
- Multitude of applications.
4Journals on Optimization (Partial list)
- SIAM Journal on Optimization
- Journal of Global Optimization
- Optimization and Engineering
- Journal of Combinatorial Optimization
- Journal of Optimization Theory and Applications
- Optimization Methods and Software
- Many journals on operations research
5Applications of Optimization
Engineering design
Business and industry
Biology and medicine
Radiotherapy planning
Economics
Design of materials
Manufacturing design
Bioinformatics Proteomics
Finance
Systems biology
Management
Simulation and modeling
Image registration
Operations research
6Optimization in Biology and Biomedicine
Engineering design
Fitting models to experimental data
Optimizing relevant functions
Radiotherapy planning
Design of biomaterials
Soft tissue biomechanics
Biomechanics
Bioinformatics Proteomics
Systems biology
Biomedical Imaging
Biochemical pathways
7Optimization in Tomotherapy and Intensity
Modulated Radiotherapy
- Particle swarm (Li et al., 2005).
- Cimmino algorithm (Xiao et al., 2004) .
- Exhaustive search (Wang et al., 2004).
- Simulated annealing (Rosen et al., 2005).
- Gradient descent (Rosen et al., 2005).
- Costlets (Kessler et al., 2005).
http//www.lrcc.on.ca/research/tomo/index.xml
8New Applications of Optimization
- Simulation-based optimization
- Objective function values are the results of
experiments and/or simulations. - Analytic forms for the objective function are
unavailable. - Black box cost function computation.
- Evaluating f is extremely costly.
- PDE-constrained optimization.
- Very large, complex, and high-dimensional.
- Requires massive computing resources.
- Inverse problems.
- Costly computation of the objective function.
9Applications of PDE-Constrained Optimization
- Electromagnetics.
- Aerodynamics.
- Cardiac electrophysiology.
- Image processing.
- Reservoir simulation.
- Compressible and incompressible flows.
- Earthquake modeling.
10Example PDE-Constrained Optimization
Ground-truth model of distribution of wave speed
of an elastic earth medium (portion of the Los
Angeles Basin)
Inverted model from solving an optimization
problem with 17 million inversion
parameters. Solution required 24 hours using 2048
CPUs at the PSC.
O. Ghattas, 2003 http//www.siam.org/pdf/news/128
.pdf
11Parameter Estimation
- For phenomena modeled by systems of ODEs and
PDEs. - Optimization used to find the system parameters
that best fit experimental and/or observed data. - Applications
- Biochemical pathways.
- Environmental modeling.
- Economics.
12Basic Terminology
- Objective/cost function (f)
- The function to be minimized or maximized. The
function need be differentiable, or even
continuous. In general, f is of the form fRD??
RD0, where, typically, D0 1. - Search space (W)
- The part of RD for which the optimum value
(maximum or minimum) is to be found by
optimization. - Minimizer (x ? W)
- The minimum value of f attained in W.
13Basic Terminology (2)
- Global optimization
- Determination of the globally best optimum for f,
where f possibly contains many local optima. - Local optimization
- Determination of the optimum value in a finite
neighborhood of W, that is not on the boundary of
that neighborhood.
14Basic Terminology (3)
- Iterative optimization method
- Values computed in W to update the search,
wherein f is evaluated at new points in W. - Convergence criteria
- In an iterative method, the criteria/criterion
used to determine that an optima cannot be
sufficiently improved. - Stationary point
- A point x ? W such that f? (x) 0. A stationary
point may be a (local or global) minimum,
maximum, or inflection point.
15Basic Terminology (4)
- Deterministic method
- An optimization approach wherein no randomness is
introduced. The search is guided strictly by
function and/or derivative computations. - Stochastic method
- A randomized optimization approach wherein the
search is also guided by random number generation.
16Basic Terminology (5)
- Multiobjective optimization
- Simultaneous optimization of multiple objective
functions. - Let C denote the feasible set (search space) of
solutions with equality and inequality
constraints. - Given n objective functions fi, i 1, , n, a
vector x? C is Pareto optimal if all other
vectors x? C have a higher value for at least one
of the fi. - Formally There is no x? C such that fi(x) ?
fi(x) for i 1, , n, with at least one i such
that fi(x) lt fi(x).
17Combinatorial Optimization
- A linear or nonlinear function is defined over a
very large finite set of solutions. - Categories
- Network problems.
- Scheduling.
- Transportation.
18Combinatorial Optimization (2)
- If the function is piecewise linear, the function
can be optimized exactly with mixed integer
program methods using branch and bound. - Approximate solutions can be obtained with
heuristic methods - Simulated annealing.
- Tabu search.
- Genetic algorithms.
19General Unconstrained Problems
- A nonlinear function is defined over the real
numbers. - The function is not subject to constraints, or
has simple bound constraints. - Partitioning strategies utilize a priori
knowledge of how rapidly the function varies
(e.g. the Lipschitz constant). - Interval methods can be used if the objective
function can be expressed analytically.
20General Unconstrained Problems Statistical
Methods
- Stochastic in nature.
- Use partitioning to decompose the search space.
- Use a priori information or assumptions to model
the objective function. - Usually inexact.
21General Unconstrained Problems Statistical
Methods Examples
- Simulated annealing.
- Genetic algorithms.
- Particle swarm optimization.
22Unconstrained Nonlinear Optimization
- Unconstrained minimization Find x that
minimizes a function f(x).
23Bound-Constrained Nonlinear Optimization
- Find x that minimizes a function f(x).
- subject to
24General Constrained Problems
- A nonlinear function is defined over the real
numbers. - The optimization is subject to constraints.
- These problems are usually solved by adapting
techniques for unconstrained problems to address
constraints.
25General Constrained Nonlinear Optimization
- Constrained minimization
- subject to
- Either f(x) and/or the c(x) are nonlinear.
Equality constraints
Inequality constraints
26General Constrained Nonlinear Optimization (2)
- In general, the problem is to minimize f over a
compact subset W of RD.
27Nonlinear Equations Problem
- Determine a root of a system of D nonlinear
equations in D variables -
28Example
29Global Optimization
- Relatively new vis-Ã -vis local methods.
- Many important objective functions are
- Non-convex.
- Irregular.
- Derivatives are not available or cannot be easily
computed. - The high computational cost of many global
methods precludes their use in time
critical-applications.
30Local and Global Optimization
End
End
Start
31Optimization Techniques
Global
Local
Stochastic
Deterministic
Derivative
Derivative-free
Simulated annealing
Gradient descent
Nelder-Mead simplex
Interval analysis
Genetic algorithms
Trust region methods
Powells direction set
Homotopy methods
Evolutionary strategies
Newton-based methods
Pattern search
Response surface techniques
Particle swarm
DIRECT
Multidirectional search
32Derivative-Based Optimization
- If derivatives can be computed or accurately
estimated (e.g. finite differences), a
derivative-based method is preferable. - If second derivatives are available, Newton based
approaches should be considered. - Generally, these are the most successful,
efficient approaches. - The best known methods for local optimization.
33Problems with Derivative-Based Approaches
- Derivatives may not always be available.
- Finite difference approximations are too
expensive or inaccurate. - There is no a priori information about f.
- High dimensional problems preclude accurate
estimates of the gradient.
34Automatic Differentiation
- A technique to compute derivatives based on the
computer code for f. - A new program is generated wherein analytical
derivates are computed based on the original
code. - Utilizes the chain rule and differentials for
generating the differentiation code. - Can be useful for gradient-based optimization of
difficult functions.
35Automatic Differentiation Disadvantages
- Derivatives may not be accurate due to truncation
error in functions involving solutions to PDEs. - Black-box function code contains branching.
- Black-box code may not be available.
- Usually requires the original program to be
written in a specified manner. - Interfacing generated code with original program
may be difficult.
36Derivative-Free Optimization
- Direct optimization
- Relies only on the ability to compute function
values. - Based on the relationship among values, not the
actual numerical value itself. - Can be used for optimizing non-numeric functions.
- Initial interest in the early 1960s 1970s,
followed by a sharp decline. - Interest revived recently, with the advent of
simulation-based optimization and other new
applications (Wright, 1995 Kolda, 2003).
37Derivative-Free Optimization Motivation
- Computation of the objective function is very
time-consuming (expensive). - Results of experiments.
- Results of simulations.
- Gradient computation is very expensive.
- Gradients cannot be calculated.
- Objective function extremely complex, making A.D.
difficult or impossible. - May be discontinuous in places.
- Objective function is very noisy
- Calculated values rely on discretization.
- Observed values are very noisy.
M. Wright, 1995
38Limited Impact of Optimization
- Computational difficulties
- High computational cost (especially G.O.).
- Plethora of methods (which to use?).
- Disconnect between CAD and CAE.
- Realism of the underlying model.
http//www.cas.mcmaster.ca/mopta/mopta01/abstract
s.htmljones
39Three Levels for Introducing Parallelism into
Optimization
- Parallelize the function, gradient, and
constraint evaluations. - Parallelize the numerical computations (linear
algebra) and numerical libraries. - Parallelize the optimization algorithm at a high
level i.e. adopt new optimization approaches.
Schnabel, 1995
40Examples of Parallel Optimization Approaches
Deterministic Methods
- DIRECT (Dividing Rectangles) (Jones et al., 1993
Wachowiak et al., 2006). - Parallel trust region methods (Hough and Meza,
2002). - Asynchronous pattern search (local method) (Hough
et al., 2001). - Interval analysis (Eriksson and Lindström, 1995).
- Parallel Newton-Krylov methods for
PDE-constrained optimization (Biros and Ghattas,
1999). - Various surface response techniques.
41Examples of Parallel Optimization Approaches
Stochastic Methods
- Various parallel simulated annealing approaches.
- Parallel evolutionary strategies and genetic
algorithms. - Parallel particle swarm optimization (Schutte et
al., 2004).
42Popular Gradient-Based Methods
- Conjugate gradient.
- Newton methods.
- Trust-region methods.
- Can employ all three levels of parallelization
(Schnabel, 1995), particularly - Parallelization of f and ?f.
- Parallelization of linear algebra and numerical
libraries.
43Two Important Classes of Derivative-Free,
Parallel, Global Optimization
- Response surface methods.
- DIRECT (Dividing Rectangles).
- There are many other promising techniques to
solve these problems.
44Response Surface Methods
- Model the unknown function f as an analytic
approximation . - Response surface.
- Surrogate model.
- The model is optimized over W.
- Assume a deterministic function of D variables at
n points.
Regis and Shoemaker, 2007
45Response Surface
Linear regression
Normally distributed error terms N(0, s)
Unknown coefficients to be estimated
Linear or nonlinear function of x.
462D Response Surface Fitting Polynomials
d 8
d 4
f(x)
Response
d 6
x
47Radial Basis Functions
- Radially-symmetric basis functions for modeling
scattered data. - Used in many response surface approaches.
- RBFs have nice mathematical properties that
facilitate analysis (Powell, 1992).
48Scattered Point Interpolation with Radial Basis
Functions
Interpolate scattered points
Original point cloud from segmented contours in
CT volume.
Enhanced point cloud
Radial basis interpolation
Surface normals
Final RBF model
Courtesy Derek Cool, Robarts Research Imaging
Laboratories
49Radial Basis Functions
Space of polynomials in D variables of degree ? m.
50Examples of Radial Basis Functions
r ? 0
51Compact Support RBFs
r gt 0
otherwise
Compact support ? Positive definite matrices.
Wendland, 1995
52Examples RBFs
Gaussian with g 2
5th degree CSRBF
f(r)
6th degree CSRBF
8th degree CSRBF
r
53Radial Basis Functions
Matrix of evaluated radial basis functions
54Radial Basis Functions
- The RBF that interpolates (xi, f(xi)) for i
1,,n is obtained by solving
Polynomial not required if CSRBFs are used.
Coefficients l and c can be found by inversion.
55Determining New Points
- Resample in areas having promising function
values as predicted by the model (Sóbester et
al., 2004). - The next point selected maximizes the expected
improvement in the objective function (Jones et
al. 1998). - Search in areas of high predicted approximation
error, using stochastic techniques (Sóbester et
al., 2004). - Select a guess fn for the global minimum, and
choose a new point xnew such that some function
snew interpolating all (xi, fi) and (xnew, fn)
such that some quality measure on snew is
maximized (e.g. minimizing the bumpiness of
snew (Gutmann, 2001 Regis and Shoemaker, 2007).
56Constrained Optimization using Response Surfaces
- Choose points so that they are at a maximum
distance from previously-evaluated points. - Define the maximum distance between any point not
evaluated and evaluated points as
Maximin distance
Regis and Shoemaker, 2005.
57Constrained Optimization using Response Surfaces
(2)
- Let bn ? 0, 1 denote a user-defined distance
factor. - The next point chosen is required to have a
distance of at least bn?n from all
previously-evaluated points.
58Auxiliary Constrained Minimization Subproblem
- The auxiliary problem of selecting the next point
for evaluation is expressed as - k denotes the initial number of points
evaluated. - Auxiliary subproblem is generally nonconvex.
59Solving the Auxiliary Subproblem
- The bi are typically a search pattern balancing
global and local search - b1 1 ? global exploratory search.
- bN1 0 ? greedy local search.
- The RBF model can be differentiated.
- Gradient-based optimization software.
- Multistart nonlinear programming solver.
- Global optimization techniques, such as
constrained DIRECT.
60Parallelization
- Each search pattern contains Ns 1 values,
where Ns is the cycle length. - If P ? Ns 1, the search pattern is cycled
through by the available processors. - If If P gt Ns 1, more values are added to the
search pattern - E.g. With ??0.9, 0.75, 0.25, 0.05, 0.03, 0? (Ns
5), - P 8 ? ??0.9, 0.9, 0.75, 0.25, 0.05, 0.03,
0.03, 0?
61Full Algorithm
Select initial points S1 (Latin hypercubes, etc.)
Fit or update using data points x ?Si
Select the next candidate point xki by solving
the constrained auxiliary subproblem. 0 ? bi ? 1
is set by the user.
Evaluate f at points in Si and select value
Convergence?
NO
Evaluate f at xki and update the best function
value thus far.
YES
End
Si1 ?? Si ?? xki Di1 ?? Di ??
(xki,f(xki)) i ? i 1
62Example
y
x
RBF surface after first iteration
RBF surface after iteration 2
RBF surface after iteration 3
Initial RBF surface
63Branch and Bound
- Widely used for solving large NP-hard
combinatorial problems. - Vehicle routing.
- Scheduling.
- Production planning.
- Natural parallel implementation.
Clausen and Perregaard, 1999
64Branching
- A structured way of subdividing the search space
into smaller subregions. - Subdivision is applied recursively to the
subregions, forming a tree structure (branching),
with nodes representing the subregions.
65Bounding
- Within each subregion, an upper and lower bound
on f are computed. - Subregions that have lower bounds higher than
upper bounds of other subregions are removed from
further consideration (pruned). - The process continues iteratively until
convergence criteria are met. - Subregions that have not been pruned have equal
upper and lower bounds.
66Bounding (2)
- Some a priori knowledge of f is needed to
establish upper and lower bounds. - Global Lipschitz constant.
- Guaranteed smoothness.
- Other analytic information.
- Unfortunately, there is no universal, guaranteed
bounding method.
67Example
W0
W0
68Example (2)
W0
W00
W01
W01
W02
W03
W00
W02
W03
69Example (3)
W0
W010
W011
W00
W012
W013
W01
W02
W03
W00
W020
W021
W03
W011
W012
W010
W013
W022
W023
W021
W022
W023
W020
W00 and W03 do not contain the minimum.
70DIRECT
- Relatively new method for bound-constrained
global optimization (Jones et al., 1993). - Lipschitzian approach, but the Lipschitz constant
need not be explicitly specified. - Balance of global and local search.
- A type of branch and bound method, but
non-promising regions of the search are not
removed. - Inherently parallel.
Wachowiak and Peters, 2006
71Applications of DIRECT
- Engineering design problems
- High-speed civil transport.
- Air-bearing surface designs (important in hard
disks) (Zhu and Bogy, 2002). - Optimal transmitter placement (He et al. 2004).
- Biomedical image registration (Wachowiak and
Peters, 2006).
72DIRECT Algorithm
Sample at points around centers with dimensions
of maximum side length.
Evaluate function values at sampled points.
Normalize
Divide rectangles according to function value.
Identify potentially optimal rectangles.
Convergence?
Group rectangles by their diameters.
NO
YES
END
73Potentially Optimal Hyperboxes
Potentially optimal HBs have centers that define
the bottom of the convex hull of a scatter plot
of rectangle diameters versus f(xi) for all HB
centers xi.
e 0.001
e 0.1
74Example
75Example
Normalize the search space and evaluate the
center point of a D-dimensional rectangle.
76Example
Evaluate points around the center.
Iteration 1
77Example
Divide the rectangle according to the function
values.
78Example
Use Lipschitz conditions and the rectangle
diameters to determine which rectangles should be
divided next.
79Division of the Search Space
Potentially optimal rectangles with these
centres are divided in the next iteration.
Estimate of Lipschitz constant
Locality parameter
80Example
Evaluate points in the potentially optimal
rectangles.
Iteration 2
81Example
Iterate (evaluate, divide, find potentially
optimal rectangles) until stopping criteria are
met.
82Example
83Example
Iteration 3
84Example
Iteration 4
Iteration 5
Iteration 10
Iteration 20
Iteration 110
Iteration 50
85Parallel DIRECT
- Coarse-grained parallelism
- Costly function evaluations performed in
parallel. - Master-worker implementation
- A single master processor performs calculations
for rectangle division and distributes tasks
(computing f) to worker processors. - Finds all potentially optimal rectangles.
- Samples within these rectangles.
- Worker processor sends its result back to the
master. - Disadvantage Communication bottleneck.
Watson, 2001 Wachowiak 2006.
86Parallel DIRECT Static Load Balancing
- Each processor computes its own local set of
potentially optimal rectangles. - A root processor gathers these local sets from
each processor and forms a global set of
potentially optimal rectangles. - The root generates a new set of tasks, which are
then distributed to the other processors. - Advantage More even distribution of work.
- Disadvantage Some processors may be idle while
others are finishing complex function evaluations.
Watson, 2001
87Parallel DIRECT Dynamic Load Balancing
- Allows task migration to other processors that
have finished their own tasks. - Once a processor has finished its function
evaluation, it processes any messages received
during its computation. - The evaluate-communicate cycle continues until
the processor has no more tasks. - At this point, it passes work requests to a
randomly selected processor either until work is
found. - If a processor receives a work request, it
transfers half of its remaining tasks to the
requesting processor. - Dynamic load balancing is achieved via a random
polling algorithm using token-passing (Krasteva
et al., 1999).
Watson, 2001
88Particle Swarm Optimization
- Relatively new GO technique (1995) (Kennedy,
Eberhart, Shi, 2001). - Iterative, population-based global optimization
method. - Based on a co-operative model of individual
(particle) interactions. - In contrast to the generally competitive model of
genetic algorithms and evolutionary strategies.
89Particle Swarm Optimization (2)
- Each individual, or particle, has memory of the
best position it found so far (best response),
and the best position of any particle. - The next move of the particle is determined by
its velocity, v, which is a function of its
personal best and global best locations.
90Position Update in PSO
Velocity for particle j, parameter i, (t1)-th
iteration
Previous velocity
Effect of the best position found by particle j
(personal best)
Effect of the best position found by any particle
during the entire search (global best)
Position update
91PSO Parameters
- Maximum velocity vmax
- Prevents particles from moving outside the region
of feasible solutions. - Inertial weight w
- Determines the degree to which the particle
should stay in the same direction as the last
iteration. - C1, C2
- Control constants.
- j1, j2
- Random numbers to add stochasticity to the
update.
92PSO Example
Iteration 1
Iteration 5
Iteration 10
Iteration 20
Iteration 30
Last iteration (41)
93Clustering Methods
- Used for unconstrained real-valued functions.
- Multistart procedure - local searches are
performed from mutliple points distributed over
the search space. - The same local minimum may identified by multiple
points. Clustering methods attempt to avoid this
inefficiency by careful selection of points at
which the local search is initiated.
94Clustering Methods General Algorithm
- Sample points in the search space.
- Transform the sampled points to group them around
the local minima. - Apply a clustering technique to identify groups
representing convergence basins of local minima.
95Clustering Methods Disadvantages
- Because a potentially large number of points are
randomly sampled to identify the clusters in
neighborhoods of local minima, the objective
function must be relatively inexpensive to
compute. - Clustering methods are not suited to
high-dimensional problems (more than a few
hundred variables). - However, coarse-grained parallelism may be useful
in improving efficiency.
96Example Image Registration
- Aligning and combining images from the same or
different type of image. - Linear
- Nonlinear/elastic
- Useful in simulation, modeling, and in planning
surgical procedures. - Employs concepts from probability theory,
information theory, geometry, topology,
optimization, parallel computing, and many other
areas. - Once the images are registered, they are fused
together to provide complementary information.
97Registration and Fusion
MRI
MRI
Ultrasound
MRI Ultrasound
PET
Histology cryosection
98Similarity Metric
Small x, y, and z translation errors
Correct registration.
Rotation errors around all three axes.
Large translation and rotation errors.
Wachowiak et al., Proc. SPIE Medical Imaging, 2003
99Entropy and Mutual Information
Entropy
Mutual Information (MI)
Normalized Mutual Information (NMI)
100Parallel Optimization Image Registration
Coarse-grained
DIRECT
Fine-grained
Powell
MDS
Compute transformation matrix
Apply transformation to all voxels.
Determine valid coordinates.
Interpolate image intensities.
Compute marginal and joint densities.
Calculate the similarity metric.
Wachowiak and Peters, 2006
101Parameter Estimation
- Given a set of experimental data, calibrate the
model so as to reproduce the experimental results
in the best possible way (Moles et al., 2003). - Inverse problem.
- Solutions to these problems are instrumental in
the development of dynamic models, that promote
functional understanding at the systems level.
102Parameter Estimation (2)
- Goal Minimize an objective function that
measures the goodness of the fit of the model
with respect to a given experimental data set,
subject to the dynamics of the system
(constraints).
103Example Biochemical Pathways
- Nonlinear biochemical model with 36 parameters.
- Model consists of 8 ODEs describing the
time-dependent (dynamic) variation of metabolite
concentrations. - Output of each experiment is a set of 8 state
variables. - Optimization problem is to fit two different
classes of parameters to the experimental model.
Moles et al., 2003
104Example Biochemical Pathways (2)
- Nonlinear programming problem with differential
algebraic constraints. - The problems are frequently nonconvex and
multimodal. Therefore, global methods are
required. - Multistart strategies do not work in this case
multiple convergence to the same local minimum.
Moles et al., 2003
105Biochemical Pathways (3)
n Number of data for each experiment m Number of
experiments yexp Known experimental
data ypred Theoretical state of the 36-parameter
model
Weighting to normalize the contribution of each
term.
Moles et al., 2003
106Biochemical Pathways (4)
- Several different global optimization approaches
were applied. - A stochastic method (evolutionary strategies) was
found to be the only method to solve the problem. - However, the computational cost was very high.
- Due to the inherently parallel nature of the
problem, efficiency gains are likely with HPC and
parallel approaches.
Moles et al., 2003
107Grid Computing
- Grid computing is becoming increasingly popular
for solving large-scale scientific and
engineering problems using computational power
distributed over a network (Internet). - Multiobjective optimization (Nebro et al, 2007).
- Harnessing of the computational power of hundreds
of workstations to search a huge solution space
using a master-worker model (J. Nocedal, 2003). - An important concern is platforms and middleware
to enable parallel grid computing.
108Optimization Software
- OPT (http//hpcrd.lbl.gov/meza/projects/opt/O
PTdoc/html/index.html) - C library of nonlinear optimization algorithms.
- GNU public license.
- Many codes implementing various global
optimization algorithms can be located at - http//www.mat.univie.ac.at/neum/glopt/software_
g.html - Public domain and commercial links.
- Stochastic methods (e.g. simulated annealing,
genetic algorithms). - Deterministic methods (e.g. branch and bound).
109Conclusion
- Many new applications of optimization need
methods not relying on derivatives. - Cost of computing f is extremely costly.
- Parallel computation, particularly coarse-grained
parallelism, provides the efficiency needed to
solve large black-box problems.
110Future Directions
- The new focus is on deterministic global
approaches. - New global optimization algorithms should be
inherently parallel to maximize scalability. - Converge proofs
- Easier for derivative-based methods.
- Especially important for stochastic methods.
- Improvements and performance-tweaking of
optimization software. - Hybrid fine-grained and coarse-grained
parallelization. - Exploiting the three levels of parallelism for
optimization.
111Future Directions (2)
- Load balancing
- Asynchronous versions of parallel techniques.
- Selecting the appropriate methods for specific
problems. - If derivatives are available, use gradient-based
methods. - Otherwise, parallel direct methods should be
used. - If all else fails, consider parallel stochastic
methods.
112Thank you.
113Radial Basis Functions
- The RBF that interpolates (xi, f(xi)) for i
1,,n is obtained by solving
Polynomial not required if CSRBFs are used.
Coefficients l and c can be found by inversion.
114Radial Basis Functions
115Response Surface Methods
- Model the unknown function f as an analytic
approximation . - Response surface.
- Surrogate model.
- The model is optimized over W.
- Assume a deterministic function of D variables at
n points. - Denote sampled point i by xi (xii , xiD), i
1, , n. - The associated function value is yi y(xi).
Regis and Shoemaker, 2007
116DIRECT Step 1
Normalize the search space and evaluate the
center point of an n-dimensional rectangle.
117DIRECT Step 2
Evaluate points around the center.
118DIRECT Step 3
Divide the rectangle according to the function
values.
119Division of the Search Space
Potentially optimal rectangles with these
centres are divided in the next iteration.
Estimate of Lipschitz constant
Locality parameter
120DIRECT Step 4
Use Lipschitz conditions and the rectangle
diameters to determine which rectangles should be
divided next.
121DIRECT Iteration
Repeat steps 2 4 (Evaluate, find potentially
optimal rectangles, divide).
122DIRECT Convergence
Based on rectangle clustering, number of
non-improving iterations, and function value
tolerance.
123Global and local optimization
124Global and local optimization
125Local Optimization
Start
126Local Optimization
End
127Global Optimization
Start
128Global Optimization
End
129Proposed Adaptation of PSO to Biomedical Image
Registration
- Incorporation of constriction factor c (Clerc
Kennedy, 2002). - Registration functions using generalized
similarity metrics and function stretching
(Parsopoulos et al., 2001) in the global search. - Addition of new memory term that of initial
orientation, as initial position is usually
carefully chosen.
130Proposed Adaptation of PSO to Biomedical Image
Registration
Constriction factor
Previous velocity
Personal best term
Global best term
Initial orientation term
Initial orientation
j1, j2, j3 Control parameters u1, u2, u3
Uniformly distributed random numbers
131Proposed Adaptation of PSO to Biomedical Image
Registration
132Parallel Optimization
Coarse-grained
DIRECT
Fine-grained
Powell
MDS
Compute transformation matrix
Apply transformation to all voxels.
Determine valid coordinates.
Interpolate image intensities.
Compute marginal and joint densities.
Calculate the similarity metric.
Wachowiak and Peters, IEEE TITB 2006. Wachowiak
and Peters, IEEE HPCS 2005, 50-56.
133References
- http//www.cs.sandia.gov/opt/survey/
134Example
y
y
x
x
135Steepest Descent
- Derivative-based local optimization method.
- Simple to implement.
Gradient of f(x) using current x
Current x
x in next iteration
Step size
136Steepest Descent Disadvantages
- Unreliable convergence properties.
- The new gradient at the minimum point of any line
minimization is orthogonal to the direction just
traversed. - ? Potentially many small steps are taken, leading
to slow convergence.
137Conjugate Gradient Methods
- Assume that both f(x) and ??f(x) can be computed.
- Assume that f can be approximated as a quadratic
form -
-
- ?Optimality condition is
-
138Conjugates
- Given that A is a symmetric matrix, two vectors g
and h are said to be conjugate w.r.t. A if gTAh
0. - Orthogonal vectors are a special case of
conjugate vectors gTh 0. - Therefore, the solution to the n ? n quadratic
equation is
139Conjugate Gradient Method
- Select successive direction vectors as a
conjugate version of the successive gradients
obtained as the method progresses. - Conjugate directions are generated as the
algorithm proceeds.
http//www.srl.gatech.edu/education/ME6103/NLP-Unc
onstrained-Multivariable.ppt
140Conjugate Gradient Method (2)
Choose x0 Compute g0 ?f(x0) Set h0 -g0
Using a line search, find ak that minimizes f(xk
akhk).
Set xk1 xk akhk.
Fletcher-Reeves
Stopping criteria met?
NO
YES
Return x
Polack-Robiere
Compute new conjugate direction hk1 -gk1
bk hk
141Genetic Algorithms and Evolutionary Strategies
142Genetic Algorithms
Initialize the population
Evaluate the function values of the population
Perform competitive selection
Generate new solutions from genetic operators
NO
Stopping criteria met?
YES
END
143Parallel DIRECT Dynamic Load Balancing
(Implementation)
- Dynamic load balancing is achieved via a random
polling algorithm (Krasteva et al., 1999). - Each processor tracks its own state with an idle
flag. - If a processor receives work, idle FALSE.
- Each processor is identified by a number 1, , P.
- A token is passed around, in ring fashion, to all
processors.
Watson, 2001
144Parallel DIRECT Dynamic Load Balancing
(Implementation)
Token lt P AND no pending work requests?
Processor i idle TRUE
Token
145Parallel and High-Performance Computing
- Parallel computing
- The use of multiple computer resources (usually
processors or execution threads) to improve the
efficiency of computationally expensive
algorithms. - High-performance computing
- Parallel computing, or
- Optimizing algorithm performance by explicitly
taking advantage of computational abilities
inherent in the system hardware and/or software.
146HPC and Parallel Computing
www.sharcnet.ca
147Limits to Improvement via Parallelization
- Not all useful algorithms are inherently
parallel. - They do not have a natural parallel
implementation. - Overhead required in parallelization negatively
affects performance. - Cost of creating and joining threads.
- Synchronization of threads and/or processes.
- Communication overhead and/or network issues.
- Inefficiencies resulting from introducing
parallelism into an optimal serial algorithm.
148Parallel Computing Terminology
- Coarse-grained
- Each processor has a large amount of work. A
relatively small number of large computations is
performed. - Fine-grained
- Each processor works on a small subtask.
- A relatively large number of small computations
is performed.
149Parallel Computing Terminology
- Distributed memory
- Processors each have their own memory.
- Processors communicate through passing messages.
- Useful for coarse-grained algorithms.
- Relatively difficult to program (e.g. MPI, PVM).
- Shared memory
- All processors share one memory.
- Inter-processor communication is less important.
- Useful for fine-grained algorithms, as well as
for some coarse-grained implementations. - Relatively simple to program (e.g. OpenMP,
pThreads).
150Speedup and Efficiency
P Number of processors
Scalability For a specified algorithm, the
increase in speedup as a function of the number
of processors P. In general, distributed memory
systems are more scalable than shared memory
systems.
151Amdahls Law
- Quantifies the efficiency benefits of
parallelization for multiprocessing systems. - Assumption The problem size is fixed.
- Notation
- P number of processors
- s fraction of intrinsically serial code (0 to 1)
- 1 s fraction of code that can be parallelized
152Amdahls Law (2)
s 0.01
Speedup(P)
s 0.05
s 0.2
s 0.5
P
153Gustafsons Law
- Quantifies massive parallelism.
- Underlying principle the parallel or vector part
of a program scales with the problem size. - Serial components, including start-up, program
loading, serial bottlenecks, and I/O do not grow
with problem size (are constant). - If a problem is run on n processors, speedup is
linear
154Gustafsons Law (2)
- Each processor performs similar work, but on
different parts of the data. - In this way, the fixed, sequential portion of the
code can be considered as constant.
155Steepest Descent
- Derivative-based local optimization method.
- Simple to implement.
Gradient of f(x) using current x
Current x
x in next iteration
Step size
156Steepest Descent Disadvantages
- Unreliable convergence properties.
- The new gradient at the minimum point of any line
minimization is orthogonal to the direction just
traversed. - ? Potentially many small steps are taken, leading
to slow convergence.
157Conjugate Gradient Methods
- Assume that both f(x) and ??f(x) can be computed.
- Assume that f can be approximated as a quadratic
form -
-
- ?Optimality condition is
-
158Conjugates
- Given that A is a symmetric matrix, two vectors g
and h are said to be conjugate w.r.t. A if gTAh
0. - Orthogonal vectors are a special case of
conjugate vectors gTh 0. - Therefore, the solution to the n ? n quadratic
equation is
159Conjugate Gradient Method
- Select successive direction vectors as a
conjugate version of the successive gradients
obtained as the method progresses. - Conjugate directions are generated as the
algorithm proceeds.
http//www.srl.gatech.edu/education/ME6103/NLP-Unc
onstrained-Multivariable.ppt
160Conjugate Gradient Method (2)
Choose x0 Compute g0 ?f(x0) Set h0 -g0
Using a line search, find ak that minimizes f(xk
akhk).
Set xk1 xk akhk.
Fletcher-Reeves
Stopping criteria met?
NO
YES
Return x
Polack-Robiere
Compute new conjugate direction hk1 -gk1
bk hk