Title: Search and Optimization Methods
1Search and Optimization Methods
Based in part on Chapter 8 of Hand, Manilla,
Smyth David Madigan
2Introduction
- This chapter is about finding the models and
parameters that minimize a general score function
S - Often have to conduct a parameter search for each
visited model - The number of possible structures can be immense.
For example, there are 3.6 ? 1013 undirected
graphical models with 10 vertices
3Greedy Search
1. Initialize. Chose an initial state Mk 2.
Iterate. Evaluate the score function at all
adjacent states and move to the best one 3.
Stopping Criterion. Repeat step 2 until no
further improvement can be made. 4. Multiple
Restarts. Repeat 1-3 from different starting
points and choose the best solution found.
4Systematic Search Heuristics
Breadth-first, Depth-first, Beam-search, etc.
5Parameter Optimization
Finding the parameters ? that minimize a score
function S(?) is usually equivalent to the
problem of minimizing a complicated function in a
high-dimensional space Define the gradient
function is S as When closed form solutions to
?S(?)0 exist, no need for numerical methods.
6Gradient-Based Methods
1. Initialize. Choose an initial value for ?
?0 2. Iterate. Starting with i0, let ?i1 ?i
?i vi where v is the direction of the next step
and lambda is the distance. Generally choose v to
be a direction that improves the score 3.
Convergence. Repeat step 2 until S appears to
have reached a local minimum. 4. Multiple
Restarts. Repeat steps 1-3 from different initial
starting points and choose the best minimum found.
7(No Transcript)
8Univariate Optimization
Let g(?)S(?). Newton-Raphson proceeds as
follows. Suppose g(?s)0. Then
91-D Gradient-Descent
- ? usually chosen to be quite small
- Special case of NR where 1/g(?i) is replaced by
a constant
10Multivariate Case
Curse-of-Dimensionality again. For example,
suppose S is defined on a d-dimensional unit
hypercube. Suppose we know that all components of
? are less than 1/2 at the optimum. if d1, have
eliminated half the parameter space if d2, have
eliminated 1/4 of the parameter space if d20,
have eliminated 1/1,000,000 of the parameter
space!
11Multivariate Gradient Descent
- -g(?i) points in the direction of steepest
descent - Guaranteed to converge if ? small enough
- Essentially the same as the back-propagation
method used in NNs - Can replace ? with second-derivative information
(quasi-Newton uses approx).
12Simplex Search Method
Evaluates d1 points arranged in a
hyper-tetrahedron For example, with d2,
evaluates S at the vertices of an equilateral
triangle Reflect the triangle in the side
opposite the vertex with the highest value Repeat
until oscillation occurs, then half the sides of
the triangle No calculation of derivatives...
13 14(No Transcript)
15EM for two-component Gaussian mixture
Tricky to maximize
16EM for two-component Gaussian mixture, cont.
This is the E-step. Does a soft assignment of
observations to mixture components
17EM for two-component Gaussian mixture Algorithm
18(No Transcript)
19EM with Missing Data
Let Q(H) denote a probability distribution for
the missing data
This is a lower bound on l(?)
20EM (continued)
In the E-Step, Max is achieved when In the
M-Step, need to maximize
21EM Normal Mixture Example
22EM Normal Mixture Example