Title: Unconstrained Optimization
1Unconstrained Optimization
2Recap
- Gradient ascent/descent
- Simple algorithm, only requires the first order
derivative - Problem difficulty in determining the step size
- Small step size ? slow convergence
- Large step size ? oscillation or bubbling
3Recap Newton Method
- Univariate Newton method
- Mulvariate Newton method
- Guarantee to converge when the objective function
is convex/concave
Hessian matrix
4Recap
- Problem with standard Newton method
- Computing inverse of Hessian matrix H is
expensive (O(n3)) - The size of Hessian matrix H can be very large
(O(n2)) - Quasi-Newton method (BFGS)
- Approximate the inverse of Hessian matrix H with
another matrix B - Avoid the difficulty in computing inverse of H
- However, still have problem when the size of B is
large - Limited memory Quasi-Newton method (L-BFGS)
- Storing a set of vectors instead of matrix B
- Avoid the difficulty in computing the inverse of
H - Avoid the difficulty in storing the large-size B
5Recap
V-Fast
Standard Newton method O(n3)
Small
Quasi Newton method (BFGS) O(n2)
Medium
Fast
Limited-memory Quasi Newton method (L-BFGS) O(n)
Large
R-Fast
6Empirical Study Learning Conditional
Exponential Model
Dataset Iterations Time (s)
Rule 350 4.8
Rule 81 1.13
Lex 1545 114.21
Lex 176 20.02
Summary 3321 190.22
Summary 69 8.52
Shallow 14527 85962.53
Shallow 421 2420.30
Dataset Instances Features
Rule 29,602 246
Lex 42,509 135,182
Summary 24,044 198,467
Shallow 8,625,782 264,142
Limited-memory Quasi-Newton method
Gradient ascent
7Free Software
- http//www.ece.northwestern.edu/nocedal/software.
html - L-BFGS
- L-BFGSB
8Conjugate Gradient
- Another Great Numerical Optimization Method !
9Linear Conjugate Gradient Method
- Consider optimizing the quadratic function
- Conjugate vectors
- The set of vector p1, p2, , pl is said to be
conjugate with respect to a matrix A if - Important property
- The quadratic function can be optimized by simply
optimizing the function along individual
direction in the conjugate set. - Optimal solution
- ?k is the minimizer along the kth conjugate
direction
10Example
- Minimize the following function
- Matrix A
- Conjugate direction
- Optimization
- First direction, x1 x2x
- Second direction, x1 - x2x
- Solution x1 x21
11How to Efficiently Find a Set of Conjugate
Directions
- Iterative procedure
- Given conjugate directions p1,p2,, pk-1
- Set pk as follows
- Theorem The direction generated in the above
step is conjugate to all previous directions
p1,p2,, pk-1, i.e., - Note compute the k direction pk only requires
the previous direction pk-1
12Nonlinear Conjugate Gradient
- Even though conjugate gradient is derived for a
quadratic objective function, it can be applied
directly to other nonlinear functions - Guarantee convergence if the objective is
convex/concave - Variants
- Fletcher-Reeves conjugate gradient (FR-CG)
- Polak-Ribiere conjugate gradient (PR-CG)
- More robust than FR-CG
- Compared to Newton method
- The first order method
- Usually less efficient than Newton method
- However, it is simple to implement
13Empirical Study Learning Conditional
Exponential Model
Dataset Iterations Time (s)
Rule 142 1.93
Rule 81 1.13
Lex 281 21.72
Lex 176 20.02
Summary 537 31.66
Summary 69 8.52
Shallow 2813 16251.12
Shallow 421 2420.30
Dataset Instances Features
Rule 29,602 246
Lex 42,509 135,182
Summary 24,044 198,467
Shallow 8,625,782 264,142
Limited-memory Quasi-Newton method
Conjugate Gradient (PR)
14Free Software
- http//www.ece.northwestern.edu/nocedal/software.
html - CG
15When Should We Use Which Optimization Technique
- Using Newton method if you can find a package
- Using conjugate gradient if you have to implement
it - Using gradient ascent/descent if you are lazy
16Logarithm Bound Algorithms
- To maximize
- Start with a guess
- Do it for t 1, 2, , T
- Compute
- Find a decoupling function
- Find optimal solution
-
17Logarithm Bound Algorithm
18Logarithm Bound Algorithm
- Start with initial guess x0
- Come up with a lower bounded function ?(x) ?
f(x) f(x0)
- Optimal solution x1 for ?(x)
- Repeat the above procedure
19Logarithm Bound Algorithm
Optimal Point
- Start with initial guess x0
- Come up with a lower bounded function ?(x) ?
f(x) f(x0)
- Optimal solution x1 for ?(x)
- Repeat the above procedure
- Converge to the optimal point
20Property of Concave Functions
21Important Inequality
- log(x), -exp(x) are concave functions
- Therefore
22Expectation-Maximization Algorithm
- Derive the EM algorithm for Hierarchical Mixture
Model
- Log-likelihood of training data