Title: Lecture 2: Parameter Estimation and Evaluation of Support
1Lecture 2Parameter Estimation and
Evaluation of Support
2Parameter Estimation
The problem of estimation is of more central
importance, (than hypothesis testing).. for in
almost all situations we know that the effect
whose significance we are measuring is perfectly
real, however small what is at issue is its
magnitude. (Edwards, 1992, pg. 2)
An insignificant result, far from telling us
that the effect is non-existent, merely warns us
that the sample was not large enough to reveal
it. (Edwards, 1992, pg. 2)
3Parameter Estimation
- Finding Maximum Likelihood Estimates (MLEs)
- Local optimization (optim)
- Gradient methods
- Simplex (Nelder-Mead)
- Global optimization
- Simulated Annealing (anneal)
- Genetic Algorithms (rgenoud)
- Evaluating the strength of evidence (support)
for different parameter estimates - Support Intervals
- Asymptotic Support Intervals
- Simultaneous Support Intervals
- The shape of likelihood surfaces around MLEs
4Parameter estimation finding peaks on likelihood
surfaces...
The variation in likelihood for any given set of
parameter values defines a likelihood
surface...
The goal of parameter estimation is to find the
peak of the likelihood surface.... (optimization)
5Local vs Global Optimization
- Fast local optimization methods
- Large family of methods, widely used for
nonlinear regression in commercial software
packages - Brute force global optimization methods
- Grid search
- Genetic algorithms
- Simulated annealing
global optimum
local optimum
6Local Optimization Gradient Methods
- Derivative-based (Newton-Raphson) methods
Likelihood surface
General approach Vary parameter estimate
systematically and search for zero slope in the
first derivative of the likelihood
function...(using numerical methods to estimate
the derivative, and checking the second
derivative to make sure it is a maximum, not a
minimum)
7Local Optimization No Gradient
- The Simplex (Nelder Mead) method
- Much simpler to program
- Does not require calculation or estimation of a
derivative - No general theoretical proof that it works, (but
lots of happy practitioners) - Implemented as method Nelder-Mead in the
optim function in R
8Global Optimization Grid Searches
- Simplest form of optimization (and rarely used in
practice) - Systematically search parameter space at a grid
of points - Can be useful for visualization of the broad
features of a likelihood surface
9Global Optimization Genetic Algorithms
- Based on a fairly literal analogy with evolution
- Start with a reasonably large population of
parameter sets - Calculate the fitness (likelihood) of each
individual set of parameters - Create the next generation of parameter sets
based on the fitness of the parents, and
various rules for recombination of subsets of
parameters (genes) - Let the population evolve until fitness reaches a
maximum asymptote - Implemented in the genoud package in R cool
but slow for large datasets with large number of
parameters
10Global optimization - Simulated Annealing
- Analogy with the physical process of annealing
- Start the process at a high temperature
- Gradually reduce the temperature according to an
annealing schedule - Always accept uphill moves (i.e. an increase in
likelihood) - Accept downhill moves according to the Metropolis
algorithm
p probability of accepting downhill move Dlh
magnitude of change in likelihood t temperature
11Effect of temperature (t)
12Simulated Annealing in practice...
A version with automatic adjustment of range...
Search range (step size)
Lower bound
Upper bound
Current value
REFERENCES Goffe, W. L., G. D. Ferrier, and J.
Rogers. 1994. Global optimization of statistical
functions with simulated annealing. Journal of
Econometrics 6065-99. Corana et al. 1987.
Minimizing multimodal functions of continuous
variables with the simulated annealing algorithm.
ACM Transactions on Mathematical Software
13262-280
13Effect of C on Adjusting Range...
14Constraints setting limits for the search...
- Biological limits
- Values that make no sense biologically (be
careful...) - Algebraic limits
- Values for which the model is undefined (i.e.
dividing by zero...)
Bottom line global optimization methods let you
cast your net widely, at the cost of computer
time...
15Simulated Annealing - Initialization
- Set
- Annealing schedule
- Initial temperature (t) (3.0)
- Rate of reduction in temperature (rt) (0.95)N
- Interval between drops in temperature (nt) (100)
- Interval between changes in range (ns) (20)
- Parameter values
- Initial values (x)
- Upper and lower bounds (lb,ub)
- Initial range (vm)
Typical values in blue...
16How many iterations?...
Logistic regression of windthrow
susceptibility (188 parameters) 5 million is not
enough!
Red maple leaf litterfall (6 parameters) 500,000
is way more than necessary!
What would constitute convergence?...
17Optimization - Summary
- No hard and fast rules for any optimization be
willing to explore alternate options. - Be wary of initial values used in local
optimization when the model is at all
complicated - How about a hybrid approach? Start with
simulated annealing, then switch to a local
optimization
18Evaluating the strength of evidence for the MLE
Now that you have an MLE, how should you evaluate
it? (Hint think about the shape of the
likelihood function, not just the MLE)
19Strength of evidence for particular parameter
estimates Support
Log-likelihood Support (Edwards 1992)
- Likelihood provides an objective measure of the
strength of evidence for different parameter
estimates...
20Profile Likelihood
- Evaluate support (information) for a range of
values of a given parameter by treating all other
parameters as nuisance and holding them at
their MLEs
21Asymptotic vs. Simultaneous M-Unit Support Limits
- Asymptotic Support Limits (based on Profile
Likelihood) - Hold all other parameters at their MLE values,
and systematically vary the remaining parameter
until likelihood declines by a chosen amount
(m)...
What should m be? (2 is a good number, and is
roughly analogous to a 95 CI)
22Asymptotic vs. Simultaneous M-Unit Support Limits
- Simultaneous
- Resampling method draw a very large number of
random sets of parameters and calculate
log-likelihood. M-unit simultaneous support
limits for parameter xi are the upper and lower
limits that dont differ by more than m units of
support...
In practice, it can require an enormous number of
iterations to do this if there are more than a
few parameters
23Asymptotic vs. Simultaneous Support Limits
A hypothetical likelihood surface for 2
parameters...
Simultaneous 2-unit support limits for P1
2-unit drop in support
Parameter 2
Asymptotic 2-unit support limits for P1
Parameter 1
24Other measures of strength of evidence for
different parameter estimates
- Edwards (1992 Chapter 5)
- Various measures of the shape of the likelihood
surface in the vicinity of the MLE...
How pointed is the peak?...
25Evaluating Support for Parameter Estimates A
Frequentist Approach
- Traditional confidence intervals and standard
errors of the parameter estimates can be
generated from the Hessian matrix - Hessian matrix of second partial derivatives of
the likelihood function with respect to
parameters, evaluated at the maximum likelihood
estimates - Also called the Information Matrix by Fisher
- Provides a measure of the steepness of the
likelihood surface in the region of the optimum - Can be generated in R using optim and fdHess
26Example from R
The Hessian matrix (when maximizing a log
likelihood) is a numerical approximation for
Fisher's Information Matrix (i.e. the matrix of
second partial derivatives of the likelihood
function), evaluated at the point of the maximum
likelihood estimates. Thus, it's a measure of
the steepness of the drop in the likelihood
surface as you move away from the MLE. gt
reshessian a
b sd a -150.182 -2758.360
-0.201 b -2758.360 -67984.416 -5.925 sd
-0.202 -5.926
-299.422
(sample output from an analysis that estimates
two parameters and a variance term)
27More from R
now invert (solve in R parlance) the negative
of the Hessian matrix to get the matrix of
parameter variance and covariance gt
solve(-1reshessian) a
b sd a 2.613229e-02
-1.060277e-03 3.370998e-06 b -1.060277e-03
5.772835e-05 -4.278866e-07 sd 3.370998e-06
-4.278866e-07 3.339775e-03 the square roots
of the diagonals of the inverted negative Hessian
are the standard errors gt sqrt(diag(solve(-1res
hessian))) a b sd 0.1616
0.007597 0.05779 (and 1.96 S.E. is a 95
C.I.)