Maximum likelihood ML and likelihood ratio LR test - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

Maximum likelihood ML and likelihood ratio LR test

Description:

It is often the case that we are interested in finding values of some parameters ... Arnold publisher, London, Sydney, Auckland. Exercise 1 ... – PowerPoint PPT presentation

Number of Views:191
Avg rating:3.0/5.0
Slides: 17
Provided by: gar115
Category:

less

Transcript and Presenter's Notes

Title: Maximum likelihood ML and likelihood ratio LR test


1
Maximum likelihood (ML) and likelihood ratio (LR)
test
  • Conditional distribution and likelihood
  • Maximum likelihood estimator
  • Information in the data and likelihood
  • Observed and Fishers information
  • Likelihood ratio test
  • Exercise

2
Introduction
  • It is often the case that we are interested in
    finding values of some parameters of the system.
    Then we design an experiment and get some
    observations (x1,,,xn). We want to use these
    observation and estimate parameters of the
    system. Once we know (it might be a challenging
    mathematical problem) how parameters and
    observations are related then we can use this
    fact to estimate parameters.
  • Maximum likelihood is one the techniques to
    estimate parameters using observations or
    experiments. There are other estimation
    techniques also. These include Bayesian,
    least-squares, method of moments, minimum
    chi-squared.
  • The result of the estimation is a function of
    observation t(x1,,,xn). It is a random variable
    and in many cases we want to find its
    distribution also. In general, finding the
    distribution of the statistic is a challenging
    problem. But there are numerical technique (e.g.
    bootstrap) to deal with this.

3
Desirable properties of estimation
  • Unbiasedness. Bias is defined as difference
    between estimator (t) and true parameter (?).
    Expectation taken using probability distribution
    of observations
  • Efficiency. Efficient estimation is that with
    minimum variance (var(t)).
  • Consistency. As the number of observation goes to
    infinity an estimator converges to true value
  • Minimum mean square error (m.s.e). M.s.e. is
    defined as the expectation value of the square of
    the difference (error) between estimator and the
    true value
  • It means that this estimator must be efficient
    and unbiased. It is very difficult to achieve all
    these properties. Under some conditions ML
    estimator obeys them asymptotically. Moreover ML
    is asymptotically normal that simplifies the
    interpretation of results.

4
Conditional probability distribution and
likelihood
  • Let us assume that we know that our random sample
    points came from the the population with the
    distribution with parameter(s) ?. We do not know
    ?. If we would know it, then we could write the
    probability distribution of a single observation
    f(x?). Here f(x?) is the conditional
    distribution of the observed random variable if
    the parameter would be known. If we observe n
    independent sample points from the same
    population then the joint conditional probability
    distribution of all observations can be written
  • We could write the product of the individual
    probability distribution because the observations
    are independent (independent conditionally when
    parameters are known). f(x?) is the probability
    of an observation for discrete and density of the
    distribution for continuous cases.
  • We could interpret f(x1,x2,,,xn?) as the
    probability of observing given sample points if
    we would know parameter ?. If we would vary the
    parameter(s) we would get different values for
    the probability f. Since f is the probability
    distribution, parameters are fixed and
    observation varies. For a given observation we
    define likelihood proportional to the conditional
    probability distribution.

5
Conditional probability distribution and
likelihood Cont.
  • When we talk about conditional probability
    distribution of the observations given
    parameter(s) then we assume that parameters are
    fixed and observations vary. When we talk about
    likelihood then observations are fixed parameters
    vary. That is the major difference between
    likelihood and conditional probability
    distribution. Sometimes to emphasize that
    parameters vary and observations are fixed,
    likelihood is written as
  • In this and following lectures we will use one
    notation for probability and likelihood. When we
    will talk about probability then we will assume
    that observations vary and when we will talk
    about likelihood we will assume that parameters
    vary.
  • Principle of maximum likelihood states that best
    parameters are those that maximise probability of
    observing current values of observations. Maximum
    likelihood chooses parameters that satisfy

6
Maximum likelihood
  • Purpose of maximum likelihood is to maximize the
    likelihood function and estimate parameters. If
    derivatives of the likelihood function exist then
    it can be done using
  • Solution of this equation will give possible
    values for maximum likelihood estimator. If the
    solution is unique then it will be the only
    estimator. In real application there might be
    many solutions.
  • Usually instead of likelihood its logarithm is
    maximized. Since log is strictly monotonically
    increasing function, derivative of the likelihood
    and derivative of the log of likelihood will have
    exactly same roots. If we use the fact that
    observations are independent then joint
    probability distributions of all observations is
    equal to product of individual probabilities. We
    can write log of the likelihood (denoted as l)
  • Usually working with sums is easier than working
    with products

7
Maximum likelihood Example success and failure
  • Let us consider two examples. First example
    corresponds to discrete probability distribution.
    Let us assume that we carry out trials. Possible
    outcomes of the trials are success or failure.
    Probability of success is ? and probability of
    failure is 1- ?. We do not know value of ?. Let
    us assume we have n trials and k of them are
    successes and n-k of them are failures. Value of
    random variable describing our trials are either
    0 (failure) or 1 (success). Let us denote
    observations as y(y1,y2,,,,yn). Probability of
    the observation yi at the ith trial is
  • Since individual trials are independent we can
    write for n trials
  • For log of this function we can write
  • Derivative of the likelihood w.r.t unknown
    parameter is
  • The ML estimator for the parameter is equal to
    the fraction of successes.

8
Maximum likelihood Example success and failure
  • In the example of successes and failures the
    result was not unexpected and we could have
    guessed it intuitively. More interesting problems
    arise when parameter ? itself becomes function of
    some other parameters and possible observations
    also. Let us say
  • It may happen that xi themselves are random
    variables also. If it is the case and the
    function corresponds to normal distribution then
    analysis is called Probit analysis. Then log
    likelihood function would look like
  • Finding maximum of this function is more
    complicated. This problem can be considered as a
    non-linear optimization problem. This kind of
    problems are usually solved iteratively. I.e. a
    solution to the problem is guessed and then it is
    improved iteratively.

9
Maximum likelihood Example normal distribution
  • Now let us assume that the sample points came
    from the population with normal distribution with
    unknown mean and variance. Let us assume that we
    have n observations, y(y1,y2,,,yn). We want to
    estimate the population mean and variance. Then
    log likelihood function will have the form
  • If we get derivative of this function w.r.t mean
    value and variance then we can write
  • Fortunately first of these equations can be
    solved without knowledge about the second one.
    Then if we use result from the first solution in
    the second solution (substitute ? by its
    estimate) then we can solve second equation also.
    Result of this will be sample variance

10
Maximum likelihood Example normal distribution
  • Maximum likelihood estimator in this case gave
    sample mean and biased sample variance. Many
    statistical techniques are based on maximum
    likelihood estimation of the parameters when
    observations are distributed normally. All
    parameters of interest are usually inside mean
    value. In other words ? is a function of several
    parameters.
  • Then problem is to estimate parameters using
    maximum likelihood estimator. Usually either x-s
    are fixed values (fixed effects model) or random
    variables (random effects model). Parameters are
    ?-s. If this function is linear on parameters
    then we have linear regression.
  • If variances are known then the Maximum
    likelihood estimator using observations with
    normal distribution becomes least-squares
    estimator.

11
Information matrix Observed and Fishers
  • One of the important aspects of the likelihood
    function is its behavior near to the maximum. If
    the likelihood function is flat then observations
    have little to say about the parameters. It is
    because changes of the parameters will not cause
    large changes in the probability. That is to say
    same observation can be observed with similar
    probabilities for various values of the
    parameters. On the other hand if likelihood has a
    pronounced peak near to the maximum then small
    changes of the parameters would cause large
    changes in the probability. In this cases we say
    that observation has more information about
    parameters. It is usually expressed as the second
    derivative (or curvature) of the minus
    log-likelihood function. Observed information is
    equal to the second derivative of the minus
    log-likelihood function
  • When there are more than one parameter it is
    called information matrix.
  • Usually it is calculated at the maximum of the
    likelihood. There are other definitions of
    information also.
  • Example In case of successes and failures we can
    write

12
Information matrix Observed and Fishers
  • Expected value of the observed information matrix
    is called expected information matrix or Fishers
    information. Expectation is taken over
    observations
  • It is calculated at any value of the parameter.
    Remarkable fact about Fishers information matrix
    is that it is also equal to the expected value of
    the product of the gradients (first derivatives)
  • Note that observed information matrix depends on
    particular observation whereas expected
    information matrix depends only on the
    probability distribution of the observations (It
    is a result of integration. When we integrate
    over some variables we loose dependence on
    particular values)
  • When sample size becomes large then maximum
    likelihood estimator becomes approximately
    normally distributed with variance close to
  • Fisher points out that inversion of observed
    information matrix gives slightly better estimate
    to variance than that of the expected information
    matrix.

13
Information matrix Observed and Fishers
  • More precise relation between expected
    information and variance is given by Cramer and
    Rao inequality. According to this inequality
    variance of the maximum likelihood estimator
    never can be less than inversion of information
  • Now let us consider an example of successes and
    failures. If we get expectation value for the
    second derivative of minus log likelihood
    function we can get
  • If we take this at the point of maximum
    likelihood then we can say that variance of the
    maximum likelihood estimator can be approximated
    by
  • This statement is true for large sample sizes.

14
Likelihood ratio test
  • Let us assume that we have a sample of size n
    (x(x1,,,,xn)) and we want to estimate a
    parameter vector ?(? 1,?2). Both ?1 and ?2 can
    also be vectors. We want to test null-hypothesis
    against alternative one
  • Let us assume that likelihood function is L(x
    ?). Then likelihood ratio test works as follows
    1) Maximise the likelihood function under
    null-hypothesis (I.e. fix parameter(s) ?1 equal
    to ?10 , find the value of likelihood at the
    maximum, 2)maximise the likelihood under
    alternative hypothesis (I.e. unconditional
    maximisation), find the value of the likelihood
    at the maximum, then find the ratio
  • w is the likelihood ratio statistic. Tests
    carried out using this statistic are called
    likelihood ratio tests. In this case it is clear
    that
  • If the value of w is small then null-hypothesis
    is rejected. If g(w) is the the density of the
    distribution for w then critical region can be
    calculated using

15
References
  • Berthold, M. and Hand, DJ (2003) Intelligent
    data analysis
  • Stuart, A., Ord, JK, and Arnold, S. (1991)
    Kendalls advanced Theory of statistics. Volume
    2A. Classical Inference and the Linear models.
    Arnold publisher, London, Sydney, Auckland

16
Exercise 1
  • a) Assume that we have a sample of size n
    independently drawn from the population with the
    density of probability (exponential distribution)
  • What is the maximum likelihood estimator for ?.
    What is the observed and expected information.
  • b) Let us assume that we have a sample of size n
    of two-dimensional vectors ((x1,x2)((x11,x21),
    (x12,x22),,,,(x1n,x2n) from the normal
    distribution
  • Find the maximum of the likelihood under the
    following hypotheses
  • Try to find the likelihood ratio statistic.
  • Note that variance is also unknown.
Write a Comment
User Comments (0)
About PowerShow.com