Large-Sample Estimation - PowerPoint PPT Presentation

1 / 52
About This Presentation
Title:

Large-Sample Estimation

Description:

Large-Sample Estimation Stat 700 Lecture 09 10/18-10/23 Overview of Lecture The Problem of Statistical Inference Methods of Inference Estimation (Point and Interval ... – PowerPoint PPT presentation

Number of Views:167
Avg rating:3.0/5.0
Slides: 53
Provided by: Eds140
Category:

less

Transcript and Presenter's Notes

Title: Large-Sample Estimation


1
Large-Sample Estimation
  • Stat 700 Lecture 09
  • 10/18-10/23

2
Overview of Lecture
  • The Problem of Statistical Inference
  • Methods of Inference
  • Estimation (Point and Interval)
  • Hypotheses Testing
  • Point Estimation of the Mean, Standard Deviation,
    and Proportion
  • Interval Estimation of the Mean and Proportion
  • Sample Size Determination
  • Estimation of the Difference of Means
  • Estimation of the Difference of Proportions

3
The Problem of Inference
  • What we now know!
  • Population the collection of interest to us.
  • Population Models provided by probability models
    such as the Bernoulli distribution, normal
    distribution, exponential distribution, etc.
  • (Population) Parameters characteristics of the
    population/distributions. Examples are the mean
    ?, the standard deviation ?, and the (population)
    proportion p. Others are the (population) median
    and the population quartiles.
  • Goal to know these parameters to make decisions.

4
Inference Problem continued
  • We also know how to
  • take a sample from a population (by surveys or
    designed experiments), and
  • to compute sample statistics, which are
    characteristics of the sample. For example, we
    can compute the sample mean ( ), sample standard
    deviation (S), and the sample proportion ( ).
  • Goal to use these sample statistics to infer
    about the population parameters.

5
Inference Problem continued
  • Furthermore, we also know how
  • sample statistics behave in a probabilistic way,
    when we consider the experiment of taking a
    sample from a population, by looking at the
    statistics sampling distributions. In
    particular, we know the mean of a sample
    statistic as well as its variability as measured
    by its standard error.
  • A thing to realize is that the sample statistic
    will usually not coincide with the associated
    parameter, but will tend to cluster to the value
    of the parameter especially when the sample size
    is large enough!

6
Inference Problem 1 Estimation
  • The basic questions when dealing with estimation
    problems are
  • Based on the sample data, what is the value of
    the parameter of interest? This is the problem
    of point estimation
  • or
  • Based on the sample data, what is an interval of
    values in which we will have a pre-specified
    confidence that the value of the parameter
    belongs to this interval? This is the problem of
    interval estimation or construction of a
    confidence interval.

7
Inference Problem 2 Hypotheses Testing
  • When dealing on the other hand with hypotheses
    testing our aim is to determine, based on the
    sample data, which of two complementary
    propositions, called statistical hypotheses,
    about the parameter of interest is true.
  • In hypotheses testing, we are not really
    interested in knowing the exact value of the
    parameter, but rather we are simply interested in
    deciding between competing claims about the
    parameter based on the sample data.

8
An Illustration
  • Situation The population of interest is the
    collection of all American households and their
    annual out-of-pocket medical expenses. Suppose
    that we would like to determine the proportion,
    p, of American households which incur at least
    1000 out-of-pocket medical expenses during the
    year. This p is the parameter of interest.
  • Why is this parameter, p, relevant in public
    policy?
  • Except for the fact that p is between 0 and 1 we
    do not know its exact value.

9
Illustration continued
  • Study We take an SRS of n 2000 American
    households, and determine for each household
    their annual out-of-pocket medical expenses.
    Suppose that out of these 2000 households, 114
    incurred out-of-pocket medical expenses of at
    least 1000, so 114/2000 .057.
  • Problem of Estimation Based on the sample data,
    what is the value of p? or, what is an interval
    L, U such that we will be 95 confident that p
    is in this interval?
  • Problem of Hypotheses Testing Based on the
    sample data, which of the following statements is
    true p is less than 0.05, or p is at least 0.05?

10
Point Estimation
  • For our discussion, we shall let ? denote a
    generic population parameter, so it could be the
    mean ?, the variance ?2, the standard deviation
    ?, or the proportion p.
  • A point estimator (denoted by ) of a parameter
    ? is a procedure, a rule, or a formula for
    obtaining a value from the sample data which will
    serve as an estimate of ?. As such, a point
    estimator is a sample statistic.
  • When the data has been obtained, the realized
    value of a point estimator is called a point
    estimate.

11
Examples of Point Estimators
  • Example 1 For estimating the population mean ?
    possible point estimators are
  • Estimator 1 Sample Mean
  • Estimator 2 Sample Median
  • Estimator 3 Sample Midrange, which is the
    average value of the smallest and largest
    observations
  • Estimator 4 (Sum of Observations) 1/(n 2)
  • Question Which among these four possible point
    estimators to use??

12
Examples continued
  • Example 2 For estimating the population
    proportion p, a point estimator is the sample
    proportion, , which is the proportion of
    successes in the sample.
  • Example 3 For estimating the population
    variance ?2, a possible point estimator is the
    sample variance S2. This is the variance formula
    with divisor of (n-1). However, another possible
    estimator of ?2 is

13
Comparing Competing Estimators
  • Suppose there are several possible estimators of
    a parameter (for example, in estimating the
    population mean, there could be several candidate
    estimators). How do we decide which estimator to
    use?
  • What are the desirable or good properties that we
    want from our estimators?
  • How do we know which estimator will have the
    desirable properties?

14
Desirable Properties of Estimators
  • Ideally, an estimator should always give the
    exact value of the parameter, whatever that value
    is. But this will never be satisfied in reality!
  • Property of Unbiasedness On the average, the
    estimator should equal the parameter being
    estimated. Formally, this means that the mean of
    the sampling distribution of the estimator
    recall that an estimator is a sample statistic
    so it has a sampling distribution should equal
    the value of the parameter it is estimating,
    whatever the value of the parameter is.

15
Desirable Properties continued
  • For example, since from our study of the sampling
    distribution of the sample mean, we found that
    the mean of the sample mean is equal to the
    population mean, then the sample mean is unbiased
    for the population mean.
  • The sample proportion is also unbiased for the
    population proportion.
  • The sample variance S2 is also unbiased for the
    population variance ?2. This is the reason for
    dividing by (n-1) in the formula.

16
Desirable Properties continued
  • Property of Small Variation this is the
    property of an estimator being precise in the
    sense that its variability is small. In
    practical terms, we want the values of the
    estimator to be closely clustered towards what it
    is trying to estimate.
  • The variability of an estimator is measured by
    the standard deviation of its sampling
    distribution, which we now call as the standard
    error. The smaller the standard error is, the
    more desirable the estimator, provided that it is
    unbiased.

17
Margin of Error (ME) of an Estimator
  • When reporting a point estimate, we report also
    its measure of variability, and this measure of
    variability is usually reported as the margin of
    error (ME) of the estimate, which is equal to
    1.96 times its standard error. That is,

18
Interpretation of the Margin of Error
  • The reason for this definition of the margin of
    error is that the sampling distribution of the
    estimators will usually be approximately normal
    (by the central limit theorem) with mean equal to
    the value of the parameter being estimated, hence
    the interval from
  • (Parameter Value) - 1.96(Std. Error) to
  • (Parameter Value) 1.96(Std. Error)
  • will contain approximately 95 of all the
    possible values of the estimator. Therefore,
    approximately 95 of the time, the point estimate
    will not differ by more than one ME from the true
    parameter value.
  • But, why 95? It is the convention handed to us!

19
Illustration of Comparison of Estimators
  • To see in a concrete way how estimators are
    compared, consider the estimation of the
    population mean in the population considered in
    the discussion of sampling distributions. This
    population has
  • p(2) .4, p(4) .5, p(5) .1
  • Population Mean ? 3.3
  • Population Standard Deviation ? 1.1
  • We compare the four estimators of the mean
    mentioned earlier
  • Sample Mean, Sample Median, Sample Midrange, and
    ((Sum of Xs) 1)/(n 2).

20
Comparison continued
  • Our comparison will be based on samples of size n
    10. A theoretical comparison is not easy, so
    we rely on a Monte Carlo simulation.
  • We generate 500 samples of size n 10 from the
    population and for each sample compute the
    estimate based on each of the 4 estimators.
  • We then look at the simulated sampling
    distributions of the 4 estimators to see which
    estimators are unbiased and compare their
    variability.

21
First 10 Samples from the Simulation
  • For sample 1 Sample Midrange (2 5)/2 3.5
    while Estimate4 (31 1)/(10 2) 32/12
    2.6667.
  • Sample Mean and Sample Median are computed the
    usual way.

22
BoxPlots of the Simulated Sampling Distributions
Recall Target is m 3.3
23
Histograms of the Simulated Sampling
Distributions Using Same Scales
24
Parameters of the Simulated Sampling
Distributions and Comparisons
  • Sample mean is closest to being unbiased. Next is
    the sample midrange, although it is still biased.
  • Sample Median and Estimator 4 are very biased.
  • Sample median is very variable or imprecise.
  • Sample mean is best, though midrange is also good.

25
Point Estimation of the Mean ?
  • When the population of interest is normal with
    (unknown) mean ? and standard deviation ?, then,
    based on theoretical analysis, the best estimator
    of ? is the sample mean . The margin of error
    is
  • ME (1.96)(?/n1/2).
  • If ? is not known then the margin of error could
    be reported as
  • ME (1.96)(S/n1/2)
  • where S is the sample standard deviation.

26
Point Estimation of Mean ...
  • When the population is not normal and the sample
    size is large, the sample mean need not be the
    best estimator anymore, but it is still unbiased
    for the population mean, and has decent
    variability.
  • For example, when the population is Uniform, the
    population mean is best estimated by the Sample
    Midrange instead of the Sample Mean.
  • However, for our purposes, we will simply use the
    Sample Mean as estimator of the population mean,
    and its margin of error will be (assuming ? is
    not known)
  • ME (1.96)(S/n1/2).

27
Point Estimation of the Population Proportion, p
  • When the population is Bernoulli so the parameter
    of interest is p, the proportion of Successes
    in the population, then the best estimator of p
    is the sample proportion .
  • When np gt 5 and n(1-p) gt 5, then its margin of
    error is estimated by

28
An Example
  • Situation Suppose we want to estimate the mean
    systolic blood pressure for the population of
    1910 people in the blood pressure data set.
  • Sample We take a sample of size n 30 from the
    population, and the sample data is
  • 100,110, 118, 134, ., 92, 104, 100, 110, 130,
    110, 132, 102, 128, 88, 135, 140, 90, 108, 112,
    100, 130, 136, 124, 150, 138, 130, 104, 114, 110
  • The one dot indicates a missing value in the data
    so n 29 in this case.

29
Example continued
  • Sample Statistics 116.52, S 16.76
  • Therefore, the point estimate for ? is
  • 116.52
  • with margin of error of
  • ME (1.96)16.76/(29)1/2 6.10.
  • Interpretation We are 95 confident that the
    true mean systolic blood pressure for the
    population is therefore between
  • 116.52 - 6.10, 116.52 6.10 110.42, 122.62
  • Indeed, the true value of ? is 114.59. (On
    target!!)

30
Example Freshly-Brewed vs Instant
  • Example A matched pairs experiment was performed
    to compare the taste of instant versus
    fresh-brewed coffee. Each subject tastes two
    unmarked cups of coffee, one of each type, in
    random order and states which he/she prefers. Of
    the 50 subjects who participated, 19 prefer the
    instant coffee. Let p be the probability that a
    randomly chosen subject prefers freshly brewed
    coffee over instant coffee, that is, p is the
    proportion in the population who prefer
    freshly-brewed coffee.
  • Based on the given information, provide a point
    estimate for p.

31
Example continued
  • Based on the sample data, there are 31 out of the
    50 who preferred freshly-brewed coffee, so the
    sample proportion is 31/50 .62. This is
    our point estimate of p.
  • We report this by also providing an estimate of
    its margin of error, which is
  • ME (1.96)(.62)(1-.62)/501/2 .13.
  • Based on these information, we are 95 confident
    that the true p is between .62 - .13 .49 to .62
    .13 .75. Because this interval still
    includes .5, it will not be possible to conclude
    that more than 50 prefer freshly-brewed coffee
    over instant coffee.

32
Interval Estimation of the Mean, ?
  • Consider a population or distribution with
    unknown mean ? and standard deviation ?. We take
    a sample from this population of size n, where n
    is large (at least 30).
  • Let ? be a number between 0 and 1. An 100(1 - ?)
    interval estimator of ? is a random interval L,
    U, where L and U are computed from the sample
    data, such that the probability that the interval
    L, U covers the mean ? equals (1 - ?). That is,
  • PL lt ? lt U 1 - ?.

33
Derivation of the Interval Estimator
  • Let z? be such that PZ gt z? ?, where Z is the
    standard normal variable.
  • Therefore, P-z?/2 lt Z lt z?/2 1 - ?.
  • By virtue of the Central Limit Theorem, is
    approximately normal with mean ? and standard
    deviation (standard error) ?/n1/2. Therefore,

34
Continued ...
  • Based on this equation we therefore obtain the
    large-sample 100(1-?) interval estimator of the
    population mean ? to be

35
Some Comments
  • The interval estimator in the preceding slide
    assumes that the population standard deviation is
    known. In many situations, however, this will
    not be the case.
  • If ? is not known, then we replace it by S, the
    sample standard deviation, in the computation of
    the lower and upper bounds.
  • Terminology After the sample data has been
    gathered, then we could calculate the lower and
    upper bound of the interval. This realized
    interval is called a 100(1-?) confidence
    interval for ?.

36
Interpretation of a Confidence Interval
  • Based on our derivation of the interval
    estimator, 100(1-?) of all the possible samples
    of size n will produce interval estimates that
    will contain the true mean ?, while the remaining
    100? will produce intervals that will not
    include the true mean ?. Consequently, for the
    particular confidence interval that we obtained,
    we associate a 100(1-?) confidence that it will
    include the true value of ?.

37
Relationships
  • With ? and ? remaining constant, if n is
    increased, then the length of the interval will
    decrease, which is desirable.
  • With ? and n remaining fixed, increasing the
    confidence coefficient will (1- ?) lead to an
    increase in length of the interval.
  • With ? and n remaining fixed, we could decrease
    the length of the interval by decreasing ?. This
    could be done for instance by improving the
    measurement process.

38
Example
  • Situation An experiment was conducted to
    estimate the effect of smoking on the blood
    pressure of a group of 34 college-age cigarette
    smokers. The difference for each participant was
    obtained by taking the difference in the blood
    pressure readings at the time of graduation and
    five years later. The sample mean increase in
    blood pressure was 9.7 millimeters of mercury
    with a sample standard deviation of 5.8.
  • Question Obtain a 95 confidence interval for
    the mean ?, which is the mean increase in the
    blood pressure reading among all college-age
    cigarette smokers.

39
The Confidence Interval
  • Since n 34 gt 30 the standard error is
    (5.8)/(34)1/2 .9947.
  • For a confidence coefficient of 95, z.025
    1.96.
  • Therefore the appropriate margin of error
    becomes (1.96)(.9947) 1.95.
  • The 95 confidence interval is therefore
  • 9.7 - 1.95, 9.7 1.95 7.75, 11.65.
  • Interpretation We are 95 confident that this
    interval contains the true value of ?.

40
Decreasing the Confidence Coefficient
  • If instead we decrease the confidence coefficient
    to 90 so ? 0.10, then z.05 1.645.
  • Therefore, the appropriate margin of error is
    (1.645)(.9947) 1.64.
  • The 90 confidence interval therefore becomes
  • 9.7 - 1.64, 9.7 1.64 8.06, 11.34.
  • Notice that this interval is shorter than the 95
    confidence interval, but then we are less
    confident that it contains the true mean ?.

41
Sample Size Determination
  • Suppose we want to determine the appropriate
    sample size such that the margin of error for the
    100(1-?) confidence is at most B, where B is a
    pre-specified upper bound. Then we must have
  • z?/2?/n1/2 lt B so solving for n, we obtain the
    formula for the minimum sample size needed to
    satisfy the desired condition to be

42
When the Population Standard Deviation is Not
Known
  • In this sample size formula, we need to know the
    standard deviation ?. If this is not the case
    then we could either do the following
  • Perform a small pilot study to obtain an estimate
    of ?, and use the resulting estimate in the
    formula.
  • Use a historical value of ?, if such is
    available.
  • Use an upper bound for the the value of ?, that
    is, use the largest possible value that ? could
    have in the situation of interest. This will
    provide a conservative (safe) value for the
    sample size n.

43
Confidence Interval for Proportion
  • The 100(1-?) confidence interval for the
    population proportion, when n gt 30, is derived
    similarly and is of form

44
Determining the Sample Size when Constructing CI
for Proportion, p
  • If one wants the 100(1-?) confidence interval
    for p to have margin of error of at most B, then
    the appropriate formula becomes

45
Continued ...
  • However, this formula requires the value of p,
    which is what we are trying to determine. Two
    routes to circumvent this problem are
  • Use a prior estimate of p, that is, some
    historical or previous value of p.
  • Use the value of p such that p(1-p) is largest.
    This occurs when p 1/2 and p(1-p) 1/4. Using
    this procedure, the sample size formula becomes

46
Conservative Formula for Determining the Sample
Size when Constructing CI for the Proportion, p
47
Example
  • Suppose that interest is to obtain a 95
    confidence interval for the proportion p which
    represents the proportion of Americans without
    health insurance. What would be the appropriate
    sample size in order that the margin of error of
    the interval is at most 0.03.
  • In this case, B .03 and ? 0.05. Therefore,
    z.025 1.96. Furthermore, since we do not have
    any idea about what p might be, we use the
    conservative formula to obtain
  • n gt (1.96)2/(4)(0.03)2 1067.
  • Thus, at least 1067 people should be sampled.

48
Two-Sample Problems
  • Consider now the situation where we have two
    populations. Population 1 has mean ?1 and
    standard deviation ?1 and population 2 has mean
    ?2 and standard deviation ?2.
  • Our objective is to construct a confidence
    interval for the difference ?1 - ?2. This
    interval is to be constructed from a sample of
    size n1 from population 1, and a sample of size
    n2 from population 2, with the samples being
    independent of each other.
  • For each sample we obtain the sample means and
    standard deviations.

49
Available Data for Two-Sample Problems
  • The sample data could therefore be summarized
    into a table of form

50
Confidence Interval for the Difference of Two
Means
  • For this two-sample problem, when the sample
    sizes are at least equal to 30, the 100(1-?)
    confidence interval for ?1 - ?2 is given by

51
Example On Obesity
  • Situation An experiment was conducted to compare
    two diets A and B designed for weight reduction.
    Two groups of 30 overweight dieters each are
    randomly selected. One group was placed on diet A
    and the other on diet B, and their weight losses
    were recorded over a 30-day period. The means and
    standard deviations of the weight-loss
    measurements for the two groups are given in the
    table below.

52
99 Confidence Interval for the Difference of the
Means
  • For a 99 confidence interval, we have z.005
    2.575.
  • The estimate of the standard error becomes
  • (2.6)2/30 (1.9)2/301/2 (.3457)1/2 .5879.
  • The appropriate margin of error is therefore
  • (2.575)(.5879) 1.5138.
  • The difference of sample means is 21.3 - 13.4
    7.9
  • The 99 CI for the difference of the population
    means becomes 7.9 - 1.51, 7.9 1.51 6.39,
    9.41.
  • Since this interval does not contain 0, then diet
    A is more effective in reducing weight.
Write a Comment
User Comments (0)
About PowerShow.com