Title: Large-Sample Estimation
1Large-Sample Estimation
- Stat 700 Lecture 09
- 10/18-10/23
2Overview of Lecture
- The Problem of Statistical Inference
- Methods of Inference
- Estimation (Point and Interval)
- Hypotheses Testing
- Point Estimation of the Mean, Standard Deviation,
and Proportion - Interval Estimation of the Mean and Proportion
- Sample Size Determination
- Estimation of the Difference of Means
- Estimation of the Difference of Proportions
3The Problem of Inference
- What we now know!
- Population the collection of interest to us.
- Population Models provided by probability models
such as the Bernoulli distribution, normal
distribution, exponential distribution, etc. - (Population) Parameters characteristics of the
population/distributions. Examples are the mean
?, the standard deviation ?, and the (population)
proportion p. Others are the (population) median
and the population quartiles. - Goal to know these parameters to make decisions.
4Inference Problem continued
- We also know how to
- take a sample from a population (by surveys or
designed experiments), and - to compute sample statistics, which are
characteristics of the sample. For example, we
can compute the sample mean ( ), sample standard
deviation (S), and the sample proportion ( ). - Goal to use these sample statistics to infer
about the population parameters.
5Inference Problem continued
- Furthermore, we also know how
- sample statistics behave in a probabilistic way,
when we consider the experiment of taking a
sample from a population, by looking at the
statistics sampling distributions. In
particular, we know the mean of a sample
statistic as well as its variability as measured
by its standard error. - A thing to realize is that the sample statistic
will usually not coincide with the associated
parameter, but will tend to cluster to the value
of the parameter especially when the sample size
is large enough!
6Inference Problem 1 Estimation
- The basic questions when dealing with estimation
problems are - Based on the sample data, what is the value of
the parameter of interest? This is the problem
of point estimation - or
- Based on the sample data, what is an interval of
values in which we will have a pre-specified
confidence that the value of the parameter
belongs to this interval? This is the problem of
interval estimation or construction of a
confidence interval.
7Inference Problem 2 Hypotheses Testing
- When dealing on the other hand with hypotheses
testing our aim is to determine, based on the
sample data, which of two complementary
propositions, called statistical hypotheses,
about the parameter of interest is true. - In hypotheses testing, we are not really
interested in knowing the exact value of the
parameter, but rather we are simply interested in
deciding between competing claims about the
parameter based on the sample data.
8An Illustration
- Situation The population of interest is the
collection of all American households and their
annual out-of-pocket medical expenses. Suppose
that we would like to determine the proportion,
p, of American households which incur at least
1000 out-of-pocket medical expenses during the
year. This p is the parameter of interest. - Why is this parameter, p, relevant in public
policy? - Except for the fact that p is between 0 and 1 we
do not know its exact value.
9Illustration continued
- Study We take an SRS of n 2000 American
households, and determine for each household
their annual out-of-pocket medical expenses.
Suppose that out of these 2000 households, 114
incurred out-of-pocket medical expenses of at
least 1000, so 114/2000 .057. - Problem of Estimation Based on the sample data,
what is the value of p? or, what is an interval
L, U such that we will be 95 confident that p
is in this interval? - Problem of Hypotheses Testing Based on the
sample data, which of the following statements is
true p is less than 0.05, or p is at least 0.05?
10Point Estimation
- For our discussion, we shall let ? denote a
generic population parameter, so it could be the
mean ?, the variance ?2, the standard deviation
?, or the proportion p. - A point estimator (denoted by ) of a parameter
? is a procedure, a rule, or a formula for
obtaining a value from the sample data which will
serve as an estimate of ?. As such, a point
estimator is a sample statistic. - When the data has been obtained, the realized
value of a point estimator is called a point
estimate.
11Examples of Point Estimators
- Example 1 For estimating the population mean ?
possible point estimators are - Estimator 1 Sample Mean
- Estimator 2 Sample Median
- Estimator 3 Sample Midrange, which is the
average value of the smallest and largest
observations - Estimator 4 (Sum of Observations) 1/(n 2)
- Question Which among these four possible point
estimators to use??
12Examples continued
- Example 2 For estimating the population
proportion p, a point estimator is the sample
proportion, , which is the proportion of
successes in the sample. - Example 3 For estimating the population
variance ?2, a possible point estimator is the
sample variance S2. This is the variance formula
with divisor of (n-1). However, another possible
estimator of ?2 is
13Comparing Competing Estimators
- Suppose there are several possible estimators of
a parameter (for example, in estimating the
population mean, there could be several candidate
estimators). How do we decide which estimator to
use? - What are the desirable or good properties that we
want from our estimators? - How do we know which estimator will have the
desirable properties?
14Desirable Properties of Estimators
- Ideally, an estimator should always give the
exact value of the parameter, whatever that value
is. But this will never be satisfied in reality! - Property of Unbiasedness On the average, the
estimator should equal the parameter being
estimated. Formally, this means that the mean of
the sampling distribution of the estimator
recall that an estimator is a sample statistic
so it has a sampling distribution should equal
the value of the parameter it is estimating,
whatever the value of the parameter is.
15Desirable Properties continued
- For example, since from our study of the sampling
distribution of the sample mean, we found that
the mean of the sample mean is equal to the
population mean, then the sample mean is unbiased
for the population mean. - The sample proportion is also unbiased for the
population proportion. - The sample variance S2 is also unbiased for the
population variance ?2. This is the reason for
dividing by (n-1) in the formula.
16Desirable Properties continued
- Property of Small Variation this is the
property of an estimator being precise in the
sense that its variability is small. In
practical terms, we want the values of the
estimator to be closely clustered towards what it
is trying to estimate. - The variability of an estimator is measured by
the standard deviation of its sampling
distribution, which we now call as the standard
error. The smaller the standard error is, the
more desirable the estimator, provided that it is
unbiased.
17Margin of Error (ME) of an Estimator
- When reporting a point estimate, we report also
its measure of variability, and this measure of
variability is usually reported as the margin of
error (ME) of the estimate, which is equal to
1.96 times its standard error. That is,
18Interpretation of the Margin of Error
- The reason for this definition of the margin of
error is that the sampling distribution of the
estimators will usually be approximately normal
(by the central limit theorem) with mean equal to
the value of the parameter being estimated, hence
the interval from - (Parameter Value) - 1.96(Std. Error) to
- (Parameter Value) 1.96(Std. Error)
- will contain approximately 95 of all the
possible values of the estimator. Therefore,
approximately 95 of the time, the point estimate
will not differ by more than one ME from the true
parameter value. - But, why 95? It is the convention handed to us!
19Illustration of Comparison of Estimators
- To see in a concrete way how estimators are
compared, consider the estimation of the
population mean in the population considered in
the discussion of sampling distributions. This
population has - p(2) .4, p(4) .5, p(5) .1
- Population Mean ? 3.3
- Population Standard Deviation ? 1.1
- We compare the four estimators of the mean
mentioned earlier - Sample Mean, Sample Median, Sample Midrange, and
((Sum of Xs) 1)/(n 2).
20Comparison continued
- Our comparison will be based on samples of size n
10. A theoretical comparison is not easy, so
we rely on a Monte Carlo simulation. - We generate 500 samples of size n 10 from the
population and for each sample compute the
estimate based on each of the 4 estimators. - We then look at the simulated sampling
distributions of the 4 estimators to see which
estimators are unbiased and compare their
variability.
21First 10 Samples from the Simulation
- For sample 1 Sample Midrange (2 5)/2 3.5
while Estimate4 (31 1)/(10 2) 32/12
2.6667. - Sample Mean and Sample Median are computed the
usual way.
22BoxPlots of the Simulated Sampling Distributions
Recall Target is m 3.3
23Histograms of the Simulated Sampling
Distributions Using Same Scales
24Parameters of the Simulated Sampling
Distributions and Comparisons
- Sample mean is closest to being unbiased. Next is
the sample midrange, although it is still biased. - Sample Median and Estimator 4 are very biased.
- Sample median is very variable or imprecise.
- Sample mean is best, though midrange is also good.
25Point Estimation of the Mean ?
- When the population of interest is normal with
(unknown) mean ? and standard deviation ?, then,
based on theoretical analysis, the best estimator
of ? is the sample mean . The margin of error
is - ME (1.96)(?/n1/2).
- If ? is not known then the margin of error could
be reported as - ME (1.96)(S/n1/2)
- where S is the sample standard deviation.
26Point Estimation of Mean ...
- When the population is not normal and the sample
size is large, the sample mean need not be the
best estimator anymore, but it is still unbiased
for the population mean, and has decent
variability. - For example, when the population is Uniform, the
population mean is best estimated by the Sample
Midrange instead of the Sample Mean. - However, for our purposes, we will simply use the
Sample Mean as estimator of the population mean,
and its margin of error will be (assuming ? is
not known) - ME (1.96)(S/n1/2).
27Point Estimation of the Population Proportion, p
- When the population is Bernoulli so the parameter
of interest is p, the proportion of Successes
in the population, then the best estimator of p
is the sample proportion . - When np gt 5 and n(1-p) gt 5, then its margin of
error is estimated by
28An Example
- Situation Suppose we want to estimate the mean
systolic blood pressure for the population of
1910 people in the blood pressure data set. - Sample We take a sample of size n 30 from the
population, and the sample data is - 100,110, 118, 134, ., 92, 104, 100, 110, 130,
110, 132, 102, 128, 88, 135, 140, 90, 108, 112,
100, 130, 136, 124, 150, 138, 130, 104, 114, 110 - The one dot indicates a missing value in the data
so n 29 in this case.
29Example continued
- Sample Statistics 116.52, S 16.76
- Therefore, the point estimate for ? is
- 116.52
- with margin of error of
- ME (1.96)16.76/(29)1/2 6.10.
- Interpretation We are 95 confident that the
true mean systolic blood pressure for the
population is therefore between - 116.52 - 6.10, 116.52 6.10 110.42, 122.62
- Indeed, the true value of ? is 114.59. (On
target!!)
30Example Freshly-Brewed vs Instant
- Example A matched pairs experiment was performed
to compare the taste of instant versus
fresh-brewed coffee. Each subject tastes two
unmarked cups of coffee, one of each type, in
random order and states which he/she prefers. Of
the 50 subjects who participated, 19 prefer the
instant coffee. Let p be the probability that a
randomly chosen subject prefers freshly brewed
coffee over instant coffee, that is, p is the
proportion in the population who prefer
freshly-brewed coffee. - Based on the given information, provide a point
estimate for p.
31Example continued
- Based on the sample data, there are 31 out of the
50 who preferred freshly-brewed coffee, so the
sample proportion is 31/50 .62. This is
our point estimate of p. - We report this by also providing an estimate of
its margin of error, which is - ME (1.96)(.62)(1-.62)/501/2 .13.
- Based on these information, we are 95 confident
that the true p is between .62 - .13 .49 to .62
.13 .75. Because this interval still
includes .5, it will not be possible to conclude
that more than 50 prefer freshly-brewed coffee
over instant coffee.
32Interval Estimation of the Mean, ?
- Consider a population or distribution with
unknown mean ? and standard deviation ?. We take
a sample from this population of size n, where n
is large (at least 30). - Let ? be a number between 0 and 1. An 100(1 - ?)
interval estimator of ? is a random interval L,
U, where L and U are computed from the sample
data, such that the probability that the interval
L, U covers the mean ? equals (1 - ?). That is, - PL lt ? lt U 1 - ?.
33Derivation of the Interval Estimator
- Let z? be such that PZ gt z? ?, where Z is the
standard normal variable. - Therefore, P-z?/2 lt Z lt z?/2 1 - ?.
- By virtue of the Central Limit Theorem, is
approximately normal with mean ? and standard
deviation (standard error) ?/n1/2. Therefore,
34Continued ...
- Based on this equation we therefore obtain the
large-sample 100(1-?) interval estimator of the
population mean ? to be
35Some Comments
- The interval estimator in the preceding slide
assumes that the population standard deviation is
known. In many situations, however, this will
not be the case. - If ? is not known, then we replace it by S, the
sample standard deviation, in the computation of
the lower and upper bounds. - Terminology After the sample data has been
gathered, then we could calculate the lower and
upper bound of the interval. This realized
interval is called a 100(1-?) confidence
interval for ?.
36Interpretation of a Confidence Interval
- Based on our derivation of the interval
estimator, 100(1-?) of all the possible samples
of size n will produce interval estimates that
will contain the true mean ?, while the remaining
100? will produce intervals that will not
include the true mean ?. Consequently, for the
particular confidence interval that we obtained,
we associate a 100(1-?) confidence that it will
include the true value of ?.
37Relationships
- With ? and ? remaining constant, if n is
increased, then the length of the interval will
decrease, which is desirable. - With ? and n remaining fixed, increasing the
confidence coefficient will (1- ?) lead to an
increase in length of the interval. - With ? and n remaining fixed, we could decrease
the length of the interval by decreasing ?. This
could be done for instance by improving the
measurement process.
38Example
- Situation An experiment was conducted to
estimate the effect of smoking on the blood
pressure of a group of 34 college-age cigarette
smokers. The difference for each participant was
obtained by taking the difference in the blood
pressure readings at the time of graduation and
five years later. The sample mean increase in
blood pressure was 9.7 millimeters of mercury
with a sample standard deviation of 5.8. - Question Obtain a 95 confidence interval for
the mean ?, which is the mean increase in the
blood pressure reading among all college-age
cigarette smokers.
39The Confidence Interval
- Since n 34 gt 30 the standard error is
(5.8)/(34)1/2 .9947. - For a confidence coefficient of 95, z.025
1.96. - Therefore the appropriate margin of error
becomes (1.96)(.9947) 1.95. - The 95 confidence interval is therefore
- 9.7 - 1.95, 9.7 1.95 7.75, 11.65.
- Interpretation We are 95 confident that this
interval contains the true value of ?.
40Decreasing the Confidence Coefficient
- If instead we decrease the confidence coefficient
to 90 so ? 0.10, then z.05 1.645. - Therefore, the appropriate margin of error is
(1.645)(.9947) 1.64. - The 90 confidence interval therefore becomes
- 9.7 - 1.64, 9.7 1.64 8.06, 11.34.
- Notice that this interval is shorter than the 95
confidence interval, but then we are less
confident that it contains the true mean ?.
41Sample Size Determination
- Suppose we want to determine the appropriate
sample size such that the margin of error for the
100(1-?) confidence is at most B, where B is a
pre-specified upper bound. Then we must have - z?/2?/n1/2 lt B so solving for n, we obtain the
formula for the minimum sample size needed to
satisfy the desired condition to be
42When the Population Standard Deviation is Not
Known
- In this sample size formula, we need to know the
standard deviation ?. If this is not the case
then we could either do the following - Perform a small pilot study to obtain an estimate
of ?, and use the resulting estimate in the
formula. - Use a historical value of ?, if such is
available. - Use an upper bound for the the value of ?, that
is, use the largest possible value that ? could
have in the situation of interest. This will
provide a conservative (safe) value for the
sample size n.
43Confidence Interval for Proportion
- The 100(1-?) confidence interval for the
population proportion, when n gt 30, is derived
similarly and is of form
44Determining the Sample Size when Constructing CI
for Proportion, p
- If one wants the 100(1-?) confidence interval
for p to have margin of error of at most B, then
the appropriate formula becomes
45Continued ...
- However, this formula requires the value of p,
which is what we are trying to determine. Two
routes to circumvent this problem are - Use a prior estimate of p, that is, some
historical or previous value of p. - Use the value of p such that p(1-p) is largest.
This occurs when p 1/2 and p(1-p) 1/4. Using
this procedure, the sample size formula becomes
46Conservative Formula for Determining the Sample
Size when Constructing CI for the Proportion, p
47Example
- Suppose that interest is to obtain a 95
confidence interval for the proportion p which
represents the proportion of Americans without
health insurance. What would be the appropriate
sample size in order that the margin of error of
the interval is at most 0.03. - In this case, B .03 and ? 0.05. Therefore,
z.025 1.96. Furthermore, since we do not have
any idea about what p might be, we use the
conservative formula to obtain - n gt (1.96)2/(4)(0.03)2 1067.
- Thus, at least 1067 people should be sampled.
48Two-Sample Problems
- Consider now the situation where we have two
populations. Population 1 has mean ?1 and
standard deviation ?1 and population 2 has mean
?2 and standard deviation ?2. - Our objective is to construct a confidence
interval for the difference ?1 - ?2. This
interval is to be constructed from a sample of
size n1 from population 1, and a sample of size
n2 from population 2, with the samples being
independent of each other. - For each sample we obtain the sample means and
standard deviations.
49Available Data for Two-Sample Problems
- The sample data could therefore be summarized
into a table of form
50Confidence Interval for the Difference of Two
Means
- For this two-sample problem, when the sample
sizes are at least equal to 30, the 100(1-?)
confidence interval for ?1 - ?2 is given by
51Example On Obesity
- Situation An experiment was conducted to compare
two diets A and B designed for weight reduction.
Two groups of 30 overweight dieters each are
randomly selected. One group was placed on diet A
and the other on diet B, and their weight losses
were recorded over a 30-day period. The means and
standard deviations of the weight-loss
measurements for the two groups are given in the
table below.
5299 Confidence Interval for the Difference of the
Means
- For a 99 confidence interval, we have z.005
2.575. - The estimate of the standard error becomes
- (2.6)2/30 (1.9)2/301/2 (.3457)1/2 .5879.
- The appropriate margin of error is therefore
- (2.575)(.5879) 1.5138.
- The difference of sample means is 21.3 - 13.4
7.9 - The 99 CI for the difference of the population
means becomes 7.9 - 1.51, 7.9 1.51 6.39,
9.41. - Since this interval does not contain 0, then diet
A is more effective in reducing weight.