Title: Introduction to Inference for Means
1Introduction to Inference for Means
- Sample means are normally distributed, with mean
population mean ? - standard deviation SE
- Estimation
- what is the population mean?
- confidence interval certain probability that the
population mean is within the interval - CI also can be constructed for the difference
between the means in two populations, based on
samples from each population
2Introduction to Inference for Means
- Hypothesis tests
- was the sample drawn randomly from the
population, or is there a systematic difference
between the sample and the population? - how likely is it that the difference between the
sample mean and the known population mean is due
to chance? - were two samples drawn randomly from the same
population? - how likely is it that the difference between the
sample means is due to chance?
3Distribution of Sample Mean
- The sample mean is normally distributed, with
a mean of m and a standard deviation of - We can express the sample mean in terms of the
standard normal distribution
4Estimation Confidence Intervals
- The sample mean is a point estimate of the
population mean. How accurate an estimate? - There is a 95 chance that any given sample mean
will lie within 2 SE of the population mean - Thus, also a 95 chance that the population mean
is within 2 SE of a sample mean - The range of values within which a parameter is
likely to be found is a confidence interval - The concept of a confidence interval is
illustrated in Confidence Intervals.xls.
5Confidence Intervals and Levels
- To obtain a confidence interval for a population
mean, we first specify a confidence level,
usually 95 (sometimes 90 or 99). - We then determine the multiple of the standard
error we need on either side of the sample mean
to achieve the given confidence level that the
interval will contain the population mean - So far, we assumed 2 SE for 95 confidence, but
that is only approximately true
6Confidence Intervals
- Confidence intervals have the form
- (pop. mean) (sample mean) (multiple)(SE)
- (multiple) is determined by desired confidence
level, which you choose - CL is usually 95, sometimes 90 or 99
- ? 1 CL probability outside the interval
- If two-sided interval, area under each tail ?/2
- If ? is known, (multiple) Z?/2
- (multiple) is also called critical value of Z
7Finding Multiples of SE with Excel
- Let CL confidence level (e.g., 0.9, 0.95,
0.99) - ? 1 CL (e.g., 0.1, 0.05,
0.01)
Z? NORMSINV(?) one-tailed Z?/2
NORMSINV(?/2) two-tailed
8?/2 for two-sided confidence interval
Z 1.282
Z 2.576
Z 1.960
Z 1.645
Z 2.326
9Distribution of Sample Mean (2)
- The sample mean is normally distributed, with
a mean of m and a standard deviation of - We can express the sample mean in terms of the
standard normal distribution
- We usually dont know s. We can use the sample
standard deviation, s, but this can be a poor
approximation to s, particularly if n is small
10Sample Standard Deviation
- Standard deviation is not a robust measure of
spread. It is sensitive to outliersin this case,
an unlucky sample in which the spread of the
values is either much smaller than the spread in
the population which it was drawn - In income sampling.xls, 0.3s lt s lt 2.4s ??for
n10. If SE is calculated using a low value of s,
SE can be seriously under-estimated. - overestimates are also possible, but not usually
as worrisome
11n10
n10
12Distribution of Sample Mean
- There is an exact solution for the distribution
of the sample mean when we know s but not s - assumes population is normally distributed (but
works well in most other cases) - The standardized value
has a t distribution with ? (n 1) degrees of
freedom
13The t Distribution
- The t distribution is a close relative of the
normal distribution as n ? 8 t ? normal - The degrees of freedom parameter, (n 1),
defines the precise shape of the t distribution - The t distribution is a little more spread out
than the normal distribution this increase in
spread is greater for smaller n - The t distribution is used when we want to make
inferences about a population mean and the
population standard deviation is unknown
14The t Distribution
15Excel Commands t Distribution
- TDIST(t,deg_freedom,tails)
- if tails 1, gives the area or probability in
the right-hand tail of the distributionthat is,
the probability of finding a value greater than t - unlike NORMSDIST, gives prob. to the right
- t must be positive to calculate the area of the
left-hand tail (the probability of less than t),
use the probability of greater than t. - if tails 2, gives the probability of gtt or lt t
(both the left-hand and the right-hand tails)
16Using TDIST n 6, tails 1
For comparison 1-NORMSDIST(2)
0.023 NORMSDIST(-2) 0.023
17Using TDIST n 6, tails 2
For comparison 2(1-NORMSDIST(2))
0.046 2(NORMSDIST(-2)) 0.046
18Excel Commands t Distribution
- TINV(probability,deg_freedom)
- gives the value of t, given the total probability
in both tails half of this goes in the
right-hand tail and half goes in the left-hand
tail. - tdist.xls contains sample calculations that
illustrate the TDIST and TINV functions
19Using TINV tails always 2
For comparison NORMSINV(0.05)
-1.645 NORMSINV(0.95) 1.645 Z0.05
1.645 t0.05,5 2.015
20Confidence Intervals
- Confidence intervals have the form
- (pop. mean) (sample mean) (multiple)(SE)
- (multiple) is determined by desired confidence
level, which you choose - CL is usually 95, sometimes 90 or 99
- ? 1 CL probability outside the interval
- If two-sided interval, area under each tail ?/2
- If ? is known, (multiple) Z?/2
- Otherwise, (multiple) t?/2,?
- (multiple) is also called critical value of Z
or t
21Finding Multiples of SE with Excel
- Let CL confidence level (e.g., 0.9, 0.95,
0.99) - ? 1 CL (e.g., 0.1, 0.05,
0.01) - ? degrees of freedom (e.g., n 1)
22Multiples of the SE required for a given
confidence level and number of degrees of freedom
Note if you know s, you can use Z instead of t
23for one-sided, ?/2 for two-sided confidence
interval
24If s unknown and n 31 (? 30)
25Example
- Collect income data for 31 households
TINV(0.05,30) 2.042
- 95 confidence interval for population mean (mean
household income)
26Assumptions
- Sample is a simple random sample
- Population distribution is normal
- If s is known, this assumption not needed use Z
rather than t - Confidence intervals based on t distribution are
robust to violations of normality, particularly
for n gt 30 - Intervals could be too narrow for highly
asymmetrical distributions with small n
27Confidence Interval for a Total
So if we want a confidence interval for the total
income of the city, just multiple the average
household incomeand its standard errorby the
number of households, N.
28One-Sided Confidence Intervals
- Previous examples were two-sided confidence
intervals we were concerned with establishing
both lower and upper limits for the mean - In some cases we are only interested in the upper
limit (e.g., EPA or OSHA regulations limiting
exposure to a chemical) - In other cases we are interested only in the
lower limit (e.g., specifications for the minimum
reliability of a nuclear reactor component)
29One-sided Confidence Intervals
- In these cases, we use one-sided intervals that
establish a given level of confidence that the
value is below or above a certain level - (pop. mean) lt (sample mean) (multiple)(SE)
- (pop. mean) gt (sample mean) (multiple)(SE)
- In this case ? 1 CL should be the area under
one tail, and (multiple) Z? or t?,?
30Example Radon Concentrations
- Three measurements 2.5, 3.0, 3.5 pCi/L
- Does mean concentration exceed EPA limit of 4.0
pCi/L? Construct a one-sided 95 CI
31for one-sided, ?/2 for two-sided confidence
interval
32Difference of Means
- If x1 and x2 are independent random variables,
and if y x1 x2, then
- Sample means are independent random variables
(assuming samples are drawn randomly), so these
rules apply to the difference of sample means
33Why square root of sum of squares?
- Independence can be represented graphically by
perpendicular lines or shapes (knowledge of one
gives no information about the other)
34CI for Difference Between Means
- Let and be the means of two samples of
size n1 and n2. (e.g., average household income
in August and September). - The difference between sample means
is a random variable with a mean of (m1 m2)
and a standard deviation of
35CI for Difference Between Means
- What if we dont know s1 or s2? Two solutions.
- 1. Assume s1 s2 sp (pooled standard dev.)
36CI for Difference Between Means
In both cases, use t with ? n1 n2 2 Excel
(Data Analysis) can do either. Which to use? If
s1, s2 different and n1 or n2 is small (lt30) and
no reason to believe s1 ? s2, then use method
1 Otherwise use whichever is most convenient
37Final Exam Scores in PUAF 610, 1994-2000, by
Gender
- Why is this considered a sample, rather than a
population?
38Final Exam Scores Method 2
39Final Exam Scores Method 1
40Confidence Intervals for Paired Samples
- Sometimes wed like to compare two samples, in
which each member of one sample is naturally
paired with a member of the other sample - employment status or income of individuals in the
CPS in consecutive months - blood pressure or blood count of an individual
before and after a treatment - IQ or health status of identical twins
- test score before and after a Kaplan review course
41Confidence Intervals for Paired Samples
- Let x1 value for member in the first sample,
- x2 corresponding value in second sample
- Define a new variable
- y x1 x2
- Calculate the sample mean and sample standard
deviation of y
42Example Gasoline Substitute
- Before a new fuel can be sold, the Clean Air Act
requires that the producer demonstrate that it
will not increase emissions of air pollutants. - Petrocoal.xls contains data for NOx emissions for
16 different cars driven with gasoline and with
Petrocoal (gasoline mixed with methanol derived
from coal). - Because the same car is used with both fuels, a
matched pair analysis possible. This is more
sensitive, because much of the variability
associated with car type is eliminated.