Title: Inferences Based on a Single Sample
1Chapter 7
- Inferences Based on a Single Sample
2Parameters and Statistics
- A parameter is a numeric characteristic of a
population or distribution, usually symbolized by
a Greek letter, such as µ, the population mean. - Inferential Statistics uses sample information to
estimate parameters. - A Statistic is a number calculated from data.
- There are usually statistics that do the same job
for samples that the parameters do for
populations, such as , the sample mean.
3Using Samples for Estimation
µ
Sample (known statistic)
Population (unknown parameter)
estimate
4The Idea of Estimation
- We want to find a way to estimate the population
parameters. - We only have information from a sample, available
in the form of statistics. - The sample mean, , is an estimator of the
population mean, µ. - This is called a point estimate because it is
one point, or a single value.
5Interval Estimation
- There is variation in , since it is a random
variable calculated from data. - A point estimate doesnt reveal anything about
how much the estimate varies. - An interval estimate gives a range of values that
is likely to contain the parameter. - Intervals are often reported in polls, such as
56 4 favor candidate A. This suggests we
are not sure it is exactly 56, but we are quite
sure that it is between 52 and 60. - 56 is the point estimate, whereas (52, 60) is
the interval estimate.
6The Confidence Interval
- A confidence interval is a special interval
estimate involving a percent, called the
confidence level. - The confidence level tells how often, if samples
were repeatedly taken, the interval estimate
would surround the true parameter. - We can use this notation (L,U) or (LCL,UCL).
- L and U stand for Lower and Upper endpoints. The
longer versions, LCL and UCL, stand for Lower
Confidence Limit and Upper Confidence Limit. - This interval is built around the point estimate.
7Theory of Confidence Intervals
- Alpha (a) represents the probability that when
the sample is taken, the calculated CI will miss
the parameter. - The confidence level is given by (1-a)100, and
used to name the interval, so for example, we may
have a 90 CI for µ. - After sampling, we say that we are, for example,
90 confident that we have captured the true
parameter. (There is no probability at this
point. Either we did or we didnt, but we dont
know.)
8How to Calculate CIs
- Many CIs have the following basic structure
- P TS
- Where P is the parameter estimate,
- T is a table value equal to the number of
standard deviations needed for the confidence
level, - and S is the standard deviation of the estimate.
- The quantity TS is also called the Error Bound
(B) or Margin of Error. - The CI should be written as (L,U) where
L P-TS, and U PTS. - Dont forget to convert your P TS expression to
confidence interval form, including parentheses!
9A Confidence Interval for µ
- If s is known, and
- the population is normally distributed,or ngt30
(so that we can say is approximately
normally distiributed), gives the endpoints
for a (1- a)100 CI for µ - Note how this corresponds to the P TS formula
given earlier.
10Distribution Details
- What is ?
- a is the significance level, P(CI will miss)
- The subscript on z refers to the upper tail
probability, that is, P(Zgtz). - To find this value in the table, look up
thez-value for a probability of .5-a/2. - Examples
11Example Estimation of µ (? Known)
- A random sample of 25 items resulted in a sample
mean of 50. Construct a 95 confidence interval
estimate for ? if ? 10.
12Confidence Interval Estimates
Confidence
Intervals
Proportion
Mean
Variance
?
Unknown
??
Known
13Estimation of m (s unknown)
- We now turn to the situation where s is unknown
but the sample size is large or the sample
population is normal. - Since s is unknown, we use s in its place.
- However, without knowing s, we are not able to
make use of the z table in building a confidence
interval. - Instead, we will use a distribution called t
(Students t). - The t distribution is symmetric and bell-shaped
like the standard normal, and also has a m0, but
sgt1, so the shape is flatter in the middle and
thicker in the tails.
14- Students t-Distributions
- Degrees of Freedom, df
- A parameter that identifies each different
distribution of Students t-distribution. For
the methods presented in this chapter, the value
of df will be the sample size minus 1, df n - 1.
Normal distribution
Students t, df 15
Students t, df 5
15Using t
- As the previous graph shows, the t distribution
has another parameter, called degrees of freedom
(df). So this is actually a family of
distributions, with different df values. - The higher the df, the closer the t distribution
comes to the standard normal. - For our purposes, dfn-1. It is actually related
to the denominator in the formula for s2. - There is a t-table in the back of the book. It
is different from the z-table, so we have to
understand how it works.
16The t table
- Refer to the table. First you will notice the
left-hand column is for df. - When df 100, the z-table can be used, because
the values will be very close. - This table gives tail probabilities, similar to
z(a). However, only a selection of probabilities
is given, across the top of the table. - The interior of the table gives the t-values, so
it is arranged almost opposite of the z-table. - The notation used for t-values is t(df,a).
- Just like z(a), a refers to the upper tail
probability.
17- t-Distribution Showing t(df, a)
18- Example Find the value of t(12, 0.025).
Portion of t-table
19Confidence Intervals
- When we build our confidence interval, a refers
to the probability in both tails. - This is not the same a used in looking up the
distribution! So what we have to look up is
actually a/2, because thats the upper tail
probability. - And so we come to the formula for a (1-a)100 CI
for m when s is unknown
20- Example A study is conducted to learn how long
it takes the typical tax payer to complete his or
her federal income tax return. A random sample
of 17 income tax filers showed a mean time (in
hours) of 7.8 and a standard deviation of 2.3.
Find a 95 confidence interval for the true mean
time required to complete a federal income tax
return. Assume the time to complete the return
is normally distributed. - Solution
- 1. Parameter of Interest the mean time required
to complete a federal income tax return. - 2. Confidence Interval Criteria
- a. Assumptions Sampled population assumed
normal, s unknown. - b. Distribution table value t will be used.
- c. Confidence level 1 - a 0.95
21- 3. The Sample Evidence
-
- 4. Calculations
-
- 5. (6.62, 8.98) is the 95 confidence interval
for µ.
22Confidence Interval for a Proportion
- Assumptions
- Population Follows Binomial Distribution
- Normal Approximation Can Be Used if
- does not Include 0
or 1 - Or (older guideline)
- Confidence Interval Estimate
23Example
- A random sample of 400 graduates showed 32 went
to grad school. Set up a 95 confidence interval
estimate for p.
24New Method
- A new method (Agresti Coull, 1998) can be used
to avoid the problems with extreme ps. There is
no need to check the np or nq values with this
method. - Define
- Then a (1-a)100 CI for p is given by
25Example
- In the 2004 presidential election, Ralph Nader
had about 0.34 of the vote. Suppose an exit
poll was taken to estimate Naders share of the
vote, with a sample size of 200, and 2 people
indicated they voted for Nader. - Note that with the traditional method,
so the formula is not valid. - Use the p method to construct a 95 CI for p.
26Choosing CI Formulas
27Sample Size Calculation
- We may wish to decide upon a sample size so that
we can get a confidence interval with a
pre-determined width. - This is common in polls, where the margin of
error is usually decided in advance. - All CIs we have seen so far have the form PB,
where B is the margin of error. - We want to fix B in advance.
28Sample Size for Estimating µ, s Known
- Suppose X is a random variable with s10 and we
want a 90 CI to have a Bound, or Margin of
Error, of 3. - Use the formula .
- Fill in the numbers
- Solve
- This is the minimum sample size, but we need a
whole number, so round up to n31.
29Sample Size for Estimating µ, s Unknown
- If s is unknown, the confidence interval will be
calculated using the t distribution, unless n is
very large. - But the degrees of freedom depend on n, which we
dont know. - The calculation also depends on s, which we dont
know until after sampling. - We must have an initial guess for s, and then use
the normal distribution to approximate the t
distribution, since it does not require knowing n.
30Example (s unknown)
- A manufacturer needs to be able to estimate the
width of a new part to within 2mm with 95
confidence. There is not enough history to know
what s would be, so a pilot study is run by
measuring 6 parts, and finding s3.4mm. - Rounding up to the next whole number gives n12.
31Sample Size for Estimating p, a Population
Proportion
- With a population proportion, we also have a
problem in getting the standard deviation part of
the Margin of Error, since it depends on p, the
thing we are trying to estimate. - There are two possibilities
- 1) We may have a preliminary guess about p that
we can use, or - 2) We can use p.5 because that maximizes the
standard deviation. - The sample size will be calculated from the
desired margin of error, or error bound.
32Example (proportion)
- A pollster wants to do a simple random sample to
estimate the proportion of the population
favoring an increase in property taxes for school
funding. He wants a margin of error of 3, with
90 confidence. The general belief is that it
will be a close election, so an initial value of
p.5 is reasonable. - Rounding up to the next whole number gives n752.
33Misc. Notes
- The CI for µ formula using z is also called the
Large Sample CI. It is valid when s is known,
for any sample size, but it also serves as an
approximation of the t formula (using s) when n
is large. How large? Many books say n30. I
recommend making use of the t table up to n100
since that is how far it goes. Statistical
computer programs will always calculate t values,
regardless of how large n is, for the s unknown
case.
34Misc. Notes
- The CI for µ formula using t is also called the
Small Sample CI, but only because the other one
is called Large Sample. It is valid for any
sample size when s is unknown and the population
is normal. - We do not cover methods for small samples that do
not come from a normal population in this course
(non-parametric methods).
35Misc. Notes
- The t table is limited because it does not have a
very good selection of probabilities. It also
jumps in the df column. It is possible to use
the closest value or interpolate when you cant
find what you need, but a better option is to use
the Excel functions, TDIST and TINV. - However, you have to be VERY careful about what
Excel is giving you.
36Excels TDIST function
- TDIST takes a t value and returns the tail
probability. You can choose one or two tails.
37Excels TINV Function
- The TINV Function takes a two-tailed probability
and returns a t-value (just what we need now).
38Excel Function Comparison
- The NORMSINV Function, by contrast, takes a
left-tailed probability and returns a z-value.
This means you have to enter a/2 and take the
negative, or else use 1- a/2 as the argument.