Title: Statistics 221
1Statistics 221
- Chapter 7
- Sampling and Sampling Distributions
2Samples and Populations
- A population is the entire set of all elements of
interest in a study. Examples all students in a
university, all residents of a country, all
registered voters, etc. - A sample is a subset of that population.
- Numerical summaries of a population are called
parameters numerical summaries of a sample are
called statistics.
3Statistical Inference
- In a typical study, a representative sample will
be drawn from a population and statistics will be
calculated. - The statistics are used to draw inferences about
the population as a whole.
4Random samples
- If the sample was drawn using recognized random
sampling techniques, it will be representative
of the population, and therefore, the sample
statistics should provide good estimates of the
population parameters.
5Making inferences about population means and
proportions
- It is common to make inferences about population
means and proportions using sample statistics - For example, we may take a sample of employees
and ask them what their annual salary is and then
compute an average. That sample average will be
our best point estimate of what the populations
average salary is. - Or we may take a sample of employees as ask them
whether they are in favor of flex-hours. The
percentage or proportion of that sample who favor
flex-hours is our best point estimate of the
population proportion who favor flex-hours.
6The Electronics Associates sampling problem
- The director of Personnel for Electronics
Associates, Inc (EAI) has been assigned the task
of developing a profile of the companys 2500
managers. - The characteristics of interest are
- Average salary
- The proportion who have completed the management
training program.
7Sampling techniques
- The most common method for gathering a sample is
to use simple random sampling. - The process of selecting a simple random sample
depends on whether the population is finite or
infinite.
8Sampling from a finite population
- The population of managers at EAI is finite
(2500). - There are many possible samples of say size 30
(n30) that could be drawn from a population of
2500 (N2500). - The sampling process should assure that each
possible sample of size n drawn from population N
has an equal chance of being selected. - One technique would be to assign each manager a
number and then use a random-number generator to
generate n random numbers. If a managers number
is generated, that manager is selected for the
survey.
9Sampling with and without replacement from a
finite population
- It is possible that a managers number is
selected more than once. If we allow that to
happen, we are sampling with replacement. If we
eliminate that managers number so that it cant
be selected again, we are sampling without
replacement. - For a finite population we generally follow a
sample without replacement procedure by making
sure that the same number is not selected more
than once.
10Sampling with and without replacement from an
infinite population
- When the population is infinite, the population
size (N) is so large and the sample size (n) is
so (relatively) small that the probability that a
managers number will come up more than once is
so small, so we assume that it wont happen and
we just proceed as if we are sampling with
replacement.
11What were about to learn
- Were about to learn that if you took a large
number of separate samples of a population, and
you plotted all the samples means, you would
have a normal distribution even if the
populations distribution is NOT normal. - The question we seek the answer to is still
Whats the probability that x is x? but now x
is a mean of sample not a single value.
12Example
- A new establishment is open for only three days
before it goes out of business. The sales volume
for each of those three days was 1, 2, and 5. - We want to statistically analyze sales volume.
The population of interest consists of the values
1, 2, and 5.
13Calculate parameters
- The population is size 3.
- The mean ? of the population is (1 2 5)/3 or
8/3 or 2.7. - The std deviation ? of the population is 1.7.
- X (x-mean) (x-mean)2
- 1 - 1.7 2.89
- 2 - .7 .49
- 2.3 5.29
- 8.67
Sqrt(8.67 / 3) 1.7
14Now let us take all possible samples of size 2
- We will list all possible samples of size 2 with
replacement. Therefore, samples can consist of
the same value twice (1-1, 2-2, 5-5). - Why are we using with replacement? Because in
most cases (but not in this case) the sample size
is going to be less than 5 of the population,
and when that happens, we use the with
replacement formulas instead of the without
replacement formulas because the with
replacement formulas are simpler.
15How many permutations of 2 are possible from
three values?
- 3 3 or 9
- Here are the 9 possible samples
- 1-1, 1-2, 1-5
- 2-1, 2-2, 2-5
- 5-1, 5-2, 5-5
16Here are all possible samples of size 2 along
with their sample statistics
17Important Point
- Notice that when we take the mean of the means,
we get 2.7 that is also the population mean. - When the group of samples includes all possible
samples, then the mean of the means will always
target the population mean.
18Important Point
Frequency Distribution of the Sample Means
Although this distribution doesnt look so
normal, that is because the sample size is only
2. As the sample size increases, the probability
distribution of the samples means will approach
a normal distribution.
19Example 2 (The unknown is the population
proportion / percentage)
- This time we consider a population that consists
of all the US Senators -87 being male and 13
being female. - Now lets say we dont know that 13 are female
because the population is too big to survey
everyone. - We are trying to determine the proportion of
senators that are female. (In other words, the
unknown is the proportion / percentage of the
population that are female.)
201. We start out by obtaining several samples of
size 5
- We decide to take 100 samples of size 5 and
record the percentage of female senators in each
sample. - We create a frequency distribution that shows the
of females in each sample of 5. - If we take the mean of all those samples
percentages, we should get 13.
21Results from just 100 samples of size 5
- of Females (x) Frequency (f) ( of samples)
- 0 26
- .1 41
- .2 24
- .3 7
- .4 1
- .5 1
- Mean (?(x f) .119 (not quite 13)
22Why didnt we get 13 for the mean percentage of
women?
- Because we only took 100 samples of size 5, when
there are a possible 1005 (10 billion) samples of
size 5. - So the mean of sample means didnt exactly mirror
the population mean.
23Here is the distribution of sample percentages
when the number of samples is 100
It resembles a normal curve but its not exactly
normal.
24If our frequency distribution had included all
possible samples of size 5
- (1) the distribution of sample percentages would
be (almost) normal and - (2) the mean of sample means would have been the
same as the population mean (13).
It would be (completely) normal only if n 30.
25If we had taken all 10 billion possible samples
the distribution would be almost normal with a
mean of .13
26Another important point
- We can see that when using a sample statistic to
estimate a population parameter, some statistics
are good in the sense that they target the
population parameter and are therefore likely to
yield good results. Such statistics are called
unbiased estimators. - Statistics that target population parameters
mean, variance, proportion. - Statistics that do not target population
parameters median, range, standard deviation
27Practice exercise (p. 256 - 6)
- Here are the numbers of sales per day that were
made by Kim Ryan, a courteous telemarketer who
worked four days before being fired 1, 11, 9, 3.
Assume that samples of size 2 are randomly
selected with replacement from this population of
four values. - A. List the 16 different possible samples and
find the mean of each of them. - B. Identify the probability of each sample, then
describe the sampling distribution of sample
means (Hint see Table 5-3). - C. Find the mean of the sampling distribution
- D. Is the mean of the sampling distribution (from
part c) equal to the mean of the population of
the four listed values? Are those means always
equal?
28Practice exercise (p. 256 - 6)
- Open file dataSetsForProjectsCh5.xls
- Go to worksheet telemarketing
- Fill in the shaded cells with the appropriate
values and answer the questions.
29The Central Limit Theorem
30The Central Limit Theorem conditions
- 1. Lets say that a random variable x has a
distribution (which may or may not be normal)
with mean µ and standard deviation ?. - 2. Several samples all of the same size n are
randomly selected from the population. - 3. All the sample means are plotted on a
probability distribution.
31The Central Limit Theorem assertions
- 1. That probability distribution of sample means
will, as the sample size increases, approach a
normal distribution even if the population does
not have a normal distribution! - 2. Further, the mean of the sample means (?x )
will be the same as the population mean µ. - 3. The standard deviation of the sample means
(?x) will be ? / ?n
32One more point to add to the Central Limit theorem
- Recall 1 again the probability distribution of
sample means will, as the sample size increases,
approach a normal distribution even if the
population does not have a normal distribution! - The key word is approach. That is if the sample
size (n) 30, then the distribution of sample
means will be normal. - If the sample size (n) is of the sample means will approach normal.
- BUT if the original population is already
normally distributed, then the distribution of
sample means will be normal for any sample size n
(not just when n 30).
33An example demonstrating the Central Limit
Theorem concept
- If we take the last 4 digits of the social
security numbers of every US citizen, we have a
population of values that form a uniform
distribution. - Recall that a uniform distribution means that
every value from 0000 to 9999 is equally likely
to occur.
34A uniform distribution
35Lets say we select a sample of 50 people
- And we take the last 4 digits of each of their
social security numbers and we lump them
together as a one big sample of 200 (4 50)
digits. - Then we calculate the mean of those 200 numbers
to be 4.5 - Then we calculate the std deviation of those 200
values to be 2.8.
36Then we create a frequency distribution based on
that one sample of 200 digits
Distribution of 200 digits from Social Security
Numbers (Last 4 digits from 50 students)
Its not normal nor does it approximate the
uniform distribution of the population very
closely.
37But now treat the sample data as 50 samples of 4
instead of 1 sample of 200
And calculate a mean for each sample of size
4then create a frequency distribution of those
sample means
38 And we have an (almost) normal distribution
Distribution of 50 Sample Means
Even though the population does not have a normal
distribution, the distribution of the sample
means is (almost) normal. And the std deviation
is ? / ? n.
39Furthermore
- Had we used samples of size 30 or more (instead
of 4) we could remove the word almost the
sample means would have a frequency distribution
that is fully-normal. - As the sample size increases, the frequency
distribution of the sample means approaches
normal.
40Applying the Central Limit Theorem
- In practice, we dont take several samples of
size n we take one sample of size n and we treat
the mean of that sample like its a single
x-value - one of many possible sample means. - Then we calculate a z-value for that x so that we
can derive a p-value (the probability of getting
that particular x (sample mean). - Notice that we are still operating under the
assumption that we know the population mean ? and
population std deviation ? . - We are still asking the question, Whats the
probability of x being x? but now x is one of
several potential sample means.
41Calculating a z-value when x is a sample mean
not a single value
X as one possible single outcome
X as one possible sample mean
x - ?
x - ? x
z
z
?
? x
- Its actually the same formula because ?x ?
(the mean of the population is the same as the
mean of the sample means) but the standard
deviations of the two distributions are
different. The standard deviation for the sample
means is lower ?x ? / ?n.
42Example
- Given the population of men has normally
distributed weights with a mean of 172 lb and a
standard deviation of 29 lb, a) if one man is
randomly selected, find the probability that his
weight is greater than 167 lb.b) if 12
different men are randomly selected, find the
probability that their mean weight is greater
than 167 lb.
43Calculating the z-values
X as one possible single outcome
X as one possible sample mean
x - ?
x - ? x
z
z
?
? x
but ?x ? and ?x ? / ?n
167 -172
167 -172
z
z
so
29
29 / ?12
z
-0.17
z
-0.60
44Using z to lookup p, we find that if one man is
randomly selected, the probability that his
weight is greater than 167 lb. is 0.5675.
45Using z to lookup p, we find that if 12 different
men are randomly selected, the probability that
their mean weight is greater than 167 lb is
0.7257.
The frequency distribution of the sample means is
narrower (less variation) and taller than the
frequency distribution of the population but the
mean is the same.
46The sample mean distribution has less dispersion
than the distribution of individual values
- Since the individual weights are more spread out
than the sample average weights, only 56 of the
area is under the curve for individual weights
while 72 of the area is under the curve for
average weights. - For example, while an outlier can have a big
effect on the distributions variation for
individual weights, when only plotting sample
averages, an outlier will get averaged into other
values and will not be so outlying. - Sample means cluster together more than
individual values so it is more unusual for group
of 12s average value to deviate from the mean
than it is for an individual value to deviate
from the mean.
47Practical Interpretation
- There is a .5675 probability that an individual
man will weight more than 167 lbs and there is a
.7257 probability that 12 men will have a mean
weight of more than 167 lbs. - Given that the gondola maximum capacity is 2004
lbs, it is likely (.72) to be overloaded if it is
filled with 12 randomly-selected men. - However, there is some hope that the gondola
carriage wont come crashing down to earth
because (1) skiers are generally leaner than the
general public, (2) some of the 12 passengers are
likely to be women, and (3) 2004 lbs is a very
conservative limit in reality it can hold a lot
more weight than that.
48Applying the Central Limit Theorem when
hypothesis testing.
- So far, the only question we have learned how to
answer is this What is the probability that x is
x when ? is ? and ? is ?? - How useful is that? In each case, we already knew
the population mean (and std. deviation) so why
was it so helpful to know the probability of
getting a certain x when ? was known already? - In a more realistic situation, we hypothesize
about what ? is, and we use the value of x (a
sample mean) to determine whether to accept or
reject that hypothesis.
49How we use all this x, z, and p stuff to do
hypothesis testing
- A typical hypothesis I believe that the mean of
this population is x. - Written like this Ho (? x)
- And the way that hypothesis is tested is by
taking a sample. Now if the mean of that sample
is a lot different than the mean you are claiming
for the population (x), you reject the
hypothesis.
50a lot different?
- The Rare Event Rule If, under a given assumption
(? x), the probability of getting a particular
x is really small (that is x and ? are really far
apart the z-value is large), we conclude that
the assumption (? x) is probably false (and
reject the hypothesis).
51Example
- Assume that the population of human body
temperatures has a mean of 98.6oF as is commonly
believed. Also assume that the population std.
deviation is .62oF. If a sample of size n106 is
randomly selected, and its mean turns out to be
98.2oF, should we still hang on to the belief
that the population mean is 98.6?
52Assumptions we can make because of the Central
Limit Theorem
- We imagine that our sample of 106 is one of many
possible samples of size 106 and if all those
samples were taken and means obtained, we can
assume that - the distribution of those sample means would be
normal (since n30), - their mean of means ?x would be the same as the
population mean (? 98.6) - an their std deviation (?x) would be ? / ?n
(or .62 / ?106 or .0602.
53Hypothesis Ho (? 98.6)
- Based on these assumptions, we can calculate a z
value using ?x 98.6, ?x .0602. - From z we derive p, and if p is the hypothesis that the mean of the population is
98.6
Since the p-value associated with a z of -6.64 is
off the charts, the probability is like
.00000000002 (clearly less than 5), so we reject
the hypothesis that ? 98.6
98.2 -98.6
z
.62 / ?106
z
-6.64
54Practice exercise (p. 268- 9)
- The Rock-n-Roller Coaster at Disney-MGM Studio in
Orlando has two seats in each row. When designing
that roller coaster, the total width of the two
seats in each row had to be determined. In the
worst case scenario, both seats are occupied by
men. Men have hip breadths that are normally
distributed with a mean of 14.4in. And a standard
deviation of 1.0 in. Assume that two male riders
are randomly selected. - A. Find the probability that their mean hip width
is greater than 16.0 in. - B. If each row of two seats is designed to fit
two men only if they have a breadth of 16.0
inches or less, would too many riders be unable
to fit? Does this design appear to be acceptable?
55This question is whats the probability of
getting a sample with a mean hip width of x?
NOT whats the probability of getting an
individual with hip width x?
56Can we apply the central limit theorem
assumptions?
- The sample size is 2. Since the sample size (n) 30, we cannot conclude that the distribution of
sample means will be a normal distribution. - If we cant conclude that, then the central limit
theorem assumptions (?x ? and ?x ? / ?n)
cant be made. - But since they told us that the underlying
population has a normal distribution, we know
that the sample means distribution is also
normal, regardless of what n is.
57Draw pictureWhat is P(x 16) when ?x 14.4?
The red curve line is the distribution of the
sample means. The blue curve is the distribution
of the population.
What is p?
?x 1.0/?2
? 1.0
X16
? ?x 14.4
582. Calculate z
x - ? x
z
? x
but ?x ? and ?x ? / ?n
16 -14.4
1.6
2.26
z
so
1.0 / ?2
.707
59Lookup p
60P(x 16) 1 - .9881 .0119
P is .0119
?x 1.0/?2
? 1.0
X16
? ?x 14.4
61Is this acceptable?
- If each row of two seats is designed to fit two
men only if they have a breadth of 16.0 inches or
less, would too many riders be unable to fit?
Does this design appear to be acceptable? - About 1 of each sample of 2 men wont fit into
the seats. Yes, that appears to be acceptable
since really fat people would probably pass out
waiting in line to get on the ride anyway.
62Practice exercise (p. 269 - 16)
- Scores of men on the verbal portion of the SAT-I
test are normally distributed with a mean of 509
and a standard deviation of 112.
Randomly-selected men are given the Columbia
Review Course before taking the SAT test. Assume
that the course did not help improve their scores
(the null hypothesis). - A. If 1 of the men are randomly selected, find
the probability that his score is at least 590. - B. If 16 men are randomly selected, find the
probability that their mean score is at least
590. - C. In finding the probability for part (b), why
CAN the central limit theorem be used even though
the sample size is below 30? - D. If the random sample of 16 men does result in
a mean score of 590, is there strong evidence to
support the claim that the course is actually
effective? Why or why not?
63Find P(x) 590
- A. If 1 of the men is randomly selected, find the
probability that his score is at least 590.
641. Draw pictureWhat is P(x 590)?
What is p?
? 112
? 509
X590
652. Calculate z
x - ?
z
?
590 - 509
z
112
z
.723
663. Use z to lookup p
674. Revisit picture and write in pP(x 590) 1 -
. 7642 .2358
P .2358
? 112
? 509
x590
68Find P(x) 590
- B. If 16 men are randomly selected, find the
probability that their mean score is at least
590. - This question is whats the probability of
getting this particular sample average? NOT
whats the probability of getting this
particular individual value?
691. Draw pictureWhat is P(x 590)?
What is p?
?x 112/?12
? 112
X590
? ?x 509
702. Calculate z value
x - ? x
z
? x
590 -509
81
2.89
z
112 /?16
28
713. Use z to lookup p
2.8
.9981
724. Revisit picture and write in pP(x) 590 1
- .9981 .0019
P .0019
?x 112/?16
? 112
X590
? ?x 509
73C. In finding the probability for part (b), why
CAN the central limit theorem be used even though
the sample size is below 30?
- Because the underlying population (all SAT
scores) is normal.
74D. If the random sample of 16 men does result in
a mean score of 590, is there strong evidence to
support the claim that the course is actually
effective? Why or why not?
- Yes, because it is very unusual (.0019 16 men to obtain an average score that high.
75Lets summarize
- The first part of the chapter was about Whats
the probability of getting an outcome (x) that is
less than or greater than some value? and Is it
normal to get an outcome like this? - The second part of this chapter applied these
question to situations where the outcome variable
was not a single individual outcome but an
average of a sample of outcomes. In other words,
Whats the probability of getting this
particular sample average? and Is it normal to
get such as sample average if the population mean
is really what we think it is?
76Homework 12
- 4 on page 263
- Random sampling from finite population
- 10 on page 264
- Finite or infinite populations?
- 16 on page 267
- Develop a point estimate
- 26 on page 278
- Whats the probability that sample statistic is
close to population parameter? - 38 on page 284
- Whats the probability that sample statistic is
close to population parameter?