Statistics 221 - PowerPoint PPT Presentation

1 / 76
About This Presentation
Title:

Statistics 221

Description:

There are many possible samples of say size 30 (n=30) that could be drawn from a ... per day that were made by Kim Ryan, a courteous telemarketer who worked four ... – PowerPoint PPT presentation

Number of Views:520
Avg rating:3.0/5.0
Slides: 77
Provided by: margaret1
Category:

less

Transcript and Presenter's Notes

Title: Statistics 221


1
Statistics 221
  • Chapter 7
  • Sampling and Sampling Distributions

2
Samples and Populations
  • A population is the entire set of all elements of
    interest in a study. Examples all students in a
    university, all residents of a country, all
    registered voters, etc.
  • A sample is a subset of that population.
  • Numerical summaries of a population are called
    parameters numerical summaries of a sample are
    called statistics.

3
Statistical Inference
  • In a typical study, a representative sample will
    be drawn from a population and statistics will be
    calculated.
  • The statistics are used to draw inferences about
    the population as a whole.

4
Random samples
  • If the sample was drawn using recognized random
    sampling techniques, it will be representative
    of the population, and therefore, the sample
    statistics should provide good estimates of the
    population parameters.

5
Making inferences about population means and
proportions
  • It is common to make inferences about population
    means and proportions using sample statistics
  • For example, we may take a sample of employees
    and ask them what their annual salary is and then
    compute an average. That sample average will be
    our best point estimate of what the populations
    average salary is.
  • Or we may take a sample of employees as ask them
    whether they are in favor of flex-hours. The
    percentage or proportion of that sample who favor
    flex-hours is our best point estimate of the
    population proportion who favor flex-hours.

6
The Electronics Associates sampling problem
  • The director of Personnel for Electronics
    Associates, Inc (EAI) has been assigned the task
    of developing a profile of the companys 2500
    managers.
  • The characteristics of interest are
  • Average salary
  • The proportion who have completed the management
    training program.

7
Sampling techniques
  • The most common method for gathering a sample is
    to use simple random sampling.
  • The process of selecting a simple random sample
    depends on whether the population is finite or
    infinite.

8
Sampling from a finite population
  • The population of managers at EAI is finite
    (2500).
  • There are many possible samples of say size 30
    (n30) that could be drawn from a population of
    2500 (N2500).
  • The sampling process should assure that each
    possible sample of size n drawn from population N
    has an equal chance of being selected.
  • One technique would be to assign each manager a
    number and then use a random-number generator to
    generate n random numbers. If a managers number
    is generated, that manager is selected for the
    survey.

9
Sampling with and without replacement from a
finite population
  • It is possible that a managers number is
    selected more than once. If we allow that to
    happen, we are sampling with replacement. If we
    eliminate that managers number so that it cant
    be selected again, we are sampling without
    replacement.
  • For a finite population we generally follow a
    sample without replacement procedure by making
    sure that the same number is not selected more
    than once.

10
Sampling with and without replacement from an
infinite population
  • When the population is infinite, the population
    size (N) is so large and the sample size (n) is
    so (relatively) small that the probability that a
    managers number will come up more than once is
    so small, so we assume that it wont happen and
    we just proceed as if we are sampling with
    replacement.

11
What were about to learn
  • Were about to learn that if you took a large
    number of separate samples of a population, and
    you plotted all the samples means, you would
    have a normal distribution even if the
    populations distribution is NOT normal.
  • The question we seek the answer to is still
    Whats the probability that x is x? but now x
    is a mean of sample not a single value.

12
Example
  • A new establishment is open for only three days
    before it goes out of business. The sales volume
    for each of those three days was 1, 2, and 5.
  • We want to statistically analyze sales volume.
    The population of interest consists of the values
    1, 2, and 5.

13
Calculate parameters
  • The population is size 3.
  • The mean ? of the population is (1 2 5)/3 or
    8/3 or 2.7.
  • The std deviation ? of the population is 1.7.
  • X (x-mean) (x-mean)2
  • 1 - 1.7 2.89
  • 2 - .7 .49
  • 2.3 5.29
  • 8.67

Sqrt(8.67 / 3) 1.7
14
Now let us take all possible samples of size 2
  • We will list all possible samples of size 2 with
    replacement. Therefore, samples can consist of
    the same value twice (1-1, 2-2, 5-5).
  • Why are we using with replacement? Because in
    most cases (but not in this case) the sample size
    is going to be less than 5 of the population,
    and when that happens, we use the with
    replacement formulas instead of the without
    replacement formulas because the with
    replacement formulas are simpler.

15
How many permutations of 2 are possible from
three values?
  • 3 3 or 9
  • Here are the 9 possible samples
  • 1-1, 1-2, 1-5
  • 2-1, 2-2, 2-5
  • 5-1, 5-2, 5-5

16
Here are all possible samples of size 2 along
with their sample statistics
17
Important Point
  • Notice that when we take the mean of the means,
    we get 2.7 that is also the population mean.
  • When the group of samples includes all possible
    samples, then the mean of the means will always
    target the population mean.

18
Important Point
Frequency Distribution of the Sample Means
Although this distribution doesnt look so
normal, that is because the sample size is only
2. As the sample size increases, the probability
distribution of the samples means will approach
a normal distribution.
19
Example 2 (The unknown is the population
proportion / percentage)
  • This time we consider a population that consists
    of all the US Senators -87 being male and 13
    being female.
  • Now lets say we dont know that 13 are female
    because the population is too big to survey
    everyone.
  • We are trying to determine the proportion of
    senators that are female. (In other words, the
    unknown is the proportion / percentage of the
    population that are female.)

20
1. We start out by obtaining several samples of
size 5
  • We decide to take 100 samples of size 5 and
    record the percentage of female senators in each
    sample.
  • We create a frequency distribution that shows the
    of females in each sample of 5.
  • If we take the mean of all those samples
    percentages, we should get 13.

21
Results from just 100 samples of size 5
  • of Females (x) Frequency (f) ( of samples)
  • 0 26
  • .1 41
  • .2 24
  • .3 7
  • .4 1
  • .5 1
  • Mean (?(x f) .119 (not quite 13)

22
Why didnt we get 13 for the mean percentage of
women?
  • Because we only took 100 samples of size 5, when
    there are a possible 1005 (10 billion) samples of
    size 5.
  • So the mean of sample means didnt exactly mirror
    the population mean.

23
Here is the distribution of sample percentages
when the number of samples is 100
It resembles a normal curve but its not exactly
normal.
24
If our frequency distribution had included all
possible samples of size 5
  • (1) the distribution of sample percentages would
    be (almost) normal and
  • (2) the mean of sample means would have been the
    same as the population mean (13).

It would be (completely) normal only if n 30.
25
If we had taken all 10 billion possible samples
the distribution would be almost normal with a
mean of .13
26
Another important point
  • We can see that when using a sample statistic to
    estimate a population parameter, some statistics
    are good in the sense that they target the
    population parameter and are therefore likely to
    yield good results. Such statistics are called
    unbiased estimators.
  • Statistics that target population parameters
    mean, variance, proportion.
  • Statistics that do not target population
    parameters median, range, standard deviation

27
Practice exercise (p. 256 - 6)
  • Here are the numbers of sales per day that were
    made by Kim Ryan, a courteous telemarketer who
    worked four days before being fired 1, 11, 9, 3.
    Assume that samples of size 2 are randomly
    selected with replacement from this population of
    four values.
  • A. List the 16 different possible samples and
    find the mean of each of them.
  • B. Identify the probability of each sample, then
    describe the sampling distribution of sample
    means (Hint see Table 5-3).
  • C. Find the mean of the sampling distribution
  • D. Is the mean of the sampling distribution (from
    part c) equal to the mean of the population of
    the four listed values? Are those means always
    equal?

28
Practice exercise (p. 256 - 6)
  • Open file dataSetsForProjectsCh5.xls
  • Go to worksheet telemarketing
  • Fill in the shaded cells with the appropriate
    values and answer the questions.

29
The Central Limit Theorem
30
The Central Limit Theorem conditions
  • 1. Lets say that a random variable x has a
    distribution (which may or may not be normal)
    with mean µ and standard deviation ?.
  • 2. Several samples all of the same size n are
    randomly selected from the population.
  • 3. All the sample means are plotted on a
    probability distribution.

31
The Central Limit Theorem assertions
  • 1. That probability distribution of sample means
    will, as the sample size increases, approach a
    normal distribution even if the population does
    not have a normal distribution!
  • 2. Further, the mean of the sample means (?x )
    will be the same as the population mean µ.
  • 3. The standard deviation of the sample means
    (?x) will be ? / ?n

32
One more point to add to the Central Limit theorem
  • Recall 1 again the probability distribution of
    sample means will, as the sample size increases,
    approach a normal distribution even if the
    population does not have a normal distribution!
  • The key word is approach. That is if the sample
    size (n) 30, then the distribution of sample
    means will be normal.
  • If the sample size (n) is of the sample means will approach normal.
  • BUT if the original population is already
    normally distributed, then the distribution of
    sample means will be normal for any sample size n
    (not just when n 30).

33
An example demonstrating the Central Limit
Theorem concept
  • If we take the last 4 digits of the social
    security numbers of every US citizen, we have a
    population of values that form a uniform
    distribution.
  • Recall that a uniform distribution means that
    every value from 0000 to 9999 is equally likely
    to occur.

34
A uniform distribution
35
Lets say we select a sample of 50 people
  • And we take the last 4 digits of each of their
    social security numbers and we lump them
    together as a one big sample of 200 (4 50)
    digits.
  • Then we calculate the mean of those 200 numbers
    to be 4.5
  • Then we calculate the std deviation of those 200
    values to be 2.8.

36
Then we create a frequency distribution based on
that one sample of 200 digits
Distribution of 200 digits from Social Security
Numbers (Last 4 digits from 50 students)
Its not normal nor does it approximate the
uniform distribution of the population very
closely.
37
But now treat the sample data as 50 samples of 4
instead of 1 sample of 200
And calculate a mean for each sample of size
4then create a frequency distribution of those
sample means
38
And we have an (almost) normal distribution
Distribution of 50 Sample Means
Even though the population does not have a normal
distribution, the distribution of the sample
means is (almost) normal. And the std deviation
is ? / ? n.
39
Furthermore
  • Had we used samples of size 30 or more (instead
    of 4) we could remove the word almost the
    sample means would have a frequency distribution
    that is fully-normal.
  • As the sample size increases, the frequency
    distribution of the sample means approaches
    normal.

40
Applying the Central Limit Theorem
  • In practice, we dont take several samples of
    size n we take one sample of size n and we treat
    the mean of that sample like its a single
    x-value - one of many possible sample means.
  • Then we calculate a z-value for that x so that we
    can derive a p-value (the probability of getting
    that particular x (sample mean).
  • Notice that we are still operating under the
    assumption that we know the population mean ? and
    population std deviation ? .
  • We are still asking the question, Whats the
    probability of x being x? but now x is one of
    several potential sample means.

41
Calculating a z-value when x is a sample mean
not a single value
X as one possible single outcome
X as one possible sample mean
x - ?
x - ? x
z
z
?
? x
  • Its actually the same formula because ?x ?
    (the mean of the population is the same as the
    mean of the sample means) but the standard
    deviations of the two distributions are
    different. The standard deviation for the sample
    means is lower ?x ? / ?n.

42
Example
  • Given the population of men has normally
    distributed weights with a mean of 172 lb and a
    standard deviation of 29 lb, a) if one man is
    randomly selected, find the probability that his
    weight is greater than 167 lb.b) if 12
    different men are randomly selected, find the
    probability that their mean weight is greater
    than 167 lb.

43
Calculating the z-values
X as one possible single outcome
X as one possible sample mean
x - ?
x - ? x
z
z
?
? x
but ?x ? and ?x ? / ?n
167 -172
167 -172
z
z
so
29
29 / ?12
z
-0.17
z
-0.60
44
Using z to lookup p, we find that if one man is
randomly selected, the probability that his
weight is greater than 167 lb. is 0.5675.
45
Using z to lookup p, we find that if 12 different
men are randomly selected, the probability that
their mean weight is greater than 167 lb is
0.7257.
The frequency distribution of the sample means is
narrower (less variation) and taller than the
frequency distribution of the population but the
mean is the same.

46
The sample mean distribution has less dispersion
than the distribution of individual values
  • Since the individual weights are more spread out
    than the sample average weights, only 56 of the
    area is under the curve for individual weights
    while 72 of the area is under the curve for
    average weights.
  • For example, while an outlier can have a big
    effect on the distributions variation for
    individual weights, when only plotting sample
    averages, an outlier will get averaged into other
    values and will not be so outlying.
  • Sample means cluster together more than
    individual values so it is more unusual for group
    of 12s average value to deviate from the mean
    than it is for an individual value to deviate
    from the mean.

47
Practical Interpretation
  • There is a .5675 probability that an individual
    man will weight more than 167 lbs and there is a
    .7257 probability that 12 men will have a mean
    weight of more than 167 lbs.
  • Given that the gondola maximum capacity is 2004
    lbs, it is likely (.72) to be overloaded if it is
    filled with 12 randomly-selected men.
  • However, there is some hope that the gondola
    carriage wont come crashing down to earth
    because (1) skiers are generally leaner than the
    general public, (2) some of the 12 passengers are
    likely to be women, and (3) 2004 lbs is a very
    conservative limit in reality it can hold a lot
    more weight than that.

48
Applying the Central Limit Theorem when
hypothesis testing.
  • So far, the only question we have learned how to
    answer is this What is the probability that x is
    x when ? is ? and ? is ??
  • How useful is that? In each case, we already knew
    the population mean (and std. deviation) so why
    was it so helpful to know the probability of
    getting a certain x when ? was known already?
  • In a more realistic situation, we hypothesize
    about what ? is, and we use the value of x (a
    sample mean) to determine whether to accept or
    reject that hypothesis.

49
How we use all this x, z, and p stuff to do
hypothesis testing
  • A typical hypothesis I believe that the mean of
    this population is x.
  • Written like this Ho (? x)
  • And the way that hypothesis is tested is by
    taking a sample. Now if the mean of that sample
    is a lot different than the mean you are claiming
    for the population (x), you reject the
    hypothesis.

50
a lot different?
  • The Rare Event Rule If, under a given assumption
    (? x), the probability of getting a particular
    x is really small (that is x and ? are really far
    apart the z-value is large), we conclude that
    the assumption (? x) is probably false (and
    reject the hypothesis).

51
Example
  • Assume that the population of human body
    temperatures has a mean of 98.6oF as is commonly
    believed. Also assume that the population std.
    deviation is .62oF. If a sample of size n106 is
    randomly selected, and its mean turns out to be
    98.2oF, should we still hang on to the belief
    that the population mean is 98.6?

52
Assumptions we can make because of the Central
Limit Theorem
  • We imagine that our sample of 106 is one of many
    possible samples of size 106 and if all those
    samples were taken and means obtained, we can
    assume that
  • the distribution of those sample means would be
    normal (since n30),
  • their mean of means ?x would be the same as the
    population mean (? 98.6)
  • an their std deviation (?x) would be ? / ?n
    (or .62 / ?106 or .0602.

53
Hypothesis Ho (? 98.6)
  • Based on these assumptions, we can calculate a z
    value using ?x 98.6, ?x .0602.
  • From z we derive p, and if p is the hypothesis that the mean of the population is
    98.6

Since the p-value associated with a z of -6.64 is
off the charts, the probability is like
.00000000002 (clearly less than 5), so we reject
the hypothesis that ? 98.6
98.2 -98.6
z
.62 / ?106
z
-6.64
54
Practice exercise (p. 268- 9)
  • The Rock-n-Roller Coaster at Disney-MGM Studio in
    Orlando has two seats in each row. When designing
    that roller coaster, the total width of the two
    seats in each row had to be determined. In the
    worst case scenario, both seats are occupied by
    men. Men have hip breadths that are normally
    distributed with a mean of 14.4in. And a standard
    deviation of 1.0 in. Assume that two male riders
    are randomly selected.
  • A. Find the probability that their mean hip width
    is greater than 16.0 in.
  • B. If each row of two seats is designed to fit
    two men only if they have a breadth of 16.0
    inches or less, would too many riders be unable
    to fit? Does this design appear to be acceptable?

55
This question is whats the probability of
getting a sample with a mean hip width of x?
NOT whats the probability of getting an
individual with hip width x?
56
Can we apply the central limit theorem
assumptions?
  • The sample size is 2. Since the sample size (n) 30, we cannot conclude that the distribution of
    sample means will be a normal distribution.
  • If we cant conclude that, then the central limit
    theorem assumptions (?x ? and ?x ? / ?n)
    cant be made.
  • But since they told us that the underlying
    population has a normal distribution, we know
    that the sample means distribution is also
    normal, regardless of what n is.

57
Draw pictureWhat is P(x 16) when ?x 14.4?
The red curve line is the distribution of the
sample means. The blue curve is the distribution
of the population.
What is p?
?x 1.0/?2
? 1.0
X16
? ?x 14.4
58
2. Calculate z
x - ? x
z
? x
but ?x ? and ?x ? / ?n
16 -14.4
1.6
2.26
z
so

1.0 / ?2
.707
59
Lookup p
60
P(x 16) 1 - .9881 .0119
P is .0119
?x 1.0/?2
? 1.0
X16
? ?x 14.4
61
Is this acceptable?
  • If each row of two seats is designed to fit two
    men only if they have a breadth of 16.0 inches or
    less, would too many riders be unable to fit?
    Does this design appear to be acceptable?
  • About 1 of each sample of 2 men wont fit into
    the seats. Yes, that appears to be acceptable
    since really fat people would probably pass out
    waiting in line to get on the ride anyway.

62
Practice exercise (p. 269 - 16)
  • Scores of men on the verbal portion of the SAT-I
    test are normally distributed with a mean of 509
    and a standard deviation of 112.
    Randomly-selected men are given the Columbia
    Review Course before taking the SAT test. Assume
    that the course did not help improve their scores
    (the null hypothesis).
  • A. If 1 of the men are randomly selected, find
    the probability that his score is at least 590.
  • B. If 16 men are randomly selected, find the
    probability that their mean score is at least
    590.
  • C. In finding the probability for part (b), why
    CAN the central limit theorem be used even though
    the sample size is below 30?
  • D. If the random sample of 16 men does result in
    a mean score of 590, is there strong evidence to
    support the claim that the course is actually
    effective? Why or why not?

63
Find P(x) 590
  • A. If 1 of the men is randomly selected, find the
    probability that his score is at least 590.

64
1. Draw pictureWhat is P(x 590)?
What is p?
? 112
? 509
X590
65
2. Calculate z
x - ?
z
?
590 - 509
z
112
z
.723
66
3. Use z to lookup p
67
4. Revisit picture and write in pP(x 590) 1 -
. 7642 .2358
P .2358
? 112
? 509
x590
68
Find P(x) 590
  • B. If 16 men are randomly selected, find the
    probability that their mean score is at least
    590.
  • This question is whats the probability of
    getting this particular sample average? NOT
    whats the probability of getting this
    particular individual value?

69
1. Draw pictureWhat is P(x 590)?
What is p?
?x 112/?12
? 112
X590
? ?x 509
70
2. Calculate z value
x - ? x
z
? x
590 -509
81
2.89
z

112 /?16
28
71
3. Use z to lookup p
2.8
.9981
72
4. Revisit picture and write in pP(x) 590 1
- .9981 .0019
P .0019
?x 112/?16
? 112
X590
? ?x 509
73
C. In finding the probability for part (b), why
CAN the central limit theorem be used even though
the sample size is below 30?
  • Because the underlying population (all SAT
    scores) is normal.

74
D. If the random sample of 16 men does result in
a mean score of 590, is there strong evidence to
support the claim that the course is actually
effective? Why or why not?
  • Yes, because it is very unusual (.0019 16 men to obtain an average score that high.

75
Lets summarize
  • The first part of the chapter was about Whats
    the probability of getting an outcome (x) that is
    less than or greater than some value? and Is it
    normal to get an outcome like this?
  • The second part of this chapter applied these
    question to situations where the outcome variable
    was not a single individual outcome but an
    average of a sample of outcomes. In other words,
    Whats the probability of getting this
    particular sample average? and Is it normal to
    get such as sample average if the population mean
    is really what we think it is?

76
Homework 12
  • 4 on page 263
  • Random sampling from finite population
  • 10 on page 264
  • Finite or infinite populations?
  • 16 on page 267
  • Develop a point estimate
  • 26 on page 278
  • Whats the probability that sample statistic is
    close to population parameter?
  • 38 on page 284
  • Whats the probability that sample statistic is
    close to population parameter?
Write a Comment
User Comments (0)
About PowerShow.com