Title: Sampling Distributions
1Sampling Distributions
2Inching Towards Inference
- Recall that one of our main goals is to make
inference about the unknown parameters of the
population or the distribution, such as the mean
?, the standard deviation ?, or some other
summary measures such as the median, etc. - We now have possible models for the population,
which are provided by the probability
distributions (Binomial, Poisson, Normal,
Uniform, others). - We also know how to compute sample statistics
such as the sample mean, sample standard
deviation, and others, with these sample
statistics to be used for making inference about
the parameters.
3Sampling as a Random Experiment
- To understand the notion of a sampling
distribution of a sample statistic, it is
important to realize that the process of taking
a sample from a population could be viewed as a
random experiment. - To illustrate this idea, consider a population
taking 3 values 2, 4, 5 according to the
following probability distribution. - Probability Function p(2) .4, p(4) .5, p(5)
.1 - You may imagine that 40 of all the values in the
population equals 2 50 equals 4 and 10 equals
5.
4The Population
4s
5s
2s
5Characteristics of the Population
- For this population, we have the parameters
- ? (2)(.4) (4)(.5) (5)(.1) .8 2 .5
3.3 - ?2 (2 - 3.3)2(.4) (4 - 3.3)2(.5) (5 -
3.3)2(.1) 1.21 - ? (1.21)1/2 1.1
- Its shape is given by the bar graph below
6Possible Outcomes of Sampling Process
- Now, consider the sampling process of taking n
2 observations (with replacement) from this
population or distribution. Below is a table of
possibilities.
7Some Points about the Preceding Table
- Since we are sampling with replacement, to obtain
the probability of each possible sample, we
simply multiply the probabilities of each of the
observations (Think of a tree diagram!). - The 9 possible samples represent the elementary
events of the experiment of taking a sample of
size 2 from the population or distribution. - The sample mean ( ) is obtained the usual way.
- The sample variance is computed the usual way.
For example, for the second sample, we have - S2 (2-3)2 (4-3)2/(2-1) 1 1/1 2
8Sample Statistics as Random Variables
- Since the sample mean and the sample variance are
numerical characteristics of each of the possible
samples, they can be viewed as random variables
in this sampling experiment. - Therefore, we could obtain the probability
distributions of the sample mean and sample
variance. - These probability distributions are called
sampling distributions. - Thus we will have the sampling distribution of
the sample mean, as well as the sample variance.
9Sampling Distribution of the Sample Mean
- From the earlier table, we could construct the
probability distribution of the sample mean, now
called the sampling distribution of the sample
mean. - This is given by the following table.
10Graph of the Sampling Distribution of the Sample
Mean
- Note that it has become more concentrated near
the population mean of 3.3, compared to the
original distribution.
11Parameters of the Sampling Distribution
- Because the sampling distribution is just like
any other probability distribution, we are also
able to obtain its mean, variance, and standard
deviation. - Thus, for the sampling distribution of the sample
mean, we find the mean to be 3.3, which coincides
with the original population mean while - the variance of the sampling distribution of the
sample mean turns out to be equal to .605, which
is equal to (1.21)/2, the population variance
divided by the sample size. - The standard deviation of the sample mean, now
called the standard error (SE), is (.605)1/2
.7778.
12Recapitulation
- Sampling from a probability distribution or
population could be viewed as a random
experiment, and the elementary outcomes are the
possible samples. - Sample statistics, such as the sample mean, could
be viewed as random variables, and as such have
their associated probability distributions, which
are called sampling distributions. - The sampling distribution also has a mean.
- And it also has a variance.
- The standard deviation of the sampling
distribution is called the standard error (SE).
13Sampling Distribution of the Sample Mean
- The mean of the sampling distribution of the
sample mean equals the population mean. - The variance of the sampling distribution of the
sample mean equals the population variance
divided by the sample size. - These two characteristics are always true for the
sampling distribution of the sample mean when
sampling with replacement.
14Obtaining Sampling Distributions
- In the example considered, we obtained the
sampling distribution of the sample mean by
enumerating all the possible samples that could
arise. - However, such a method is not feasible if the
sample size is large. For instance, if n 10,
then there will be a total of (3)(3)(3)(3) 310
59049 possible samples, and complete
enumeration is not anymore possible. - How do we obtain sampling distributions?
15Some Methods for Obtaining Sampling Distributions
of Statistics
- Complete enumeration, if possible.
- Computer simulation or via the Monte Carlo
method. In this method the computer generates
many, many samples, and then constructs the
probability histogram of the values of the
statistic of interest. This will provide an
empirical approximation. - Using theoretical results such as, for instance,
when sampling from a Bernoulli population the
number of successes is binomially-distributed. - Using theoretical approximations such as the
Central Limit Theorem or the deMoivre
approximation.
16Illustrating the Monte Carlo Method
- We illustrate the use of the simulation or Monte
Carlo method by approximating the sampling
distribution of the sample mean based on n 10
observations from the population considered
earlier which has - p(2) .4, p(4) .5, p(5) .1
- We generate 500 samples of size n 10 from this
population, and for each sample we compute the
sample mean. - This simulation was done using Minitab.
17First 10 of the 500 Generated Samples
- The table below shows the first 10 samples of
size n 10 that were generated from the
population. - Also included are their corresponding sample
means.
18Relative Frequency Histogram of the 500 Sample
Means
19Points to Ponder
- This relative frequency histogram of the
simulated sample means serves as an approximation
to the sampling distribution of the sample mean
when n 10 and when sampling from the given
population. - Notice that the values of the sample means are
now clustered around the population mean of 3.3,
and furthermore, the shape of the histogram is
almost bell-shaped. - Looking at this histogram, it also shows that the
chances of getting a sample of size n 10 whose
sample mean is less than 2.5 or greater than 4.5
is rather small.
20- When the mean of the 500 sample means is
computed, it turns out to be 3.3094. Their
median is exactly 3.30! - Recall that the population mean is 3.30.
- The standard deviation of the 500 sample means
turns out to be 0.3497. - Recall that the population standard deviation is
(1.21)1/2 1.1, so
21- We therefore note that the mean of the simulated
sample means is very close to the population
mean, and - the standard deviation of the simulated sample
means is also very close to the population
standard deviation divided by the square root of
the sample size. - Indeed, we always have the theoretical results
22An Important Result About the Sampling
Distribution of the Sample Mean
- When the population being sampled is a normal
population with mean ? and standard deviation ?,
then the sampling distribution of the sample mean
is also normal with mean ? and standard error of
?/n1/2, for any sample size n. - When the population is not normal, however, then
the sampling distribution of the sample mean need
not be normal. But we have
23Central Limit Theorem
- If a random sample of size n is taken from a
population or distribution with mean ? and
standard deviation ?, and if the sample size is
large (n gt 30), then the sampling distribution of
the sample mean is approximately normal with mean
? and standard deviation (or standard error) of
?/n1/2. That is,
24Uses of the Central Limit Theorem
- Because of this approximation, when computing
probabilities associated with the sample mean, we
can use the approximation given below which uses
the standard normal distribution. - Note Z ? N(0,1), the standard normal variable.
25Applications of the CLT
- Situation 1 Suppose we take a sample of size n
30 from the population described by the
probability function p(2) 0.4, p(3) 0.5, p(5)
0.1. This is the population we were using
earlier. - Question 1 We seek the approximate probability
that the sample mean is between 3.1 and 3.5. - Question 2 Find the approximate probability that
the sample mean is less than 2.6.
26Applications continued
- Situation 2 The systolic blood pressure
population data set has mean ? 114.58 and
standard deviation of ? 14.06. Its
distribution is not normal as it is right-skewed.
Suppose we take a random sample of n 50
people, and obtain the sample mean of their
systolic blood pressures. - Question 1 What is the approximate probability
that this sample mean will exceed 120?
27Continued ...
- Question 2 What would be the value of A such
that the probability that the sample mean of the
systolic blood pressures of a sample of size 50
is greater than A is 0.95?
28Sampling a Bernoulli Population
- A Bernoulli population is one where there are
only two possible values or outcomes, called a
Success, denoted by the value of X 1, and a
Failure, denoted by a value of X 0. The
probability of a Success is denoted by p. - For such a population we have
- Mean ? p
- Variance ?2 p(1-p).
- Consider now taking a sample of size n from this
population and letting equal the proportion
of successes in the sample. That is,
29Sample Proportion
- Because the Bernoulli observations are either 0
or 1 (with 1 representing success), then the
sample proportion could be defined via
30Sampling Distribution of the Sample Proportion
- Since the sample proportion is the sample mean of
the observations from a Bernoulli population, by
the Central Limit Theorem, it follows that the
sampling distribution of the sample proportion,
when the sample size is large (that is n gt 30),
is approximately normal with mean of p and SE of
p(1-p)/n1/2.
31An Application
- Situation One of the ways most Americans
relieve stress is to reward themselves with
sweets. According to one study, 46 admit to
overeating sweet foods when stressed. Suppose
that the 46 figure is correct and we take a
random sample of size n 100 Americans and ask
them if they overeat sweets when they are
stressed out. - Question 1 What is the probability that the
proportion who overeats sweets in this sample
exceeds 0.50?