Sampling Experiment presentation

About This Presentation

Transcript and Presenter's Notes

Title: Sampling Experiment

1
Sampling Experiment

Please randomly select 10 slips of paper and pass
the bucket to the next person
Each slip represents a household in a small city
the household income is written on the slip
Calculate the mean income for your sample and
save your result
If you have a computer (or a calculator with this
function), also calculate the sample standard
deviation
Return the slips to the bucket

2
Sampling Terminology

Population set of all members, events, or
measurements that we wish to characterize
Random sample sample chosen from the population
by means of a random mechanism
the best way to ensure an unbiased,
representative sample
the only way to allow accuracy of inferences to
be quantified
Judgment/convenience samples are hopelessly
biased, making statistical analysis worthless

3
Predicting the 1936 Election

In 1936, Literary Digest mailed questionnaires to
10 million people, asking who they would vote for
in the upcoming presidential election. The list
was complied from magazine subscribers, car
owners and telephone directories. Based on the
2.3 million responses, they predicted a victory
for Republican Landon over Roosevelt by a 60 to
40 margin.
Roosevelt won with 61 of the vote, to 36 for
Landon.
George Gallup correctly predicted the
electionand the results of the Literary Digest
poll!to within 1 percent, using random samples.

4
Sources of Bias

Selection bias some members of population more
likely to be selected for sample than others
Non-response bias members of sample who do not
respond to survey would have answered differently
from those who do respond
Evasive or untruthful response respondents give
socially acceptable answer
Recall or reporting bias some respondents are
more likely to report an event than others
Measurement error poorly worded questions,
imprecise answers

5
Non-response rate of HIV infection

It is important for policy and predictive
purposes to know the proportion of the population
that is infected with HIV.
In a survey conducted by the National Center for
Health Statistics, the screening response rate
was 95 percent of those, 85 percent gave a blood
sample. Of those giving a sample, about 0.5
percent were infected with HIV.
What is the best estimate of the rate of HIV
infection in the population?

6
Non-response rate of HIV infection

We dont know the rate of infection among those
refusing to give a blood sample.
What should we do? Ignore this group? Assume the
same rate of infection as those agreeing to give
a blood sample?
Do you expect the rate of infection in this group
to be higher or lower than for those agreeing to
give a sample?
How could we estimate the rate of infection in
the second group?
Would stratifying the population help?

7
Random Sampling Techniques

Simple Random Sample
Systematic sampling
Stratified sampling
Cluster sampling
Capture-recapture sampling

8
Simple Random Sample (SRS)

Identify every member of the population, N
Use a random mechanism to select a sample of size
n, such that every member of the population has
the same chance of being selected
assign random number to each member of population
sort random numbers, select members with the n
smallest random numbers
The statistical techniques used in this course
apply only to an SRS

9
Simple Random Sample n 20, N 2000
10
Systematic Sampling

Select every mth member of population
Select sampling interval divide population size
(N) by sample size (n), m N/n
Use a random mechanism to choose a number k
between 1 and m
Choose members k, (mk), (2mk)...
Example N 2,000, n 20, m 100 select
members 45, 145, 2451945 for sample
Better (more representative) than SRS if no
natural trends or strata

11
Systematic sample n 20, N 2000, k 45
12
Stratified Sampling

Suppose we can identify various subpopulations or
strata within the population.
Select a simple random sample from each stratum
instead of from the entire population. This is
called stratified sampling.
Size of samples from each strata can be equal, or
proportional to size of strata.
Better than SRS when there is considerable
variation between the various strata but
relatively little variation within a given
stratum.

13
Stratified sample of 20 from 4 strata
14
Cluster Sampling

To sample households, divide city into blocks,
choose a simple (or stratified) random sample of
blocks, then sample all the households in the
chosen blocks.
In this case the city blocks are called clusters
and the sampling is called cluster sampling.
The advantage of cluster sampling is convenience
and lower cost.
Real applications are often more complex and use
multistage sampling schemes.

15
Cluster Sample of 20 (cluster size 4)
16
Multi-Stage Sampling Schemes CPS

Current Population Survey 60,000
households/month
1. 3,141 U.S. cities and counties are grouped
into 2,007 Primary Sampling Units (PSUsgroups of
counties)
2. The PSUs are grouped into 754 strata (428 with
1 PSU)
3. One PSU is randomly selected from each
stratum probability of selection is proportional
to population
4. PSUs are divided into Census Enumeration Units
(CEU ? 300 households) ? 5 CEUs are randomly
selected from each PSU
Each CEU is divided into Ultimate Survey Units
(USU ? 4 households) ? 6 USUs are randomly
selected from each CEU, interviewed during week
of 12th day
Each month, one quarter of the USUs are replaced

17
A Simple Random Sample in Excel

Data Analysis Toolpack
Tools/Data Analysis/Random Number Generation
Tools/Data Analysis/Sampling
RAND Function
Insert new column in data set
Enter RAND() in each new cell
Copy random numbers, Paste Special/Values
Sort observations by random number, select first
n observations

18
Capture-Recapture Sampling

Collect and tag a sample wait collect a
second sample and determine tagged fraction
Used to estimate the size of difficult-to-count
populations
trout in a lake, insects in a field
homeless, illegal immigrants, drug users

19
Number of Trout in a Lake

Suppose I want to estimate the number of trout in
a remote mountain lake, as part of a program to
monitor the effects of acid deposition
Catch 50 trout tag and release each trout.
One month later, return to the same lake and
catch 60 trout. Of these, 6 are tagged.
How many trout are in the lake?
What assumptions did you make?

20
Number of Trout in a Lake

The second sample reveals that 10 of the trout
in the lake (6 of 60) are tagged
There are 50 tagged trout in the lake if 10 are
tagged, there must be a total of 500 trout

21
Assumptions

Tagged and untagged trout have roughly equal
probabilities of being caught
Number of births between samples small compared
with population
Number of deaths between samples small compared
with population (or death rate roughly equal for
tagged and untagged trout)

22
Sources of Estimation Error

Two types of errors can occur when we sample
Sampling error no sample is perfectly
representative of the population some samples
will be particularly unlucky
sampling errors can be understood and quantified
using statistics
Nonsampling errors various mechanisms (e.g.,
selection, nonresponse, recall bias) can
systematically bias estimates
difficult to quantify not covered here

23
Sampling Error

Suppose we are estimating a population mean, m.
We draw a sample of size n and calculate the
sample mean, . This is a point estimate of m.
The sampling error is the difference between the
sample mean and the population mean
How big is the sampling error? In other words,
how accurate is the estimate?
We dont know, because we dont know m. But we
can answer this question probabilistically.

24
Distribution of the Sample Mean

Imagine that we collect many random samples, of
size n and compute many sample means
We make a frequency table of all the sample means
and construct a histogram
If the number of samples is very large, the
histo-gram becomes a continuous probability
distribu-tionthe distribution of the sample mean
The sample mean is normally distributed with
a mean of m
a standard deviation of

25
Central Limit Theorem

Regardless of how x is distributed, the sample
mean is normally distributed (as long as the
sample size, n, is reasonably large)
The standard deviation of the sample mean is
called the standard error

if you dont know population standard deviation,
s, use the sample standard deviation, s

26
Example

An auditor selects a sample of 100 account
balances from a population of 10,000
The sample mean is 279 the sample standard
deviation is 420 (obvious positive skew)
The auditor can be 95 certain that the mean of
all 10,000 accounts is somewhere in the interval
279 84, that is, between 195 and 363.

27
Caveats

Sample is reasonably large
depends on distribution of x if normal, any n
if reasonably symmetrical, n gt 30 if highly
asymmetrical, then n gt 100
Sample is small fraction of the population
otherwise use the finite population correction
N and n are size of population and sample if
N 10,000, n 100, fpc (9900/9999)½ 0.995.

28
Sampling Experiment

Population mean m 40,000
Population standard deviation s 15,000
Sample size n 10
68 chance that any particular sample mean will
be within one SE of the population mean 40,000
4750, or 35,250 lt lt 44,750
95 of sample means will be within two SE 40,000
9500, or 30,500 lt lt 49,500

29
Sampling Experiment

Our experiment is limited by the small sample
size (n10) and the small number of samples
Using Excel, we can explore larger samples and
larger numbers of samples
In income sampling.xls, we investigate the actual
distribution of sample means for 250 random
samples of size 10, 100, and 1000, and compare
the results to what we would expect from the
Central Limit Theorem

30
Population Distribution
31
Distribution of Sample Means
32
Distribution of Sample Means
33
Distribution of Sample Means
34
Distribution of Sample Means
35
Determining Sample Size

What sample size is necessary to estimate the
population mean with a given accuracy?
Let B acceptable sampling error 2SE (i.e., a
95 percent chance that will be in interval m
B

In above example, if we want to estimate mean
household income with an accuracy of 1,000

36
Standard Error for Proportions

Let p population proportion
Example percent voting for Bush
Draw a random sample of size n
Determine sample proportion
The standard error of the sample proportion
if np 5, n(1 p) 5 (i.e., more than 5
voting for Bush and more than 5 voting for Gore)

37
Heads in 1, 10, 100, 1000 Tosses
38
Example

Exit poll ask 1000 voters who they just voted
for
480 say Bush 0.48
68 chance that the population proportion is 48
1.6, or between 46.4 and 49.6
95 chance that the population proportion is 48
3.2, or between 44.8 and 51.2

39
Determining Sample Size Proportions

Let B acceptable sampling error 2SE (i.e., a
95 percent chance that will be in interval p
B

The unemployment rate is about 5 if we want to
measure the rate with an accuracy of 0.1, we
need to survey almost 200,000 workers

40
Difference Between Two Means

Suppose X1 and X2 are independent random
variables
If Y X1 X2, then

Sample means are independent random variables.
41
Study Design

Two types of studies
experimental
identifies a cohort or group of subjects, imposes
one or more treatments in order to observe a
response
observational
gathers data without influencing response
sometimes called a natural experiment

42
Experimental Design

The ideal experiment random assignment of
subjects into a control group and one or more
treatment groups
treatment is a combination of explanatory
variables except for treatment, subjects in all
groups are handled same
Only systematic reason for differences between
groups is the treatment
One must still account for random effects
differences considered too large to be due to
random effects are statistically significant

43
Comparative Design

Comparative design is necessary to ensure that
the measured treatment effect is due solely to
the treatment
blind experiment subject does not know whether
he is in control or treatment group placebo used
for contols
double-blind the person interacting with
subjects, measuring response does not know which
group the subject is in

44
Matching

Some studies match control, experimental group
subjects by age, gender, race, etc.
This can lead to smaller random effects, but
random selection is still necessary to control
for other variables (stratify population and then
apply random selection to each strata)
Matched pairs is particularly powerful
each subject subjected to various treatments,
difference in response measured

45
Policy Experiments

Welfare recipients randomly assigned to control
or treatment group treatment group required to
attend classes, look for work, or lose benefits
Reemployment bonus applicants randomly assigned
to control or various treatment groups treatment
group offered a bonus (3 or 6 x WBA) if they find
work in specified time (6 or 12 weeks)
Class size students and teachers randomly
assigned to small (13-17) or large (22-26) class
Vouchers students apply for voucher, half are
randomly selected

46
Experiments Not Always Possible

Experiments can be
expensive
controversial (subjects often do not want to be
in the control or treatment group)
unethical (split twins, stuttering, etc.)
impossible (effect of economy on election
results, greenhouse gas emissions on climate,
etc.)

47
Observational Studies

Natural experiments differences in explanatory
variable occur between groups or over time
prospective identify groups that differ in some
aspect (diet), track and measure outcomes
retrospective examine data collected after the
response, correlate to explanatory variable
Observational studies are suffer from
confounding variables variables correlated to
explanatory and response
selection bias systematic differences in group
characteristics

48
Natural Experiments

Teen smoking price of cigarettes, laws vary from
state to state and over time
Vouchers track performance of students who
receive vouchers, compare to other students
Cancer search for patternsgeographical or
occupational clusters of diseasein public health
records
Discrimination search for differences in salary,
promotions, mortgage lending, etc. by gender and
race

49
Case-Control

Some conditions are too rare to permit
prospective studies for example
brain cancer from cell phone use or exposure to
high-voltage transmission lines
leukemia or thyroid cancer from exposure to
fallout
Case group is composed of those with condition
control group selected to match other
characteristics of case group
Explanatory variable measured for both groups

50
The studies that found high suicide rates did not
include women who had implants after
mastectomies. Some researchers say the high
suicide rate reflects the psychological makeup of
women who seek implants -- that as a group they
are more likely to have psychological problems
than the general population. But others say the
high suicide rate is a function of the
difficulties and pain that sometimes occur years
after the surgery. Although the FDA has
restricted the use of silicone gel implants for
cosmetic purposes, saline-filled implants have
gained popularity. According to the American
Society of Plastic Surgeons, more than 225,000
women had the operation last year, and some say
many more will opt for it should silicone gel
implants become more available. Many women say
silicone looks more natural and feels better.
The Finnish study, which included women who
received the implants as long as 30 years ago,
reported on 2,166 women. It was conducted by the
private International Epidemiology Institute of
Rockville and funded by Dow Corning Corp., a
former manufacturer of silicone gel breast
implants. Dow Corning also funded the larger
Swedish study, which examined 3,521 women with
implants and also found a suicide rate about
three times above normal. "The ironic thing is
that nobody was looking for this suicide
information," said Joseph K. McLaughlin, lead
investigator on the Finnish study, published in
the Annals of Plastic Surgery. "There have been
lots of studies of women with breast implants,
and the only consistent finding that's
problematic is the suicide excess." But
McLaughlin said that the data did not prove a
cause-and-effect connection between breast
implants and suicide, and that the high rate may
be related to the nature of women who choose to
have implants. "In fact," he said, "it could be
that because of characteristics of women who get
implants, it may be that women who get them may
reduce their risk of later suicide."
Breast Implants Linked to Suicide By Marc
KaufmanWashington Post Staff WriterThursday,
October 2, 2003 Page A13
A series of studies has found a surprisingly
high suicide rate among women who have had
cosmetic breast implants, renewing the
controversy about the procedure just as the Food
and Drug Administration weighs whether to allow
silicone gel implants back on the market. The
latest study, published yesterday, found that
Finnish women who had cosmetic implants were more
than three times more likely to commit suicide
than the general population -- in line with
findings from a similar study of Swedish women
and one of American women conducted by the
National Cancer Institute. The three studies
also found that the overall death rate for women
with implants was the same as or lower than for
the general population, suggesting that the
implants themselves were not causing illness, as
once feared. But all three found that the suicide
rate was significantly, and at this point
inexplicably, higher than expected. The
question of why women with implants are so much
more likely to commit suicide has become
controversial, especially with an FDA advisory
panel preparing to consider an application by
Inamed Corp. to allow silicone gel breast
implants back on the market for breast
enhancement. The FDA restricted their use to
mastectomy patients and women in clinical trials
in 1992 after concerns arose about their safety.
51
For Love and Money By Richard MorinWashington
Post, September 28, 2003 Page B05 Want to
be wealthy? If you're a woman, two distinct paths
seem to increase the odds that you'll strike it
rich Marry young and don't have kids, or remain
single your entire life. But if you're a man and
dreaming of making a fortune, you can flip a coin
before deciding whether to go to the altar --
married or single men have about the same chance
of becoming wealthy, claim three sociologists who
have studied earnings over the course of a
person's lifetime. Forget kids if you're
seeking financial rather than emotional riches.
In statistical terms, children are a lousy
short-term financial investment, assert Thomas A.
Hirschl and Joyce Altobelli of Cornell
University. Past research has repeatedly shown
a link between marriage and affluence. Generally,
individuals who got married were more likely to
achieve wealth than people who didn't -- in fact,
married couples were more than twice as likely as
singles to have experienced at least a year of
affluence during the 25-year study period.
"Marriage enhances the odds of female affluence,
but not male affluence," they report in an
article scheduled to appear in a forthcoming
issue of the Journal of Marriage and Family.
"There is no statistically significant difference
between the life course of marital affluence and
the life course of nonmarried male affluence,
suggesting the decision to marry is not crucial
for men. This decision would, however, appear to
be quite crucial for most women." But who's
the richest of them all? It wasn't middle-aged or
older couples. Instead, it was younger (under 45)
marrieds with no children. Nearly two-thirds of
all childless couples between the ages of 25 and
45 were rich for at least a year during the study
period, compared with fewer than one in four
couples with children. So to get rich, if only
for a little while, Hirschl's advice is "marry
young and use contraceptives or have a
vasectomy." But, he was quick to add, kids are
cool, as well as critical for the perpetuation of
the species. "There are more important things in
life than merely financial success," he said.

Write a Comment

User Comments (0)

About PowerShow.com

Sampling Experiment PowerPoint PPT Presentation