GATHERING - PowerPoint PPT Presentation

About This Presentation

Title:

GATHERING

Description:

Parameter: The proportion of American adults who believe pro-wrestling is a sport. ... mixed enough a tablespoon will. suffice, whether you're 'sampling' ... – PowerPoint PPT presentation

Number of Views:47

Avg rating:3.0/5.0

Slides: 37

Provided by: james308

Learn more at: https://people.carleton.edu

Category:

more less

Transcript and Presenter's Notes

Title: GATHERING

1
GATHERING AND PRODUCING DATA
2
How Data are Obtained

Census
Everyone is included
Observational Study
Observes individuals and measures variables but
does not attempt to influence responses
Includes surveys and polls
Experiment
Deliberately imposes some treatment on
individuals in order to observe their responses
In medicine, this is called a clinical trial

3
3 BIG ideas

Examine a part of the whole take a sample from a
population
Randomization insures the sample is
representative
The size of the sample is whats important, not
the size of the population

4
Big Idea 1 Examine Part of the Whole

We are studying an entire population of
individuals (or subjects), but looking at
everyone is practically impossible.
How many support the U.S. role in Iraq?
What percent of the tomato shipment is bad?
How many children are obese?
Whats the price of gas at the pump across
Minnesota?
Settle for looking at a smaller groupa
sampleselected from the population.
Sampling is natural! Think about cooking. You
taste (sample) a small part to get an idea about
the dish as a whole.

5
Populations and parameters, samples and
statistics (This stuff is important!)

A parameter is a numerical quantity that
describes a population.
A statistic is a numerical quantity that
describes the sample.
We study a population by looking at a sample. We
infer about a parameter by using statistics from
the sample.
Notation use Greek letters for parameters and
Latin letters for statistics

6
Example Polling

Minneapolis Star Tribune A Gallup Poll,
conducted Aug. 16-18, 1999, asked, Do you
consider pro-wrestling to be a sport, or not? Of
the people polled, 19 said, Yes. (Results were
based on telephone interviews with a randomly
selected national sample of 1,028 adults, 18
years and older.)
Whats the population, parameter, sample,
statistic?
Population Americans, 18 years and older
Sample The 1,028 people who were polled
Parameter The proportion of American adults who
believe pro-wrestling is a sport. (Called the
population proportion.)
p ?
Statistic The proportion of people in the sample
who said they believe pro-wrestling is a sport.
(Called the sample proportion.) 0.19

7
Example Surveying a lot shipment

A carload of ball bearings has an average
diameter of 2.502 centimeters. This is within the
specifications for acceptance of the lot by the
purchaser. An inspector happens to inspect 100
bearings from the lot and finds the average
diameter of these to be 2.499 cm. This is within
the specified limits, so the entire lot is
accepted.
Whats the population, parameter, sample,
statistic?
Population The carload of ball bearings
Sample The 100 ball bearings that were inspected
Parameter The average diameter of the ball
bearings in the carload.
µ 2.502 cm (The population mean.)
Statistic The average diameter of the 100 ball
bearings in the sample.
2.499 cm (The sample mean.)

8
Big Idea 2 Randomization

Randomization makes sure that
on average the sample looks like
the rest of the population.
Randomization makes it possible to use
quantitative tools (probability) to draw
inferences about the population when we see only
a sample.
Randomization protects against bias.

9
Who will you vote for in 2008? Some examples of
biased samples

100 people at the Mall of America
100 people in front of the Metrodome after a
Twins game
100 friends, family and relatives
100 people who volunteered to answer a survey
question on your web site
100 people who answered their phone during supper
time
The first 100 people you see after you wake up in
the morning

10
Bias the bane of sampling

Samples that systematically misrepresent
individuals in the population are said to be
biased.
Bias is the systematic failure of a sample to
represent its population
There is usually no way to fix a biased sample
and no way to salvage useful information from it.
The best way to avoid bias is to select
individuals for the sample at random. The value
of deliberately introducing randomness is one of
the great insights of Statistics.

11
Simple Random Sample (SRS)

Suppose we want to draw a sample of size n from
some population
For a simple random sample, every possible subset
of size n has an equal chance to be selected and
to become the sample.
Such samples guarantee that each individual has
an equal chance of being selected.
Each combination of people also has an equal
chance of being selected.
The sampling frame is a list of the population
from which the sample is drawn. From the sampling
frame, we can choose a SRS using random numbers.

12
SRS and Sampling Variability

Samples drawn at random generally differ from one
another.
These differences lead to different values for
the variables we measure.
Sample-to-sample differences are called sampling
variability
This is different from bias!
Example Everyone pick 10 Skittles at random from
The Bowl and count how many reds.
The variability of the different sample counts is
sampling variability.
If half the class peeked and tried to get more
reds the differences would reflect bias.

13
Sources of sampling error

In the context of using a sample to
estimate a population parameter,
sampling variability is sometimes
called sampling error.
Taking a SRS of 3 students to estimate the
average
height of all students will have a large
sampling error, but it is not biased.
Taking a sample of 300 basketball players to
estimate the average height of all students will
produce less variability but the sample is biased.

14
More complex sampling designs

Simple random sampling is not the only way to
sample.
More complicated designs may save time or money
or help avoid sampling problems.
Stratified sampling
Cluster sampling
Systematic sampling
Multi-stage sampling
All statistical sampling designs have in common
the idea that chance, rather than human choice,
is used to select the sample.

15
Stratified sampling

Suppose we want a sample of 240 Carleton students
We also want to insure discipline representation
The student body divides as
Arts and Literature 20
Humanities 15
Social Sciences 30
Mathematics and Natural Sciences 35
For the sample, select
240 x .20 48 Arts and Lit students
240 x .15 36 Humanities students
240 x .30 72 Social science students
240 x .35 84 Natural science students
Within each discipline, choose a SRS

16
Stratified Sampling

The population is divided into homogeneous
groups, called strata, before the sample is
selected.
Then simple random sampling is used within each
stratum before the results are combined.
Advantages
Sample will be representative for the strata
Reduces sampling variability
Disadvantages
May be logistically difficult if even possible to
implement
Must have information about the population
Note a stratified sample is not a SRS

17
Cluster sampling

Sometimes stratifying isnt practical and simple
random sampling is difficult. Splitting the
population into clusters can make sampling more
practical.
Suppose you want to do a face-to-face survey of
attitudes in Minnesota based on a sample of size
600.
Choosing 600 people at random, finding their
addresses, and meeting them in person is costly
and time-consuming.
Another idea Choose some cities at random. Then
some streets at random, and then some blocks at
random. Interview everyone on the selected
blocks.
The blocks are the clusters.
If you know there are about 20 people per block.
Then choose a random sample of 30 blocks.

18
Cluster sampling in the newsThe Lancet study on
Iraq casualties

In October 2006, The Lancet published Iraq
mortality after the 2003 invasion a
cross-sectional cluster sample survey
The study was controversial because of its
findings that hundreds of thousands of Iraqis
(most likely about 650,000) had been killed since
the U.S. invasion.
Earlier reports, including the U.S. and British
government had put the number at about 30,000.
The study was based on cluster sampling, a common
methodology in public health and human rights
work
The clusters were groups of 40 houses in close
proximity whose locations were chosen based on
population demographics.

19
Cluster Sampling

If each cluster fairly represents the population,
cluster sampling will give an unbiased sample.
Advantage
Easier to implement depending on context
Disadvantage
Greater sampling variability, so less statistical
accuracy

20
Multistage Sampling

Most surveys conducted by the government or
professional polling organizations use some
combination of stratified and cluster sampling as
well as simple random sampling.
Current Population Survey is how the government
estimates the unemployment rate
Counties are divided into 2,007 Primary Sampling
Units
PSUs are divided into smaller census blocks. And
the blocks are grouped into strata. Households in
each block are grouped into clusters of about 4
households each
The final sample consists of these clusters and
interviewers go to all households in the chosen
clusters.

21
Systematic Samples

Sometimes we draw a sample by selecting
individuals systematically.
For example, you might survey every 10th person
on an alphabetical list of students.
To make it random, you must still start the
systematic selection from a randomly selected
individual.
When there is no reason to believe that the order
of the list could be associated in any way with
the responses sought, systematic sampling can
give a representative sample.
Systematic sampling can be much less expensive
than true random sampling.

22
Sampling Example

Hospital administrators are concerned about the
possibility of drug abuse among employees. They
plan to pick a sample of 40 from 800 employees,
and administer a drug test. Whats the sampling
strategy?
Randomly select 10 doctors, 10 nurses, 10 office
staff, and 10 support staff for the test.
Each employee has a 4-digit ID number. Randomly
choose 40 numbers.
At the start of each shift, choose every 20th
person who arrives for work.
There are 40 departments of 20 employees each.
Randomly choose two departments (say radiology
and ER) and test all the people who work in that
department.

23
Big Idea 3 Sample size is key, not population
size

How large a sample size do we need for the sample
to be reasonably representative of the
population?
In general, its the size of the sample, not the
size of the population, that makes the difference
in sampling.
The fraction of the population that youve
sampled doesnt matter. Its the sample size
itself thats important
Back to cooking If the soup is
mixed enough a tablespoon will
suffice, whether youre sampling
from a saucepan or from a barrel.

24
How big a sample?

Most professional polls choose a sample size of
about 1,000 people.
These polls report a margin of error of about
3. That means that with high confidence their
estimates are within 3 of the true population
parameter value.
The margin of error for a sample of 1,000 people
is the same for Minneapolis (pop. 400,000),
Minnesota (pop. 5 million), and the U.S. (pop.
290 million)
But the bad news is that if you want similar
accuracy at Carleton, you need to poll over half
the student body.
Coming Attractions Margin of Error
and
. But youll have to wait
until we get to Statistical Inference to learn
why.

25
How to Sample Badly

Advice columnist Ann Landers once asked parents
If you had it to do over again, would you have
children?
Do you think responses were representative of
public opinion?
Over 100,000 people responded, and 70 answered
No!
A later survey, more carefully designed, showed
90 of parents are happy with their decision to
have children.
In a voluntary response sample, a large group of
individuals is invited to respond, and all who do
respond are counted. But such samples are almost
always biased toward those with strong opinions
or those who are strongly motivated.
Since the sample is not representative, the
resulting voluntary response bias invalidates the
survey.

26
What Can Go Wrong?or,How to Sample Badly

In convenience sampling, we simply include the
individuals who are convenient. But they may not
be representative of the population.
A psychology professor performs an experiment
using his classroom.
A company samples opinions by using its own
customers.
Sampling mice from a large cage to study how a
drug affects physical activity The lab assistant
reaches into the cage to select the mice one at a
time until 10 are chosen. But which mice will
likely be chosen?

27
Other problems

Under-coverage
In some survey designs a portion of the
population is not sampled or has a smaller
representation in the sample than it has in the
population.
Using telephone directories for phone survey.
Half the households in large cities are unlisted.
About 5 of households dont have phones.
Random digit dialing only partially addresses
this problem
Misses students in dorms, inmates in prison,
soldiers in the military, homeless people. And
its too expensive to call Hawaii or Alaska.
Non-response
No survey succeeds in getting responses from
everyone.
The problem is that those who dont respond may
differ from those who do.
Bureau of Labor Statistics get 6-7 non-response
rate.
But its common for opinion polls and market
research studies to have 75- 80 non-response
rate.

28
What Else Can Go Wrong?

Response bias refers to anything in the survey
design that influences the responses
In particular, the wording of a question can have
a big impact on the responses

29
Some classic statistical mistakesThe Literary
Digest Poll

1936 presidential election Franklin Delano
Roosevelt vs. Alf Landon
The Literary Digest had called every presidential
election since 1916
Sample size 2.4 million!
They predicted Roosevelt would lose by 43
In fact it was a landslide for Roosevelt at 62

30
Literary Digest poll

Context
Midst of the Great Depression
9 million unemployed real income down 1/3
Landons program Cut spending
Roosevelts program Balance peoples budgets
before the governments budget
How the polling was done
Survey sent to 10 million people
And 2.4 million responded (thats huge!)

31
A huge sample, but The Literary Digest poll was
biased

The sampling frame was not representative of the
electorateselection bias
Based on magazine subscription lists, drivers
registrations, country club memberships, phone
numbers (when telephones were a luxury)
Biased toward better off groups (who were more
Republican)
Voluntary response bias
Main issue was the economy
The anti-Roosevelt forces were angryand had a
higher response rate!

32
Year Sample size Winner Gallup prediction Election result Error
1936 50,000 Roosevelt 55.7 62.5 -6.8
1940 50,000 Roosevelt 52.0 55.0 -3.0
1944 50,000 Roosevelt 51.5 53.8 -2.3
1948 50,000 Truman 44.5 49.5 -5.0
1952 5,385 Eisenhower 51.0 55.4 -4.4
1956 8,144 Eisenhower 59.5 57.8 1.7
1960 8,015 Kennedy 51.0 50.1 0.9
1964 6,625 Johnson 64.0 61.3 2.7
1968 4,414 Nixon 43.0 43.5 -0.5
1972 3,689 Nixon 62.0 61.8 0.2
1976 3,439 Carter 48.0 50.1 -2.1
1980 3,500 Reagan 47.0 50.8 -3.8
1984 3,456 Reagan 59.0 59.2 0.2
1988 4,089 Bush 56.0 53.9 2.1
1992 2,019 Clinton 49 43.3 5.7
1996 2.,417 Clinton 52.0 50.1 1.9
2000 3,129 Bush 48.0 47.9 0.1
2004 1,866 Bush 49.0 51.0 -2.0
33
The Year the Polls Elected Dewey

1948 Election Harry Truman versus Thomas Dewey
Every major poll (including Gallup) predicted
Dewey would win by 5 percentage points

34
What went wrong?

Pollsters chose their samples using quota
sampling. Each interviewer was assigned a fixed
quota of subjects in certain categories (race,
sex, age).
For instance, an interviewer in St. Louis was
required to talk to 13 people
6 live in the suburb, 7 in the central city
7 men and 6 women Over the 7 men (similar for
women)
3 under 40 years old, 4 over 40 1 black, 6
white.
In each category, interviewers were free to
choose.
But this left room for human choice and
inevitable bias.
Republicans were easier to reach. They had
telephones, permanent addresses, nicer
neighborhoods.
So interviewers ended up with too many
Republicans.
Quota sampling was abandoned for random sampling.

35
Do you believe the poll?What questions should
you ask?

Who carried out survey?
What is the population?
How was sample selected?
How large was the sample?
What was the response rate?
How were subjects contacted?
When was the survey conducted?
What are the exact questions asked?

36
To summarize . . .

We are often interested in a population and some
parameter that describes the population.
We select a sample from that population and use a
statistic from the sample to estimate the unknown
parameter
To obtain a good estimate, the sample must be as
representative of the population as possible. And
randomization, on average, insures a
representative sample
Possible sources of error are sampling
variability and bias.
To reduce sampling variability, take a bigger
sample
To reduce bias, get a better sampling design
Its the sample size, not the population size,
that matters