Title: Data Collection and Sampling
1Data Collection and Sampling
2Recall
- Statistics is a tool for converting data into
information
- But
- Where then does data come from?
- How is it gathered?
- How do we ensure its accurate(??)? Is the data
reliable(??)? - Is it representative(???) of the population from
which it was drawn? - This chapter explores some of these issues.
35.1 Methods of Collecting Data
- The reliability and accuracy of the data affect
the validity of the results of a statistical
analysis. - The reliability and accuracy of the data depend
on the method of collection. - Four of the most popular sources of statistical
data are - Published data(????)
- Observational studies(??)
- Experimental studies(??)
- Surveys(??)
4Published Data
- This is often a preferred source of data due to
low cost and convenience. - Published data is found as printed material,
tapes, disks, and on the Internet. - Data published by the organization that has
collected it is called PRIMARY DATA(????).
For example Data published by the US Bureau of
Census.
- For example
- The Statistical abstracts of the United States,
- compiles data from primary sources
- Compustat, sells variety of financial data
tapescompiled from primary sources
- Data published by an organization different than
the organization that has collected it is called
SECONDARY DATA(????).
5Observational and experimental studies
- When published data is unavailable, one needs to
conduct a study to generate the data.
- Observational study is one in which measurements
representing a variable of interest are observed
and recorded, without controlling any factor that
might influence their values. - Experimental study is one in which measurements
representing a variable of interest are observed
and recorded, while controlling factors(????)
that might influence their values.
6Surveys
- Surveys solicit information from people. e.g.
pre-election polls marketing surveys. - The Response Rate(???) (i.e. the proportion of
all people selected who complete the survey) is a
key survey parameter. - Surveys can be made by means of
- personal interview(????)
- telephone interview(????)
- self-administered questionnaire(??????)
7Questionnaire Design(????)
- Key design principles of a good questionnaire
- Keep the questionnaire as short as possible.
- Ask short, simple, and clearly worded questions.
- Start with demographic questions to help
respondents get started comfortably. - Use dichotomous (yesno) and multiple choice
questions. - Use open-ended questions cautiously.
- Avoid using leading-questions.
- Pretest a questionnaire on a small number of
people. - Think about the way you intend to use the
collected data when preparing the questionnaire.
85.2 Sampling(??)
- Recall that statistical inference permits us to
draw conclusions about a population based on a
sample. - Motivation for conducting a sampling procedure
- Costs. (e.g. its less expensive to sample 1,000
television viewers than 100 million TV viewers) - Population size.
- The possible destructive nature (???)of the
sampling process. (e.g. performing a crash test
on every automobile produced is impractical). - The sampled population(????) and the target
population(????) should be similar to one another.
95.3 Sampling Plans
- A sampling plan is just a method or procedure for
specifying how a sample will be taken from a
population. - We will focus our attention on these three
methods - Simple random sampling(??????)
- Stratified random sampling(??????)
- Cluster sampling(????)
10Simple Random Sampling
- In simple random sampling all the samples with
the same size are equally likely to be chosen. - To conduct random sampling
- assign a number to each element of the chosen
population (or use already given numbers), - randomly select the sample numbers (members). Use
a random numbers table, or a software package.
11Simple Random Sampling
- Example 5.1
- A government income-tax auditor is responsible
for 1,000 tax returns. - The auditor will randomly select 40 returns to
audit. - Use Excels random number generator to select
the returns. - Solution
- We generate 50 numbers between 1 and 1000 (we
need only 40 numbers, but the extra might be used
if duplicate numbers are generated.)
12Simple Random Sampling
- Example 5.1 A government income tax auditor must
choose a sample of 40 of 1,000 returns to audit
Extra s may be used if duplicate random numbers
are generated.
13Simple Random Sampling
Round-up
X(100)
383 101 597 900 885 959 15 408 864 139 2
46 . .
The auditor should select 40 files numbered
383, 101, ...
14Stratified Random Sampling
- This sampling procedure separates the population
into mutually exclusive sets (strata) (?????),
and then draw simple random samples from each
stratum.
15Stratified Random Sampling
- With this procedure we can acquire information
about - the whole population
- each stratum
- the relationships among strata.
16Stratified Random Sampling
- After the population has been stratified, we can
use simple random sampling to generate the
complete sample. For example, keep the proportion
of each stratum in the population.
17Cluster Sampling
- Cluster sampling is a simple random sample of
groups or clusters of elements. - This procedure is useful when
- it is difficult and costly to develop a complete
list of the population members (making it
difficult to develop a simple random sampling
procedure. - the population members are widely dispersed
geographically. - Cluster sampling may increase sampling
error(????), because of probable similarities
among cluster members.
18Sample Size(???)
- Numerical techniques for determining sample sizes
will be described later, but suffice it to say
that the larger the sample size is, the more
accurate we can expect the sample estimates to be.
195.4 Sampling and Non-Sampling Errors
- Two major types of error can arise when a sample
of observations is taken from a population -
- Sampling error(????) refers to differences
between the sample and the population that exist
only because of the observations that happened to
be selected for the sample. - Nonsampling errors (?????) are more serious and
are due to mistakes made in the acquisition of
data or due to the sample observations being
selected improperly.
20Sampling Error
- Sampling error refers to differences between the
sample and the population that exist only because
of the observations that happened to be selected
for the sample. - Another way to look at this is the differences
in results for different samples (of the same
size) is due to sampling error - E.g. Two samples of size 10 of 1,000 households.
If we happened to get the highest income level
data points in our first sample and all the
lowest income levels in the second, this delta is
due to sampling error. - Increasing the sample size will reduce this type
of error.
21Sampling Errors
Population income distribution
m ( population mean)
Sampling error
22Nonsampling Error
- Nonsampling errors are more serious and are due
to mistakes made in the acquisition of data or
due to the sample observations being selected
improperly. - Three types of nonsampling errors
- Errors in data acquisition
- Nonresponse errors(?????)
- Selection bias(????)
- Note increasing the sample size will not reduce
this type of error.
23Errors in data acquisition
- arises from the recording of incorrect
responses, due to - incorrect measurements being taken because of
faulty equipment, - mistakes made during transcription from primary
sources, - inaccurate recording of data due to
misinterpretation of terms, or - inaccurate responses to questions concerning
sensitive issues.
24Data Acquisition Error
Population
Sampling error Data acquisition error
Sample
25Nonresponse Error
- refers to error (or bias) introduced when
responses are not obtained from some members of
the sample, i.e. the sample observations that are
collected may not be representative of the target
population. - As mentioned earlier, the Response Rate (i.e. the
proportion of all people selected who complete
the survey) is a key survey parameter and helps
in the understanding in the validity of the
survey and sources of nonresponse error.
26Non-Response Error
Population
No response here...
may lead to biased results here.
Sample
27Selection Bias
- occurs when the sampling plan is such that some
members of the target population cannot possibly
be selected for inclusion in the sample.
28Selection Bias
Population
When parts of the population cannot be selected...
the sample cannot represent the whole population.
Sample