Title: Data Collection and Sampling
1Data Collection and Sampling
25.2 Sources of Data
- The reliability and accuracy of the data affect
the validity of the results of a statistical
analysis. - The reliability and accuracy of the data depend
on the method of collection. - Three of the most popular sources of statistical
data are - Published data
- Observational studies
- Experimental studies
3- Published Data
- This is often a preferred source of data due to
low cost and convenience. - Published data is found as printed material,
tapes, disks, and on the Internet. - Data published by the organization that has
collected it is called PRIMARY DATA.
For example Data published by the US Bureau of
Census.
- For example
- The Statistical abstracts of the United States,
- compiles data from primary sources
- Compustat, sells variety of financial data
tapescompiled from primary sources
- Data published by an organization different than
the organization that has collected it is called
SECONDARY DATA.
4- Observational and experimental studies
- When published data are unavailable, one needs to
conduct a study to generate the data.
- Observational study is one in which measurements
representing a variable of interest are observed
and recorded, without controlling any factor that
might influence their values. - Experimental study is one in which measurements
representing a variable of interest are observed
and recorded, while controlling factors that
might influence their values.
5- A good questionnaire must be well designed
- Keep the questionnaire as short as possible.
- Ask short,simple, and clearly worded questions.
- Start with demographic questions to help
respondents get started comfortably. - Use dichotomous and multiple choice questions.
- Use open-ended questions cautiously.
- Avoid using leading-questions.
- is useful to pretest a questionnaire.
- Think about the way you intend to use the
collected data when preparing the questionnaire.
- Surveys solicit information from people.
- Surveys can be made by means of
- personal interview
- telephone interview
- self-administered questionnaire
65.3 Sampling
- Motivation for conducting a sampling procedure
- Costs.
- Population size.
- The possible destructive nature of the sampling
process. - The sample population and the target population
should be similar to one another.
75.4 Sampling Plans
- Simple random sampling
- In simple random sampling all the samples with
the same size is equally likely to be chosen. - To conduct a random sampling
- assign a number to each element of the chosen
population (or use already given numbers), - randomly select the sample numbers (members). Use
a random numbers table, or a software package.
8- Example 5.2
- A government income-tax auditor is responsible
for 1,000 tax returns. - The auditor will randomly select 40 returns to
audit. - Use Excels random number generator to select
the returns. - Solution
- We generate 50 numbers between 1 and 1000 (we
need only 40 numbers, but the extra might be used
if duplicate numbers are generated.)
9Round-up
X(100)
383 101 597 900 885 959 15 408 864 139 2
46 . .
50 integral random numbers between 1 and
1000 uniformly distributed
50 Random numbers between 0 and 1000, each has a
probability of 1/1000 to be selected
The auditor should select 40 files numbered 383,
101, ...
10- Stratified Random Sampling
- This sampling procedure separates the population
into mutually exclusive sets (strata), and then
draw simple random samples from each stratum.
- With this procedure we can acquire information
about - the whole population
- each stratum
- the relationships among starta.
11- There are several ways to build the stratified
sample. For example, keep the proportion of each
startum in the population.
A sample of size 1,000 is to be drawn
These are the population proportions of each
income category
Total 1,000
12- Cluster sampling
- Cluster sampling is a simple random sample of
groups or clusters of elements. - This procedure is useful when
- it is difficult and costly to develop a complete
list of the population members - the population members are widely dispersed
geographically. - Cluster sampling increase sampling error, because
there are probably similarities among cluster
members.
135.5 Errors Involved in sampling
- Two major types of errors can rise when a
sampling procedure is performed. - Sampling Error
- Sampling error refers to differences between the
sample and the population, because of the
specific observations that happen to be selected. - Sampling error is expected to occur when making a
statement about the population based on the
sample taken.
14Population income distribution
m - Income population mean
Sampling error
15- Non-sampling error
- Non-sampling errors occur due to mistakes made
along the process of data acquisition - Increasing sample size will not reduce this type
of errors. - There are three types of Non-sampling errors
- Errors in data acquisition,
- Non-response errors,
- Selection bias.
16Data Acquisition Error
Population
Sampling error Data acquisition error
If this observation
is wrongly recorded here
Sample
Then the sample mean is affected
17Non-Response Error
Population
No response here...
May lead to biased results here
Sample
18Selection Bias
Population
When parts of the population cannot be selected...
the sample cannot represent the whole population
Sample