Collection and analysis of data

About This Presentation

Title:

Collection and analysis of data

Description:

Collection and analysis of data Vladimir Ryabov, PhD Principal Lecturer in Information Technology Kemi-Tornio University of Applied Sciences Contents Sampling methods ... – PowerPoint PPT presentation

Number of Views:288

Avg rating:3.0/5.0

Slides: 82

Provided by: V229

Category:

more less

Transcript and Presenter's Notes

Title: Collection and analysis of data

1
Collection and analysis of data

Vladimir Ryabov, PhD
Principal Lecturer in Information Technology
Kemi-Tornio University of Applied Sciences

2
Contents

Sampling methods
Hypothesis testing
Correlation and regression analysis
Analyzing qualitative data

3
Collecting data during surveys

A population is the set of all items or
individuals of interest.
A sample is a subset of the population
Targeting correct population is very important.
Sampling is the process of selecting the right
individuals, objects, or events for study.
What are the reasons for using samples?

4
Inferential statistics

Inferential statistics is aiming at making
statements about a population by examining sample
results.
Sample statistics Population
parameters
Inference

5
Sampling methods
Sampling
Probabilistic sampling
Non-probabilistic sampling
Simple Random
Systematic
Judgement
Cluster
Convenience
Stratified
6
Non-probabilistic sampling

Convenience sampling
Collecting information from the members of the
population who are more conveniently available to
provide it.
Purposive sampling (judgement and quota sampling)
Collecting information from the specific group of
the population either because they are the only
ones who have it (judgement sampling) or conform
to some criteria set by the researcher (quota
sampling).
Non-probabilistic sampling also means that the
findings of a study cant be confidently
generalized to the population.
Nevertheless, generalizability is not always a
priority.

7
Snowball sampling

Also called a referral sampling.
The initial respondents are typically selected
using probability methods.
Then these respondents are used to identify other
respondents in the target population.
The process continues until the desired sample
size is reached.
This sampling type is used in the cases of rare
populations or when a list of respondents does
not exist.

8
Simple random sampling

Every individual or item from the population has
an equal chance of being selected
E.g., when you play dice.
Selection may be with replacement or without
replacement.
Samples can be obtained from a table of random
numbers or computer random number generators.

9
Stratified sampling

Population divided into subgroups (called strata)
according to some common characteristic.
Simple random sample selected from each subgroup.
Samples from subgroups are combined into one.
Stratified sampling can be proportionate or
disproportionate.

10
Systematic sampling

Decide on sample size n
Divide frame of N individuals into groups of k
individuals kN/n
Randomly select one individual from the 1st group
Select every kth individual thereafter

N 64 n 8 k 8
First Group
11
Cluster sampling

Population is divided into several clusters,
each representative of the population.
A simple random sample of clusters is selected.
All items in the selected clusters can be used,
or items can be chosen from a cluster using
another probability sampling technique

Population divided into 16 clusters.
Randomly selected clusters for sample
12
Sample size

Determination of a sample size is complex due to
many factors that should be taken into account
the variability of elements in the population
the type of sample required
time available
budget
required estimation precision
should the findings be generalized/with what
degree of confidence
Formulas based on statistical theory are used as
well as other ad hoc methods.

13
Sampling from a large population

When statistical formulas are used three
decisions should be made
The specified level of precision. Precision
refers to how close our estimate is to the true
population characteristic. Usually, we would
estimate the population parameter to fall within
a range, based on sample estimate.
The degree of confidence. It determines how
certain we are that our estimates will be really
hold true for the population. It is often taken
as 95.
The amount of variability. It determines the
homogeneity of the population. The variability is
measures by the standard deviation.
The degree of confidence vs. the level of
precision. More precision but less confidence.
More confidence but less precision.

14
Sample size using statistical formula

If we have information on these factors, then the
sample size can be calculated as follows
where
DC (degree of confidence) the number of
standard errors for the degree of confidence
specified for the research results.
TV (true variability) the standard deviation of
the population.
DP (desired precision) the acceptable
difference between the sample estimate and the
population value.
This formula does not include the population
size, which typically does not have an impact on
the sample size for large populations.

15
Example

Consider the case where we wish to estimate the
average monthly expenditure on eating out.
Although, the true standard deviation
(variability) is unknown a pilot test study of 30
customers provides an estimate of the unknown
standard deviation of 14. We want to be 95
confident that our estimate of the mean monthly
expenditure on eating out is within 2 of the
true population mean. Assuming the distribution
of expenditures follows a normal distribution
then the sample size is determined as follows

16
Sampling from a small population

The use of the formula for large populations may
lead to an unnecessary large sample size.
If the sample size is larger than 5 of the
population then the calculated sample size should
be corrected.
Correction factor, where N is the population
size, n is the calculated sample size by the
original formula
When dealing with small populations, the sample
size can be 10 to 20 of the total number of
individuals. Also, a minimum number of 30 is
recommended and larger if possible. Other
factors, like characteristics of the sample
respondents should be taken into account.

17
Example

Suppose a bank has 5000 ATM installed in the UK
the bank wishes to establish their users views
of this service. A researcher estimates the
required sample size given their agreed criteria
is 750. This sample size is 13 of the population
and is larger than is necessary for an efficient
sample size. In this case the sample size
correction factor needs to be applied.

18
Rules of thumb for sampling(Roscoe 1975)

Sample sizes larger than 30 and less than 500 are
appropriate for most research.
Where samples are to be broken into sub-samples
(males/females, etc.), a minimum sample size of
30 for each category is necessary.
In multivariate research (including multiple
regression analyses), the sample size should be
several times (preferably 10 times or more) as
large as the number of variables in the study.
For simple experimental research with tight
experimental controls, successful research is
possible with samples as small as 10 to 20 in
size.

19
Sampling for qualitative studies

Only small samples of individuals, groups, or
events are typically chosen due to the in-depth
nature of research, huge costs and energy
expenditure.
This also means that the generalizability of the
findings is very restricted.
External validity will be low.
It is possible to use any of the sampling
techniques.
If the purpose of the study is merely to explore
and try to understand the phenomena, a
convenience sampling is often used.

20
Managerial relevance

Awareness of sampling methods
helps managers to understand why a particular
method of sampling is used
facilitates understanding of the cost
implications of different designs and the
tradeoffs between precision and confidence vs.
the costs
enables managers to understand the risk they take
in implementing changes based on the results of
the research study
helps to assess the generalizability of the
findings and analyze the implications of trying
out recommendations made in their own system.

21
Further references

Chapter 1 from Groebner, D., et al., 2005,
Business Statistics A Decision Making Approach,
6-th Ed., Prentice Hall.
Chapter 7 from Hair, J., et al., 2007, Research
Methods for Business, John Wiley Sons.
Chapter 11 from Sekaran, U., 2003, Research
Methods for Business a skill building approach,
4th Ed., John Wiley Sons.
Chapter 8 from Laurence Neuman, W., 2003, Social
Research Methods Qualitative and Quantitative
Approaches, 5-th Edition, Pearson Education.

22
Contents

Sampling methods
Hypothesis testing
Correlation and regression analysis
Analyzing qualitative data

23
What is a hypothesis?

A hypothesis is a logically conjectured
relationship between two or more variables
expressed in the form of a testable statement.
A hypothesis is a claim (assumption) about a
population parameter
population mean
population proportion

Example The mean monthly cell phone bill of
this city is ? 42
Example The proportion of adults in this city
with cell phones is p 0.68
24
The null hypothesis H0

States the assumption (numerical) to be tested
Example The average number of TV sets in U.S.
Homes is at least three ( )
The null hypothesis is always about a population
parameter, not about a sample statistic.
We always begin with the assumption that the null
hypothesis is true. It may or may not be rejected.

25
The alternative hypothesis HA

It is the opposite of the null hypothesis.
Example the average number of TV sets in U.S.
homes is less than 3 ( HA ? lt 3 )
May or may not be accepted.
The alternative hypothesis is generally the
hypothesis that is believed (or needs to be
supported) by the researcher.

26
Hypothesis testing process
Claim the
population
mean age is 50.
(Null Hypothesis
Population
H0 ? 50 )
Now select a random sample
Is x 20 likely if ? 50?
Suppose the sample
If not likely,
REJECT
mean age is 20 x 20
Sample
Null Hypothesis
27
Reason for rejecting H0
Sampling distribution of x
x
? 50 If H0 is true
20
... then we reject the null hypothesis that ?
50.
If it is unlikely that we would get a sample mean
of this value ...
... if in fact this were the population mean
28
Types of statistical errors Type I

Type I error is rejecting the null hypothesis
when it is, in fact, true.
This error is considered a serious type of
error.
The probability of Type I error is ?, which is
called level of significance of the test
set by researcher in advance.

29
Critical value

The objective of a hypothesis test is to use
sample information to decide whether to reject
the null hypothesis about a population parameter.
Example,
When should we reject the null hypothesis?
We need to select a cut-off point that is the
demarcation between rejecting and not rejecting
the null hypothesis.

30
Level of significance ?

The level of significance defines unlikely values
of sample statistic if null hypothesis is true
Defines rejection region of the sampling
distribution
The level of significance is denoted by ?
Typical values are 0.01, 0.05, or 0.10
The value of the level of significance is
selected by the researcher at the beginning.
Provides the critical value(s) of the test.
So, ? is the probability of making Type I error.

31
Level of significance and the rejection region
a
Level of significance
Represents critical value
H0 µ 3 HA µ lt 3
a
Rejection region is shaded
0
Lower tail test
H0 µ 3 HA µ gt 3
a
0
Upper tail test
H0 µ 3 HA µ ? 3
a
a
/2
/2
0
Two tailed test
32
Types of statistical errors Type II

Type II error is failing to reject the null
hypothesis when it is, in fact, false.
The probability of Type II error is ß.
The probabilities ? and ß are inversely related.
That is, if we reduce ?, then ß will increase.
Thus, setting ? we must consider both sides of
the issue.

33
Outcomes and probabilities
Possible hypothesis test outcomes
State of nature
Decision
H0 False
H0 True
Do Not
No error (1 - )
Type II Error ( ß )
Reject
Key Outcome (Probability)
a
H
0
Reject
Type I Error ( )
No Error ( 1 - ß )
H
a
0
34
Type I II error relationship

Type I and Type II errors can not happen at the
same time
Type I error can only occur if H0 is true
Type II error can only occur if H0 is false
If Type I error probability ( ? ) , then
Type II error probability ( ß )

35
Factors affecting Type II error

ß when the difference between
hypothesized parameter and its true value
ß when ?
ß when s
ß when n

36
Steps in hypothesis testing

1. Specify the population value of interest
2. Formulate the appropriate null and alternative
hypotheses
3. Specify the desired level of significance
4. Determine the rejection region
5. Obtain sample evidence and compute the test
statistic
6. Reach a decision and interpret the result

37
Example hypothesis testing
Test the claim that the true mean number of TV
sets in US homes is at least 3.
(Assume s 0.8)
1. Specify the population value of interest The
mean number of TVs in US homes 2. Formulate the
appropriate null and alternative hypotheses H0 µ
? 3 HA µ lt 3 (This is a lower tail
test) 3. Specify the desired level of
significance Suppose that ? 0.05 is chosen for
this test
38
Example hypothesis testing
(continued)

4. Determine the rejection region

? 0.05
Reject H0
Do not reject H0
-za -1.645
0
This is a one-tailed test with ? 0.05. Since
s is known, the cutoff value is a z value
Reject H0 if z lt z? -1.645 otherwise do
not reject H0
39
Example hypothesis testing

5. Obtain sample evidence and compute the test
statistic
Suppose a sample is taken with the following
results n 100, x 2.84 (? 0.8 is
assumed known)
Then the test statistic is

40
Example hypothesis testing
(continued)

6. Reach a decision and interpret the result

? 0.05
z
Reject H0
Do not reject H0
-1.645
0
-2.0
Since z -2.0 lt -1.645, we reject the null
hypothesis that the mean number of TVs in US
homes is at least 3.
41
Statistical techniques for hypothesis testing

The choice of a technique depends on (1) the
number of variables and (2) the scale of
measurement.
Scales nominal data, ordinal data,
interval/ratio data.

Type of scale Measure of central tendency Measure of dispersion Statistic
Nominal Mode ? Chi square
Ordinal Median Percentiles or quartiles Chi square
Interval/Ratio Mean Standard deviation T-test, ANOVA
42
Chi square ( ) statistic

It is used to test whether the frequencies of two
nominally scaled variables are related.
It is also called Chi square goodness-of-fit
test.
The Chi square statistic compares the observed
frequencies of the responses with the expected
frequencies, which are theoretical ones derived
from the null hypothesis of no relationship
between two variables.
Examples of questions appropriate for the test
Are technical support calls equal across all days
of the week? (i.e., do calls follow a uniform
distribution?)
Do measurements from a production process follow
a normal distribution?

43
Example Chi-square goodness-of-fit test

Are technical support calls equal across all days
of the week? (i.e., do calls follow a uniform
distribution?)
Sample data for 10 days per day of week
Sum of calls for this day
Monday 290
Tuesday 250
Wednesday 238
Thursday 257
Friday 265
Saturday 230
Sunday 192

? 1722
44
Example Chi-square goodness-of-fit test

If calls are uniformly distributed, the 1722
calls would be expected to be equally divided
across the 7 days
Chi-Square Goodness-of-Fit Test test to see if
the sample results are consistent with the
expected results.

45
ExampleObserved vs. expected frequencies
Observed oi Expected ei
Monday Tuesday Wednesday Thursday Friday Saturday Sunday 290 250 238 257 265 230 192 246 246 246 246 246 246 246
TOTAL 1722 1722
46
Example Chi-square test statistic
H0 The distribution of calls is uniform
over days of the week HA The distribution of
calls is not uniform

The test statistic is

where k number of categories oi observed
cell frequency for category i ei expected cell
frequency for category i
47
Example The rejection region
H0 The distribution of calls is uniform
over days of the week HA The distribution of
calls is not uniform

Reject H0 if

?
(with k 1 degrees of freedom)
0
?2
Reject H0
Do not reject H0
?2?
48
Example Chi-square test statistic
H0 The distribution of calls is uniform
over days of the week HA The distribution of
calls is not uniform
k 1 6 (7 days of the week) so use 6
degrees of freedom
?2.05 12.5916
? .05
Conclusion ?2 23.05 gt ?2? 12.5916
so reject H0 and conclude that the distribution
is not uniform
0
?2
Reject H0
Do not reject H0
?2.05 12.5916
49
Analysis of variance (ANOVA)

ANOVA is used to assess the statistical
differences between the means of two or more
groups.
Example During a readership survey we found that
individuals 39 and younger read newspapers an
average 2.5 times per week, individuals 40 to 49
read newspapers an average of 3.1 times a week,
and individuals 50 and older read newspapers an
average of 4.7 times a week. The manager wants to
know whether these observed differences are
statistically significant?
The null hypothesis is the means are equal.
There can be one-way ANOVA (only one independent
variable) and N-way ANOVA (many independent
variables).

50
Further references

Chapter 8 from Groebner, D., et al., 2005,
Business Statistics A Decision Making Approach,
6-th Ed., Prentice Hall.
Chapter 13 from Hair, J., et al., 2007, Research
Methods for Business, John Wiley Sons.
Chapter 5 from Sekaran, U., 2003, Research
Methods for Business a skill building approach,
4th Ed., John Wiley Sons.

51
Contents

Sampling methods
Hypothesis testing
Correlation and regression analysis
Analyzing qualitative data

52
Scatter plots and correlation

A scatter plot (or scatter diagram) is used to
show the relationship between two variables.
Correlation analysis is used to measure strength
of the association (linear relationship) between
two variables.
Only concerned with strength of the relationship
No causal effect is implied

53
Scatter plot examples
Linear relationships
Curvilinear relationships
y
y
x
x
y
y
x
x
54
Scatter plot examples
Strong relationships
Weak relationships
y
y
x
x
y
y
x
x
55
Scatter plot examples
No relationship
y
x
y
x
56
Correlation coefficient

The population correlation coefficient ? (rho)
measures the strength of the association between
the variables.
The sample correlation coefficient r is an
estimate of ? and is used to measure the
strength of the linear relationship in the sample
observations.
Features of ? and r
Unit free
Range between -1 and 1
The closer to -1, the stronger the negative
linear relationship
The closer to 1, the stronger the positive linear
relationship
The closer to 0, the weaker the linear
relationship

57
Examples of approximate r values
y
y
y
x
x
x
r - 1
r - 0.6
r 0
y
y
x
x
r 0.3
r 1
58
Calculating the correlation coefficient
Sample correlation coefficient
or the algebraic equivalent
where r Sample correlation coefficient n
Sample size x Value of the independent
variable y Value of the dependent variable
59
Calculation example
Tree Height Trunk Diameter
y x xy y2 x2
35 8 280 1225 64
49 9 441 2401 81
27 7 189 729 49
33 6 198 1089 36
60 13 780 3600 169
21 7 147 441 49
45 11 495 2025 121
51 12 612 2601 144
?321 ?73 ?3142 ?14111 ?713
60
Calculation example
Tree Height, y
r 0.886 ? relatively strong positive linear
association between x and y
Trunk Diameter, x
61
Introduction to regression analysis

Regression analysis is used to
Predict the value of a dependent variable based
on the value of at least one independent
variable.
Explain the impact of changes in an independent
variable on the dependent variable.
Dependent variable the variable we wish to
explain.
Independent variable the variable used to
explain the dependent variable.
Simple linear regression model
Only one independent variable, x.
Relationship between x and y is described by
a linear function.
Changes in y are assumed to be caused by
changes in x.

62
Types of regression models
Positive linear relationship
Relationship NOT linear
Negative linear relationship
No relationship
63
Population linear regression
The population regression model
Random error term, or residual
Population slopecoefficient
Population y intercept
Independent variable
Dependent variable
Linear component
Random error component
64
Linear regression assumptions

Error values (e) are statistically independent.
Error values are normally distributed for any
given value of x.
The probability distribution of the errors is
normal.
The probability distribution of the errors has
constant variance.
The underlying relationship between the x
variable and the y variable is linear.

65
Population linear regression
y
Observed value of y for xi
ei
Slope ß1
Predicted value of y for xi
Random error for this x value
Intercept ß0
x
xi
66
Estimated regression model
The sample regression line provides an estimate
of the population regression line
Estimate of the regression intercept
Estimated (or predicted) y value
Estimate of the regression slope
Independent variable
The individual random error terms ei have a
mean of zero
67
The least squares equation

The formulas for b1 and b0 are

algebraic equivalent
and
68
Interpretation of the slope and the intercept

b0 is the estimated average value of y when the
value of x is zero.
b1 is the estimated change in the average value
of y as a result of a one-unit change in x.
The coefficients b0 and b1 will usually be
found using computer software, such as SPSS,
Minitab, and Excel.
Other regression measures will also be computed
as part of computer-based regression analysis.

69
Simple linear regression example

A real estate agent wishes to examine the
relationship between the selling price of a home
and its size (measured in square feet).
A random sample of 10 houses is selected
Dependent variable (y) house price in 1000s
Independent variable (x) square feet

70
Sample data for house price model
House Price in 1000s (y) Square Feet (x)
245 1400
312 1600
279 1700
308 1875
199 1100
219 1550
405 2350
324 2450
319 1425
255 1700
71
Graphical presentation

House price model scatter plot and regression
line

Slope 0.10977
Intercept 98.248
72
Interpretation of the intercept, b0

b0 is the estimated average value of Y when the
value of X is zero (if x 0 is in the range of
observed x values).
Here, no houses had 0 square feet, so b0
98.24833 just indicates that, for houses within
the range of sizes observed, 98,248.33 is the
portion of the house price not explained by
square feet.

73
Interpretation of the slope coefficient, b1

b1 measures the estimated change in the average
value of Y as a result of a one-unit change in X.
Here, b1 0.10977 tells us that the average
value of a house increases by 0.10977(1000)
109.77, on average, for each additional one
square foot of size.

74
Further references

Chapter 13 from Groebner, D., et al., 2005,
Business Statistics A Decision Making Approach,
6-th Ed., Prentice Hall.
Chapter 14 from Hair, J., et al., 2007, Research
Methods for Business, John Wiley Sons.

75
Contents

Sampling methods
Hypothesis testing
Correlation and regression analysis
Analyzing qualitative data

76
Approaches to qualitative research

Phenomenology studies human experiences and
consciousness. It is the study of phenomena, or
how things appear in our experiences, the meaning
things have in our experience. Interviews and
observations are used.
Hermeneutics (a specialized field of
phenomenology) attempts to understand and explain
human behaviour based on an analysis of stories
people tell about themselves.
Ethnography is a description of human
socio-cultural phenomena, based on field
observations and interviews. Ethnography
typically focuses on a community of individuals.
Snowball sampling is often used here.
Case studies.
Grounded theory. The goal is to construct
theories in order to understand phenomena.

77
Managing qualitative data

Field generated data and found data.
Field data need to be transcribed into a textual
format, which might be expensive.
Found data may be numerous making their
management and analysis difficult.
There must be a plan how to manage data at the
beginning.

78
Coding

Coding is the process of assigning meaningful
numerical values that facilitate understanding of
your data.
The purpose is to simplify and focus on
meaningful characteristics of the data.
Coding units words, phrases, themes, items,
images, graphics, photographs, etc.
We might be interested in the presence of some
item in the data as well as its frequency of
occurrence.
Data reduction is often used to select,
transform, simplify the data to make it more
manageable and understandable.

79
Software for qualitative analysis

Software intended for analysis of qualitative
data is often known as CAQADS (computer assisted
qualitative data analysis software).
Atlas.ti facilitates qualitative analysis if
large amounts of unstructured textual, graphical,
audio and video data. The tools include examining
unstructured data, managing, extracting,
comparing, exploring and reassembling meaningful
segments.www.atlasti.com
QRS NVIVO is typically used to analyse text from
focus group or interview transcripts, literary
documents, nontextual data such as photos, tape
recordings, films, multimedia, etc. A numerical
output can be exported to software programs SPSS,
Excel.

80
Reliability and validity

Reliability is the degree of consistency in
assignment of similar words, phrases or other
kinds of data to the same pattern or themes by
different researchers.
Validation is the extent to which qualitative
findings accurately represent the phenomena being
examined.
Data triangulation collecting data from several
different sources at different times and
comparing it.
Method triangulation conducting similar
research using several different methods and
comparing the findings.
Theory triangulation is using multiple theories
and perspectives to interpret and explain data.

81
References

Chapter 13 from Groebner, D., et al., 2005,
Business Statistics A Decision Making Approach,
6-th Ed., Prentice Hall.
Chapters 11 and 14 from Hair, J., et al., 2007,
Research Methods for Business, John Wiley Sons.

Write a Comment

User Comments (0)