Title: Collection and analysis of data
1Collection and analysis of data
- Vladimir Ryabov, PhD
- Principal Lecturer in Information Technology
- Kemi-Tornio University of Applied Sciences
2Contents
- Sampling methods
- Hypothesis testing
- Correlation and regression analysis
- Analyzing qualitative data
3Collecting data during surveys
- A population is the set of all items or
individuals of interest. - A sample is a subset of the population
- Targeting correct population is very important.
- Sampling is the process of selecting the right
individuals, objects, or events for study. - What are the reasons for using samples?
4Inferential statistics
- Inferential statistics is aiming at making
statements about a population by examining sample
results. - Sample statistics Population
parameters - Inference
-
5Sampling methods
Sampling
Probabilistic sampling
Non-probabilistic sampling
Simple Random
Systematic
Judgement
Cluster
Convenience
Stratified
6Non-probabilistic sampling
- Convenience sampling
- Collecting information from the members of the
population who are more conveniently available to
provide it. - Purposive sampling (judgement and quota sampling)
- Collecting information from the specific group of
the population either because they are the only
ones who have it (judgement sampling) or conform
to some criteria set by the researcher (quota
sampling). - Non-probabilistic sampling also means that the
findings of a study cant be confidently
generalized to the population. - Nevertheless, generalizability is not always a
priority.
7Snowball sampling
- Also called a referral sampling.
- The initial respondents are typically selected
using probability methods. - Then these respondents are used to identify other
respondents in the target population. - The process continues until the desired sample
size is reached. - This sampling type is used in the cases of rare
populations or when a list of respondents does
not exist.
8Simple random sampling
- Every individual or item from the population has
an equal chance of being selected - E.g., when you play dice.
- Selection may be with replacement or without
replacement. - Samples can be obtained from a table of random
numbers or computer random number generators.
9Stratified sampling
- Population divided into subgroups (called strata)
according to some common characteristic. - Simple random sample selected from each subgroup.
- Samples from subgroups are combined into one.
- Stratified sampling can be proportionate or
disproportionate.
10Systematic sampling
- Decide on sample size n
- Divide frame of N individuals into groups of k
individuals kN/n - Randomly select one individual from the 1st group
- Select every kth individual thereafter
N 64 n 8 k 8
First Group
11Cluster sampling
- Population is divided into several clusters,
each representative of the population. - A simple random sample of clusters is selected.
- All items in the selected clusters can be used,
or items can be chosen from a cluster using
another probability sampling technique
Population divided into 16 clusters.
Randomly selected clusters for sample
12Sample size
- Determination of a sample size is complex due to
many factors that should be taken into account - the variability of elements in the population
- the type of sample required
- time available
- budget
- required estimation precision
- should the findings be generalized/with what
degree of confidence - Formulas based on statistical theory are used as
well as other ad hoc methods.
13Sampling from a large population
- When statistical formulas are used three
decisions should be made - The specified level of precision. Precision
refers to how close our estimate is to the true
population characteristic. Usually, we would
estimate the population parameter to fall within
a range, based on sample estimate. - The degree of confidence. It determines how
certain we are that our estimates will be really
hold true for the population. It is often taken
as 95. - The amount of variability. It determines the
homogeneity of the population. The variability is
measures by the standard deviation. - The degree of confidence vs. the level of
precision. More precision but less confidence.
More confidence but less precision.
14Sample size using statistical formula
- If we have information on these factors, then the
sample size can be calculated as follows -
- where
- DC (degree of confidence) the number of
standard errors for the degree of confidence
specified for the research results. - TV (true variability) the standard deviation of
the population. - DP (desired precision) the acceptable
difference between the sample estimate and the
population value. - This formula does not include the population
size, which typically does not have an impact on
the sample size for large populations.
15Example
- Consider the case where we wish to estimate the
average monthly expenditure on eating out.
Although, the true standard deviation
(variability) is unknown a pilot test study of 30
customers provides an estimate of the unknown
standard deviation of 14. We want to be 95
confident that our estimate of the mean monthly
expenditure on eating out is within 2 of the
true population mean. Assuming the distribution
of expenditures follows a normal distribution
then the sample size is determined as follows
16Sampling from a small population
- The use of the formula for large populations may
lead to an unnecessary large sample size. - If the sample size is larger than 5 of the
population then the calculated sample size should
be corrected. - Correction factor, where N is the population
size, n is the calculated sample size by the
original formula - When dealing with small populations, the sample
size can be 10 to 20 of the total number of
individuals. Also, a minimum number of 30 is
recommended and larger if possible. Other
factors, like characteristics of the sample
respondents should be taken into account.
17Example
- Suppose a bank has 5000 ATM installed in the UK
the bank wishes to establish their users views
of this service. A researcher estimates the
required sample size given their agreed criteria
is 750. This sample size is 13 of the population
and is larger than is necessary for an efficient
sample size. In this case the sample size
correction factor needs to be applied.
18Rules of thumb for sampling(Roscoe 1975)
- Sample sizes larger than 30 and less than 500 are
appropriate for most research. - Where samples are to be broken into sub-samples
(males/females, etc.), a minimum sample size of
30 for each category is necessary. - In multivariate research (including multiple
regression analyses), the sample size should be
several times (preferably 10 times or more) as
large as the number of variables in the study. - For simple experimental research with tight
experimental controls, successful research is
possible with samples as small as 10 to 20 in
size.
19Sampling for qualitative studies
- Only small samples of individuals, groups, or
events are typically chosen due to the in-depth
nature of research, huge costs and energy
expenditure. - This also means that the generalizability of the
findings is very restricted. - External validity will be low.
- It is possible to use any of the sampling
techniques. - If the purpose of the study is merely to explore
and try to understand the phenomena, a
convenience sampling is often used.
20Managerial relevance
- Awareness of sampling methods
- helps managers to understand why a particular
method of sampling is used - facilitates understanding of the cost
implications of different designs and the
tradeoffs between precision and confidence vs.
the costs - enables managers to understand the risk they take
in implementing changes based on the results of
the research study - helps to assess the generalizability of the
findings and analyze the implications of trying
out recommendations made in their own system.
21Further references
- Chapter 1 from Groebner, D., et al., 2005,
Business Statistics A Decision Making Approach,
6-th Ed., Prentice Hall. - Chapter 7 from Hair, J., et al., 2007, Research
Methods for Business, John Wiley Sons. - Chapter 11 from Sekaran, U., 2003, Research
Methods for Business a skill building approach,
4th Ed., John Wiley Sons. - Chapter 8 from Laurence Neuman, W., 2003, Social
Research Methods Qualitative and Quantitative
Approaches, 5-th Edition, Pearson Education.
22Contents
- Sampling methods
- Hypothesis testing
- Correlation and regression analysis
- Analyzing qualitative data
23What is a hypothesis?
- A hypothesis is a logically conjectured
relationship between two or more variables
expressed in the form of a testable statement. - A hypothesis is a claim (assumption) about a
population parameter - population mean
- population proportion
Example The mean monthly cell phone bill of
this city is ? 42
Example The proportion of adults in this city
with cell phones is p 0.68
24The null hypothesis H0
- States the assumption (numerical) to be tested
- Example The average number of TV sets in U.S.
Homes is at least three ( ) - The null hypothesis is always about a population
parameter, not about a sample statistic. - We always begin with the assumption that the null
hypothesis is true. It may or may not be rejected.
25The alternative hypothesis HA
- It is the opposite of the null hypothesis.
- Example the average number of TV sets in U.S.
homes is less than 3 ( HA ? lt 3 ) - May or may not be accepted.
- The alternative hypothesis is generally the
hypothesis that is believed (or needs to be
supported) by the researcher.
26Hypothesis testing process
Claim the
population
mean age is 50.
(Null Hypothesis
Population
H0 ? 50 )
Now select a random sample
Is x 20 likely if ? 50?
Suppose the sample
If not likely,
REJECT
mean age is 20 x 20
Sample
Null Hypothesis
27Reason for rejecting H0
Sampling distribution of x
x
? 50 If H0 is true
20
... then we reject the null hypothesis that ?
50.
If it is unlikely that we would get a sample mean
of this value ...
... if in fact this were the population mean
28Types of statistical errors Type I
- Type I error is rejecting the null hypothesis
when it is, in fact, true. - This error is considered a serious type of
error. - The probability of Type I error is ?, which is
- called level of significance of the test
- set by researcher in advance.
29Critical value
- The objective of a hypothesis test is to use
sample information to decide whether to reject
the null hypothesis about a population parameter. - Example,
- When should we reject the null hypothesis?
- We need to select a cut-off point that is the
demarcation between rejecting and not rejecting
the null hypothesis.
30Level of significance ?
- The level of significance defines unlikely values
of sample statistic if null hypothesis is true - Defines rejection region of the sampling
distribution - The level of significance is denoted by ?
- Typical values are 0.01, 0.05, or 0.10
- The value of the level of significance is
selected by the researcher at the beginning. - Provides the critical value(s) of the test.
- So, ? is the probability of making Type I error.
31Level of significance and the rejection region
a
Level of significance
Represents critical value
H0 µ 3 HA µ lt 3
a
Rejection region is shaded
0
Lower tail test
H0 µ 3 HA µ gt 3
a
0
Upper tail test
H0 µ 3 HA µ ? 3
a
a
/2
/2
0
Two tailed test
32Types of statistical errors Type II
- Type II error is failing to reject the null
hypothesis when it is, in fact, false. - The probability of Type II error is ß.
- The probabilities ? and ß are inversely related.
That is, if we reduce ?, then ß will increase.
Thus, setting ? we must consider both sides of
the issue.
33Outcomes and probabilities
Possible hypothesis test outcomes
State of nature
Decision
H0 False
H0 True
Do Not
No error (1 - )
Type II Error ( ß )
Reject
Key Outcome (Probability)
a
H
0
Reject
Type I Error ( )
No Error ( 1 - ß )
H
a
0
34Type I II error relationship
- Type I and Type II errors can not happen at the
same time - Type I error can only occur if H0 is true
- Type II error can only occur if H0 is false
- If Type I error probability ( ? ) , then
- Type II error probability ( ß )
35Factors affecting Type II error
- ß when the difference between
hypothesized parameter and its true value - ß when ?
- ß when s
- ß when n
36Steps in hypothesis testing
- 1. Specify the population value of interest
- 2. Formulate the appropriate null and alternative
hypotheses - 3. Specify the desired level of significance
- 4. Determine the rejection region
- 5. Obtain sample evidence and compute the test
statistic - 6. Reach a decision and interpret the result
37Example hypothesis testing
Test the claim that the true mean number of TV
sets in US homes is at least 3.
(Assume s 0.8)
1. Specify the population value of interest The
mean number of TVs in US homes 2. Formulate the
appropriate null and alternative hypotheses H0 µ
? 3 HA µ lt 3 (This is a lower tail
test) 3. Specify the desired level of
significance Suppose that ? 0.05 is chosen for
this test
38Example hypothesis testing
(continued)
- 4. Determine the rejection region
? 0.05
Reject H0
Do not reject H0
-za -1.645
0
This is a one-tailed test with ? 0.05. Since
s is known, the cutoff value is a z value
Reject H0 if z lt z? -1.645 otherwise do
not reject H0
39Example hypothesis testing
- 5. Obtain sample evidence and compute the test
statistic - Suppose a sample is taken with the following
results n 100, x 2.84 (? 0.8 is
assumed known) - Then the test statistic is
40Example hypothesis testing
(continued)
- 6. Reach a decision and interpret the result
? 0.05
z
Reject H0
Do not reject H0
-1.645
0
-2.0
Since z -2.0 lt -1.645, we reject the null
hypothesis that the mean number of TVs in US
homes is at least 3.
41Statistical techniques for hypothesis testing
- The choice of a technique depends on (1) the
number of variables and (2) the scale of
measurement. - Scales nominal data, ordinal data,
interval/ratio data.
Type of scale Measure of central tendency Measure of dispersion Statistic
Nominal Mode ? Chi square
Ordinal Median Percentiles or quartiles Chi square
Interval/Ratio Mean Standard deviation T-test, ANOVA
42Chi square ( ) statistic
- It is used to test whether the frequencies of two
nominally scaled variables are related. - It is also called Chi square goodness-of-fit
test. - The Chi square statistic compares the observed
frequencies of the responses with the expected
frequencies, which are theoretical ones derived
from the null hypothesis of no relationship
between two variables. - Examples of questions appropriate for the test
- Are technical support calls equal across all days
of the week? (i.e., do calls follow a uniform
distribution?) - Do measurements from a production process follow
a normal distribution?
43Example Chi-square goodness-of-fit test
- Are technical support calls equal across all days
of the week? (i.e., do calls follow a uniform
distribution?) - Sample data for 10 days per day of week
- Sum of calls for this day
- Monday 290
- Tuesday 250
- Wednesday 238
- Thursday 257
- Friday 265
- Saturday 230
- Sunday 192
? 1722
44Example Chi-square goodness-of-fit test
- If calls are uniformly distributed, the 1722
calls would be expected to be equally divided
across the 7 days - Chi-Square Goodness-of-Fit Test test to see if
the sample results are consistent with the
expected results.
45ExampleObserved vs. expected frequencies
Observed oi Expected ei
Monday Tuesday Wednesday Thursday Friday Saturday Sunday 290 250 238 257 265 230 192 246 246 246 246 246 246 246
TOTAL 1722 1722
46Example Chi-square test statistic
H0 The distribution of calls is uniform
over days of the week HA The distribution of
calls is not uniform
where k number of categories oi observed
cell frequency for category i ei expected cell
frequency for category i
47Example The rejection region
H0 The distribution of calls is uniform
over days of the week HA The distribution of
calls is not uniform
?
(with k 1 degrees of freedom)
0
?2
Reject H0
Do not reject H0
?2?
48Example Chi-square test statistic
H0 The distribution of calls is uniform
over days of the week HA The distribution of
calls is not uniform
k 1 6 (7 days of the week) so use 6
degrees of freedom
?2.05 12.5916
? .05
Conclusion ?2 23.05 gt ?2? 12.5916
so reject H0 and conclude that the distribution
is not uniform
0
?2
Reject H0
Do not reject H0
?2.05 12.5916
49Analysis of variance (ANOVA)
- ANOVA is used to assess the statistical
differences between the means of two or more
groups. - Example During a readership survey we found that
individuals 39 and younger read newspapers an
average 2.5 times per week, individuals 40 to 49
read newspapers an average of 3.1 times a week,
and individuals 50 and older read newspapers an
average of 4.7 times a week. The manager wants to
know whether these observed differences are
statistically significant? - The null hypothesis is the means are equal.
- There can be one-way ANOVA (only one independent
variable) and N-way ANOVA (many independent
variables).
50Further references
- Chapter 8 from Groebner, D., et al., 2005,
Business Statistics A Decision Making Approach,
6-th Ed., Prentice Hall. - Chapter 13 from Hair, J., et al., 2007, Research
Methods for Business, John Wiley Sons. - Chapter 5 from Sekaran, U., 2003, Research
Methods for Business a skill building approach,
4th Ed., John Wiley Sons.
51Contents
- Sampling methods
- Hypothesis testing
- Correlation and regression analysis
- Analyzing qualitative data
52Scatter plots and correlation
- A scatter plot (or scatter diagram) is used to
show the relationship between two variables. - Correlation analysis is used to measure strength
of the association (linear relationship) between
two variables. - Only concerned with strength of the relationship
- No causal effect is implied
53Scatter plot examples
Linear relationships
Curvilinear relationships
y
y
x
x
y
y
x
x
54Scatter plot examples
Strong relationships
Weak relationships
y
y
x
x
y
y
x
x
55Scatter plot examples
No relationship
y
x
y
x
56Correlation coefficient
- The population correlation coefficient ? (rho)
measures the strength of the association between
the variables. - The sample correlation coefficient r is an
estimate of ? and is used to measure the
strength of the linear relationship in the sample
observations. - Features of ? and r
- Unit free
- Range between -1 and 1
- The closer to -1, the stronger the negative
linear relationship - The closer to 1, the stronger the positive linear
relationship - The closer to 0, the weaker the linear
relationship
57Examples of approximate r values
y
y
y
x
x
x
r - 1
r - 0.6
r 0
y
y
x
x
r 0.3
r 1
58Calculating the correlation coefficient
Sample correlation coefficient
or the algebraic equivalent
where r Sample correlation coefficient n
Sample size x Value of the independent
variable y Value of the dependent variable
59Calculation example
Tree Height Trunk Diameter
y x xy y2 x2
35 8 280 1225 64
49 9 441 2401 81
27 7 189 729 49
33 6 198 1089 36
60 13 780 3600 169
21 7 147 441 49
45 11 495 2025 121
51 12 612 2601 144
?321 ?73 ?3142 ?14111 ?713
60Calculation example
Tree Height, y
r 0.886 ? relatively strong positive linear
association between x and y
Trunk Diameter, x
61Introduction to regression analysis
- Regression analysis is used to
- Predict the value of a dependent variable based
on the value of at least one independent
variable. - Explain the impact of changes in an independent
variable on the dependent variable. - Dependent variable the variable we wish to
explain. - Independent variable the variable used to
explain the dependent variable. - Simple linear regression model
- Only one independent variable, x.
- Relationship between x and y is described by
a linear function. - Changes in y are assumed to be caused by
changes in x.
62Types of regression models
Positive linear relationship
Relationship NOT linear
Negative linear relationship
No relationship
63Population linear regression
The population regression model
Random error term, or residual
Population slopecoefficient
Population y intercept
Independent variable
Dependent variable
Linear component
Random error component
64Linear regression assumptions
- Error values (e) are statistically independent.
- Error values are normally distributed for any
given value of x. - The probability distribution of the errors is
normal. - The probability distribution of the errors has
constant variance. - The underlying relationship between the x
variable and the y variable is linear.
65Population linear regression
y
Observed value of y for xi
ei
Slope ß1
Predicted value of y for xi
Random error for this x value
Intercept ß0
x
xi
66Estimated regression model
The sample regression line provides an estimate
of the population regression line
Estimate of the regression intercept
Estimated (or predicted) y value
Estimate of the regression slope
Independent variable
The individual random error terms ei have a
mean of zero
67The least squares equation
- The formulas for b1 and b0 are
algebraic equivalent
and
68Interpretation of the slope and the intercept
- b0 is the estimated average value of y when the
value of x is zero. - b1 is the estimated change in the average value
of y as a result of a one-unit change in x. - The coefficients b0 and b1 will usually be
found using computer software, such as SPSS,
Minitab, and Excel. - Other regression measures will also be computed
as part of computer-based regression analysis.
69Simple linear regression example
- A real estate agent wishes to examine the
relationship between the selling price of a home
and its size (measured in square feet). - A random sample of 10 houses is selected
- Dependent variable (y) house price in 1000s
- Independent variable (x) square feet
70Sample data for house price model
House Price in 1000s (y) Square Feet (x)
245 1400
312 1600
279 1700
308 1875
199 1100
219 1550
405 2350
324 2450
319 1425
255 1700
71Graphical presentation
- House price model scatter plot and regression
line
Slope 0.10977
Intercept 98.248
72Interpretation of the intercept, b0
- b0 is the estimated average value of Y when the
value of X is zero (if x 0 is in the range of
observed x values). - Here, no houses had 0 square feet, so b0
98.24833 just indicates that, for houses within
the range of sizes observed, 98,248.33 is the
portion of the house price not explained by
square feet.
73Interpretation of the slope coefficient, b1
- b1 measures the estimated change in the average
value of Y as a result of a one-unit change in X. - Here, b1 0.10977 tells us that the average
value of a house increases by 0.10977(1000)
109.77, on average, for each additional one
square foot of size.
74Further references
- Chapter 13 from Groebner, D., et al., 2005,
Business Statistics A Decision Making Approach,
6-th Ed., Prentice Hall. - Chapter 14 from Hair, J., et al., 2007, Research
Methods for Business, John Wiley Sons.
75Contents
- Sampling methods
- Hypothesis testing
- Correlation and regression analysis
- Analyzing qualitative data
76Approaches to qualitative research
- Phenomenology studies human experiences and
consciousness. It is the study of phenomena, or
how things appear in our experiences, the meaning
things have in our experience. Interviews and
observations are used. - Hermeneutics (a specialized field of
phenomenology) attempts to understand and explain
human behaviour based on an analysis of stories
people tell about themselves. - Ethnography is a description of human
socio-cultural phenomena, based on field
observations and interviews. Ethnography
typically focuses on a community of individuals.
Snowball sampling is often used here. - Case studies.
- Grounded theory. The goal is to construct
theories in order to understand phenomena.
77Managing qualitative data
- Field generated data and found data.
- Field data need to be transcribed into a textual
format, which might be expensive. - Found data may be numerous making their
management and analysis difficult. - There must be a plan how to manage data at the
beginning.
78Coding
- Coding is the process of assigning meaningful
numerical values that facilitate understanding of
your data. - The purpose is to simplify and focus on
meaningful characteristics of the data. - Coding units words, phrases, themes, items,
images, graphics, photographs, etc. - We might be interested in the presence of some
item in the data as well as its frequency of
occurrence. - Data reduction is often used to select,
transform, simplify the data to make it more
manageable and understandable.
79Software for qualitative analysis
- Software intended for analysis of qualitative
data is often known as CAQADS (computer assisted
qualitative data analysis software). - Atlas.ti facilitates qualitative analysis if
large amounts of unstructured textual, graphical,
audio and video data. The tools include examining
unstructured data, managing, extracting,
comparing, exploring and reassembling meaningful
segments.www.atlasti.com - QRS NVIVO is typically used to analyse text from
focus group or interview transcripts, literary
documents, nontextual data such as photos, tape
recordings, films, multimedia, etc. A numerical
output can be exported to software programs SPSS,
Excel.
80Reliability and validity
- Reliability is the degree of consistency in
assignment of similar words, phrases or other
kinds of data to the same pattern or themes by
different researchers. - Validation is the extent to which qualitative
findings accurately represent the phenomena being
examined. - Data triangulation collecting data from several
different sources at different times and
comparing it. - Method triangulation conducting similar
research using several different methods and
comparing the findings. - Theory triangulation is using multiple theories
and perspectives to interpret and explain data.
81References
- Chapter 13 from Groebner, D., et al., 2005,
Business Statistics A Decision Making Approach,
6-th Ed., Prentice Hall. - Chapters 11 and 14 from Hair, J., et al., 2007,
Research Methods for Business, John Wiley Sons.