Collection and analysis of data - PowerPoint PPT Presentation

1 / 81
About This Presentation
Title:

Collection and analysis of data

Description:

Collection and analysis of data Vladimir Ryabov, PhD Principal Lecturer in Information Technology Kemi-Tornio University of Applied Sciences Contents Sampling methods ... – PowerPoint PPT presentation

Number of Views:299
Avg rating:3.0/5.0
Slides: 82
Provided by: V229
Category:

less

Transcript and Presenter's Notes

Title: Collection and analysis of data


1
Collection and analysis of data
  • Vladimir Ryabov, PhD
  • Principal Lecturer in Information Technology
  • Kemi-Tornio University of Applied Sciences

2
Contents
  • Sampling methods
  • Hypothesis testing
  • Correlation and regression analysis
  • Analyzing qualitative data

3
Collecting data during surveys
  • A population is the set of all items or
    individuals of interest.
  • A sample is a subset of the population
  • Targeting correct population is very important.
  • Sampling is the process of selecting the right
    individuals, objects, or events for study.
  • What are the reasons for using samples?

4
Inferential statistics
  • Inferential statistics is aiming at making
    statements about a population by examining sample
    results.
  • Sample statistics Population
    parameters
  • Inference

5
Sampling methods
Sampling
Probabilistic sampling
Non-probabilistic sampling
Simple Random
Systematic
Judgement
Cluster
Convenience
Stratified
6
Non-probabilistic sampling
  • Convenience sampling
  • Collecting information from the members of the
    population who are more conveniently available to
    provide it.
  • Purposive sampling (judgement and quota sampling)
  • Collecting information from the specific group of
    the population either because they are the only
    ones who have it (judgement sampling) or conform
    to some criteria set by the researcher (quota
    sampling).
  • Non-probabilistic sampling also means that the
    findings of a study cant be confidently
    generalized to the population.
  • Nevertheless, generalizability is not always a
    priority.

7
Snowball sampling
  • Also called a referral sampling.
  • The initial respondents are typically selected
    using probability methods.
  • Then these respondents are used to identify other
    respondents in the target population.
  • The process continues until the desired sample
    size is reached.
  • This sampling type is used in the cases of rare
    populations or when a list of respondents does
    not exist.

8
Simple random sampling
  • Every individual or item from the population has
    an equal chance of being selected
  • E.g., when you play dice.
  • Selection may be with replacement or without
    replacement.
  • Samples can be obtained from a table of random
    numbers or computer random number generators.

9
Stratified sampling
  1. Population divided into subgroups (called strata)
    according to some common characteristic.
  2. Simple random sample selected from each subgroup.
  3. Samples from subgroups are combined into one.
  4. Stratified sampling can be proportionate or
    disproportionate.

10
Systematic sampling
  1. Decide on sample size n
  2. Divide frame of N individuals into groups of k
    individuals kN/n
  3. Randomly select one individual from the 1st group
  4. Select every kth individual thereafter

N 64 n 8 k 8
First Group
11
Cluster sampling
  • Population is divided into several clusters,
    each representative of the population.
  • A simple random sample of clusters is selected.
  • All items in the selected clusters can be used,
    or items can be chosen from a cluster using
    another probability sampling technique


Population divided into 16 clusters.
Randomly selected clusters for sample
12
Sample size
  • Determination of a sample size is complex due to
    many factors that should be taken into account
  • the variability of elements in the population
  • the type of sample required
  • time available
  • budget
  • required estimation precision
  • should the findings be generalized/with what
    degree of confidence
  • Formulas based on statistical theory are used as
    well as other ad hoc methods.

13
Sampling from a large population
  • When statistical formulas are used three
    decisions should be made
  • The specified level of precision. Precision
    refers to how close our estimate is to the true
    population characteristic. Usually, we would
    estimate the population parameter to fall within
    a range, based on sample estimate.
  • The degree of confidence. It determines how
    certain we are that our estimates will be really
    hold true for the population. It is often taken
    as 95.
  • The amount of variability. It determines the
    homogeneity of the population. The variability is
    measures by the standard deviation.
  • The degree of confidence vs. the level of
    precision. More precision but less confidence.
    More confidence but less precision.

14
Sample size using statistical formula
  • If we have information on these factors, then the
    sample size can be calculated as follows
  • where
  • DC (degree of confidence) the number of
    standard errors for the degree of confidence
    specified for the research results.
  • TV (true variability) the standard deviation of
    the population.
  • DP (desired precision) the acceptable
    difference between the sample estimate and the
    population value.
  • This formula does not include the population
    size, which typically does not have an impact on
    the sample size for large populations.

15
Example
  • Consider the case where we wish to estimate the
    average monthly expenditure on eating out.
    Although, the true standard deviation
    (variability) is unknown a pilot test study of 30
    customers provides an estimate of the unknown
    standard deviation of 14. We want to be 95
    confident that our estimate of the mean monthly
    expenditure on eating out is within 2 of the
    true population mean. Assuming the distribution
    of expenditures follows a normal distribution
    then the sample size is determined as follows

16
Sampling from a small population
  • The use of the formula for large populations may
    lead to an unnecessary large sample size.
  • If the sample size is larger than 5 of the
    population then the calculated sample size should
    be corrected.
  • Correction factor, where N is the population
    size, n is the calculated sample size by the
    original formula
  • When dealing with small populations, the sample
    size can be 10 to 20 of the total number of
    individuals. Also, a minimum number of 30 is
    recommended and larger if possible. Other
    factors, like characteristics of the sample
    respondents should be taken into account.

17
Example
  • Suppose a bank has 5000 ATM installed in the UK
    the bank wishes to establish their users views
    of this service. A researcher estimates the
    required sample size given their agreed criteria
    is 750. This sample size is 13 of the population
    and is larger than is necessary for an efficient
    sample size. In this case the sample size
    correction factor needs to be applied.

18
Rules of thumb for sampling(Roscoe 1975)
  • Sample sizes larger than 30 and less than 500 are
    appropriate for most research.
  • Where samples are to be broken into sub-samples
    (males/females, etc.), a minimum sample size of
    30 for each category is necessary.
  • In multivariate research (including multiple
    regression analyses), the sample size should be
    several times (preferably 10 times or more) as
    large as the number of variables in the study.
  • For simple experimental research with tight
    experimental controls, successful research is
    possible with samples as small as 10 to 20 in
    size.

19
Sampling for qualitative studies
  • Only small samples of individuals, groups, or
    events are typically chosen due to the in-depth
    nature of research, huge costs and energy
    expenditure.
  • This also means that the generalizability of the
    findings is very restricted.
  • External validity will be low.
  • It is possible to use any of the sampling
    techniques.
  • If the purpose of the study is merely to explore
    and try to understand the phenomena, a
    convenience sampling is often used.

20
Managerial relevance
  • Awareness of sampling methods
  • helps managers to understand why a particular
    method of sampling is used
  • facilitates understanding of the cost
    implications of different designs and the
    tradeoffs between precision and confidence vs.
    the costs
  • enables managers to understand the risk they take
    in implementing changes based on the results of
    the research study
  • helps to assess the generalizability of the
    findings and analyze the implications of trying
    out recommendations made in their own system.

21
Further references
  • Chapter 1 from Groebner, D., et al., 2005,
    Business Statistics A Decision Making Approach,
    6-th Ed., Prentice Hall.
  • Chapter 7 from Hair, J., et al., 2007, Research
    Methods for Business, John Wiley Sons.
  • Chapter 11 from Sekaran, U., 2003, Research
    Methods for Business a skill building approach,
    4th Ed., John Wiley Sons.
  • Chapter 8 from Laurence Neuman, W., 2003, Social
    Research Methods Qualitative and Quantitative
    Approaches, 5-th Edition, Pearson Education.

22
Contents
  • Sampling methods
  • Hypothesis testing
  • Correlation and regression analysis
  • Analyzing qualitative data

23
What is a hypothesis?
  • A hypothesis is a logically conjectured
    relationship between two or more variables
    expressed in the form of a testable statement.
  • A hypothesis is a claim (assumption) about a
    population parameter
  • population mean
  • population proportion

Example The mean monthly cell phone bill of
this city is ? 42
Example The proportion of adults in this city
with cell phones is p 0.68
24
The null hypothesis H0
  • States the assumption (numerical) to be tested
  • Example The average number of TV sets in U.S.
    Homes is at least three ( )
  • The null hypothesis is always about a population
    parameter, not about a sample statistic.
  • We always begin with the assumption that the null
    hypothesis is true. It may or may not be rejected.

25
The alternative hypothesis HA
  • It is the opposite of the null hypothesis.
  • Example the average number of TV sets in U.S.
    homes is less than 3 ( HA ? lt 3 )
  • May or may not be accepted.
  • The alternative hypothesis is generally the
    hypothesis that is believed (or needs to be
    supported) by the researcher.

26
Hypothesis testing process
Claim the
population
mean age is 50.
(Null Hypothesis
Population
H0 ? 50 )
Now select a random sample
Is x 20 likely if ? 50?
Suppose the sample
If not likely,
REJECT
mean age is 20 x 20
Sample
Null Hypothesis
27
Reason for rejecting H0
Sampling distribution of x
x
? 50 If H0 is true
20
... then we reject the null hypothesis that ?
50.
If it is unlikely that we would get a sample mean
of this value ...
... if in fact this were the population mean
28
Types of statistical errors Type I
  • Type I error is rejecting the null hypothesis
    when it is, in fact, true.
  • This error is considered a serious type of
    error.
  • The probability of Type I error is ?, which is
  • called level of significance of the test
  • set by researcher in advance.

29
Critical value
  • The objective of a hypothesis test is to use
    sample information to decide whether to reject
    the null hypothesis about a population parameter.
  • Example,
  • When should we reject the null hypothesis?
  • We need to select a cut-off point that is the
    demarcation between rejecting and not rejecting
    the null hypothesis.

30
Level of significance ?
  • The level of significance defines unlikely values
    of sample statistic if null hypothesis is true
  • Defines rejection region of the sampling
    distribution
  • The level of significance is denoted by ?
  • Typical values are 0.01, 0.05, or 0.10
  • The value of the level of significance is
    selected by the researcher at the beginning.
  • Provides the critical value(s) of the test.
  • So, ? is the probability of making Type I error.

31
Level of significance and the rejection region
a
Level of significance
Represents critical value
H0 µ 3 HA µ lt 3
a
Rejection region is shaded
0
Lower tail test
H0 µ 3 HA µ gt 3
a
0
Upper tail test
H0 µ 3 HA µ ? 3
a
a
/2
/2
0
Two tailed test
32
Types of statistical errors Type II
  • Type II error is failing to reject the null
    hypothesis when it is, in fact, false.
  • The probability of Type II error is ß.
  • The probabilities ? and ß are inversely related.
    That is, if we reduce ?, then ß will increase.
    Thus, setting ? we must consider both sides of
    the issue.

33
Outcomes and probabilities
Possible hypothesis test outcomes
State of nature
Decision
H0 False
H0 True
Do Not
No error (1 - )
Type II Error ( ß )
Reject
Key Outcome (Probability)
a
H
0
Reject
Type I Error ( )
No Error ( 1 - ß )
H
a
0
34
Type I II error relationship
  • Type I and Type II errors can not happen at the
    same time
  • Type I error can only occur if H0 is true
  • Type II error can only occur if H0 is false
  • If Type I error probability ( ? ) , then
  • Type II error probability ( ß )

35
Factors affecting Type II error
  • ß when the difference between
    hypothesized parameter and its true value
  • ß when ?
  • ß when s
  • ß when n

36
Steps in hypothesis testing
  • 1. Specify the population value of interest
  • 2. Formulate the appropriate null and alternative
    hypotheses
  • 3. Specify the desired level of significance
  • 4. Determine the rejection region
  • 5. Obtain sample evidence and compute the test
    statistic
  • 6. Reach a decision and interpret the result

37
Example hypothesis testing
Test the claim that the true mean number of TV
sets in US homes is at least 3.
(Assume s 0.8)
1. Specify the population value of interest The
mean number of TVs in US homes 2. Formulate the
appropriate null and alternative hypotheses H0 µ
? 3 HA µ lt 3 (This is a lower tail
test) 3. Specify the desired level of
significance Suppose that ? 0.05 is chosen for
this test
38
Example hypothesis testing
(continued)
  • 4. Determine the rejection region

? 0.05
Reject H0
Do not reject H0
-za -1.645
0
This is a one-tailed test with ? 0.05. Since
s is known, the cutoff value is a z value
Reject H0 if z lt z? -1.645 otherwise do
not reject H0
39
Example hypothesis testing
  • 5. Obtain sample evidence and compute the test
    statistic
  • Suppose a sample is taken with the following
    results n 100, x 2.84 (? 0.8 is
    assumed known)
  • Then the test statistic is

40
Example hypothesis testing
(continued)
  • 6. Reach a decision and interpret the result

? 0.05
z
Reject H0
Do not reject H0
-1.645
0
-2.0
Since z -2.0 lt -1.645, we reject the null
hypothesis that the mean number of TVs in US
homes is at least 3.
41
Statistical techniques for hypothesis testing
  • The choice of a technique depends on (1) the
    number of variables and (2) the scale of
    measurement.
  • Scales nominal data, ordinal data,
    interval/ratio data.

Type of scale Measure of central tendency Measure of dispersion Statistic
Nominal Mode ? Chi square
Ordinal Median Percentiles or quartiles Chi square
Interval/Ratio Mean Standard deviation T-test, ANOVA
42
Chi square ( ) statistic
  • It is used to test whether the frequencies of two
    nominally scaled variables are related.
  • It is also called Chi square goodness-of-fit
    test.
  • The Chi square statistic compares the observed
    frequencies of the responses with the expected
    frequencies, which are theoretical ones derived
    from the null hypothesis of no relationship
    between two variables.
  • Examples of questions appropriate for the test
  • Are technical support calls equal across all days
    of the week? (i.e., do calls follow a uniform
    distribution?)
  • Do measurements from a production process follow
    a normal distribution?

43
Example Chi-square goodness-of-fit test
  • Are technical support calls equal across all days
    of the week? (i.e., do calls follow a uniform
    distribution?)
  • Sample data for 10 days per day of week
  • Sum of calls for this day
  • Monday 290
  • Tuesday 250
  • Wednesday 238
  • Thursday 257
  • Friday 265
  • Saturday 230
  • Sunday 192

? 1722
44
Example Chi-square goodness-of-fit test
  • If calls are uniformly distributed, the 1722
    calls would be expected to be equally divided
    across the 7 days
  • Chi-Square Goodness-of-Fit Test test to see if
    the sample results are consistent with the
    expected results.

45
ExampleObserved vs. expected frequencies
Observed oi Expected ei
Monday Tuesday Wednesday Thursday Friday Saturday Sunday 290 250 238 257 265 230 192 246 246 246 246 246 246 246
TOTAL 1722 1722
46
Example Chi-square test statistic
H0 The distribution of calls is uniform
over days of the week HA The distribution of
calls is not uniform
  • The test statistic is

where k number of categories oi observed
cell frequency for category i ei expected cell
frequency for category i
47
Example The rejection region
H0 The distribution of calls is uniform
over days of the week HA The distribution of
calls is not uniform
  • Reject H0 if

?
(with k 1 degrees of freedom)
0
?2
Reject H0
Do not reject H0
?2?
48
Example Chi-square test statistic
H0 The distribution of calls is uniform
over days of the week HA The distribution of
calls is not uniform
k 1 6 (7 days of the week) so use 6
degrees of freedom
?2.05 12.5916
? .05
Conclusion ?2 23.05 gt ?2? 12.5916
so reject H0 and conclude that the distribution
is not uniform
0
?2
Reject H0
Do not reject H0
?2.05 12.5916
49
Analysis of variance (ANOVA)
  • ANOVA is used to assess the statistical
    differences between the means of two or more
    groups.
  • Example During a readership survey we found that
    individuals 39 and younger read newspapers an
    average 2.5 times per week, individuals 40 to 49
    read newspapers an average of 3.1 times a week,
    and individuals 50 and older read newspapers an
    average of 4.7 times a week. The manager wants to
    know whether these observed differences are
    statistically significant?
  • The null hypothesis is the means are equal.
  • There can be one-way ANOVA (only one independent
    variable) and N-way ANOVA (many independent
    variables).

50
Further references
  • Chapter 8 from Groebner, D., et al., 2005,
    Business Statistics A Decision Making Approach,
    6-th Ed., Prentice Hall.
  • Chapter 13 from Hair, J., et al., 2007, Research
    Methods for Business, John Wiley Sons.
  • Chapter 5 from Sekaran, U., 2003, Research
    Methods for Business a skill building approach,
    4th Ed., John Wiley Sons.

51
Contents
  • Sampling methods
  • Hypothesis testing
  • Correlation and regression analysis
  • Analyzing qualitative data

52
Scatter plots and correlation
  • A scatter plot (or scatter diagram) is used to
    show the relationship between two variables.
  • Correlation analysis is used to measure strength
    of the association (linear relationship) between
    two variables.
  • Only concerned with strength of the relationship
  • No causal effect is implied

53
Scatter plot examples
Linear relationships
Curvilinear relationships
y
y
x
x
y
y
x
x
54
Scatter plot examples
Strong relationships
Weak relationships
y
y
x
x
y
y
x
x
55
Scatter plot examples
No relationship
y
x
y
x
56
Correlation coefficient
  • The population correlation coefficient ? (rho)
    measures the strength of the association between
    the variables.
  • The sample correlation coefficient r is an
    estimate of ? and is used to measure the
    strength of the linear relationship in the sample
    observations.
  • Features of ? and r
  • Unit free
  • Range between -1 and 1
  • The closer to -1, the stronger the negative
    linear relationship
  • The closer to 1, the stronger the positive linear
    relationship
  • The closer to 0, the weaker the linear
    relationship

57
Examples of approximate r values
y
y
y
x
x
x
r - 1
r - 0.6
r 0
y
y
x
x
r 0.3
r 1
58
Calculating the correlation coefficient
Sample correlation coefficient
or the algebraic equivalent
where r Sample correlation coefficient n
Sample size x Value of the independent
variable y Value of the dependent variable
59
Calculation example
Tree Height Trunk Diameter
y x xy y2 x2
35 8 280 1225 64
49 9 441 2401 81
27 7 189 729 49
33 6 198 1089 36
60 13 780 3600 169
21 7 147 441 49
45 11 495 2025 121
51 12 612 2601 144
?321 ?73 ?3142 ?14111 ?713
60
Calculation example
Tree Height, y
r 0.886 ? relatively strong positive linear
association between x and y
Trunk Diameter, x
61
Introduction to regression analysis
  • Regression analysis is used to
  • Predict the value of a dependent variable based
    on the value of at least one independent
    variable.
  • Explain the impact of changes in an independent
    variable on the dependent variable.
  • Dependent variable the variable we wish to
    explain.
  • Independent variable the variable used to
    explain the dependent variable.
  • Simple linear regression model
  • Only one independent variable, x.
  • Relationship between x and y is described by
    a linear function.
  • Changes in y are assumed to be caused by
    changes in x.

62
Types of regression models
Positive linear relationship
Relationship NOT linear
Negative linear relationship
No relationship
63
Population linear regression
The population regression model
Random error term, or residual
Population slopecoefficient
Population y intercept
Independent variable
Dependent variable
Linear component
Random error component
64
Linear regression assumptions
  • Error values (e) are statistically independent.
  • Error values are normally distributed for any
    given value of x.
  • The probability distribution of the errors is
    normal.
  • The probability distribution of the errors has
    constant variance.
  • The underlying relationship between the x
    variable and the y variable is linear.

65
Population linear regression
y
Observed value of y for xi
ei
Slope ß1
Predicted value of y for xi
Random error for this x value
Intercept ß0
x
xi
66
Estimated regression model
The sample regression line provides an estimate
of the population regression line
Estimate of the regression intercept
Estimated (or predicted) y value
Estimate of the regression slope
Independent variable
The individual random error terms ei have a
mean of zero
67
The least squares equation
  • The formulas for b1 and b0 are

algebraic equivalent
and
68
Interpretation of the slope and the intercept
  • b0 is the estimated average value of y when the
    value of x is zero.
  • b1 is the estimated change in the average value
    of y as a result of a one-unit change in x.
  • The coefficients b0 and b1 will usually be
    found using computer software, such as SPSS,
    Minitab, and Excel.
  • Other regression measures will also be computed
    as part of computer-based regression analysis.

69
Simple linear regression example
  • A real estate agent wishes to examine the
    relationship between the selling price of a home
    and its size (measured in square feet).
  • A random sample of 10 houses is selected
  • Dependent variable (y) house price in 1000s
  • Independent variable (x) square feet

70
Sample data for house price model
House Price in 1000s (y) Square Feet (x)
245 1400
312 1600
279 1700
308 1875
199 1100
219 1550
405 2350
324 2450
319 1425
255 1700
71
Graphical presentation
  • House price model scatter plot and regression
    line

Slope 0.10977
Intercept 98.248
72
Interpretation of the intercept, b0
  • b0 is the estimated average value of Y when the
    value of X is zero (if x 0 is in the range of
    observed x values).
  • Here, no houses had 0 square feet, so b0
    98.24833 just indicates that, for houses within
    the range of sizes observed, 98,248.33 is the
    portion of the house price not explained by
    square feet.

73
Interpretation of the slope coefficient, b1
  • b1 measures the estimated change in the average
    value of Y as a result of a one-unit change in X.
  • Here, b1 0.10977 tells us that the average
    value of a house increases by 0.10977(1000)
    109.77, on average, for each additional one
    square foot of size.

74
Further references
  • Chapter 13 from Groebner, D., et al., 2005,
    Business Statistics A Decision Making Approach,
    6-th Ed., Prentice Hall.
  • Chapter 14 from Hair, J., et al., 2007, Research
    Methods for Business, John Wiley Sons.

75
Contents
  • Sampling methods
  • Hypothesis testing
  • Correlation and regression analysis
  • Analyzing qualitative data

76
Approaches to qualitative research
  • Phenomenology studies human experiences and
    consciousness. It is the study of phenomena, or
    how things appear in our experiences, the meaning
    things have in our experience. Interviews and
    observations are used.
  • Hermeneutics (a specialized field of
    phenomenology) attempts to understand and explain
    human behaviour based on an analysis of stories
    people tell about themselves.
  • Ethnography is a description of human
    socio-cultural phenomena, based on field
    observations and interviews. Ethnography
    typically focuses on a community of individuals.
    Snowball sampling is often used here.
  • Case studies.
  • Grounded theory. The goal is to construct
    theories in order to understand phenomena.

77
Managing qualitative data
  • Field generated data and found data.
  • Field data need to be transcribed into a textual
    format, which might be expensive.
  • Found data may be numerous making their
    management and analysis difficult.
  • There must be a plan how to manage data at the
    beginning.

78
Coding
  • Coding is the process of assigning meaningful
    numerical values that facilitate understanding of
    your data.
  • The purpose is to simplify and focus on
    meaningful characteristics of the data.
  • Coding units words, phrases, themes, items,
    images, graphics, photographs, etc.
  • We might be interested in the presence of some
    item in the data as well as its frequency of
    occurrence.
  • Data reduction is often used to select,
    transform, simplify the data to make it more
    manageable and understandable.

79
Software for qualitative analysis
  • Software intended for analysis of qualitative
    data is often known as CAQADS (computer assisted
    qualitative data analysis software).
  • Atlas.ti facilitates qualitative analysis if
    large amounts of unstructured textual, graphical,
    audio and video data. The tools include examining
    unstructured data, managing, extracting,
    comparing, exploring and reassembling meaningful
    segments.www.atlasti.com
  • QRS NVIVO is typically used to analyse text from
    focus group or interview transcripts, literary
    documents, nontextual data such as photos, tape
    recordings, films, multimedia, etc. A numerical
    output can be exported to software programs SPSS,
    Excel.

80
Reliability and validity
  • Reliability is the degree of consistency in
    assignment of similar words, phrases or other
    kinds of data to the same pattern or themes by
    different researchers.
  • Validation is the extent to which qualitative
    findings accurately represent the phenomena being
    examined.
  • Data triangulation collecting data from several
    different sources at different times and
    comparing it.
  • Method triangulation conducting similar
    research using several different methods and
    comparing the findings.
  • Theory triangulation is using multiple theories
    and perspectives to interpret and explain data.

81
References
  • Chapter 13 from Groebner, D., et al., 2005,
    Business Statistics A Decision Making Approach,
    6-th Ed., Prentice Hall.
  • Chapters 11 and 14 from Hair, J., et al., 2007,
    Research Methods for Business, John Wiley Sons.
Write a Comment
User Comments (0)
About PowerShow.com