Title: Correlation and regression analysis
1Correlation and regression analysis
- Week 8
- Research Methods Data Analysis
2Lecture outline
- Correlation
- Regression Analysis
- The least squares estimation method
- SPSS and regression output
- Task overview
3Correlation
- Correlation measures to what extent two (or
more) variables are related - Correlation expresses a relationship that is not
necessarily precise (e.g. height and weight) - Positive correlation indicates that the two
variables move in the same direction - Negative correlation indicates that they move in
opposite directions
4Covariance
- Covariance measures the joint variability
- If two variables are independent, then the
covariance is zero (however, CovO does not mean
that two variables are independent) - Where E() indicates the expected value (i.e.
average value)
5Correlation coefficient
- The correlation coefficient r gives a measure (in
the range 1, 1) of the relationship between two
variables - r0 means no correlation
- r1 means perfect positive correlation
- r-1 means perfect negative correlation
- Perfect correlation indicates that a p variation
in x corresponds to a p variation in y
6Correlation coefficient and covariance
Pearson correlation coefficient
Correlation coefficient - POPULATION
SAMPLE
7Bivariate and multivariate correlation
- Bivariate correlation
- 2 variables
- Pearson correlation coefficient
- Partial correlation
- The correlation between two variables after
allowing for the effect of other control
variables
8Significance level in correlation
- Level of correlation (value of the correlation
coefficient) indicates to what extent the two
variables move together - Significance of correlation (p value) given that
the correlation coefficient is computed on a
sample, indicates whether the relationship appear
to be statistically significant - Examples
- Correlation is 0.50, but not significant the
sampling error is so high that the actual
correlation could even be 0 - Correlation is 0.10 and highly significant the
level of correlation is very low, but we can be
confident on the value of such correlation
9Correlation and covariance in SPSS
Choose between bivariate partial
10Bivariate correlation
Select the variables you want to analyse
Require the significance level (two tailed)
Ask for additional statistics (if necessary)
11Bivariate correlation output
12Partial correlations
List of variables to be analysed
Control variables
13Partial correlation output
- - - P A R T I A L C O R R E L A T I O N C
O E F F I C I E N T S - - - Controlling for..
SIZE STYLE AMTSPENT USECOUP
ORG AMTSPENT 1.0000 .2677
-.0116 ( 0) ( 775) (
775) P . P .000 P
.746 USECOUP .2677 1.0000 .0500
( 775) ( 0) ( 775)
P .000 P . P .164 ORG
-.0116 .0500 1.0000 ( 775)
( 775) ( 0) P .746 P
.164 P . (Coefficient / (D.F.) / 2-tailed
Significance) " . " is printed if a coefficient
cannot be computed
Partial correlations still measure the
correlation between two variables, but eliminate
the effect of other variables, i.e. the
correlations are computed on consumers shopping
in stores of identical size and with the same
shopping style
14Bivariate and partial correlations
- Correlation between Amount spent and Use of
coupon - Bivariate correlation 0.291 (p value 0.00)
- Partial correlation 0.268 (p value 0.00)
- The amount spent is positively correlated with
the use of coupon (0no use, 1from newspaper,
2from mailing, 3both) - The level of correlation does not change much
after accounting for different shop size and
shopping styles
15Linear regression analysis
Intercept
Error
Dependent variable
Independent variable (explanatory variable,
regressor)
Regression coefficient
16Regression analysis
y
x
17Example
- We want to investigate if there is a
relationship between cholesterol and age on a
sample of 18 people - The dependent variable is the cholesterol level
- The explanatory variable is age
18What regression analysis does
- Determine whether a relationships exist between
the dependent and explanatory variables - Determine how much of the variation in the
dependent variable is explained by the
independent variable (goodness of fit) - Allow to predict the values of the dependent
variable
19Regression and correlation
- Correlation there is no causal relationship
assumed - Regression we assume that the explanatory
variables cause the dependent variable - Bivariate one explanatory variable
- Multivariate two or more explanatory variables
20How to estimate the regression coefficients
- The objective is to estimate the population
parameters a e b on our data sample - A good way to estimate it is by minimising the
error ei, which represents the difference between
the actual observation and the estimated
(predicted) one
21The objective is to identify the line (i.e. the a
and b coefficients) that minimise the distance
between the actual points and the fit line
22The least square method
- This is based on minimising the square of the
distance (error) rather than the distance
23Bivariate regression in SPSS
24Regression dialog box
Dependent variable
Explanatory variable
Leave this unchanged!
25Regression output
Statistical significance Is the coefficient
different from 0?
Value of the coefficients
26Model diagnostics goodness of fit
The value of the R square is included between 0
and 1 and represents the proportion of total
variation that is explained by the regression
model
27R-square
Total variation
Variation explaned by regression
Residual variation
28Multivariate regression
- The principle is identical to bivariate
regression, but there are more explanatory
variables - The goodness of fit can be measured through the
adjusted R-square, which takes into account the
number of explanatory variables
29Multivariate regression in SPSS
- Analyze / Regression / Linear
Simply select more than one explanatory variable
30Output
31Coefficient interpretation
- The constant represents the amount spent being 0
all other variables ( 296.5) - Health food stores, Size of store and being
vegetarian are not significantly different from 0 - Gender coeff -69.6 On average being woman
(G1) implies spending 69 less - Shopping style coeff 22.8 S
- S1 (shop per himself) 22.8
- S2 (shop per himself spouse) 45.6
- S3 (shop per himself family) 68.4
- Coupon use coeff 30.4 C
- C1 (do not use coupon) 30.4
- C2 (coupon from newspapers) 60.8
- C3 (coupon from mailings) 91.2
- C4 (coupon from both) 121.6
Categorization problems?
32Prediction
- On average, how much will someone with the
following characteristics spend - Male (G0)
- Shopping for family (S3)
- Not using coupons (C1)
33How good is the model?
- The regression model explain less than 19 of
the total variation in the amount spent
34Task A
- Examine the relationship between the amount spent
and the following customer characteristics - Being male/female
- Being vegetarian
- Shopping for himself / for himself and others
- Shopping style (weekly, bi-weekly, etc.)
- Potential methods
- Battery of hypothesis testing Analysis of
variance - Regression Analysis
35Task B
- Examine the relationship between the amount spent
and the following customer characteristics - Hypothesis the average amount spent in
health-oriented shop is higher than those of
other shops. True or false? - Test the same hypothesis accounting for different
shop sizes
- Potential methods
- Battery of hypothesis testing Analysis of
variance - Regression Analysis
36Task C
- Find a relationship between the average amount
spent per store and the following store
characteristics - Size of store
- Health-oriented store
- Store organisation
- Potential methods
- Transform the customer data set into a store
data set - Battery of ANOVA
- Regression Analysis
37Task D
- Hypothesis is the amount spent by those that use
coupon significantly higher? - What is the most effective way of distributing
coupons - By mail
- On newspapers
- Both
- Potential methods
- Recode the variable into 1not using coupon and
2using coupon - Hypothesis testing
- Analysis of variance