Title: Action Research Correlation and Regression
1Action ResearchCorrelation and Regression
2Measures of Association
- Measures of association are used to determine how
strong the relationship is between two variables
or measures, and how we can predict such a
relationship - Only applies for interval or ratio scale
variables - Everything this week only applies to interval or
ratio scale variables!
3Measures of Association
- For example, I have GRE and GPA scores for a
random sample of graduate students - How strong is the relationship between GRE scores
and GPA? Do these variables relate to each other
in some way? - If there is a strong relationship, how well can
we predict the values of one variable when values
of the other variable are known?
4Strength of Prediction
- Two techniques are used to describe the strength
of a relationship, and predict values of one
variable when another variables value is known - Correlation Describes the degree (strength) to
which the two variables are related - Regression Used to predict the values of one
variable when values of the other are known
5Strength of Prediction
- Correlation and regression are linked -- the
ability to predict one variable when another
variable is known depends on the degree and
direction of the variables relationship in the
first place - We find correlation before we calculate
regression - So generating a regression without checking for a
correlation first is pointless (though well do
both at once)
6Correlation
- There are different types of statistical measures
of correlation - They give us a measure known as the correlation
coefficient - The most common procedure used is known as the
Pearsons Product Moment Correlation, or
Pearsons r
7Pearsons r
- Can only be calculated for interval or ratio
scale data - Its value is a real number from -1 to 1
- Strength As the value of r approaches -1 or
1, the relationship is stronger. As the
magnitude of r approaches zero, we see little
or no relationship
8Pearsons r
- For example, r might equal 0.89, -0.9, 0.613,
or -0.3 - Which would be the strongest correlation?
- Direction Positive or negative correlation can
not be distinguished from looking at r - Direction of correlation depends on the type of
equation used, and the resulting constants
obtained for it
9Example of Relationships
- Positive direction -- as the independent variable
increases, the dependent variable tends to
increase - Student GRE (X) GPA1 (Y)
- 1 1500 4.0
- 2 1400 3.8
- 3 1250 3.5
- 4 1050 3.1
- 5 950 2.9
10Example of Relationships
- Negative direction -- as the dependent variable
increases, the independent variable decreases - Student GRE (X) GPA2 (Y)
- 1 1500 2.9
- 2 1400 3.1
- 3 1250 3.4
- 4 1050 3.7
- 5 950 4.0
11Positive and Negative Correlation
Data from slide 9
Data from slide 10
Notice that high r doesnt tell whether the
correlation is positive or negative!
12Important Note
- An association value provided by a correlation
analysis, such as Pearsons r, tells us nothing
about causation - In this case, high GRE scores dont necessarily
cause high or low GPA scores, and vice versa
13Significance of r
- We can test for the significance of r (to see
whether our relationship is statistically
significant) by consulting a table of critical
values for r (Action Research p. 41/42) - Table VALUES OF THE CORRELATION COEFFICIENT FOR
DIFFERENT LEVELS OF SIGNIFICANCE - Where df (number of data pairs) 2
14Significance of r
- We test the null hypothesis that the correlation
between the two variables is equal to zero (there
is no relationship between them) - Reject the null hypothesis (H0) if the absolute
value of r is greater than the critical r value - Reject H0 if r gt rcrit
- This is similar to evaluating actual versus
critical t values
15Significance of r Example
- So if we had 20 pairs of data
- For two-tail 95 confidence (P.05), the critical
r value at df20-218 is 0.444 - So reject the null hypothesis (hence correlation
is statistically significant) if - r gt 0.444 or r lt -0.444
16Strength of r
- Absolute value of Pearsons r indicates the
strength of a correlation - 1.0 to 0.9 very strong correlation
- 0.9 to 0.7 strong
- 0.7 to 0.4 moderate to substantial
- 0.4 to 0.2 moderate to low
- 0.2 to 0.0 low to negligible correlation
- Notice that a correlation can be strong, but
still not be statistically significant!
(especially for small data sets)
17Important Notes
- The stronger the r, the smaller the standard
estimate of the error, the better the prediction! - A significant r does not necessarily mean that
you have a strong correlation - A significant r means that whatever correlation
you do have is not due to random chance
18Coefficient of Determination
- By squaring r, we can determine the amount of
variance the two variables share (called
explained variance) - R Square is the coefficient of determination
- So, an R Square of 0.94 means that 94 of the
variance in the Y variable is explained by the
variance of the X variable
19What is R Squared?
- The Coefficient of determination, R2, is a
measure of the goodness of fit - R2 ranges from 0 to 1
- R2 1 is a perfect fit (all data points fall on
the estimated line or curve) - R2 0 means that the variable(s) have no
explanatory power
20What is R Squared?
- Having R2 closer to 1 helps choose which
regression model is best suited to a problem - Having R2 actually equal zero is very difficult
- A sample of ten random numbers from Excel still
obtained an R2 of 0.006
21Scatter Plots
- Its nice to use R2 to determine the strength of
a relationship, but visual feedback helps verify
whether the model fits the data well - Also helps look for data fliers (outliers)
- A scatter plot (or scatter gram) allows us to
compare any two interval or ratio scale
variables, and see how data points are related to
each other
22Scatter Plots
- Scatter plots are two-dimensional graphs with an
axis for each variable (independent variable X
and dependent variable Y) - To construct place an on the graph for each X
and Y value from the data - Seeing data this way can help choose the correct
mathematical model for the data
23Scatter Plots
24Models
- Allow us to focus on select elements of the
problem at hand, and ignore irrelevant ones - May show how parts of the problem relate to each
other - May be expressed as equations, mappings, or
diagrams - May be chosen or derived before or after
measurement (theory vs. empirical)
25Modeling
- Often we look for a linear relationship one
described by fitting a straight line as well to
the data as possible - More generally, any equation could be used as the
basis for regression modeling, or describing the
relationship between two variables - You could have Y aX2 bln(X)
csin(dX-e)
26Linear Model
27Linear Model
- Pearsons r for linear regression is calculated
per (Action Research p. 29/30) - Define N number of data pairs SX Sum of all
X values SX2 Sum of all (X values squared) SY
Sum of all Y values SY2 Sum of all (Y values
squared) SXY Sum of all (X values times Y
values) - Pearsons r N(SXY) (SX)(SY) /
sqrt(N(SX2) (SX)2)(N(SY2) (SY)2)
28Linear Model
- For the linear model, you could find the slope
m and Y-intercept b from - m (r) (standard deviation of Y) / (standard
deviation of X) - b (mean of Y) (m)(mean of X)
- But its a lot easier to use SPSS slopeb1 and
Y intercept b0
29Regression Analysis
- Allows us to predict the likely value of one
variable from knowledge of another variable - The two variables should be fairly highly
correlated (close to a straight line) - The regression equation is a mathematical
expression of the relationship between 2
variables on, for example, a straight line
30Regression Equation
- Y mX b
- In this linear equation, you predict Y values
(the dependent variable) from known values of X
(the independent variable) this is called the
regression of Y on X - The regression equation is fundamentally an
equation for plotting a straight line, so the
stronger our correlation -- the closer our
variables will fall to a straight line, and the
better our prediction will be
31Linear Regression
y
y
y
y a bx
y y e
x
Choose best line by minimizing the sum of the
squares of the vertical distances between the
data points and the regression line
32Standard Error of the Estimate
- Is the standard deviation of data around the
regression line - Tells how much the actual values of Y deviate
from the predicted values of Y
33Standard Error of the Estimate
- After you calculate the standard error of the
estimate, you add and subtract the value from
your predicted values of Y to get a area around
the regression line within which you would expect
repeated actual values to occur or cluster if you
took many samples (sort of like a sampling
distribution for the mean.)
34Standard Error of Estimate
- The Standard Error of Estimate for Y predicted by
X issy/x sqrtsum of(Ypredicted Y)2
/(N2)where Y is each actual Y
valuepredicted Y is the Y value predicted by
the linear regressionN is the number of data
pairs - For example on (Action Research p. 33/34), Sy/x
sqrt(2.641/(10-2)) 0.574
35Standard Error of the Estimate
- So, if the standard error of the estimate is
equal to 0.574, and if you have a predicted Y
value of 4.560, then 68 of your actual values,
with repeated sampling, would fall between 3.986
and 5.134 (predicted Y /- 1 std error) - The smaller the standard error, the closer your
actual values are to the regression line, and
the more confident you can be in your prediction
36SPSS Regression Equations
- Instead of constants called m and b, b0
and b1 are used for most equations - The meaning of b0 and b1 varies, depending on
the type of equation which is being modeled - Can repress the use of b0 by unchecking
Include constant in equation
37SPSS Regression Models
- Linear modelY b0 b1X
- Logarithmic modelY b0 b1ln(X) where ln
natural log - Inverse model Y b0 b1/XSimilar to the form
XY constant, which is a hyperbola
38SPSS Regression Models
- Power modelY b0(Xb1)
- Compound model Y b0(b1X)
- A variant of this is the Logistic model, which
requires a constant input u which is larger
than Y for any actual data pointY 1/ 1/u
b0(b1X)
Where indicates to the power of
39SPSS Regression Models
exp means e to the power ofe 2.7182818
- Exponential model Y b0exp(b1X)
- Other exponential functions
- S modelY exp(b0 b1/X)
- Growth model (is almost identical to the
exponential model)Y exp(b0 b1X)
40SPSS Regression Models
- Polynomials beyond the Linear model (linear is a
first order polynomial) - Quadratic (second order)Y b0 b1X b2X2
- Cubic (third order)Y b0 b1X b2X2
b3X3These are the only equations which use
constants b2 b3 - Higher order polynomials require the Regression
module of SPSS, which can do regression using any
equation you enter
41Y whattheflock?
- To help picture these equations
- Make an X variable over some typical range (0 to
10 in a small increment, maybe 0.01) - Define a Y variable
- Calculate the Y variable using Transform gt
Compute and whatever equation you want to see - Pick values for b0 and b1 that arent 0, 1, or 2
- Have SPSS plot the results of a regression of Y
vs X for that type of equation
42How Apply This?
- Given a set of data containing two variables of
interest, generate a scatter plot to get some
idea of what the data looks like - Choose which types of models are most likely to
be useful - For only linear models, use Analyze / Regression
/ Linear...
43How Apply This?
- Select the Independent (X) and Dependent (Y)
variables - Rules may be applied to limit the scope of the
analysis, e.g. gender1 - Dozens of other characteristics may also be
obtained, which are beyond our scope here
44How Apply This?
- Then check for the R Square value in the Model
Summary - Check the Coefficients to make sure they are all
significant (e.g. Sig. lt 0.050) - If so, use the b0 and b1 coefficients from
under the B column (see Statistics for Software
Process Improvement handout), plus or minus the
standard errors SE B
45Regression Example
- For example, go back to the GSS91
political.sav data set - Generate a linear regression (Analyze gt
Regression gt Linear) for age as the Independent
variable, and partyid as the Dependent variable - Notice that R2 and the ANOVA summary are given,
with F and its significance
46Regression Example
47Regression Example
- The R Square of 0.006 means there is a very
slight correlation (little strength) - But the ANOVA Significance well under 0.050
confirms there is a statistically significant
relationship here - its just a really weak one
48Regression Example
49Regression Example
- The heart of the regression analysis is in the
Coefficients section - We could look up t on a critical values table,
but its easier to - See if all values of Sig are lt 0.050 - if they
are, reject the null hypothesis, meaning there is
a significant relationship - If so, use the values under B for b0 and b1
- If any coefficient has Sig gt 0.050, dont use
that regression (coeff might be zero)
50Regression Example
- The answer for what is the effect of age on
political view? is that there is a very weak but
statistically significant linear relationship,
with a reduction of 0.009 (b1) political view
categories per year - From the Variable View of the data, since low
values are liberal and large values conservative,
this means that people tend to get slightly more
liberal as they get older
51Curve Estimation Example
- For the other regression options, choose Analyze
/ Regression / Curve Estimation - Define the Dependents (variable) and the
Independent variable - note that multiple
Dependents may be selected - Check which math models you want used
- Display the ANOVA table for reference
52Curve Estimation Example
- SPSS Tip up to three regression models can be
plotted at once, so dont select more than that
if you want a scatter plot to go with the data
and the regressions - For the same example just used, get a summary for
the linear and quadratic models (Analyze gt
Regression gt Curve Estimation) - Find R Square for each model
- Generally pick the model with largest R Square
- Already saw Linear output, now see Quadratic
53Curve Estimation Example
- For the quadratic regression, R Square is
slightly higher, and the ANOVA is still
significant
54Curve Estimation Example
- The Quadratic coefficients are all significant at
the 0.050 level
Interpret as partyid (4.191 /- 0.412)
(-0.048 /- 0.018)age
(0.0003918/- 0.0001754)age2Edit the
data table, then double click on the cells to get
the values of b2 and its std error.
55Curve Estimation Example
- The data set will be plotted as the Observed
points, with the regression models shown for
comparison - Look to see which model most closely matches the
data - Look for regions of data which do or dont match
the model well (if any)
56Curve Estimation Example
57Curve Estimation Procedure
- See which models are significant (throw out the
rest!) - Compare the R Square values to see which provides
the best fit - Use the graph to verify visually that the correct
model was chosen - Use the model equations B values and their
standard errors to describe and predict the
datas behavior