Title: Part III: Correlation and Regression:
1Part III Correlation and Regression
2Chapter 8 Correlation and Regression
Correlation and regression techniques allow us to
study and understand the relationship between 2
variables Example Pearsons study of the
heredity of height Data 1078 pairs of fathers
and sons measured their height Fathers
height Sons Height 72 65 68 73
64 60
Question is there any relationship between a
fathers height and his sons height??
- Hard to answer this by looking at 1078 pairs of
data points - OH Fig 1 scattergram allows a visual
representation - we get a sense of the
relationship the oval football shaped cloud of
points points up indicating a positive
association, but weak - a lot of spread around
the line - 45 degree line points on this line are where
sons height fathers height a lot of spread
around this - weakness of the relationship verti
cal chimney father 72 inches, sons vary a lot
3Chapter 8 Terminology
the independent variable or IV
x
(Fathers height)
influences
y
(Sons height)
the dependent variable or DV
Note think about it like this The value of the
DV depends on the value of the IV
if there is a strong association between 2
variables, knowing something about one helps a
lot in predicting the other a weak association
means that information about one does not help
much in predicting the other
4Chapter 8 Correlation Coefficient
Recall the average and standard deviation
allowed us to numerically summarize a univariate
distribution of scores Similarly, the correlation
coefficient allows us to summarize the
relationship between two variables numerically a
bivariate distribution
- Identify the point of averages
- mark the point showing the average of the x
and the y values - called the point of averages
Point of averages
y average
x average
5Chapter 8 Correlation Coefficient
Recall the average and standard deviation
allowed us to numerically summarize a
distribution of scores Similarly, the correlation
coefficient allows us to summarize the
relationship between two variables numerically
-2 to 2 SDs
X descriptive statistics measure the horizontal
spread (variability) of the cloud (i.e., from
side to side) Note this is the standard
deviation of x
x average
6Chapter 8 Correlation Coefficient
Y descriptive statistics measure the vertical
spread (variability) of the cloud (i.e., from top
to bottom) Note this is the standard deviation
of y
y average
-2 to 2 SDs
7Chapter 8 Correlation and Regression
Summary statistics obtained thus far average of
x, SD of x average of y, SD of y They tell
us where the center of the cloud is and how
spread out it is But still no information about
the strength of the x y relationship!!!
Note we can have two distributions with the
same center and same vertical and horizontal
spread
loose clustering, correlation near 0
tight clustering, correlation near 1
8Chapter 8 Correlation and Regression
Before computing the r, first we focus on a
graphical interpretation of r a measure of
linear association a summary statistic denoting
the strength of the x y relationship denoted by
r graphical interpretation OH Fig 6 and 7
r - r ranges from -1.0 to
1.0 absolute value of 1.0 perfect r r 0
no relationship
9Chapter 8 The SD Line for a () r
Points in a scattergram usually cluster around
the SD line The SD line runs through (1) the
point of averages (2) all the points which
are an equal of SDs away from the
average of x and the average of y
SD of y
y average
SD of x
x average
10Chapter 8 The SD Line for a () r (cont.)
Points in a scattergram usually cluster around
the SD line The SD line runs through (1) the
point of averages (2) all the points which
are an equal of SDs away from the
average of x and the average of y
y average
2 SDs of y
2 SDs of x
x average
11Chapter 8 The SD Line for a (-) r
Points in a scattergram usually cluster around
the SD line The SD line runs through (1) the
point of averages (2) all the points which
are an equal of SDs away from the
average of x and the average of y
SD of y
y average
SD of x
x average
12Chapter 8 The SD Line
SD of y
SD of x
x average
Note The line climbs at the rate of one vertical
SD for each horizontal SD The slope of the SD
line (SD of y) or -(SD of y) (SD of x)
(SD of x)
13Computing the correlation coefficient
Step 1 Convert each variable to standard units
Recall std units (z) (score -
average) standard deviation
14Computing the correlation coefficient
Step2 Calculate the average of the z score
products
15Computing the correlation coefficient
Why/How does this computation work???
Graph the x and y coordinates - mark them with
their respective product values Draw 2 lines
through the point of averages - splits graph in 4
quadrants
2.25
-0.25
Products
0.75 -0.25 0.00 -0.75 2.25
x 1 3 4 5 7
y 5 9 7 1 13
0.00
0.75
-0.75
16Computing the correlation coefficient
This produces 4 quadrants
Quadrant I both x and y are below average, thus
have negative std unit scores whose products are
positive
Quadrant III both x and y are above average,
thus have positive std unit scores whose products
are positive
III
X 1 3 4 5 7
y 5 9 7 1 13
Products
0.75 -0.25 0.00 -0.75 2.25
I
4 7 0.40
17Computing the correlation coefficient
Quadrant II x is below average, has negative std
unit scores y is above average, has positive std
unit scores the products are negative
Quadrant IV x is above average, has positive std
unit scores y is below average, has negative std
unit scores the products are negative
-
III
II
x 1 3 4 5 7
y 5 9 7 1 13
Products
0.75 -0.25 0.00 -0.75 2.25
IV
-
I
4 7 0.40
18Computing the correlation coefficient
Positive Correlation
If all the points fall in the two positive
quadrants (Quadrant I and III), the correlation
is positive.
-
III
I
-
19Computing the correlation coefficient
Negative Correlation
If all the points fall in the two negative
quadrants (Quadrant II and IV), the correlation
is negative.
-
II
-
IV
20Chapter 8 Review
- the relationship between 2 variables can be
represented by a scatter diagram - tight clustering around a straight line
indicates a strong linear relationship - a scatter diagram can be summarized by
- X average and SD of x
- Y average and SD of Y
- r
- positive association means
- that as X increases, Y also increases
- or as X decreases, Y also decreases
- indicated by a sign and the line slopes
upward - negative association means
- that as X increases, Y decreases
- or as X decreases, Y increases
- indicated by a - sign and the line slopes
downward - r ranges from -1.0 (perfect - relationship)
to 1.0 (perfect relationship)
21Chapter 8 Review
- the SD line
- goes through the point of averages
- with a r, the slope of the line is
- with a - r, the slope of the line is
-
- to calculate r
- convert X and Y to z scores
- compute the product of the z scores
- take the average of this product