Title: Correlation and Regression
1Correlation and Regression
9
Elementary Statistics Larson Farber
2Section 9.1
Correlation
3Correlation
A relationship between two variables
Explanatory (Independent) Variable
Response (Dependent) Variable
y
x
Hours of Training
Number of Accidents
Shoe Size
Height
Cigarettes smoked per day
Lung Capacity
Score on SAT
Grade Point Average
Height
IQ
What type of relationship exists between the two
variables and is the correlation significant?
4Scatter Plots and Types of Correlation
x hours of training y number of accidents
60
50
40
Accidents
30
20
10
0
0
2
4
6
8
10
12
14
16
18
20
Hours of Training
Negative Correlationas x increases, y decreases
5Scatter Plots and Types of Correlation
x SAT score y GPA
4.00
3.75
3.50
3.25
GPA
3.00
2.75
2.50
2.25
2.00
1.75
1.50
300
350
400
450
500
550
600
650
700
750
800
Math SAT
Positive Correlationas x increases, y increases
6Scatter Plots and Types of Correlation
x height y IQ
160
150
140
130
IQ
120
110
100
90
80
60
64
68
72
76
80
Height
No linear correlation
7Correlation Coefficient
A measure of the strength and direction of a
linear relationship between two variables
The range of r is from 1 to 1.
If r is close to 1 there is a strong positive
correlation.
If r is close to 1 there is a strong negative
correlation.
If r is close to 0 there is no linear correlation.
8Application
Final Grade
Absences
x y 8 78 2 92 5 90 12
58 15 43 9 74 6 81
95
90
85
80
75
Final Grade
70
65
60
55
50
45
40
0
2
4
6
8
10
12
14
16
Absences
X
9Computation of r
xy
x y
x2
y2
1 8 78 2 2 92 3
5 90 4 12 58 5 15 43 6
9 74 7 6 81
6084 8464 8100 3364 1849 5476 6561
624 184 450 696 645 666 486
64 4 25 144 225 81 36
57
516
3751
579
39898
10Hypothesis Test for Significance
r is the correlation coefficient for the sample.
The correlation coefficient for the population is
(rho).
For a two tail test for significance
(The correlation is not significant)
(The correlation is significant)
For left tail and right tail to test negative or
positive significance
The sampling distribution for r is a
t-distribution with n 2 d.f.
Standardized test statistic
11Test of Significance
You found the correlation between the number of
times absent and a final grade r 0.975. There
were seven pairs of data.Test the significance of
this correlation. Use 0.01.
1. Write the null and alternative hypothesis.
(The correlation is not significant)
(The correlation is significant)
2. State the level of significance.
0.01
3. Identify the sampling distribution.
A t-distribution with 5 degrees of freedom
12Rejection Regions
Critical Values t0
t
0
4. Find the critical value.
5. Find the rejection region.
6. Find the test statistic.
13t
0
4.032
4.032
7. Make your decision.
t 9.811 falls in the rejection region. Reject
the null hypothesis.
8. Interpret your decision.
There is a significant correlation between the
number of times absent and final grades.
14Section 9.2
Linear Regression
15The Line of Regression
Once you know there is a significant linear
correlation, you can write an equation describing
the relationship between the x and y variables.
This equation is called the line of regression or
least squares line.
The equation of a line may be written as y mx
b where m is the slope of the line and b is
the y-intercept.
The line of regression is
The slope m is
The y-intercept is
16(xi,yi)
a data point
a point on the line with the same x-value
a residual
260
250
240
230
revenue
220
210
200
190
180
1.5
2.0
2.5
3.0
Ad
17xy
x2
y2
x y
Write the equation of the line of regression with
x number of absences and y final grade.
1 8 78 2 2 92 3
5 90 4 12 58 5 15 43 6
9 74 7 6 81
6084 8464 8100 3364 1849 5476 6561
624 184 450 696 645 666 486
64 4 25 144 225 81 36
Calculate m and b.
57
516
3751
579
39898
The line of regression is
3.924x 105.667
18The Line of Regression
m 3.924 and b 105.667
The line of regression is
95
90
85
Grade
80
75
70
65
Final
60
55
50
45
40
Absences
Note that the point (8.143, 73.714) is
on the line.
19Predicting y Values
The regression line can be used to predict values
of y for values of x falling within the range of
the data.
The regression equation for number of times
absent and final grade is
3.924x 105.667
Use this equation to predict the expected grade
for a student with (a) 3 absences (b) 12
absences
3.924(3) 105.667 93.895
(a)
3.924(12) 105.667 58.579
(b)
20Section 9.3
Measures of Regression and Correlation
21The Coefficient of Determination
The coefficient of determination, r2, is the
ratio of explained variation in y to the total
variation in y.
The correlation coefficient of number of times
absent and final grade is r 0.975. The
coefficient of determination is r2 (0.975)2
0.9506.
Interpretation About 95 of the variation in
final grades can be explained by the number of
times a student is absent. The other 5 is
unexplained and can be due to sampling error or
other variables such as intelligence, amount of
time studied, etc.
22The Standard Error of Estimate
23The Standard Error of Estimate
x
y
1 8 78 74.275 13.8756 2
2 92 97.819 33.8608 3 5
90 86.047 15.6262 4 12 58
58.579 0.3352 5 15 43 46.807
14.4932 6 9 74 70.351
13.3152 7 6 81 82.123 1.2611
92.767
Calculate
for each x.
4.307
24Prediction Intervals
Given a specific linear regression equation and
x0, a specific value of x, a c-prediction
interval for y is
where
The point estimate is and E is the maximum
error of estimate.
Use a t-distribution with n 2 degrees of
freedom.
25Application
Construct a 90 confidence interval for a final
grade when a student has been absent 6 times.
1. Find the point estimate
The point (6, 82.123) is the point on the
regression line with x-coordinate of 6.
26Application
Construct a 90 confidence interval for a final
grade when a student has been absent 6 times.
2. Find E,
At the 90 level of confidence, the maximum error
of estimate is 9.438.
27Application
Construct a 90 confidence interval for a final
grade when a student has been absent 6 times.
3. Find the endpoints.
E 82.123 9.438 72.685
E 82.123 9.438 91.561
72.685 lt y lt 91.561
When x 6, the 90 confidence interval is from
72.685 to 91.586.
28Minitab Output
Regression Analysis The regression equation
is y 106 3.92x Predictor Coef
StDev T P Constant 105.668
3.655 28.91 0.000
x 3.9241 0.4019 9.76 0.000
S 4.307 R-Sq 95.0 R-Sq(adj) 94.0
29Section 9.4
Multiple Regression
30More Explanatory Variables
Absence IQ Grade
8 2 5 12 15 9 6
115 135 126 110 105 120 125
78 92 90 58 43 74 81
31Minitab Output
Regression Analysis The regression equation
is Grade 52.7 2.65 absence 0.357
IQ Predictor Coef StDev
T P Constant Absence IQ
0.573 0.277 0.571
0.61 1.26 0.62
86.110 2.111 0.580
52.720 2.652 0.357
S 4.603 R-Sq 95.4 R-Sq(adj) 93.2
32Interpretation
The regression equation is Grade 52.7 2.65
absence 0.357 IQ
When other variables are 0, the grade is 52.7.
If IQ is held constant, each time there is one
more absence the predicted grade will decrease by
2.65 points.
If number of absences is held constant, and IQ is
increased by one point the predicted grade will
increase by 0.357 points.
33Predicting the Response Variable
The regression equation is Grade 52.7 2.65
absence 0.357 IQ
Use the regression equation to predict a grade
when a student is absent 5 times and has an IQ of
125.
Grade 52.7 2.65 absence 0.357 IQ Grade
52.7 2.65(5) 0.357(125) 80.075 (about 80)
Use the regression equation to predict a grade
when a student is absent 9 times and has an IQ of
120.
Grade 52.7 2.65 absence 0.357 IQ Grade
52.7 2.65(9) 0.357(120) 71.69 (about 72)