Simple Linear Regression: An Introduction presentation

About This Presentation

Transcript and Presenter's Notes

Title: Simple Linear Regression: An Introduction

1
Simple Linear RegressionAn Introduction

Dr. Tuan V. Nguyen
Garvan Institute of Medical Research
Sydney

Give a man three weapons correlation,
regression and a pen and he will use all three
(Anon, 1978)

3
An example
ID Age Chol (mg/ml) 1 46 3.5 2 20 1.9 3 52 4.0
4 30 2.6 5 57 4.5 6 25 3.0 7 28 2.9 8 36 3.8 9 22
2.1 10 43 3.8 11 57 4.1 12 33 3.0 13 22 2.5 14 63
4.6 15 40 3.2 16 48 4.2 17 28 2.3 18 49 4.0
Age and cholesterol levels in 18 individuals
4
Read data into R

id lt- seq(118)
age lt- c(46, 20, 52, 30, 57, 25, 28, 36, 22,
43, 57, 33, 22, 63, 40, 48, 28, 49)
chol lt- c(3.5, 1.9, 4.0, 2.6, 4.5, 3.0, 2.9, 3.8,
2.1,
3.8, 4.1, 3.0, 2.5, 4.6, 3.2, 4.2, 2.3,
4.0)
plot(chol age, pch16)

5
(No Transcript)
6
Questions of interest

Association between age and cholesterol levels
Strength of association
Prediction of cholesterol for a given age

Correlation and Regression analysis
7
Variance and covariance algebra

Let x and y be two random variables from a sample
of n obervations.
Measure of variability of x and y variance

Measure of covariation between x and y ?

Algebraically
var(x y) var(x) var(y)
var(x y) var(x) var(y) 2cov(x,y)
Where

8
Variance and covariance geometry

The independence or dependence between x and y
can be represented geometrically

y
h
h
y
H
x
x
h2 x2 y2 2xycos(H)
h2 x2 y2
9
Meaning of variance and covariance

Variance is always positive
If covariance 0, x and y are independent.
Covariance is sum of cross-products can be
positive or negative.
Negative covariance deviations in the two
distributions in are opposite directions, e.g.
genetic covariation.
Positive covariance deviations in the two
distributions in are in the same direction.
Covariance a measure of strength of
association.

10
Covariance and correlation

Covariance is unit-depenent.
Coefficient of correlation (r) between x and y is
a standardized covariance.
r is defined by

11
Positive and negative correlation
r 0.9
r -0.9
12
Test of hypothesis of correlation

Hypothesis Ho r 0 versus Ho r not equal to
0.
Standard error of r is
The t-statistic

This statistic has a t distribution with n 2
degrees of freedom.
Fishers z-transformation
Standard error of z
Then 95 CI of z can be constructed as

13
An illustration of correlation analysis

ID Age Cholesterol
(x) (y mg/100ml)
46 3.5
20 1.9
52 4.0
30 2.6
57 4.5
25 3.0
28 2.9
36 3.8
22 2.1
43 3.8
57 4.1
33 3.0
22 2.5
63 4.6
40 3.2
48 4.2
28 2.3

Cov(x, y) 10.68
t-statistic 0.56 / 0.26 2.17 Critical t-value
with 17 df and alpha 5 is 2.11 Conclusion
There is a significant association between age
and cholesterol.
14
Simple linear regression analysis

Only two variables are of interest one response
variable and one predictor variable
No adjustment is needed for confounding or
covariate

Assessment
Quantify the relationship between two variables
Prediction
Make prediction and validate a test
Control
Adjusting for confounding effect (in the case of
multiple variables)

15
Relationship between age and cholesterol
16
Linear regression model

Y random variable representing a response
X random variable representing a predictor
variable (predictor, risk factor)
Both Y and X can be a categorical variable (e.g.,
yes / no) or a continuous variable (e.g., age).
If Y is categorical, the model is a logistic
regression model if Y is continuous, a simple
linear regression model.
Model
Y a bX e
a intercept
b slope / gradient
random error (variation between subjects in y
even if x is constant, e.g., variation in
cholesterol for patients of the same age.)

17
Linear regression assumptions

The relationship is linear in terms of the
parameter
X is measured without error
The values of Y are independently from each other
(e.g., Y1 is not correlated with Y2)
The random error term (e) is normally distributed
with mean 0 and constant variance.

18
Expected value and variance

If the assumptions are tenable, then
The expected value of Y is E(Y x) a bx
The variance of Y is var(Y) var(e) s2

19
Estimation of model parameters
Given two points A(x1, y1) and B(x2, y2) in a
two-dimensional space, we can derive an equation
connecting the points.
Gradient
y
B(x2,y2)
Equation y mx a What happen if we have
more than 2 points?
dy
A(x1,y1)
dx
a
0
x
20
Estimation of a and b

For a series of pairs (x1, y1), (x2, y2), (x3,
y3), , (xn, yn)
Let a and b be sample estimates for parameters a
and b,
We have a sample equation Y a bx
Aim finding the values of a and b so that (Y
Y) is minimal.
Let SSE sum of (Yi a bxi)2.
Values of a and b that minimise SSE are called
least square estimates.

21
Criteria of estimation
yi
Chol
Age
The goal of least square estimator (LSE) is to
find a and b such that the sum of d2 is minimal.
22
Estimation of a and b

After some calculus operations, the results can
be shown to be

Where

When the regression assumptions are valid, the
estimators of a and b have the following
properties
Unbiased
Uniformly minimal variance (eg efficient)

23
Goodness-of-fit

Now, we have the equation Y a bX e
Question how well the regression equation
describe the actual data?
Answer coefficient of determination (R2) the
amount of variation in Y is explained by the
variation in X.

24
Partitioning of variations concept

SST sum of squared difference between yi and
the mean of y.
SSR sum of squared difference between the
predicted value of y and the mean of y.
SSE sum of squared difference between the
observed and predicted value of y.
SST SSR SSE
The the coefficient of determination is
R2 SSR / SST

25
Partitioning of variations geometry
SSE
SST
Chol (Y)
SSR
mean
Age (X)
26
Partitioning of variations algebra

Some statistics
Total variation
Attributed to the model
Residual sum of square
SST SSR SSE
SSR SST SSE

27
Analysis of variance

SS increases in proportion to sample size (n)
Mean squares (MS) normalise for degrees of
freedom (df)
MSR SSR / p (where p number of degrees of
freedom)
MSE SSE / (n p 1)
MST SST / (n 1)

Analysis of variance (ANOVA) table

Source d.f. Sum of squares (SS) Mean squares (MS) F-test
Regression Residual Total p Np 1 n 1 SSR SSE SST MSR MSE MSR/MSE
28
Hypothesis tests in regression analysis

Now, we have
Sample data Y a bX e
Population Y a bX e
Ho b 0. There is no linear association
between the outcome and predictor variable.
In layman language what is the chance, given
the sample data that we observed, of observing a
sample of data that is less consistent with the
null hypothesis of no association?

29
Inference about slope (parameter b)

Recall that e is assumed to be normally
distributed with mean 0 and variance s2.
Estimate of s2 is MSE (or s2)
It can be shown that
The expected value of b is b, i.e. E(b) b,
The standard error of b is
Then the test whether b 0 is t b / SE(b)
which follows a t-distribution with n-1 degrees
of freedom.

30
Confidence interval around predicted valued

Observed value is Yi.
Predicted value is
The standard error of the predicted value is

Interval estimation for Yi values

31
Checking assumptions

Assumption of constant variance
Assumption of normality
Correctness of functional form
Model stability
All can be conducted with graphical analysis.
The residuals from the model or a function of the
residuals play an important role in all of the
model diagnostic procedures.

32
Checking assumptions

Assumption of constant variance
Plot the studentized residuals versus their
predicted values. Examine whether the
variability between residuals remains relatively
constant across the range of fitted values.
Assumption of normality
Plot the residuals versus their expected values
under normality (Normal probability plot). If
the residuals are normally distributed, it should
fall along a 45o line.
Correct functional form?
Plot the residuals versus fitted values. Examine
whether the residual plot for evidence of a
non-linear trend in the value of the residual
across the range of fitted values.
Model stability
Check whether one or more observations are
influential. Use Cooks distance.

33
Checking assumptions (Cont)

Cooks distance (D) is a measure of the magnitude
by which the fitted values of the regression
model change if the ith observation is removed
from the data set.
Leverage is a measure of how extreme the value of
xi is relative to the remaining value of x.
The Studentized residual provides a measure of
how extreme the value of yi is relative to the
remaining value of y.

34
Remedial measures

Non-constant variance
Transform the response variable (y) to a new
scale (e.g. logarithm) is often helpful.
If no transformation can achieve the non-constant
variance problem, use a more robust estimator
such as iterative weighted least squares.
Non-normality
Non-normality and non-constant variance go
hand-in-hand.
Outliers
Check for accuracy
Use robust estimator

35
Regression analysis using R

id lt- seq(118)
age lt- c(46, 20, 52, 30, 57, 25, 28, 36, 22,
43, 57, 33, 22, 63, 40, 48, 28, 49)
chol lt- c(3.5, 1.9, 4.0, 2.6, 4.5, 3.0, 2.9, 3.8,
2.1,
3.8, 4.1, 3.0, 2.5, 4.6, 3.2, 4.2, 2.3,
4.0)
Fit linear regression model
reg lt- lm(chol age)
summary(reg)

36
ANOVA result

gt anova(reg)
Analysis of Variance Table
Response chol
Df Sum Sq Mean Sq F value Pr(gtF)
age 1 10.4944 10.4944 114.57 1.058e-08
Residuals 16 1.4656 0.0916
---
Signif. codes 0 '' 0.001 '' 0.01 '' 0.05
'.' 0.1 ' ' 1

37
Results of R analysis
gt summary(reg) Call lm(formula chol
age) Residuals Min 1Q Median
3Q Max -0.40729 -0.24133 -0.04522 0.17939
0.63040 Coefficients Estimate Std.
Error t value Pr(gtt) (Intercept) 1.089218
0.221466 4.918 0.000154 age
0.057788 0.005399 10.704 1.06e-08
--- Signif. codes 0 '' 0.001 '' 0.01
'' 0.05 '.' 0.1 ' ' 1 Residual standard error
0.3027 on 16 degrees of freedom Multiple
R-Squared 0.8775, Adjusted R-squared 0.8698
F-statistic 114.6 on 1 and 16 DF, p-value
1.058e-08
38
Diagnostics influential data
par(mfrowc(2,2)) plot(reg)
39
A non-linear illustration BMI and sexual
attractiveness

Study on 44 university students
Measure body mass index (BMI)
Sexual attractiveness (SA) score

id lt- seq(144) bmi lt- c(11.00, 12.00, 12.50,
14.00, 14.00, 14.00, 14.00, 14.00,
14.00, 14.80, 15.00, 15.00, 15.50, 16.00,
16.50, 17.00, 17.00, 18.00, 18.00, 19.00,
19.00, 20.00, 20.00, 20.00, 20.50,
22.00, 23.00, 23.00, 24.00, 24.50,
25.00, 25.00, 26.00, 26.00, 26.50,
28.00, 29.00, 31.00, 32.00, 33.00, 34.00, 35.50,
36.00, 36.00) sa lt- c(2.0, 2.8, 1.8,
1.8, 2.0, 2.8, 3.2, 3.1, 4.0, 1.5, 3.2,
3.7, 5.5, 5.2, 5.1, 5.7, 5.6, 4.8, 5.4, 6.3,
6.5, 4.9, 5.0, 5.3, 5.0, 4.2, 4.1, 4.7, 3.5,
3.7, 3.5, 4.0, 3.7, 3.6, 3.4, 3.3, 2.9,
2.1, 2.0, 2.1, 2.1, 2.0, 1.8, 1.7)
40
Linear regression analysis of BMI and SA
reg lt- lm (sa bmi) summary(reg) Residuals
Min 1Q Median 3Q Max
-2.54204 -0.97584 0.05082 1.16160 2.70856
Coefficients Estimate Std. Error t
value Pr(gtt) (Intercept) 4.92512
0.64489 7.637 1.81e-09 bmi -0.05967
0.02862 -2.084 0.0432 --- Signif.
codes 0 '' 0.001 '' 0.01 '' 0.05 '.' 0.1
' ' 1 Residual standard error 1.354 on 42
degrees of freedom Multiple R-Squared 0.09376,
Adjusted R-squared 0.07218 F-statistic 4.345
on 1 and 42 DF, p-value 0.04323
41
BMI and SA analysis of residuals
plot(reg)
42
BMI and SA a simple plot
par(mfrowc(1,1)) reg lt- lm(sa bmi) plot(sa
bmi, pch16) abline(reg)
43
Re-analysis of sexual attractiveness data

Fit 3 regression models
linear lt- lm(sa bmi)
quad lt- lm(sa poly(bmi, 2))
cubic lt- lm(sa poly(bmi, 3))
Make new BMI axis
bmi.new lt- 1040
Get predicted values
quad.pred lt- predict(quad,data.frame(bmibmi.new))
cubic.pred lt- predict(cubic,data.frame(bmibmi.new
))
Plot predicted values
abline(reg)
lines(bmi.new, quad.pred, col"blue",lwd3)
lines(bmi.new, cubic.pred, col"red",lwd3)

44
(No Transcript)
45
Some comments Interpretation of correlation

Correlation lies between 1 and 1. A very small
correlation does not mean that no linear
association between the two variables. The
relationship may be non-linear.
For curlinearity, a rank correlation is better
than the Pearsons correlation.
A small correlation (eg 0.1) may be statistically
significant, but clinically unimportant.
R2 is another measure of strength of association.
An r 0.7 may sound impressive, but R2 is 0.49!
Correlation does not mean causation.

46
Some comments Interpretation of correlation

Be careful with multiple correlations. For p
variables, there are p(p 1)/2 possible pairs of
correlation, and false positive is a problem.
Correlation can not be inferred directly from
association.
r(age, weight) 0.05 r(weight, fat) 0.03 it
does not mean that r(age, fat) is near zero.
In fact, r(age, fat) 0.79.

47
Some comments Interpretation of regression

The fitted line (regression) is only an estimated
of the relation between these variables in the
population.
Uncertainty associated with estimated parameters.
Regression line should not be used to make
prediction of x values outside the range of
values in the observed data.
A statistical model is an approximation the
true relation may be nonlinear, but a linear is
a reasonable approximation.

48
Some comments Reporting results

Results should be reported in sufficient details
nature of response variable, predictor variable
any transformation checking assumptions, etc.
Regression coefficients (a, b), their associated
standard errors, and R2 are useful summary.

Simple Linear Regression: An Introduction PowerPoint PPT Presentation