Title: Simple Linear Regression: An Introduction
 1Simple Linear RegressionAn Introduction
- Dr. Tuan V. Nguyen 
- Garvan Institute of Medical Research 
- Sydney
2- Give a man three weapons  correlation, 
 regression and a pen  and he will use all three
 (Anon, 1978)
3An example
ID Age Chol (mg/ml) 1 46 3.5 2 20 1.9 3 52 4.0
 4 30 2.6 5 57 4.5 6 25 3.0 7 28 2.9 8 36 3.8 9 22
 2.1 10 43 3.8 11 57 4.1 12 33 3.0 13 22 2.5 14 63
 4.6 15 40 3.2 16 48 4.2 17 28 2.3 18 49 4.0
Age and cholesterol levels in 18 individuals 
 4Read data into R
- id lt- seq(118) 
- age lt- c(46, 20, 52, 30, 57, 25, 28, 36, 22, 
-  43, 57, 33, 22, 63, 40, 48, 28, 49) 
- chol lt- c(3.5, 1.9, 4.0, 2.6, 4.5, 3.0, 2.9, 3.8, 
 2.1,
-  3.8, 4.1, 3.0, 2.5, 4.6, 3.2, 4.2, 2.3, 
 4.0)
- plot(chol  age, pch16)
5(No Transcript) 
 6Questions of interest
- Association between age and cholesterol levels 
- Strength of association 
- Prediction of cholesterol for a given age
Correlation and Regression analysis 
 7Variance and covariance algebra
- Let x and y be two random variables from a sample 
 of n obervations.
- Measure of variability of x and y variance
- Measure of covariation between x and y ?
- Algebraically 
-  var(x  y)  var(x)  var(y) 
-  var(x  y)  var(x)  var(y)  2cov(x,y) 
- Where 
8Variance and covariance geometry
- The independence or dependence between x and y 
 can be represented geometrically
y
h
h
y
H
x
x
h2  x2  y2  2xycos(H)
h2  x2  y2 
 9Meaning of variance and covariance
- Variance is always positive 
- If covariance  0, x and y are independent. 
- Covariance is sum of cross-products can be 
 positive or negative.
- Negative covariance  deviations in the two 
 distributions in are opposite directions, e.g.
 genetic covariation.
- Positive covariance  deviations in the two 
 distributions in are in the same direction.
- Covariance  a measure of strength of 
 association.
10Covariance and correlation
- Covariance is unit-depenent. 
- Coefficient of correlation (r) between x and y is 
 a standardized covariance.
- r is defined by 
11Positive and negative correlation
r  0.9
r  -0.9 
 12Test of hypothesis of correlation
- Hypothesis Ho r  0 versus Ho r not equal to 
 0.
- Standard error of r is 
- The t-statistic
- This statistic has a t distribution with n  2 
 degrees of freedom.
- Fishers z-transformation 
- Standard error of z 
- Then 95 CI of z can be constructed as 
13An illustration of correlation analysis
- ID Age Cholesterol 
-  (x) (y mg/100ml) 
- 46 3.5 
- 20 1.9 
- 52 4.0 
- 30 2.6 
- 57 4.5 
- 25 3.0 
- 28 2.9 
- 36 3.8 
- 22 2.1 
- 43 3.8 
- 57 4.1 
- 33 3.0 
- 22 2.5 
- 63 4.6 
- 40 3.2 
- 48 4.2 
- 28 2.3 
Cov(x, y)  10.68
t-statistic  0.56 / 0.26  2.17 Critical t-value 
with 17 df and alpha  5 is 2.11 Conclusion 
There is a significant association between age 
and cholesterol. 
 14Simple linear regression analysis
- Only two variables are of interest one response 
 variable and one predictor variable
- No adjustment is needed for confounding or 
 covariate
- Assessment 
- Quantify the relationship between two variables 
- Prediction 
- Make prediction and validate a test 
- Control 
- Adjusting for confounding effect (in the case of 
 multiple variables)
15Relationship between age and cholesterol 
 16Linear regression model
- Y  random variable representing a response 
- X  random variable representing a predictor 
 variable (predictor, risk factor)
- Both Y and X can be a categorical variable (e.g., 
 yes / no) or a continuous variable (e.g., age).
- If Y is categorical, the model is a logistic 
 regression model if Y is continuous, a simple
 linear regression model.
- Model 
-  Y  a  bX  e 
- a  intercept 
- b  slope / gradient 
-  random error (variation between subjects in y 
 even if x is constant, e.g., variation in
 cholesterol for patients of the same age.)
17Linear regression assumptions
- The relationship is linear in terms of the 
 parameter
- X is measured without error 
- The values of Y are independently from each other 
 (e.g., Y1 is not correlated with Y2)
- The random error term (e) is normally distributed 
 with mean 0 and constant variance.
18Expected value and variance
- If the assumptions are tenable, then 
- The expected value of Y is E(Y  x)  a  bx 
- The variance of Y is var(Y)  var(e)  s2 
19Estimation of model parameters
Given two points A(x1, y1) and B(x2, y2) in a 
two-dimensional space, we can derive an equation 
connecting the points.
Gradient
y
B(x2,y2)
Equation y  mx  a What happen if we have 
more than 2 points? 
dy
A(x1,y1)
dx
a
0
x 
 20Estimation of a and b
- For a series of pairs (x1, y1), (x2, y2), (x3, 
 y3), , (xn, yn)
- Let a and b be sample estimates for parameters a 
 and b,
- We have a sample equation Y  a  bx 
- Aim finding the values of a and b so that (Y  
 Y) is minimal.
- Let SSE  sum of (Yi  a  bxi)2. 
- Values of a and b that minimise SSE are called 
 least square estimates.
21Criteria of estimation
yi
Chol
Age
The goal of least square estimator (LSE) is to 
find a and b such that the sum of d2 is minimal. 
 22Estimation of a and b
- After some calculus operations, the results can 
 be shown to be
Where 
- When the regression assumptions are valid, the 
 estimators of a and b have the following
 properties
- Unbiased 
- Uniformly minimal variance (eg efficient) 
23Goodness-of-fit
- Now, we have the equation Y  a  bX  e 
- Question how well the regression equation 
 describe the actual data?
- Answer coefficient of determination (R2) the 
 amount of variation in Y is explained by the
 variation in X.
24Partitioning of variations concept
- SST  sum of squared difference between yi and 
 the mean of y.
- SSR  sum of squared difference between the 
 predicted value of y and the mean of y.
- SSE  sum of squared difference between the 
 observed and predicted value of y.
-  SST  SSR  SSE 
- The the coefficient of determination is 
-  R2  SSR / SST
25Partitioning of variations geometry
SSE
SST
Chol (Y)
SSR
mean
Age (X) 
 26Partitioning of variations algebra
- Some statistics 
- Total variation 
- Attributed to the model 
- Residual sum of square 
- SST  SSR  SSE 
- SSR  SST  SSE 
27Analysis of variance
- SS increases in proportion to sample size (n) 
- Mean squares (MS) normalise for degrees of 
 freedom (df)
- MSR  SSR / p (where p  number of degrees of 
 freedom)
- MSE  SSE / (n  p  1) 
- MST  SST / (n  1)
- Analysis of variance (ANOVA) table
Source d.f. Sum of squares (SS) Mean squares (MS) F-test
Regression Residual Total p Np 1 n  1 SSR SSE SST MSR MSE MSR/MSE 
 28Hypothesis tests in regression analysis
- Now, we have 
-  Sample data Y  a  bX  e 
-  Population Y  a  bX  e 
- Ho b  0. There is no linear association 
 between the outcome and predictor variable.
- In layman language what is the chance, given 
 the sample data that we observed, of observing a
 sample of data that is less consistent with the
 null hypothesis of no association?
29Inference about slope (parameter b)
- Recall that e is assumed to be normally 
 distributed with mean 0 and variance  s2.
- Estimate of s2 is MSE (or s2) 
- It can be shown that 
- The expected value of b is b, i.e. E(b)  b, 
- The standard error of b is 
- Then the test whether b  0 is t  b / SE(b) 
 which follows a t-distribution with n-1 degrees
 of freedom.
30Confidence interval around predicted valued
- Observed value is Yi. 
- Predicted value is 
- The standard error of the predicted value is 
- Interval estimation for Yi values 
31Checking assumptions
- Assumption of constant variance 
- Assumption of normality 
- Correctness of functional form 
- Model stability 
- All can be conducted with graphical analysis. 
 The residuals from the model or a function of the
 residuals play an important role in all of the
 model diagnostic procedures.
32Checking assumptions
- Assumption of constant variance 
- Plot the studentized residuals versus their 
 predicted values. Examine whether the
 variability between residuals remains relatively
 constant across the range of fitted values.
- Assumption of normality 
- Plot the residuals versus their expected values 
 under normality (Normal probability plot). If
 the residuals are normally distributed, it should
 fall along a 45o line.
- Correct functional form? 
- Plot the residuals versus fitted values. Examine 
 whether the residual plot for evidence of a
 non-linear trend in the value of the residual
 across the range of fitted values.
- Model stability 
- Check whether one or more observations are 
 influential. Use Cooks distance.
33Checking assumptions (Cont)
- Cooks distance (D) is a measure of the magnitude 
 by which the fitted values of the regression
 model change if the ith observation is removed
 from the data set.
- Leverage is a measure of how extreme the value of 
 xi is relative to the remaining value of x.
- The Studentized residual provides a measure of 
 how extreme the value of yi is relative to the
 remaining value of y.
34Remedial measures
- Non-constant variance 
- Transform the response variable (y) to a new 
 scale (e.g. logarithm) is often helpful.
- If no transformation can achieve the non-constant 
 variance problem, use a more robust estimator
 such as iterative weighted least squares.
- Non-normality 
- Non-normality and non-constant variance go 
 hand-in-hand.
- Outliers 
- Check for accuracy 
- Use robust estimator 
35Regression analysis using R
- id lt- seq(118) 
- age lt- c(46, 20, 52, 30, 57, 25, 28, 36, 22, 
-  43, 57, 33, 22, 63, 40, 48, 28, 49) 
- chol lt- c(3.5, 1.9, 4.0, 2.6, 4.5, 3.0, 2.9, 3.8, 
 2.1,
-  3.8, 4.1, 3.0, 2.5, 4.6, 3.2, 4.2, 2.3, 
 4.0)
- Fit linear regression model 
- reg lt- lm(chol  age) 
- summary(reg)
36ANOVA result 
- gt anova(reg) 
- Analysis of Variance Table 
- Response chol 
-  Df Sum Sq Mean Sq F value Pr(gtF) 
 
- age 1 10.4944 10.4944 114.57 1.058e-08 
 
- Residuals 16 1.4656 0.0916 
 
- --- 
- Signif. codes 0 '' 0.001 '' 0.01 '' 0.05 
 '.' 0.1 ' ' 1
37Results of R analysis
gt summary(reg) Call lm(formula  chol  
age) Residuals Min 1Q Median 
3Q Max -0.40729 -0.24133 -0.04522 0.17939 
0.63040 Coefficients Estimate Std. 
Error t value Pr(gtt) (Intercept) 1.089218 
0.221466 4.918 0.000154  age 
0.057788 0.005399 10.704 1.06e-08 
 --- Signif. codes 0 '' 0.001 '' 0.01 
'' 0.05 '.' 0.1 ' ' 1 Residual standard error 
0.3027 on 16 degrees of freedom Multiple 
R-Squared 0.8775, Adjusted R-squared 0.8698 
 F-statistic 114.6 on 1 and 16 DF, p-value 
1.058e-08 
 38Diagnostics influential data
par(mfrowc(2,2)) plot(reg) 
 39A non-linear illustration BMI and sexual 
attractiveness
- Study on 44 university students 
- Measure body mass index (BMI) 
- Sexual attractiveness (SA) score
id lt- seq(144) bmi lt- c(11.00, 12.00, 12.50, 
14.00, 14.00, 14.00, 14.00, 14.00, 
14.00, 14.80, 15.00, 15.00, 15.50, 16.00, 
 16.50, 17.00, 17.00, 18.00, 18.00, 19.00, 
19.00, 20.00, 20.00, 20.00, 20.50, 
22.00, 23.00, 23.00, 24.00, 24.50, 
25.00, 25.00, 26.00, 26.00, 26.50, 
28.00, 29.00, 31.00, 32.00, 33.00, 34.00, 35.50, 
 36.00, 36.00) sa lt- c(2.0, 2.8, 1.8, 
1.8, 2.0, 2.8, 3.2, 3.1, 4.0, 1.5, 3.2, 
3.7, 5.5, 5.2, 5.1, 5.7, 5.6, 4.8, 5.4, 6.3, 
 6.5, 4.9, 5.0, 5.3, 5.0, 4.2, 4.1, 4.7, 3.5, 
3.7, 3.5, 4.0, 3.7, 3.6, 3.4, 3.3, 2.9, 
2.1, 2.0, 2.1, 2.1, 2.0, 1.8, 1.7) 
 40Linear regression analysis of BMI and SA
reg lt- lm (sa  bmi) summary(reg) Residuals 
 Min 1Q Median 3Q Max 
 -2.54204 -0.97584 0.05082 1.16160 2.70856 
 Coefficients Estimate Std. Error t 
value Pr(gtt) (Intercept) 4.92512 
0.64489 7.637 1.81e-09  bmi -0.05967 
 0.02862 -2.084 0.0432  --- Signif. 
codes 0 '' 0.001 '' 0.01 '' 0.05 '.' 0.1 
' ' 1 Residual standard error 1.354 on 42 
degrees of freedom Multiple R-Squared 0.09376, 
 Adjusted R-squared 0.07218 F-statistic 4.345 
on 1 and 42 DF, p-value 0.04323 
 41BMI and SA analysis of residuals
plot(reg) 
 42BMI and SA a simple plot
par(mfrowc(1,1)) reg lt- lm(sa  bmi) plot(sa  
bmi, pch16) abline(reg) 
 43Re-analysis of sexual attractiveness data
-  Fit 3 regression models 
- linear lt- lm(sa  bmi) 
- quad lt- lm(sa  poly(bmi, 2)) 
- cubic lt- lm(sa  poly(bmi, 3)) 
-  Make new BMI axis 
- bmi.new lt- 1040 
-  Get predicted values 
- quad.pred lt- predict(quad,data.frame(bmibmi.new))
 
- cubic.pred lt- predict(cubic,data.frame(bmibmi.new
 ))
-  Plot predicted values 
- abline(reg) 
- lines(bmi.new, quad.pred, col"blue",lwd3) 
- lines(bmi.new, cubic.pred, col"red",lwd3)
44(No Transcript) 
 45Some comments Interpretation of correlation
- Correlation lies between 1 and 1. A very small 
 correlation does not mean that no linear
 association between the two variables. The
 relationship may be non-linear.
- For curlinearity, a rank correlation is better 
 than the Pearsons correlation.
- A small correlation (eg 0.1) may be statistically 
 significant, but clinically unimportant.
- R2 is another measure of strength of association. 
 An r  0.7 may sound impressive, but R2 is 0.49!
- Correlation does not mean causation.
46Some comments Interpretation of correlation
- Be careful with multiple correlations. For p 
 variables, there are p(p  1)/2 possible pairs of
 correlation, and false positive is a problem.
- Correlation can not be inferred directly from 
 association.
- r(age, weight)  0.05 r(weight, fat)  0.03 it 
 does not mean that r(age, fat) is near zero.
- In fact, r(age, fat)  0.79. 
47Some comments Interpretation of regression
- The fitted line (regression) is only an estimated 
 of the relation between these variables in the
 population.
- Uncertainty associated with estimated parameters. 
- Regression line should not be used to make 
 prediction of x values outside the range of
 values in the observed data.
- A statistical model is an approximation the 
 true relation may be nonlinear, but a linear is
 a reasonable approximation.
48Some comments Reporting results
- Results should be reported in sufficient details 
 nature of response variable, predictor variable
 any transformation checking assumptions, etc.
- Regression coefficients (a, b), their associated 
 standard errors, and R2 are useful summary.
49Some final comments
- Equations are the cornerstone on which the 
 edifice of science rests.
- Equations are like poems, or even an onion. 
- So, be careful with your building of equations!