Title: Biostatistics and Computer Applications
1Biostatistics and Computer Applications
Correlation Analysis Nonlinear
regression Multiple regression SAS
programming 1/8/2003
2Recap( Regression analysis)
Regression equation and standard deviation from
regression.
3Recap (F and t test for .
- The ANOVA table for regression analysis
4Recap (Confidence intervals)
confidence interval for
Population individual observation Y 1-alpha
prediction interval
5Recap (Confidence Intervals)
1-alpha confidence interval for intercept
1-alpha confidence interval for slope
6Correlation analysis
- Example A malacologist interested in the
morphology of West Indian chitons, Chiton
olivaceous, measured the length (X, cm) and width
(Y, cm) of the eight overlapping plates composing
the shell of 10 of these animals. Her data are
presented below
2 Ys
7Determination coefficient
- For data cant be distinguished as dependent and
independent variables, we cant use regression
equation. - We use determination coefficient (r2 ) to measure
degree of association. - r correlation coefficient.
8Calculation of Determination coefficient
Equation 1 Equation 2
9Determination Coefficient
- Proportion of variation explained by
relationship between X Y.
0 ? r2 ? 1
10Correlation Coefficient Values
Perfect Negative Correlation
Perfect Positive Correlation
No Correlation
-1.0
1.0
0
-.5
.5
r
Increasing degree of positive correlation
Increasing degree of negative correlation
11Coefficient of Correlation Examples
r 1
r -1
Y
Y
X
X
r .89
r 0
Y
Y
X
X
12Example of Coefficient of Determination
13Test of correlation coefficient
- Test if there is a linear relationship between 2
numerical variables - Hypotheses
- H0 ? 0 (No Correlation)
- Ha ? ? 0 (Correlation)
14Model of linear correlation
- Both X and Y are random variables and normal
distributed. - Population correlation coefficient
15Example of test of r
- Chiton
- Amount of insect and PPT/T
This t is exactly the same as H0 slope0.
Check r table for significant test.
16Confidence Interval for Correlation
17Relationship of regression and correlation
We can use r2 and r in the regression analysis,
but cant use regression equation in the
correlation analysis.
18Cautions in regression and correlation analysis
- Widely used and easy to be misused or
misinterpreted. - YabX and r are used to describe to linear
relationship of Y on X. r is not significant does
not mean there is no relationship between Y and
X, only means no significant linear
relationship. - A significant r does not mean that the true
relationship of Y and X IS linear. Maybe
nonlinear is better than linear one. - Even if the true relationship of Y and X is
nonlinear, we may use linear regression to
estimate or predict Y if the r is significant.
But be caution when you do extrapolation.
19Cautions in regression and correlation analysis
- A significant linear regression does not
guarantee you can use the equation to predict
practically. You need a large r (rgt0.7, 49 can
be explained by predictors). If you are sure that
there is a relationship of Y and X, but you get a
small r, this may be caused by 1) nonlinear, 2)
other important factors not included in the
model. - Control other variables to be constant if
possible, and use the equation under the same
conditions. - Sample size n should be large than 5 and design a
large range for variable X (easy to find
nonlinear relationship and decrease estimation
error)
20SAS program
- PROC CORR PROC REG.
- The CORR procedure is a statistical procedure for
numeric random variables that computes Pearson
correlation coefficients, three nonparametric
measures of association and the probabilities
associated with these statistics.
21SAS program
- DATA corr
- INPUT x y
- DATALINES
- 10.7 5.8
- 11 6
- 9.5 5
- 11.1 6
- 10.3 5.3
- 10.7 5.8
- 9.9 5.2
- 10.6 5.7
- 10 5.3
- 12 6.3
- PROC REG
- MODEL yx
- MODEL xy
- RUN
- PROC CORR
- RUN
22Types of regression analysis
Regression
2 Explanatory
1 Explanatory
Models
Variables
Variable
Simple
Multiple
Non-
Non-
Linear
Linear
Linear
Linear
23Nonlinear regression
- Under many conditions, relationship of Y and X is
not linear (curvilinear) - Use scatter plot helps to determine equation
- Determine of the function mainly based on the
knowledge statistics help to estimate the
function and test the significance - Nonlinear estimations are related to initial
values of parameters in the equation. If not fit
well, you may try different initial values.
24Example of Nonlinear regression
- Growth of chicken may be describe by logistic
equation. Weight of chicken (Y) was measured at
different age (X).
25Example of Nonlinear regression
- PROC NLIN
- PROC NLIN METHODDUDGAUSSetc
- MODEL dependentexpression
- PARAMETERS parametervalues lt,...,
parametervaluesgt
26Example of Nonlinear regression
- DATA nonlinear
- INPUT age weight
- DATALINES
- 2 0.3
- 4 0.86
- 6 1.73
- 8 2.2
- 10 2.47
- 12 2.67
- 14 2.8
- PROC print
- PROC NLIN METHODDUD
- MODEL weightk/(1aexp(-bage))
- PARMS k3 a20 b0.4
- RUN
27Multiple Linear Regression
- Multiple linear regression extends simple linear
regression to multiple predictor variables - Most of time, there are more than one independent
variables influence the response variable. - For example, body mass index can be predicted by
caloric intake and gender CO2 flux between
biosphere and atmosphere can be predicted by
light, temperature, leaf area index and vapor
pressure deficit.
28Multiple Regression Model
- Estimate multiple linear regression equation
- Test overall significance of model
- Test significance of each independent variables
and select best model - Test relative important of each independent
variables. - Use model for prediction estimation
29Multiple Linear Regression Model
- Relationship between 1 dependent 2 or more
independent variables is a linear function
Population slopes
Population Y-intercept
Random error
Dependent (response) variable
Independent (explanatory) variables
30Population Multiple Regression Model
Bivariate model
31Sample Multiple Regression Model
Bivariate model
b
Y
X
X
b
a
e
Y
1
1
2
2
i
i
i
i
(Observed Y)
a
Response
e
Plane
i
X
2
X
(
X
,
X
)
1
1
2
i
i
a
b
b
Y
X
X
i
1
1
i
2
2
i
32Least squares fit
- Regression parameters (a bk) are determined
using method of least squares - Minimizes the squared differences between each
observation and the fitted line in the
multivariate plane i.e., minimizes the
residuals
33Interpretation of regression model
- ?a is the intercept (Y value) when the predictors
are zero - ?bk is one of the partial regression coefficient
or slopes of regression line - represents change in Y for a unit change in Xk
with other predictors held constant. - i.e., ?k is the average slope across all
subgroups created by the Xk levels - e is the error term for each individual and is
the residual for that individual - Residual is the difference between predicted and
observed values
34The Residual
- A residual is difference between an observed
value of Y and the estimated mean based on the
associated X value. - There exists one residual for every subject (XY
pair) - Measures the distance of each observation from
- Useful for
- Diagnostics, that is, techniques for checking
assumptions of the regression model - Understanding the variation in Y that is
unexplained by the linear function of X
35Example of Parameter Estimation
- In order to develop a multiple regression
equation to predict the yield of wheat variety
Fengchan, spikes per head (X1), head per plant
(X2), weight per 100 grains (X3), height of plant
(X4) and weight per plant (Y, g) were measured
as
36Parameter Estimation Computer Output
Parameter
Estimates Â
Parameter Standard Variable
DF Estimate Error t Value
Pr gt t  Intercept 1
-51.90207 13.35182 -3.89 0.0030
X1 1 2.02618
0.27204 7.45 lt.0001
X2 1 0.65400 0.30270
2.16 0.0561 X3
1 7.79694 2.33281 3.34
0.0075 X4 1 0.04970
0.08300 0.60 0.5626
37Testing Overall Significance
- 1. Shows if there is a linear relationship
retween all X variables together Y - 2. Uses F Test Statistic
- 3. Hypotheses
- H0 ?1 ?2 ... ?k 0
- No Linear Relationship
- Ha At least one coefficient is not 0
- At least one X variable affects Y
38Testing Overall SignificanceComputer Output
Â
Analysis of Variance Â
Sum of Mean
Source DF Squares
Square F Value Pr gt F Â
Model 4 221.47175
55.36794 30.06 lt.0001 Error
10 18.41758 1.84176
Corrected Total 14
239.88933 Â Â Root MSE
1.35711 R-Square 0.9232
Dependent Mean 14.47333 Adj
R-Sq 0.8925 Coeff Var
9.37665 Â
39Coefficient of determination Multiple
- 1. Proportion of variation in Y explained by
all X variables Taken Together - R2 Explained variation SSR Total
variation SSyy - 2. Never decreases when new X variable is added
to model - Only Y values determine SSy
- Disadvantage when comparing models
40Parameter Estimation Computer Output
Parameter
Estimates Â
Parameter Standard Variable
DF Estimate Error t Value
Pr gt t  Intercept 1
-51.90207 13.35182 -3.89 0.0030
X1 1 2.02618
0.27204 7.45 lt.0001
X2 1 0.65400 0.30270
2.16 0.0561 X3
1 7.79694 2.33281 3.34
0.0075 X4 1 0.04970
0.08300 0.60 0.5626
41Variable Selection
- There are many different ways to select variable
- Forward
- Backward
- Stepwise
- Rsquare
42Variable Selection
- Forward
- Summary of Forward Selection
- Â
- Variable Number Partial
Model - Step Entered Vars In R-Square
R-Square C(p) F Value Pr gt F - Â
- 1 X1 1 0.8052
0.8052 14.3764 53.73 lt.0001 - 2 X3 2 0.0767
0.8818 6.3911 7.79 0.0163 - 3 X2 3 0.0386
0.9205 3.3585 5.34 0.0412 - Parameter
Standard - Variable Estimate
Error Type II SS F Value Pr gt F - Â
- Intercept -46.96636
10.19262 36.82480 21.23 0.0008 - X1 2.01314
0.26314 101.50782 58.53 lt.0001 - X2 0.67464
0.29183 9.26887 5.34 0.0412 - X3 7.83023
2.26313 20.76193 11.97 0.0053
43Variable Selection
- Backward
- Summary of Backward Elimination
- Â
- Variable Number Partial
Model - Step Removed Vars In R-Square
R-Square C(p) F Value P gt F - Â
- 1 X4 3 0.0028
0.9205 3.3585 0.36 0.5626 -
- Parameter
Standard - Variable Estimate
Error Type II SS F Value Pr gt F - Â
- Intercept -46.96636
10.19262 36.82480 21.23 0.0008 - X1 2.01314
0.26314 101.50782 58.53 lt.0001 - X2 0.67464
0.29183 9.26887 5.34 0.0412 - X3 7.83023
2.26313 20.76193 11.97 0.0053
44Variable Selection
- Stepwise
-
- Summary of Stepwise Selection
- Â
- Variable Variable Number
Partial Model - Step Entered Removed Vars In
R-Square R-Square C(p) F Value Pr
gt F - Â
- 1 X1 1
0.8052 0.8052 14.3764 53.73
lt.0001 - 2 X3 2
0.0767 0.8818 6.3911 7.79
0.0163 - 3 X2 3
0.0386 0.9205 3.3585 5.34
0.0412
45Variable Selection
- Rsquare R-Square Selection Method
- Number in Model
R-Square Variables in Model - Â
- 1 0.8052
X1 - 1 0.4747
X3 - 1 0.0021
X2 - 1 0.0000
X4 - ----------------------
--------------------- - 2 0.8818
X1 X3 - 2 0.8339
X1 X2 - 2 0.8113
X1 X4 - 2 0.4973
X2 X3 - 2 0.4750
X3 X4 - 2 0.0023
X2 X4 - ----------------------
--------------------- - 3 0.9205
X1 X2 X3 - 3 0.8874
X1 X3 X4 - 3 0.8375
X1 X2 X4 - 3 0.4973
X2 X3 X4
46Variable Selection
- Which one is better?
- Depends
- FORWARD keeps more in the model, BACKWARD deletes
more from model. STEPWISE is good in general. - Rsquare allows you to select highest R2 model.
47Relative important of Variable
- Use STB to print standardized regression
coefficient. - MODEL yvariables /STB
Parameter Estimates  Parameter
Standard Standardized
Variable DF Estimate Error t Value
Pr gt t Estimate 95 Confidence
Limits  Intercept 1 -46.96636 10.19262
-4.61 0.0008 0 -69.40016
-24.53256 X1 1 2.01314
0.26314 7.65 lt.0001 0.75342
1.43396 2.59231 X2 1 0.67464
0.29183 2.31 0.0412 0.19929
0.03233 1.31696 X3 1 7.83023
2.26313 3.46 0.0053 0.34139
2.84911 12.81134
48Confidence intervals estimation
Dep Var Predicted Std Error Obs
Y Value Mean Predict 95 CL Mean
95 CL Predict Residual 1
15.7000 16.8707 0.4999 15.7705
17.9708 13.7703 19.9710 -1.1707 2
14.5000 12.8336 0.6867 11.3222
14.3449 9.5646 16.1025 1.6664 3
17.5000 16.9790 0.4666 15.9520
18.0061 13.9039 20.0542 0.5210 4
22.5000 22.3438 0.9090 20.3430
24.3446 18.8217 25.8659 0.1562 5
15.5000 16.1960 0.3732 15.3746
17.0174 13.1833 19.2087 -0.6960 6
16.9000 16.0876 0.5112 14.9624
17.2129 12.9783 19.1970 0.8124 7
8.6000 10.4953 0.6314 9.1056
11.8850 7.2808 13.7098 -1.8953 8
17.0000 15.9792 0.7946 14.2305
17.7280 12.5940 19.3645 1.0208 9
13.7000 13.2807 0.7933 11.5346
15.0267 9.8968 16.6645 0.4193 10
13.4000 13.9553 0.6118 12.6087
15.3019 10.7592 17.1514 -0.5553 11
20.3000 19.2197 0.9110 17.2146
21.2248 15.6952 22.7442 1.0803 12
10.2000 10.7121 0.5657 9.4670
11.9571 7.5574 13.8667 -0.5121 13
7.4000 5.6860 0.9191 3.6631
7.7089 2.1513 9.2207 1.7140 14
11.6000 12.2781 0.7637 10.5972
13.9590 8.9274 15.6288 -0.6781 15
12.3000 14.1829 0.3997 13.3032
15.0626 11.1537 17.2120 -1.8829