Biostatistics and Computer Applications - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

Biostatistics and Computer Applications

Description:

If not fit well, you may try different initial values. 24 ... Least squares fit. Regression parameters (a bk) are determined using method of least squares ... – PowerPoint PPT presentation

Number of Views:74
Avg rating:3.0/5.0
Slides: 49
Provided by: dafen
Category:

less

Transcript and Presenter's Notes

Title: Biostatistics and Computer Applications


1
Biostatistics and Computer Applications
Correlation Analysis Nonlinear
regression Multiple regression SAS
programming 1/8/2003
2
Recap( Regression analysis)
Regression equation and standard deviation from
regression.
3
Recap (F and t test for .
  • The ANOVA table for regression analysis

4
Recap (Confidence intervals)
confidence interval for
Population individual observation Y 1-alpha
prediction interval
5
Recap (Confidence Intervals)
1-alpha confidence interval for intercept
1-alpha confidence interval for slope
6
Correlation analysis
  • Example A malacologist interested in the
    morphology of West Indian chitons, Chiton
    olivaceous, measured the length (X, cm) and width
    (Y, cm) of the eight overlapping plates composing
    the shell of 10 of these animals. Her data are
    presented below

2 Ys
7
Determination coefficient
  • For data cant be distinguished as dependent and
    independent variables, we cant use regression
    equation.
  • We use determination coefficient (r2 ) to measure
    degree of association.
  • r correlation coefficient.

8
Calculation of Determination coefficient
Equation 1 Equation 2
9
Determination Coefficient
  • Proportion of variation explained by
    relationship between X Y.

0 ? r2 ? 1
10
Correlation Coefficient Values
Perfect Negative Correlation
Perfect Positive Correlation
No Correlation
-1.0
1.0
0
-.5
.5
r
Increasing degree of positive correlation
Increasing degree of negative correlation
11
Coefficient of Correlation Examples
r 1
r -1
Y
Y
X
X
r .89
r 0
Y
Y
X
X
12
Example of Coefficient of Determination
  • Chiton length and width

13
Test of correlation coefficient
  • Test if there is a linear relationship between 2
    numerical variables
  • Hypotheses
  • H0 ? 0 (No Correlation)
  • Ha ? ? 0 (Correlation)

14
Model of linear correlation
  • Both X and Y are random variables and normal
    distributed.
  • Population correlation coefficient

15
Example of test of r
  • Chiton
  • Amount of insect and PPT/T

This t is exactly the same as H0 slope0.
Check r table for significant test.
16
Confidence Interval for Correlation
  • Chiton

17
Relationship of regression and correlation
  • 1.
  • 2.
  • 3.
  • 4.

We can use r2 and r in the regression analysis,
but cant use regression equation in the
correlation analysis.
18
Cautions in regression and correlation analysis
  • Widely used and easy to be misused or
    misinterpreted.
  • YabX and r are used to describe to linear
    relationship of Y on X. r is not significant does
    not mean there is no relationship between Y and
    X, only means no significant linear
    relationship.
  • A significant r does not mean that the true
    relationship of Y and X IS linear. Maybe
    nonlinear is better than linear one.
  • Even if the true relationship of Y and X is
    nonlinear, we may use linear regression to
    estimate or predict Y if the r is significant.
    But be caution when you do extrapolation.

19
Cautions in regression and correlation analysis
  • A significant linear regression does not
    guarantee you can use the equation to predict
    practically. You need a large r (rgt0.7, 49 can
    be explained by predictors). If you are sure that
    there is a relationship of Y and X, but you get a
    small r, this may be caused by 1) nonlinear, 2)
    other important factors not included in the
    model.
  • Control other variables to be constant if
    possible, and use the equation under the same
    conditions.
  • Sample size n should be large than 5 and design a
    large range for variable X (easy to find
    nonlinear relationship and decrease estimation
    error)

20
SAS program
  • PROC CORR PROC REG.
  • The CORR procedure is a statistical procedure for
    numeric random variables that computes Pearson
    correlation coefficients, three nonparametric
    measures of association and the probabilities
    associated with these statistics.

21
SAS program
  • DATA corr
  • INPUT x y
  • DATALINES
  • 10.7 5.8
  • 11 6
  • 9.5 5
  • 11.1 6
  • 10.3 5.3
  • 10.7 5.8
  • 9.9 5.2
  • 10.6 5.7
  • 10 5.3
  • 12 6.3
  • PROC REG
  • MODEL yx
  • MODEL xy
  • RUN
  • PROC CORR
  • RUN

22
Types of regression analysis
Regression
2 Explanatory
1 Explanatory
Models
Variables
Variable
Simple
Multiple
Non-
Non-
Linear
Linear
Linear
Linear
23
Nonlinear regression
  • Under many conditions, relationship of Y and X is
    not linear (curvilinear)
  • Use scatter plot helps to determine equation
  • Determine of the function mainly based on the
    knowledge statistics help to estimate the
    function and test the significance
  • Nonlinear estimations are related to initial
    values of parameters in the equation. If not fit
    well, you may try different initial values.

24
Example of Nonlinear regression
  • Growth of chicken may be describe by logistic
    equation. Weight of chicken (Y) was measured at
    different age (X).

25
Example of Nonlinear regression
  • PROC NLIN
  • PROC NLIN METHODDUDGAUSSetc
  • MODEL dependentexpression
  • PARAMETERS parametervalues lt,...,
    parametervaluesgt

26
Example of Nonlinear regression
  • DATA nonlinear
  • INPUT age weight
  • DATALINES
  • 2 0.3
  • 4 0.86
  • 6 1.73
  • 8 2.2
  • 10 2.47
  • 12 2.67
  • 14 2.8
  • PROC print
  • PROC NLIN METHODDUD
  • MODEL weightk/(1aexp(-bage))
  • PARMS k3 a20 b0.4
  • RUN

27
Multiple Linear Regression
  • Multiple linear regression extends simple linear
    regression to multiple predictor variables
  • Most of time, there are more than one independent
    variables influence the response variable.
  • For example, body mass index can be predicted by
    caloric intake and gender CO2 flux between
    biosphere and atmosphere can be predicted by
    light, temperature, leaf area index and vapor
    pressure deficit.

28
Multiple Regression Model
  • Estimate multiple linear regression equation
  • Test overall significance of model
  • Test significance of each independent variables
    and select best model
  • Test relative important of each independent
    variables.
  • Use model for prediction estimation

29
Multiple Linear Regression Model
  • Relationship between 1 dependent 2 or more
    independent variables is a linear function

Population slopes
Population Y-intercept
Random error
Dependent (response) variable
Independent (explanatory) variables
30
Population Multiple Regression Model
Bivariate model
31
Sample Multiple Regression Model
Bivariate model
b
Y


X

X

b
a
e
Y
1
1
2
2

i
i
i
i
(Observed Y)
a
Response

e
Plane
i
X
2
X
(
X
,
X
)
1
1
2
i
i

a
b
b
Y

X

X
i
1
1
i
2
2
i
32
Least squares fit
  • Regression parameters (a bk) are determined
    using method of least squares
  • Minimizes the squared differences between each
    observation and the fitted line in the
    multivariate plane i.e., minimizes the
    residuals

33
Interpretation of regression model
  • ?a is the intercept (Y value) when the predictors
    are zero
  • ?bk is one of the partial regression coefficient
    or slopes of regression line
  • represents change in Y for a unit change in Xk
    with other predictors held constant.
  • i.e., ?k is the average slope across all
    subgroups created by the Xk levels
  • e is the error term for each individual and is
    the residual for that individual
  • Residual is the difference between predicted and
    observed values

34
The Residual
  • A residual is difference between an observed
    value of Y and the estimated mean based on the
    associated X value.
  • There exists one residual for every subject (XY
    pair)
  • Measures the distance of each observation from
  • Useful for
  • Diagnostics, that is, techniques for checking
    assumptions of the regression model
  • Understanding the variation in Y that is
    unexplained by the linear function of X

35
Example of Parameter Estimation
  • In order to develop a multiple regression
    equation to predict the yield of wheat variety
    Fengchan, spikes per head (X1), head per plant
    (X2), weight per 100 grains (X3), height of plant
    (X4) and weight per plant (Y, g) were measured
    as

36
Parameter Estimation Computer Output
Parameter
Estimates  
Parameter Standard Variable
DF Estimate Error t Value
Pr gt t   Intercept 1
-51.90207 13.35182 -3.89 0.0030
X1 1 2.02618
0.27204 7.45 lt.0001
X2 1 0.65400 0.30270
2.16 0.0561 X3
1 7.79694 2.33281 3.34
0.0075 X4 1 0.04970
0.08300 0.60 0.5626
37
Testing Overall Significance
  • 1. Shows if there is a linear relationship
    retween all X variables together Y
  • 2. Uses F Test Statistic
  • 3. Hypotheses
  • H0 ?1 ?2 ... ?k 0
  • No Linear Relationship
  • Ha At least one coefficient is not 0
  • At least one X variable affects Y

38
Testing Overall SignificanceComputer Output
 
Analysis of Variance  
Sum of Mean
Source DF Squares
Square F Value Pr gt F  
Model 4 221.47175
55.36794 30.06 lt.0001 Error
10 18.41758 1.84176
Corrected Total 14
239.88933     Root MSE
1.35711 R-Square 0.9232
Dependent Mean 14.47333 Adj
R-Sq 0.8925 Coeff Var
9.37665  
39
Coefficient of determination Multiple
  • 1. Proportion of variation in Y explained by
    all X variables Taken Together
  • R2 Explained variation SSR Total
    variation SSyy
  • 2. Never decreases when new X variable is added
    to model
  • Only Y values determine SSy
  • Disadvantage when comparing models

40
Parameter Estimation Computer Output
Parameter
Estimates  
Parameter Standard Variable
DF Estimate Error t Value
Pr gt t   Intercept 1
-51.90207 13.35182 -3.89 0.0030
X1 1 2.02618
0.27204 7.45 lt.0001
X2 1 0.65400 0.30270
2.16 0.0561 X3
1 7.79694 2.33281 3.34
0.0075 X4 1 0.04970
0.08300 0.60 0.5626
41
Variable Selection
  • There are many different ways to select variable
  • Forward
  • Backward
  • Stepwise
  • Rsquare

42
Variable Selection
  • Forward
  • Summary of Forward Selection
  •  
  • Variable Number Partial
    Model
  • Step Entered Vars In R-Square
    R-Square C(p) F Value Pr gt F
  •  
  • 1 X1 1 0.8052
    0.8052 14.3764 53.73 lt.0001
  • 2 X3 2 0.0767
    0.8818 6.3911 7.79 0.0163
  • 3 X2 3 0.0386
    0.9205 3.3585 5.34 0.0412
  • Parameter
    Standard
  • Variable Estimate
    Error Type II SS F Value Pr gt F
  •  
  • Intercept -46.96636
    10.19262 36.82480 21.23 0.0008
  • X1 2.01314
    0.26314 101.50782 58.53 lt.0001
  • X2 0.67464
    0.29183 9.26887 5.34 0.0412
  • X3 7.83023
    2.26313 20.76193 11.97 0.0053

43
Variable Selection
  • Backward
  • Summary of Backward Elimination
  •  
  • Variable Number Partial
    Model
  • Step Removed Vars In R-Square
    R-Square C(p) F Value P gt F
  •  
  • 1 X4 3 0.0028
    0.9205 3.3585 0.36 0.5626
  • Parameter
    Standard
  • Variable Estimate
    Error Type II SS F Value Pr gt F
  •  
  • Intercept -46.96636
    10.19262 36.82480 21.23 0.0008
  • X1 2.01314
    0.26314 101.50782 58.53 lt.0001
  • X2 0.67464
    0.29183 9.26887 5.34 0.0412
  • X3 7.83023
    2.26313 20.76193 11.97 0.0053

44
Variable Selection
  • Stepwise
  • Summary of Stepwise Selection
  •  
  • Variable Variable Number
    Partial Model
  • Step Entered Removed Vars In
    R-Square R-Square C(p) F Value Pr
    gt F
  •  
  • 1 X1 1
    0.8052 0.8052 14.3764 53.73
    lt.0001
  • 2 X3 2
    0.0767 0.8818 6.3911 7.79
    0.0163
  • 3 X2 3
    0.0386 0.9205 3.3585 5.34
    0.0412

45
Variable Selection
  • Rsquare R-Square Selection Method
  • Number in Model
    R-Square Variables in Model
  •  
  • 1 0.8052
    X1
  • 1 0.4747
    X3
  • 1 0.0021
    X2
  • 1 0.0000
    X4
  • ----------------------
    ---------------------
  • 2 0.8818
    X1 X3
  • 2 0.8339
    X1 X2
  • 2 0.8113
    X1 X4
  • 2 0.4973
    X2 X3
  • 2 0.4750
    X3 X4
  • 2 0.0023
    X2 X4
  • ----------------------
    ---------------------
  • 3 0.9205
    X1 X2 X3
  • 3 0.8874
    X1 X3 X4
  • 3 0.8375
    X1 X2 X4
  • 3 0.4973
    X2 X3 X4

46
Variable Selection
  • Which one is better?
  • Depends
  • FORWARD keeps more in the model, BACKWARD deletes
    more from model. STEPWISE is good in general.
  • Rsquare allows you to select highest R2 model.

47
Relative important of Variable
  • Use STB to print standardized regression
    coefficient.
  • MODEL yvariables /STB

Parameter Estimates   Parameter
Standard Standardized
Variable DF Estimate Error t Value
Pr gt t Estimate 95 Confidence
Limits   Intercept 1 -46.96636 10.19262
-4.61 0.0008 0 -69.40016
-24.53256 X1 1 2.01314
0.26314 7.65 lt.0001 0.75342
1.43396 2.59231 X2 1 0.67464
0.29183 2.31 0.0412 0.19929
0.03233 1.31696 X3 1 7.83023
2.26313 3.46 0.0053 0.34139
2.84911 12.81134
48
Confidence intervals estimation
Dep Var Predicted Std Error Obs
Y Value Mean Predict 95 CL Mean
95 CL Predict Residual 1
15.7000 16.8707 0.4999 15.7705
17.9708 13.7703 19.9710 -1.1707 2
14.5000 12.8336 0.6867 11.3222
14.3449 9.5646 16.1025 1.6664 3
17.5000 16.9790 0.4666 15.9520
18.0061 13.9039 20.0542 0.5210 4
22.5000 22.3438 0.9090 20.3430
24.3446 18.8217 25.8659 0.1562 5
15.5000 16.1960 0.3732 15.3746
17.0174 13.1833 19.2087 -0.6960 6
16.9000 16.0876 0.5112 14.9624
17.2129 12.9783 19.1970 0.8124 7
8.6000 10.4953 0.6314 9.1056
11.8850 7.2808 13.7098 -1.8953 8
17.0000 15.9792 0.7946 14.2305
17.7280 12.5940 19.3645 1.0208 9
13.7000 13.2807 0.7933 11.5346
15.0267 9.8968 16.6645 0.4193 10
13.4000 13.9553 0.6118 12.6087
15.3019 10.7592 17.1514 -0.5553 11
20.3000 19.2197 0.9110 17.2146
21.2248 15.6952 22.7442 1.0803 12
10.2000 10.7121 0.5657 9.4670
11.9571 7.5574 13.8667 -0.5121 13
7.4000 5.6860 0.9191 3.6631
7.7089 2.1513 9.2207 1.7140 14
11.6000 12.2781 0.7637 10.5972
13.9590 8.9274 15.6288 -0.6781 15
12.3000 14.1829 0.3997 13.3032
15.0626 11.1537 17.2120 -1.8829
Write a Comment
User Comments (0)
About PowerShow.com