Biostatistics in Practice - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Biostatistics in Practice

Description:

Pearson correlation, r, is used to measure association between two ... GLUM -0.00046 0.00018 -2.50 0.0135 -0.18691. SKINF 0.00147 0.00183 0.81 0.4221 0.07108 ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 18
Provided by: bios62
Category:

less

Transcript and Presenter's Notes

Title: Biostatistics in Practice


1
Biostatistics in Practice
Session 5 Methods for Assessing Associations
Peter D. Christenson Biostatistician http//gcrc
.humc.edu/Biostat
2
Readings for Session 5from StatisticalPractice.co
m
  • Simple Linear Regression
  • Introduction to Simple Linear Regression
  • Transformations in Linear Regression
  • Multiple Regression
  • Introduction to Multiple Regression
  • What Does Multiple Regression Look Like?
  • Which Predictors are More Important?
  • Also, without any reading Correlation

3
Correlation
  • Visualize Y (vertical) by X (horizontal)
    scatterplot
  • Pearson correlation, r, is used to measure
    association between two measures X and Y
  • Ranges from -1 (perfect inverse association) to
    1 (direct association)
  • Value of r does not depend on
  • scales (units) of X and Y
  • which role X and Y assume, as in a X-Y plot
  • Value of r does depend on
  • the ranges of X and Y
  • values chosen for X, if X is fixed and Y is
    measured

4
Graphs and Values of Correlation
5
Correlation Depends on Ranges of X and Y
B
A
Graph B contains only the graph A points in the
ellipse. Correlation is reduced in graph B. Thus
correlations for the same quantities X and Y may
be quite different in different study
populations.
6
Regression
  • Again Y (vertical) by X (horizontal)
    scatterplot, as with correlation. See next
    slide.
  • X and Y now assume different roles
  • Y is an outcome, response, output, dependent
    variable
  • X is an input, predictor, independent
    variable
  • Regression analysis is used to
  • Fit a straight line through the scatterplot.
  • Measure X-Y association, as does correlation.
  • Predict Y from X, and assess the precision of
    the prediction.

7
Regression Example
8
X-Y Association
If slope0 then X and Y are not associated. But
the slope measured from a sample will never be 0.
How different from 0 does a measured slope need
to be to claim X and Y are associated? Test H0
slope0 vs. HA slope?0, with the rule Claim
association (HA) if tcslope/SE(slope) gt t
2. There is a 5 chance of claiming an X-Y
association that really does not exist. Note
similarity to t-test for means tcmean/
SE(mean). Formula for SE(slope) is in
statistics books.
9
X-Y Association, Continued
Refer to the graph of the example, 2 slides
back. We are 95 sure that the true line for the
X-Y association is within the inner .. band
about the estimated line from our limited sample
data. If our test of H0 slope0 vs. HA slope?0
results in claiming HA, then the inner .. band
does not include the horizontal line, and
vice-versa. X and Y are significantly associated.
We can also test H0 ?0 vs. HA ? ?0 , where ?
is the true correlation estimated by r. The
result is identical to that for the slope. Thus,
correlation and regression are equivalent methods
for measuring whether two variables are linearly
associated.
10
Prediction from Regression
  • Again, Refer to the graph of the example, 3
    slides back.
  • The regression line (e.g., y81.6 2.16x) is
    used for
  • Predicting y for an individual with a known value
    of x. We are 95 sure that the individuals true
    y is between the outer (---) band endpoints
    vertically above x. This interval is analogous to
    mean2SD.
  • Predicting the mean y for all subjects with a
    known value of x. We are 95 sure that this mean
    is between the inner (.) band endpoints
    vertically above x. This interval is analogous to
    mean2SE.

11
Example Software Output
The regression equation is Y 81.6 2.16
X Predictor Coeff StdErr T
P Constant 81.64 11.47 7.12
lt0.0001 X 2.1557 0.1122
19.21 lt0.0001 S 21.72 R-Sq
79.0 Predicted Values X
100 Fit 297.21 SE(Fit) 2.17 95 CI
292.89 - 301.52 95 PI 253.89 - 340.52
19.212.16/0.112 should be between -2 and 2 if
slope0.
Predicted y 81.6 2.16(100) Range of Ys with
95 assurance for Mean of all subjects with
x100. Individual with x100.
12
Regression Issues
  • We are assuming that the relation is linear.
  • We can generalize to more complicated non-linear
    associations.
  • Transformations, e.g., logarithmic, can be made
    to achieve linearity.
  • The vertical distances between the actual ys and
    the predicted ys (on the line) are called
    residuals. Their magnitude should not depend on
    the value of x (e.g., should not tend to be
    larger for larger x), and should be symmetrically
    distributed about 0. If not, transformations can
    often achieve this.

13
Multiple Regression Geometric View
Multiple refers to using more than one X (say
X1 and X2) simultaneously to predict Y.
Geometrically, this is fitting a slanted plane to
a cloud of points
Graph from the readings. LHCY is the Y
(homocysteine) to be predicted from the two Xs
LCLC (folate) and LB12 (B12).
LHCY b0 b1LCLC b2LB12
14
Multiple Regression More General
  • More than 2 predictors can be used. The equation
    is for a hyperplane y b0 b1x1 b2x2
    bkxk.
  • A more realistic functional form, more complex
    than a plane, can be used. For example, to fit
    curvature for x2, use y b0 b1x1 b2x2
    b3x22 .
  • If predictors themselves are highly correlated,
    then the fitted equation is imprecise. This is
    because the x1 and x2 data then lie in almost a
    line in the x1-x2 plane, so the fitted plane is
    like an unstable tabletop with the table legs not
    well-spaced.
  • How many and which variables to include?
    Prediction strategies (e.g, stepwise) differ from
    significance of factors.

15
Reading Example HDL Cholesterol
Parameter Std
Standardized Estimate Error
T Pr gt t Estimate Intercept 1.16448
0.28804 4.04 lt.0001 0 AGE
-0.00092 0.00125 -0.74 0.4602
-0.05735 BMI -0.01205 0.00295 -4.08
lt.0001 -0.35719 BLC 0.05055 0.02215
2.28 0.0239 0.17063 PRSSY -0.00041
0.00044 -0.95 0.3436 -0.09384 DIAST
0.00255 0.00103 2.47 0.0147
0.23779 GLUM -0.00046 0.00018 -2.50
0.0135 -0.18691 SKINF 0.00147 0.00183
0.81 0.4221 0.07108 LCHOL 0.31109
0.10936 2.84 0.0051 0.20611 The
predictors are age, body mass index, blood
vitamin C, systolic and diastolic blood
pressures, skinfold thickness, and the log of
total cholesterol. LHDL1.16-0.00092Age0.311LCH
OL
16
Reading Example Coefficients
  • Interpretation of coefficients (parameter
    estimates) from output LHDL1.16-0.00092Age0.3
    11LCHOL on previous slide
  • Need to use entire equation for making
    predictions.
  • Each coefficient measures the difference in
    expected LHDL between 2 subjects if the factor
    differs by 1 unit between the two subjects, and
    if all other factors are the same. E.g., expected
    LHDL is 0.00092 lower in a subject whose BMI is 1
    unit greater, but on other factors the same as
    another subject.
  • Situation in (2) may be unrealistic, or
    impossible.

17
Reading Example Predictors
  • P-values measure independent effect of the
    factor i.e., whether it is associated with the
    outcome (LHDL here) after accounting for all of
    the other effects in the model.
  • Which factors should be included in the equation?
    Remove those that are not significant (plt0.05)?
  • In general, it depends on the goal
  • For prediction, more predictors ? less bias, but
    less precision. Stepwise methods balance this.
  • For importance of a particular factor, need to
    include that factor and other factors that are
    either biologically or statistically related to
    the outcome.
Write a Comment
User Comments (0)
About PowerShow.com