Bivariate data - PowerPoint PPT Presentation

1 / 71
About This Presentation
Title:

Bivariate data

Description:

Extrapolation: Estimating based on an X value outside the range. ... There is evidence that at least one j differs from the rest. 0 = .05. FU = 3.89. Reject H0 ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 72
Provided by: drinase
Category:

less

Transcript and Presenter's Notes

Title: Bivariate data


1
Lecture 12
  • Bivariate data
  • Correlation
  • Coefficient of Determination
  • Regression
  • One-way Analysis of Variance (ANOVA)

2
Correlation between hours worked and pay
3
Correlation between hours worked and pay
4
Bivariate Data
  • Bivariate data are just what they sound like
    data with measurements on two variables lets
    call them X and Y
  • Here, we will look at two continuous variables
  • Want to explore the relationship between the two
    variables
  • Example Fasting blood glucose and ventricular
    shortening velocity

5
Scatterplot
  • We can graphically summarize a bivariate data set
    with a scatterplot (also sometimes called a
    scatter diagram)
  • Plots values of one variable on the horizontal
    axis and values of the other on the vertical axis
  • Can be used to see how values of 2 variables tend
    to move with each other (i.e. how the variables
    are associated)

6
Scatterplot positive correlation
7
Scatterplot negative correlation
8
Scatterplot real data example
9
Pearsons Correlation Coefficient r
  • r indicates
  • strength of relationship (strong, weak, or none)
  • direction of relationship
  • positive (direct) variables move in same
    direction
  • negative (inverse) variables move in opposite
    directions
  • r ranges in value from 1.0 to 1.0

-1.0 0.0
1.0
Strong Negative No Rel.
Strong Positive
10
Correlation (cont)
Correlation is the relationship between two
variables.
11
What r is...
  • r is a measure of LINEAR ASSOCIATION
  • The closer r is to 1 or 1, the more tightly the
    points on the scatterplot are clustered around a
    line
  • The sign of r ( or -) is the same as the sign of
    the slope of the line
  • When r 0, the points are not LINEARLY
    ASSOCIATED this does NOT mean there is NO
    ASSOCIATION

12
...and what r is not
  • r is a measure of LINEAR ASSOCIATION
  • r does NOT tell us if Y is a function of X
  • r does NOT tell us if X causes Y
  • r does NOT tell us if Y causes X
  • r does NOT tell us what the scatterplot looks
    like

13
r ? 0 curved relation
14
r ? 0 outliers
outliers
15
r ? 0 parallel lines
16
r ? 0 different linear trends
17
r ? 0 random scatter
18
Correlation is NOT causation
  • You cannot infer that since X and Y are highly
    correlated (r close to 1 or 1) that X is causing
    a change in Y
  • Y could be causing X
  • X and Y could both be varying along with a third,
    possibly unknown factor (either causal or not)

19
(No Transcript)
20
Correlation matrix
21
(No Transcript)
22
Reading Correlation Matrix
r -.904
p .013 -- Probability of getting a
correlation this size by sheer chance. Reject Ho
if p .05.
sample size
r (4) -.904, p?.05
23
Interpretation of Correlation
  • Correlations
  • from 0 to 0.25 (-0.25) little or no
    relationship
  • from 0.25 to 0.50 (-0.25 to 0.50) fair degree
    of relationship
  • from 0.50 to 0.75 (-0.50 to -0.75) moderate to
    good relationship
  • greater than 0.75 (or -0.75) very good to
    excellent relationship.

24
Limitations of Correlation
  • linearity
  • cant describe non-linear relationships
  • e.g., relation between anxiety performance
  • truncation of range
  • underestimate strength of relationship if you
    cant see full range of x value
  • no proof of causation
  • third variable problem
  • could be 3rd variable causing change in both
    variables
  • directionality cant be sure which way
    causality flows

25
Coefficient of Determination r2
  • The square of the correlation, r2, is the
    proportion of variation in the values of y that
    is explained by the regression model with x.
  • Amount of variance accounted for in y by x
  • Percentage increase in accuracy you gain by using
    the regression line to make predictions
  • 0 ? r2 ? 1.
  • The larger r2 , the stronger the linear
    relationship.
  • The closer r2 is to 1, the more confident we are
    in our prediction.

26
Age vs. Height r20.9888.
27
Age vs. Height r20.849.
28
Linear Regression
  • Correlation measures the direction and strength
    of the linear relationship between two
    quantitative variables
  • A regression line
  • summarizes the relationship between two variables
    if the form of the relationship is linear.
  • describes how a response variable y changes as an
    explanatory variable x changes.
  • is often used as a mathematical model to predict
    the value of a response variable y based on a
    value of an explanatory variable x.

29
(Simple) Linear Regression
  • Refers to drawing a (particular, special) line
    through a scatterplot
  • Used for 2 broad purposes
  • Estimation
  • Prediction

30
Formula for Linear Regression
Slope or the change in y for every unit change in
x
Y-intercept or the value of y when x 0.
y bx a
Y variable plotted on vertical axis.
X variable plotted on horizontal axis.
31
Interpretation of parameters
  • The regression slope is the average change in Y
    when X increases by 1 unit
  • The intercept is the predicted value for Y when X
    0
  • If the slope 0, then X does not help in
    predicting Y (linearly)

32
Which line?
  • There are many possible lines that could be drawn
    through the cloud of points in the scatterplot

33
Least Squares
  • Q Where does this equation come from?
  • A It is the line that is best in the sense
    that it minimizes the sum of the squared errors
    in the vertical (Y) direction

Y



errors


X
34
Linear Regression
U.K. monthly return is y variable
U.S. monthly return is x variable
Question What is the relationship between U.K.
and U.S. stock returns?
35
Correlation tells the strength of relationship
between x and y. Relationship may not be linear.

36
Linear Regression
A regression creates a model of the relationship
between x and y. It fits a line to the scatter
plot by minimizing the distance between y and the
line or
If the correlation is significant then create a
regression analysis.
37
Linear Regression
The slope is calculated as
Tells you the change in the dependent variable
for every unit change in the independent variable.
38
The coefficient of determination or R-square
measures the variation explained by the best-fit
line as a percent of the total variation
39
Regression Graphic Regression Line
40
Regression Equation
  • y bx a
  • y predicted value of y
  • b slope of the line
  • x value of x that you plug-in
  • a y-intercept (where line crosses y access)
  • In this case.
  • y -4.263(x) 125.401
  • So if the distance is 20 feet
  • y -4.263(20) 125.401
  • y -85.26 125.401
  • y 40.141

41
SPSS Regression Set-up
  • Criterion,
  • y-axis variable,
  • what youre trying to predict
  • Predictor,
  • x-axis variable,
  • what youre basing the prediction on

42
Getting Regression Info from SPSS
y b (x) a y -4.263(20)
125.401
a
43
Extrapolation
  • Interpolation Using a model to estimate Y for
    an X value within the range on which the model
    was based.
  • Extrapolation Estimating based on an X value
    outside the range.
  • Interpolation Good, Extrapolation Bad.

44
Nixons GraphEconomic Growth
45
Nixons GraphEconomic Growth
Start of Nixon Adm.
46
Nixons GraphEconomic Growth
Start of Nixon Adm.
Now
47
Nixons GraphEconomic Growth
Start of Nixon Adm.
Projection
Now
48
Conditions for regression
  • Straight enough condition (linearity)
  • Errors are mostly independent of X
  • Errors are mostly independent of anything else
    you can think of
  • Errors are more-or-less normally distributed

49
General ANOVA SettingComparisons of 2 or more
means
  • Investigator controls one or more independent
    variables
  • Called factors (or treatment variables)
  • Each factor contains two or more levels (or
    groups or categories/classifications)
  • Observe effects on the dependent variable
  • Response to levels of independent variable
  • Experimental design the plan used to collect the
    data

50
Logic of ANOVA
  • Each observation is different from the Grand
    (total sample) Mean by some amount
  • There are two sources of variance from the mean
  • 1) That due to the treatment or independent
    variable
  • 2) That which is unexplained by our treatment

51
One-Way Analysis of Variance
  • Evaluate the difference among the means of two or
    more groups
  • Examples
  • Cholesterol levels in three
    groups
  • Assumptions
  • Populations are normally distributed
  • Populations have equal variances
  • Samples are randomly and independently drawn

52
Hypotheses of One-Way ANOVA
  • All population means are equal
  • i.e., no treatment effect (no variation in means
    among groups)
  • At least one population mean is different
  • i.e., there is a treatment effect
  • Does not mean that all population means are
    different (some pairs may be the same)

53
One-Factor ANOVA
All Means are the same The Null Hypothesis is
True (No Treatment Effect)
54
One-Factor ANOVA
(continued)
At least one mean is different The Null
Hypothesis is NOT true (Treatment Effect is
present)
or
55
Total Variation
56
Among-Group Variation
(continued)
57
Within-Group Variation
(continued)
58
Partitioning the Variation
  • Total variation can be split into two parts

SST SSA SSW
SST Total Sum of Squares (Total
variation) SSA Sum of Squares Among Groups
(Among-group variation) SSW Sum of Squares
Within Groups (Within-group variation)
59
One-Way ANOVA Table
Source of Variation
MS (Variance)
df
SS
F ratio
SSA
Among Groups
MSA
SSA
MSA
c - 1
F
c - 1
MSW
SSW
Within Groups
n - c
SSW
MSW
n - c
SST SSASSW
Total
n - 1
c number of groups n sum of the sample sizes
from all groups df degrees of freedom
60
One-Way ANOVAF Test Statistic
H0 µ1 µ2 µc H1 At least two population
means are different
  • Test statistic
  • MSA is mean squares among groups
  • MSW is mean squares within groups
  • Degrees of freedom
  • df1 c 1 (c number of groups)
  • df2 n c (n sum of sample sizes from
    all populations)

61
Interpreting One-Way ANOVA F Statistic
  • The F statistic is the ratio of the among
    estimate of variance and the within estimate of
    variance
  • The ratio must always be positive
  • df1 c -1 will typically be small
  • df2 n - c will typically be large
  • Decision Rule
  • Reject H0 if F gt FU, otherwise do not reject H0

? .05
0
Reject H0
Do not reject H0
FU
62
One-Way ANOVA F Test Example
Gp 1 Gp 2 Gp 3 254 234
200 263 218 222 241 235
197 237 227 206 251 216
204
  • You want to see if cholesterol level is different
    in three groups.
  • You randomly select five patients. Measure their
    cholesterol levels.
  • At the 0.05 significance level, is there a
    difference in mean cholesterol?

63
One-Way ANOVA Example Scatter Diagram
Cholesterol
270 260 250 240 230 220 210 200 190
Gp 1 Gp 2 Gp 3 254 234
200 263 218 222 241 235
197 237 227 206 251 216
204















1 2 3
Groups
64
One-Way ANOVA Example Computations
Gp 1 Gp 2 Gp 3 254 234
200 263 218 222 241 235
197 237 227 206 251 216
204
X1 249.2 X2 226.0 X3 205.8 X 227.0
n1 5 n2 5 n3 5 n 15 c 3
SSA 5 (249.2 227)2 5 (226 227)2 5
(205.8 227)2 4716.4
SSW (254 249.2)2 (263 249.2)2 (204
205.8)2 1119.6
MSA 4716.4 / (3-1) 2358.2
MSW 1119.6 / (15-3) 93.3
65
One-Way ANOVA Example Solution
  • H0 µ1 µ2 µ3
  • H1 µj not all equal
  • ? 0.05
  • df1 2 df2 12

Test Statistic Decision Conclusion
Critical Value FU 3.89
Reject H0 at ? 0.05
? .05
There is evidence that at least one µj differs
from the rest
0
Reject H0
Do not reject H0
F 25.275
FU 3.89
66
Significant and Non-significant Differences
Non-significant Within gt Between
Significant Between gt Within
67
ANOVA
WHEN YOU REJECT THE NULL For an one-way ANOVA
after you have rejected the null, you may want to
determine which treatment yielded the best
results. Must do follow-on analysis to determine
if the difference between each pair of means if
significant.
68
One-way ANOVA (example)
  • The study described here is about measuring
    cortisol levels in 3 groups of subjects
  • Healthy (n 16)
  • Depressed Non-melancholic depressed (n 22)
  • Depressed Melancholic depressed (n 18)

69
Results
  • Results were obtained as follows
  • Source DF SS MS F
    P
  • Grp. 2 164.7 82.3 6.61
    0.003
  • Error 53 660.0 12.5
  • Total 55 824.7
  • Individual 95
    CIs For Mean
  • Based on
    Pooled StDev
  • Level N Mean StDev
    ---------------------------------
  • 1 16 9.200 2.931
    (------------)
  • 2 22 10.700 2.758
    (----------)
  • 3 18 13.500 4.674
    (------------)

  • ---------------------------------
  • Pooled StDev 3.529 7.5 10.0
    12.5 15.0

70
Multiple Comparison of the Means - 1
  • Several methods are available depending upon
    whether one wishes to compare means with a
    control mean (Dunnett) or just overall comparison
    (Tukey and Fisher)
  • Dunnett's comparisons with a control
  • Critical value 2.27
  • Control level (1) of Grp.
  • Intervals for treatment mean minus control mean
  • Level Lower Center Upper
    ----------------------------------
  • 2 -1.127 1.500 4.127
    (--------------------)
  • 3 1.553 4.300 7.047
    (--------------------)

  • ----------------------------------
  • -1.0
    1.5 4.0 7.0

71
Multiple Comparison of Means - 2
  • Tukey's pair wise comparisons
  • 95 CI for differences
  • 1 2
  • 2 -4.296
  • 1.296
  • 3 -7.224 -5.504
  • -1.376 -0.096
  • Fisher's pair wise comparisons
  • 95 CI for differences
  • 1 2
  • 2 -3.826
  • 0.826
  • 3 -6.732 -5.050
  • -1.868 -0.550

The End
Write a Comment
User Comments (0)
About PowerShow.com