Bivariate data - PowerPoint PPT Presentation

1 / 80

About This Presentation

Title:

Bivariate data

Description:

Age vs. Height: r2=0.9888. Age vs. Height: r2=0.849. Linear Regression ... 'Predictor,' x-axis variable, what you're basing the prediction on ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 81

Provided by: drinase

Category:

more less

Transcript and Presenter's Notes

Title: Bivariate data

1
Lecture 9

Bivariate data
Correlation
Coefficient of Determination
Regression
One-way Analysis of Variance (ANOVA)

2
Bivariate Data

Bivariate data are just what they sound like
data with measurements on two variables lets
call them X and Y
Here, we will look at two continuous variables
Want to explore the relationship between the two
variables
Example Fasting blood glucose and ventricular
shortening velocity

3
Scatterplot

We can graphically summarize a bivariate data set
with a scatterplot (also sometimes called a
scatter diagram)
Plots values of one variable on the horizontal
axis and values of the other on the vertical axis
Can be used to see how values of 2 variables tend
to move with each other (i.e. how the variables
are associated)

4
Scatterplot positive correlation
5
Scatterplot negative correlation
6
Scatterplot real data example
7
Numerical Summary

Typically, a bivariate data set is summarized
numerically with 5 summary statistics
These provide a fair summary for scatterplots
with the same general shape as we just saw, like
an oval or an ellipse
We can summarize each variable separately X
mean, X SD Y mean, Y SD
But these numbers dont tell us how the values of
X and Y vary together

8
Pearsons Correlation Coefficient r

r indicates
strength of relationship (strong, weak, or none)
direction of relationship
positive (direct) variables move in same
direction
negative (inverse) variables move in opposite
directions
r ranges in value from 1.0 to 1.0

-1.0 0.0
1.0
Strong Negative No Rel.
Strong Positive
9
Correlation (cont)
Correlation is the relationship between two
variables.
10
What r is...

r is a measure of LINEAR ASSOCIATION
The closer r is to 1 or 1, the more tightly the
points on the scatterplot are clustered around a
line
The sign of r ( or -) is the same as the sign of
the slope of the line
When r 0, the points are not LINEARLY
ASSOCIATED this does NOT mean there is NO
ASSOCIATION

11
...and what r is not

r is a measure of LINEAR ASSOCIATION
r does NOT tell us if Y is a function of X
r does NOT tell us if X causes Y
r does NOT tell us if Y causes X
r does NOT tell us what the scatterplot looks
like

12
r ? 0 curved relation
13
r ? 0 outliers
outliers
14
r ? 0 parallel lines
15
r ? 0 different linear trends
16
r ? 0 random scatter
17
Correlation is NOT causation

You cannot infer that since X and Y are highly
correlated (r close to 1 or 1) that X is causing
a change in Y
Y could be causing X
X and Y could both be varying along with a third,
possibly unknown factor (either causal or not)

18
(No Transcript)
19
Correlation matrix
20
(No Transcript)
21
Reading Correlation Matrix
r -.904
p .013 -- Probability of getting a
correlation this size by sheer chance. Reject Ho
if p .05.
sample size
r (4) -.904, p?.05
22
Interpretation of Correlation

Correlations
from 0 to 0.25 (-0.25) little or no
relationship
from 0.25 to 0.50 (-0.25 to 0.50) fair degree
of relationship
from 0.50 to 0.75 (-0.50 to -0.75) moderate to
good relationship
greater than 0.75 (or -0.75) very good to
excellent relationship.

23
Limitations of Correlation

linearity
cant describe non-linear relationships
e.g., relation between anxiety performance
truncation of range
underestimate stength of relationship if you
cant see full range of x value
no proof of causation
third variable problem
could be 3rd variable causing change in both
variables
directionality cant be sure which way
causality flows

24
Coefficient of Determination r2

The square of the correlation, r2, is the
proportion of variation in the values of y that
is explained by the regression model with x.
Amount of variance accounted for in y by x
Percentage increase in accuracy you gain by using
the regression line to make predictions
0 ? r2 ? 1.
The larger r2 , the stronger the linear
relationship.
The closer r2 is to 1, the more confident we are
in our prediction.

25
Age vs. Height r20.9888.
26
Age vs. Height r20.849.
27
Linear Regression

Correlation measures the direction and strength
of the linear relationship between two
quantitative variables
A regression line
summarizes the relationship between two variables
if the form of the relationship is linear.
describes how a response variable y changes as an
explanatory variable x changes.
is often used as a mathematical model to predict
the value of a response variable y based on a
value of an explanatory variable x.

28
(Simple) Linear Regression

Refers to drawing a (particular, special) line
through a scatterplot
Used for 2 broad purposes
Estimation
Prediction

29
Formula for Linear Regression
Slope or the change in y for every unit change in
x
Y-intercept or the value of y when x 0.
y bx a
Y variable plotted on vertical axis.
X variable plotted on horizontal axis.
30
Interpretation of parameters

The regression slope is the average change in Y
when X increases by 1 unit
The intercept is the predicted value for Y when X
0
If the slope 0, then X does not help in
predicting Y (linearly)

31
Which line?

There are many possible lines that could be drawn
through the cloud of points in the scatterplot

32
Least Squares

Q Where does this equation come from?
A It is the line that is best in the sense
that it minimizes the sum of the squared errors
in the vertical (Y) direction

Y

errors

X
33
Linear Regression
U.K. monthly return is y variable
U.S. monthly return is x variable
Question What is the relationship between U.K.
and U.S. stock returns?
34
Correlation tells the strength of relationship
between x and y. Relationship may not be linear.

35
Linear Regression
A regression creates a model of the relationship
between x and y. It fits a line to the scatter
plot by minimizing the distance between y and the
line or
If the correlation is significant then create a
regression analysis.
36
Linear Regression
The slope is calculated as
Tells you the change in the dependent variable
for every unit change in the independent variable.
37
The coefficient of determination or R-square
measures the variation explained by the best-fit
line as a percent of the total variation
38
Regression Graphic Regression Line
39
Regression Equation

y bx a
y predicted value of y
b slope of the line
x value of x that you plug-in
a y-intercept (where line crosses y access)
In this case.
y -4.263(x) 125.401
So if the distance is 20 feet
y -4.263(20) 125.401
y -85.26 125.401
y 40.141

40
SPSS Regression Set-up

Criterion,
y-axis variable,
what youre trying to predict

Predictor,
x-axis variable,
what youre basing the prediction on

41
Getting Regression Info from SPSS
y b (x) a y -4.263(20)
125.401
a
42
Extrapolation

Interpolation Using a model to estimate Y for
an X value within the range on which the model
was based.
Extrapolation Estimating based on an X value
outside the range.
Interpolation Good, Extrapolation Bad.

43
Nixons GraphEconomic Growth
44
Nixons GraphEconomic Growth
Start of Nixon Adm.
45
Nixons GraphEconomic Growth
Start of Nixon Adm.
Now
46
Nixons GraphEconomic Growth
Start of Nixon Adm.
Projection
Now
47
Conditions for regression

Straight enough condition (linearity)
Errors are mostly independent of X
Errors are mostly independent of anything else
you can think of
Errors are more-or-less normally distributed

48
General ANOVA SettingComparisons of 2 or more
means

Investigator controls one or more independent
variables
Called factors (or treatment variables)
Each factor contains two or more levels (or
groups or categories/classifications)
Observe effects on the dependent variable
Response to levels of independent variable
Experimental design the plan used to collect the
data

49
Logic of ANOVA

Each observation is different from the Grand
(total sample) Mean by some amount
There are two sources of variance from the mean
1) That due to the treatment or independent
variable
2) That which is unexplained by our treatment

50
One-Way Analysis of Variance

Evaluate the difference among the means of two or
more groups
Examples Accident rates for 1st, 2nd, and 3rd
shift
Expected mileage for five
brands of tires
Assumptions
Populations are normally distributed
Populations have equal variances
Samples are randomly and independently drawn

51
Hypotheses of One-Way ANOVA

All population means are equal
i.e., no treatment effect (no variation in means
among groups)
At least one population mean is different
i.e., there is a treatment effect
Does not mean that all population means are
different (some pairs may be the same)

52
One-Factor ANOVA
All Means are the same The Null Hypothesis is
True (No Treatment Effect)
53
One-Factor ANOVA
(continued)
At least one mean is different The Null
Hypothesis is NOT true (Treatment Effect is
present)
or
54
Partitioning the Variation

Total variation can be split into two parts

SST SSA SSW
SST Total Sum of Squares (Total
variation) SSA Sum of Squares Among Groups
(Among-group variation) SSW Sum of Squares
Within Groups (Within-group variation)
55
Partitioning the Variation
(continued)
SST SSA SSW
Total Variation the aggregate dispersion of the
individual data values across the various factor
levels (SST)
Among-Group Variation dispersion between the
factor sample means (SSA)
Within-Group Variation dispersion that exists
among the data values within a particular factor
level (SSW)
56
Partition of Total Variation
Total Variation (SST)
d.f. n 1
Variation Due to Factor (SSA)
Variation Due to Random Sampling (SSW)

d.f. c 1
d.f. n c

Commonly referred to as
Sum of Squares Within
Sum of Squares Error
Sum of Squares Unexplained
Within-Group Variation

Commonly referred to as
Sum of Squares Between
Sum of Squares Among
Sum of Squares Explained
Among Groups Variation

57
Total Sum of Squares
SST SSA SSW

Where
SST Total sum of squares
c number of groups (levels or treatments)
nj number of observations in group j
Xij ith observation from group j
X grand mean (mean of all data values)

58
Total Variation
(continued)
59
Among-Group Variation
SST SSA SSW

Where
SSA Sum of squares among groups
c number of groups
nj sample size from group j
Xj sample mean from group j
X grand mean (mean of all data values)

60
Among-Group Variation
(continued)
Variation Due to Differences Among Groups
Mean Square Among SSA/degrees of freedom
61
Among-Group Variation
(continued)
62
Within-Group Variation
SST SSA SSW

Where
SSW Sum of squares within groups
c number of groups
nj sample size from group j
Xj sample mean from group j
Xij ith observation in group j

63
Within-Group Variation
(continued)
Summing the variation within each group and then
adding over all groups
Mean Square Within SSW/degrees of freedom
64
Within-Group Variation
(continued)
65
Obtaining the Mean Squares
66
One-Way ANOVA Table
Source of Variation
MS (Variance)
df
SS
F ratio
SSA
Among Groups
MSA
SSA
MSA
c - 1
F
c - 1
MSW
SSW
Within Groups
n - c
SSW
MSW
n - c
SST SSASSW
Total
n - 1
c number of groups n sum of the sample sizes
from all groups df degrees of freedom
67
One-Way ANOVAF Test Statistic
H0 µ1 µ2 µc H1 At least two population
means are different

Test statistic
MSA is mean squares among groups
MSW is mean squares within groups
Degrees of freedom
df1 c 1 (c number of groups)
df2 n c (n sum of sample sizes from
all populations)

68
Interpreting One-Way ANOVA F Statistic

The F statistic is the ratio of the among
estimate of variance and the within estimate of
variance
The ratio must always be positive
df1 c -1 will typically be small
df2 n - c will typically be large

Decision Rule
Reject H0 if F gt FU, otherwise do not reject H0

? .05
0
Reject H0
Do not reject H0
FU
69
One-Way ANOVA F Test Example
Gp 1 Gp 2 Gp 3 254 234
200 263 218 222 241 235
197 237 227 206 251 216
204

You want to see if cholesterol level is different
in three groups.
You randomly select five patients. Measure their
cholesterol levels.
At the 0.05 significance level, is there a
difference in mean cholesterol?

70
One-Way ANOVA Example Scatter Diagram
Cholesterol
270 260 250 240 230 220 210 200 190
Gp 1 Gp 2 Gp 3 254 234
200 263 218 222 241 235
197 237 227 206 251 216
204

1 2 3
Groups
71
One-Way ANOVA Example Computations
Gp 1 Gp 2 Gp 3 254 234
200 263 218 222 241 235
197 237 227 206 251 216
204
X1 249.2 X2 226.0 X3 205.8 X 227.0
n1 5 n2 5 n3 5 n 15 c 3
SSA 5 (249.2 227)2 5 (226 227)2 5
(205.8 227)2 4716.4
SSW (254 249.2)2 (263 249.2)2 (204
205.8)2 1119.6
MSA 4716.4 / (3-1) 2358.2
MSW 1119.6 / (15-3) 93.3
72
One-Way ANOVA Example Solution

H0 µ1 µ2 µ3
H1 µj not all equal
? 0.05
df1 2 df2 12

Test Statistic Decision Conclusion
Critical Value FU 3.89
Reject H0 at ? 0.05
? .05
There is evidence that at least one µj differs
from the rest
0
Reject H0
Do not reject H0
F 25.275
FU 3.89
73
Significant and Non-significant Differences
Non-significant Within gt Between
Significant Between gt Within
74
ANOVA (summary)

Null hypothesis is that there is no difference
between the means.
Alternate hypothesis is that at least two means
differ.
Use the F statistic as your test statistic. It
tests the between-sample variance (difference
between the means) against the within-sample
variance (variability within the sample). The
larger this is the more likely the means are
different.
Degrees of freedom for numerator is k-1 (k is the
number of treatments)
Degrees of freedom for the denominator is n-k (n
is the number of responses)
If test F is larger than critical F, then reject
the null.
If p-value is less than alpha, then reject the
null.

75
ANOVA (summary)

Assumptions
All k population probability distributions are
normal.
The k population variances are equal.
The samples from each population are random and
independent.

76
ANOVA
WHEN YOU REJECT THE NULL For an one-way ANOVA
after you have rejected the null, you may want to
determine which treatment yielded the best
results. Must do follow-on analysis to determine
if the difference between each pair of means if
significant.
77
One-way ANOVA (example)

The study described here is about measuring
cortisol levels in 3 groups of subjects
Healthy (n 16)
Depressed Non-melancholic depressed (n 22)
Depressed Melancholic depressed (n 18)

78
Results

Results were obtained as follows

Source DF SS MS F
P
Grp. 2 164.7 82.3 6.61
0.003
Error 53 660.0 12.5
Total 55 824.7
Individual 95
CIs For Mean
Based on
Pooled StDev
Level N Mean StDev
---------------------------------
1 16 9.200 2.931
(------------)
2 22 10.700 2.758
(----------)
3 18 13.500 4.674
(------------)
---------------------------------
Pooled StDev 3.529 7.5 10.0
12.5 15.0

79
Multiple Comparison of the Means - 1

Several methods are available depending upon
whether one wishes to compare means with a
control mean (Dunnett) or just overall comparison
(Tukey and Fisher)