Title: Correlation 2
1Correlation 2
- Computations, and the best fitting line.
2Computing r from a more realistic set of data
- A study was performed to investigate whether the
quality of an image affects reading time. - The experimental hypothesis was that reduced
quality would slow down reading time. - Quality was measured on a scale of 1 to 10.
Reading time was in seconds.
3Quality vs Reading Time data Compute the
correlation
Quality (scale 1-10) 4.30 4.55 5.55 5.65 6.30 6.45
6.45
Reading time (seconds) 8.1 8.5 7.8 7.3 7.5 7.3 6.0
Is there a relationship? Check for
linearity. Compute r.
4Calculate t scores for X
X 4.30 4.55 5.55 5.65 6.30 6.45 6.45
5Calculate t scores for Y
tY (Y - Y) / sY 0.76 1.26 0.38 -025
0.00 -0.25 -1.89
Y 8.1 8.5 7.8 7.3 7.5 7.3 6.0
(Y - Y)2 0.36 1.00 0.09 0.04 0.00 0.04 2.25
Y - Y 0.60 1.00 0.30 -0.20 0.00 -0.20 -1.50
?Y52.5 n 7 Y7.50
MSW 3.78/(7-1) 0.63
sY 0.79
6Plot t scores
tY 0.76 1.28 0.39 -0.25 0.00 -0.25 -1.89
tX -1.48 -1.19 -0.07 0.05 0.78 0.95 0.95
7t score plot with best fitting line linear? YES!
8Calculate r
tY 0.76 1.28 0.39 -0.25 0.00 -0.25 -1.88
tX -1.48 -1.19 -0.07 0.05 0.78 0.95 0.95
tY -tX -2.24 -2.47 -0.46 0.30 0.78 1.20 2.83
(tY -tX)2 5.02 6.10 0.21 0.09 0.61 1.44 8.01
? (tX - tY)2 / (nP - 1) 3.580
r 1 - (1/2 3.580)
1 - 1.79 -0.790
9Best fitting line
10The definition of the best fitting line plotted
on t axes
- A best fitting line minimizes the average
squared vertical distance of Y scores in the
sample (expressed as tY scores) from the line. - The best fitting line is a least squares,
unbiased estimate of values of Y in the sample. - The generic formula for a line is Ymxb where m
is the slope and b is the Y intercept. - Thus, any specific line, such as the best fitting
line, can be defined by its slope and its
intercept.
11The intercept of the best fitting line plotted on
t axes
- The origin is the point where both tX and
tY0.000 - So the origin represents the mean of both the X
and Y variable - When plotted on t axes all best fitting lines go
through the origin. - Thus, the tY intercept of the best fitting line
0.000
12The slope of and formula for the best fitting line
- When plotted on t axes the slope of the best
fitting line r, the correlation coefficient. - To define a line we need its slope and Y
intercept - r the slope and tY intercept0.00
- The formula for the best fitting line is
therefore tYrtX 0.00 or tY rtX
13Heres how a visual representation of the best
fitting line (slope r, Y intercept 0.000) and
the dots representing tX and tY scores might be
described. (Whether the correlation is positive
of negative doesnt matter.)
- Perfect - scores fall exactly on a straight
line. - Strong - most scores fall near the line.
- Moderate - some are near the line, some not.
- Weak - the scores are only mildly linear.
- Independent - the scores are not linear at all.
14Strength of a relationship
15Strength of a relationship
16Strength of a relationship
Moderate r about .500
17Strength of a relationshipr about 0.000
18r.800, the formula for the best fitting line
???
19r-.800, the formula for the best fitting line
???
20r0.000, the formula for the best fitting line is
21Notice what that formula for independent
variables says
- tY rtX 0.000 (tX) 0.000
- When tY 0.000, you are at the mean of Y
- So, when variables are independent, the best
fitting line says that the best estimate of Y
scores in the sample is back to the mean of Y
regardless of your score on X - Thus, when variables are independent we go back
to saying everyone will score right at the mean
22A note of caution Watch out for the plot for
which the best fitting line is a curve.
23Confidence intervals around rhoT relation to
Chapter 6
- In Chapter 6 we learned to create confidence
intervals around muT that allowed us to test a
theory. - To test our theory about mu we took a random
sample, computed the sample mean and standard
deviation, and determined whether the sample mean
fell into that interval. - If it did not, we had shown the theory that led
us to predict muT was false. - We then discarded the theory and muT and used the
sample mean as our best estimate of the true
population mean.
24If we discard muT, what do we use as our best
estimate of mu?
- Generally, our best estimate of a population
parameter is the sample statistic that estimates
it. - Our best estimate of mu has been and is the
sample mean, X-bar. - Since we have discarded our theory, we went back
to using X-bar as our best (least squares,
unbiased, consistent estimate) of mu.
25More generally, we can test a theory (hypothesis)
about any population parameter using a similar
confidence interval.
- We theorize about what the value of the
population parameter is. - We get an estimate of the variability of the
parameter - We construct a confidence interval (usually a 95
confidence interval) in which our hypothesis says
that the sample statistic should fall. - We obtain a random sample and determine whether
the sample statistic falls inside or outside our
confidence interval
26The sample statistic will fall inside or outside
of the CI.95
- If the sample statistic falls inside the
confidence interval, our theory has received some
support and we hold on to it. - But the more interesting case is when the sample
statistic falls outside the confidence interval. - Then we must discard the theory and the theory
based estimate of the population parameter. - In that case, our best estimate of the population
parameter is the sample statistic - Remember, the sample statistic is a least
squares, unbiased, consistent estimate of its
population parameter.
27We are going to do the same thing with a theory
about rho
- rho is the correlation coefficient for the
population. - If we have a theory about rho, we can create a
95 confidence interval into which we expect r
will fall. - An r computed from a random sample will then fall
inside or outside the confidence interval.
28When r falls inside or outside of the CI.95
around rhoT
- If r falls inside the confidence interval, our
theory about rho has received some support and we
hold on to it. - But the more interesting case is when r falls
outside the confidence interval. - Then we must discard the theory and the theory
based estimate of the population parameter. - In that case, our best estimate of rho is the r
we found in our random sample - Thus, when r falls outside the CI.95 we can go
back to using it as a least squares unbiased
estimate of rho.
29Chapter 7 slides end here
- Rest of slides are for other chapters and should
not be reviewed here. - RK 10/24
30Why is it so important to determine whether r
fits a theory
- In Chapter 8 we go on to predict values of Y from
values of X and r. - The formula we use is called the regression
equation, it is very much like the formula for
the best fitting line. - The only difference is that the best fitting line
describes the relationship among the Y scores in
the sample. - But in Chapter 8 we move to predicting scores for
people who are in the population from which the
sample was drawn, but not in the sample.
31Thats dangerous.
- Let me give you an example.
32Assume, you are the personnel officer for a mid
size company.
- You need to hire a typist.
- There are 2 applicants for the job.
- You give the applicants a typing test.
- Which would you hire someone who types 6 words a
minute with 12 mistakes or someone who types 100
words a minute with 1 mistake.
33Who would you hire?
- Of course, you would predict that the second
person will be a better typist and hire that
person. - Notice that we never gave the person with 6
words/minute a chance to be a typist in our firm.
- We prejudged her on the basis of the typing test.
- That is probably valid in this case a typing
test probably predicts fairly well how good a
typist someone will be.
34But say the situation is a little more
complicated!
- You have several applicants for a leadership
position in your firm. - But it is not 2002, it is 1957, when we knew that
only white males were capable of leadership in
corporate America. - That is, we all know that leadership ability is
correlated with both gender and skin color, white
and male are associated with high leadership
ability and darker skin color and female gender
with lower leadership ability. - We now know this is absurd, but lots of people
were never
35Confidence intervals around muT
36Confidence intervals and hypothetical means
- We frequently have a theory about what the mean
of a distribution should be. - To be scientific, that theory about mu must be
able to be proved wrong (falsified). - One way to test a theory about a mean is to state
a range where sample means should fall if the
theory is correct. - We usually state that range as a 95 confidence
interval.
37- To test our theory, we take a random sample from
the appropriate population and see if the sample
mean falls where the theory says it should,
inside the confidence interval. - If the sample mean falls outside the 95
confidence interval established by the theory,
the evidence suggests that our theoretical
population mean and the theory that led to its
prediction is wrong. - When that happens our theory has been falsified.
We must discard it and look for an alternative
explanation of our data.
38For example
- For example, lets say that we had a new
antidepressant drug we wanted to peddle. Before
we can do that we must show that the drug is
safe. - Drugs like ours can cause problems with body
temperature. People can get chills or fever. - We want to show that body temperature is not
effected by our new drug.
39Testing a theory
- Everyone knows that normal body temperature for
healthy adults is 98.6oF. - Therefore, it would be nice if we could show that
after taking our drug, healthy adults still had
an average body temperature of 98.6oF. - So we might test a sample of 16 healthy adults,
first giving them a standard dose of our drug
and, when enough time had passed, taking their
temperature to see whether it was 98.6oF on the
average.
40Testing a theory - 2
- Of course, even if we are right and our drug has
no effect on body temperature, we wouldnt expect
a sample mean to be precisely 98.600000 - We would expect some sampling fluctuation around
a population mean of 98.6oF. - So, if our drug does not cause change in body
temperature, the sample mean should be close to
98.6. It should, in fact, be within the 95
confidence interval around muT, 98.6. - SO WE MUST CONSTRUCT A 95 CONFIDENCE INTERVAL
AROUND 98.6o AND SEE WHETHER OUR SAMPLE MEAN
FALLS INSIDE OR OUTSIDE THE CI.
41To create a confidence interval around muT, we
must estimate sigma from a sample.
- We randomly select a group of 16 healthy
individuals from the population. - We administer a standard clinical dose of our new
drug for 3 days. - We carefully measure body temperature.
- RESULTS We find that the average body
temperature in our sample is 99.5oF with an
estimated standard deviation of 1.40o (s1.40). - IS 99.5oF. IN THE 95 CI AROUND MUT???
42Knowing s and n we can easily compute the
estimated standard error of the mean.
- Lets say that s1.40o and n 16
- 1.40/4.00
0.35 - Using this estimated standard error we can
construct a 95 confidence interval for the body
temperature of a sample of 16 healthy adults.
43We learned how to create confidence intervals
with the Z distribution in Chapter 4. 95 of
sample means will fall in a symmetrical interval
around mu that goes from 1.960 standard errors
below mu to 1.960 standard errors above mu
- A way to write that fact in statistical language
is - CI.95 mu ZCRIT sigmaX-bar or
- CI.95 mu - ZCRIT sigmaX-bar lt X-bar lt mu
ZCRIT sigmaX-bar - For a 95 CI, ZCRIT 1.960
44- But when we must estimate sigma with s, we must
use the t distribution to define critical
intervals around mu or muT. - Here is how we would write the formulae
substituting t for Z and s for sigma - CI95 muT tCRIT sX-bar or
- CI.95 muT - tCRIT sX-bar lt X-bar lt muT tCRIT
sX-bar - Notice that the critical value of t that includes
95 of the sample means changes with the number
of degrees of freedom for s, our estimate of
sigma, and must be taken from the t table. - If n 16 in a single sample, dfWn-k15.
45 df 1 2 3 4 5 6 7 8 .05
12.706 4.303 3.182 2.776 2.571 2.447 2.365 2.306 .
01 63.657 9.925 5.841 4.604 4.032 3.707 3.499
3.355 df 9 10 11 12 13 14 15 16 .05
2.262 2.228 2.201 2.179 2.160 2.145 2.131 2.120 .0
1 3.250 3.169 3.106 3.055 3.012 2.997 2.947 2.
921 df 17 18 19 20 21 22 23 24 .05
2.110 2.101 2.093 2.086 2.080 2.074 2.069 2.064 .0
1 2.898 2.878 2.861 2.845 2.831 2.819 2.807 2.
797 df 25 26 27 28 29 30 40 60 .05
2.060 2.056 2.052 2.048 2.045 2.042 2.021 2.000 .0
1 2.787 2.779 2.771 2.763 2.756 2.750 2.704 2.
660 df 100 200 500 1000 2000 10000 .05
1.984 1.972 1.965 1.962 1.961 1.960 .01
2.626 2.601 2.586 2.581 2.578 2.576
46So, muT98.6, tCRIT2.131, s1.40, n16Here is
the confidence interval
- CI.95 muT tCRIT sX-bar
- 98.6 (2.131)(1.40/ )
- 98.6 (2.131)(1.40/4)
- 98.6 (2.131)(0.35) 98.60 0.75
- CI.95 97.85 lt X-bar lt 99.35
- Our sample mean fell outside the CI.95 and
falsifies the theory that our drug has no effect
on body temperature. Our drug may cause a slight
fever.
47(No Transcript)