Title: Matters arising
1Matters arising
- Summary of last weeks lecture
- The exercises
- Your queries
2The Pearson correlation (r)
- The PEARSON CORRELATION is a measure of a
supposed linear association between two
variables.
3Linear, but imperfect association
- If the scatterplot is elliptical in shape, a
linear association is indicated. - In psychology, all measurement is subject to
random error. - No association between measured variables is ever
perfect. - That is why the points do not all lie on a
straight line.
4The Pearson correlation
Sum of products
Sums of squares
5Explanation
- The numerator of r is known as a SUM OF PRODUCTS
(SP). - It is the sum of products that captures the
extent to which X and Y are associated, or
CO-VARY. - The sums of squares in the denominator merely
constrain the range of variation of r.
6The sum of products captures covariation
- Points in the upper right quadrant have positive
deviation products points in the lower left also
have positive deviation products (a minus times a
minus is a plus). - Points in the other two quadrants have negative
products. - Since the positive products predominate, we can
expect the covariance to be very large. - The negative products are small the points are
near the intersection of the mean lines.
Mean Actual Violence score
Mean Preference score
7An elliptical scatterplot
- This is fine.
- The elliptical scatterplot indicates that there
is indeed a basically linear relationship between
variable Y1 and variable X1.
8No association
- There is NO association between Z and Y.
- The high value of r is driven solely by the
presence of a single OUTLIER.
9Anscombes rule
- When you examine a scatterplot (something you
should ALWAYS do when interpreting a
correlation), ask yourself the following
question -
- Would the removal of one or two points at random
affect the basically ellipical shape of the
scatterplot? If the shape would remain
essentially the same, the value of r accurately
reflects the association between the variables.
10Summary
- The Pearson correlation r is a measure of the
strength of a SUPPOSED linear relationship
between 2 variables. - It is one of the most widely used of statistical
measures but it is also one of the most misused.
- You should always try to see the scatterplot when
interpreting a value of r.
11Exercise
-
- From the Violence data, obtain a scatterplot and
calculate the Pearson correlation.
12Direction of causation
- When we measure and obtaining the correlation
between two variables we nearly always do so
because we believe that one variable X causes or
influences the other Y. - We have measured Exposure X and Violence Y
because we have the hypothesis that X causes Y.
13The scatterplot of Y against X
- If we believe that X causes Y, we want to PLOT Y
AGAINST X . - We want a scatterplot with Y on the vertical axis
and X on the horizontal axis.
Richard
John
Jim
14Ordering the plot
15The default graph
16The vertical scale
- Notice that the vertical axis begins at 3, rather
than at zero. - I like to see the whole scale on the vertical
axis. - Double-click on the graph to enter the Chart
Editor. - Double-click on the vertical axis to enter a
dialog which will enable you to control the
amount of the vertical scale that you can see.
17Ordering the full Y scale
Uncheck Auto and enter zero into the Custom slot.
18Final version
19Why do I like to see the entire scale on
the vertical axis?
20Beware!
- Modern computing packages such as SPSS afford a
bewildering variety of attractive graphs and
displays to help you bring out the most important
features of your results. You should certainly
use them. - But there are pitfalls awaiting the unwary.
21Performance profiles
- We often want to see how mean performance varies
(or not) over various treatment conditions. - We might want to compare the performance of
participants who have ingested different kinds
(or dosages) of drugs with that of a comparison
or control group. - There is a set of methods known as Analysis of
Variance (ANOVA) which enable us to do that.
22Ordering a means plot
23A picture of the results
24The picture is false!
- The table of means shows miniscule differences
among the five group means! - The graph suggested that there were vast
differences among the means!
25A small scale view
- Only a microscopically small section of the scale
is shown on the vertical axis. - This greatly magnifies even small differences
among the group means.
26Putting things right
- Double-click on the image to get into the Graph
Editor. - Double-click on the vertical axis to access the
scale specifications.
Click here
27Putting things right
- Uncheck the minimum value box and enter zero as
the desired minimum point. - Click Apply.
Amend entry
28The true picture!
29The true picture
- The effect is dramatic.
- The profile now reflects the true situation.
- ALWAYS BE SUSPICIOUS OF GRAPHS THAT DO NOT SHOW
THE COMPLETE VERTICAL SCALE!
30Your queries
- Several of you have e-mailed me asking how you
fit a line graph to a scatterplot. - Last week, I said that an elliptical scatterplot
indicated that the relationship between the
variables was basically LINEAR. - So we want the best-fitting straight line through
the points. - This is known as the REGRESSION LINE.
31Drawing the regression line through the points
Choose Fit Line at Total.
To leave the Chart Editor, choose Close from the
Edit menu or double-click on the Viewer outside
the rectangle around the figure.
32Finding the value of r
33Hypothesis testing
- In HYPOTHESIS TESTING, a proposition known as the
NULL HYPOTHESIS (H0) is set up. - H0 is the NEGATION of your scientific hypothesis.
- So if our scientific hypothesis is that there is
an association, H0 says theres NO association.
34The p-value
- To test H0, we gather our data and calculate the
value of a TEST STATISTIC. - If the null hypothesis is true, how probable
would a value of our test statistic as extreme as
ours have been? - The answer is given by a probability known as the
p-value. - SPSS calls the p-value the Sig., i.e., the
SIGNIFICANCE PROBABILITY.
35A significant result
- A SIGNIFICANCE LEVEL is a small probability
accepted by convention as a criterion for a
decision about a statistical test. - Most commonly, the 0.05 significance level is
accepted by psychologists. - If the p-value of your test statistic is LESS
than the 0.05 significance level, your result is
said to be significant beyond the 0.05 level.
36The result
The p-value
Never report a p-value like this! Report the
p-value to 2 places of decimals if its less
than .01, use the inequality sign lt.
- Report this result as follows
- r(27) 0.89 p lt .01
Number of pairs value of r
p-value
37Lecture 9MORE ON ASSOCIATION
38We have shown that there is a strong
association between a childs violence and the
amount of violent screen material watched
39but have we really gathered evidence for
the hypothesis that exposure to screen violence
promotes actual violence?
40Remember
- CORRELATION
- does not necessarily mean
- CAUSATION
41One causal model
- The hypothesis implies this CAUSAL MODEL.
- The results are CONSISTENT with the hypothesis.
- The correlation may indeed arise because exposure
to violence causes actual violence.
42Another causal model
- The childs violent tendencies towards and
appetite for violence lead to his (or her)
watching violent programmes as often as possible.
- This model is also consistent with the data.
43A third causal model
- NEITHER variable causes the other.
- Both are determined by the behaviour of the
childs parents.
44The choice
- Does exposure cause violence (top model)?
- Does Violence lead to more exposure (middle
model)? - Are both exposure and violence caused by a third,
background, variable (bottom model)?
45A background variable
- Perhaps neither Exposure nor Actual violence
cause one another. - Perhaps they are caused by a background parental
behaviour variable. - We have data on such a variable.
- The background variable correlates highly with
both Exposure and Actual violence.
46Partial correlation
- A PARTIAL CORRELATION is what remains of a
Pearson correlation between two variables when
the influence of a third variable has been
removed, or PARTIALLED OUT.
47Three variables
- Let X1, X2 and X3 be three variables.
- Let r12 be the Pearson correlation between X1 and
X2. - Let r(12.3) be the partial correlation between
X1 and X2 when the covariation of each with X3
has been removed.
48Partial correlation
49Explanation
Removes the influence of the third variable.
Rescales with new variances, so that the range is
as below.
50Obtaining a partial correlation
51The partial correlation
- The partial correlation fails to reach
significance. - Now that we have taken the background variable
into consideration, we see that there is no
significant correlation between Exposure and
Actual violence. - It appears that, of the three possible causal
models, the third party model gives the most
convincing account of the data.
52Levels of measurement
- There are three levels
- 1. The SCALE level. The data are measures on an
independent scale with units. Heights, weights,
performance scores and IQs are scale data. Each
score has stand-alone meaning. - 2. The ORDINAL level. Data in the form of RANKS
(1st, 3rd, 53rd). A rank has meaning only in
relation to the other individuals in the sample.
A rank does not express, in units, the extent to
which a property is possessed. - 3. The NOMINAL level. Assignments to categories
(so-many males, so-many females.)
533. Nominal data
- NOMINAL data relate to qualitative variables or
attributes, such as gender or blood group, and
are merely records of CATEGORY MEMBERSHIP. - Nominal data are merely LABELS they may take the
form of numbers, but such numbers are arbitrary
code numbers representing, say, the different
blood groups or different nationalities. ANY
numbers will do, as long as they are all
different.
54A set of nominal data
- A medical researcher wishes to test the
hypothesis that people with a certain type of
body tissue (Critical) are more likely to show
the presence of a potentially harmful antibody. - Data are obtained on 79 people, who are
classified with respect to 2 attributes - 1. Tissue Type
- 2. Whether the antibody is present or absent.
55The research question
- Do more of the people in the critical group have
the antibody? - We are asking whether there is an ASSOCIATION
between the variables of category membership
(tissue type) and presence/absence of the
antibody. - This is the SCIENTIFIC hypothesis.
56The null hypothesis
- The NULL HYPOTHESIS is the negation of the
scientific hypothesis. - The null hypothesis states that there is NO
association between tissue type and presence of
the antibody.
57Contingency tables (cross-tabulations)
- When we wish to investigate whether an
association exists between qualitative or
categorical variables, the starting point is
usually a display known as a CONTINGENCY TABLE,
whose rows and columns represent the categories
of the qualitative variables we are studying. - Contingency tables are also known as
CROSS-TABULATIONS, or CROSSTABS.
58The contingency table
- Is there an association between Tissue Type and
Presence of the antibody? - It looks as if the antibody is indeed more in
evidence in the Critical tissue group.
59The null hypothesis
- The null hypothesis is the negation of our
scientific hypothesis, namely, the statement that
the two variables are INDEPENDENT. - In other words, any differences in the relative
incidence of the antibody in the different tissue
groups have resulted from SAMPLING ERROR.
60Expected cell frequencies
- The pattern of the OBSERVED FREQUENCIES (O) would
suggest that there is a greater incidence of the
antibody in the Critical tissue group. - But the marginal totals showing the frequencies
of the various groups in the sample also vary. - What cell frequencies would we expect under the
independence hypothesis?
61Expected cell frequencies (E)
- According to the null hypothesis, the joint
occurrence of the antibody and a particular
tissue type are independent events. - The probability of the joint occurrence of
independent events is the product of their
separate probabilities. - We find the expected frequencies (E) by
multiplying together the marginal totals that
intersect at the cells concerned and dividing by
the total number of observations.
62The expected frequencies
- To obtain, say, the value of E for the top left
cell, multiply the intersecting marginal totals
(36 and 22) and divide by 79 (the total
frequency), obtaining - (3622)/79 10.03 .
- In the Critical group, there seem to be large
differences between O and E fewer Nos than
expected and more Yess.
63The chi-square (?2) statistic
- We need a statistic which compares the
differences between the O and E, so that a large
value will cast doubt upon the null hypothesis of
independence. - Such a statistic is CHI-SQUARE (?2).
64Formula for chi-square
- The element of chi-square expresses the square of
the difference between O and E as a proportion of
E. - Add up these squared differences for all the
cells in the contingency table.
65The value of chi-square
- There are 8 terms in the summation, but only the
first two and the last are shown in the
calculation below.
66Degrees of freedom
- To decide whether a given value of chi-square is
significant, we must specify the DEGREES OF
FREEDOM df. - If a contingency table has R rows and C columns,
the degrees of freedom is given by - df (R 1)(C 1)
- In our example, R 4, C 2 and so
- df (4 1)(2 1) 3.
67Significance
- SPSS will tell us that the p-value of a
chi-square with a value of 10.655 in the
chi-square distribution with three degrees of
freedom is .014. - We should write this result as
- ?2(3) 10.66 p .01 .
- Since the result is significant beyond the .05
level, we have evidence against the null
hypothesis of independence and evidence for the
scientific hypothesis.
68Summary
- This week I extended my discussion of statistical
association to the topic of partial correlation. - A partial correlation can help the researcher to
choose from different causal models. - I also considered the analysis of nominal data in
the form of contingency tables. - The chi-square statistic can be used to test for
the presence of an association between
qualitative or categorical variables.
69Multiple-choice example
70Multiple-choice example
71Another example