Title: Canonical Correlation
1Canonical Correlation
- Equation is big brother to little r and multiple
regression - little r ? y x
- regression ? y x1 x2 x3
- canonical correlation (Rc) y1y2y3 x1x2x3
- Analyzing the relation between 2 sets of
variables - generally no IVs or DVs
- e.g., facets of Neuroticism (N) health
- set 1 N anxiety, vulnerability, worry
- set 2 Health depression, well-being, physical
- variables on both sides are combined in an
optimal way to maximize the relationships
between the 2 sides
2The Model
anxiety
depression
Rc
Health
N
vulnerable
well-being
worry
physical
3Canonical Correlation
- linear combination of variables on each side
creates a new variable - referred to as a canonical variate (CV) or a
synthetic variable - each CV represents a dimension
- e.g., N on one-side and health on the other
- examine the correlation between CVs
- CVs come in pairs, of which you can have multiple
- however, individual variables may differentially
contribute to an individual CV
4Things we can do with this technique
- Number of CV pairs
- can have zero if variables are unrelated
- typically have at least one, but you can have
multiple - first CV pair is the most reliable
- i.e., maximizes the correlation between a CV pair
- e.g.,
- 1 CV pair ?? N (for set 1) and Health (for set 2)
- 2 CV pairs ??
- pair 1 Emotional N and Psychological Health
- pair 2 Cognitive N and Physical Health
5Things we can do with this technique
- Interpretation of CVs
- what is their meaning?
- Importance of CVs
- how strongly are variables on one side related
to each other - how strongly are variables related to variables
on the other side - Canonical variate scores
6The Process
- First evaluate R11, R22, R12 (R21)
- 1 2 3 4 5 6
- anxiety (1) --- .70 .50 .30 -.35 -.20
- vulnerability (1) --- .40 .20 -.25 -.30
- worry (1) --- .40 -.35 -.35
- depression (2) --- -.75 -.50
- well-being (2) --- .45
- physical (2) ---
7The Process
- From these matrices, linear combinations of
variables are formed - reducing the variables sets into CVs
- these CVs maximize Rc, using canonical weights
(we will talk about these shortly) - squaring Rc gives a familiar index (yes?)
- SPSS calls this the Squared Correlation
- indication of overlapping variance in a CV pair
- do you smell an index of effect size here?!
8The Process
- Wilks Lambda (?) or Bartlett's Test is used to
determine if Rc is statistically significant - both test statistics are distributed as a ?2
- df
- first CV pair ( vars. in set 1)( vars. in set
2) - second CV pair ( vars. in set 1 - 1)( vars.
in set 2 - 1) - Wilk's ? error variance / total variance
- for now
- low ? values are goodwhy???
- ?2 1 - ? ? is another measure of effect size
9Overall Significance Tests
- Statistically determine how many CVs
- possible CV pairs variables in smallest
variable set - first, the strongest CV is tested (highest Rc)
- if significant, at least first CV pair is
significant - if not significant, your linear combos are bad
- second test
- removes first CV pair, conducted on residual
correlation matrices - determines if a second CV pair is significant
- orthogonal to first CV pair
- if significant, the second CV pair adds something
unique - if not significant, only interpret the first CV
10Interpreting Overall Tests and Relations between
CVs and individual variables
- Technically what the statistical tests are doing
- the first test is actually testing if "all" CV
pairs explain significant variance - e.g., CV pairs 1, 2, and 3
- the second test is actually testing if "all but
the first" CV pairs explain significant variance - and similarly for the third test
- if the first test is significant but the second
test is not we infer that only the 1st CV pair is
important
11Interpreting Overall Tests and Relations between
CVs and individual variables
- Look at Rc, Rc2, and ?2
- Rc .30, practical significance is met
- Rc2 and ?2 accounts for 9 of the variance
- Now we can interpret the relations between
- individual variables and the CV pair
- variance accounted for by each CV and its own set
- variance accounted for by each CV and the other
set
12Relations between CVs and individual variables
- Canonical coefficients (or weights)
- unique contribution of each variable to its CV
- can be either raw or standardized
- e.g., canonical weight matrix for first CV pair
Variables N__
Health__ Anxiety .50 Vulnerable
.30 Worry .05 Depression -.32
Well-Being .31 Physical
.05
13Relations between CVs and individual variables
- Correlations between variables and CV pairs are
called loadings - e.g., loading or structure matrix for 2
hypothetical CV pairs
Variables 1___ 2__ For N
Anxiety .85 .05 Vulnerable .70
.25 Worry .25 .80 For Health
Depression -.90 .02 Well-Being
.80 .23 Physical .20
-.50
14Relations between CVs and variables for 1st CV
pair
anxiety
depression
.85
-.90
-.60
Health
N
.80
.70
vulnerable
well-being
.25
.20
worry
physical
15Relations between CVs and individual variables
- Canonical Adequacy Coefficient (CAC)
- proportion of variance extracted by CV in
intradomain variables - i.e., for own set of variables (same-set)
- CAC ? ( loadings2 / variables in set )
- CV (N) (.852 .702 .252 / 3)
.42 - CV (Health) (- .902 .802 .202 ) / 3 )
.50 - Are we happy with this?
- FYI Thompson hates this index
16Relations between CVs and individual variables
- Redundancy (Red)
- proportion of variance extracted by CV for
other-set of variables - Red ? ( loadings2 / variables in own set )
(Rc2) - remember Rc2 (-.602) .36
- CV (Health) ( .852 .702 .252 / 3) ( .36 )
.15 - CV (N) (- .902 .802 .202 ) / 3 )
( .36 ) .18 - Are we happy with this?
17Summary of 1-CV Pair Solution
- We found that one canonical variate pair
explained data - first CV represents N, the second CV represents
Health - The canonical correlation (- .60 ) and overall
variance accounted ( .36 ) were fairly strong - At the variable level
- anxiety vulnerability loaded on N
- depression well-being loaded on health
- Variables accounted for an appreciable amount of
variance in their own CV and the other CV
18Practical Issues
- Normality, linearity, multicollinearity/singularit
y are key - because correlational data is what we have!
- Be aware of sample size issues
- want 15-20 cases per variable
- can have fewer if your variables are highly
reliable - e.g., reliability .80
- if lacking
- bootstrap canonical correlational analysis
- e.g., CANSTRAP programs
19Practical Issues continued
- Calculate canonical variate scores
- represent scores on the synthetic variable
- multiply an individuals standardized score
- for a variable to its canonical weight
- 2. then sum across all of these products
- using the canonical weight matrix (slide 12)
Variables N__
z-score__ Anxiety .50 2.50 Vulnerability
.30 2.00 Worry .05 1.50
CV score for N (.50)(2.50) (.30)(2)
(.05)(1.5) 1.93
This variable, then, can be used in other
analyses!
20Practical Issues Continued
- Types of variables
- generally continuous
- but you can use binary
- Other follow-up analyses
- regression
- regress each individual variable from one set on
all variables from the other set simultaneously - e.g., anxiety ON depression, well-being, physical
health