Title: to 2 and beyond
1to ?2 and beyond
- Distribution
- Lots of tests produce statistics that are
compared with the ?2 distribution - Goodness of fit to compare single variables with
a distribution - ?2 goodness of fit
- Kolmogorov-Smirnov
- Shapiro-Wilk
- Tests of association between two variables
2Ratbert correct on 58 of the 100. Is this
different from chance?
3Pearson ?2 statistic A measure of deviation
4- Is this high enough to reject the model of no
ESP? - Need to know the residual degrees of freedom.
- Number of non-redundant pieces of information
minus the number of pieces of information used in
model. - Two pieces of information. The model uses one (n
used to calculate expected values), leaving one
for the residual. - ?2 table uses both the deviation value and the
degrees of freedom.
5?2 distribution
6A Second Equation
- The likelihood ratio ?2 statistic
- Plugging in the numbers for this example yields
- Also non-significant
- L?2 can be partitioned like the sums of squares
(SS) in ANOVA and regression models.
7(No Transcript)
8http//glass.ed.asu.edu/stats/analysis/
9The Confidence Interval (Wald)
10- Wilson (1927) approach thought of as best.
11Men Winning More Awards
- A few good men
- Johnson, Carothers Deary (Nov, PPS) "Role of
the X Chromosome" - psychologicalscience.org/journals/pps/4_6_inpress
/Johnson_final.pdf - Wendy Northcutts (2000) approach to evolution
Survival of the fittest only means that best
genes remain in pool. - Pool may still contain bad genes which just don't
get passed onto the next gene pool (of course,
one genetically damaged plant if propagated could
destroy a field, or one prolific war monger). - 131 men won Darwin Awards, but only 24 women.
- Odds is 131/24 5.46!
12Repeat Odds
- 131 men won a Darwin award
- 24 women won one
- Proportion of men winning is 131/1550.85
- Odds is 131/24 5.46.
- Odds are important for some statistics.
- In a second, as the odds ratio.
- Next term, as part of regression when the
response variable is binary.
13Two Binary Variables Own Race Bias
- A Black confederate approached Black and White
participants in South Africa. A few minutes
later an RA approached the participant and asked
them to identify the confederate from a lineup. - 17 of the 25 Blacks (68) correct, odds 2.125
- 8 of the 25 Whites (32) correct, odds 0.471
- The odds ratio (OR) is 2.125/0.471 4.52.
- The odds ratio is a common measure of
association, like the correlation, but for 2x2
tables.
14(No Transcript)
15- install.packages("sdtalt")
- library(sdtalt)
- sdt(17,8,8,17)
16(No Transcript)
17Calculating Asymptotic CI (called the Wald CIs)
- 1. Take the ln of the observed OR. Here,
ln(4.516)1.507. - 2. Calculate the standard error on the log odds
ratio - Calculate the 95 confidence interval of ln OR
- lb ln OR - 1.96 se(ln OR) 1.507 - 1.96(0.606)
0.319 - ub ln OR 1.96 se(ln OR) 1.507 1.96(0.606)
2.69 - 4. Back-transform these into odds ratios
- exp(0.319) e0.319 1.376 and exp(2.69)
e2.69 14.80
18Or write a function or find onehttps//home.comca
st.net/lthompson221/RCode.txt
19Differences due to rounding (these estimates are
more accurate)
20Calculating ?2
- E11 RT1 CT1 /n
- 12.5 (25) (25) /50
- Eij RTi CTj /n
- ln (Eij) ln(RTi) ln(CTj) - ln (n)
21?2 (1) 6.48, p .011
Pearson originally got the df wrong rc - 1
rather than (r-1)(c-1). Fisher corrected him. ?2
(1) 6.48, p .011
22Graphing Chi-Square residuals
- When the table is larger than 2x2, the Chi-Square
value does not tell you where the effect is. - Two approaches
- Residual statistics
- Correspondence analysis
23Two variables Multiple values(based on a UK
national survey)
24Degrees of Freedom
- 16 non-redundant pieces of information
- Equal numbers in all cells (one number)
- Accounting for column totals (3 additional)
- Accounting for column and row total (3
additional) - 7 df used in the model
- Leaving 9 df for the residuals. The ?2 test is
of the residuals so use this for looking up.
25- L?2 30.56
- degrees of freedom (r-1)(c-1) 9 or calculate
from 16-1-3-39 - Critical ?2 values for df9 are 16.92 and 21.67
for a equal .05 and .01 respectively. - So, an association has been detected.
26(No Transcript)
27(No Transcript)
28Pearson residuals (O-E)/sqrt(E)
29Association
- Find odds ratio for each 2x2 comparison.
- Useful if both variables are ordinal.
1.13
1.61
0.91
2.53
0.30
3.07
0.39
1.92
0.51
30How big is the association?
- Can look at odds ratios of each 2x2 comparison.
- Cramers V (and V2)
- SPSS gives lots of effect sizes. Cohen uses
31(No Transcript)
32(No Transcript)
33(No Transcript)
34Where is the association? O-E Residuals
35Biggest residuals for married, but it is also the
largest row total.
36Standardized or Pearson's Residuals
- Square root of each cells contribution to the
overall ?2
37Now widowed has the largest, but it has a small
row total.
38Correspondence Analysis(briefly and without math)
- Partitions the residual ?2 left over after the
no association model - If there are r rows and c columns, the program
can use up to either (r-1) or (c-1) dimensions,
whichever is smaller. - Important Think about the size of association
- Analyze/Data Reduction/Correspondence Analysis in
SPSS - with corresp in MASS package for R (lots of
others, too)
39(No Transcript)
40(No Transcript)
41(No Transcript)
42?2 30 (is social class ordinal?)
43(No Transcript)
44(No Transcript)
45Lots of output minimum here
15 for 1st dimension, 9 2nd, and 1.5 for 3rd
46Does for class also.
47Schuman Scott (1989) Events by Age
48Three Variables
- There are techniques designed for multiple
variables. - Can use CA.
- Compute new variable
- Compute newvar age 6 gender
- 1 2 3 4 5 6 1
2 - 7 8 9 10 11 12 13 14 15
16 17 18 - CA comes to mind with newvar
- Biplot and drawing lines
49Let me repeat that
- Compute new variable
- Compute newvar age 6 gender
- 1 2 3 4 5 6 1
2 - 7 8 9 10 11 12 13 14 15
16 17 18
Make sure it works
50(No Transcript)
51(No Transcript)
52Heath et al. on social class and voting
53Summary
- Finding a significant Chi-square does not tell
you where the effect lies. - Graph different residuals and look at
correspondence analysis
54Square Tables (if time allows)
- The rows and columns have the same values.
- Inter-rater reliability
- Two similar measures (two tests, two trials)
- Before-After studies
- Matched participants (eg., fathers/sons
political party preference or social class) - Can have more than 2 variables, but complexity
increases.
55Special Models of Interest
- Equi-Probable (every cell equally likely)
- Independence (like what weve been doing)
- Quasi-independence
- Symmetry
- Quasi-symmetry
56Square Tables Inter-rater reliability
- Suppose a researcher was interested in the
reliability of exam markers. - Suppose there are first and second exam markers
of a large number of exercises. - Questions Are they reliable? Is one harsher
than the other? Are non-agreements random?
57Contingency Table
Marker 1
58.067
6.25
2.40
1.00
0.40
2.50
16.67
1.00
0.12
59Equi-Probable w/ Diagonal Eij n/16
Marker 1
16 Cells. 1 used, 15 left. X2(15) 1187
60Equi-Probable w/o Diagonal Eij n/12
Marker 1
12 Cells. 1 used, 11 left. X2(11) 660
can also fit the diagonal, using those 4 df.
61Independence w/o Diagonal Eij RTi CTj /n
Marker 1
12 Cells. 7 used, 5 left. X2(5) 58, p lt .001
62Symmetry w/o Diagonal Eij Eii
Marker 1
12 Cells. 6 used, 6 left. X2(6) 149
63Quasi-Symmetry w/o Diagonal Eij Eii but
taking into account marginals
Marker 1
12 Cells. 9 used, 3 left. X2(3) 9.68 (p .02)
64Summary of Models
- Equi-probable with all data X2(15) 1187
- Equiprobably w/o diagonal X2(11) 660
- Add marginals Independence X2(5) 58
- Symmetry X2(6) 149
- Quasi-symmetry X2(3) 10
- Still significant, but small enough to accept
- Shows marginals different. Marker 1 is tougher.
65Quasi-Symmetry fits pretty wellResiduals and
standardized residuals
Marker 1
12 Cells. 9 used, 3 left. X2(3) 9.68 (p
.02) Marker 2 gives As to many that Marker 1
thinks are poor.
66Taking into account the ordinality
- Assume independence as the baseline.
- Include additional parameters to account for the
association. - Linear by linear
- RC model (Correspondence Analysis)
67Independence Eij RTi CTj / n
Marker 1
16 Cells. 7 used, 9 left. X2(9) 605
68Linear by Linear Term
- Eij RTi CTj / n
- ln Eij ln RTi ln CTj - ln n
- It is significant, but 9 df we have not located
where the association is. - If scores are interval, then including a term
marker1marker2 tests for linear association. - In SPSS, put interaction in as covariate and in
model (with the two main effects).
69Linear by Linear Model
Marker 1
16 Cells. 8 used, 8 left. X2(9) 127
70Association Models
- Sometimes called uniform association model as all
the local odds ratios are the same. - Assume interval Row and Column values
- R models relax this for rows
- C models relax this for columns
- R C models for both
- RC(M) models and correspondence analysis
71Journal
- Be prepared to share your thoughts about your
presentation at the next lecture. - First Steps 10.2, 10.3