Title: Two-sample tests
1Two-sample tests
2Binary or categorical outcomes (proportions)
Outcome Variable Are the observations correlated? Are the observations correlated? Alternative to the chi-square test if sparse cells
Outcome Variable independent correlated Alternative to the chi-square test if sparse cells
Binary or categorical (e.g. fracture, yes/no) Chi-square test compares proportions between two or more groups Relative risks odds ratios or risk ratios Logistic regression multivariate technique used when outcome is binary gives multivariate-adjusted odds ratios McNemars chi-square test compares binary outcome between correlated groups (e.g., before and after) Conditional logistic regression multivariate regression technique for a binary outcome when groups are correlated (e.g., matched data) GEE modeling multivariate regression technique for a binary outcome when groups are correlated (e.g., repeated measures) Fishers exact test compares proportions between independent groups when there are sparse data (some cells lt5). McNemars exact test compares proportions between correlated groups when there are sparse data (some cells lt5).
3Recall The odds ratio (two samplescases and
- Interpretation there is a 2.25-fold higher odds
of stroke in smokers vs. non-smokers.
4Inferences about the odds ratio
- Does the sampling distribution follow a normal
distribution? - What is the standard error?
- 1. In SAS, assume infinite population of cases
and controls with equal proportion of smokers
(exposure), p.23 (UNDER THE NULL!) - 2. Use the random binomial function to randomly
select n50 cases and n50 controls each with
p.23 chance of being a smoker. - 3. Calculate the observed odds ratio for the
resulting 2x2 table. - 4. Repeat this 1000 times (or some large number
of times). - 5. Observe the distribution of odds ratios under
the null hypothesis.
6Properties of the OR (simulation)
(50 cases/50 controls/23 exposed)
Under the null, this is the expected variability
of the sample OR?note the right skew
7Properties of the lnOR
8Properties of the lnOR
From the simulation, can get the empirical
standard error (0.5) and p-value (.10)
9Properties of the lnOR
10Inferences about the ln(OR)
11Confidence interval
Final answer 2.25 (0.85,5.92)
12Practice problem
Suppose the following data were collected in a
case-control study of brain tumor and cell phone
usage Â
  Is there sufficient evidence for an
association between cell phones and brain tumor?
1. What is your null hypothesis? Null hypothesis
OR1.0 lnOR 0 Alternative hypothesis OR? 1.0
lnORgt0 Â 2. What is your null distribution?
lnOR N(0, )
SD (lnOR) .44 Â 3.
Empirical evidence 2040/6010 800/600
1.33 ? lnOR .288 Â 4. Z (.288-0)/.44
.65 p-value P(Zgt.65 or Zlt-.65) .262 5.
Not enough evidence to reject the null hypothesis
of no association
14Key measures of relative risk 95 CIs OR and RR
For an odds ratio, 95 confidence limits
For a risk ratio, 95 confidence limits
15Continuous outcome (means)
Outcome Variable Are the observations independent or correlated? Are the observations independent or correlated? Alternatives if the normality assumption is violated (and small sample size)
Outcome Variable independent correlated Alternatives if the normality assumption is violated (and small sample size)
Continuous (e.g. pain scale, cognitive function) Ttest compares means between two independent groups ANOVA compares means between more than two independent groups Pearsons correlation coefficient (linear correlation) shows linear correlation between two continuous variables Linear regression multivariate regression technique used when the outcome is continuous gives slopes Paired ttest compares means between two related groups (e.g., the same subjects before and after) Repeated-measures ANOVA compares changes over time in the means of two or more groups (repeated measurements) Mixed models/GEE modeling multivariate regression techniques to compare changes over time between two or more groups gives rate of change over time Non-parametric statistics Wilcoxon sign-rank test non-parametric alternative to the paired ttest Wilcoxon sum-rank test (Mann-Whitney U test) non-parametric alternative to the ttest Kruskal-Wallis test non-parametric alternative to ANOVA Spearman rank correlation coefficient non-parametric alternative to Pearsons correlation coefficient
16The two-sample t-test
17The two-sample T-test
- Is the difference in means that we observe
between two groups more than wed expect to see
based on chance alone?
18The standard error of the difference of two means
- Â
- Â
- First add the variances and then take the
square root of the sum to get the standard error.
Recall, Var (A-B) Var (A) Var (B) if A and B
are independent!
19Shown by simulation
One sample of 30 (with SD5).
One sample of 30 (with SD5).
Difference of the two samples.
20Distribution of differences
- If X and Y are the averages of n and m subjects,
- As before, you usually have to use the sample SD,
since you wont know the true SD ahead of time - So, again becomes a T-distribution...
22Estimated standard error of the difference.
23Case 1 un-pooled variance
Question What are your degrees of freedom
here? Answer Not obvious!
24Case 1 ttest, unpooled variances
It is complicated to figure out the degrees of
freedom here! A good approximation is given as
df harmonic mean (or SAS will tell you!)
25Case 2 pooled variance
If you assume that the standard deviation of the
characteristic (e.g., IQ) is the same in both
groups, you can pool all the data to estimate a
common standard deviation. This maximizes your
degrees of freedom (and thus your power).
26Estimated standard error (using pooled variance
27Case 2 ttest, pooled variances
28Alternate calculation formula ttest, pooled
29Pooled vs. unpooled variance
- Rule of Thumb Use pooled unless you have a
reason not to. - Pooled gives you more degrees of freedom.
- Pooled has extra assumption variances are equal
between the two groups. - SAS automatically tests this assumption for you
(Equality of Variances test). If plt.05, this
suggests unequal variances, and better to use
unpooled ttest.
30Example two-sample t-test
- In 1980, some researchers reported that men have
more mathematical ability than women as
evidenced by the 1979 SATs, where a sample of 30
random male adolescents had a mean score 1
standard deviation of 43677 and 30 random female
adolescents scored lower 41681 (genders were
similar in educational backgrounds,
socio-economic status, and age). Do you agree
with the authors conclusions?
31Data Summary
n Sample Mean Sample Standard Deviation
Group 1 women 30 416 81
Group 2 men 30 436 77
32Two-sample t-test
- 1. Define your hypotheses (null, alternative)
- H0 ?-? math SAT 0
- Ha ?-? math SAT ? 0 two-sided
33Two-sample t-test
- 2. Specify your null distribution
- F and M have similar standard deviations/variance
s, so make a pooled estimate of variance.
34Two-sample t-test
- 3. Observed difference in our experiment 20
35Two-sample t-test
- 4. Calculate the p-value of what you observed
data _null_
pval(1-probt(.98, 58))2
put pval
5. Do not
reject null! No evidence that men are better in
math )
36Example 2 Difference in means
- Example Rosental, R. and Jacobson, L. (1966)
Teachers expectancies Determinates of pupils
I.Q. gains. Psychological Reports, 19, 115-118.
37The Experiment (note exact numbers have been
- Grade 3 at Oak School were given an IQ test at
the beginning of the academic year (n90). - Classroom teachers were given a list of names of
students in their classes who had supposedly
scored in the top 20 percent these students were
identified as academic bloomers (n18). - BUT the children on the teachers lists had
actually been randomly assigned to the list. - At the end of the year, the same I.Q. test was
38Example 2
- Statistical question Do students in the
treatment group have more improvement in IQ than
students in the control group? - What will we actually compare?
- One-year change in IQ score in the treatment
group vs. one-year change in IQ score in the
control group.
Academic bloomers (n18)
Controls (n72)
Change in IQ score
12.2 (2.0)
 8.2 (2.0)
12.2 points
8.2 points
Difference4 points
40What does a 4-point difference mean?
- Before we perform any formal statistical analysis
on these data, we already have a lot of
information. - Look at the basic numbers first THEN consider
statistical significance as a secondary guide.
41Is the association statistically significant?
- This 4-point difference could reflect a true
effect or it could be a fluke. - The question is a 4-point difference bigger or
smaller than the expected sampling variability?
42Hypothesis testing
Step 1 Assume the null hypothesis.
Null hypothesis There is no difference between
academic bloomers and normal students ( the
difference is 0)
43Hypothesis Testing
Step 2 Predict the sampling variability assuming
the null hypothesis is true
- These predictions can be made by mathematical
theory or by computer simulation.
44Hypothesis Testing
Step 2 Predict the sampling variability assuming
the null hypothesis is truemath theory
45Hypothesis Testing
Step 2 Predict the sampling variability assuming
the null hypothesis is truecomputer simulation
- In computer simulation, you simulate taking
repeated samples of the same size from the same
population and observe the sampling variability. - I used computer simulation to take 1000 samples
of 18 treated and 72 controls
46Computer Simulation Results
473. Empirical data
- Observed difference in our experiment 12.2-8.2
4.0 - Â
484. P-value
- t-curve with 88 dfs has slightly wider
cut-offs for 95 area (t1.99) than a normal
curve (Z1.96)Â
p-value lt.0001
505. Reject null!
- Conclusion I.Q. scores can bias expectancies in
the teachers minds and cause them to
unintentionally treat bright students
differently from those seen as less bright.
51Confidence interval (more information!!)
- 95 CI for the difference 4.01.99(.52) (3.0
52What if our standard deviation had been higher?
- The standard deviation for change scores in
treatment and control were each 2.0. What if
change scores had been much more variablesay a
standard deviation of 10.0 (for both)?
53(No Transcript)
54With a std. dev. of 10.0LESS STATISICAL POWER!
55Dont forget The paired T-test
- Did the control group in the previous experiment
improveat all during the year? - Do not apply a two-sample ttest to answer this
question! - After-Before yields a single sample of
differences - within-group rather than between-group
56Continuous outcome (means)
Outcome Variable Are the observations independent or correlated? Are the observations independent or correlated? Alternatives if the normality assumption is violated (and small sample size)
Outcome Variable independent correlated Alternatives if the normality assumption is violated (and small sample size)
Continuous (e.g. pain scale, cognitive function) Ttest compares means between two independent groups ANOVA compares means between more than two independent groups Pearsons correlation coefficient (linear correlation) shows linear correlation between two continuous variables Linear regression multivariate regression technique used when the outcome is continuous gives slopes Paired ttest compares means between two related groups (e.g., the same subjects before and after) Repeated-measures ANOVA compares changes over time in the means of two or more groups (repeated measurements) Mixed models/GEE modeling multivariate regression techniques to compare changes over time between two or more groups gives rate of change over time Non-parametric statistics Wilcoxon sign-rank test non-parametric alternative to the paired ttest Wilcoxon sum-rank test (Mann-Whitney U test) non-parametric alternative to the ttest Kruskal-Wallis test non-parametric alternative to ANOVA Spearman rank correlation coefficient non-parametric alternative to Pearsons correlation coefficient
57Data Summary
n Sample Mean Sample Standard Deviation
Group 1 Change 72 8.2 2.0
58Did the control group in the previous experiment
improveat all during the year?
p-value lt.0001
59Normality assumption of ttest
- If the distribution of the trait is normal, fine
to use a t-test. - But if the underlying distribution is not normal
and the sample size is small (rule of thumb ngt30
per group if not too skewed ngt100 if
distribution is really skewed), the Central Limit
Theorem takes some time to kick in. Cannot use
ttest. - Note ttest is very robust against the normality
60Alternative tests when normality is violated
Non-parametric tests
61Continuous outcome (means)
Outcome Variable Are the observations independent or correlated? Are the observations independent or correlated? Alternatives if the normality assumption is violated (and small sample size)
Outcome Variable independent correlated Alternatives if the normality assumption is violated (and small sample size)
Continuous (e.g. pain scale, cognitive function) Ttest compares means between two independent groups ANOVA compares means between more than two independent groups Pearsons correlation coefficient (linear correlation) shows linear correlation between two continuous variables Linear regression multivariate regression technique used when the outcome is continuous gives slopes Paired ttest compares means between two related groups (e.g., the same subjects before and after) Repeated-measures ANOVA compares changes over time in the means of two or more groups (repeated measurements) Mixed models/GEE modeling multivariate regression techniques to compare changes over time between two or more groups gives rate of change over time Non-parametric statistics Wilcoxon sign-rank test non-parametric alternative to the paired ttest Wilcoxon sum-rank test (Mann-Whitney U test) non-parametric alternative to the ttest Kruskal-Wallis test non-parametric alternative to ANOVA Spearman rank correlation coefficient non-parametric alternative to Pearsons correlation coefficient
62Non-parametric tests
- t-tests require your outcome variable to be
normally distributed (or close enough), for small
samples. - Non-parametric tests are based on RANKS instead
of means and standard deviations (population
63Example non-parametric tests
10 dieters following Atkins diet vs. 10 dieters
following Jenny Craig Hypothetical
RESULTS Atkins group loses an average of 34.5
lbs. J. Craig group loses an average of 18.5
lbs. Conclusion Atkins is better?
64Example non-parametric tests
BUT, take a closer look at the individual
data Atkins, change in weight (lbs) 4, 3,
0, -3, -4, -5, -11, -14, -15, -300 J. Craig,
change in weight (lbs) -8, -10, -12, -16, -18,
-20, -21, -24, -26, -30
65Jenny Craig
Weight Change
Weight Change
67t-test inappropriate
- Comparing the mean weight loss of the two groups
is not appropriate here. - The distributions do not appear to be normally
distributed. - Moreover, there is an extreme outlier (this
outlier influences the mean a great deal).
68Wilcoxon rank-sum test
- RANK the values, 1 being the least weight loss
and 20 being the most weight loss. - Atkins
- 4, 3, 0, -3, -4, -5, -11, -14, -15, -300
- Â 1, 2, 3, 4, 5, 6, 9, 11, 12, 20
- J. Craig
- -8, -10, -12, -16, -18, -20, -21, -24, -26, -30
- 7, 8, 10, 13, 14, 15, 16, 17, 18,
69Wilcoxon rank-sum test
- Sum of Atkins ranks
- Â 1 2 3 4 5 6 9 11 12 2073
- Sum of Jenny Craigs ranks
- 7 8 10 13 14 1516 17 1819137
- Jenny Craig clearly ranked higher!
- P-value (from computer) .018
For details of the statistical test, see
appendix of these slides
70Binary or categorical outcomes (proportions)
Outcome Variable Are the observations correlated? Are the observations correlated? Alternative to the chi-square test if sparse cells
Outcome Variable independent correlated Alternative to the chi-square test if sparse cells
Binary or categorical (e.g. fracture, yes/no) Chi-square test compares proportions between two or more groups Relative risks odds ratios or risk ratios Logistic regression multivariate technique used when outcome is binary gives multivariate-adjusted odds ratios McNemars chi-square test compares binary outcome between two correlated groups (e.g., before and after) Conditional logistic regression multivariate regression technique for a binary outcome when groups are correlated (e.g., matched data) GEE modeling multivariate regression technique for a binary outcome when groups are correlated (e.g., repeated measures) Fishers exact test compares proportions between independent groups when there are sparse data (some cells lt5). McNemars exact test compares proportions between correlated groups when there are sparse data (some cells lt5).
71Difference in proportions (special case of
chi-square test)
72Null distribution of a difference in proportions
73Null distribution of a difference in proportions
74Difference in proportions test
Null hypothesis The difference in proportions is
75Recall case-control example
76Absolute risk Difference in proportions exposed
77Difference in proportions exposed
78Example 2 Difference in proportions
- Research Question Are antidepressants a risk
factor for suicide attempts in children and
- Example modified from Antidepressant Drug
Therapy and Suicide in Severely Depressed
Children and Adults Olfson et al. Arch Gen
79Example 2 Difference in Proportions
- Design Case-control study
- Methods Researchers used Medicaid records to
compare prescription histories between 263
children and teenagers (6-18 years) who had
attempted suicide and 1241 controls who had never
attempted suicide (all subjects suffered from
depression). - Statistical question Is a history of use of
antidepressants more common among cases than
80Example 2
- Statistical question Is a history of use of
antidepressants more common among heart disease
cases than controls? - What will we actually compare?
- Proportion of cases who used antidepressants in
the past vs. proportion of controls who did
No () of cases (n263)
No () of controls (n1241)
Any antidepressant drug ever
120 (46)
 448 (36)
82Is the association statistically significant?
- This 10 difference could reflect a true
association or it could be a fluke in this
particular sample. - The question is 10 bigger or smaller than the
expected sampling variability?
83Hypothesis testing
Step 1 Assume the null hypothesis.
Null hypothesis There is no association between
antidepressant use and suicide attempts in the
target population ( the difference is 0)
84Hypothesis Testing
Step 2 Predict the sampling variability assuming
the null hypothesis is true
85Also Computer Simulation Results
86Hypothesis Testing
Step 3 Do an experiment
We observed a difference of 10 between cases and
87Hypothesis Testing
Step 4 Calculate a p-value
88P-value from our simulation
From our simulation, we estimate the p-value to
be 4/1000 or .004
90Hypothesis Testing
Step 5 Reject or do not reject the null
Here we reject the null. Alternative hypothesis
There is an association between antidepressant
use and suicide in the target population.
91What would a lack of statistical significance
- If this study had sampled only 50 cases and 50
controls, the sampling variability would have
been much higheras shown in this computer
92(No Transcript)
93With only 50 cases and 50 controls
94Two-tailed p-value
95Practice problem
- An August 2003 research article in Developmental
and Behavioral Pediatrics reported the following
about a sample of UK kids when given a choice of
a non-branded chocolate cereal vs. CoCo Pops, 97
(36) of 37 girls and 71 (27) of 38 boys
preferred the CoCo Pops. Is this evidence that
girls are more likely to choose brand-named
- 1. Hypotheses
- H0 p?-p? 0
- Ha p?-p?? 0 two-sided
- Â
- 2. Null distribution of difference of two
proportions - Â
- Â
- 3. Observed difference in our experiment
.97-.71 .26 - Â
- 4. Calculate the p-value of what you observed
data _null_
put pval
0.0022133699 5.
p-value is sufficiently low for us to reject the
null there does appear to be a difference in
gender preferences here.
97Key two-sample Hypothesis Tests
- Test for Ho µx- µy 0 (s2 unknown, but
roughly equal) - Test for Ho p1- p2 0
- Â
98Corresponding confidence intervals
- For a difference in means, 2 independent samples
(s2s unknown but roughly equal) - For a difference in proportions, 2 independent
samples - Â
99Appendix details of rank-sum test
100Wilcoxon Rank-sum test
- For example, if team 1 and team 2 (two gymnastic
teams) are competing, and the judges rank all the
individuals in the competition, how can you tell
if team 1 has done significantly better than team
2 or vice versa?
- Intuition under the null hypothesis of no
difference between the two groups - If n1n2, the sums of T1 and T2 should be equal.
- But if n1 ?n2, then T2 (n2bigger group) should
automatically be bigger. But how much bigger
under the null? - For example, if team 1 has 3 people and team 2
has 10, we could rank all 13 participants from 1
to 13 on individual performance. If team1 (X)
and team2 dont differ in talent, the ranks ought
to be spread evenly among the two groups, e.g. - 1 2 X 4 5 6 X 8 9 10 X 12 13 (exactly even
distribution if team1 ranks 3rd, 7th, and 11th)
103(No Transcript)
104It turns out that, if the null hypothesis is
true, the difference between the larger-group sum
of ranks and the smaller-group sum of ranks is
exactly equal to the difference between T1 and T2
105From slide 23
From slide 24
Here, under null U25530-70 U1630-21 U2U130
106- ? under null hypothesis, U1 should equal U2
The Us should be equal to each other and will
equal n1n2/2 Â U1 U2 n1n2 Under null
hypothesis, U1 U2 U0 ?E(U1 U2) 2E(U0)
n1n2 E(U1 U2U0) n1n2/2
So, the test statistic here is not quite the
difference in the sum-of-ranks of the 2
groups? Its the smaller observed U value U0 For
small ns, take U0, and get p-value directly from
a U table.
107For large enough ns (gt10 per group)
108Add observed data to the example
- Example If the girls on the two gymnastics teams
were ranked as follows - Team 1 1, 5, 7 Observed T1 13
- Team 2 2,3,4,6,8,9,10,11,12,13
Observed T2 78 - Â
- Are the teams significantly different?
- Total sum of ranks 1314/2 91
n1n2310 30 - Â
- Under the null hypothesis expect U1 - U2 0 and
U1 U2 30 (each should equal about 15 under
the null) and U0 15 - Â Â
- U130 6 13 23
- U2 30 55 78 7
- Â ?U0 7
- Â
- Not quite statistically significant in U
tablep.1084 (see attached) x2 for two-tailed
109Example problem 2
A study was done to compare the Atkins Diet
(low-carb) vs. Jenny Craig (low-cal, low-fat).
The following weight changes were obtained note
they are very skewed because someone lost 100
pounds the mean loss for Atkins is going to look
higher because of the bozo, but does that mean
the diet is better overall? Conduct a
Mann-Whitney U test to compare ranks. Â
Corresponding Ranks (lower is more weight
loss!) Â
Sum of ranks for JC 25 (n5) Sum of ranks for
Atkins41 (n6) Â n1n256 30 Â under the null
hypothesis expect U1 - U2 0 and U1 U2 30
and U0 15 Â Â U130 15 25 20 U2 30
21 41 10 Â U0 10 n15, n26 Go to
Mann-Whitney chart.p.2143x 2 .42