Two-sample tests - PowerPoint PPT Presentation

About This Presentation
Title:

Two-sample tests

Description:

Single population mean (unknown ) Single population proportion. Difference in means (ttest) ... Sum of Jenny Craig's ranks: 7 8 10 13 14 15 16 17 18 19=137 ... – PowerPoint PPT presentation

Number of Views:207
Avg rating:3.0/5.0
Slides: 111
Provided by: Kris147
Learn more at: https://web.stanford.edu
Category:
Tags: craigs | sample | tests | two

less

Transcript and Presenter's Notes

Title: Two-sample tests


1
Two-sample tests
2
Binary or categorical outcomes (proportions)
Outcome Variable Are the observations correlated? Are the observations correlated? Alternative to the chi-square test if sparse cells
Outcome Variable independent correlated Alternative to the chi-square test if sparse cells
Binary or categorical (e.g. fracture, yes/no) Chi-square test compares proportions between two or more groups Relative risks odds ratios or risk ratios Logistic regression multivariate technique used when outcome is binary gives multivariate-adjusted odds ratios McNemars chi-square test compares binary outcome between correlated groups (e.g., before and after) Conditional logistic regression multivariate regression technique for a binary outcome when groups are correlated (e.g., matched data) GEE modeling multivariate regression technique for a binary outcome when groups are correlated (e.g., repeated measures) Fishers exact test compares proportions between independent groups when there are sparse data (some cells lt5). McNemars exact test compares proportions between correlated groups when there are sparse data (some cells lt5).
3
Recall The odds ratio (two samplescases and
controls)
  • Interpretation there is a 2.25-fold higher odds
    of stroke in smokers vs. non-smokers.

4
Inferences about the odds ratio
  • Does the sampling distribution follow a normal
    distribution?
  • What is the standard error?

5
Simulation
  • 1. In SAS, assume infinite population of cases
    and controls with equal proportion of smokers
    (exposure), p.23 (UNDER THE NULL!)
  • 2. Use the random binomial function to randomly
    select n50 cases and n50 controls each with
    p.23 chance of being a smoker.
  • 3. Calculate the observed odds ratio for the
    resulting 2x2 table.
  • 4. Repeat this 1000 times (or some large number
    of times).
  • 5. Observe the distribution of odds ratios under
    the null hypothesis.

6
Properties of the OR (simulation)
(50 cases/50 controls/23 exposed)
Under the null, this is the expected variability
of the sample OR?note the right skew
7
Properties of the lnOR
Normal!
8
Properties of the lnOR
From the simulation, can get the empirical
standard error (0.5) and p-value (.10)
9
Properties of the lnOR
10
Inferences about the ln(OR)
p.10
11
Confidence interval
Final answer 2.25 (0.85,5.92)
12
Practice problem
Suppose the following data were collected in a
case-control study of brain tumor and cell phone
usage  
    Is there sufficient evidence for an
association between cell phones and brain tumor?
13
Answer
1. What is your null hypothesis? Null hypothesis
OR1.0 lnOR 0 Alternative hypothesis OR? 1.0
lnORgt0   2. What is your null distribution?
lnOR N(0, )
SD (lnOR) .44   3.
Empirical evidence 2040/6010 800/600
1.33 ? lnOR .288   4. Z (.288-0)/.44
.65 p-value P(Zgt.65 or Zlt-.65) .262 5.
Not enough evidence to reject the null hypothesis
of no association
14
Key measures of relative risk 95 CIs OR and RR
For an odds ratio, 95 confidence limits
For a risk ratio, 95 confidence limits
15
Continuous outcome (means)
Outcome Variable Are the observations independent or correlated? Are the observations independent or correlated? Alternatives if the normality assumption is violated (and small sample size)
Outcome Variable independent correlated Alternatives if the normality assumption is violated (and small sample size)
Continuous (e.g. pain scale, cognitive function) Ttest compares means between two independent groups ANOVA compares means between more than two independent groups Pearsons correlation coefficient (linear correlation) shows linear correlation between two continuous variables Linear regression multivariate regression technique used when the outcome is continuous gives slopes Paired ttest compares means between two related groups (e.g., the same subjects before and after) Repeated-measures ANOVA compares changes over time in the means of two or more groups (repeated measurements) Mixed models/GEE modeling multivariate regression techniques to compare changes over time between two or more groups gives rate of change over time Non-parametric statistics Wilcoxon sign-rank test non-parametric alternative to the paired ttest Wilcoxon sum-rank test (Mann-Whitney U test) non-parametric alternative to the ttest Kruskal-Wallis test non-parametric alternative to ANOVA Spearman rank correlation coefficient non-parametric alternative to Pearsons correlation coefficient
16
The two-sample t-test
17
The two-sample T-test
  • Is the difference in means that we observe
    between two groups more than wed expect to see
    based on chance alone?

18
The standard error of the difference of two means
  •  
  •  
  • First add the variances and then take the
    square root of the sum to get the standard error.

Recall, Var (A-B) Var (A) Var (B) if A and B
are independent!
19
Shown by simulation
One sample of 30 (with SD5).
One sample of 30 (with SD5).
Difference of the two samples.
20
Distribution of differences
  • If X and Y are the averages of n and m subjects,
    respectively

21
But
  • As before, you usually have to use the sample SD,
    since you wont know the true SD ahead of time
  • So, again becomes a T-distribution...

22
Estimated standard error of the difference.
23
Case 1 un-pooled variance
Question What are your degrees of freedom
here? Answer Not obvious!
24
Case 1 ttest, unpooled variances
It is complicated to figure out the degrees of
freedom here! A good approximation is given as
df harmonic mean (or SAS will tell you!)
25
Case 2 pooled variance
If you assume that the standard deviation of the
characteristic (e.g., IQ) is the same in both
groups, you can pool all the data to estimate a
common standard deviation. This maximizes your
degrees of freedom (and thus your power).
26
Estimated standard error (using pooled variance
estimate)
27
Case 2 ttest, pooled variances
28
Alternate calculation formula ttest, pooled
variance
29
Pooled vs. unpooled variance
  • Rule of Thumb Use pooled unless you have a
    reason not to.
  • Pooled gives you more degrees of freedom.
  • Pooled has extra assumption variances are equal
    between the two groups.
  • SAS automatically tests this assumption for you
    (Equality of Variances test). If plt.05, this
    suggests unequal variances, and better to use
    unpooled ttest.

30
Example two-sample t-test
  • In 1980, some researchers reported that men have
    more mathematical ability than women as
    evidenced by the 1979 SATs, where a sample of 30
    random male adolescents had a mean score 1
    standard deviation of 43677 and 30 random female
    adolescents scored lower 41681 (genders were
    similar in educational backgrounds,
    socio-economic status, and age). Do you agree
    with the authors conclusions?

31
Data Summary
n Sample Mean Sample Standard Deviation
Group 1 women 30 416 81
Group 2 men 30 436 77
32
Two-sample t-test
  • 1. Define your hypotheses (null, alternative)
  • H0 ?-? math SAT 0
  • Ha ?-? math SAT ? 0 two-sided

33
Two-sample t-test
  • 2. Specify your null distribution
  • F and M have similar standard deviations/variance
    s, so make a pooled estimate of variance.

34
Two-sample t-test
  • 3. Observed difference in our experiment 20
    points

35
Two-sample t-test
  • 4. Calculate the p-value of what you observed

data _null_




pval(1-probt(.98, 58))2




put pval




run




0.3311563454




5. Do not
reject null! No evidence that men are better in
math )
36
Example 2 Difference in means
  • Example Rosental, R. and Jacobson, L. (1966)
    Teachers expectancies Determinates of pupils
    I.Q. gains. Psychological Reports, 19, 115-118.

37
The Experiment (note exact numbers have been
altered)
  • Grade 3 at Oak School were given an IQ test at
    the beginning of the academic year (n90).
  • Classroom teachers were given a list of names of
    students in their classes who had supposedly
    scored in the top 20 percent these students were
    identified as academic bloomers (n18).
  • BUT the children on the teachers lists had
    actually been randomly assigned to the list.
  • At the end of the year, the same I.Q. test was
    re-administered.

38
Example 2
  • Statistical question Do students in the
    treatment group have more improvement in IQ than
    students in the control group?
  • What will we actually compare?
  • One-year change in IQ score in the treatment
    group vs. one-year change in IQ score in the
    control group.

39
Results
Academic bloomers (n18)
Controls (n72)
Change in IQ score
12.2 (2.0)
 8.2 (2.0)
12.2 points
8.2 points
Difference4 points
40
What does a 4-point difference mean?
  • Before we perform any formal statistical analysis
    on these data, we already have a lot of
    information.
  • Look at the basic numbers first THEN consider
    statistical significance as a secondary guide.

41
Is the association statistically significant?
  • This 4-point difference could reflect a true
    effect or it could be a fluke.
  • The question is a 4-point difference bigger or
    smaller than the expected sampling variability?

42
Hypothesis testing
Step 1 Assume the null hypothesis.
Null hypothesis There is no difference between
academic bloomers and normal students ( the
difference is 0)
43
Hypothesis Testing
Step 2 Predict the sampling variability assuming
the null hypothesis is true
  • These predictions can be made by mathematical
    theory or by computer simulation.

44
Hypothesis Testing
Step 2 Predict the sampling variability assuming
the null hypothesis is truemath theory
45
Hypothesis Testing
Step 2 Predict the sampling variability assuming
the null hypothesis is truecomputer simulation
  • In computer simulation, you simulate taking
    repeated samples of the same size from the same
    population and observe the sampling variability.
  • I used computer simulation to take 1000 samples
    of 18 treated and 72 controls

46
Computer Simulation Results
47
3. Empirical data
  • Observed difference in our experiment 12.2-8.2
    4.0
  •  

48
4. P-value
  • t-curve with 88 dfs has slightly wider
    cut-offs for 95 area (t1.99) than a normal
    curve (Z1.96) 

p-value lt.0001
49
Visually
50
5. Reject null!
  • Conclusion I.Q. scores can bias expectancies in
    the teachers minds and cause them to
    unintentionally treat bright students
    differently from those seen as less bright.

51
Confidence interval (more information!!)
  • 95 CI for the difference 4.01.99(.52) (3.0
    5.0)

52
What if our standard deviation had been higher?
  • The standard deviation for change scores in
    treatment and control were each 2.0. What if
    change scores had been much more variablesay a
    standard deviation of 10.0 (for both)?

53
(No Transcript)
54
With a std. dev. of 10.0LESS STATISICAL POWER!
55
Dont forget The paired T-test
  • Did the control group in the previous experiment
    improveat all during the year?
  • Do not apply a two-sample ttest to answer this
    question!
  • After-Before yields a single sample of
    differences
  • within-group rather than between-group
    comparison

56
Continuous outcome (means)
Outcome Variable Are the observations independent or correlated? Are the observations independent or correlated? Alternatives if the normality assumption is violated (and small sample size)
Outcome Variable independent correlated Alternatives if the normality assumption is violated (and small sample size)
Continuous (e.g. pain scale, cognitive function) Ttest compares means between two independent groups ANOVA compares means between more than two independent groups Pearsons correlation coefficient (linear correlation) shows linear correlation between two continuous variables Linear regression multivariate regression technique used when the outcome is continuous gives slopes Paired ttest compares means between two related groups (e.g., the same subjects before and after) Repeated-measures ANOVA compares changes over time in the means of two or more groups (repeated measurements) Mixed models/GEE modeling multivariate regression techniques to compare changes over time between two or more groups gives rate of change over time Non-parametric statistics Wilcoxon sign-rank test non-parametric alternative to the paired ttest Wilcoxon sum-rank test (Mann-Whitney U test) non-parametric alternative to the ttest Kruskal-Wallis test non-parametric alternative to ANOVA Spearman rank correlation coefficient non-parametric alternative to Pearsons correlation coefficient
57
Data Summary
n Sample Mean Sample Standard Deviation
Group 1 Change 72 8.2 2.0

58
Did the control group in the previous experiment
improveat all during the year?
p-value lt.0001
59
Normality assumption of ttest
  • If the distribution of the trait is normal, fine
    to use a t-test.
  • But if the underlying distribution is not normal
    and the sample size is small (rule of thumb ngt30
    per group if not too skewed ngt100 if
    distribution is really skewed), the Central Limit
    Theorem takes some time to kick in. Cannot use
    ttest.
  • Note ttest is very robust against the normality
    assumption!

60
Alternative tests when normality is violated
Non-parametric tests
61
Continuous outcome (means)
Outcome Variable Are the observations independent or correlated? Are the observations independent or correlated? Alternatives if the normality assumption is violated (and small sample size)
Outcome Variable independent correlated Alternatives if the normality assumption is violated (and small sample size)
Continuous (e.g. pain scale, cognitive function) Ttest compares means between two independent groups ANOVA compares means between more than two independent groups Pearsons correlation coefficient (linear correlation) shows linear correlation between two continuous variables Linear regression multivariate regression technique used when the outcome is continuous gives slopes Paired ttest compares means between two related groups (e.g., the same subjects before and after) Repeated-measures ANOVA compares changes over time in the means of two or more groups (repeated measurements) Mixed models/GEE modeling multivariate regression techniques to compare changes over time between two or more groups gives rate of change over time Non-parametric statistics Wilcoxon sign-rank test non-parametric alternative to the paired ttest Wilcoxon sum-rank test (Mann-Whitney U test) non-parametric alternative to the ttest Kruskal-Wallis test non-parametric alternative to ANOVA Spearman rank correlation coefficient non-parametric alternative to Pearsons correlation coefficient
62
Non-parametric tests
  • t-tests require your outcome variable to be
    normally distributed (or close enough), for small
    samples.
  • Non-parametric tests are based on RANKS instead
    of means and standard deviations (population
    parameters).

63
Example non-parametric tests
10 dieters following Atkins diet vs. 10 dieters
following Jenny Craig Hypothetical
RESULTS Atkins group loses an average of 34.5
lbs. J. Craig group loses an average of 18.5
lbs. Conclusion Atkins is better?
64
Example non-parametric tests
BUT, take a closer look at the individual
data Atkins, change in weight (lbs) 4, 3,
0, -3, -4, -5, -11, -14, -15, -300 J. Craig,
change in weight (lbs) -8, -10, -12, -16, -18,
-20, -21, -24, -26, -30
65
Jenny Craig
30
25
20
P
e
r
c
15
e
n
t
10
5
0
-30
-25
-20
-15
-10
-5
0
5
10
15
20
Weight Change
66
Atkins
30
25
20
P
e
r
c
15
e
n
t
10
5
0
-300
-280
-260
-240
-220
-200
-180
-160
-140
-120
-100
-80
-60
-40
-20
0
20
Weight Change
67
t-test inappropriate
  • Comparing the mean weight loss of the two groups
    is not appropriate here.
  • The distributions do not appear to be normally
    distributed.
  • Moreover, there is an extreme outlier (this
    outlier influences the mean a great deal).

68
Wilcoxon rank-sum test
  • RANK the values, 1 being the least weight loss
    and 20 being the most weight loss.
  • Atkins
  • 4, 3, 0, -3, -4, -5, -11, -14, -15, -300
  •  1, 2, 3, 4, 5, 6, 9, 11, 12, 20
  • J. Craig
  • -8, -10, -12, -16, -18, -20, -21, -24, -26, -30
  • 7, 8, 10, 13, 14, 15, 16, 17, 18,
    19

69
Wilcoxon rank-sum test
  • Sum of Atkins ranks
  •  1 2 3 4 5 6 9 11 12 2073
  • Sum of Jenny Craigs ranks
  • 7 8 10 13 14 1516 17 1819137
  • Jenny Craig clearly ranked higher!
  • P-value (from computer) .018

For details of the statistical test, see
appendix of these slides
70
Binary or categorical outcomes (proportions)
Outcome Variable Are the observations correlated? Are the observations correlated? Alternative to the chi-square test if sparse cells
Outcome Variable independent correlated Alternative to the chi-square test if sparse cells
Binary or categorical (e.g. fracture, yes/no) Chi-square test compares proportions between two or more groups Relative risks odds ratios or risk ratios Logistic regression multivariate technique used when outcome is binary gives multivariate-adjusted odds ratios McNemars chi-square test compares binary outcome between two correlated groups (e.g., before and after) Conditional logistic regression multivariate regression technique for a binary outcome when groups are correlated (e.g., matched data) GEE modeling multivariate regression technique for a binary outcome when groups are correlated (e.g., repeated measures) Fishers exact test compares proportions between independent groups when there are sparse data (some cells lt5). McNemars exact test compares proportions between correlated groups when there are sparse data (some cells lt5).
71
Difference in proportions (special case of
chi-square test)
72
Null distribution of a difference in proportions
73
Null distribution of a difference in proportions
74
Difference in proportions test
Null hypothesis The difference in proportions is
0.
75
Recall case-control example
76
Absolute risk Difference in proportions exposed
77
Difference in proportions exposed
78
Example 2 Difference in proportions
  • Research Question Are antidepressants a risk
    factor for suicide attempts in children and
    adolescents?
  • Example modified from Antidepressant Drug
    Therapy and Suicide in Severely Depressed
    Children and Adults Olfson et al. Arch Gen
    Psychiatry.200663865-872.

79
Example 2 Difference in Proportions
  • Design Case-control study
  • Methods Researchers used Medicaid records to
    compare prescription histories between 263
    children and teenagers (6-18 years) who had
    attempted suicide and 1241 controls who had never
    attempted suicide (all subjects suffered from
    depression).
  • Statistical question Is a history of use of
    antidepressants more common among cases than
    controls?

80
Example 2
  • Statistical question Is a history of use of
    antidepressants more common among heart disease
    cases than controls?
  • What will we actually compare?
  • Proportion of cases who used antidepressants in
    the past vs. proportion of controls who did

81
Results
No () of cases (n263)
No () of controls (n1241)
Any antidepressant drug ever
120 (46)
 448 (36)
46
36
Difference10
82
Is the association statistically significant?
  • This 10 difference could reflect a true
    association or it could be a fluke in this
    particular sample.
  • The question is 10 bigger or smaller than the
    expected sampling variability?

83
Hypothesis testing
Step 1 Assume the null hypothesis.
Null hypothesis There is no association between
antidepressant use and suicide attempts in the
target population ( the difference is 0)
84
Hypothesis Testing
Step 2 Predict the sampling variability assuming
the null hypothesis is true
85
Also Computer Simulation Results
86
Hypothesis Testing
Step 3 Do an experiment
We observed a difference of 10 between cases and
controls.
87
Hypothesis Testing
Step 4 Calculate a p-value
88
P-value from our simulation
89
P-value
From our simulation, we estimate the p-value to
be 4/1000 or .004
90
Hypothesis Testing
Step 5 Reject or do not reject the null
hypothesis.
Here we reject the null. Alternative hypothesis
There is an association between antidepressant
use and suicide in the target population.
91
What would a lack of statistical significance
mean?
  • If this study had sampled only 50 cases and 50
    controls, the sampling variability would have
    been much higheras shown in this computer
    simulation

92
(No Transcript)
93
With only 50 cases and 50 controls
94
Two-tailed p-value
95
Practice problem
  • An August 2003 research article in Developmental
    and Behavioral Pediatrics reported the following
    about a sample of UK kids when given a choice of
    a non-branded chocolate cereal vs. CoCo Pops, 97
    (36) of 37 girls and 71 (27) of 38 boys
    preferred the CoCo Pops. Is this evidence that
    girls are more likely to choose brand-named
    products?

96
Answer
  • 1. Hypotheses
  • H0 p?-p? 0
  • Ha p?-p?? 0 two-sided
  •  
  • 2. Null distribution of difference of two
    proportions
  •  
  •  
  • 3. Observed difference in our experiment
    .97-.71 .26
  •  
  • 4. Calculate the p-value of what you observed

data _null_




pval(1-probnorm(3.06))2




put pval




run




0.0022133699 5.
p-value is sufficiently low for us to reject the
null there does appear to be a difference in
gender preferences here.
97
Key two-sample Hypothesis Tests
  • Test for Ho µx- µy 0 (s2 unknown, but
    roughly equal)
  • Test for Ho p1- p2 0
  •  

98
Corresponding confidence intervals
  • For a difference in means, 2 independent samples
    (s2s unknown but roughly equal)
  • For a difference in proportions, 2 independent
    samples
  •  

99
Appendix details of rank-sum test
100
Wilcoxon Rank-sum test
101
Example
  • For example, if team 1 and team 2 (two gymnastic
    teams) are competing, and the judges rank all the
    individuals in the competition, how can you tell
    if team 1 has done significantly better than team
    2 or vice versa?

102
Answer
  • Intuition under the null hypothesis of no
    difference between the two groups
  • If n1n2, the sums of T1 and T2 should be equal.
  • But if n1 ?n2, then T2 (n2bigger group) should
    automatically be bigger. But how much bigger
    under the null?
  • For example, if team 1 has 3 people and team 2
    has 10, we could rank all 13 participants from 1
    to 13 on individual performance. If team1 (X)
    and team2 dont differ in talent, the ranks ought
    to be spread evenly among the two groups, e.g.
  • 1 2 X 4 5 6 X 8 9 10 X 12 13 (exactly even
    distribution if team1 ranks 3rd, 7th, and 11th)

103
(No Transcript)
104
It turns out that, if the null hypothesis is
true, the difference between the larger-group sum
of ranks and the smaller-group sum of ranks is
exactly equal to the difference between T1 and T2
105
From slide 23
From slide 24
Here, under null U25530-70 U1630-21 U2U130
106
  • ? under null hypothesis, U1 should equal U2

The Us should be equal to each other and will
equal n1n2/2   U1 U2 n1n2 Under null
hypothesis, U1 U2 U0 ?E(U1 U2) 2E(U0)
n1n2 E(U1 U2U0) n1n2/2
So, the test statistic here is not quite the
difference in the sum-of-ranks of the 2
groups? Its the smaller observed U value U0 For
small ns, take U0, and get p-value directly from
a U table.
107
For large enough ns (gt10 per group)
108
Add observed data to the example
  • Example If the girls on the two gymnastics teams
    were ranked as follows
  • Team 1 1, 5, 7 Observed T1 13
  • Team 2 2,3,4,6,8,9,10,11,12,13
    Observed T2 78
  •  
  • Are the teams significantly different?
  • Total sum of ranks 1314/2 91
    n1n2310 30
  •  
  • Under the null hypothesis expect U1 - U2 0 and
    U1 U2 30 (each should equal about 15 under
    the null) and U0 15
  •   
  • U130 6 13 23
  • U2 30 55 78 7
  •  ?U0 7
  •  
  • Not quite statistically significant in U
    tablep.1084 (see attached) x2 for two-tailed
    test

109
Example problem 2
A study was done to compare the Atkins Diet
(low-carb) vs. Jenny Craig (low-cal, low-fat).
The following weight changes were obtained note
they are very skewed because someone lost 100
pounds the mean loss for Atkins is going to look
higher because of the bozo, but does that mean
the diet is better overall? Conduct a
Mann-Whitney U test to compare ranks.  
 
110
Answer
Corresponding Ranks (lower is more weight
loss!)  
Sum of ranks for JC 25 (n5) Sum of ranks for
Atkins41 (n6)   n1n256 30   under the null
hypothesis expect U1 - U2 0 and U1 U2 30
and U0 15    U130 15 25 20 U2 30
21 41 10   U0 10 n15, n26 Go to
Mann-Whitney chart.p.2143x 2 .42
Write a Comment
User Comments (0)
About PowerShow.com