Title: Sample Design for Group-Randomized Trials
1Sample Design for Group-Randomized Trials
- Howard S. Bloom
- Chief Social Scientist
- MDRC
- Prepared for the IES/NCER Summer Research
Training Institute held at Northwestern
University on July 27, 2010.
2Today we will examine
- Sample size determinants
- Precision requirements
- Sample allocation
- Covariate adjustments
- Matching and blocking
- Subgroup analyses
- Generalizing findings for sites and blocks
- Using two-level data for three-level situations
3Part I
4Statistical properties of group-randomized impact
estimators
- Unbiased estimates
- Yij aB0Tjejeij
- E(b0) B0
- Less precise estimates
- VAR(eij) s2
- VAR(ej) t2
- r t2/(t2s2)
5Design Effect(for a given total number of
individuals)
- ______________________________________
- Intraclass Individuals per Group (n)
- Correlation (r) 10
50 500 - 0.01 1.04
1.22 2.48 - 0.05 1.20
1.86 5.09 - 0.10 1.38
2.43 7.13 - _____________________________________
6Sample design parameters
- Number of randomized groups (J)
- Number of individuals per randomized group (n)
- Proportion of groups randomized to program status
(P)
7Reporting precision
- A minimum detectable effect (MDE) is the smallest
true effect that has a good chance of being
found to be statistically significant. - We typically define an MDE as the smallest true
effect that has 80 percent power for a two-tailed
test of statistical significance at the 0.05
level. - An MDE is reported in natural units whereas a
minimum detectable effect size (MDES) is reported
in units of standard deviations
8Minimum Detectable Effect SizesFor a
Group-Randomized Design with r 0.05 and no
Covariates
- ___________________________________
- Randomized Individuals per Group (n)
- Groups (J) 10 50
500 - 10 0.77 0.53
0.46 - 20 0.50 0.35
0.30 - 40 0.35 0.24
0.21 - 120 0.20 0.14
0.12 - ___________________________________
9Implications for sample design
- It is extremely important to randomize an
adequate number of groups. - It is often far less important how many
individuals per group you have.
10Part II
- Determining required precision
11When assessing how much precision is needed
- Always ask relative to what?
- Program benefits
- Program costs
- Existing outcome differences
- Past program performance
12Effect Size Gospel According to Cohen and Lipsey
- Cohen Lipsey
- (speculative)
(empirical) - _______________________________________________
- Small 0.2s Small 0.15s
- Medium 0.5s Medium 0.45s
- Large 0.8s Large 0.90s
13Five-year impacts of the Tennessee class-size
experiment
- Treatment
- 13-17 versus 22-26 students per class
- Effect sizes
- 0.11s to 0.22s for reading and math
- Findings are summarized from Nye, Barbara, Larry
V. Hedges and Spyros Konstantopoulos (1999) The
Long-Term Effects of Small Classes A Five-Year
Follow-up of the Tennessee Class Size
Experiment, Educational Evaluation and Policy
Analysis, Vol. 21, No. 2 127-142.
14Annual reading and math growth
- Reading Math
- Grade Growth Growth
- Transition Effect Size Effect Size
- --------------------------------------------------
-------------- - K - 1 1.52 1.14
- 1 - 2 0.97
1.03 - 2 - 3 0.60
0.89 - 3 - 4 0.36
0.52 - 4 - 5 0.40
0.56 - 5 - 6 0.32
0.41 - 6 - 7 0.23
0.30 - 7 - 8 0.26
0.32 - 8 - 9 0.24
0.22 - 9 - 10 0.19
0.25 - 10 - 11 0.19
0.14 - 11 - 12 0.06
0.01 - --------------------------------------------------
-----------------------------------------------
Based on work in progress using documentation on
the national norming samples for the CAT5, SAT9,
Terra Nova CTBS, Gates MacGinitie (for reading
only), MAT8, Terra Nova CAT, and SAT10. 95
confidence intervals range in reading from /-
.03 to .15 and in math from /- .03 to .22
15Performance gap between average (50th
percentile) and weak (10th percentile) schools
Subject and grade Subject and grade District I District II District III District IV
Reading Reading Reading
Grade 3 0.31 0.18 0.16 0.43
Grade 5 0.41 0.18 0.35 0.31
Grade 7 .025 0.11 0.30 NA
Grade 10 0.07 0.11 NA NA
Math Math Math
Grade 3 0.29 0.25 0.19 0.41
Grade 5 0.27 0.23 0.36 0.26
Grade 7 0.20 0.15 0.23 NA
Grade 10 0.14 0.17 NA NA
Source District I outcomes are based on ITBS
scaled scores, District II on SAT 9 scaled
scores, District III on MAT NCE scores, and
District IV on SAT 8 NCE scores.
16Demographic performance gap in reading and math
Main NAEP scores
Subject and grade Subject and grade Black-White Hispanic-White Male-Female Eligible-Ineligible for free/reduced price lunch
Reading Reading Reading
Grade 4 -0.83 -0.77 -0.18 -0.74
Grade 8 -0.80 -0.76 -0.28 -0.66
Grade 12 -0.67 -0.53 -0.44 -0.45
Math Math Math
Grade 4 -0.99 -0.85 0.08 -0.85
Grade 8 -1.04 -0.82 0.04 -0.80
Grade 12 -0.94 -0.68 0.09 -0.72
Source U.S. Department of Education, Institute
of Education Sciences, National Center for
Education Statistics, National Assessment of
Educational Progress (NAEP), 2002 Reading
Assessment and 2000 Mathematics Assessment.
17ES Results from Randomized Studies
Achievement Measure Achievement Measure n n Mean
Elementary School Elementary School Elementary School 389 0.33
Standardized test (Broad) Standardized test (Broad) 21 0.07
Standardized test (Narrow) Standardized test (Narrow) 181 0.23
Specialized Topic/Test Specialized Topic/Test 180 0.44
Middle Schools Middle Schools Middle Schools 36 0.51
High Schools High Schools High Schools 43 0.27
18Part III
- The ABCs of Sample Allocation
19Sample allocation alternatives
- Balanced allocation
- maximizes precision for a given sample size
- maximizes robustness to distributional
assumptions. - Unbalanced allocation
- precision erodes slowly with imbalance for a
given sample size - imbalance can facilitate a larger sample
- Imbalance can facilitate randomization
20Variance relationships for the program and
control groups
- Equal variances when the program does not affect
the outcome variance. -
- Unequal variances when the program does affect
the outcome variance.
21MDES for equal variances without covariates
22How allocation affects MDES
23Minimum Detectable Effect Size For Sample
Allocations Given Equal Variances
- Allocation Example
Ratio to Balanced -
Allocation - 0.5/0.5 0.54s
1.00 - 0.6/0.4 0.55s
1.02 - 0.7/0.3 0.59s 1.09
- 0.8/0.2 0.68s 1.25
- 0.9/0.1 0.91s 1.67
- ________________________________________
- Example is for n 20, J 10, r 0.05, a
one-tail hypothesis test and no covariates.
24Implications of unbalanced allocations with
unequal variances
25Implications Continued
- The estimated standard error is unbiased
- When the allocation is balanced
- When the variances are equal
- The estimated standard error is biased upward
- When the larger sample has the larger variance
- The estimated standard error is biased downward
- When the larger sample has the smaller variance
26Interim Conclusions
- Dont use the equal variance assumption for an
unbalanced allocation with many degrees of
freedom. - Use a balanced allocation when there are few
degrees of freedom.
27References
- Gail, Mitchell H., Steven D. Mark, Raymond J.
Carroll, Sylvan B. Green and David Pee (1996) On
Design Considerations and Randomization-Based
Inferences for Community Intervention Trials,
Statistics in Medicine 15 1069 1092. - Bryk, Anthony S. and Stephen W. Raudenbush (1988)
Heterogeneity of Variance in Experimental
Studies A Challenge to Conventional
Interpretations, Psychological Bulletin, 104(3)
396 404.
28Part IV
- Using Covariates to Reduce
- Sample Size
29Basic ideas
- Goal Reduce the number of clusters randomized
- Approach Reduce the standard error of the impact
estimator by controlling for baseline covariates - Alternative Covariates
- Individual-level
- Cluster-level
- Pretests
- Other characteristics
30Impact Estimation with a Covariate
- yij the outcome for student i from school j
- Tj 1 for treatment schools and 0 for control
schools - Xj a covariate for school j
- xij a covariate for student i from school j
- ej a random error term for school j
- eij a random error term for student i from
school j
31Minimum Detectable Effect Size with a Covariate
- MDES minimum detectable effect size
- MJ-K a degrees-of-freedom multiplier1
- J the total number of schools randomized
- n the number of students in a grade per school
- P the proportion of schools randomized to
treatment - the unconditional intraclass correlation
(without a covariate) - R12 the proportion of variance across
individuals within schools (at level 1) predicted
by the covariate - R22 the proportion of variance across schools
(at level 2) predicted by the covariate - 1 For 20 or more degrees of freedom MJ-K equals
2.8 for a two-tail test and 2.5 for a one-tail
test with statistical power of 0.80 and
statistical significance of 0.05
32Questions Addressed Empirically about the
Predictive Power of Covariates
- School-level vs. student-level pretests
- Earlier vs. later follow-up years
- Reading vs. math
- Elementary vs. middle vs. high school
- All schools vs. low-income schools vs.
low-performing schools
33Empirical Analysis
- Estimate r, R22 and R12 from data on thousands of
students from hundreds of schools, during
multiple years at five urban school districts - Summarize these estimates for reading and math in
grades 3, 5, 8 and 10 - Compute implications for minimum detectable
effect sizes
34Estimated Parameters for Reading with a
School-level Pretest Lagged One Year
- __________________________________________________
_________________
- School District
- __________________________________
_________________________ - A
B C D
E - __________________________________________________
_________________ - Grade 3
- r 0.20
0.15 0.19 0.22
0.16 - R22 0.31
0.77 0.74 0.51
0.75 - Grade 5
- r 0.25
0.15 0.20 NA
0.12 - R22 0.33
0.50 0.81 NA
0.70 - Grade 8
- r 0.18
NA 0.23 NA
NA - R22 0.77
NA 0.91 NA
NA - Grade 10
- r 0.15
NA 0.29 NA
NA - R22 0.93
NA 0.95 NA
NA - __________________________________________________
__________________
35Minimum Detectable Effect Sizes for Reading with
a School-Level Pretest (Y-1) or a Student-Level
Pretest (y-1) Lagged One Year
- __________________________________________________
______ - Grade 3 Grade
5 Grade 8 Grade 10 - __________________________________________________
______ - 20 schools randomized
- No covariate 0.57 0.56
0.61 0.62 - Y-1 0.37
0.38 0.24 0.16 - y-1 0.38
0.40 0.28 0.15 - 40 schools randomized
- No covariate 0.39 0.38
0.42 0.42 - Y-1 0.26
0.26 0.17 0.11 - y-1 0.26
0.27 0.19 0.10 - 60 schools randomized
- No covariate 0.32 0.31
0.34 0.34 - Y-1 0.21
0.21 0.13 0.09 - y-1 0.21
0.22 0.15 0.08 - __________________________________________________
______
36Key Findings
- Using a pretest improves precision dramatically.
- This improvement increases appreciably from
elementary school to middle school to high school
because R22 increases. - School-level pretests produce as much precision
as do student-level pretests. - The effect of a pretest declines somewhat as the
time between it and the post-test increases. - Adding a second pretest increases precision
slightly. - Using a pretest for a different subject increases
precision substantially. - Narrowing the sample to schools that are similar
to each other does not improve precision beyond
that achieved by a pretest.
37Source
- Bloom, Howard S., Lashawn Richburg-Hayes and
Alison Rebeck Black (2007) Using Covariates to
Improve Precision for Studies that Randomize
Schools to Evaluate Educational Interventions
Educational Evaluation and Policy Analysis,
29(1) 30 59.
38Part VThe Putative Power of Pairing
- A Tail of Two Tradeoffs
- (It was the best of techniques. It was the worst
of techniques. - Who the dickens said that?)
39Pairing
- Why match pairs?
- for face validity
- for precision
- How to match pairs?
- rank order clusters by covariate
- pair clusters in rank-ordered list
- randomize clusters in each pair
40When to pair?
- When the gain in predictive power outweighs the
loss of degrees of freedom - Degrees of freedom
- J - 2 without pairing
- J/2 - 1 with pairing
41Deriving the Minimum Required Predictive Power
of Pairing
- Without pairing
- With pairing
- Breakeven R2
42The Minimum Required Predictive Power of Pairing
- Randomized Required Predictive
- Clusters (J) Power (R min2)
- 6 0.52
- 8 0.35
- 10 0.26
- 20 0.11
- 30 0.07
- For a two-tail test.
43A few key points about blocking
- Blocking for face validity vs. blocking for
precision - Treating blocks as fixed effects vs.random
effects - Defining blocks using baseline information
44Part VI
- Subgroup Analyses 1
- When to Emphasize Them
45Confirmatory vs. Exploratory Findings
- Confirmatory Draw conclusions about the
programs effectiveness if results are - Consistent with theory and contextual factors
- Statistically significant and large
- And subgroup was pre-specified
- Exploratory Develop hypotheses for further study
46Pre-specification
- Before the analysis, state that conclusions about
the program will be based in part on findings for
this set of subgroups - Pre-specification can be based on
- Theory
- Prior evidence
- Policy relevance
47Statistical significance
- When should we discuss subgroup findings?
- Depends on
- Whether significant differences in impacts across
subgroups - Might depend on whether impacts for the full
sample are statistically significant
48Part VII
- Subgroup Analyses 2
- Creating Subgroups
49Defining Features
- Creating subgroups in terms of
- Program characteristics
- Randomized group characteristics
- Individual characteristics
50Defining Subgroups by Program Characteristics
- Based only on program features that were
randomized - Thus one cannot use implementation quality
51Defining Subgroups by Characteristics Of
Randomized Groups
- Types of impacts
- Net impacts
- Differential impacts
- Internal validity
- only use pre-existing characteristics
- Precision
- Net impact estimates are limited by reduced
number of randomized groups - Differential impact estimates are triply limited
(and often need four times as many randomized
groups)
52Defining Subgroups by Characteristics of
Individuals
- Types of impacts
- Net impacts
- Differential impacts
- Internal validity
- Only use pre-existing characteristics
- Only use subgroups with sample members from all
randomized groups - Precision
- For net impacts can be almost as good as for
full sample - For differential impacts can be even better than
for full sample
53(No Transcript)
54Part VIII
- Generalizing Results from
- Multiple Sites and Blocks
55Fixed vs. Random Effects InferenceA Vexing Issue
- Known vs. unknown populations
- Broader vs. narrower inferences
- Weaker vs. stronger precision
- Few vs. many sites or blocks
56Weighting Sites and Blocks
- Implicitly through a pooled regression
- Explicitly based on
- Number of schools
- Number of students
- Explicitly based on precision
- Fixed effects
- Random effects
- Bottom line the question addressed is what counts
57Part IX
- Using Two-Level Data for Three-Level Situations
58The Issue
- General Question What happens when you design a
study with randomized groups that comprise three
levels based on data which do not account
explicitly for the middle level? - Specific Example What happens when you design a
study that randomizes schools (with students
clustered in classrooms in schools) based on data
for students clustered in schools?
593-level vs. 2-level Variance Components
603-level vs. 2-level MDES for Original Sample
61Further References
- Bloom, Howard S. (2005) Randomizing Groups to
Evaluate Place-Based Programs, in Howard S.
Bloom, editor, Learning More From Social
Experiments Evolving Analytic Approaches (New
York Russell Sage Foundation). - Bloom, Howard S., Lashawn Richburg-Hayes and
Alison Rebeck Black (2005) Using Covariates to
Improve Precision Empirical Guidance for Studies
that Randomize Schools to Measure the Impacts of
Educational Interventions (New York MDRC). - Donner, Allan and Neil Klar (2000) Cluster
Randomization Trials in Health Research (London
Arnold). - Hedges, Larry V. and Eric C. Hedberg (2006)
Intraclass Correlation Values for Planning Group
Randomized Trials in Education (Chicago
Northwestern University). - Murray, David M. (1998) Design and Analysis of
Group-Randomized Trials (New York Oxford
University Press). - Raudenbush, Stephen W., Andres Martinez and
Jessaca Spybrook (2005) Strategies for Improving
Precision in Group-Randomized Experiments
(University of Chicago). - Raudenbush, Stephen W. (1997) Statistical
Analysis and Optimal Design for Cluster
Randomized Trials Psychological Methods, 2(2)
173 185. - Schochet, Peter Z. (2005) Statistical Power for
Random Assignment Evaluations of Education
Programs, (Princeton, NJ Mathematica Policy
Research).