Using Experiments as a Causal Gold Standard to show that Experiments are not Unique as a Causal Gold - PowerPoint PPT Presentation

1 / 48

About This Presentation

Title:

Using Experiments as a Causal Gold Standard to show that Experiments are not Unique as a Causal Gold

Description:

Is a mediocre non-experiment being compared to a good experiment? ... To compare good exp to poor quasi-exp is to confound a design type and the ... – PowerPoint PPT presentation

Number of Views:110

Avg rating:3.0/5.0

Slides: 49

Provided by: thoma78

Category:

more less

Transcript and Presenter's Notes

Title: Using Experiments as a Causal Gold Standard to show that Experiments are not Unique as a Causal Gold

1
Using Experiments as a Causal Gold Standard to
show that Experiments are not Unique as a Causal
Gold Standard

Thomas D. Cook William R. Shadish, Jr.Vivian C.
Wong

2
General Purposes

Presentation on design choice in evaluation
policy
Prompted by growing advocacy in USA of randomized
experiments as the preferred method for causal
inference, as the gold standard
Implications of talk are broader than evaluation
since cause is omnipresent in theoretical and
applied social sciences in all disciplines
We seek to empirically examine whether certain
kinds of non-experiment produce similar causal
estimates to experiments.
To do this we turn to within-study comparison
studies

3
What is a Within Study Comparison Study?

The causal estimate from an experiment is
compared to the causal estimate from a
non-experiment that shares the same intervention
group
Comparing the difference between a randomly or
systematically formed comparison group, whether
due to self-selection or administrator decision
or both
Past Question 1 Does the non-random comparison
lead to the same effect size after adjustments
for selection ---OLS, Heckman-type selection
models, propensity scores?
Past Question 2 Under what conditions does one
get a closer correspondence between the two kinds
of estimate?

4
Brief History of Literature on Within Study
Comparisons

LaLonde Fraker Maynard
12 subsequent studies in job training
Reviews, meta-analytic and interpretative, of
this job training literature
Extension to examples in education in USA and
social welfare in Mexico, never yet reviewed

5
Conclusions from Reviews in Job Training

1. All studies claim unacceptable level of
correspondence between the experimental and
non-experimental effect sizes--Deheija and Wahba
is the sole exception, but lively criticism of it
by Smith and Todd
2. Some procedures give better approximations
than others
e.g., local matches, outcomes measured in
same way in experiment and non-exp., when OLS or
propensity scores are used versus instrumental
variable approaches, include Heckman type models

6
Policy Consequences

Department of Labor, as early as 1985
Health and Human Services, job training and
beyond
National Academy of Sciences
Institute of Educational Sciences
Do within-study comparisons deserve all this?

7
Criteria of Good Within-Study Comparison Design

1. Variation in mode of assignment--random or not
2. No third variables correlated with both
assignment and outcome--e.g., measurement
3. Randomized experiment properly executed
4. Quasi-experiment good instance of type
5. Both design types estimate the same causal
entity--e.g, LATE in regression-discontinuity
6. Acceptable criteria of correspondence between
design types--ESs seem similar not formally
differ stat significance patterns not differ,
etc.

8
Outline of Talk

To deconstruct non-experiment we will compare
experimental estimates to
1. Regression-discontinuity estimates
2. Interrupted time-series estimates with control
series
3. Estimates from difference-of-differences
(fixed effects) design
Ask Is general conclusion about the inadequacy
of non-experiments true across at least these
three different kinds of non-experiment

9
Experiments vs. Regression-Discontinuity Design
Studies
10
Three Known within-Study Comparisons of Exp and
R-D

Aiken, West et al (1998)- R-D study experiment
LATE analysis results
Buddelmeyer Skoufias (2003)-R-D study
experiment LATE analysis results
Black, Galdo Smith (2005)-R-D study
experiment LATE analysis results

11
Comments on R-D vs Exp.

Cumulative correspondence over three cases
Is this theoretically trivial, though?
Is it pragmatically significant, given variation
in implementation in both the experiment and R-D?
As existence proof, it belies over-generalized
argument that non-experiments dont work
Emboldens to deconstruct non-experiment further

12
Experiment vs Interrupted Time Series

Only one case of experiment deliberately compared
to ITS with no-treat control series
Bloom et al (2002 2005)--job training the topic
Experiment 11 sites - 8 pre earning waves 20
post
Non-Experiment 5 within-state comparisons 4
within-city all comparison Ss enrolled in
welfare
We present only control/comparison contrast
because treatment time series is a constant

13
Issue is

Is there overall difference between control
groups randomly or non-randomly formed?
If yes, can statistical controlsOLS, IV (inc.
Heckman models), propensity scores, random growth
modelseliminate this difference? Tested 1O
modes, but only one longitudinal
Is there a special difference around the
intervention point, given that stable pretest
group differences are not problem for ITS?

14
Bloom et al. Results
15
Bloom et al. Results (continued)
16
Implications of Bloom et al

Averaging across the 4 within-city sites showed
no difference-also true if 5th between-city site
added
Selecting within-study comparisons obviated the
need for statistical adjustments for
non-equivalence--design alone did it.
Bloom et al tested differential effects of
statistical adjustments in between-state
comparisons where there were large differences
None worked, or did better than OLS

17
Non-Equivalent Control Group Design with Pretest

Most frequent non-experimental design by far
across many fields of study
Also modal in within-study comparisons in job
training, and so it provides major basis for past
opinion that non-experiments are routinely biased
Walk through some exemplars varying on how well
they meet six criteria for being a good
within-study project. From better to worse...

18
Questions are

1. Is past pessimistic conclusion related to
quality of the within-study comparison study?
2. Can we identify ex post facto the conditions
under which this design gets the same or
different answer from a randomized experiment
1st clue from Bloom et al. When the comparison
group is very local the comparison groups may not
even differ on major observables
2nd clue from theory, complete model of selection
or of outcome will work

19
Figure 1 Design of Shadish et al. (2006)
N 445 Undergraduate Psychology Students
Pretests, and then Random Assignment to
Randomized Experiment n 235 Randomly Assigned to
Nonrandomized Experiment n 210 Self-Selected
into
Mathematics Training n 79
Vocabulary Training n 131
Mathematics Training n 119
Vocabulary Training n 116
All participants measured on both mathematics and
vocabulary outcomes
20
Whats special in Shadish et al

Variation in mode of assignment
Hold constant most other factors thru first
RA--population/measures /activity patterns
Good experiment? Pretests short-term and
attrition no chance for contamination.
Good quasi-experiment? - selection process
quality of measurement analysis and role of
Rosenbaum

21
Results Shadish et al
22
Implications of Shadish et al

Here the sampling design produced non- equivalent
groups on observables, unlike Bloom
Here the statistical adjustments worked when
computed as propensity scores
However, big overlap in experimental and
non-experimental scores due to first stage random
assignment, making propensity scores more valid
Extensive, unusually valid measurement of a
relatively simple selection process, though not
homogeneous.

23
Limitations to Shadish et al

What about more complex settings?
What about more complex selection processes?
What about OLS analyses?
Now let us examine a study without these
limitations, that does not set out to be an
analog experiment to real world

24
Aiken et al (1998) Revisited

The experiment. Remember that sample was
selected on narrow range of test score values
Quasi-Experiment--sample selection limited to
students who register late or cannot be found in
summer but who score in the same range as the
experiment
No differences between experiment and
non-experiment on test scores or pretest writing
tests
Measurement identical in experiment and non-exp

25
Results for Aiken et al

Writing standardized test .59 and .57 - sig
Rated essay .06 and .16 ns
High degree of comparability in statistical test
results and effect size estimates

26
Implications of Aiken et al

Like Bloom et al, careful selection of sample
gets close correspondence on important
observables.
Little need for stat adjustment for
non-equivalence limited only to unobservables
Statistical adjustment minor compared to use of
sampling design to construct initial
correspondence

27
Examine Poorer Within-Study Comparison Studies

The Bulk of the Job Training Comparisons
Two Examples from Education

28
Earliest Job Training Studies Adding to Critique
of Smith T

Mode of Assignment clearly varied, and we can
assume randomized experiments implemented OK
But third variable irrelevancies were not
controlled, esp location and measurement, given
dependence on matching from extant data sets
Non-experiments have larger differences from
experiment prior to individual matching, creating
poor counterfactual and dependence on
statistical adjustment and not design

29
Recent Educational Examples

30
Agodini M. Dynarski (2004)

Drop-out prevention experiment, 16 m/h schools,
Individual students, likely dropouts, assigned
within schools16 replicates
Quasi-Experimentstudents matched from 2 quite
different sources middle school controls in
another study, and national NELS data.
Matching basically on individual and school
demographic factors
4 outcomes examined and hence
128 propensity scores -16 x 4 x 2--computed
basically from demographic background variables

31
Results

Only 29 of 128 cases were balanced matches
obtained
Why quality matching so rare? In non-experiment,
groups hardly overlap since treatment group is
high and middle schools but comparisons are
middle only or a very non-local national data set
Mixed pattern of outcome correspondences in 29
cases of computable propensity scores. Not good
OLS did as well as propensity scores

32
Critique

Who would design a quasi-experiment this way? Is
a mediocre non-experiment being compared to a
good experiment?
Alternative design 1. Regression-discontinuity.
2. Local comparison schools, same selection
mechanism to select similar comparison students.
3 Use of multi-year prior achievement data

33
Wilde Hollister (2005)

The Experimentreducing class size in 11 sites
no pretest used at the individual level
Quasi-experimental designindividuals in reduced
classes matched to individual cases from other 10
sites
Propensity scores mostly demographic
Analysis treat each site as a separate experiment
And so 11 replicates comparing an experimental
and non-experimental effect size

34
Results

Low level of correspondence in experimental and
non-experimental effect sizes across the 11 sites
So for each site it makes a causal difference
whether experiment or quasi-experiment
When aggregated across sites, results closer exp
.68 non-exp 1.07
But they do reliably differ

35
Critique

Who would design a quasi-exp on this topic
without a pretest on same scale as outcome?
Who would design it with these controls?
Instead would select controls from one or more
matched schools on prior achievement history
We have here a good experiment being compared to
a bad quasi-experiment
Who would treat this as 11 separate experiments
vs. a more stable pooled experiment? Even the
authors, pooled results are much more congruent.

36
Hypothesis is that...

The job training and educational examples that
produce different conclusions from the experiment
are examples of poor quasi-experimental design
To compare good exp to poor quasi-exp is to
confound a design type and the quality of ist
implementationa logical fallacy
But I reach this conclusion ex post facto and
knowing the randomized experimental results in
advance

37
Big Conclusions

R-D has given results not much different from
experiment in three of three cases.
Abbreviated ITS with local (within-city)
controls has given same result, though only in
one case
Simpler Quasi-Experiments tend to give same
results as experiment if (a) population matching
in the sampling designBloom and Aiken studies,
or if (b) careful conceptualization and
measurement of selection model, as in Shadish et.

38
Even Bigger Conclusions

Now have existence proof that non-experiments
can give same answer as experiments
Loose rhetoric about failure of non-experiments
is not warranted
Government agencies can implement some kinds of
non-experiment with reasonable reassurance of
valid results
But this is not the case with the most common
design involving pre-post, non-equivalent groups
and propensity scores

39
What I am not Concluding

That well designed quasi-experiment is as good as
an experiment. Difference in
Number and transparency of assumptions
Statistical power
Knowledge of implementation
Social and political acceptance
If you have the option, do an experiment
Never forget You can rarely put right by
statistics what you have messed up by design

40
(No Transcript)
41
(No Transcript)
42
Shadish, Luellen Clark (2006)
43
Shadish, Luellen Clark (2006)
44
Results-Aiken et al

pretest values on SAT/CAT, 2 writing measures
Measurement framework the same
Pretest ACTs and writing - ns exp vs non
OLS tests
Results for writing test .59 and .57 - sig
Results for essay .06 and .16 - ns

45
Bloom et al Revisited

Analysis at the individual level
Within city, within welfare to work center, same
measurement design
Absolute bias- yes
Average bias none across 5 within-state sites,
even w/o stat tests
Average bias limited to small site and
non-within-city site-Detroit vs Grand Rapids

46
Correspondence Criteria

Random error and no exact agreement
Shared stat sig pattern from zero - 68
Two ESs not statistically different
Comparable magnitude estimates
One as percent of other
Indulgence, common sense and mix

47
Our Research Issues

Deconstructing non-experiment--do experimental
and non-experimental ESs correspond differently
for R-D, for ITS, and for simple non-equivalent
designs?
How far can we generalize results about
invalidity of non-experiments beyond job
training?
Do these within-study comparison studies bear the
weight ascribed to them in evaluation policy at
DoL and IES?

48
(No Transcript)

Write a Comment

User Comments (0)