Using Experiments as a Causal Gold Standard to show that Experiments are not Unique as a Causal Gold - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

Using Experiments as a Causal Gold Standard to show that Experiments are not Unique as a Causal Gold

Description:

Is a mediocre non-experiment being compared to a good experiment? ... To compare good exp to poor quasi-exp is to confound a design type and the ... – PowerPoint PPT presentation

Number of Views:110
Avg rating:3.0/5.0
Slides: 49
Provided by: thoma78
Category:

less

Transcript and Presenter's Notes

Title: Using Experiments as a Causal Gold Standard to show that Experiments are not Unique as a Causal Gold


1
Using Experiments as a Causal Gold Standard to
show that Experiments are not Unique as a Causal
Gold Standard
  • Thomas D. Cook William R. Shadish, Jr.Vivian C.
    Wong

2
General Purposes
  • Presentation on design choice in evaluation
    policy
  • Prompted by growing advocacy in USA of randomized
    experiments as the preferred method for causal
    inference, as the gold standard
  • Implications of talk are broader than evaluation
    since cause is omnipresent in theoretical and
    applied social sciences in all disciplines
  • We seek to empirically examine whether certain
    kinds of non-experiment produce similar causal
    estimates to experiments.
  • To do this we turn to within-study comparison
    studies

3
What is a Within Study Comparison Study?
  • The causal estimate from an experiment is
    compared to the causal estimate from a
    non-experiment that shares the same intervention
    group
  • Comparing the difference between a randomly or
    systematically formed comparison group, whether
    due to self-selection or administrator decision
    or both
  • Past Question 1 Does the non-random comparison
    lead to the same effect size after adjustments
    for selection ---OLS, Heckman-type selection
    models, propensity scores?
  • Past Question 2 Under what conditions does one
    get a closer correspondence between the two kinds
    of estimate?

4
Brief History of Literature on Within Study
Comparisons
  • LaLonde Fraker Maynard
  • 12 subsequent studies in job training
  • Reviews, meta-analytic and interpretative, of
    this job training literature
  • Extension to examples in education in USA and
    social welfare in Mexico, never yet reviewed

5
Conclusions from Reviews in Job Training
  • 1. All studies claim unacceptable level of
    correspondence between the experimental and
    non-experimental effect sizes--Deheija and Wahba
    is the sole exception, but lively criticism of it
    by Smith and Todd
  • 2. Some procedures give better approximations
    than others
  • e.g., local matches, outcomes measured in
    same way in experiment and non-exp., when OLS or
    propensity scores are used versus instrumental
    variable approaches, include Heckman type models

6
Policy Consequences
  • Department of Labor, as early as 1985
  • Health and Human Services, job training and
    beyond
  • National Academy of Sciences
  • Institute of Educational Sciences
  • Do within-study comparisons deserve all this?

7
Criteria of Good Within-Study Comparison Design
  • 1. Variation in mode of assignment--random or not
  • 2. No third variables correlated with both
    assignment and outcome--e.g., measurement
  • 3. Randomized experiment properly executed
  • 4. Quasi-experiment good instance of type
  • 5. Both design types estimate the same causal
    entity--e.g, LATE in regression-discontinuity
  • 6. Acceptable criteria of correspondence between
    design types--ESs seem similar not formally
    differ stat significance patterns not differ,
    etc.

8
Outline of Talk
  • To deconstruct non-experiment we will compare
    experimental estimates to
  • 1. Regression-discontinuity estimates
  • 2. Interrupted time-series estimates with control
    series
  • 3. Estimates from difference-of-differences
    (fixed effects) design
  • Ask Is general conclusion about the inadequacy
    of non-experiments true across at least these
    three different kinds of non-experiment

9
Experiments vs. Regression-Discontinuity Design
Studies
10
Three Known within-Study Comparisons of Exp and
R-D
  • Aiken, West et al (1998)- R-D study experiment
    LATE analysis results
  • Buddelmeyer Skoufias (2003)-R-D study
    experiment LATE analysis results
  • Black, Galdo Smith (2005)-R-D study
    experiment LATE analysis results

11
Comments on R-D vs Exp.
  • Cumulative correspondence over three cases
  • Is this theoretically trivial, though?
  • Is it pragmatically significant, given variation
    in implementation in both the experiment and R-D?
  • As existence proof, it belies over-generalized
    argument that non-experiments dont work
  • Emboldens to deconstruct non-experiment further

12
Experiment vs Interrupted Time Series
  • Only one case of experiment deliberately compared
    to ITS with no-treat control series
  • Bloom et al (2002 2005)--job training the topic
  • Experiment 11 sites - 8 pre earning waves 20
    post
  • Non-Experiment 5 within-state comparisons 4
    within-city all comparison Ss enrolled in
    welfare
  • We present only control/comparison contrast
    because treatment time series is a constant

13
Issue is
  • Is there overall difference between control
    groups randomly or non-randomly formed?
  • If yes, can statistical controlsOLS, IV (inc.
    Heckman models), propensity scores, random growth
    modelseliminate this difference? Tested 1O
    modes, but only one longitudinal
  • Is there a special difference around the
    intervention point, given that stable pretest
    group differences are not problem for ITS?

14
Bloom et al. Results
15
Bloom et al. Results (continued)
16
Implications of Bloom et al
  • Averaging across the 4 within-city sites showed
    no difference-also true if 5th between-city site
    added
  • Selecting within-study comparisons obviated the
    need for statistical adjustments for
    non-equivalence--design alone did it.
  • Bloom et al tested differential effects of
    statistical adjustments in between-state
    comparisons where there were large differences
  • None worked, or did better than OLS

17
Non-Equivalent Control Group Design with Pretest
  • Most frequent non-experimental design by far
    across many fields of study
  • Also modal in within-study comparisons in job
    training, and so it provides major basis for past
    opinion that non-experiments are routinely biased
  • Walk through some exemplars varying on how well
    they meet six criteria for being a good
    within-study project. From better to worse...

18
Questions are
  • 1. Is past pessimistic conclusion related to
    quality of the within-study comparison study?
  • 2. Can we identify ex post facto the conditions
    under which this design gets the same or
    different answer from a randomized experiment
  • 1st clue from Bloom et al. When the comparison
    group is very local the comparison groups may not
    even differ on major observables
  • 2nd clue from theory, complete model of selection
    or of outcome will work

19
Figure 1 Design of Shadish et al. (2006)
N 445 Undergraduate Psychology Students
Pretests, and then Random Assignment to
Randomized Experiment n 235 Randomly Assigned to
Nonrandomized Experiment n 210 Self-Selected
into
Mathematics Training n 79
Vocabulary Training n 131
Mathematics Training n 119
Vocabulary Training n 116
All participants measured on both mathematics and
vocabulary outcomes
20
Whats special in Shadish et al
  • Variation in mode of assignment
  • Hold constant most other factors thru first
    RA--population/measures /activity patterns
  • Good experiment? Pretests short-term and
    attrition no chance for contamination.
  • Good quasi-experiment? - selection process
    quality of measurement analysis and role of
    Rosenbaum

21
Results Shadish et al
22
Implications of Shadish et al
  • Here the sampling design produced non- equivalent
    groups on observables, unlike Bloom
  • Here the statistical adjustments worked when
    computed as propensity scores
  • However, big overlap in experimental and
    non-experimental scores due to first stage random
    assignment, making propensity scores more valid
  • Extensive, unusually valid measurement of a
    relatively simple selection process, though not
    homogeneous.

23
Limitations to Shadish et al
  • What about more complex settings?
  • What about more complex selection processes?
  • What about OLS analyses?
  • Now let us examine a study without these
    limitations, that does not set out to be an
    analog experiment to real world

24
Aiken et al (1998) Revisited
  • The experiment. Remember that sample was
    selected on narrow range of test score values
  • Quasi-Experiment--sample selection limited to
    students who register late or cannot be found in
    summer but who score in the same range as the
    experiment
  • No differences between experiment and
    non-experiment on test scores or pretest writing
    tests
  • Measurement identical in experiment and non-exp

25
Results for Aiken et al
  • Writing standardized test .59 and .57 - sig
  • Rated essay .06 and .16 ns
  • High degree of comparability in statistical test
    results and effect size estimates

26
Implications of Aiken et al
  • Like Bloom et al, careful selection of sample
    gets close correspondence on important
    observables.
  • Little need for stat adjustment for
    non-equivalence limited only to unobservables
  • Statistical adjustment minor compared to use of
    sampling design to construct initial
    correspondence

27
Examine Poorer Within-Study Comparison Studies
  • The Bulk of the Job Training Comparisons
  • Two Examples from Education

28
Earliest Job Training Studies Adding to Critique
of Smith T
  • Mode of Assignment clearly varied, and we can
    assume randomized experiments implemented OK
  • But third variable irrelevancies were not
    controlled, esp location and measurement, given
    dependence on matching from extant data sets
  • Non-experiments have larger differences from
    experiment prior to individual matching, creating
    poor counterfactual and dependence on
    statistical adjustment and not design

29
Recent Educational Examples

30
Agodini M. Dynarski (2004)
  • Drop-out prevention experiment, 16 m/h schools,
  • Individual students, likely dropouts, assigned
    within schools16 replicates
  • Quasi-Experimentstudents matched from 2 quite
    different sources middle school controls in
    another study, and national NELS data.
  • Matching basically on individual and school
    demographic factors
  • 4 outcomes examined and hence
  • 128 propensity scores -16 x 4 x 2--computed
    basically from demographic background variables

31
Results
  • Only 29 of 128 cases were balanced matches
    obtained
  • Why quality matching so rare? In non-experiment,
    groups hardly overlap since treatment group is
    high and middle schools but comparisons are
    middle only or a very non-local national data set
  • Mixed pattern of outcome correspondences in 29
    cases of computable propensity scores. Not good
  • OLS did as well as propensity scores

32
Critique
  • Who would design a quasi-experiment this way? Is
    a mediocre non-experiment being compared to a
    good experiment?
  • Alternative design 1. Regression-discontinuity.
    2. Local comparison schools, same selection
    mechanism to select similar comparison students.
    3 Use of multi-year prior achievement data

33
Wilde Hollister (2005)
  • The Experimentreducing class size in 11 sites
    no pretest used at the individual level
  • Quasi-experimental designindividuals in reduced
    classes matched to individual cases from other 10
    sites
  • Propensity scores mostly demographic
  • Analysis treat each site as a separate experiment
  • And so 11 replicates comparing an experimental
    and non-experimental effect size

34
Results
  • Low level of correspondence in experimental and
    non-experimental effect sizes across the 11 sites
  • So for each site it makes a causal difference
    whether experiment or quasi-experiment
  • When aggregated across sites, results closer exp
    .68 non-exp 1.07
  • But they do reliably differ

35
Critique
  • Who would design a quasi-exp on this topic
    without a pretest on same scale as outcome?
  • Who would design it with these controls?
  • Instead would select controls from one or more
    matched schools on prior achievement history
  • We have here a good experiment being compared to
    a bad quasi-experiment
  • Who would treat this as 11 separate experiments
    vs. a more stable pooled experiment? Even the
    authors, pooled results are much more congruent.

36
Hypothesis is that...
  • The job training and educational examples that
    produce different conclusions from the experiment
    are examples of poor quasi-experimental design
  • To compare good exp to poor quasi-exp is to
    confound a design type and the quality of ist
    implementationa logical fallacy
  • But I reach this conclusion ex post facto and
    knowing the randomized experimental results in
    advance

37
Big Conclusions
  • R-D has given results not much different from
    experiment in three of three cases.
  • Abbreviated ITS with local (within-city)
    controls has given same result, though only in
    one case
  • Simpler Quasi-Experiments tend to give same
    results as experiment if (a) population matching
    in the sampling designBloom and Aiken studies,
    or if (b) careful conceptualization and
    measurement of selection model, as in Shadish et.

38
Even Bigger Conclusions
  • Now have existence proof that non-experiments
    can give same answer as experiments
  • Loose rhetoric about failure of non-experiments
    is not warranted
  • Government agencies can implement some kinds of
    non-experiment with reasonable reassurance of
    valid results
  • But this is not the case with the most common
    design involving pre-post, non-equivalent groups
    and propensity scores

39
What I am not Concluding
  • That well designed quasi-experiment is as good as
    an experiment. Difference in
  • Number and transparency of assumptions
  • Statistical power
  • Knowledge of implementation
  • Social and political acceptance
  • If you have the option, do an experiment
  • Never forget You can rarely put right by
    statistics what you have messed up by design

40
(No Transcript)
41
(No Transcript)
42
Shadish, Luellen Clark (2006)
43
Shadish, Luellen Clark (2006)
44
Results-Aiken et al
  • pretest values on SAT/CAT, 2 writing measures
  • Measurement framework the same
  • Pretest ACTs and writing - ns exp vs non
  • OLS tests
  • Results for writing test .59 and .57 - sig
  • Results for essay .06 and .16 - ns

45
Bloom et al Revisited
  • Analysis at the individual level
  • Within city, within welfare to work center, same
    measurement design
  • Absolute bias- yes
  • Average bias none across 5 within-state sites,
    even w/o stat tests
  • Average bias limited to small site and
    non-within-city site-Detroit vs Grand Rapids

46
Correspondence Criteria
  • Random error and no exact agreement
  • Shared stat sig pattern from zero - 68
  • Two ESs not statistically different
  • Comparable magnitude estimates
  • One as percent of other
  • Indulgence, common sense and mix

47
Our Research Issues
  • Deconstructing non-experiment--do experimental
    and non-experimental ESs correspond differently
    for R-D, for ITS, and for simple non-equivalent
    designs?
  • How far can we generalize results about
    invalidity of non-experiments beyond job
    training?
  • Do these within-study comparison studies bear the
    weight ascribed to them in evaluation policy at
    DoL and IES?

48
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com