Using State Tests to Measure Student Achievement in Large-Scale Randomized Experiments - PowerPoint PPT Presentation

About This Presentation
Title:

Using State Tests to Measure Student Achievement in Large-Scale Randomized Experiments

Description:

Using State Tests to Measure Student Achievement in Large-Scale Randomized Experiments Marie-Andr e Somers (Presenter) Pei Zhu Edmond Wong MDRC – PowerPoint PPT presentation

Number of Views:132
Avg rating:3.0/5.0
Slides: 22
Provided by: Pei96
Learn more at: https://ies.ed.gov
Category:

less

Transcript and Presenter's Notes

Title: Using State Tests to Measure Student Achievement in Large-Scale Randomized Experiments


1
Using State Tests to Measure Student Achievement
in Large-Scale Randomized Experiments
An Empirical Assessment Based on Four Recent
Evaluations
Marie-Andrée Somers (Presenter) Pei Zhu Edmond
Wong MDRC
  • IES Research Conference
  • June 28th, 2010

2
Two key concerns with using state tests in an
evaluation
  • They may not be suitable for the evaluation
  • Validity concerns They may not be aligned with
    outcomes of interest (do not provide a valid
    inference about program impacts)
  • Reliability concerns They may be too difficult
    for low-performing students (unreliable)
  • Variation in scale/content of state tests also
    complicates the task of combining impact findings
    across states and grades

3
About This Study
  • Funded by Institute of Education Sciences (IES)
  • Purpose is to bring data to bear on several
    topics covered in May et al. discussion paper
  • Are state tests suitable for evaluation purposes?
  • As a measure of the outcome(s) of interest?
  • As a measure of student achievement at baseline?
  • How should impacts on state tests be pooled?
  • Are impact findings sensitive to methods of
    rescaling and aggregating test scores across
    states and/or grades?

4
Overview of Analytical Approach
  • We identified 4 large-scale randomized
    experiments where achievement was measured using
    both (i) state tests AND (ii) a study test
  • The study test provides a benchmark for gauging
    the suitability of state tests
  • Two types of analyses
  • Impact analyses We compared estimated impacts on
    state tests and on the  benchmark  study test
  • Descriptive analyses We also examined published
    information on the characteristics/content of
    tests

5
Data and Samples
Study A Study B Study C Study D
Targeted Outcome General Reading Achievement General Math Achievement Specific Reading Outcome Specific Math Outcome
Level Elementary Elementary High School Middle School
Sample for Analysis 1,032 (9 states) 944 (7 states) 1,065 (4 states) 4,387 (9 states)
  • Studies represent diversity with respect to grade
    levels and outcomes
  • Analysis sample includes students with a state
    test score and a study test score

6
Approach for Estimating Impacts
  • Impact on state tests
  • Rescaling Scores are z-scored by state and
    grade using the sample mean and standard
    deviation
  • Pooling approach Impacts by state and grade are
    aggregated using precision weighting
  • Impact on the study test
  • Rescaled/pooled using the same approach for
    comparability

7
Criteria for Assessing Suitability
  • Two dimensions of suitability
  • Validity
  • Whether the content of state tests is aligned
    with the outcomes of interest in the evaluation
  • Reliability
  • Whether state tests provide a reliable measure of
    achievement for the target population (in this
    case, low-performing students)
  • A key concern State tests have low reliability
    and do not yield valid inferences about program
    effectiveness

8
Criteria for Assessing Suitability
  • Implications for the impact findings
  • Poor Validity
  • Could fail to detect impacts on the outcome of
    interest (invalid inference about program
    effectiveness)
  • Affects the magnitude of the estimated impact on
    state tests
  • Low Reliability
  • Student achievement is estimated with greater
    error
  • Affects the standard error of the estimated
    impact on state tests

9
Criteria for Assessing Suitability
  • Reliability Compare the standard error of the
    estimated impact on state tests vs. the study
    test
  • Smaller standard error is better (more
    precision)
  • Validity Compare the magnitude of the impact
    estimates, in light of estimation error
  • Compare the statistical significance of the
    impact findings (i.e., conclusions about program
    effectiveness based on p-value)
  • If both estimates are statistically significant,
    then also compare their magnitudes

10
Criteria for Assessing Validity
  • The extent to which the magnitude of the impact
    estimates are expected to differ depends on the
    outcome that state tests are intended to measure
  • Two types of intervention
  • Targeted outcome is general achievement (Studies
    A and B)
  • The outcome of interest is general achievement
    in math or reading
  • Both state tests and the study test measure the
    targeted outcome (general achievement)
  • If state tests are valid, then the impact on the
    study test and state tests should be similar

11
Criteria for Assessing Validity
  • Two types of intervention (ctd.)
  • Targeted outcome is a specific skill (Studies C
    and D)
  • There are two outcomes of interest
  • Targeted skill (short-term) and
  • General achievement (longer term)
  • Study test is used to measure the short-term
    outcome (specific skill), while state tests are
    used to measure the longer-term outcome (general
    achievement)
  • If state tests are valid, then the impact on
    state tests should be smaller than the impact on
    the study test

12
Benchmark Impact on the Study Test
13
P-Value Magnitude (Validity)
Targeted Outcome is General Achievement
p 0.119
p 0.055
14
P-Value Magnitude (Validity)
Targeted Outcome is General Achievement
p 0.119
p 0.189
p 0.055
p 0.229
15
P-Value Magnitude (Validity)
Targeted Outcome is a Specific Skill
p 0.002
p 0.578
16
P-Value Magnitude (Validity)
Targeted Outcome is a Specific Skill
p 0.002
p 0.007
p 0.578
17
P-Value Magnitude (Validity)
Targeted Outcome is a Specific Skill
p 0.002
p 0.007
p 0.578
p 0.219
18
Standard Errors (Reliability)
19
Standard Errors (Reliability)
State-Study Ratio 1.20
1.07 1.04
1.03
20
Conclusion
  • Findings suggest that state tests can be used as
    a complement to a study-administered test
  • State tests are suitable (valid and reliable) in
    3 of 4 studies
  • Whether state tests can be used as a substitute
    for a study test is an open question
  • Limited availability in some grades and subjects
  • Available for all states/grades in only 1 of 4
    studies
  • May not be able to use them to measure a specific
    targeted skill
  • Possibly less reliable
  • Findings from descriptive analysis lead to the
    same conclusions as the impact analysis

21
Questions?
  • Marie-Andrée Somers
  • marie-andree.somers_at_mdrc.org
  • Pei Zhu
  • pei.zhu_at_mdrc.org
  • Edmond Wong
  • edmond.wong_at_mdrc.org
Write a Comment
User Comments (0)
About PowerShow.com