Empirical Bayes DIF Assessment Rebecca Zwick, UC Santa Barbara - PowerPoint PPT Presentation

1 / 39

About This Presentation

Title:

Empirical Bayes DIF Assessment Rebecca Zwick, UC Santa Barbara

Description:

Compare item performance for members of 2 groups, after matching on total test score, S. ... where k is the population odds ratio at score level k. ... – PowerPoint PPT presentation

Number of Views:39

Avg rating:3.0/5.0

Slides: 40

Provided by: rebecc126

Category:

more less

Transcript and Presenter's Notes

Title: Empirical Bayes DIF Assessment Rebecca Zwick, UC Santa Barbara

1
Empirical Bayes DIF Assessment Rebecca Zwick, UC
Santa Barbara

Presented at Measured Progress
August 2007

2
Overview

Definition and causes of DIF
Assessing DIF via Mantel-Haenszel
EB enhancement to MH DIF (1994-2002, with D.
Thayer C. Lewis)
Model and Applications
Simulation findings
Discussion

3
Whats differential item functioning ?

DIF occurs when equally skilled members of 2
groups have different probabilities of answering
an item correctly.
(Only dichotomous items considered today)

4
IRT Definition of (absence of) DIF

Lord, 1980 P(Yi 1 ?, R) P(Yi 1 ?,
F) means DIF is absent
P(Yi 1 ?, G) is the probability of correct
response to item i, given ?, in group G,
G F (focal) or R (Reference).
? is a latent ability variable, imperfectly
measured by test score S. (More later...)

5
Reasons for DIF

Construct-irrelevant difficulty (e.g., sports
content in a math item)
Differential interests or educational background
NAEP History items with DIF favoring Black
test-takers were about M. L. King, Harriet
Tubman, Underground Railroad (Zwick Ercikan,
1989)
Often mystifying (e.g., X 5 10 has DIF Y
8 11 doesnt)

6
Mini-history of DIF analysis

DIF research dates back to 1960s
In late 1980s (Golden Rule), testing companies
started including DIF analysis as a QC procedure.
Mantel-Haenszel (Holland Thayer, 1988) method
of choice for operational DIF analyses
Few assumptions
No complex estimation procedures
Easy to explain

7
Mantel-Haenszel

Compare item performance for members of 2 groups,
after matching on total test score, S.
Suppose we have K levels of the score used for
matching test-takers, s1, s2, sK
In each of the K levels, data can be represented
as a 2 x 2 table (Right/Wrong by
Reference/Focal).

8
Mantel-Haenszel

For each table, compute conditional odds ratio
Odds of correct response Ssk, GR
Odds of correct response Ssk, GF
Weighted combination of these K values is MH odds
ratio,
MH DIF statistic is -2.35 ln( )

9
Mantel-Haenszel

The MH chi-square tests the hypothesis,
H0 ?k ? 1, k 1, 2, K versus
H1 ? k ? ? 1, k 1, 2, K
where ?k is the population odds ratio at score
level k.
(Above H0 is similar, but not, in general,
identical to the IRT H0 see Zwick, 1990 Journal
of Educational Statistics)

10
Mantel-Haenszel

ETS Size of DIF estimate, plus chi-square
results are used to categorize item
A negligible DIF
B slight to moderate DIF
C substantial DIF
For B and C, or - used to indicate DIF
direction - means DIF against focal group.
Designation determines items fate.

11
Drawbacks to usual MH approach

May give impression that DIF status is
deterministic or is a fixed property of the item
Reviewers of DIF items often ignore SE
Is unstable in small samples, which may arise in
CAT settings

12
EB enhancement to MH

Provides more stable results
May allow variability of DIF findings to be
represented in a more intuitive way
Can be used in three ways
Substitute more stable point estimates for MH
Provide probabilistic perspective on true DIF
status (A, B, C) and future observed status
Loss-function-based DIF detection

13
Main Empirical Bayes DIF Work (supported by ETS
and LSAC)

An EB approach to MH DIF analysis (with Thayer
Lewis). JEM, 1999. General approach,
probabilistic DIF
Using loss functions for DIF detection An EB
approach (with Thayer Lewis). JEBS, 2000.
Loss functions
The assessment of DIF in CATs. In van der Linden
Glas (Eds.) CAT Theory and Practice, 2000.
review
Application of an EB enhancement of MH DIF
analysis to a CAT (with Thayer). APM, 2002.
simulated CAT-LSAT

14
Whats an Empirical Bayes Model?(See Casella
(1985), Am. Statistician)

In Bayesian statistics, we assume that parameters
have prior distributions that describe parameter
behavior.
Statistical theory, or past research may inform
us about the nature of those distributions.
Combining observed data with the prior
distribution yields a posterior (after the
data) distribution that can be used to obtain
improved parameter estimates.
EB means priors parameters are estimated from
data (unlike fully Bayes models).

15
EB DIF Model
16
EB DIF Model
17
EB DIF Model
18
EB DIF Model
19
EB DIF Model
20
(No Transcript)
21
Recall EB DIF estimate is a weighted combination
of MHi and prior mean.
22
Next

Performance of EB DIF estimator
Probabilistic DIF idea

23
How does EB DIF estimator EBi compare to MHi?

Applied to real data, including GRE
Applied to simulated data, including simulated
CAT-LSAT (Zwick Thayer, 2002)
Testlet CAT data simulated, including items with
varying amounts of DIF
EB and MH both used to estimate (known) True DIF
Performance compared using RMSR, variance, and
bias measures

24
Design of Simulated CAT

Pool 30 5-item testlets (150 items total)
10 Testlets at each of 3 difficulty levels
Item data generated via 3PL model
CAT algorithm was based on testlet scores
Examinees received 5 testlets (25 items)
Test score (used as DIF matching variable) was
expected true score on pool (Zwick, Thayer,
Wingersky, 1994 APM)

25
Simulation Conditions Differed on Several Factors

Ability distribution
Always N(0,1) in Reference group
Focal group either N(0,1) or N(-1,1)
Initial sample size per group 1000 or 3000
DIF Absent or Present (in amounts that vary
across items)
600 replications for results shown today

26
Definition of True DIF for Simulation
Range of True DIF -2.3 to 2.9, SD 1.
27
Definition of Root Mean Square Residual
28
MSR Variance Squared Bias

MSR RMSR2

29
RMSRs for No-DIF condition, Initial N1000
Item Ns 80 to 300
30
RMSRs - 50 hard items, DIF condition, Focal
N(-1,1)Focal Ns 16 to 67, Reference Ns
80 to 151
31
RMSRs for DIF condition, Focal N(-1,1)Initial
N1000 Item Ns 16 to 307
32
Variance and Squared Bias for Same
ConditionInitial N1000 Item Ns 16 to 307
33
Summary-Performance of EB DIF Estimator

RMSRs (and variances) are smaller for EB than for
MH, especially in (1) no-DIF case and
(2) very small-sample case.
EB estimates more biased than MH bias is toward
0.
Above findings are consistent with theory.
Implications to be discussed.

34
External Applications/Elaborations of EB DIF
Point Estimation

Defense Dept CAT-ASVAB (Krass Segal, 1998)
ACT Simulated multidimensional CAT data (Miller
Fan, NCME, 1998)
ETS Fully Bayes DIF model (NCME, 2007) of
Sinharay et al Like EB, but parameters of
prior are determined using past data (see ZTL).
Also tried loss function approach.

35
Probabilistic DIF

In our model, posterior distribution is normal,
so is fully determined by mean and variance.
Can use posterior distribution to infer the
probability that DIF falls into each of the ETS
categories (C-, B-, A, B, C), each of which
corresponds to a particular DIF magnitude.
(Statistical significance plays no role
here.)
Can display graphically.

36
Probabilistic DIF status for an A item in LSAT
sim.MH 4.7, SE 2.2, Identified Status
CPosterior Mean EBi .7, Posterior SD .8
NR101 NF 23
37
Probabilistic DIF, continued

In EB approach can be used to accumulate DIF
evidence across administrations.
Prior can be modified each time an item is given
Use former posterior distribution as new prior
(Zwick, Thayer Lewis, 1999).
Pie chart could then be modified to reflect new
evidence about an items status.

38
Predicting an Items Future Status The Posterior
Predictive Distribution

A variation on the above can be used to predict
future observed DIF status
Mean of posterior predictive distribution is same
as posterior mean, but variance is larger.
For details and an application to GRE items, see
Zwick, Thayer, Lewis, 1999 JEM.

39
Discussion

EB point estimates have advantages over MH
counterparts
EB approach can be applied to non-MH DIF methods
Advisability of shrinkage estimation for DIF
needs to be considered
Reducing Type I error may yield more
interpretable results
Degree of shrinkage can be fine-tuned
Probabilistic DIF displays may have value in
conveying uncertainty of DIF results.