Title: Estimation of Item Response Models
1Estimation of Item Response Models
- Mister Ibik
- Division of Psychology in Education
- Arizona State University
- EDP 691 Advanced Topics in Item Response Theory
2Motivation and Objectives
- Why estimate?
- Distinguishing feature of IRT modeling as
compared to classical techniques is the presence
of parameters - These parameters characterize and guide inference
regarding entities of interest (i.e., examinees,
items) - We will think through
- Different estimation situations
- Alternative estimation techniques
- The logic and mathematics underpinning these
techniques - Various strengths and weaknesses
- What you will have
- A detailed introduction to principles and
mathematics - A resource to be revisitedand revisitedand
revisited
3Outline
- Some Necessary Mathematical Background
- Maximum Likelihood and Bayesian Theory
- Estimation of Person Parameters When Item
Parameters are Known - ML
- MAP
- EAP
- Estimation of Item Parameters When Person
Parameters are Known - ML
- Simultaneous Estimation of Item and Person
Parameters - JML
- CML
- MML
- Other Approaches
4Background Finding the Root of an Equation
- Newton-Raphson Algorithm
- Finds the root of an equation
- Example the function f(x) x2
- Has a root (where f(x) 0) at x 0
5Newton-Raphson
- Newton-Raphson takes a given point, x0, and
systematically progresses to find the root of the
equation - Utilizes the slope of the function to find where
the root may be - The slope of the function is given by the
derivative - Denoted
- Gives the slope of the straight line that is
tangent to f(x) at x - Tangent best linear prediction of how the
function is changing - For x0, the best guess for the root is the point
where f'(x) 0 - This occurs at
- So the next candidate point for the root is
6Newton-Raphson Updating (1)
f'(x0) 3
f(x0) 2.25
x1 0.75
x0 1.5
7Newton-Raphson Updating (2)
f'(x1) 1.5
f(x1) 0.5625
x2 0.375
x1 0.75
8Newton-Raphson Updating (3)
f'(x2) 0.75
f(x2) 0.1406
x3 0.1875
x2 0.375
9Newton-Raphson Updating (4)
f'(x3) 0.375
f(x3) 0.0352
x4 0.0938
x3 0.1875
10Newton-Raphson Example
Iteration Value f(x)
0 1.5000 2.2500 3.0000 0.7500 0.7500
1 0.7500 0.5625 1.5000 0.3750 0.3750
2 0.3750 0.1406 0.7500 0.1875 0.1875
3 0.1875 0.0352 0.3750 0.0938 0.0938
4 0.0938 0.0088 0.1875 0.0469 0.0469
5 0.0469 0.0022 0.0938 0.0234 0.0234
6 0.0234 0.0005 0.0469 0.0117 0.0117
7 0.0117 0.0001 0.0234 0.0059 0.0059
8 0.0059 0.0000 0.0117 0.0029 0.0029
9 0.0029 0.0000 0.0059 0.0015 0.0015
10 0.0015 0.0000 0.0029 0.0007 0.0007
11Newton-Raphson Summary
- Iterative algorithm for finding the root of an
equation - Takes a starting point and systematically
progresses to find the root of the function - Requires the derivative of the function
- Each successive point is given by
- The process continues until we get arbitrarily
close, as usually measured by the change in some
function
12Difficulties With Newton-Raphson
- Some functions have multiple roots
- Which root is found often depends on the start
value
13Difficulties With Newton-Raphson
- Numerical complications can arise
- When the derivative is relatively small in
magnitude, the algorithm shoots into outer space
14Logic of Maximum Likelihood
- A general approach to parameter estimation
- The use of a model implies that the data may be
sufficiently characterized by the features of the
model, including the unknown parameters - Parameters govern the data in the sense that the
data depend on the parameters - Given values of the parameters we can calculate
the (conditional) probability of the data - P(Xij 1 ?i, bj) exp(?i bj)/(1 exp(?i
bj)) - Maximum likelihood (ML) estimation asks What
are the values of the parameters that make the
data most probable?
15Example Series of Bernoulli Variables With
Unknown Probability
- Bernoulli variable P(X 1) p
- The probability of the data is given by pX
(1-p)(1-X) - Suppose we have two random variables X1 and X2
- When taken as a function of the parameters, it is
called the likelihood - Suppose X1 1, X2 0
- P(X1 1, X2 0p) L(pX1 1, X2 0) p
(1-p) - Choose p to maximize the conditional probability
of the data - For p 0.1, L 0.1 (1-0.1) 0.09
- For p 0.2, L 0.2 (1-0.2) 0.16
- For p 0.3, L 0.3 (1-0.3) 0.21
16Example Likelihood Function
17The Likelihood Function in IRT
- The Likelihood may be thought of as the
conditional probability, where the data are known
and the parameters vary - Let Pij P(Xij 1 ?i, ?j)
- The goal is to maximize this function what
values of the parameters yield the highest value?
18Log-Likelihood Functions
- It is numerically easier to maximize the natural
logarithm of the likelihood - The log-likelihood has the same maximum as the
likelihood
19Maximizing the Log-Likelihood
- Note that at the maximum of the function, the
slope of the tangent line equals 0 - The slope of the tangent is given by the first
derivative - If we can find the point at which the first
derivative equals 0, we will have also found the
point at which the function is maximized
20Overview of Numerical Techniques
- One can maximize the lnL function by finding a
point where its derivative is 0 - A variety of methods are available for maximizing
L, or lnL - Newton-Raphson
- Fisher Scoring
- Estimation-Maximization (EM)
- The generality of ML estimation and these
numerical techniques results in the same concepts
and estimation routines being employed across
modeling situations - Logistic regression, log-linear modeling, FA,
SEM, LCA
21ML Estimation of Person Parameters When Item
Parameters Are Known
- Assume item parameters bj, aj, and cj, are known
- Assume unidimensionality, local and respondent
independence
- Conditional probability now depends on person
parameter only
- Likelihood function for the person parameters
only
22ML Estimation of Person Parameters When Item
Parameters Are Known
- Choose each ?i such that L or lnL is maximized
- Lets suppose we have one examinee
- Maximize this function using any of several
methods - Well use Newton-Raphson
23Newton-Raphson Estimation Recap
- Recall NR seeks to find the root of a function
(where 0) - NR updates follow the general structure
What is the derivative of this function?
What is our function of interest?
Current value
- Derivative of the function of interest
Function of interest
24Newton-Raphson Estimation of Person Parameters
- Newton-Raphson uses the derivative of the
function of interest - Our function is itself a derivative, the first
derivative of lnL with respect to ?i -
- Well need the second derivative as well as the
first derivative - Updates given by
25ML Estimation of Person Parameters When Item
Parameters Are Known The Log-Likelihood
- The log-likelihood to be maximized
- Select a start value and iterate towards a
solution using Newton-Raphson - A hill-climbing sequence
26ML Estimation of Person Parameters When Item
Parameters Are Known Newton-Raphson
27ML Estimation of Person Parameters When Item
Parameters Are Known Newton-Raphson
28ML Estimation of Person Parameters When Item
Parameters Are Known Newton-Raphson
- Move to -0.0001
- When the change in ?i is arbitrarily small (e.g.,
less than 0.001), stop estimation - No meaningful change in next step
- The key is that the tangent is 0
29Newton-Raphson Estimation of Multiple Person
Parameters
- But we have N examinees each with a ?i to be
estimated - We need a multivariate version of the
Newton-Raphson algorithm
30First Order Derivatives
- First order derivatives of the log-likelihood
- ?lnL/??i only involves terms corresponding to
subject i
Why???
31Second Order Derivatives
- Hessian second order partial derivatives of the
log-likelihood - This matrix needs to be inverted
- In the current context, this matrix is diagonal
Why???
32Second Order Derivatives
- The inverse of the Hessian is diagonal with
elements that are the reciprocals of the diagonal
of the Hessian - Updates for each ?i do not depend on any other
subjects ?
33Second Order Derivatives
- The updates for each ?i are independent of one
another - The procedure can be performed one examinee at a
time
34ML Estimation of Person Parameters When Item
Parameters Are Known Standard Errors
- The approximate, asymptotic standard error of the
ML estimate of ?i is - where I(?i) is the information function
- Standard errors are
- asymptotic with respect to the number of items
- approximate because only an estimate of ?i is
employed - asymptotically approximately unbiased
35ML Estimation of Person Parameters When Item
Parameters Are Known Strengths
- ML estimates have some desirable qualities
- They are consistent
- If a sufficient statistic exists, then the MLE is
a function of that statistic (Rasch models) - Asymptotically normally distributed
- Asymptotically most efficient (least variable)
estimator among the class of normally distributed
unbiased estimators - Asymptotically with respect to what?
36ML Estimation of Person Parameters When Item
Parameters Are Known Weaknesses
- ML estimates have some undesirable qualities
- Estimates may fly off into outer space
- They do not exist for so called perfect scores
(all 1s or 0s) - Can be difficult to compute or verify when the
likelihood function is not single peaked (may
occur with 3-PLM or more complex IRT models)
37ML Estimation of Person Parameters When Item
Parameters Are Known Weaknesses
- Strategies to handle wayward solutions
- Bound the amount of change at any one iteration
- Atheoretical
- No longer common
- Use an alternative estimation framework (Fisher,
Bayesian) - Strategies to handle perfect scores
- Do not estimate ?i
- Use an alternative estimation framework
(Bayesian) - Strategies to handle local maxima
- Re-estimate the parameters using different
starting points and look for agreement
38ML Estimation of Person Parameters When Item
Parameters Are Known Weaknesses
- An alternative to the Newton-Raphson technique is
Fishers method of scoring - Instead of the Hessian, it uses the information
matrix (based on the Hessian) - This usually leads to quicker convergence
- Often is more stable than Newton-Raphson
- But what about those perfect scores?
39Bayes Theorem
- We can avoid some of the problems that occur in
ML estimation by employing a Bayesian approach - All entities treated as random variables
- Bayes Theorem for random variables A and B
Posterior distribution of A, given B The
probability of A, given B.
- Conditional probability of B, given A
- Marginal probability of B
40Bayes Theorem
- If A is discrete
- If A is continuous
- Note that P(BA) L(AB)
41Bayesian Estimation of Person Parameters The
Posterior
- Select a prior distribution for ?i denoted P(?i)
- Recall the likelihood function takes on the form
P(Xi ?i) - The posterior density of ?i given Xi is
- Since P(Xi) is a constant
42Bayesian Estimation of Person Parameters The
Posterior
43Maximum A Posteriori Estimation of Person
Parameters
- The Maximum A Posteriori (MAP) estimate is
the maximum of the posterior density of ?i - Computed by maximizing the posterior density, or
its log - Find ?i such that
- Use Newton-Raphson or Fisher scoring
- Max of lnP(?i Xi) occurs at max of lnP(Xi
?i) lnP(?i) - This can be thought of as augmenting the
likelihood with prior information
44Choice of Prior Distribution
- Choosing P(?i) U(-8, 8) yields the posterior to
be proportional to the likelihood - In this case, the MAP is very similar to the ML
estimate - The prior distribution P(?i) is often assumed to
be N(0, 1) - The normal distribution commonly justified by
appeal to CLT - Choice of mean and variance identifies the scale
of the latent continuum
45MAP Estimation of Person Parameters Features
- The approximate, asymptotic standard error of the
MAP is - where I(?i) is the information from the posterior
density - Advantages of the MAP estimator
- Exists for every response pattern why?
- Generally leads to a reduced tendency for local
extrema - Disadvantages of the MAP estimator
- Must specify a prior
- Exhibits shrinkage in that it is biased towards
the mean May need lots of items to swamp the
prior if its misspecified - Calculations are iterative and may take a long
time - May result in local extrema
46Expected A Posteriori (EAP) Estimation of Person
Parameters
- The Expected A Posteriori (EAP) estimator is the
mean of the posterior distribution - Exact computations are often intractable
- We approximate the integral using numerical
techniques - Essentially, we take a weighted average of the
values, where the weights are determined by the
posterior distribution - Recall that the posterior distribution is itself
determined by the prior and the likelihood
47Numerical Integration Via Quadrature
- The Posterior Distribution
- With quadrature points
- Evaluate the heights of the distribution at each
point - Use the relative heights as the weights
? .165
.021 / .165 .127
.002 / .165 .015
48EAP Estimation of via Quadrature
- The Expected A Posteriori (EAP) is estimated by a
weighted average - where H(Qr) is weight of point Qr in the
posterior (compare Embretson Reise, 2000 p.
177) - The standard error is the standard deviation in
the posterior and may also be approximated via
quadrature
49EAP Estimation of via Quadrature
- Advantages
- Exists for all possible response patterns
- Non-iterative solution strategy
- Not a maximum, therefore no local extrema
- Has smallest MSE in the population
- Disadvantages
- Must specify a prior
- Exhibits shrinkage to the prior mean If the
prior is misspecified, may need lots of items to
swamp the prior
50ML Estimation of Item Parameters When Person
Parameters Are Known Assumptions
- Assume
- person parameters ?i are known
- respondent and local independence
- Choose values for item parameters that maximize
lnL
51Newton-Raphson Estimation
- What is the structure of this matrix?
52ML Estimation of Item Parameters When Person
Parameters Are Known
- Just as we could estimate subjects one at a time
thanks to respondent independence, we can
estimate items one at time thanks to local
independence - Multivariate Newton-Raphson
53ML Estimation of Item Parameters When Person
Parameters Are Known Standard Errors
- To obtain the approximate, asymptotic standard
errors - Invert the associated information matrix, which
yields the variance-covariance matrix - Take the square root of the elements of the
diagonal - Asymptotic w.r.t. sample size and approximate
because we only have estimates of the parameters - This is conceptually similar to those for the
estimation of ? - But why do we need a matrix approach?
54ML Estimation of Item Parameters When Person
Parameters Are Known Standard Errors
- ML estimates of item parameters have same
properties as those for person parameters
consistent, efficient, asymptotic (w.r.t.
subjects) - aj parameters can be difficult to estimate, tend
to get inflated with small sample sizes - cj parameters are often difficult to estimate
well - Usually because theres not a lot of information
in the data about the asymptote - Especially true when items are easy
- Generally need larger and more heterogeneous
samples to estimate 2-PL and 3-PL - Can employ Bayesian estimation (more on this
later)