Title: Measurement 102
1Measurement 102
- Steven Viger
- Lead Psychometrician, Office of General
Assessment and Accountability, Michigan Dept. of
Education - Joseph Martineau, Ph.D.
- Interim Director, Office of General Assessment
and Accountability, Michigan Dept. of Education
2Student Performance Measurement
- The difference between validity and reliability
- Validity
- The degree to which the assessment measures the
intended construct(s) - Reliability
- The consistency with which the assessment
produces scores
3Student Performance Measurement
4Student Performance Measurement
- Validity
- Documenting validity is a process of gathering
evidence that the assessment measures what is
intended
5Student Performance Measurement
- Individual item validity evidence includes
- Focus is on elimination of construct-irrelevant
variance - Item development/review procedures
- Alignment of individual items
- Bias
- Simple item analyses
6Student Performance Measurement
- Scale score validity evidence includes
- Input from item-level validity evidence (the
validity of the score scale depends upon the
validity of the items that contribute to that
score scale) - Convergent, divergent relationships with
appropriate external criteria, for example, - Strong relationships with other measures of
achievement - Teacher assigned grades
- Other subject area assessments
- Success in college
- Alignment of overall assessment to content
standards - Comparability across forms and administrations
- Accommodations
- Year to year equating
7Student Performance Measurement
- Item Response Theory is used to create the score
scale - Treats all sub-components as a single construct
- Assumes that there is a high correlation between
sub-components - Statistically speaking, this indicates that there
is a strong first principal component of all
items that contribute to the construct in
question - It would probably be better to measure the
sub-components separately, but that would require
significantly more assessment items - Assumes that a more able person has a higher
probability of responding correctly to an item
than a less able person - Specifically, when a persons ability is greater
than the item difficulty, they have a better than
50 chance of getting the item correct.
8The Rasch Model (1 parameter logistic model)
- The psychometric/statistical model used for MEAP
9The 3 Parameter Logistic Model
- The psychometric/statistical model used with the
MME
10MEAP example (10 items scaled using Rasch)
11MME model (10 items scaled using the 3-PL model)
12How do we get there?
- Although the graphics on the previous screens may
make conceptual sense, many wonder what we use to
produce this curve. - We are psychometriciansnot psychomagicians, so
the numbers come from somewhere. - We need a person by item matrix to begin the
process.
13IRT Estimation
- The person by item matrix is fed into an IRT
program to produce estimates of item parameters
and person parameters. - Item parameters are the guessability,
discrimination and difficulty parameters - Person parameters are the ability estimates we
use to create a students scale score.
14Parameter Estimation
- For single parameter (item difficulty) models,
WINSTEPS is the industry standard. - More complex models like the 3 parameter model
used in the MME require more specialized software
such as PARSCALE. - Once we know the parameters, we feed them into
the appropriate model to give us an estimated
probability of correct response.
15The Rasch Model(MEAP and ELPA)
16The 3 Parameter Logistic Model(MME)
17From Theta to Scale Scores
- Once item parameters are known, we can use the
item responses for the individuals to estimate
their ability (theta). - In general, when persons share the same response
string (pattern of correct and incorrect
responses) they will have the same estimate of
theta. - The estimation program will then produce a table
that gives us the relationship between raw scores
and theta.
18(No Transcript)
19DIF
- Differential item functioning (DIF) is a
phenomenon which occurs in the context of testing
multiple groups. - When we estimate item and person parameters we do
so with a complete data set the persons are from
multiple demographic groups. - We do our best to have a representative sample.
- DIF occurs when we estimate the item calibrations
separately based on groups of interest (e.g.
males and females, ethnicity, type of
instruction, geographic regions, etc.) and we
find differences in item parameters. - Depending on the DIF methodology used, there are
different levels of DIF that may or be not be
problematic.
20DIF
- Once DIF is detected the item(s) is/are brought
to the attention of the content specialists and
at times the content and sensitivity review
committees. - Items are either edited, deleted or kept in the
assessment depending on the findings of the
reviewers. - The bottom line is that the finding of DIF from a
statistical standpoint does not necessarily mean
the item will be deleted. - Subjective decisions always follow the technical
information.
21Item Types
- MDE assessments contain a variety of items with
different levels of maturity. - Even though an assessment is new for a test
cycle, there are parts of it which are not new. - Generally speaking, we have core items and field
test items.
22Equating
- Core items are more established and have been
used before. In fact, the core items are the only
ones which contribute to the score. - Field test items are embedded within assessments
to maintain the health of our item banks. - We can treat the item parameters from core items
as known and use those known parameters to
drive the estimation of field test item
parameters. - Equating also utilizes what we know about the
common items to link assessments from year to
year because common items are used in concurrent
test years.
23Equating
- When we have a test designed to measure the same
construct from year to year but differs
(somewhat) in specific content, we need to be
able to put the scores on the same scale. - What is being sought in test equating is a
conversion from the units of one form of a test
to the units of another form of the same test.
24Equating
- Three restrictions are important for equating
- The two tests must measure the same construct
- The resulting conversion should be independent of
the individuals from whom the data were drawn to
develop the conversion - The conversions should be applicable in future
situations.
25Equipercentile Equating
- Two scores, one on Form X and the other on Form Y
(where X and Y measure the same thing with the
same degree of reliability), may be considered
equivalent if their corresponding percentile
ranks in any giving group are equal. - Plot the percentile rank to raw score curves for
each form - Paired values for the forms are then interpolated
at common points on the percentile rank
distribution.
26Linear Equating
- Based on the assumption that the two forms of the
test, designed to be parallel (equivalent), will
have essentially the same raw score
distributions. - When the assumption is met, it should be possible
to convert scores on one form of the measure into
the same metric as the other form by employing a
linear function.
27Linear Equating
- Analogous to multiple regression
- Generally expressed as Y a(X-c) d
- a refers to the ration of the standard
deviations of Form Y over the standard deviation
of Form X - c refers to the mean of form X
- d refers to the mean of form Y
- To perform this type of equating, one of three
basic data collection designs should be used.
28Basic Data Collection Designs
- Design 1 a large group of examinees are selected
who are sufficiently heterogeneous in order to
adequately sample all levels of the scores on
both Form X and Form Y. - Divide this randomly into two groups, each group
gets a different form. - Collect the mean and standard deviations of both
groups and insert them in the aforementioned
linear function.
29Design 2
- Preferable when the test administrator has the
luxury of more time to administer tests. - Both form X and form Y are administered to all
subjects. - To control for order effects, half of the
subjects receive form X first and half receive
form Y first. - Calculations are slightly more involved because
averages must be used and must be applied
properly.
30Design 3
- Two randomly assigned groups each take a
different test along with a common equating test. - Common test is known as the anchor test (form
Z) - Again, calculations are complicated by the
addition of the anchor test - Benefits it is the industry standard, intact or
non-random groups may be used, the anchor test is
designed to adjust for any between group
differences that may be present
31IRT Equating
- When using IRT, if the data fit the model
reasonably well, the item and ability parameters
are invariant. - For a set of calibrated items, an examinee will
be expected to obtain the same ability estimate
from any subset of items. - For any sub-sample of examinees, item parameters
will be the same.
32IRT Equating Contd
- Based on the principals in the previous slide and
the common anchor item methodology, MDE utilizes
various forms of this equating methodology. - Generally, when we embed core items into
examinations from year to year, we already know
the difficulty estimates of those items. - When we perform our IRT estimation during the
current year or with the form under
investigation, we can fix the parameters of
those items when we feed our data into an
estimation program. - The new ability estimates (and field test item
parameter estimates) are anchored to the previous
form or previous years administration.
33Contact Information
- Steven Viger
- Michigan Department of Education608 W. Allegan
St.Lansing, MI 48909Office (517)
241-2334Fax (517) 335-1186 - VigerS_at_Michigan.gov
34Contact Information
- Joseph Martineau
- Michigan Department of Education608 W. Allegan
St.Lansing, MI 48909 - Office (517) 241-4710Fax (517) 335-1186
- MartineauJ_at_Michigan.gov