Title: Asfwasfwer sdf
1(No Transcript)
2(No Transcript)
3CEFR
4Bad Practice
Good Practice
5Terminology
Alignment
Anchoring
Calibration
Projection
Scaling
Comparability
Linking
Concordance
Benchmarking
Equating
Prediction
Moderation
6Milestones in Comparability
The proof and measurement of association between
two things
association
Spearman
7Milestones in Comparability
Scores on two or more tests may be said to be
comparable for a certain population if they show
identical distributions for that population.
comparable
population
Flanagan
Spearman
8Milestones in Comparability
- Scales, norms, and equivalent scores
- Equating
- Calibration
- Comparability
Angoff
Flanagan
Spearman
9Milestones in Comparability
Linking
Mislevy, Linn
Angoff
Flanagan
Spearman
10Milestones in Comparability
Alignment
Webb, Porter
Mislevy, Linn
Angoff
Flanagan
Spearman
11Alignment
- Alignment refers to the degree of match between
test content and the standards - Dimensions of alignment
- Content
- Depth
- Emphasis
- Performance
- Accessibility
12Alignment
- Alignment is related to content validity
- Specification (Manual Ch. 4)
- Specification can be seen as a qualitative
method. There are also quantitative methods for
content validation but this manual does not
require their use. (p. 2) - 24 pages of forms
- Outcome A chart profiling coverage graphically
in terms of levels and categories of CEF. (p. 7) - Crocker, L. et al. (1989). Quantitative Methods
for Assessing the Fit Between Test and
Curriculum. In Applied Measurement in
Education, 2 (2), 179-194.
Why?
How?
13Alignment (Porter, 2004)
www.ncrel.org
14Milestones in Comparability
Linking
Webb, Porter
Mislevy, Linn
Angoff
Flanagan
Spearman
15Mislevy Linn Linking Assessments
Equating ? Linking
16The Good The Bad
17Model Data Fit
18Model Data Fit
19Model Data Fit
Reality
Models
20Sample-Free Estimation
21The ruler (? scale)
22The ruler (? scale)
23The ruler (? scale)
24The ruler (? scale)
boiling water
absolute zero
25The ruler (? scale)
F 1.8 C 32 C (F 32) / 1.8
26Mislevy Linn Linking Assessments
27Standard Setting
28The Ugly
29Fact 1
- Human judgment is
- the epicenter of every standard-setting method
- Berk, 1995
30When Ugliness turns to Beauty
31When Ugliness turns to Beauty
32Fact 2
- The cut-off points on the latent continuum do not
possess any objective reality outside and
independently of our minds. They are mental
constructs, which can differ within different
persons.
33Consequently
- Whether the levels themselves are set at the
proper points is a most contentious issue and
depends on the defensibility of the procedures
used for determining them - Messick, 1994
34Defensibility
Evidence
Claims
35Defensibility Claims vs. Evidence
- National Standards
- Understands manuals for devices used in their
everyday life
- CEF A2
- Can understand simple instructions on equipment
encountered in everyday life such as a public
telephone (p. 70)
(A2)
36Defensibility Claims vs. Evidence
- Cambridge ESOL
- DIALANG
- Finnish Matriculation
- CIEP (TCF)
- CELI Universit? per Stranieri di Perugia
- Goethe-Institut
- TestDaF Institut
- WBT (Zertifikat Deutsch)
75 of the institutions provide only claims about
item's CEF level
37Defensibility Claims vs. Evidence
- Common Practice (Buckendahl et al., 2000)
- External Evaluation of the alignment of
- 12 tests by 2 publishers
- Publisher reports
- No description of the exact procedure followed
- Reports include only the match between items and
standards - Evaluation study
- At least 10 judges per test
- Comparison results
- of agreement 26 - 55
- Overestimation of the match by test-publishers
38Standards for educational and psychological
testing,1999
- Standard 1.7
- When a validation rests in part of the opinion or
decisions of expert judges, observers or raters,
procedures for selecting such experts and for
eliciting judgments or ratings should be fully
described. The description of procedures should
include any training and instruction provided,
should indicate whether participants reached
their decisions independently, and should report
the level of agreement reached. If participants
interacted with one another or exchanged
information, the procedures through which they
may have influenced one another should be set
forth.
39Evaluation Criteria
- Hambleton, R. (2001). Setting Performance
Standards on Educational Assessments and Criteria
for Evaluating the Process. In Setting
Performance Standards Concepts, Methods and
Perspectives., Ed. by Cizek, G., Lawrence Erlbaum
Ass., 89-116. - A list of 20 questions as evaluation criteria
- Planning Documentation 4 (20)
- Judgments 11 (55)
- Standard Setting Method 5 (25)
Planning
40Judges
- Because standard-setting inevitably involves
human judgment, a central issue is who is to make
these judgments, that is, whose values are to be
embodied in the standards. - Messick, 1994
41Selection of Judges
- The judges should have
- the right qualifications, but
- some other criteria such as
- occupation,
- working experience,
- age,
- sex
- may be taken into account, because although
ensuring expertise is critical, sampling from
relevant different constituencies may be an
important consideration if the testing
procedures and passing scores are to be
politically acceptable (Maurer Alexander,
1992).
42Number of Judges
- Livingston Zieky (1982) suggest the number of
judges to be not less than 5. - Based on the court cases in the USA, Biddle
(1993) recommends 7 to 10 Subject Matter Experts
to be used in the Judgement Session. - As a general rule Hurtz Hertz (1999) recommend
10 to 15 raters to be sampled. - 10 judges is a minimum number, according to the
Manual (p. 94).
43Training Session
- The weakest point
- How much?
- Until it hurts (Berk, 1995)
- Main focus
- Intra-judge consistency
- Evaluation forms
- Hambleton, 2001
- Feedback
?
?
44Training Session Feedback Form
45Training Session Feedback Form
46Standard Setting Method
- Good Practice
- The most appropriate
- Due diligence
- Field tested
- Reality check
- Validity evidence
- More than one
47Standard Setting Method
- Probably the only point of agreement among
standard-setting gurus is that there is hardly
any agreement between results of any two
standard-setting methods, even when applied to
the same test under seemingly identical
conditions. - Berk, 1995
48He that increaseth knowledge increaseth sorrow.
(Ecclesiastes 118)
Examinee-centered methods
B1/B2
Test-centered methods
49He that increaseth knowledge increaseth sorrow.
(Ecclesiastes 118)
Test-centered methods
B1/B2
Examinee-centered methods
50Instead of Conclusion
- In sum, it may seem that providing valid grounds
for valid inferences in standards-based
educational assessment is a costly and
complicated enterprise. But when the consequences
of the assessment affect accountability decisions
and educational policy, this needs to be weighed
against the costs of uninformed or invalid
inferences. - Messick, 1994
Butterfly Effect
Change one thing, change everything!
51Instead of Conclusion
- The chief determiner of performance standards is
not truth it is consequences. - Popham, 1997
Butterfly Effect
Change one thing, change everything!
52Instead of Conclusion
- Perhaps by the year 2000, the collaborative
efforts of measurement researchers and
practitioners will have raised the standard on
standard-setting practices for this emerging
testing technology. - Berk, 1996
Butterfly Effect
Change one thing, change everything!
53Rise up Magyar
A coward and a lowly bastard
Is he, who dares not raise the standard!
54Thanks!
Rise up Magyar
A coward and a lowly bastard
Is he, who dares not raise the standard!