Quality control in language tests: - PowerPoint PPT Presentation

1 / 55

About This Presentation

Title:

Quality control in language tests:

Description:

4. The work of the task and item writers needs to be edited before pretesting. ... A 200-item test has (for example) a mean of 100 and a SD of 15 ... – PowerPoint PPT presentation

Number of Views:38

Avg rating:3.0/5.0

Slides: 56

Provided by: wielan

Category:

more less

Transcript and Presenter's Notes

Title: Quality control in language tests:

1

Quality control in language tests
pretesting and item banking

ALTE Meeting 9-11 November 2005 Dr Anthony
Green, Cambridge ESOL Research and Validation
2
Which car would you buy?
3
Quality control inspection and testing
Car safety ratings Euro NCAP
Rigorous testing demonstrates and improves
quality of product
4
Quality Control for Test Material
Pre-editing and editing Inspection
Pretesting and trialling Testing
Test construction Inspection
5
Standards for language testers

ILTA code of practice (2005)
3. All tests, regardless of their purpose or use,
must be reliable. Reliability refers to the
consistency of the test results, to what extent
they are generalizable and therefore comparable
across time and across settings.
4. The work of the task and item writers needs to
be edited before pretesting. If pretesting is not
possible, tasks and items should be analysed
after the test has been administered but before
results are reported. Malfunctioning or
misfitting tasks and items should not be included
in the calculation of individual test takers'
reported scores.

6
Range of ability and difficulty
Individuals have different levels of language
ability
Language tasks have different levels of difficulty
7
Using Tests to Obtain Information
8
The Pretesting Process
(Live)
9
The Pretesting Process
10
The Data File
Identifiers
Item responses
0090040112 000260001MToma
CCDDCABCBBCAD100101101111 0090040112
000260002FSukaria CCDDCADDBBAAA0101111011O1 0
090040112 000260003KTakahashi
CCDDCADBBAAAD111101101111 0090040112
000260004SSunaga CCDDCBBDBBDAD111011100110 0
090040112 000260005FPaolicchi
CCDDCADDBBAAD111111101100 0090040112
000260006JStassart CCDDCBABABBBD110101101100 0
090040112 000260007RMagome
CCDDCACDBBCAD111101001101 0090040112
000260008YMatsuzaki CCDDDABDBBABD111111111111 0
090020112 001160001AAlhinai
BCDDCACCABAAB100001011101 0090040112
001160002SAlzefeiti CCADCABCBBCAC101101111010
11
The Candidates Who took the test?

Candidate numbers sufficient to provide useful
data?

Variety of backgrounds and language ability
reflecting target population?

Range of language ability covered within each
group?

12
Classical Test Analysis (ITEMAN)

Is the difficulty of the test appropriate to the
level of the candidates?
Does the test discriminate between lower and
higher-ability candidates?
Are the items consistent with each other? Do they
all tell the same story?

13
Analysing the Scored Data File
14
Finding patterns in test data
15
(No Transcript)
16
Suitability of Test Central Tendency

Are scores mostly high, low, or in the middle?
Mean and Median
The mean (average) is the sum of all scores
divided by the number of people
The median is the score in the middle of a
distribution of scores that is, half the scores
are below the median, half are above

17
Suitability of Test Dispersion

Are scores bunched together, or widely spread
out?
Standard Deviation
The average amount by which scores vary from the
mean.

18
Normal Distribution
A 200-item test has (for example) a mean of 100
and a SD of 15 About 68 of candidates score
within 1 SD, (between 85 and 115) About 95 of
candidates score within 2 SDs, (between 70 and
130)
19
Score Distribution
Number Freq- Cum Correct
uency Freq PR PCT -------
------- ------ ---- ----
0 1 1 5 5
1 0 1 5
0 2 0 1 5
0 3 1 2 10
5 4 5 7
35 25
5 0 7 35 0
6 4 11 55 20
7 4
15 75 20
8 2 17 85 10
9 1 18 90
5 10 2
20 99 10

--------------------
5 10 15 20
25
Percentage of Examinees
20
Descriptive Statistics
Scale Statistics ---------------- Scale
1 ------- N of Items
10 N of Examinees 20 Mean
6.000 Variance 5.900 Std. Dev.
2.429 Skew -0.377 Kurtosis
-0.004 Minimum 0.000 Maximum
10.000 Median 6.000
Alpha 0.667 SEM
2.180 Mean P 0.618 Mean Item-Tot.
0.329 Mean Biserial 0.437 Max Score (Low)
13 N (Low Group) 39 Min Score (High)
18 N (High Group) 33
21
Does the Test Work? Reliability

Stability Does the test yield consistent results
on two or more occasions

Internal Do all parts of the
test consistency provide consistent
information
22
Ordering the data
Person ability
23
(No Transcript)
24
Ordering the data
Person ability
Task difficulty
25
(No Transcript)
26
Reliability

Alpha
Measured from 0 to 1
The higher the alpha the more reliable the test
Affected by homogeneity (of candidates and of
items) and by test length
SEM
A way to estimate the reliability of individual
scores
Combines alpha and standard deviation

27
Reliability Figures
Scale Statistics ---------------- Scale
1 ------- N of Items
10 N of Examinees 20 Mean
6.000 Variance 6.211 Std. Dev.
2.492 Skew -0.408 Kurtosis
0.368 Minimum 0.000 Maximum
10.000 Median 6.000
Alpha 0.724 SEM
2.180 Mean P 0.600 Mean Item-Tot.
0.545 Mean Biserial 0.735 Max Score (Low)
4 N (Low Group) 7 Min Score (High) 7 N
(High Group) 9
28
ITEMAN task

How many items are there on the test?
How many people took the test?
Is the test easy or difficult for these
test-takers?
How reliable was the test overall?
If a candidate actually scored 45 on this test,
what was her true score?

29
Where are the problems? Item Analysis

How many candidates got the item right?
Facility (proportion correct)

Did the item sort the sheep from the goats?
Discrimination (High scorers vs low scorers)

30
Facility

Item Statistics
Alternative Statistics
-----------------------
-----------------------------------
Seq. Scale Prop. Disc. Point
Prop. Endorsing Point
No. -Item Correct Index Biser. Alt.
Total Low High Biser. Key
---- ----- ------- ------ ------ -----
----- ---- ---- ------ ---
Also known as p (proportion correct)
Acceptable range depends on the exam .35 to .85
OK for many
Very high/ low facility items provides little
information. Low facility may reduce to guessing
Low facility for a distractor indicates poor
pulling power
There may be good reasons why out-of-range items
should be included

31
Item Discrimination

Item Statistics
Alternative Statistics
-----------------------
-----------------------------------
Seq. Scale Prop. Disc. Point
Prop. Endorsing Point
No. -Item Correct Index Biser. Alt.
Total Low High Biser. Key
---- ----- ------- ------ ------ -----
----- ---- ---- ------ ---
Measures of item discrimination show how
successfully an item distinguishes between higher
and lower ability candidates by
explicitly dividing the candidates into
high-scoring and low-scoring groups, or
correlating scores on an individual item with
total scores on the test.

32
The Discrimination Index

ITEMAN reports discrimination as
P high - P low
The highest (27) and lowest scoring groups of
candidates are compared. The proportion of the
lowest scoring candidates answering correctly is
subtracted from the proportion of the highest
scoring group.

33
Point-Biserial Correlation

The point-biserial correlation shows the
relationship between candidates' performance on
a single item and their performance on all items
in the test (or part of the test)
i.e. Do those people who answer the item
correctly also score highly on the rest of the
test?

34
(No Transcript)
35
Point-Biserial Correlation

An item's possibility of discriminating peaks at
a facility of .5, i.e. when half the candidates
are responding correctly.

36
Item Analysis
Item Statistics
Alternative Statistics
----------------------- ------------------------
----------- Seq. Scale Prop. Disc. Point
Prop. Endorsing Point No. -Item
Correct Index Biser. Alt. Total Low High
Biser. Key ---- ----- ------- ------ ------
----- ----- ---- ---- ------ --- 8
1-8 .40 .23 .24 A .53 .62
.46 -.13
B .06 .10 .04 -.12
C .40 .28 .50
.24
Other .00 .00 .00 -.38 9 1-9
.48 .39 .42 A .06 .10 .02
-.22 B
.48 .27 .66 .42
C .24 .30 .19
-.12 D
.22 .32 .13 -.20
Other .00 .00 .00 -.38
37
Item 9
Item Statistics Alternative
Statistics -----------------------
-----------------------------------Seq. Scale
Prop. Disc. Point Prop. Endorsing
PointNo. -Item Correct Index Biser. Alt.
Total Low High Biser. Key---- -----
------- ------ ------ ----- ----- ---- ----
------ --- 9 1-9 .48 .39 .42
A .06 .10 .02 -.22
B .48 .27 .66
.42
C .24 .30 .19 -.12
D .22 .32 .13
-.20
Other .00 .00 .00 -.38
38
Item 9
39
Item 8
Item Statistics Alternative
Statistics -----------------------
----------------------------------- Seq. Scale
Prop. Disc. Point Prop. Endorsing
Point No. -Item Correct Index Biser. Alt.
Total Low High Biser. Key ---- -----
------- ------ ------ ----- ----- ---- ----
------ --- 8 1-8 .40 .23 .24
A .53 .62 .46 -.13
B .06 .10 .04
-.12 C
.40 .28 .50 .24
Other .00 .00 .00
-.38
40
Item 8
41
Item 8
42
Item 8

With a computer, we can see and speak to people
in other countries for just a ___ pence per
minute.
A) little B) small C) few

43
Item analysis

Which were the easiest and most difficult items?
Which item(s), if any, might you consider
replacing? Why?
If both item 2 and item 3 are testing the same
ability and you want to choose one for a language
proficiency test, which would you choose? Why?

44
The Limitations of Classical Test Analysis
It is difficult to compare performances in
different contexts We need a common measure to
compare them
100
C
50
B
A
25
45
Test Equating

How does this test compare with other tests in
our bank?
Is it of similar difficulty to other tests at the
same level?
Is it harder than lower level and easier than
higher level tests?
How much harder or easier is it?
Can we select items that are at the right level
of difficulty for the tests we want to build?

46
Item Response Theory

Relates items (and candidates) across tests
Estimates probability that a candidate of known
ability will succeed on an item of known
difficulty
Useful for
Test Construction
Grading
Test scrutiny
Tests must be linked by common candidates and/or
common items

47
Item Response Theory
Person Ability
Item Difficulty
Low ability candidate easy item - 50 chance
48
Item Response Theory
Person Ability
Item Difficulty
Low ability candidate moderately difficult item
- 10 chance
High ability candidate, moderately difficult
item 90 chance
49
Item Response Theory
100 -
50 -
Probability of success
Item difficulty/ Person ability
0 -
-3 -2 -1 0 1
2 3
Item
50
Item link the same items appear on two or more
tests
51
Person link the same people take two or more
tests of the same skills
52
Linking Pretests at Cambridge ESOL
Pretests Small numbers of candidates All
candidates take anchor plus a batch of
pretests The anchor links the tests together
Overlapping anchors are linked to the ESOL Common
Scale
53
The Common Scale
54
The picture is often confusing or
ambiguousRequires professional interpretation
and judgement
Statistics for Pretesting
55
Statistics for Pretesting