Title: Quality control in language tests:
1- Quality control in language tests
- pretesting and item banking
ALTE Meeting 9-11 November 2005 Dr Anthony
Green, Cambridge ESOL Research and Validation
2Which car would you buy?
3Quality control inspection and testing
Car safety ratings Euro NCAP
Rigorous testing demonstrates and improves
quality of product
4Quality Control for Test Material
Pre-editing and editing Inspection
Pretesting and trialling Testing
Test construction Inspection
5Standards for language testers
- ILTA code of practice (2005)
- 3. All tests, regardless of their purpose or use,
must be reliable. Reliability refers to the
consistency of the test results, to what extent
they are generalizable and therefore comparable
across time and across settings. - 4. The work of the task and item writers needs to
be edited before pretesting. If pretesting is not
possible, tasks and items should be analysed
after the test has been administered but before
results are reported. Malfunctioning or
misfitting tasks and items should not be included
in the calculation of individual test takers'
reported scores.
6Range of ability and difficulty
Individuals have different levels of language
ability
Language tasks have different levels of difficulty
7Using Tests to Obtain Information
8The Pretesting Process
(Live)
9The Pretesting Process
10The Data File
Identifiers
Item responses
0090040112 000260001MToma
CCDDCABCBBCAD100101101111 0090040112
000260002FSukaria CCDDCADDBBAAA0101111011O1 0
090040112 000260003KTakahashi
CCDDCADBBAAAD111101101111 0090040112
000260004SSunaga CCDDCBBDBBDAD111011100110 0
090040112 000260005FPaolicchi
CCDDCADDBBAAD111111101100 0090040112
000260006JStassart CCDDCBABABBBD110101101100 0
090040112 000260007RMagome
CCDDCACDBBCAD111101001101 0090040112
000260008YMatsuzaki CCDDDABDBBABD111111111111 0
090020112 001160001AAlhinai
BCDDCACCABAAB100001011101 0090040112
001160002SAlzefeiti CCADCABCBBCAC101101111010
11The Candidates Who took the test?
- Candidate numbers sufficient to provide useful
data?
- Variety of backgrounds and language ability
reflecting target population?
- Range of language ability covered within each
group?
12Classical Test Analysis (ITEMAN)
- Is the difficulty of the test appropriate to the
level of the candidates? - Does the test discriminate between lower and
higher-ability candidates? - Are the items consistent with each other? Do they
all tell the same story?
13Analysing the Scored Data File
14Finding patterns in test data
15(No Transcript)
16Suitability of Test Central Tendency
- Are scores mostly high, low, or in the middle?
- Mean and Median
- The mean (average) is the sum of all scores
divided by the number of people - The median is the score in the middle of a
distribution of scores that is, half the scores
are below the median, half are above
17Suitability of Test Dispersion
- Are scores bunched together, or widely spread
out? - Standard Deviation
- The average amount by which scores vary from the
mean.
18Normal Distribution
A 200-item test has (for example) a mean of 100
and a SD of 15 About 68 of candidates score
within 1 SD, (between 85 and 115) About 95 of
candidates score within 2 SDs, (between 70 and
130)
19Score Distribution
Number Freq- Cum Correct
uency Freq PR PCT -------
------- ------ ---- ----
0 1 1 5 5
1 0 1 5
0 2 0 1 5
0 3 1 2 10
5 4 5 7
35 25
5 0 7 35 0
6 4 11 55 20
7 4
15 75 20
8 2 17 85 10
9 1 18 90
5 10 2
20 99 10
--------------------
5 10 15 20
25
Percentage of Examinees
20Descriptive Statistics
Scale Statistics ---------------- Scale
1 ------- N of Items
10 N of Examinees 20 Mean
6.000 Variance 5.900 Std. Dev.
2.429 Skew -0.377 Kurtosis
-0.004 Minimum 0.000 Maximum
10.000 Median 6.000
Alpha 0.667 SEM
2.180 Mean P 0.618 Mean Item-Tot.
0.329 Mean Biserial 0.437 Max Score (Low)
13 N (Low Group) 39 Min Score (High)
18 N (High Group) 33
21Does the Test Work? Reliability
- Stability Does the test yield consistent results
on two or more occasions
Internal Do all parts of the
test consistency provide consistent
information
22Ordering the data
Person ability
23(No Transcript)
24Ordering the data
Person ability
Task difficulty
25(No Transcript)
26Reliability
- Alpha
- Measured from 0 to 1
- The higher the alpha the more reliable the test
- Affected by homogeneity (of candidates and of
items) and by test length - SEM
- A way to estimate the reliability of individual
scores - Combines alpha and standard deviation
27Reliability Figures
Scale Statistics ---------------- Scale
1 ------- N of Items
10 N of Examinees 20 Mean
6.000 Variance 6.211 Std. Dev.
2.492 Skew -0.408 Kurtosis
0.368 Minimum 0.000 Maximum
10.000 Median 6.000
Alpha 0.724 SEM
2.180 Mean P 0.600 Mean Item-Tot.
0.545 Mean Biserial 0.735 Max Score (Low)
4 N (Low Group) 7 Min Score (High) 7 N
(High Group) 9
28ITEMAN task
- How many items are there on the test?
- How many people took the test?
- Is the test easy or difficult for these
test-takers? - How reliable was the test overall?
- If a candidate actually scored 45 on this test,
what was her true score?
29Where are the problems? Item Analysis
- How many candidates got the item right?
- Facility (proportion correct)
- Did the item sort the sheep from the goats?
- Discrimination (High scorers vs low scorers)
30Facility
- Item Statistics
Alternative Statistics - -----------------------
----------------------------------- - Seq. Scale Prop. Disc. Point
Prop. Endorsing Point - No. -Item Correct Index Biser. Alt.
Total Low High Biser. Key - ---- ----- ------- ------ ------ -----
----- ---- ---- ------ --- - Also known as p (proportion correct)
- Acceptable range depends on the exam .35 to .85
OK for many - Very high/ low facility items provides little
information. Low facility may reduce to guessing - Low facility for a distractor indicates poor
pulling power - There may be good reasons why out-of-range items
should be included
31Item Discrimination
- Item Statistics
Alternative Statistics - -----------------------
----------------------------------- - Seq. Scale Prop. Disc. Point
Prop. Endorsing Point - No. -Item Correct Index Biser. Alt.
Total Low High Biser. Key - ---- ----- ------- ------ ------ -----
----- ---- ---- ------ --- - Measures of item discrimination show how
successfully an item distinguishes between higher
and lower ability candidates by - explicitly dividing the candidates into
high-scoring and low-scoring groups, or - correlating scores on an individual item with
total scores on the test.
32The Discrimination Index
- ITEMAN reports discrimination as
- P high - P low
- The highest (27) and lowest scoring groups of
candidates are compared. The proportion of the
lowest scoring candidates answering correctly is
subtracted from the proportion of the highest
scoring group.
33Point-Biserial Correlation
- The point-biserial correlation shows the
relationship between candidates' performance on
a single item and their performance on all items
in the test (or part of the test) - i.e. Do those people who answer the item
correctly also score highly on the rest of the
test?
34(No Transcript)
35Point-Biserial Correlation
- An item's possibility of discriminating peaks at
a facility of .5, i.e. when half the candidates
are responding correctly.
36Item Analysis
Item Statistics
Alternative Statistics
----------------------- ------------------------
----------- Seq. Scale Prop. Disc. Point
Prop. Endorsing Point No. -Item
Correct Index Biser. Alt. Total Low High
Biser. Key ---- ----- ------- ------ ------
----- ----- ---- ---- ------ --- 8
1-8 .40 .23 .24 A .53 .62
.46 -.13
B .06 .10 .04 -.12
C .40 .28 .50
.24
Other .00 .00 .00 -.38 9 1-9
.48 .39 .42 A .06 .10 .02
-.22 B
.48 .27 .66 .42
C .24 .30 .19
-.12 D
.22 .32 .13 -.20
Other .00 .00 .00 -.38
37Item 9
Item Statistics Alternative
Statistics -----------------------
-----------------------------------Seq. Scale
Prop. Disc. Point Prop. Endorsing
PointNo. -Item Correct Index Biser. Alt.
Total Low High Biser. Key---- -----
------- ------ ------ ----- ----- ---- ----
------ --- 9 1-9 .48 .39 .42
A .06 .10 .02 -.22
B .48 .27 .66
.42
C .24 .30 .19 -.12
D .22 .32 .13
-.20
Other .00 .00 .00 -.38
38Item 9
39Item 8
Item Statistics Alternative
Statistics -----------------------
----------------------------------- Seq. Scale
Prop. Disc. Point Prop. Endorsing
Point No. -Item Correct Index Biser. Alt.
Total Low High Biser. Key ---- -----
------- ------ ------ ----- ----- ---- ----
------ --- 8 1-8 .40 .23 .24
A .53 .62 .46 -.13
B .06 .10 .04
-.12 C
.40 .28 .50 .24
Other .00 .00 .00
-.38
40Item 8
41Item 8
42Item 8
- With a computer, we can see and speak to people
in other countries for just a ___ pence per
minute. - A) little B) small C) few
43Item analysis
- Which were the easiest and most difficult items?
- Which item(s), if any, might you consider
replacing? Why? - If both item 2 and item 3 are testing the same
ability and you want to choose one for a language
proficiency test, which would you choose? Why?
44The Limitations of Classical Test Analysis
It is difficult to compare performances in
different contexts We need a common measure to
compare them
100
C
50
B
A
25
45Test Equating
- How does this test compare with other tests in
our bank? - Is it of similar difficulty to other tests at the
same level? - Is it harder than lower level and easier than
higher level tests? - How much harder or easier is it?
- Can we select items that are at the right level
of difficulty for the tests we want to build?
46Item Response Theory
- Relates items (and candidates) across tests
- Estimates probability that a candidate of known
ability will succeed on an item of known
difficulty - Useful for
- Test Construction
- Grading
- Test scrutiny
- Tests must be linked by common candidates and/or
common items
47Item Response Theory
Person Ability
Item Difficulty
Low ability candidate easy item - 50 chance
48Item Response Theory
Person Ability
Item Difficulty
Low ability candidate moderately difficult item
- 10 chance
High ability candidate, moderately difficult
item 90 chance
49Item Response Theory
100 -
50 -
Probability of success
Item difficulty/ Person ability
0 -
-3 -2 -1 0 1
2 3
Item
50Item link the same items appear on two or more
tests
51Person link the same people take two or more
tests of the same skills
52Linking Pretests at Cambridge ESOL
Pretests Small numbers of candidates All
candidates take anchor plus a batch of
pretests The anchor links the tests together
Overlapping anchors are linked to the ESOL Common
Scale
53The Common Scale
54The picture is often confusing or
ambiguousRequires professional interpretation
and judgement
Statistics for Pretesting
55Statistics for Pretesting
- To identify tasks at the appropriate level of
difficulty for a test - To flag any problems with items before they are
used in Live tests - Ensure that Live tests reflect the full range of
abilities we wish to test