Quality control in language tests: - PowerPoint PPT Presentation

1 / 55
About This Presentation
Title:

Quality control in language tests:

Description:

4. The work of the task and item writers needs to be edited before pretesting. ... A 200-item test has (for example) a mean of 100 and a SD of 15 ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 56
Provided by: wielan
Category:

less

Transcript and Presenter's Notes

Title: Quality control in language tests:


1
  • Quality control in language tests
  • pretesting and item banking

ALTE Meeting 9-11 November 2005 Dr Anthony
Green, Cambridge ESOL Research and Validation
2
Which car would you buy?
3
Quality control inspection and testing
Car safety ratings Euro NCAP
Rigorous testing demonstrates and improves
quality of product
4
Quality Control for Test Material
Pre-editing and editing Inspection
Pretesting and trialling Testing
Test construction Inspection
5
Standards for language testers
  • ILTA code of practice (2005)
  • 3. All tests, regardless of their purpose or use,
    must be reliable. Reliability refers to the
    consistency of the test results, to what extent
    they are generalizable and therefore comparable
    across time and across settings.
  • 4. The work of the task and item writers needs to
    be edited before pretesting. If pretesting is not
    possible, tasks and items should be analysed
    after the test has been administered but before
    results are reported. Malfunctioning or
    misfitting tasks and items should not be included
    in the calculation of individual test takers'
    reported scores.

6
Range of ability and difficulty
Individuals have different levels of language
ability
Language tasks have different levels of difficulty
7
Using Tests to Obtain Information
8
The Pretesting Process
(Live)
9
The Pretesting Process
10
The Data File
Identifiers
Item responses
0090040112 000260001MToma
CCDDCABCBBCAD100101101111 0090040112
000260002FSukaria CCDDCADDBBAAA0101111011O1 0
090040112 000260003KTakahashi
CCDDCADBBAAAD111101101111 0090040112
000260004SSunaga CCDDCBBDBBDAD111011100110 0
090040112 000260005FPaolicchi
CCDDCADDBBAAD111111101100 0090040112
000260006JStassart CCDDCBABABBBD110101101100 0
090040112 000260007RMagome
CCDDCACDBBCAD111101001101 0090040112
000260008YMatsuzaki CCDDDABDBBABD111111111111 0
090020112 001160001AAlhinai
BCDDCACCABAAB100001011101 0090040112
001160002SAlzefeiti CCADCABCBBCAC101101111010
11
The Candidates Who took the test?
  • Candidate numbers sufficient to provide useful
    data?
  • Variety of backgrounds and language ability
    reflecting target population?
  • Range of language ability covered within each
    group?

12
Classical Test Analysis (ITEMAN)
  • Is the difficulty of the test appropriate to the
    level of the candidates?
  • Does the test discriminate between lower and
    higher-ability candidates?
  • Are the items consistent with each other? Do they
    all tell the same story?

13
Analysing the Scored Data File
14
Finding patterns in test data
15
(No Transcript)
16
Suitability of Test Central Tendency
  • Are scores mostly high, low, or in the middle?
  • Mean and Median
  • The mean (average) is the sum of all scores
    divided by the number of people
  • The median is the score in the middle of a
    distribution of scores that is, half the scores
    are below the median, half are above

17
Suitability of Test Dispersion
  • Are scores bunched together, or widely spread
    out?
  • Standard Deviation
  • The average amount by which scores vary from the
    mean.

18
Normal Distribution
A 200-item test has (for example) a mean of 100
and a SD of 15 About 68 of candidates score
within 1 SD, (between 85 and 115) About 95 of
candidates score within 2 SDs, (between 70 and
130)
19
Score Distribution
Number Freq- Cum Correct
uency Freq PR PCT -------
------- ------ ---- ----
0 1 1 5 5
1 0 1 5
0 2 0 1 5
0 3 1 2 10
5 4 5 7
35 25
5 0 7 35 0
6 4 11 55 20
7 4
15 75 20
8 2 17 85 10
9 1 18 90
5 10 2
20 99 10

--------------------
5 10 15 20
25
Percentage of Examinees
20
Descriptive Statistics
Scale Statistics ---------------- Scale
1 ------- N of Items
10 N of Examinees 20 Mean
6.000 Variance 5.900 Std. Dev.
2.429 Skew -0.377 Kurtosis
-0.004 Minimum 0.000 Maximum
10.000 Median 6.000
Alpha 0.667 SEM
2.180 Mean P 0.618 Mean Item-Tot.
0.329 Mean Biserial 0.437 Max Score (Low)
13 N (Low Group) 39 Min Score (High)
18 N (High Group) 33
21
Does the Test Work? Reliability
  • Stability Does the test yield consistent results
    on two or more occasions

Internal Do all parts of the
test consistency provide consistent
information
22
Ordering the data
Person ability
23
(No Transcript)
24
Ordering the data
Person ability
Task difficulty
25
(No Transcript)
26
Reliability
  • Alpha
  • Measured from 0 to 1
  • The higher the alpha the more reliable the test
  • Affected by homogeneity (of candidates and of
    items) and by test length
  • SEM
  • A way to estimate the reliability of individual
    scores
  • Combines alpha and standard deviation

27
Reliability Figures
Scale Statistics ---------------- Scale
1 ------- N of Items
10 N of Examinees 20 Mean
6.000 Variance 6.211 Std. Dev.
2.492 Skew -0.408 Kurtosis
0.368 Minimum 0.000 Maximum
10.000 Median 6.000
Alpha 0.724 SEM
2.180 Mean P 0.600 Mean Item-Tot.
0.545 Mean Biserial 0.735 Max Score (Low)
4 N (Low Group) 7 Min Score (High) 7 N
(High Group) 9
28
ITEMAN task
  • How many items are there on the test?
  • How many people took the test?
  • Is the test easy or difficult for these
    test-takers?
  • How reliable was the test overall?
  • If a candidate actually scored 45 on this test,
    what was her true score?

29
Where are the problems? Item Analysis
  • How many candidates got the item right?
  • Facility (proportion correct)
  • Did the item sort the sheep from the goats?
  • Discrimination (High scorers vs low scorers)

30
Facility
  • Item Statistics
    Alternative Statistics
  • -----------------------
    -----------------------------------
  • Seq. Scale Prop. Disc. Point
    Prop. Endorsing Point
  • No. -Item Correct Index Biser. Alt.
    Total Low High Biser. Key
  • ---- ----- ------- ------ ------ -----
    ----- ---- ---- ------ ---
  • Also known as p (proportion correct)
  • Acceptable range depends on the exam .35 to .85
    OK for many
  • Very high/ low facility items provides little
    information. Low facility may reduce to guessing
  • Low facility for a distractor indicates poor
    pulling power
  • There may be good reasons why out-of-range items
    should be included

31
Item Discrimination
  • Item Statistics
    Alternative Statistics
  • -----------------------
    -----------------------------------
  • Seq. Scale Prop. Disc. Point
    Prop. Endorsing Point
  • No. -Item Correct Index Biser. Alt.
    Total Low High Biser. Key
  • ---- ----- ------- ------ ------ -----
    ----- ---- ---- ------ ---
  • Measures of item discrimination show how
    successfully an item distinguishes between higher
    and lower ability candidates by
  • explicitly dividing the candidates into
    high-scoring and low-scoring groups, or
  • correlating scores on an individual item with
    total scores on the test.

32
The Discrimination Index
  • ITEMAN reports discrimination as
  • P high - P low
  • The highest (27) and lowest scoring groups of
    candidates are compared. The proportion of the
    lowest scoring candidates answering correctly is
    subtracted from the proportion of the highest
    scoring group.

33
Point-Biserial Correlation
  • The point-biserial correlation shows the
    relationship between candidates' performance on
    a single item and their performance on all items
    in the test (or part of the test)
  • i.e. Do those people who answer the item
    correctly also score highly on the rest of the
    test?

34
(No Transcript)
35
Point-Biserial Correlation
  • An item's possibility of discriminating peaks at
    a facility of .5, i.e. when half the candidates
    are responding correctly.

36
Item Analysis
Item Statistics
Alternative Statistics
----------------------- ------------------------
----------- Seq. Scale Prop. Disc. Point
Prop. Endorsing Point No. -Item
Correct Index Biser. Alt. Total Low High
Biser. Key ---- ----- ------- ------ ------
----- ----- ---- ---- ------ --- 8
1-8 .40 .23 .24 A .53 .62
.46 -.13
B .06 .10 .04 -.12
C .40 .28 .50
.24
Other .00 .00 .00 -.38 9 1-9
.48 .39 .42 A .06 .10 .02
-.22 B
.48 .27 .66 .42
C .24 .30 .19
-.12 D
.22 .32 .13 -.20
Other .00 .00 .00 -.38
37
Item 9
Item Statistics Alternative
Statistics -----------------------
-----------------------------------Seq. Scale
Prop. Disc. Point Prop. Endorsing
PointNo. -Item Correct Index Biser. Alt.
Total Low High Biser. Key---- -----
------- ------ ------ ----- ----- ---- ----
------ --- 9 1-9 .48 .39 .42
A .06 .10 .02 -.22
B .48 .27 .66
.42
C .24 .30 .19 -.12
D .22 .32 .13
-.20
Other .00 .00 .00 -.38
38
Item 9
39
Item 8
Item Statistics Alternative
Statistics -----------------------
----------------------------------- Seq. Scale
Prop. Disc. Point Prop. Endorsing
Point No. -Item Correct Index Biser. Alt.
Total Low High Biser. Key ---- -----
------- ------ ------ ----- ----- ---- ----
------ --- 8 1-8 .40 .23 .24
A .53 .62 .46 -.13
B .06 .10 .04
-.12 C
.40 .28 .50 .24
Other .00 .00 .00
-.38
40
Item 8
41
Item 8
42
Item 8
  • With a computer, we can see and speak to people
    in other countries for just a ___ pence per
    minute.
  • A) little B) small C) few

43
Item analysis
  • Which were the easiest and most difficult items?
  • Which item(s), if any, might you consider
    replacing? Why?
  • If both item 2 and item 3 are testing the same
    ability and you want to choose one for a language
    proficiency test, which would you choose? Why?

44
The Limitations of Classical Test Analysis
It is difficult to compare performances in
different contexts We need a common measure to
compare them
100
C
50
B
A
25
45
Test Equating
  • How does this test compare with other tests in
    our bank?
  • Is it of similar difficulty to other tests at the
    same level?
  • Is it harder than lower level and easier than
    higher level tests?
  • How much harder or easier is it?
  • Can we select items that are at the right level
    of difficulty for the tests we want to build?

46
Item Response Theory
  • Relates items (and candidates) across tests
  • Estimates probability that a candidate of known
    ability will succeed on an item of known
    difficulty
  • Useful for
  • Test Construction
  • Grading
  • Test scrutiny
  • Tests must be linked by common candidates and/or
    common items

47
Item Response Theory
Person Ability
Item Difficulty
Low ability candidate easy item - 50 chance
48
Item Response Theory
Person Ability
Item Difficulty
Low ability candidate moderately difficult item
- 10 chance
High ability candidate, moderately difficult
item 90 chance
49
Item Response Theory
100 -
50 -
Probability of success
Item difficulty/ Person ability
0 -
-3 -2 -1 0 1
2 3
Item
50
Item link the same items appear on two or more
tests
51
Person link the same people take two or more
tests of the same skills
52
Linking Pretests at Cambridge ESOL
Pretests Small numbers of candidates All
candidates take anchor plus a batch of
pretests The anchor links the tests together
Overlapping anchors are linked to the ESOL Common
Scale
53
The Common Scale
54
The picture is often confusing or
ambiguousRequires professional interpretation
and judgement
Statistics for Pretesting
55
Statistics for Pretesting
  • To identify tasks at the appropriate level of
    difficulty for a test
  • To flag any problems with items before they are
    used in Live tests
  • Ensure that Live tests reflect the full range of
    abilities we wish to test
Write a Comment
User Comments (0)
About PowerShow.com