Title: How to Assess and Measure Competency
 1How to Assess andMeasure Competency
- Robert C. Shaw, Jr., PhD 
 - Program Director 
 
  2Presentation Outline
- Describe a programs responsibilities 
 - Assess appropriate content 
 - Measure abilities as precisely as possible 
 - Reference each cut score to a criterion
 
  3The validity claim
- Our program is confident we can make valid 
inferences from an assessment because  - we carefully selected and structured the content 
 - and 
 - observed scores are reasonably precise 
 - Weakness in either claim diminishes the validity 
argument 
  4Define appropriate content
  5Information sources for content
Certification Boards Expectations 
 6What should we assess?
- A program should seek multiple opinions about 
program content  - May mean more than one faculty person in the 
program  - Could extend to survey results from several 
stakeholders  - Those who hire your graduates 
 - Those who graduated
 
  7Describe potential content
- Define potential content by describing job 
behaviors or tasks  - Interpret ABG results 
 - Determine the appropriate time to refer a patient 
for consultation from another service  - Adjust mechanical ventilation settings to 
optimize oxygenation for a patient while 
minimizing the risk of pulmonary injury 
  8Define terminal behaviors
- Focus terminal assessments on end-product 
behavior you expect students to master  - Insert a pulmonary artery catheter in a patient 
within a critical care setting using standard 
technique while minimizing risks of infection and 
lung involvement  - Integrate pulmonary function testing results with 
patient history and other laboratory results to 
produce a diagnosis 
  9Measure task criticality
- Typically expressed by the interaction of a 
 -  importance/significance/risk measure 
 - and a 
 - frequency/extent measure
 
  10Potential survey measurements
- How important is the task to success? 
 - OR 
 - How significant is the task to safe and effective 
practice? 
- 4Extremely 
 - 3Very 
 - 2Moderately 
 - 1Minimally
 
  11Potential survey measurements
- If this task is incorrectly performed, how strong 
is the risk? 
- 3 Potentially fatal 
 - 2Likely to increase morbidity 
 - 1 Unlikely to have an adverse effect
 
  12Potential survey measurements
- How frequently do you perform the task?
 
- 3Every week 
 - 2A few times each year 
 - 1Less than once a year
 
- 3Very often 
 - 2Occasionally 
 - 1Infrequently
 
  13Potential survey measurements
- Have you performed the task in the last year?
 
  14What can we do with task measurements?
- Normed-referenced approach 
 - Rank order tasks from most to least critical 
 - Start at the top and work down using available 
time  - Criterion-referenced approach 
 - Identify tasks that are sufficiently critical to 
ensure program coverage and competency assessment 
  15Select item type(s) for each assessment
- Constructed response (e.g., short answer, essay, 
performance)  - Short development time 
 - Long scoring time 
 - Scores have strong subjective characteristics 
 - Selected response (e.g., true/false, matching, 
multiple-choice)  - Long development time 
 - Short scoring time 
 - Scores have strong objective characteristics
 
  16High stakes terminal assessments should be 
standardized
- Specify how the assessment should look before 
writing/selecting items  - Test specifications ensure each assessment is 
similar, fair, and covers critical content 
  17Test specifications are typically two-dimensional 
 18Entire test blueprint/matrix 
 19Test specifications and items
- Each item should be linked to a task and a 
cognitive process level  - It helps to store items in a database 
 - A sophisticated database will permit additional 
layers of classification  - Acute/chronic care 
 - Age groups
 
  20Item banking software
- FastTest  
 - www.assess.com/frmSoftCat.htm 
 - ExamView  
 - www.pearsonncs.com/examview/ 
 - examview.htm 
 - LXRTest  
 - www.lxrtest.com/
 
  21Measure abilities precisely
- Are we confident an assessment has yielded a 
sufficiently precise ability estimate? 
  22Reliability
- Theoretical premise 
 - Observed scores are assumed to express true 
ability plus some measurement error  - High reliability implies low measurement error 
 
  23Reliability
- Reliability indices are R2 values, which express 
the percentage of observed score variance that 
can be attributed to true score variance  - How high is high enough? 
 - A test score reliability value of at least .85 is 
a characteristic of large-scale, standardized 
assessments, many exceed .90  - Sufficiently reliable test scores from a test 
built by a program should show values of at least 
.60 
  24Reliability
- Reliability is an attribute of a set of test 
scores, it is not an attribute of a test  - Therefore, a program should assess reliability 
for each group  - KR20 is appropriate for dichotomously scored 
(0,1) items  - Coefficient alpha works for polytomously (0, 
1,n) scored items 
  25Why are selected response items used for so many 
assessments?
- Assuming the time to assess is constant, more 
responses can be elicited from students using 
selected response items  - more items  
 - broader content coverage  
 - increased information  
 - enhanced measurement precision  
 -  stronger validity 
 - Scores are more strongly objective
 
  26Add items or options?
- A program cannot go wrong by adding more items to 
an assessment  - A program may only consume space and time by 
adding more options to multiple-choice items  - There is growing evidence items with 3 options 
are optimal, particularly when doing so permits 
inclusion of more items on an assessment  - Dr. Thomas Haladyna, Arizona State University
 
  27Up to a point, measurement precision and item 
quantity are directly related
Reliability
Higher quality items
Lower quality items
Item Count 
 28What encourages high item quality?
- Write well 
 - Clear, concise, accurate 
 - Remove unnecessary information from the stimulus 
 - Present nuanced choices that require a 
sophisticated mastery of material to correctly 
respond  - Item review is another opportunity to seek 
multiple opinions 
  29What encourages high item quality?
- Avoid formats known to be flawed 
 - D. All of the above 
 - D. None of the above 
 - Negative wording 
 - All of the following are true EXCEPT 
 - Which of the following is not true? 
 
  30What encourages high item quality?
- Apply quality improvement principles 
 - Analyze item performance 
 - Retain items that contribute to test score 
reliability  - Change or discard items that fail to contribute 
or negatively affect reliability 
  31Item analysis properties
- Difficulty 
 - p  proportion of students who correctly 
responded  - Discrimination 
 - rpb  correlation between item success and 
students test scores  
  32Item difficulty
Contribution to Test Score Reliability
1.0
0.0
0.4
0.6
p 
 33Item discrimination
- Because rpb values are correlations, values 
reflect one of three possibilities relative to 
reliability  - Positive contribution 
 - No contribution 
 - Negative contribution
 
  34Using item parameters diagnostically
- Relative to reliability contribution, item 
 - p values provide magnitude information 
 - rpb values provide magnitude and direction ( or 
-) information  
  35Using item parameters diagnostically
- Difficulty and discrimination properties equally 
contribute to reliability  - The best items show .30ltplt.70 AND ppbgt.20 
 - The worst items exist at the difficulty extremes 
and show zero or negative discrimination 
  36After diagnosing an item that shows a weak or 
negative reliability contribution
- What should we do? 
 - Observe option response frequencies and mean 
scores  - Identify incorrect responses that attracted 
students with test scores equal to or greater 
than the average  - Replace the offending option with a less 
attractive response  - Rewrite the stem to clarify ambiguities 
 - OR 
 - Discard the whole item and use a better one the 
next time 
  37Item analysis software
- Iteman  
 - www.assess.com/Software/iteman.htm 
 - examSystem II  
 - www.pearsonncs.com/examsystem/index.htm 
 - LXRTest  
 - www.lxrtest.com/ 
 - True Score II  
 - www.nine-patch.com/TSCDL.htm 
 - Excel Templates Free 
 - www.eflclub.com/elvin/publications/2003/itemanalys
is.html 
  38Internal resources may be available
- There is a good probability a large university 
with education, psychology, and/or statistics 
departments will have a system available for 
scoring items and providing analyses of test 
scores and items 
  39Reference each cut score to a criterion
- Should we define and assess minimal competence 
for our program? 
  40Cut points
- Highly reliable test scores reveal differences 
between students abilities and can help 
accurately rank order students, which may be 
important to employers  - However, the program is likely interested in 
assessing whether each student is sufficiently 
competent to safely and effectively practice  - Such assessment concerns typically surface as 
students are about to graduate 
  41Measuring minimal competence
- A program should decide whether it wants to 
create one large assessment with a single 
compensatory cut point  - OR 
 - Should each content domain have its own cut, a 
conjunctive model  
  42Why are there so many compensatory cut competency 
assessments?
- If a program selects the more rigorous 
conjunctive model, then each component test will 
produce its own set of scores, each with its own 
reliability  - Each component must have a sufficient number of 
items or data points to be confident each student 
groups test scores will show adequate 
reliability  - Modules of less than 80-100 program-made items 
are unlikely to produce adequate reliability 
  43Seek multiple opinions . . . again
- Program faculty should define skills competent 
practitioners possess  - This is a group activity 
 - Each cut point should be linked to a definition 
of minimally competent practitioners 
  44Performance assessments
- Pick your spots 
 - Ensure a sufficient quantity of information is 
collected  - Standardize administration 
 - Measure agreement between/among evaluators
 
  45Summary
- Collective opinions are closer to the truth about 
  - appropriate assessment content, 
 - item quality, and 
 - justifiable cut scores than any one opinion 
 - Unreliable scales have no utility 
 
  46Thank you for the opportunity to share some 
details about measurement