Title: Measurement Issues in Selection
1Measurement Issues in Selection
- Basic Measurement Issues,Reliability, and
Validity
2Why Measurement Issues are Critical
- Need sound data upon which to make high quality
selection decisions - Judicial system gives great weight to
professional and scientific standards - Job relatedness of selection devices is decided
on technical issues (i.e., reliability and
validity of methods)
3Basic Measurement Issues
- Measurement is just the use of rules to assign
numbers to objects to represent quantities of
attributes (e.g., mechanical ability,
interpersonal effectiveness, knowledge of U.S.
history, etc.) - People vary on the quantity of different
attributes they possess - We want to make reliable distinctions between
people on important job attributes
4Basic Measurement Issues
- Predictor any technique that is used to predict
some aspect of job performance - tests, interviews, work samples, biodata, etc.
- Criterion a measure or definition of what is
successful job performance - productivity, absenteeism, tardiness, dollar
sales, speed of performance, commitment, etc.
5Typical Predictors Used in Selection Processes
- Physical characteristics
- eyesight, hearing, manual dexterity, strength,
reflexes, reaction time - Psychological characteristics
- personality traits, attitudes, honesty, tolerance
for stress, tolerance for ambiguity - Behavioral characteristics
- work history, skills, past job performance
- Cognitive abilities
- verbal, quantitative, mechanical, spatial,
reasoning
6Basic Measurement Process
- Choose what attributes you want to measure
- Develop an operational definition of each
attribute - Construct a measure of each attribute (if none
already exist) based on the operational
definitions - Use the measures to assess the attributes in each
candidate
7Basic Measurement Issues
- Scales of measurement
- Nominal we can classify individuals into
different categories - Ordinal we can rank individuals relative to
each other - Interval we can make ratings of meaningful
differences among individual - Ratio we can derive ratios comparing one
individual to another - Scale of measurement determines precision of data
and statistical analyses performed
8Basic Measurement Issues
- Central Tendency
- Mean ? X / N
- Median middle score
- Mode most frequent score
- Variability
- Range of scores or spread
- Standard deviation average amount of deviation
of scores around the average score
9Basic Measurement Issues
- Standardization measuring attributes of,
administering procedures to, and scoring test
results of individuals in the same manner to
control outside influences - Norms scores of relevant others for use in
score interpretation. - Results in relative standings on an attribute
- Normative group should be relevant to purpose,
best if local, and re-checked periodically for
change
10Basic Measurement Issues
- Percentile scores percentage of people in a
norm group who fall below a given score on a
measure - Problem based on ordinal scale of measurement
but often interpreted on a ratio scale - Standard scores scores that have common
measurement units such that equal intervals exist
between scores, such as z scores
z (X-M) SD
11Basic Measurement Issues
- Correlation
- Degree of relationship between two sets of scores
(e.g., knowledge test and job performance rating) - Plot scatter diagram
- Calculate correlation coefficient, r (-1.0 to 0
to 1.0) - R2 the amount of shared variation among two
measures (r .60, then common variation is .36) - Evaluate the practical and statistical
significance of the correlation between two
measures
12Reliability of Selection Devices
- What is reliability?
- Dependability, consistency, or stability of
scores on a measure (predictor, criterion, etc.) - Measurement error
- Difference between a theoretical true score on
an attribute and that which is obtained through
measurement - Sources of error test itself, test conditions,
person, scoring errors, etc.
Obtained Score True Score Measurement Error
13Methods of Estimating Reliability
- Within One Time Period
- Parallel or Equivalent Forms
- Internal Consistency
- Inter-rater Agreement
- Across Two Time Periods
- Parallel or Equivalent Forms
- Test-Retest
- Intra-rater Agreement
14Parallel or Equivalent Forms
- Develop two tests equal in material covered, the
form, and the difficulty and number of items - Administer to the same respondents with either a
brief or long time period - Correlate the scores on the two tests
(coefficient of equivalence)
15Parallel or Equivalent Forms
- Limitations
- Difficult and time consuming to construct
equivalent tests. - Must develop a domain of items to represent the
universe of possible items, pre-test the items,
and do a detailed item analysis - When this estimate should be used
- Cognitive ability tests, such as verbal,
quantitative, mathematical skills - Preferable to test-retest estimate
16Internal Consistency
- Estimates the extent to which all parts of a test
are similar in what they measure - Items are consistent are homogeneous in measuring
the same trait, skill, ability,etc. - Three types
- 1. Split-half estimates
- 2. Kuder-Richardson estimates
- 3. Cronbachs alpha estimates
Both are average coefficients computed from all
possible split-halves
17Internal Consistency
- Limitations
- Need lots of items (but large numbers of items
increase the estimate without increasing
reliability) - Low estimates may be due to the measurement of
more than one construct or just unreliability - Cannot use with timed test
- When to use
- Popular to use (one test at one time)
- Can use with psychological constructs
18Inter-rater Reliability Agreement
- Determination of the reliability of two or more
raters - Rater biases (errors) must be factored out of
ratings of the focal person (e.g., performance
appraisal ratings) - Rater errors include
- different interpretation of standards in making
ratings - inconsistency in using the standards across time
or ratees
19Inter-rater Reliability Agreement
- Three types of interrater reliability estimates
- 1. Interrater agreement for nominal/category
data (percent of rater agreement) - 2. Inter-class correlation (average amount of
agreement among two raters judging a series of
objects or people) - 3. Intra-class correlation (average amount of
agreement among three or more raters judging a
series of objects or people)
20Inter-rater Reliability Agreement
- Limitations
- Complex calculations required
- Assumes all raters are interchangeable and know
the ratees equally well - When to use
- When ratings from multiple sources are available
on candidates (e.g., assessment center or
performance appraisal) - To test effectiveness of rater training
21Test-Retest Reliability
- Use same selection measure to collect data from
same people at two different times - Correlate scores from the two tests to obtain the
coefficient of stability (0 - 1.0) - Limitations
- What is proper time interval between tests?
Learning and practice (memory) effects
underestimate or overestimate the reliability,
respectively - Not good for measuring attributes that are
expected to change over time
22Intra-rater Agreement
- Scores assigned to same people by the same rater
in two different time periods are compared - Calculations
- Percent agreement
- Correlation of scores across Time 1 and Time 2
- Limitations
- Changes in attribute over time
23Interpreting Reliability Coefficients
- What does it mean to say that we have a
reliability estimate of .60? - Is this good?
- Can we say the measure is dependable?
- How much error of measurement is there?
- Are any one individuals responses reliable?
24Interpreting Reliability Coefficients
- Reliability is the extent to which individual
differences in scores are due to true
differences in the attribute measured rather than
chance errors - It is an estimate of the percentage or proportion
of total differences in scores due to true
differences (i.e., 60) rather than error (i.e.,
40) - Reliability coefficients for selection devices
should be no lower than .85 and preferably around
.90
25Interpreting Reliability Coefficients
- Reliability estimates are specific to the group
upon which they were calculated - Cannot say that the test scores of any one
individual are reliable, but that on average the
groups scores are reliable - More important the decision and more that the
decision relies on test results, then the higher
the reliability required.
26What affects the reliability of a selection
device?
- Method of estimating the reliability
- Individual differences among the test takers on
the attribute measured (e.g., mechanical
ability) more variation, more reliability - Length of the test longer usually better
- Test question difficulty moderate is best
- Homogeneous items more homogeneity, more
reliability
27What affects the reliability of a selection
device?
- Response format more response choices (e.g., 7
vs. 3), then the more reliable - Fewer the errors of measurement (due to
variations in the administration of the test or
temporary states of the test-taker), then the
more reliable the measure
28Standard Error of Measurement (SEM)
- Use to estimate the error associated with a
particular individuals score on a test - Helps us determine a range or confidence interval
around a persons obtained score within which
his/her true score resides - Helps us determine if two individuals test
scores are significantly different from each
other
29Standard Error of Measurement Example
b. Assume reliability is .9
a. Assume reliability is .5
10, Mean 75 7.07
10, Mean 75 3.16
- For applicant with a score of 70
- There is a 95 chance that his/her true score is
- ? 2 (7.07) or 70 ? 14.14 or 55.86 and 84.14
- ? 2 (3.16) or 70 ? 6.32 or 63.68 and 76.32
30Validity of Selection Measures
- Definition Degree to which the selection device
measures what it is intended to measure - Example Do scores on the Bennett Mechanical
Comprehension Test actually measure mechanical
skills? - Is there evidence to support making inferences
from scores on selection devices to job
performance?
31Accuracy in Prediction
Prediction
Low
High
Prediction Error (Under-prediction of job
success)
Correct Prediction
High
Actual Job Perform- ance
Prediction Error (Over-prediction of job
success)
Correct Prediction
Low
32Validity and Reliability
- Valid inferences cannot be made with unreliable
measures - However, just because a measure is reliable, does
not mean it is a valid means of making inferences
about job performance - Thus, high reliability is a necessary but not
sufficient condition for validity
33Different Types of Validity
- Criterion related
- Concurrent
- Predictive
- Content validity
- Validity generalization
Validation studies are done to collect evidence
about the appropriateness of making
inferences from our selection measures
34Concurrent Validity
- Information is obtained on both a predictor
(selection device) and a criterion measure (i.e.,
aspect of job performance) for a current group of
employees - Two sets of data are statistically correlated
- A significant correlation coefficient provides
evidence of validity
35Concurrent Validity
- Strengths
- Can obtain immediate validation evidence in short
period of time - Accepted by courts as evidence of validity
- Weaknesses
- Job tenure or experience may affect scores on
either predictor or criterion measure - Are current employees representative of
applicants? - Questionable motivation of participants
- Some employees may be missing from study
36Predictive Validity
- Information is obtained on a predictor (selection
device) from applicants - Results of this measure are filed away
- After passage of time, collect information on a
criterion measure (i.e., aspect of job
performance) from those applicants hired - Two sets of data are statistically correlated
- A significant correlation coefficient provides
evidence of validity
37Predictive Validity
- Strengths
- Applicants rather than current employees are
measured on the predictor - Motivation and job tenure are not problematic
- Weaknesses
- What time interval to use?
- Need a large enough group of hires to perform
statistical analysis, so it may take a long time
before validation studies can be completed
38Other Issues in Criterion Related Validation
Studies
- Validation studies should be conducted on large
sample sizes to avoid sampling error - Violation of statistical assumptions (e.g.,
linearity) will underestimate true relationship
between predictor and criterion - Criterion contamination (i.e., criteria scores
are affected by variables other than predictor)
will affect validity coefficient (e.g., gender,
age)
39Other Issues in Criterion Related Validation
Studies
- Unreliability in criterion or predictor measures
will lower validity coefficient - Restriction in range of scores on a predictor
will lower validity coefficient - Coefficient of determination (i.e., squaring the
validity coefficient) percentage of variation
in criterion that can be explained by variation
in the predictor (rxy2)
40Content Validity
- Selection devices have content validity when it
can be shown that its items are representative
samples of content of job (i.e., job content
domain) - Characteristics
- Prime emphasis is on the construction of measure
- Relies on expert judgment to determine validity
(descriptive rather than predictive validity) - Examples faculty teach a class truck driver
take a driving test manager make a decision
41Content Validation Procedures
1. Conduct a thorough job analysis (i.e.,
identify critical tasks and KSAs required to
perform tasks. 2. Select experts (job
incumbents and supervisors) to participate in
validation study. 3. Specify content of
selection measure using domain sampling,
ensuring both physical and psychological
fidelity. 4. Assess relevance of selection
measure to a jobs content by asking incumbents
to rate extent to which important KSAs to do
job are needed to answer test questions
42When to Use Content Validation
- When there are too few people on which to conduct
a criterion-related study - When criterion measures are not available
- When criterion measures have questionable quality
- It is usually preferable to perform
criterion-related validation studies over content
validation studies if at all possible
43When Not to use Content Validity
- Jobs or KSAOs are very abstract
- Inferential links made in performing job analysis
and constructing selection measure are not sound - Assessing mental, psychological, or personality
variables (more inference needed) - Content, setting, and administration of procedure
does not resemble work setting - KSAOs assessed are expected to be learned fairly
easily or quickly
44Validity Generalization
- Validity of selection device may be generalizable
across similar job situations - Evidence used to support validity of measure in
one setting can be used to support validity of
using it in different setting with same or
similar jobs - Validity generalization methods essentially
correct for errors in validation studies - Implication not necessary to conduct validity
studies for each job in each firm
45Validation Studies in Practice
- Only 24 of companies report doing validity
studies - Small organizations are unlikely to be able to do
validation studies (cost, sample sizes) - Changing nature of jobs (team based, fluidity of
job, changing criteria for success) - Legal requirements may limit some validation
studies (validity generalization) - Requires hiring testing specialists or
consultants to conduct studies