Measurement Issues in Selection
  • Basic Measurement Issues,Reliability, and

Why Measurement Issues are Critical
  • Need sound data upon which to make high quality
    selection decisions
  • Judicial system gives great weight to
    professional and scientific standards
  • Job relatedness of selection devices is decided
    on technical issues (i.e., reliability and
    validity of methods)

Basic Measurement Issues
  • Measurement is just the use of rules to assign
    numbers to objects to represent quantities of
    attributes (e.g., mechanical ability,
    interpersonal effectiveness, knowledge of U.S.
    history, etc.)
  • People vary on the quantity of different
    attributes they possess
  • We want to make reliable distinctions between
    people on important job attributes

Basic Measurement Issues
  • Predictor any technique that is used to predict
    some aspect of job performance
  • tests, interviews, work samples, biodata, etc.
  • Criterion a measure or definition of what is
    successful job performance
  • productivity, absenteeism, tardiness, dollar
    sales, speed of performance, commitment, etc.

Typical Predictors Used in Selection Processes
  • Physical characteristics
  • eyesight, hearing, manual dexterity, strength,
    reflexes, reaction time
  • Psychological characteristics
  • personality traits, attitudes, honesty, tolerance
    for stress, tolerance for ambiguity
  • Behavioral characteristics
  • work history, skills, past job performance
  • Cognitive abilities
  • verbal, quantitative, mechanical, spatial,

Basic Measurement Process
  • Choose what attributes you want to measure
  • Develop an operational definition of each
  • Construct a measure of each attribute (if none
    already exist) based on the operational
  • Use the measures to assess the attributes in each

Basic Measurement Issues
  • Scales of measurement
  • Nominal we can classify individuals into
    different categories
  • Ordinal we can rank individuals relative to
    each other
  • Interval we can make ratings of meaningful
    differences among individual
  • Ratio we can derive ratios comparing one
    individual to another
  • Scale of measurement determines precision of data
    and statistical analyses performed

Basic Measurement Issues
  • Central Tendency
  • Mean ? X / N
  • Median middle score
  • Mode most frequent score
  • Variability
  • Range of scores or spread
  • Standard deviation average amount of deviation
    of scores around the average score

Basic Measurement Issues
  • Standardization measuring attributes of,
    administering procedures to, and scoring test
    results of individuals in the same manner to
    control outside influences
  • Norms scores of relevant others for use in
    score interpretation.
  • Results in relative standings on an attribute
  • Normative group should be relevant to purpose,
    best if local, and re-checked periodically for

Basic Measurement Issues
  • Percentile scores percentage of people in a
    norm group who fall below a given score on a
  • Problem based on ordinal scale of measurement
    but often interpreted on a ratio scale
  • Standard scores scores that have common
    measurement units such that equal intervals exist
    between scores, such as z scores

z (X-M) SD
Basic Measurement Issues
  • Correlation
  • Degree of relationship between two sets of scores
    (e.g., knowledge test and job performance rating)
  • Plot scatter diagram
  • Calculate correlation coefficient, r (-1.0 to 0
    to 1.0)
  • R2 the amount of shared variation among two
    measures (r .60, then common variation is .36)
  • Evaluate the practical and statistical
    significance of the correlation between two

Reliability of Selection Devices
  • What is reliability?
  • Dependability, consistency, or stability of
    scores on a measure (predictor, criterion, etc.)
  • Measurement error
  • Difference between a theoretical true score on
    an attribute and that which is obtained through
  • Sources of error test itself, test conditions,
    person, scoring errors, etc.

Obtained Score True Score Measurement Error
Methods of Estimating Reliability
  • Within One Time Period
  • Parallel or Equivalent Forms
  • Internal Consistency
  • Inter-rater Agreement
  • Across Two Time Periods
  • Parallel or Equivalent Forms
  • Test-Retest
  • Intra-rater Agreement

Parallel or Equivalent Forms
  • Develop two tests equal in material covered, the
    form, and the difficulty and number of items
  • Administer to the same respondents with either a
    brief or long time period
  • Correlate the scores on the two tests
    (coefficient of equivalence)

Parallel or Equivalent Forms
  • Limitations
  • Difficult and time consuming to construct
    equivalent tests.
  • Must develop a domain of items to represent the
    universe of possible items, pre-test the items,
    and do a detailed item analysis
  • When this estimate should be used
  • Cognitive ability tests, such as verbal,
    quantitative, mathematical skills
  • Preferable to test-retest estimate

Internal Consistency
  • Estimates the extent to which all parts of a test
    are similar in what they measure
  • Items are consistent are homogeneous in measuring
    the same trait, skill, ability,etc.
  • Three types
  • 1. Split-half estimates
  • 2. Kuder-Richardson estimates
  • 3. Cronbachs alpha estimates

Both are average coefficients computed from all
possible split-halves
Internal Consistency
  • Limitations
  • Need lots of items (but large numbers of items
    increase the estimate without increasing
  • Low estimates may be due to the measurement of
    more than one construct or just unreliability
  • Cannot use with timed test
  • When to use
  • Popular to use (one test at one time)
  • Can use with psychological constructs

Inter-rater Reliability Agreement
  • Determination of the reliability of two or more
  • Rater biases (errors) must be factored out of
    ratings of the focal person (e.g., performance
    appraisal ratings)
  • Rater errors include
  • different interpretation of standards in making
  • inconsistency in using the standards across time
    or ratees

Inter-rater Reliability Agreement
  • Three types of interrater reliability estimates
  • 1. Interrater agreement for nominal/category
    data (percent of rater agreement)
  • 2. Inter-class correlation (average amount of
    agreement among two raters judging a series of
    objects or people)
  • 3. Intra-class correlation (average amount of
    agreement among three or more raters judging a
    series of objects or people)

Inter-rater Reliability Agreement
  • Limitations
  • Complex calculations required
  • Assumes all raters are interchangeable and know
    the ratees equally well
  • When to use
  • When ratings from multiple sources are available
    on candidates (e.g., assessment center or
    performance appraisal)
  • To test effectiveness of rater training

Test-Retest Reliability
  • Use same selection measure to collect data from
    same people at two different times
  • Correlate scores from the two tests to obtain the
    coefficient of stability (0 - 1.0)
  • Limitations
  • What is proper time interval between tests?
    Learning and practice (memory) effects
    underestimate or overestimate the reliability,
  • Not good for measuring attributes that are
    expected to change over time

Intra-rater Agreement
  • Scores assigned to same people by the same rater
    in two different time periods are compared
  • Calculations
  • Percent agreement
  • Correlation of scores across Time 1 and Time 2
  • Limitations
  • Changes in attribute over time

Interpreting Reliability Coefficients
  • What does it mean to say that we have a
    reliability estimate of .60?
  • Is this good?
  • Can we say the measure is dependable?
  • How much error of measurement is there?
  • Are any one individuals responses reliable?

Interpreting Reliability Coefficients
  • Reliability is the extent to which individual
    differences in scores are due to true
    differences in the attribute measured rather than
    chance errors
  • It is an estimate of the percentage or proportion
    of total differences in scores due to true
    differences (i.e., 60) rather than error (i.e.,
  • Reliability coefficients for selection devices
    should be no lower than .85 and preferably around

Interpreting Reliability Coefficients
  • Reliability estimates are specific to the group
    upon which they were calculated
  • Cannot say that the test scores of any one
    individual are reliable, but that on average the
    groups scores are reliable
  • More important the decision and more that the
    decision relies on test results, then the higher
    the reliability required.

What affects the reliability of a selection
  • Method of estimating the reliability
  • Individual differences among the test takers on
    the attribute measured (e.g., mechanical
    ability) more variation, more reliability
  • Length of the test longer usually better
  • Test question difficulty moderate is best
  • Homogeneous items more homogeneity, more

What affects the reliability of a selection
  • Response format more response choices (e.g., 7
    vs. 3), then the more reliable
  • Fewer the errors of measurement (due to
    variations in the administration of the test or
    temporary states of the test-taker), then the
    more reliable the measure

Standard Error of Measurement (SEM)
  • Use to estimate the error associated with a
    particular individuals score on a test
  • Helps us determine a range or confidence interval
    around a persons obtained score within which
    his/her true score resides
  • Helps us determine if two individuals test
    scores are significantly different from each

Standard Error of Measurement Example

b. Assume reliability is .9
a. Assume reliability is .5
10, Mean 75 7.07
10, Mean 75 3.16
  • For applicant with a score of 70
  • There is a 95 chance that his/her true score is
  • ? 2 (7.07) or 70 ? 14.14 or 55.86 and 84.14
  • ? 2 (3.16) or 70 ? 6.32 or 63.68 and 76.32

Validity of Selection Measures
  • Definition Degree to which the selection device
    measures what it is intended to measure
  • Example Do scores on the Bennett Mechanical
    Comprehension Test actually measure mechanical
  • Is there evidence to support making inferences
    from scores on selection devices to job

Accuracy in Prediction
Prediction Error (Under-prediction of job
Correct Prediction
Actual Job Perform- ance
Prediction Error (Over-prediction of job
Correct Prediction
Validity and Reliability
  • Valid inferences cannot be made with unreliable
  • However, just because a measure is reliable, does
    not mean it is a valid means of making inferences
    about job performance
  • Thus, high reliability is a necessary but not
    sufficient condition for validity

Different Types of Validity
  • Criterion related
  • Concurrent
  • Predictive
  • Content validity
  • Validity generalization

Validation studies are done to collect evidence
about the appropriateness of making
inferences from our selection measures
Concurrent Validity
  • Information is obtained on both a predictor
    (selection device) and a criterion measure (i.e.,
    aspect of job performance) for a current group of
  • Two sets of data are statistically correlated
  • A significant correlation coefficient provides
    evidence of validity

Concurrent Validity
  • Strengths
  • Can obtain immediate validation evidence in short
    period of time
  • Accepted by courts as evidence of validity
  • Weaknesses
  • Job tenure or experience may affect scores on
    either predictor or criterion measure
  • Are current employees representative of
  • Questionable motivation of participants
  • Some employees may be missing from study

Predictive Validity
  • Information is obtained on a predictor (selection
    device) from applicants
  • Results of this measure are filed away
  • After passage of time, collect information on a
    criterion measure (i.e., aspect of job
    performance) from those applicants hired
  • Two sets of data are statistically correlated
  • A significant correlation coefficient provides
    evidence of validity

Predictive Validity
  • Strengths
  • Applicants rather than current employees are
    measured on the predictor
  • Motivation and job tenure are not problematic
  • Weaknesses
  • What time interval to use?
  • Need a large enough group of hires to perform
    statistical analysis, so it may take a long time
    before validation studies can be completed

Other Issues in Criterion Related Validation
  • Validation studies should be conducted on large
    sample sizes to avoid sampling error
  • Violation of statistical assumptions (e.g.,
    linearity) will underestimate true relationship
    between predictor and criterion
  • Criterion contamination (i.e., criteria scores
    are affected by variables other than predictor)
    will affect validity coefficient (e.g., gender,

Other Issues in Criterion Related Validation
  • Unreliability in criterion or predictor measures
    will lower validity coefficient
  • Restriction in range of scores on a predictor
    will lower validity coefficient
  • Coefficient of determination (i.e., squaring the
    validity coefficient) percentage of variation
    in criterion that can be explained by variation
    in the predictor (rxy2)

Content Validity
  • Selection devices have content validity when it
    can be shown that its items are representative
    samples of content of job (i.e., job content
  • Characteristics
  • Prime emphasis is on the construction of measure
  • Relies on expert judgment to determine validity
    (descriptive rather than predictive validity)
  • Examples faculty teach a class truck driver
    take a driving test manager make a decision

Content Validation Procedures
1. Conduct a thorough job analysis (i.e.,
identify critical tasks and KSAs required to
perform tasks. 2. Select experts (job
incumbents and supervisors) to participate in
validation study. 3. Specify content of
selection measure using domain sampling,
ensuring both physical and psychological
fidelity. 4. Assess relevance of selection
measure to a jobs content by asking incumbents
to rate extent to which important KSAs to do
job are needed to answer test questions
When to Use Content Validation
  • When there are too few people on which to conduct
    a criterion-related study
  • When criterion measures are not available
  • When criterion measures have questionable quality
  • It is usually preferable to perform
    criterion-related validation studies over content
    validation studies if at all possible

When Not to use Content Validity
  • Jobs or KSAOs are very abstract
  • Inferential links made in performing job analysis
    and constructing selection measure are not sound
  • Assessing mental, psychological, or personality
    variables (more inference needed)
  • Content, setting, and administration of procedure
    does not resemble work setting
  • KSAOs assessed are expected to be learned fairly
    easily or quickly

Validity Generalization
  • Validity of selection device may be generalizable
    across similar job situations
  • Evidence used to support validity of measure in
    one setting can be used to support validity of
    using it in different setting with same or
    similar jobs
  • Validity generalization methods essentially
    correct for errors in validation studies
  • Implication not necessary to conduct validity
    studies for each job in each firm

Validation Studies in Practice
  • Only 24 of companies report doing validity
  • Small organizations are unlikely to be able to do
    validation studies (cost, sample sizes)
  • Changing nature of jobs (team based, fluidity of
    job, changing criteria for success)
  • Legal requirements may limit some validation
    studies (validity generalization)
  • Requires hiring testing specialists or
    consultants to conduct studies
