Validity/Reliability and a recap of Statistics - PowerPoint PPT Presentation

About This Presentation
Title:

Validity/Reliability and a recap of Statistics

Description:

Validity/Reliability and a recap of Statistics RCS 6740 6/27/05 RELIABILITY AND VALIDITY Reliability From the perspective of classical test theory, an examinee's ... – PowerPoint PPT presentation

Number of Views:239
Avg rating:3.0/5.0
Slides: 115
Provided by: Stev144
Category:

less

Transcript and Presenter's Notes

Title: Validity/Reliability and a recap of Statistics


1
Validity/Reliability and a recap of Statistics
  • RCS 6740
  • 6/27/05

2
RELIABILITY AND VALIDITY
  • Reliability
  •  
  • From the perspective of classical test theory,
    an examinee's obtained test score (X) is composed
    of two components, a true score component (T) and
    an error component (E)
  •  XTE
  •   The true score component reflects the
    examinee's status with regard to the attribute
    that is measured by the test, while the error
    component represents measurement error.
    Measurement error is random error. It is due to
    factors that are irrelevant to what is being
    measured by the test and that have an
    unpredictable (unsystematic) effect on an
    examinee's test score.

3
RELIABILITY AND VALIDITY
  • The score you obtain on a test is likely to be
    due both to the knowledge you have about the
    topics addressed by exam items (T) and the
    effects of random factors (E) such as the way
    test items are written, any alterations in
    anxiety, attention, or motivation you experience
    while taking the test, and the accuracy of your
    "educated guesses."
  •  
  • Whenever we administer a test to examinees, we
    would like to know how much of their scores
    reflects "truth" and how much reflects error. It
    is a measure of reliability that provides us with
    an estimate of the proportion of variability in
    examinees' obtained scores that is due to true
    differences among examinees on the attribute(s)
    measured by the test.

4
RELIABILITY AND VALIDITY
  • Reliability
  • When a test is reliable, it provides dependable,
    consistent results and, for this reason, the term
    consistency is often given as a synonym for
    reliability (e.g., Anastasi, 1988).

5
RELIABILITY AND VALIDITY
  • The Reliability Coefficient
  •   Ideally, a test's reliability would be
    calculated by dividing true score variance by the
    obtained (total) variance to derive a reliability
    index. This index would indicate the proportion
    of observed variability in test scores that
    reflects true score variability. A test's true
    score variance is not known, however, and
    reliability must be estimated rather than
    calculated directly. There are several ways to
    estimate a test's reliability. Each involves
    assessing the consistency of an examinee's scores
    over time, across different content samples, or
    across different scorers and is based on the
    assumption that variability that is consistent is
    true score variability, while variability that is
    inconsistent reflects random error.

6
RELIABILITY AND VALIDITY
  • Most methods for estimating reliability produce
    a reliability coefficient, which is a correlation
    coefficient that ranges in value from 0.0 to
    1.0. When a test's reliability coefficient is
    0.0, this means that all variability in obtained
    test scores is due to measurement error.
    Conversely, when a test's reliability coefficient
    is 1.0, this indicates that all variability in
    scores reflects true score variability. The
    reliability coefficient is symbolized with the
    letter "r" and a subscript that contains two of
    the same letters or numbers (e.g., ''rxx''). The
    subscript indicates that the correlation
    coefficient was calculated by correlating a test
    with itself rather than with some other measure.

7
RELIABILITY AND VALIDITY
  • Regardless of the method used to calculate a
    reliability coefficient, the coefficient is
    interpreted directly as the proportion of
    variability in obtained test scores that reflects
    true score variability. For example, as depicted
    in Figure 1, a reliability coefficient of .84
    indicates that 84 of variability in scores is
    due to true score differences among examinees,
    while the remaining 16 (1.00 - .84) is due to
    measurement error.
  •  
  •  
  • Figure 1. Proportion of variability in test
    scores

True Score Variability (84)
Error (16)
8
RELIABILITY AND VALIDITY
  • Note that a reliability coefficient does not
    provide any information about what is actually
    being measured by a test. A reliability
    coefficient only indicates whether the attribute
    measured by the test whatever it isis being
    assessed in a consistent, precise way. Whether
    the test is actually assessing what it was
    designed to measure is addressed by an analysis
    of the test's validity.

9
RELIABILITY AND VALIDITY
Study Tip Remember that, in contrast to other
correlation coefficients, the reliability
coefficient is never squared to interpret it but
is interpreted directly as a measure of true
score variability. A reliability coefficient of
.89 means that 89 of variability in obtained
scores is true score variability.
10
RELIABILITY AND VALIDITY
  • Methods for Estimating Reliability
  •  
  • The selection of a method for estimating
    reliability depends on the nature of the test. As
    noted below, each method not only entails
    different procedures but is also affected by
    different sources of error. For many tests, more
    than one method should be used.

11
RELIABILITY AND VALIDITY
  • 1. Test-Retest Reliability The test-retest
    method for estimating reliability involves
    administering the same test to the same group of
    examinees on two different occasions and then
    correlating the two sets of scores. When using
    this method, the reliability coefficient
    indicates the degree of stability (consistency)
    of examinees' scores over time and is also known
    as the coefficient of stability.
  • The primary sources of measurement error for
    test-retest reliability are any random factors
    related to the time that passes between the two
    administrations of the test. These time sampling
    factors include random fluctuations in examinees
    over time (e.g., changes in anxiety or
    motivation) and random variations in the testing
    situation. Memory and practice also contribute to
    error when they have random carryover effects
    i.e., when they affect many or all examinees but
    not in the same way.

12
RELIABILITY AND VALIDITY
  • Test-retest reliability is appropriate for
    determining the reliability of tests designed to
    measure attributes that are relatively stable
    over time and that are not affected by repeated
    measurement. It would be appropriate for a test
    of aptitude, which is a stable characteristic,
    but not for a test of mood, since mood fluctuates
    over time, or a test of creativity, which might
    be affected by previous exposure to test items.

13
RELIABILITY AND VALIDITY
  • 2. Alternate (Equivalent, Parallel) Forms
    Reliability To assess a test's alternate forms
    reliability, two equivalent forms of the test are
    administered to the same group of examinees and
    the two sets of scores are correlated. Alternate
    forms reliability indicates the consistency of
    responding to different item samples (the two
    test forms) and, when the forms are administered
    at different times, the consistency of responding
    over time. The alternate forms reliability
    coefficient is also called the coefficient of
    equivalence when the two forms are administered
    at about the same time and the coefficient of
    equivalence and stability when a relatively long
    period of time separates administration of the
    two forms.

14
RELIABILITY AND VALIDITY
  • The primary source of measurement error for
    alternate forms reliability is content sampling,
    or error introduced by an interaction between
    different examinees' knowledge and the different
    content assessed by the items included in the two
    forms The items in Form A might be a better
    match of one examinee's knowledge than items in
    Form B, while the opposite is true for another
    examinee. In this situation, the two scores
    obtained by each examinee will differ, which will
    lower the alternate forms reliability
    coefficient. When administration of the two forms
    is separated by a period of time, time sampling
    factors also contribute to error.

15
RELIABILITY AND VALIDITY
  • Like test-retest reliability, alternate forms
    reliability is not appropriate when the attribute
    measured by the test is likely to fluctuate over
    time (and the forms will be administered at
    different times) or when scores are likely to be
    affected by repeated measurement. If the same
    strategies required to solve problems on Form A
    are used to solve problems on Form B, even if the
    problems on the two forms are not identical,
    there are likely to be practice effects. When
    these effects differ for different examinees
    (i.e., are random), practice will serve as a
    source of measurement error. Although alternate
    forms reliability is considered by some experts
    to be the most rigorous (and best) method for
    estimating reliability, it is not often assessed
    due to the difficulty in developing forms that
    are truly equivalent.

16
RELIABILITY AND VALIDITY
  • 3. Internal Consistency Reliability
    Reliability can also be estimated by measuring
    the internal consistency of a test. Split-half
    reliability and coefficient alpha are two methods
    for evaluating internal consistency. Both involve
    administering the test once to a single group of
    examinees, and both yield a reliability
    coefficient that is also known as the coefficient
    of internal consistency.
  •  
  • To determine a test's split-half reliability,
    the test is split into equal halves so that each
    examinee has two scores (one for each half of the
    test). Scores on the two halves are then
    correlated. Tests can be split in several ways,
    but probably the most common way is to divide the
    test on the basis of odd- versus even-numbered
    items.

17
RELIABILITY AND VALIDITY
  • A problem with the split-half method is that it
    produces a reliability coefficient that is based
    on test scores that were derived from one-half of
    the entire length of the test. If a test contains
    30 items, each score is based on 15 items.
    Because reliability tends to decrease as the
    length of a test decreases, the split-half
    reliability coefficient usually underestimates a
    test's true reliability. For this reason, the
    split-half reliability coefficient is ordinarily
    corrected using the Spearman-Brown prophecy
    formula, which provides an estimate of what the
    reliability coefficient would have been had it
    been based on the full length of the test.
  •  
  • Cronbach's coefficient alpha also involves
    administering the test once to a single group of
    examinees. However, rather than splitting the
    test in half, a special formula is used to
    determine the average degree of inter-item
    consistency. One way to interpret coefficient
    alpha is as the average reliability that would be
    obtained from all possible splits of the test.
    Coefficient alpha tends to be conservative and
    can be considered the lower boundary of a test's
    reliability (Novick and Lewis, 1967). When test
    items are scored dichotomously (right or wrong),
    a variation of coefficient alpha known as the
    Kuder-Richardson Formula 20 (KR-20) can be used.

18
RELIABILITY AND VALIDITY
  • Content sampling is a source of error for both
    split-half reliability and coefficient alpha. For
    split-half reliability, content sampling refers
    to the error resulting from differences between
    the content of the two halves of the test (i.e.,
    the items included in one half may better fit the
    knowledge of some examinees than items in the
    other half) for coefficient alpha, content
    (item) sampling refers to differences between
    individual test items rather than between test
    halves. Coefficient alpha also has as a source of
    error, the heterogeneity of the content domain. A
    test is heterogeneous with regard to content
    domain when its items measure several different
    domains of knowledge or behavior. The greater the
    heterogeneity of the content domain, the lower
    the inter-item correlations and the lower the
    magnitude of coefficient alpha. Coefficient alpha
    could be expected to be smaller for a 200-item
    test that contains items assessing knowledge of
    test construction, statistics, ethics,
    industrial-organizational psychology, clinical
    psychology, etc. than for a 200-item test that
    contains questions on test construction only.

19
RELIABILITY AND VALIDITY
  • The methods for assessing internal consistency
    reliability are useful when a test is designed to
    measure a single characteristic, when the
    characteristic measured by the test fluctuates
    over time, or when scores are likely to be
    affected by repeated exposure to the test. They
    are not appropriate for assessing the reliability
    of speed tests because, for these tests, they
    tend to produce spuriously high coefficients.
    (For speed tests, alternate forms reliability is
    usually the best choice.)

20
RELIABILITY AND VALIDITY
  • 4. Inter-Rater (Inter-Scorer, Inter-Observer)
    Reliability Inter-rater reliability is of
    concern whenever test scores depend on a rater's
    judgment. A test constructor would want to make
    sure that an essay test, a behavioral observation
    scale, or a projective personality test have
    adequate inter-rater reliability. This type of
    reliability is assessed either by calculating a
    correlation coefficient (e.g., a kappa
    coefficient or coefficient of concordance) or by
    determining the percent agreement between two or
    more raters. Although the latter technique is
    frequently used, it can lead to erroneous
    conclusions since it does not take into account
    the level of agreement that would have occurred
    by chance alone. This is a particular problem for
    behavioral observation scales that require raters
    to record the frequency of a specific behavior.
    In this situation, the degree of chance agreement
    is high whenever the behavior has a high rate of
    occurrence, and percent agreement will provide an
    inflated estimate of the measure's reliability.

21
RELIABILITY AND VALIDITY
  • Sources of error for inter-rater reliability
    include factors related to the raters such as
    lack of motivation and rater biases and
    characteristics of the measuring device. An
    inter-rater reliability coefficient is likely to
    be low, for instance, when rating categories are
    not exhaustive (i.e., don't include all possible
    responses or behaviors) and/or are not mutually
    exclusive.
  • The inter-rater reliability of a behavioral
    rating scale can also be affected by consensual
    observer drift, which occurs when two (or more)
    observers working together influence each other's
    ratings so that they both assign ratings in a
    similarly idiosyncratic way. (Observer drift can
    also affect a single observer's ratings when he
    or she assigns ratings in a consistently deviant
    way.) Unlike other sources of error, consensual
    observer drift tends to artificially inflate
    inter-rater reliability.

22
RELIABILITY AND VALIDITY
  • The reliability (and validity) of ratings can be
    improved in several ways. Consensual observer
    drift can be eliminated by having raters work
    independently or by alternating raters. Rating
    accuracy is also improved when raters are told
    that their ratings will be checked. Overall, the
    best way to improve both inter- and intra-rater
    accuracy is to provide raters with training that
    emphasizes the distinction between observation
    and interpretation (Aiken, 1985).

23
RELIABILITY AND VALIDITY

Study Tip Remember the Spearman-Brown formula is
related to split-half reliability and KR-20 is
related to the coefficient alpha. Also know that
alternate forms reliability is the most thorough
method for estimating reliability and that
internal consistency reliability is not
appropriate for speed tests.
24
RELIABILITY AND VALIDITY
  • Factors That Affect The Reliability Coefficient
  •  
  • The magnitude of the reliability coefficient is
    affected not only by the sources of error
    discussed above but also by the length of the
    test, the range of the test scores, and the
    probability that the correct response to items
    can be selected by guessing.
  • Test Length
  • Range of Test Scores
  • Guessing

25
RELIABILITY AND VALIDITY
  • 1. Test Length The larger the sample of the
    attribute being measured by a test, the less the
    relative effects of measurement error and the
    more likely the sample will provide dependable,
    consistent information. Consequently, a general
    rule is that the longer the test, the larger the
    test's reliability coefficient.
  •  
  • The Spearman-Brown prophecy formula is most
    associated with split-half reliability but can
    actually be used whenever a test developer wants
    to estimate the effects of lengthening or
    shortening a test on its reliability coefficient.
    For instance, if a 100-item test has a
    reliability coefficient of .84, the
    Spearman-Brown formula could be used to estimate
    the effects of increasing the number of items to
    150 or reducing the number to 50. A problem with
    the Spearman-Brown formula is that it does not
    always yield an accurate estimate of reliability
    In general, it tends to overestimate a test's
    true reliability (Gay, 1992).

26
RELIABILITY AND VALIDITY
  • This is most likely to be the case when the
    added items do not measure the same content
    domain as the original items and/or are more
    susceptible to the effects of measurement error.
    Note that, when used to correct the split-half
    reliability coefficient, the situation is more
    complex, and this generalization does not always
    apply When the two halves are not equivalent in
    terms of their means and standard deviations, the
    Spearman-Brown formula may either over- or
    underestimate the test's actual reliability.
  • 2. Range of Test Scores Since the reliability
    coefficient is a correlation coefficient, it is
    maximized when the range of scores is
    unrestricted. The range is directly affected by
    the deee of similarity of examinees with regard
    to the attribute measured by the test When
    examinees are heterogeneous, the range of scores
    is maximized. The range is also affected by the
    difficulty level of the test items. When all
    items are either very difficult or very easy, all
    examinees will obtain either low or high scores,
    resulting in a restricted range. Therefore, the
    best strategy is to choose items so that the
    average difficulty level is in the mid-range (r
    .50).

27
RELIABILITY AND VALIDITY
  • Guessing A test's reliability coefficient is
    also affected by the probability that examinees
    can guess the correct answers to test items. As
    the probability of correctly guessing answers
    increases, the reliability coefficient decreases.
    All other things being equal, a true/false test
    will have a lower reliability coefficient than a
    four-alternative multiple-choice test which, in
    turn, will have a lower reliability coefficient
    than a free recall test.

28
RELIABILITY AND VALIDITY
  • The Interpretation of Reliability
  •  
  • The interpretation of a test's reliability
    entails considering its effects on the scores
    achieved by a group of examinees as well as the
    score obtained by a single examinee.
  •  
  • 1. The Reliability Coefficient As discussed
    above, a reliability coefficient is interpreted
    directly as the proportion of variability in a
    set of test scores that is attributable to true
    score variability. A reliability coefficient of
    .84 indicates that 84 of variability in test
    scores is due to true score differences among
    examinees, while the remaining 16 is due to
    measurement error. While different types of tests
    can be expected to have different levels of
    reliability, for most tests, reliability
    coefficients of .80 or larger are considered
    acceptable.

29
RELIABILITY AND VALIDITY
  • When interpreting a reliability coefficient, it
    is important to keep in mind that there is no
    single index of reliability for a given test.
    Instead, a test's reliability coefficient can
    vary from situation to situation and sample to
    sample. Ability tests, for example, typically
    have different reliability coefficients for
    groups of individuals of different ages or
    ability levels.

30
RELIABILITY AND VALIDITY
  • 2. The Standard Error of Measurement While the
    reliability coefficient is useful for estimating
    the proportion of true score variability in a set
    of test scores, it is not particularly helpful
    for interpreting an individual examinee's
    obtained test score. When an examinee receives a
    score of 80 on a 100-item test that has a
    reliability coefficient of .84, for instance, we
    can only conclude that, since the test is not
    perfectly reliable, the examinee's obtained score
    might or might not be his or her true score.
  •  
  • A common practice when interpreting an
    examinee-s obtained score is to construct a
    confidence interval around that score. The
    confidence interval helps a test user estimate
    the range within which an examinee's true score
    is likely to fall given his or her obtained
    score. This range is calculated using the
    standard error of measurement, which is an index
    of the amount of error that can be expected in
    obtained scores due to the unreliability of the
    test. (When raw scores have been converted to
    percentile ranks, the confidence interval is
    referred to as a percentile band.)

31
RELIABILITY AND VALIDITY
  • The following formula is used to estimate the
    standard error of measurement
  •  
  • Formula 1 Standard Error of Measurement
  • SEmeas SDx (1 rxx)1/2
  • Where
  • SEmeas standard error of measurement
  • SDx standard deviation of test scores
  • rxx reliability coefficient

32
RELIABILITY AND VALIDITY
  • As shown by the formula, the magnitude of the
    standard error is affected by two factors the
    standard deviation of the test scores and the
    test's reliability coefficient. The lower the
    test's standard deviation and the higher its
    reliability coefficient, the smaller the standard
    error of measurement (and vice versa).
  •  
  • Because the standard error is a type of standard
    deviation, it can be interpreted in terms of the
    areas under the normal curve. With regard to
    confidence intervals, this means that a 68
    confidence interval is constructed by adding and
    subtracting one standard error to an examinee's
    obtained score a 95 confidence interval is
    constructed by adding and subtracting two
    standard errors and a 99 confidence interval is
    constructed by adding and subtracting three
    standard errors.

33
RELIABILITY AND VALIDITY
  • Example The psychologist in Study 3
    administers the interpersonal assertiveness test
    to a sales applicant who receives a score of 80.
    Since the test's reliability is less than 1.0,
    the psychologist knows that this score might be
    an imprecise estimate of the applicant's true
    score and decides to use the standard error of
    measurement to construct a 95 confidence
    interval. Assuming that the test-s reliability
    coefficient is .84 and its standard deviation is
    10, the standard error of measurement is equal to
    4.0
  •  
  • SEmeas SDx 1 rxx 10 (1 - .84)1/2 10(.4)
    4.0
  •  
  •   The psychologist constructs the 95 confidence
    interval by adding and subtracting two standard
    errors from the applicant's obtained score 80
    2(4.0) 72 to 88. This means that there is a 95
    chance that the applicant's true score falls
    between 72 and 88.

34
RELIABILITY AND VALIDITY
  • One problem with the standard error is that
    measurement error is not usually equally
    distributed throughout the range of test scores.
    Use of the same standard error to construct
    confidence intervals for all scores in a
    distribution can, therefore, be somewhat
    misleading. To overcome this problem, some test
    manuals report different standard errors for
    different score intervals.

35
RELIABILITY AND VALIDITY
  • 3. Estimating True Scores from Obtained Scores
    As discussed above, because of the effects of
    measurement error, obtained test scores tend to
    be biased (inaccurate) estimates of true scores.
    More specifically, scores above the mean of a
    distribution tend to overestimate true scores,
    while scores below the mean tend to underestimate
    true scores. Moreover, the farther from the mean
    an obtained score is, the greater this bias.
    Rather than constructing a confidence interval,
    an alternative (but less used) method for
    interpreting an examinee's obtained test score is
    to estimate his/her true score using a formula
    that takes into account this bias by adjusting
    the obtained score using the mean of the
    distribution and the test's reliability
    coefficient.

36
RELIABILITY AND VALIDITY
  • For example, if an examinee obtains a score of
    80 on a test that has a mean of 70 and a
    reliability coefficient of .84, the formula
    predicts that the examinee's true score is 78.2.
  • Ta bX
  • (1-rxx )X rxx X
  • T(1-.84) x 70 .84 x 80
  • .16 x 70 .84 x 80
  • 11.2 6778.2

37
RELIABILITY AND VALIDITY
  • 4. The Reliability of Difference Scores A test
    user is sometimes interested in comparing the
    performance of an examinee on two different tests
    or subtests and, therefore, computes a difference
    score. An educational psychologist, for instance,
    might calculate the difference between a child's
    WISC-R Verbal and Performance 19 scores. When
    doing so, it is important to keep in mind that
    the reliability coefficient for the difference
    scores can be no larger than the average of the
    reliabilities of the two tests or subtests If
    Test A has a reliability coefficient of .95 and
    Test B has a reliability coefficient of .85, this
    means that difference scores calculated from the
    two tests will have a reliability coefficient of
    .90 or less. The exact size of the reliability
    coefficient for difference scores depends on the
    degree of correlation between the two tests The
    more highly correlated the tests, the smaller the
    reliability coefficient (and the larger the
    standard error of measurement).

38
RELIABILITY AND VALIDITY
  • Validity
  •  
  • Validity refers to a test's accuracy. A test is
    valid when it measures what it is intended to
    measure. The intended uses for most tests fall
    into one of three categories, and each category
    is associated with a different method for
    establishing validity
  •  
  • The test is used to obtain information about an
    examinee's familiarity with a particular content
    or behavior domain content validity.
  •  
  • The test is administered to determine the extent
    to which an examinee possesses a particular
    hypothetical trait construct validity.
  •  
  • The test is used to estimate or predict an
    examinee's standing or performance on an external
    criterion criterion-related validity.

39
RELIABILITY AND VALIDITY
  • For some tests, it is necessary to demonstrate
    only one type of validity for others, it is
    desirable to establish more than one type. For
    example, if an arithmetic achievement test will
    be used to assess the classroom learning of 8th
    grade students, establishing the test's content
    validity would be sufficient. If the same test
    will be used to predict the performance of 8th
    grade students in an advanced high school math
    class, the test's content and criterion-related
    validity will both be of concern.
  •  
  • Note that, even when a test is found valid for a
    particular purpose, it might not be valid for
    that purpose for all people. It is quite possible
    for a test to be a valid measure of intelligence
    or a valid predictor of job performance for one
    group of people but not for another group.

40
RELIABILITY AND VALIDITY
  • Content Validity
  •  
  • A test has content validity to the extent that
    it adequately samples the content or behavior
    domain that it is designed to measure. If test
    items are not a good sample, results of testing
    will be misleading. Although content validation
    is sometimes used to establish the validity of
    personality, aptitude, and attitude tests, it is
    most associated with achievement-type tests that
    measure knowledge of one or more content domains
    and with tests designed to assess a well-defined
    behavior domain. Adequate content validity would
    be important for a statistics test and for a work
    (job) sample test.
  •  
  • Content validity is usually "built into" a test
    as it is constructed through a systematic,
    logical, and qualitative process that involves
    clearly identifying the content or behavior
    domain to be sampled and then writing or
    selecting items that represent that domain. Once
    a test has been developed, the establishment of
    content validity relies primarily on the judgment
    of subject matter experts. If experts agree that
    test items are an adequate and representative
    sample of the target domain, then the test is
    said to have content validity.

41
RELIABILITY AND VALIDITY
  • Although content validation depends mainly on
    the judgment of experts, supplemental
    quantitative evidence can be obtained. If a test
    has adequate content validity, a coefficient of
    internal consistency will be large the test will
    correlate highly with other tests that purport to
    measure the same domain and pre-/post-test
    evaluations of a program designed to increase
    familiarity with the domain will indicate
    appropriate changes.
  •  
  • Content validity must not be confused with face
    validity. Content validity refers to the
    systematic evaluation of a test by experts who
    determine whether or not test items adequately
    sample the relevant domain, while face validity
    refers simply to whether or not a test "looks
    like" it measures what it is intended to measure.
    Although face validity is not an actual type of
    validity, it is a desirable feature for many
    tests. If a test lacks face validity, examinees
    may not be motivated to respond to items in an
    honest or accurate manner. A high degree of face
    validity does not, however, indicate that a test
    has content validity.

42
RELIABILITY AND VALIDITY
  • Construct Validity
  •  
  • When a test has been found to measure the
    hypothetical trait (construct) it is intended to
    measure, the test is said to have construct
    validity. A construct is an abstract
    characteristic that cannot be observed directly
    but must be inferred by observing its effects.
    lntelligence, mechanical aptitude, self-esteem,
    and neuroticism are all constructs.
  •  
  • There is no single way to establish a test's
    construct validity. Instead, construct validation
    entails a systematic accumulation of evidence
    showing that the test actually measures the
    construct it was designed to measure. The various
    methods used to establish this type of validity
    each answer a slightly different question about
    the construct and include the following

43
RELIABILITY AND VALIDITY
  • Assessing the test's internal consistency Do
    scores on individual test items correlate highly
    with the total test score i.e., are all of the
    test items measuring the same construct?
  •  
  • Studying group differences Do scores on the test
    accurately distinguish between people who are
    known to have different levels of the construct?
  •  
  • Conducting research to test hypotheses about the
    construct Do test scores change, following an
    experimental manipulation, in the direction
    predicted by the theory underlying the construct?
  •  

44
RELIABILITY AND VALIDITY
  • Assessing the test's convergent and discriminant
    validity Does the test have high correlations
    with measures of the same trait (convergent
    validity) and low correlations with measures of
    unrelated traits (discriminant validity)?
  •  
  • Assessing the test's factorial validity Does the
    test have the factorial composition it would be
    expected to have i.e., does it have factorial
    validity?

45
RELIABILITY AND VALIDITY
  • Construct validity is said to be the most
    theory-laden of the methods of test validation.
    The developer of a test designed to measure a
    construct begins with a theory about the nature
    of the construct, which then guides the test
    developer in selecting test items and in choosing
    the methods for establishing the test's validity.
    For example, if the developer of a creativity
    test believes that creativity is unrelated to
    general intelligence, that creativity is an
    innate characteristic that cannot be learned, and
    that creative people can be expected to generate
    more alternative solutions to certain types of
    problems than non-creative people, she would want
    to determine the correlation between scores on
    the creativity test and a measure of
    intelligence, see if a course in creativity
    affects test scores, and find out if test scores
    distinguish between people who differ in the
    number of solutions they generate to relevant
    problems.

46
RELIABILITY AND VALIDITY
  • Note that some experts consider construct
    validity to be the most basic form of validity
    because the techniques involved in establshing
    construct validity overlap those used to
    determine if a test has content or
    criterion-related validity. Indeed, Cronbach
    argues that "all validation is one, and in a
    sense all is construct validation."

47
RELIABILITY AND VALIDITY
  • Construct Validity
  • Convergent and Discriminant Validity As noted
    above, one way to assess a test's construct
    validity is to correlate test scores with scores
    on measures that do and do not purport to assess
    the same trait. High correlations with measures
    of the same trait provide evidence of the test's
    convergent validity, while low correlations with
    measures of unrelated characteristics provide
    evidence of the test's discriminant (divergent)
    validity.

48
RELIABILITY AND VALIDITY
  • The multitrait-multimethod matrix (Campbell
    Fiske, 1959) is used to systematically organize
    the data collected when assessing a test's
    convergent and discriminant validity. The
    multitrait-multimethod matrix is a table of
    correlation coefficients, and, as its name
    suggests, it provides information about the
    degree of association between two or more traits
    that have each been assessed using two or more
    methods. When the correlations between different
    methods measuring the same trait are larger than
    the correlations between the same and different
    methods measuring different traits, the matrix
    provides evidence of the test's convergent and
    discriminant validity.

49
RELIABILITY AND VALIDITY
  • Example To assess the construct validity of the
    interpersonal assertiveness test, the
    psychologist in Study 3 administers four
    measures to a group of salespeople ( 1 ) the
    test of interpersonal assertiveness (2) a
    supervisor's rating of interpersonal
    assertiveness (3) a test of aggressiveness and
    (4) a supervisor's rating of aggressiveness. The
    psychologist has the minimum data needed to
    construct a multitrait-multimethod matrix She
    has measured two traits that she believes are
    unrelated (assertiveness and aggressiveness), and
    each trait has been measured by two different
    methods (a test and a supervisor-s rating). The
    psychologist calculates correlation coefficients
    for all possible pairs of scores on the four
    measures and constructs the following
    multitrait-multimethod matrix (the upper half of
    the table has not been filled in because it would
    simply duplicate the correlations in the lower
    half)

50
RELIABILITY AND VALIDITY
51
RELIABILITY AND VALIDITY
  • All multitrait-multimethod matrices contain four
    types of correlation coefficients
  •  
  • Monotrait-monomethod coefficients ("same
    trait-same method")
  • Monotrait-heteromethod coefficients ("same
    trait-different methods")
  • Heterotrait-monomethod coefficients ("different
    traits-same method")
  • Heterotrait-heteromethod coefficients ("different
    traits-different methods)

52
RELIABILITY AND VALIDITY
  • 1. Monotrait-monomethod coefficients ("same
    trait-same method") The monotrait-monomethod
    coefficients (coefficients in parentheses in the
    above matrix) are reliability coefficients They
    indicate the correlation between a measure and
    itself. Although these coeffcients are not
    directly relevant to a test's convergent and
    discriminant validity, they should be large in
    order for the matrix to provide useful
    information.

53
RELIABILITY AND VALIDITY
  • 2. Monotrait-heteromethod coefficients ("same
    trait-different methods") These coefficients
    (coefficients in rectangles) indicate the
    correlation between different measures of the
    same trait. When these coefficients are large,
    they provide evidence of convergent validity.

54
RELIABILITY AND VALIDITY
  • 3. Heterotrait-monomethod coefficients
    ("different traits-same method") These
    coefficients (coefficients in ellipses) show the
    correlation between different traits that have
    been measured by the same method. When the
    heterotrait-monomethod coefficients are small,
    this indicates that a test has discriminant
    validity.

55
RELIABILITY AND VALIDITY
  • 4. Heterotrait-heteromethod coefficients
    ("different traits-different methods") The
    heterotrait-heteromethod coefficients (underlined
    coefficients) indicate the correlation between
    different traits that have been measured by
    different methods. These coefficients also
    provide evidence of discriminant validity when
    they are small

56
RELIABILITY AND VALIDITY
  • Note that, in a multitrait-multimethod matrix,
    only those correlation coefficients that include
    the test that is being validated are actually of
    interest. For the above example, the correlation
    between the rating of interpersonal assertiveness
    and the rating of aggressiveness (r .16) is a
    heterotrait-monomethod coefficient, but it isn't
    of interest because it doesn't provide
    information about the interpersonal assertiveness
    test. Also, the number of correlation
    coefficients that can provide evidence of
    convergent and discriminant validity depends on
    the number of measures included in the matrix. In
    the example, only four measures were included
    (the minimum number), but there could certainly
    have been more.

57
RELIABILITY AND VALIDITY
  • Example Three of the correlations in the above
    multitrait-multimethod matrix are relevant to the
    construct validity of the interpersonal
    assertiveness test. The correlation between the
    assertiveness test and the assertiveness rating
    (monotrait-heteromethod coefficient) is .71.
    Since this is a relatively high correlation, it
    suggests that the test has convergent validity.
    The correlation between the assertiveness test
    and the aggressiveness test (heterotrait-monometho
    d coefficient) is .13 and the correlation between
    the assertiveness test and the aggressiveness
    rating (heterotrait-heteromethod coefficient) is
    .04. Because these two correlations are low, they
    confirm that the assertiveness test has
    discriminant validity. This pattern of
    correlation coefficients confirms that the
    assertiveness test has construct validity. Note
    that the monotrait-monomethod coefficient for the
    assertiveness test is .93, which indicates that
    the test also has adequate reliability. (The
    other correlations in the matrix are not relevant
    to the psychologist's validation study because
    they do not include the assertiveness test.)

58
RELIABILITY AND VALIDITY
59
RELIABILITY AND VALIDITY
  • Construct Validity
  • 2. Factor Analysis Factor analysis is used for
    several reasons including identifying the minimum
    number of common factors required to account for
    the intercorrelations among a set of tests or
    test items, evaluating a tests internal
    consistency, and assessing a tests construct
    validity. When factor analysis is used in the
    latter purpose, a test is considered to have
    construct (factorial) validity when it correlates
    highly only with the factor(s) that it would be
    expected to correlate with.

60
DESCRIPTIVE STATISTICS
  • Descriptive Statistics
  •  
  • Descriptive statistics are used to describe or
    summarize a distribution (set) of data.
    Descriptive techniques include
  • tables,
  • graphs,
  • measures of central tendency, and
  • measures of variability.

61
DESCRIPTIVE STATISTICS
  • A set of data can be organized in a table known
    as a frequency distribution. Frequency
    distributions are constructed by summarizing the
    data in terms of the number (frequency) of
    observations in each category, score, or score
    interval. In Study 1, the academic achievement
    tests scores of 25 children with ADHD could be
    summarized as shown in Table 1. The column
    labeled "Frequency (f) indicates the number of
    observations in each score interval Three of the
    25 children received a score between 80 and 100,
    while five received a score between 60 and 79.

62
DESCRIPTIVE STATISTICS
  • Table 1 also includes a "Cumulative Frequency
    (cf)" column. The cumulative frequencies indicate
    the total number of observations that fall at or
    below each category or score. The information in
    Table 1 indicates that 2 of the 25 children
    received scores of 19 or below. 5 received scores
    of 39 or below and so on.
  • Table 1

  • Cumulative
  • Score Frequency (f) Frequency (cf)
  • 80- 100 3 25
  • 60-79 5 22
  • 40-59 12 17
  • 20-39 3 5
  • 0-19 2 2

63
DESCRIPTIVE STATISTICS
  • The information presented in a table can also be
    presented in a graph. Bar graphs, histograms, and
    frequency polygons are three types of graphs. The
    choice of a graph depends on the scale of
    measurement Bar graphs are used when the data
    represent a nominal or ordinal scale, while
    histograms and frequency polygons are used with
    interval or ratio data.

64
DESCRIPTIVE STATISTICS
  • Shapes of distribution
  • Normal curve (mean, mode, median fall on the same
    point)
  • Leptokurtic distribution (more peaked than the
    normal curve)
  • Platykurtic distribution (flatter than the normal
    curve)
  • Positive skewed distribution (the tail is
    extended to the positive side of the
    distributioni.e., most of the scores are in the
    negative side)modeltmedianltmean.
  • Negative skewed distribution (the opposite
    characteristics of the positive skewed
    distribution)mean lt median lt mode.

65
DESCRIPTIVE STATISTICS
  • Measure of central tendency
  • Mean the arithmetic average
  • Mode the most frequently occur score(s).
  • Median the middle score.

66
DESCRIPTIVE STATISTICS
  • Measure of variability
  • Range Max score Min score.
  • Variance (Mean Square) S2SS/(N-1)?(X-M)2/(N-1)
  • the denominator is N-1 for the sample
    variancethis is because the sample variance tend
    to underestimate the population variance because
    one subject score cannot be freely varied.
  • Standard deviation is computed by taking the
    square root of the variance.
  • Normal distribution M 1 SD (68.26) 2 SD
    (95.44) 3 SD (99.72)

67
DESCRIPTIVE STATISTICS
  • Effect of math. Operations on measures of
    central tendency and variability Add/subtract
    constant to every score the central tendency
    score will change but not the variability.
    Multiply/divide by a constant will change both
    central tendency score and variability.

68
INFERENTIAL STATISTICS
  • Inferential Statistics
  •  
  • While descriptive statistics are used to
    summarize data, inferential statistics are used
    to make inferences about a population based on
    data collected from a sample drawn from that
    population and to do so with a pre-defined degree
    of confidence. In this section, the concept of
    statistical inference is explained. In Section
    IV, specific inferential statistical tests are
    described.

69
INFERENTIAL STATISTICS
  • The Logic of Statistical Inference
  •  
  • The techniques of statistical inference allow an
    investigator to make inferences about the
    relationships between variables in a population
    based on relationships observed in a sample.

70
INFERENTIAL STATISTICS
  • For example, the psychologist in Study 1 will
    want to determine if there is a relationship
    between training in the self-control procedure
    and scores on an academic achievement test for
    all children who have received a diagnosis of
    ADHD. Since the psychologist won't have access to
    the entire population of children with this
    disorder, he will evaluate the effects of the
    self-control procedure on a sample of children
    drawn from the target population. The
    psychologist will then use an inferential
    statistical test to analyze the data he collects
    from the sample, and results of the test will
    enable him to make an inference about the effects
    of the procedure on the achievement test scores
    for the population of children with ADHD.
    Inferential statistical tests accomplish this
    task through the use of a sampling distribution.

71
INFERENTIAL STATISTICS
  • Sampling Distributions
  • 1. Population Parameters and Sample Statistics
    To understand inferential statistics, it is
    necessary to first distinguish between sample
    values and population values. As noted above,
    when conducting a research study, an investigator
    does not have access to the entire population of
    interest but, instead, estimates population
    values based on obtained sample values. In other
    words, an investigator uses a sample statistic to
    estimate a population parameter. Sample
    statistics and population parameters are
    designated with different symbols

72
INFERENTIAL STATISTICS
73
INFERENTIAL STATISTICS
  • 2. Characteristics of Sampling Distributions Due
    to the effects of random (chance) factors, it is
    unlikely that any sample will perfectly represent
    the population from which it was drawn. As a
    result, an estimate of a population parameter
    from a sample statistic is always subject to some
    inaccuracy. Because of the effects of sampling
    error, sample statistics deviate from population
    parameters and from statistics obtained from
    other samples drawn from the same population.

74
INFERENTIAL STATISTICS
  • The relationship between sample statistics and a
    population parameter can be described in terms of
    a sampling distribution, which is a frequency
    distribution of the means or other sample values
    of a very large number of equal-sized samples
    that have been randomly selected from the
    population. Keep in mind that a sampling
    distribution is not a distribution of individual
    scores but a distribution of sample statistics. A
    sampling distribution is important in inferential
    statistics because it allows a researcher to
    determine the probability that a sample having a
    particular mean or other value could have been
    drawn from a population with a known parameter.

75
INFERENTIAL STATISTICS
  • To better understand what a sampling
    distribution is, assume that the psychologist in
    Study 1 defines his population as "all children
    in the 6th grade who have received a diagnosis of
    ADHD," and, for that population, an academic
    achievement test has a mean of 50 and a standard
    deviation of 10. The psychologist repeatedly
    selects random samples of 25 children from this
    population and, for each sample he administers
    the achievement test and calculates the mean
    score. The psychologist has collected a set of
    sample means and finds that, while some of the
    sample means are equal to the population mean
    (50), because of the effects of sampling error,
    some means are larger than the population mean
    and some are smaller. In fact, the psychologist
    finds that his distribution of sample means, or
    sampling distribution of the mean, resembles the
    distribution depicted in Figure 7. As shown in
    that figure, the sampling distribution of the
    mean is normally shaped and its mean is equal to
    the population mean of 50.

76
INFERENTIAL STATISTICS
  • Researchers do not actually construct a sampling
    distribution of the mean by obtaining a large
    number of samples and calculating each sample's
    mean. Instead, they depend on probability theory
    to tell them what a sampling distribution would
    look like. The sampling distribution defined by
    probability theory is called a theoretical
    sampling distribution, and it is based on the
    assumption that an infinite number of equal-sized
    samples have been randomly drawn from the same
    population.

77
INFERENTIAL STATISTICS
  • The characteristics of a sampling distribution
    of the mean are specified by the Central Limit
    Theorem, which makes the following predictions
    (a) Regardless of the shape of the distribution
    of individual scores in the population, as the
    sample size increases, the sampling distribution
    of the mean approaches a normal distribution (b)
    The mean of the sampling distribution of the mean
    is equal to the population mean (c) The standard
    deviation of the sampling distribution of the
    mean is equal to the population standard
    deviation divided by the square root of the
    sample size
  •  
  • SEM?/?(N)

78
INFERENTIAL STATISTICS
  • The standard deviation of a sampling
    distribution of the mean is known as the standard
    error of the mean. It provides an estimate of the
    extent to which the mean of any one sample
    randomly drawn from a population can be expected
    to vary from the population mean as the result of
    sampling error. In other words, like other
    standard deviations, it is a measure of
    variability, but it is a measure of variability
    that is due to the effects of random error. The
    formula for SEM indicates that the size of the
    standard error of the mean is affected by the
    population standard deviation and the sample size
    (N) The larger the population standard deviation
    and the smaller the sample size, the larger the
    standard error and vice versa.

79
INFERENTIAL STATISTICS
  • For the above example, the population standard
    deviation for the achievement test is 10 and the
    sample size is 25. Using Formula 4, we can
    determine that the standard error of the mean in
    this situation is equal to 2
  •  
  • For Study 1, the Central Limit Theorem
    predicts that the sampling distribution of the
    mean is normally shaped, that its mean is equal
    to 50, and that its standard deviation is equal
    to 2.

80
INFERENTIAL STATISTICS
  • Note that, if the sample size had been 9 instead
    of 25, the standard error would increase to 3.33
    (10 divided by the square root of 9 10/3
    3.33). In other words, the smaller the sample
    size, the larger the standard error of the mean.
    One implication of this is that the smaller the
    size of the sample, the greater the probability
    for error when using a sample statistic to
    estimate a population parameter. Another
    implication is that, for any given population,
    there is a ''family'' of sampling distributions,
    with a different distribution for each sample
    size.

81
INFERENTIAL STATISTICS
  • Although this discussion of sampling
    distributions has focused on the sampling
    distribution of the mean, a sampling distribution
    can actually be derived for any sample statistic.
    A sampling distribution can be obtained for
    standard deviations, proportions, correlation
    coefficients, the difference between means, and
    so on. In each case, the basic characteristics of
    the sampling distribution are similar to those of
    the sampling distribution of the mean.
  •  
  • The sampling distribution is the foundation of
    inferential statistics. It is the sampling
    distribution that enables a researcher to make
    inferences about the relationships between
    variables in the population based on obtained
    sample data. How this is done is described in the
    next section.

82
INFERENTIAL STATISTICS
  • Analyzing the Data and Making a Decision
  • After stating the null and alternative hypotheses
    and collecting the sample data, an investigator
    analyzes the data using an inferential
    statistical test such as the t-test or analysis
    of variance. The choice of a statistical test is
    based on several factors including the scale of
    measurement of the data to be analyzed.
  • The inferential statistical test yields a t, an
    F, or other value that indicates where the
    obtained sample statistic falls in the
    appropriate sampling distribution. That is, the
    test indicates whether the statistic is in the
    rejection region or the retention region of the
    sampling distribution

83
INFERENTIAL STATISTICS
  • The rejection region, or "region of unlikely
    values," lies in one or both tails of the
    sampling distribution and contains the sample
    values that are most unlikely to occur simply as
    the result of sampling error. (The rejection
    region is also known as the critical region.)
  •  
  • The retention region, or "region of likely
    values," lies in the central portion of the
    sampling distribution and consists of the values
    that are likely to occur as a consequence of
    sampling error only.

84
INFERENTIAL STATISTICS
  • When the results of the statistical test
    indicate that the obtained sample statistic is in
    the rejection region of the sampling
    distribution, the null hypothesis is rejected and
    the alternative hypothesis is retained. The
    investigator concludes that the sample statistic
    is not likely to have been obtained by chance
    alone and that the independent variable has had
    an effect on the dependent variable. Conversely,
    if the statistical test indicates that the sample
    statistic lies in the retention region of the
    sampling distributionb -the null hypothesis is
    retained and the alternative hypothesis is
    rejected. In this case, the investigator
    concludes that the independent variable has not
    had an effect and that any observed effect is due
    to error.

85
INFERENTIAL STATISTICS
  • Example
  •  
  • In Study 1, if the children who receive
    training in the self-control procedure obtain a
    mean of 60 on the achievement test following
    training, the psychologist would use an
    inferential statistical test to determine whether
    the mean of 60 is due to error or to the
    procedure. If the results of the test indicate
    that a mean of 60 is in the retention region of
    the appropriate sampling distribution, the
    psychologist will conclude that the procedure
    does not have an effect and that the observed
    effect simply reflects error. Conversely, if the
    statistical test indicates that a mean of 60 is
    in the rejection region, the psychologist will
    conclude that the self-control procedure does, in
    fact, have a beneficial effect on achievement
    test scores.

86
INFERENTIAL STATISTICS
  • Alpha The size of the rejection region is
    defined by alpha (a), or the level of
    significance. If alpha is .05, then 5 of the
    sampling distribution represents the rejection
    region and the remaining 95 represents the
    retention region. The rejection region is always
    placed in one or both tails of the sampling
    distribution that is, in that portion of the
    sampling distribution that contains the values
    that are least likely to occur as the result of
    sampling error only. The value of alpha is set by
    an experimenter prior to collecting and/or
    analyzing the data. In other words, it is the
    experimenter who decides what proportion of the
    sampling distribution will represent the region
    of unlikely values. In psychological research,
    alpha is commonly set at .05 or .01.

87
INFERENTIAL STATISTICS
  • When the results of an inferential statistical
    test indicate that the obtained sample statistic
    lies in the rejection region of the sampling
    distribution, the study's results are said to be
    statistically significant. For example, when
    alpha has been set at .05 and the statistical
    test indicates that the sample value is in the
    rejection region, the results of the study are
    "significant at the .05 level."

88
INFERENTIAL STATISTICS
  • One- versus Two-Tailed Tests Some inferential
    statistical tests can be conducted as either a
    one- or two-tailed test. When a two-tailed test
    is used, the rejection region is equally divided
    between the two tails of the sampling
    distribution. If alpha is set at .05, 2.5 of the
    rejection region lies in the positive tail of the
    distribution and 2.5 lies in the negative tail.
    With a one-tailed test, the entire rejection
    region is placed in only one of the tails. The
    division of the sampling distribution for one-
    and two-tailed tests when alpha has been set at
    .05 is illustrated in the following figure.

89
INFERENTIAL STATISTICS
  • It is the alternative hypothesis that determines
    whether a one- or a two-tailed test should be
    conducted. A two-tailed test is used when the
    alternative hypothesis is nondirectional, while a
    one-tailed test is used when the alternative
    hypothesis is directional. If a directional
    alternative hypothesis predicts that the sample
    statistic will be greater than the value
    specified in the null hypothesis, the entire
    rejection region lies in the positive tail of the
    sampling distribution. If a directional
    alternative hypothesis predicts that the sample
    statistic will be less than the value specified
    in the null hypothesis, the rejection region is
    located in the negative tail.

90
INFERENTIAL STATISTICS
  • Decide, on the basis of the results of the
    statistical test, whether to retain or reject the
    statistical hypotheses.

91
INFERENTIAL STATISTICS
  • Decision Outcomes Regardless of whether an
    experimenter decides to retain or reject the null
    hypothesis, there are two possible outcomes of
    his or her decision The decision can be either
    correct or in error, and an experimenter can
    never be entirely certain which type of decision
    has been made.

92
INFERENTIAL STATISTICS
  • Decision Errors There are two decision errors,
    a Type I error and a Type II error. A Type I
    error occurs when an investigator rejects a true
    null hypothesis. For example, if the psychologist
    in Study 1 conclud
Write a Comment
User Comments (0)
About PowerShow.com