Title: Validity/Reliability and a recap of Statistics
1Validity/Reliability and a recap of Statistics
2RELIABILITY AND VALIDITY
- Reliability
- Â
- From the perspective of classical test theory,
an examinee's obtained test score (X) is composed
of two components, a true score component (T) and
an error component (E) - Â XTE
- Â The true score component reflects the
examinee's status with regard to the attribute
that is measured by the test, while the error
component represents measurement error.
Measurement error is random error. It is due to
factors that are irrelevant to what is being
measured by the test and that have an
unpredictable (unsystematic) effect on an
examinee's test score.
3RELIABILITY AND VALIDITY
- The score you obtain on a test is likely to be
due both to the knowledge you have about the
topics addressed by exam items (T) and the
effects of random factors (E) such as the way
test items are written, any alterations in
anxiety, attention, or motivation you experience
while taking the test, and the accuracy of your
"educated guesses." - Â
- Whenever we administer a test to examinees, we
would like to know how much of their scores
reflects "truth" and how much reflects error. It
is a measure of reliability that provides us with
an estimate of the proportion of variability in
examinees' obtained scores that is due to true
differences among examinees on the attribute(s)
measured by the test.
4RELIABILITY AND VALIDITY
-
- Reliability
- When a test is reliable, it provides dependable,
consistent results and, for this reason, the term
consistency is often given as a synonym for
reliability (e.g., Anastasi, 1988).
5RELIABILITY AND VALIDITY
- The Reliability Coefficient
- Â Ideally, a test's reliability would be
calculated by dividing true score variance by the
obtained (total) variance to derive a reliability
index. This index would indicate the proportion
of observed variability in test scores that
reflects true score variability. A test's true
score variance is not known, however, and
reliability must be estimated rather than
calculated directly. There are several ways to
estimate a test's reliability. Each involves
assessing the consistency of an examinee's scores
over time, across different content samples, or
across different scorers and is based on the
assumption that variability that is consistent is
true score variability, while variability that is
inconsistent reflects random error.
6RELIABILITY AND VALIDITY
-
- Most methods for estimating reliability produce
a reliability coefficient, which is a correlation
coefficient that ranges in value from 0.0 to
1.0. When a test's reliability coefficient is
0.0, this means that all variability in obtained
test scores is due to measurement error.
Conversely, when a test's reliability coefficient
is 1.0, this indicates that all variability in
scores reflects true score variability. The
reliability coefficient is symbolized with the
letter "r" and a subscript that contains two of
the same letters or numbers (e.g., ''rxx''). The
subscript indicates that the correlation
coefficient was calculated by correlating a test
with itself rather than with some other measure.
7RELIABILITY AND VALIDITY
- Regardless of the method used to calculate a
reliability coefficient, the coefficient is
interpreted directly as the proportion of
variability in obtained test scores that reflects
true score variability. For example, as depicted
in Figure 1, a reliability coefficient of .84
indicates that 84 of variability in scores is
due to true score differences among examinees,
while the remaining 16 (1.00 - .84) is due to
measurement error. - Â
- Â
- Figure 1. Proportion of variability in test
scores
True Score Variability (84)
Error (16)
8RELIABILITY AND VALIDITY
- Note that a reliability coefficient does not
provide any information about what is actually
being measured by a test. A reliability
coefficient only indicates whether the attribute
measured by the test whatever it isis being
assessed in a consistent, precise way. Whether
the test is actually assessing what it was
designed to measure is addressed by an analysis
of the test's validity.
9RELIABILITY AND VALIDITY
Study Tip Remember that, in contrast to other
correlation coefficients, the reliability
coefficient is never squared to interpret it but
is interpreted directly as a measure of true
score variability. A reliability coefficient of
.89 means that 89 of variability in obtained
scores is true score variability.
10RELIABILITY AND VALIDITY
- Methods for Estimating Reliability
- Â
- The selection of a method for estimating
reliability depends on the nature of the test. As
noted below, each method not only entails
different procedures but is also affected by
different sources of error. For many tests, more
than one method should be used.
11RELIABILITY AND VALIDITY
- 1. Test-Retest Reliability The test-retest
method for estimating reliability involves
administering the same test to the same group of
examinees on two different occasions and then
correlating the two sets of scores. When using
this method, the reliability coefficient
indicates the degree of stability (consistency)
of examinees' scores over time and is also known
as the coefficient of stability. - The primary sources of measurement error for
test-retest reliability are any random factors
related to the time that passes between the two
administrations of the test. These time sampling
factors include random fluctuations in examinees
over time (e.g., changes in anxiety or
motivation) and random variations in the testing
situation. Memory and practice also contribute to
error when they have random carryover effects
i.e., when they affect many or all examinees but
not in the same way.
12RELIABILITY AND VALIDITY
- Test-retest reliability is appropriate for
determining the reliability of tests designed to
measure attributes that are relatively stable
over time and that are not affected by repeated
measurement. It would be appropriate for a test
of aptitude, which is a stable characteristic,
but not for a test of mood, since mood fluctuates
over time, or a test of creativity, which might
be affected by previous exposure to test items.
13RELIABILITY AND VALIDITY
- 2. Alternate (Equivalent, Parallel) Forms
Reliability To assess a test's alternate forms
reliability, two equivalent forms of the test are
administered to the same group of examinees and
the two sets of scores are correlated. Alternate
forms reliability indicates the consistency of
responding to different item samples (the two
test forms) and, when the forms are administered
at different times, the consistency of responding
over time. The alternate forms reliability
coefficient is also called the coefficient of
equivalence when the two forms are administered
at about the same time and the coefficient of
equivalence and stability when a relatively long
period of time separates administration of the
two forms.
14RELIABILITY AND VALIDITY
- The primary source of measurement error for
alternate forms reliability is content sampling,
or error introduced by an interaction between
different examinees' knowledge and the different
content assessed by the items included in the two
forms The items in Form A might be a better
match of one examinee's knowledge than items in
Form B, while the opposite is true for another
examinee. In this situation, the two scores
obtained by each examinee will differ, which will
lower the alternate forms reliability
coefficient. When administration of the two forms
is separated by a period of time, time sampling
factors also contribute to error. -
15RELIABILITY AND VALIDITY
- Like test-retest reliability, alternate forms
reliability is not appropriate when the attribute
measured by the test is likely to fluctuate over
time (and the forms will be administered at
different times) or when scores are likely to be
affected by repeated measurement. If the same
strategies required to solve problems on Form A
are used to solve problems on Form B, even if the
problems on the two forms are not identical,
there are likely to be practice effects. When
these effects differ for different examinees
(i.e., are random), practice will serve as a
source of measurement error. Although alternate
forms reliability is considered by some experts
to be the most rigorous (and best) method for
estimating reliability, it is not often assessed
due to the difficulty in developing forms that
are truly equivalent. -
16RELIABILITY AND VALIDITY
- 3. Internal Consistency Reliability
Reliability can also be estimated by measuring
the internal consistency of a test. Split-half
reliability and coefficient alpha are two methods
for evaluating internal consistency. Both involve
administering the test once to a single group of
examinees, and both yield a reliability
coefficient that is also known as the coefficient
of internal consistency. - Â
- To determine a test's split-half reliability,
the test is split into equal halves so that each
examinee has two scores (one for each half of the
test). Scores on the two halves are then
correlated. Tests can be split in several ways,
but probably the most common way is to divide the
test on the basis of odd- versus even-numbered
items.
17RELIABILITY AND VALIDITY
- A problem with the split-half method is that it
produces a reliability coefficient that is based
on test scores that were derived from one-half of
the entire length of the test. If a test contains
30 items, each score is based on 15 items.
Because reliability tends to decrease as the
length of a test decreases, the split-half
reliability coefficient usually underestimates a
test's true reliability. For this reason, the
split-half reliability coefficient is ordinarily
corrected using the Spearman-Brown prophecy
formula, which provides an estimate of what the
reliability coefficient would have been had it
been based on the full length of the test. - Â
- Cronbach's coefficient alpha also involves
administering the test once to a single group of
examinees. However, rather than splitting the
test in half, a special formula is used to
determine the average degree of inter-item
consistency. One way to interpret coefficient
alpha is as the average reliability that would be
obtained from all possible splits of the test.
Coefficient alpha tends to be conservative and
can be considered the lower boundary of a test's
reliability (Novick and Lewis, 1967). When test
items are scored dichotomously (right or wrong),
a variation of coefficient alpha known as the
Kuder-Richardson Formula 20 (KR-20) can be used.
18RELIABILITY AND VALIDITY
- Content sampling is a source of error for both
split-half reliability and coefficient alpha. For
split-half reliability, content sampling refers
to the error resulting from differences between
the content of the two halves of the test (i.e.,
the items included in one half may better fit the
knowledge of some examinees than items in the
other half) for coefficient alpha, content
(item) sampling refers to differences between
individual test items rather than between test
halves. Coefficient alpha also has as a source of
error, the heterogeneity of the content domain. A
test is heterogeneous with regard to content
domain when its items measure several different
domains of knowledge or behavior. The greater the
heterogeneity of the content domain, the lower
the inter-item correlations and the lower the
magnitude of coefficient alpha. Coefficient alpha
could be expected to be smaller for a 200-item
test that contains items assessing knowledge of
test construction, statistics, ethics,
industrial-organizational psychology, clinical
psychology, etc. than for a 200-item test that
contains questions on test construction only.
19RELIABILITY AND VALIDITY
- The methods for assessing internal consistency
reliability are useful when a test is designed to
measure a single characteristic, when the
characteristic measured by the test fluctuates
over time, or when scores are likely to be
affected by repeated exposure to the test. They
are not appropriate for assessing the reliability
of speed tests because, for these tests, they
tend to produce spuriously high coefficients.
(For speed tests, alternate forms reliability is
usually the best choice.)
20RELIABILITY AND VALIDITY
- 4. Inter-Rater (Inter-Scorer, Inter-Observer)
Reliability Inter-rater reliability is of
concern whenever test scores depend on a rater's
judgment. A test constructor would want to make
sure that an essay test, a behavioral observation
scale, or a projective personality test have
adequate inter-rater reliability. This type of
reliability is assessed either by calculating a
correlation coefficient (e.g., a kappa
coefficient or coefficient of concordance) or by
determining the percent agreement between two or
more raters. Although the latter technique is
frequently used, it can lead to erroneous
conclusions since it does not take into account
the level of agreement that would have occurred
by chance alone. This is a particular problem for
behavioral observation scales that require raters
to record the frequency of a specific behavior.
In this situation, the degree of chance agreement
is high whenever the behavior has a high rate of
occurrence, and percent agreement will provide an
inflated estimate of the measure's reliability.
21RELIABILITY AND VALIDITY
- Sources of error for inter-rater reliability
include factors related to the raters such as
lack of motivation and rater biases and
characteristics of the measuring device. An
inter-rater reliability coefficient is likely to
be low, for instance, when rating categories are
not exhaustive (i.e., don't include all possible
responses or behaviors) and/or are not mutually
exclusive. - The inter-rater reliability of a behavioral
rating scale can also be affected by consensual
observer drift, which occurs when two (or more)
observers working together influence each other's
ratings so that they both assign ratings in a
similarly idiosyncratic way. (Observer drift can
also affect a single observer's ratings when he
or she assigns ratings in a consistently deviant
way.) Unlike other sources of error, consensual
observer drift tends to artificially inflate
inter-rater reliability.
22RELIABILITY AND VALIDITY
- The reliability (and validity) of ratings can be
improved in several ways. Consensual observer
drift can be eliminated by having raters work
independently or by alternating raters. Rating
accuracy is also improved when raters are told
that their ratings will be checked. Overall, the
best way to improve both inter- and intra-rater
accuracy is to provide raters with training that
emphasizes the distinction between observation
and interpretation (Aiken, 1985).
23RELIABILITY AND VALIDITY
Study Tip Remember the Spearman-Brown formula is
related to split-half reliability and KR-20 is
related to the coefficient alpha. Also know that
alternate forms reliability is the most thorough
method for estimating reliability and that
internal consistency reliability is not
appropriate for speed tests.
24RELIABILITY AND VALIDITY
- Factors That Affect The Reliability Coefficient
- Â
- The magnitude of the reliability coefficient is
affected not only by the sources of error
discussed above but also by the length of the
test, the range of the test scores, and the
probability that the correct response to items
can be selected by guessing. - Test Length
- Range of Test Scores
- Guessing
25RELIABILITY AND VALIDITY
- 1. Test Length The larger the sample of the
attribute being measured by a test, the less the
relative effects of measurement error and the
more likely the sample will provide dependable,
consistent information. Consequently, a general
rule is that the longer the test, the larger the
test's reliability coefficient. - Â
- The Spearman-Brown prophecy formula is most
associated with split-half reliability but can
actually be used whenever a test developer wants
to estimate the effects of lengthening or
shortening a test on its reliability coefficient.
For instance, if a 100-item test has a
reliability coefficient of .84, the
Spearman-Brown formula could be used to estimate
the effects of increasing the number of items to
150 or reducing the number to 50. A problem with
the Spearman-Brown formula is that it does not
always yield an accurate estimate of reliability
In general, it tends to overestimate a test's
true reliability (Gay, 1992).
26RELIABILITY AND VALIDITY
- This is most likely to be the case when the
added items do not measure the same content
domain as the original items and/or are more
susceptible to the effects of measurement error.
Note that, when used to correct the split-half
reliability coefficient, the situation is more
complex, and this generalization does not always
apply When the two halves are not equivalent in
terms of their means and standard deviations, the
Spearman-Brown formula may either over- or
underestimate the test's actual reliability. - 2. Range of Test Scores Since the reliability
coefficient is a correlation coefficient, it is
maximized when the range of scores is
unrestricted. The range is directly affected by
the deee of similarity of examinees with regard
to the attribute measured by the test When
examinees are heterogeneous, the range of scores
is maximized. The range is also affected by the
difficulty level of the test items. When all
items are either very difficult or very easy, all
examinees will obtain either low or high scores,
resulting in a restricted range. Therefore, the
best strategy is to choose items so that the
average difficulty level is in the mid-range (r
.50).
27RELIABILITY AND VALIDITY
- Guessing A test's reliability coefficient is
also affected by the probability that examinees
can guess the correct answers to test items. As
the probability of correctly guessing answers
increases, the reliability coefficient decreases.
All other things being equal, a true/false test
will have a lower reliability coefficient than a
four-alternative multiple-choice test which, in
turn, will have a lower reliability coefficient
than a free recall test.
28RELIABILITY AND VALIDITY
- The Interpretation of Reliability
- Â
- The interpretation of a test's reliability
entails considering its effects on the scores
achieved by a group of examinees as well as the
score obtained by a single examinee. - Â
- 1. The Reliability Coefficient As discussed
above, a reliability coefficient is interpreted
directly as the proportion of variability in a
set of test scores that is attributable to true
score variability. A reliability coefficient of
.84 indicates that 84 of variability in test
scores is due to true score differences among
examinees, while the remaining 16 is due to
measurement error. While different types of tests
can be expected to have different levels of
reliability, for most tests, reliability
coefficients of .80 or larger are considered
acceptable.
29RELIABILITY AND VALIDITY
- When interpreting a reliability coefficient, it
is important to keep in mind that there is no
single index of reliability for a given test.
Instead, a test's reliability coefficient can
vary from situation to situation and sample to
sample. Ability tests, for example, typically
have different reliability coefficients for
groups of individuals of different ages or
ability levels.
30RELIABILITY AND VALIDITY
- 2. The Standard Error of Measurement While the
reliability coefficient is useful for estimating
the proportion of true score variability in a set
of test scores, it is not particularly helpful
for interpreting an individual examinee's
obtained test score. When an examinee receives a
score of 80 on a 100-item test that has a
reliability coefficient of .84, for instance, we
can only conclude that, since the test is not
perfectly reliable, the examinee's obtained score
might or might not be his or her true score. - Â
- A common practice when interpreting an
examinee-s obtained score is to construct a
confidence interval around that score. The
confidence interval helps a test user estimate
the range within which an examinee's true score
is likely to fall given his or her obtained
score. This range is calculated using the
standard error of measurement, which is an index
of the amount of error that can be expected in
obtained scores due to the unreliability of the
test. (When raw scores have been converted to
percentile ranks, the confidence interval is
referred to as a percentile band.)
31RELIABILITY AND VALIDITY
- The following formula is used to estimate the
standard error of measurement - Â
- Formula 1 Standard Error of Measurement
- SEmeas SDx (1 rxx)1/2
- Where
- SEmeas standard error of measurement
- SDx standard deviation of test scores
- rxx reliability coefficient
32RELIABILITY AND VALIDITY
- As shown by the formula, the magnitude of the
standard error is affected by two factors the
standard deviation of the test scores and the
test's reliability coefficient. The lower the
test's standard deviation and the higher its
reliability coefficient, the smaller the standard
error of measurement (and vice versa). - Â
- Because the standard error is a type of standard
deviation, it can be interpreted in terms of the
areas under the normal curve. With regard to
confidence intervals, this means that a 68
confidence interval is constructed by adding and
subtracting one standard error to an examinee's
obtained score a 95 confidence interval is
constructed by adding and subtracting two
standard errors and a 99 confidence interval is
constructed by adding and subtracting three
standard errors.
33RELIABILITY AND VALIDITY
- Example The psychologist in Study 3
administers the interpersonal assertiveness test
to a sales applicant who receives a score of 80.
Since the test's reliability is less than 1.0,
the psychologist knows that this score might be
an imprecise estimate of the applicant's true
score and decides to use the standard error of
measurement to construct a 95 confidence
interval. Assuming that the test-s reliability
coefficient is .84 and its standard deviation is
10, the standard error of measurement is equal to
4.0 - Â
- SEmeas SDx 1 rxx 10 (1 - .84)1/2 10(.4)
4.0 - Â
- Â The psychologist constructs the 95 confidence
interval by adding and subtracting two standard
errors from the applicant's obtained score 80
2(4.0) 72 to 88. This means that there is a 95
chance that the applicant's true score falls
between 72 and 88.
34RELIABILITY AND VALIDITY
- One problem with the standard error is that
measurement error is not usually equally
distributed throughout the range of test scores.
Use of the same standard error to construct
confidence intervals for all scores in a
distribution can, therefore, be somewhat
misleading. To overcome this problem, some test
manuals report different standard errors for
different score intervals.
35RELIABILITY AND VALIDITY
- 3. Estimating True Scores from Obtained Scores
As discussed above, because of the effects of
measurement error, obtained test scores tend to
be biased (inaccurate) estimates of true scores.
More specifically, scores above the mean of a
distribution tend to overestimate true scores,
while scores below the mean tend to underestimate
true scores. Moreover, the farther from the mean
an obtained score is, the greater this bias.
Rather than constructing a confidence interval,
an alternative (but less used) method for
interpreting an examinee's obtained test score is
to estimate his/her true score using a formula
that takes into account this bias by adjusting
the obtained score using the mean of the
distribution and the test's reliability
coefficient.
36RELIABILITY AND VALIDITY
- For example, if an examinee obtains a score of
80 on a test that has a mean of 70 and a
reliability coefficient of .84, the formula
predicts that the examinee's true score is 78.2. -
- Ta bX
- (1-rxx )X rxx X
- T(1-.84) x 70 .84 x 80
- .16 x 70 .84 x 80
- 11.2 6778.2
37RELIABILITY AND VALIDITY
- 4. The Reliability of Difference Scores A test
user is sometimes interested in comparing the
performance of an examinee on two different tests
or subtests and, therefore, computes a difference
score. An educational psychologist, for instance,
might calculate the difference between a child's
WISC-R Verbal and Performance 19 scores. When
doing so, it is important to keep in mind that
the reliability coefficient for the difference
scores can be no larger than the average of the
reliabilities of the two tests or subtests If
Test A has a reliability coefficient of .95 and
Test B has a reliability coefficient of .85, this
means that difference scores calculated from the
two tests will have a reliability coefficient of
.90 or less. The exact size of the reliability
coefficient for difference scores depends on the
degree of correlation between the two tests The
more highly correlated the tests, the smaller the
reliability coefficient (and the larger the
standard error of measurement).
38RELIABILITY AND VALIDITY
- Validity
- Â
- Validity refers to a test's accuracy. A test is
valid when it measures what it is intended to
measure. The intended uses for most tests fall
into one of three categories, and each category
is associated with a different method for
establishing validity - Â
- The test is used to obtain information about an
examinee's familiarity with a particular content
or behavior domain content validity. - Â
- The test is administered to determine the extent
to which an examinee possesses a particular
hypothetical trait construct validity. - Â
- The test is used to estimate or predict an
examinee's standing or performance on an external
criterion criterion-related validity. -
39RELIABILITY AND VALIDITY
- For some tests, it is necessary to demonstrate
only one type of validity for others, it is
desirable to establish more than one type. For
example, if an arithmetic achievement test will
be used to assess the classroom learning of 8th
grade students, establishing the test's content
validity would be sufficient. If the same test
will be used to predict the performance of 8th
grade students in an advanced high school math
class, the test's content and criterion-related
validity will both be of concern. - Â
- Note that, even when a test is found valid for a
particular purpose, it might not be valid for
that purpose for all people. It is quite possible
for a test to be a valid measure of intelligence
or a valid predictor of job performance for one
group of people but not for another group.
40RELIABILITY AND VALIDITY
- Content Validity
- Â
- A test has content validity to the extent that
it adequately samples the content or behavior
domain that it is designed to measure. If test
items are not a good sample, results of testing
will be misleading. Although content validation
is sometimes used to establish the validity of
personality, aptitude, and attitude tests, it is
most associated with achievement-type tests that
measure knowledge of one or more content domains
and with tests designed to assess a well-defined
behavior domain. Adequate content validity would
be important for a statistics test and for a work
(job) sample test. - Â
- Content validity is usually "built into" a test
as it is constructed through a systematic,
logical, and qualitative process that involves
clearly identifying the content or behavior
domain to be sampled and then writing or
selecting items that represent that domain. Once
a test has been developed, the establishment of
content validity relies primarily on the judgment
of subject matter experts. If experts agree that
test items are an adequate and representative
sample of the target domain, then the test is
said to have content validity. -
41RELIABILITY AND VALIDITY
- Although content validation depends mainly on
the judgment of experts, supplemental
quantitative evidence can be obtained. If a test
has adequate content validity, a coefficient of
internal consistency will be large the test will
correlate highly with other tests that purport to
measure the same domain and pre-/post-test
evaluations of a program designed to increase
familiarity with the domain will indicate
appropriate changes. - Â
- Content validity must not be confused with face
validity. Content validity refers to the
systematic evaluation of a test by experts who
determine whether or not test items adequately
sample the relevant domain, while face validity
refers simply to whether or not a test "looks
like" it measures what it is intended to measure.
Although face validity is not an actual type of
validity, it is a desirable feature for many
tests. If a test lacks face validity, examinees
may not be motivated to respond to items in an
honest or accurate manner. A high degree of face
validity does not, however, indicate that a test
has content validity.
42RELIABILITY AND VALIDITY
- Construct Validity
- Â
- When a test has been found to measure the
hypothetical trait (construct) it is intended to
measure, the test is said to have construct
validity. A construct is an abstract
characteristic that cannot be observed directly
but must be inferred by observing its effects.
lntelligence, mechanical aptitude, self-esteem,
and neuroticism are all constructs. - Â
- There is no single way to establish a test's
construct validity. Instead, construct validation
entails a systematic accumulation of evidence
showing that the test actually measures the
construct it was designed to measure. The various
methods used to establish this type of validity
each answer a slightly different question about
the construct and include the following
43RELIABILITY AND VALIDITY
- Assessing the test's internal consistency Do
scores on individual test items correlate highly
with the total test score i.e., are all of the
test items measuring the same construct? - Â
- Studying group differences Do scores on the test
accurately distinguish between people who are
known to have different levels of the construct? - Â
- Conducting research to test hypotheses about the
construct Do test scores change, following an
experimental manipulation, in the direction
predicted by the theory underlying the construct? - Â
44RELIABILITY AND VALIDITY
- Assessing the test's convergent and discriminant
validity Does the test have high correlations
with measures of the same trait (convergent
validity) and low correlations with measures of
unrelated traits (discriminant validity)? - Â
- Assessing the test's factorial validity Does the
test have the factorial composition it would be
expected to have i.e., does it have factorial
validity?
45RELIABILITY AND VALIDITY
- Construct validity is said to be the most
theory-laden of the methods of test validation.
The developer of a test designed to measure a
construct begins with a theory about the nature
of the construct, which then guides the test
developer in selecting test items and in choosing
the methods for establishing the test's validity.
For example, if the developer of a creativity
test believes that creativity is unrelated to
general intelligence, that creativity is an
innate characteristic that cannot be learned, and
that creative people can be expected to generate
more alternative solutions to certain types of
problems than non-creative people, she would want
to determine the correlation between scores on
the creativity test and a measure of
intelligence, see if a course in creativity
affects test scores, and find out if test scores
distinguish between people who differ in the
number of solutions they generate to relevant
problems.
46RELIABILITY AND VALIDITY
- Note that some experts consider construct
validity to be the most basic form of validity
because the techniques involved in establshing
construct validity overlap those used to
determine if a test has content or
criterion-related validity. Indeed, Cronbach
argues that "all validation is one, and in a
sense all is construct validation."
47RELIABILITY AND VALIDITY
- Construct Validity
- Convergent and Discriminant Validity As noted
above, one way to assess a test's construct
validity is to correlate test scores with scores
on measures that do and do not purport to assess
the same trait. High correlations with measures
of the same trait provide evidence of the test's
convergent validity, while low correlations with
measures of unrelated characteristics provide
evidence of the test's discriminant (divergent)
validity. -
48RELIABILITY AND VALIDITY
- The multitrait-multimethod matrix (Campbell
Fiske, 1959) is used to systematically organize
the data collected when assessing a test's
convergent and discriminant validity. The
multitrait-multimethod matrix is a table of
correlation coefficients, and, as its name
suggests, it provides information about the
degree of association between two or more traits
that have each been assessed using two or more
methods. When the correlations between different
methods measuring the same trait are larger than
the correlations between the same and different
methods measuring different traits, the matrix
provides evidence of the test's convergent and
discriminant validity.
49RELIABILITY AND VALIDITY
- Example To assess the construct validity of the
interpersonal assertiveness test, the
psychologist in Study 3 administers four
measures to a group of salespeople ( 1 ) the
test of interpersonal assertiveness (2) a
supervisor's rating of interpersonal
assertiveness (3) a test of aggressiveness and
(4) a supervisor's rating of aggressiveness. The
psychologist has the minimum data needed to
construct a multitrait-multimethod matrix She
has measured two traits that she believes are
unrelated (assertiveness and aggressiveness), and
each trait has been measured by two different
methods (a test and a supervisor-s rating). The
psychologist calculates correlation coefficients
for all possible pairs of scores on the four
measures and constructs the following
multitrait-multimethod matrix (the upper half of
the table has not been filled in because it would
simply duplicate the correlations in the lower
half)
50RELIABILITY AND VALIDITY
51RELIABILITY AND VALIDITY
- All multitrait-multimethod matrices contain four
types of correlation coefficients - Â
- Monotrait-monomethod coefficients ("same
trait-same method") - Monotrait-heteromethod coefficients ("same
trait-different methods") - Heterotrait-monomethod coefficients ("different
traits-same method") - Heterotrait-heteromethod coefficients ("different
traits-different methods)
52RELIABILITY AND VALIDITY
- 1. Monotrait-monomethod coefficients ("same
trait-same method") The monotrait-monomethod
coefficients (coefficients in parentheses in the
above matrix) are reliability coefficients They
indicate the correlation between a measure and
itself. Although these coeffcients are not
directly relevant to a test's convergent and
discriminant validity, they should be large in
order for the matrix to provide useful
information.
53RELIABILITY AND VALIDITY
- 2. Monotrait-heteromethod coefficients ("same
trait-different methods") These coefficients
(coefficients in rectangles) indicate the
correlation between different measures of the
same trait. When these coefficients are large,
they provide evidence of convergent validity.
54RELIABILITY AND VALIDITY
- 3. Heterotrait-monomethod coefficients
("different traits-same method") These
coefficients (coefficients in ellipses) show the
correlation between different traits that have
been measured by the same method. When the
heterotrait-monomethod coefficients are small,
this indicates that a test has discriminant
validity.
55RELIABILITY AND VALIDITY
- 4. Heterotrait-heteromethod coefficients
("different traits-different methods") The
heterotrait-heteromethod coefficients (underlined
coefficients) indicate the correlation between
different traits that have been measured by
different methods. These coefficients also
provide evidence of discriminant validity when
they are small
56RELIABILITY AND VALIDITY
- Note that, in a multitrait-multimethod matrix,
only those correlation coefficients that include
the test that is being validated are actually of
interest. For the above example, the correlation
between the rating of interpersonal assertiveness
and the rating of aggressiveness (r .16) is a
heterotrait-monomethod coefficient, but it isn't
of interest because it doesn't provide
information about the interpersonal assertiveness
test. Also, the number of correlation
coefficients that can provide evidence of
convergent and discriminant validity depends on
the number of measures included in the matrix. In
the example, only four measures were included
(the minimum number), but there could certainly
have been more.
57RELIABILITY AND VALIDITY
- Example Three of the correlations in the above
multitrait-multimethod matrix are relevant to the
construct validity of the interpersonal
assertiveness test. The correlation between the
assertiveness test and the assertiveness rating
(monotrait-heteromethod coefficient) is .71.
Since this is a relatively high correlation, it
suggests that the test has convergent validity.
The correlation between the assertiveness test
and the aggressiveness test (heterotrait-monometho
d coefficient) is .13 and the correlation between
the assertiveness test and the aggressiveness
rating (heterotrait-heteromethod coefficient) is
.04. Because these two correlations are low, they
confirm that the assertiveness test has
discriminant validity. This pattern of
correlation coefficients confirms that the
assertiveness test has construct validity. Note
that the monotrait-monomethod coefficient for the
assertiveness test is .93, which indicates that
the test also has adequate reliability. (The
other correlations in the matrix are not relevant
to the psychologist's validation study because
they do not include the assertiveness test.)
58RELIABILITY AND VALIDITY
59RELIABILITY AND VALIDITY
- Construct Validity
- 2. Factor Analysis Factor analysis is used for
several reasons including identifying the minimum
number of common factors required to account for
the intercorrelations among a set of tests or
test items, evaluating a tests internal
consistency, and assessing a tests construct
validity. When factor analysis is used in the
latter purpose, a test is considered to have
construct (factorial) validity when it correlates
highly only with the factor(s) that it would be
expected to correlate with.
60DESCRIPTIVE STATISTICS
- Descriptive Statistics
- Â
- Descriptive statistics are used to describe or
summarize a distribution (set) of data.
Descriptive techniques include - tables,
- graphs,
- measures of central tendency, and
- measures of variability.
61DESCRIPTIVE STATISTICS
- A set of data can be organized in a table known
as a frequency distribution. Frequency
distributions are constructed by summarizing the
data in terms of the number (frequency) of
observations in each category, score, or score
interval. In Study 1, the academic achievement
tests scores of 25 children with ADHD could be
summarized as shown in Table 1. The column
labeled "Frequency (f) indicates the number of
observations in each score interval Three of the
25 children received a score between 80 and 100,
while five received a score between 60 and 79.
62DESCRIPTIVE STATISTICS
- Table 1 also includes a "Cumulative Frequency
(cf)" column. The cumulative frequencies indicate
the total number of observations that fall at or
below each category or score. The information in
Table 1 indicates that 2 of the 25 children
received scores of 19 or below. 5 received scores
of 39 or below and so on. - Table 1
-
Cumulative - Score Frequency (f) Frequency (cf)
- 80- 100 3 25
- 60-79 5 22
- 40-59 12 17
- 20-39 3 5
- 0-19 2 2
63DESCRIPTIVE STATISTICS
- The information presented in a table can also be
presented in a graph. Bar graphs, histograms, and
frequency polygons are three types of graphs. The
choice of a graph depends on the scale of
measurement Bar graphs are used when the data
represent a nominal or ordinal scale, while
histograms and frequency polygons are used with
interval or ratio data. -
64DESCRIPTIVE STATISTICS
- Shapes of distribution
- Normal curve (mean, mode, median fall on the same
point) - Leptokurtic distribution (more peaked than the
normal curve) - Platykurtic distribution (flatter than the normal
curve) - Positive skewed distribution (the tail is
extended to the positive side of the
distributioni.e., most of the scores are in the
negative side)modeltmedianltmean. - Negative skewed distribution (the opposite
characteristics of the positive skewed
distribution)mean lt median lt mode.
65DESCRIPTIVE STATISTICS
- Measure of central tendency
- Mean the arithmetic average
- Mode the most frequently occur score(s).
- Median the middle score.
66DESCRIPTIVE STATISTICS
- Measure of variability
- Range Max score Min score.
- Variance (Mean Square) S2SS/(N-1)?(X-M)2/(N-1)
- the denominator is N-1 for the sample
variancethis is because the sample variance tend
to underestimate the population variance because
one subject score cannot be freely varied. - Standard deviation is computed by taking the
square root of the variance. - Normal distribution M 1 SD (68.26) 2 SD
(95.44) 3 SD (99.72)
67DESCRIPTIVE STATISTICS
- Effect of math. Operations on measures of
central tendency and variability Add/subtract
constant to every score the central tendency
score will change but not the variability.
Multiply/divide by a constant will change both
central tendency score and variability.
68INFERENTIAL STATISTICS
- Inferential Statistics
- Â
- While descriptive statistics are used to
summarize data, inferential statistics are used
to make inferences about a population based on
data collected from a sample drawn from that
population and to do so with a pre-defined degree
of confidence. In this section, the concept of
statistical inference is explained. In Section
IV, specific inferential statistical tests are
described.
69INFERENTIAL STATISTICS
- The Logic of Statistical Inference
- Â
- The techniques of statistical inference allow an
investigator to make inferences about the
relationships between variables in a population
based on relationships observed in a sample.
70INFERENTIAL STATISTICS
- For example, the psychologist in Study 1 will
want to determine if there is a relationship
between training in the self-control procedure
and scores on an academic achievement test for
all children who have received a diagnosis of
ADHD. Since the psychologist won't have access to
the entire population of children with this
disorder, he will evaluate the effects of the
self-control procedure on a sample of children
drawn from the target population. The
psychologist will then use an inferential
statistical test to analyze the data he collects
from the sample, and results of the test will
enable him to make an inference about the effects
of the procedure on the achievement test scores
for the population of children with ADHD.
Inferential statistical tests accomplish this
task through the use of a sampling distribution.
71INFERENTIAL STATISTICS
- Sampling Distributions
- 1. Population Parameters and Sample Statistics
To understand inferential statistics, it is
necessary to first distinguish between sample
values and population values. As noted above,
when conducting a research study, an investigator
does not have access to the entire population of
interest but, instead, estimates population
values based on obtained sample values. In other
words, an investigator uses a sample statistic to
estimate a population parameter. Sample
statistics and population parameters are
designated with different symbols
72INFERENTIAL STATISTICS
73INFERENTIAL STATISTICS
- 2. Characteristics of Sampling Distributions Due
to the effects of random (chance) factors, it is
unlikely that any sample will perfectly represent
the population from which it was drawn. As a
result, an estimate of a population parameter
from a sample statistic is always subject to some
inaccuracy. Because of the effects of sampling
error, sample statistics deviate from population
parameters and from statistics obtained from
other samples drawn from the same population.
74INFERENTIAL STATISTICS
- The relationship between sample statistics and a
population parameter can be described in terms of
a sampling distribution, which is a frequency
distribution of the means or other sample values
of a very large number of equal-sized samples
that have been randomly selected from the
population. Keep in mind that a sampling
distribution is not a distribution of individual
scores but a distribution of sample statistics. A
sampling distribution is important in inferential
statistics because it allows a researcher to
determine the probability that a sample having a
particular mean or other value could have been
drawn from a population with a known parameter.
75INFERENTIAL STATISTICS
- To better understand what a sampling
distribution is, assume that the psychologist in
Study 1 defines his population as "all children
in the 6th grade who have received a diagnosis of
ADHD," and, for that population, an academic
achievement test has a mean of 50 and a standard
deviation of 10. The psychologist repeatedly
selects random samples of 25 children from this
population and, for each sample he administers
the achievement test and calculates the mean
score. The psychologist has collected a set of
sample means and finds that, while some of the
sample means are equal to the population mean
(50), because of the effects of sampling error,
some means are larger than the population mean
and some are smaller. In fact, the psychologist
finds that his distribution of sample means, or
sampling distribution of the mean, resembles the
distribution depicted in Figure 7. As shown in
that figure, the sampling distribution of the
mean is normally shaped and its mean is equal to
the population mean of 50.
76INFERENTIAL STATISTICS
- Researchers do not actually construct a sampling
distribution of the mean by obtaining a large
number of samples and calculating each sample's
mean. Instead, they depend on probability theory
to tell them what a sampling distribution would
look like. The sampling distribution defined by
probability theory is called a theoretical
sampling distribution, and it is based on the
assumption that an infinite number of equal-sized
samples have been randomly drawn from the same
population.
77INFERENTIAL STATISTICS
- The characteristics of a sampling distribution
of the mean are specified by the Central Limit
Theorem, which makes the following predictions
(a) Regardless of the shape of the distribution
of individual scores in the population, as the
sample size increases, the sampling distribution
of the mean approaches a normal distribution (b)
The mean of the sampling distribution of the mean
is equal to the population mean (c) The standard
deviation of the sampling distribution of the
mean is equal to the population standard
deviation divided by the square root of the
sample size - Â
- SEM?/?(N)
78INFERENTIAL STATISTICS
- The standard deviation of a sampling
distribution of the mean is known as the standard
error of the mean. It provides an estimate of the
extent to which the mean of any one sample
randomly drawn from a population can be expected
to vary from the population mean as the result of
sampling error. In other words, like other
standard deviations, it is a measure of
variability, but it is a measure of variability
that is due to the effects of random error. The
formula for SEM indicates that the size of the
standard error of the mean is affected by the
population standard deviation and the sample size
(N) The larger the population standard deviation
and the smaller the sample size, the larger the
standard error and vice versa.
79INFERENTIAL STATISTICS
- For the above example, the population standard
deviation for the achievement test is 10 and the
sample size is 25. Using Formula 4, we can
determine that the standard error of the mean in
this situation is equal to 2 - Â
- For Study 1, the Central Limit Theorem
predicts that the sampling distribution of the
mean is normally shaped, that its mean is equal
to 50, and that its standard deviation is equal
to 2.
80INFERENTIAL STATISTICS
- Note that, if the sample size had been 9 instead
of 25, the standard error would increase to 3.33
(10 divided by the square root of 9 10/3
3.33). In other words, the smaller the sample
size, the larger the standard error of the mean.
One implication of this is that the smaller the
size of the sample, the greater the probability
for error when using a sample statistic to
estimate a population parameter. Another
implication is that, for any given population,
there is a ''family'' of sampling distributions,
with a different distribution for each sample
size.
81INFERENTIAL STATISTICS
- Although this discussion of sampling
distributions has focused on the sampling
distribution of the mean, a sampling distribution
can actually be derived for any sample statistic.
A sampling distribution can be obtained for
standard deviations, proportions, correlation
coefficients, the difference between means, and
so on. In each case, the basic characteristics of
the sampling distribution are similar to those of
the sampling distribution of the mean. - Â
- The sampling distribution is the foundation of
inferential statistics. It is the sampling
distribution that enables a researcher to make
inferences about the relationships between
variables in the population based on obtained
sample data. How this is done is described in the
next section.
82INFERENTIAL STATISTICS
- Analyzing the Data and Making a Decision
- After stating the null and alternative hypotheses
and collecting the sample data, an investigator
analyzes the data using an inferential
statistical test such as the t-test or analysis
of variance. The choice of a statistical test is
based on several factors including the scale of
measurement of the data to be analyzed. - The inferential statistical test yields a t, an
F, or other value that indicates where the
obtained sample statistic falls in the
appropriate sampling distribution. That is, the
test indicates whether the statistic is in the
rejection region or the retention region of the
sampling distribution
83INFERENTIAL STATISTICS
- The rejection region, or "region of unlikely
values," lies in one or both tails of the
sampling distribution and contains the sample
values that are most unlikely to occur simply as
the result of sampling error. (The rejection
region is also known as the critical region.) - Â
- The retention region, or "region of likely
values," lies in the central portion of the
sampling distribution and consists of the values
that are likely to occur as a consequence of
sampling error only.
84INFERENTIAL STATISTICS
- When the results of the statistical test
indicate that the obtained sample statistic is in
the rejection region of the sampling
distribution, the null hypothesis is rejected and
the alternative hypothesis is retained. The
investigator concludes that the sample statistic
is not likely to have been obtained by chance
alone and that the independent variable has had
an effect on the dependent variable. Conversely,
if the statistical test indicates that the sample
statistic lies in the retention region of the
sampling distributionb -the null hypothesis is
retained and the alternative hypothesis is
rejected. In this case, the investigator
concludes that the independent variable has not
had an effect and that any observed effect is due
to error.
85INFERENTIAL STATISTICS
- Example
- Â
- In Study 1, if the children who receive
training in the self-control procedure obtain a
mean of 60 on the achievement test following
training, the psychologist would use an
inferential statistical test to determine whether
the mean of 60 is due to error or to the
procedure. If the results of the test indicate
that a mean of 60 is in the retention region of
the appropriate sampling distribution, the
psychologist will conclude that the procedure
does not have an effect and that the observed
effect simply reflects error. Conversely, if the
statistical test indicates that a mean of 60 is
in the rejection region, the psychologist will
conclude that the self-control procedure does, in
fact, have a beneficial effect on achievement
test scores.
86INFERENTIAL STATISTICS
- Alpha The size of the rejection region is
defined by alpha (a), or the level of
significance. If alpha is .05, then 5 of the
sampling distribution represents the rejection
region and the remaining 95 represents the
retention region. The rejection region is always
placed in one or both tails of the sampling
distribution that is, in that portion of the
sampling distribution that contains the values
that are least likely to occur as the result of
sampling error only. The value of alpha is set by
an experimenter prior to collecting and/or
analyzing the data. In other words, it is the
experimenter who decides what proportion of the
sampling distribution will represent the region
of unlikely values. In psychological research,
alpha is commonly set at .05 or .01.
87INFERENTIAL STATISTICS
- When the results of an inferential statistical
test indicate that the obtained sample statistic
lies in the rejection region of the sampling
distribution, the study's results are said to be
statistically significant. For example, when
alpha has been set at .05 and the statistical
test indicates that the sample value is in the
rejection region, the results of the study are
"significant at the .05 level."
88INFERENTIAL STATISTICS
- One- versus Two-Tailed Tests Some inferential
statistical tests can be conducted as either a
one- or two-tailed test. When a two-tailed test
is used, the rejection region is equally divided
between the two tails of the sampling
distribution. If alpha is set at .05, 2.5 of the
rejection region lies in the positive tail of the
distribution and 2.5 lies in the negative tail.
With a one-tailed test, the entire rejection
region is placed in only one of the tails. The
division of the sampling distribution for one-
and two-tailed tests when alpha has been set at
.05 is illustrated in the following figure.
89INFERENTIAL STATISTICS
- It is the alternative hypothesis that determines
whether a one- or a two-tailed test should be
conducted. A two-tailed test is used when the
alternative hypothesis is nondirectional, while a
one-tailed test is used when the alternative
hypothesis is directional. If a directional
alternative hypothesis predicts that the sample
statistic will be greater than the value
specified in the null hypothesis, the entire
rejection region lies in the positive tail of the
sampling distribution. If a directional
alternative hypothesis predicts that the sample
statistic will be less than the value specified
in the null hypothesis, the rejection region is
located in the negative tail.
90INFERENTIAL STATISTICS
- Decide, on the basis of the results of the
statistical test, whether to retain or reject the
statistical hypotheses.
91INFERENTIAL STATISTICS
- Decision Outcomes Regardless of whether an
experimenter decides to retain or reject the null
hypothesis, there are two possible outcomes of
his or her decision The decision can be either
correct or in error, and an experimenter can
never be entirely certain which type of decision
has been made.
92INFERENTIAL STATISTICS
- Decision Errors There are two decision errors,
a Type I error and a Type II error. A Type I
error occurs when an investigator rejects a true
null hypothesis. For example, if the psychologist
in Study 1 conclud