Title: Generalizability Theory
1Generalizability Theory
2- The Big Questions . . .
- What is the fundamental difference between
generalizability theory and other classical
models? - What are facets and how are they related to this
fundamental difference? - What is the difference between a G-Study and a
D-Study? - What is the difference between a relative
decision and an absolute decision?
3The fundamental difference between
generalizability theory and other classical
models is in how they handle this.
4Generalizability theory is a statistical model
about the dependability of measurements. It
focuses on the accuracy of generalizing from an
observed score to the average score that a person
would have received under all acceptable testing
conditions. As in classical measurement theory,
the measured attribute is assumed to be a steady
state over measurement occasions so that
variability for an individual is due to error.
5Generalizability theory departs from classical
measurement theory in assuming that the error
component can be partitioned into multiple
sources so that the most serious sources of
inconsistency in responses over measurement
occasions can be identified. If the error
component can be successfully partitioned,
generalizability theory can forecast the
dependability of measurement in future
applications under a wide variety of conditions.
6In generalizability theory, any measurement is
assumed to be a sample from a universe of
admissible observations. These are observations
that are assumed to be interchangeable or
exchangeable. But, the interchangeability of
measures is a matter of degree and an empirical
question. In generalizability theory, potential
threats to interchangeability are identified and
tested to determine if they are important sources
of systematic error. These sources are known as
facets.
7Example Three judges rate the creativity of
essays written by college applicants. Can the
ratings provided by any one judge be exchanged
for any other judge and thus provide a good
estimate of the true (universe) score? Judges are
a facet of the measurement universe and the
importance of this potential source of error is
tested. Example Are two different mazes
exchangeable as measures of learning? Example
Are three different cognitive tasks
interchangeable as measures of a common ability?
8The issue of generalizability is important
because potential decision makers are probably
indifferent to the particular set of observations
(items, judges, times, etc.). Any random sample
should do (or so we hope). Any one might be
used. The basic issue is how well a particular
sample of observations from a universe of
admissible observations allows an accurate
generalization about a universe score.
9In traditional classical measurement theory error
is a nebulous catch-all category and conditions
of measurement are only vaguely defined. In
generalizability theory, careful definition of
the universe of admissible observations
identifies the potentially relevant sources of
error that can threaten the ability to generalize
from one measurement occasion to another or from
any to the universe score.
10The simplest one-facet completely crossed design
has four potential sources of variability
- Systematic variability corresponding to the
characteristic of interestthe object of
measurement. - Systematic variability due to the facet.
11The simplest one-facet completely crossed design
has four potential sources of variability
(c) The interaction between the object of
measurement and the facet (d) Random error and
unaccounted for systematic variability (i.e.,
unmeasured facets)
The last two sources of variability (c and d)
cannot be separated.
12Overall rater differencesthe tendency for some
raters to give generally higher or lower ratings
The object of measurementdesirable source of
variability
Person x rater interactionthe tendency for some
raters to rank order the objects differently than
other raters.
13The magnitude of the three sources of variability
can be estimated and compared to make decisions
about adequacy of current measurement or the best
way to redesign a measure. Measures are
generalizable to the extent that variance due to
the object of measurement is large relative to
variance from the several sources of error.
14More than one source of systematic error can be
examined. Multiple-facet studies have more
variance components to be estimated. This
complicates matters a bit but that complexity is
more than offset by the potential gains that
generalizability theory provides in isolating
problems of measurement and guiding measurement
modification.
15The simple one-facet completely crossed design
can be modified to include another crossed
facetoccasions.
Now there are seven sources of variability People
RatersOccasionsPeople x RatersPeople x
OccasionsRaters x OccasionsPeople x Raters x
Occasions, error
16This design allows us to ask if ratings are
generalizable across different raters and
different occasions. If we added one more
facettime of dayhow many sources of variability
would there be in the People x Judges x Occasions
x Time design? 4 main effects 6 two-way
interactions 4 three-way interactions 1 four-way
interaction (and error)
17If we have C items that can be combined, the
number of ways that R of them can be combined is
18- Identifying and correcting the sources of error
(lack of generalizability) requires attention to
four important distinctions - Crossed versus nested facets
- Random versus fixed effects
- Generalizability versus decision studies
- Relative versus absolute decisions
19The facets in generalizability theory need not be
crossed. Instead, facets can be nested. When they
are nested, then some sources of variability
cannot be determined independently--some sources
of variability are confounded.
20In this design, raters are nested within people.
Each person was rated by a unique pair of raters.
This might occur, for example, if parents were
asked to provide ratings of their childrens
behavior.
21In this nested design, the variability for People
can be estimated separately, but the variability
due to Raters and the People x Raters interaction
are confounded. They cannot be estimated
separately.
rp
22- When there is more than one facet, they can be
partially nested. In this design, the following
sources of variability can be estimated - People
- Occasions
- People x Occasions
- Raters, Raters x Occasions
- People x Raters, People x Raters x Occasions
(ro)p
23- In a completely nested design, the fewest sources
of variability can be estimated - People
- Occasions People x Occasions
- Raters, Raters x Occasions, Raters x People,
Raters x People x Occasions
rop
24It is important to be able to accurately define
the design underlying a generalizability theory
study. The design defines the potential sources
of error. If the design is not described
accurately, the reliability inferences could be
incorrect.
25A second important distinction in
generalizability theory is between random and
fixed effects. Random effects occur when the
sample is small relative to the universe and
randomly selected or at least conceived of as
exchangeable with any other sample of the same
size. Fixed effects occur when the sample
exhausts the universe. Exchangeability is not an
issue for fixed effects.
26A third important distinction is between
generalizability (G) studies and decision (D)
studies. G studies map the universe of
admissible observations by identifying and
estimating the important facets that threaten
exchangeability. D studies explore of how the
facets identified in G studies affect specific
applications. D studies take the information
generated in G studies and forecast the
implications for reliability in different
applications or contexts of measurement.
27A fourth distinction refers to the use of data in
a D study. Some decisions in measurement require
only that the relative positions of individuals
in the sample be known. In other decisions, the
absolute score and its position in relation to a
criterion or cut-off is important. Relative and
absolute decisions have different sources of
error.
28Consider first a relative decision. Here our goal
is to identify, for example, the top three
scorers in a sample of 5 people. Two raters make
independent judgments. What sources of
variability are present in these data? Which ones
make it difficult to find the three top scoring
people?
29What additional source of variability is present
in these data? For relative decisions, any source
of variance that interacts with the object of
measurement can potentially obscure the ability
to reliably rank order the objects of measurement.
30Now imagine that the purpose of measurement is to
decide if a person achieved a score that exceeded
some absolute criterion. What sources of
variability are present in these data? What
sources make it difficult to know if a person was
above or below the criterion?
31What additional source of variability is present
in these data? For an absolute decision, all
sources of variability except the object of
measurement can potentially obscure the ability
to reliably detect if a person was above or below
an absolute criterion.
32The type of design dictates what sources of error
can be separately estimated. The type of decision
determines what sources of error challenge the
goals of measurement. Once the type of design and
type of decision are specified, generalizability
theory can produce reliability coefficientscalled
generalizability coefficientsthat refer to the
accuracy in generalizing from an observed score
to a universe score, or, the proportion of
observed score variance that is due to universe
score variance.
33The statistical model for generalizability theory
follows the logic of analysis of variance. The
variance in observations is partitioned into
separate effects and error. The major difference
in generalizability theory is that variability
due to people is explicitly of interest.
34Item 1
Item 2
Item 3 . . . . Item k
Person 1
X11
Person 2
Person 3
Person 4
Person 5
In this design, any given score, Xpi, can be
thought of as a simple linear combination of
multiple sources of variability.
35Each element in the linear combination (except m)
has a population distribution. This means that
the variance of the scores can be represented as
the linear combination of several variances
If these sources of variance can be estimated,
their magnitude can be compared to determine the
major contributors to observed score variance.
For a reliable measure, universe score
variability (s2p) should be large and the other
sources (potential sources of error) should be
small.
36Estimating the separate variance components
requires analysis of variance, which begins with
the partitioning of the total sum of squares into
separate sums of squares due to each source of
variability
SST
SSp
SSi
SSpi
37Each sum of squares is the numerator of a
variance estimate. Dividing by degrees of freedom
provides the corresponding mean squares that are
used in ANOVA for forming tests of significance
38The mean squares are variance estimates, but they
are not directly estimates of the variance
components that are needed in generalizability
theory. Means squares are linear combinations of
variance components
39The expected mean squares show how the mean
squares can be manipulated to isolate the
variance components needed in generalizability
theory.
40The expected mean squares are used in ANOVA to
determine an appropriate F ratio for testing
hypotheses. For example, the test of whether
there are mean differences across items is formed
by
Under the null hypothesis of no differences in
means for items, this ratio will approach 1.00
41We are not usually interested in calculating F
ratios in generalizability theory. Instead, the
mean squares are used to estimate the variance
components, which then become the basic units in
G studies and D studies.
42More complex designs have more sources of
variance. For example, if multiple judges rated
the aggressiveness of children on the playground
on two occasions, there would be two crossed
facets. The design would be a People x Judges x
Occasions design, with 7 different sources of
variance to estimate. An ANOVA would be used to
estimate the mean squares.
43The expected values of those mean squares would
then be used to generate the variance components.
44The mean squares are then used to get the
variance components.
45 Judge 1 Judge 2
Judge 3 AM PM AM PM
AM PM Person 1 2 3 1
3 3 5 Person 2 1 2
2 4 4 6 Person 3
2 3 2 4 5
4 Person 4 3 4 3 3
4 6 Person 5 4 5 3
5 5 7 Person 6 4 6
3 3 5 4 Person 7
3 7 4 6 6
7 Person 8 4 7 4 6
5 6 Person 9 3 5 4
7 3 7 Person 10 4 4
4 5 4 4 Person 11
3 5 3 4 5
5 Person 12 3 4 3 2
3 5 Person 13 3 3 2
4 1 2 Person 14 1 2
2 3 2 4 Person 15 2
3 1 2 3
3 Mean 2.80 4.20 2.73 4.07 3.87
5.00 Var. 1.03 2.60 1.07 2.21
1.84 2.29
46(No Transcript)
47If the design had been partially nested, fewer
variance components would be estimated. For
example, assume that the judges who make ratings
in the morning are not the same judges who make
ratings in the afternoon. The design is then
(JudgesOccasions) x People.
48Fewer sources of variance can be estimated
because some of the sources in the completely
crossed design are now confounded.
49As before, the variance components are calculated
by manipulating the mean squares from the ANOVA.
50(No Transcript)
51- If one of the facets is fixed, then it makes no
sense to speak of generalizing from a sample of
facet levels to the universe of admissible facet
levelsall facet levels are already present. - Two approaches can be taken to handle fixed
effects - An averaging approach
- Separate estimation of variance components within
levels of the fixed facet
52If occasion is fixed, then the averaging approach
calculates the variance components as
53(No Transcript)
54What kind of design is the cartoon rating task?
55- The Big Questions . . .
- What is the fundamental difference between
generalizability theory and other classical
models? - What are facets and how are they related to this
fundamental difference? - What is the difference between a G-Study and a
D-Study? - What is the difference between a relative
decision and an absolute decision?
56Next up . . . Once the variance components are
estimated, D studies can be conducted to explore
the implications for using the measure in
different designs and for different kinds of
decisions. This expanded parallel to the
Spearman-Brown formula provides considerable
power to tailor measurement to fit specific
applications.