Generalizability Theory - PowerPoint PPT Presentation

1 / 56
About This Presentation
Title:

Generalizability Theory

Description:

Generalizability Theory – PowerPoint PPT presentation

Number of Views:2748
Avg rating:5.0/5.0
Slides: 57
Provided by: michael1175
Category:

less

Transcript and Presenter's Notes

Title: Generalizability Theory


1
Generalizability Theory
2
  • The Big Questions . . .
  • What is the fundamental difference between
    generalizability theory and other classical
    models?
  • What are facets and how are they related to this
    fundamental difference?
  • What is the difference between a G-Study and a
    D-Study?
  • What is the difference between a relative
    decision and an absolute decision?

3
The fundamental difference between
generalizability theory and other classical
models is in how they handle this.
4
Generalizability theory is a statistical model
about the dependability of measurements. It
focuses on the accuracy of generalizing from an
observed score to the average score that a person
would have received under all acceptable testing
conditions. As in classical measurement theory,
the measured attribute is assumed to be a steady
state over measurement occasions so that
variability for an individual is due to error.
5
Generalizability theory departs from classical
measurement theory in assuming that the error
component can be partitioned into multiple
sources so that the most serious sources of
inconsistency in responses over measurement
occasions can be identified. If the error
component can be successfully partitioned,
generalizability theory can forecast the
dependability of measurement in future
applications under a wide variety of conditions.
6
In generalizability theory, any measurement is
assumed to be a sample from a universe of
admissible observations. These are observations
that are assumed to be interchangeable or
exchangeable. But, the interchangeability of
measures is a matter of degree and an empirical
question. In generalizability theory, potential
threats to interchangeability are identified and
tested to determine if they are important sources
of systematic error. These sources are known as
facets.
7
Example Three judges rate the creativity of
essays written by college applicants. Can the
ratings provided by any one judge be exchanged
for any other judge and thus provide a good
estimate of the true (universe) score? Judges are
a facet of the measurement universe and the
importance of this potential source of error is
tested. Example Are two different mazes
exchangeable as measures of learning? Example
Are three different cognitive tasks
interchangeable as measures of a common ability?
8
The issue of generalizability is important
because potential decision makers are probably
indifferent to the particular set of observations
(items, judges, times, etc.). Any random sample
should do (or so we hope). Any one might be
used. The basic issue is how well a particular
sample of observations from a universe of
admissible observations allows an accurate
generalization about a universe score.
9
In traditional classical measurement theory error
is a nebulous catch-all category and conditions
of measurement are only vaguely defined. In
generalizability theory, careful definition of
the universe of admissible observations
identifies the potentially relevant sources of
error that can threaten the ability to generalize
from one measurement occasion to another or from
any to the universe score.
10
The simplest one-facet completely crossed design
has four potential sources of variability
  • Systematic variability corresponding to the
    characteristic of interestthe object of
    measurement.
  • Systematic variability due to the facet.

11
The simplest one-facet completely crossed design
has four potential sources of variability
(c) The interaction between the object of
measurement and the facet (d) Random error and
unaccounted for systematic variability (i.e.,
unmeasured facets)
The last two sources of variability (c and d)
cannot be separated.
12
Overall rater differencesthe tendency for some
raters to give generally higher or lower ratings
The object of measurementdesirable source of
variability
Person x rater interactionthe tendency for some
raters to rank order the objects differently than
other raters.
13
The magnitude of the three sources of variability
can be estimated and compared to make decisions
about adequacy of current measurement or the best
way to redesign a measure. Measures are
generalizable to the extent that variance due to
the object of measurement is large relative to
variance from the several sources of error.
14
More than one source of systematic error can be
examined. Multiple-facet studies have more
variance components to be estimated. This
complicates matters a bit but that complexity is
more than offset by the potential gains that
generalizability theory provides in isolating
problems of measurement and guiding measurement
modification.
15
The simple one-facet completely crossed design
can be modified to include another crossed
facetoccasions.
Now there are seven sources of variability People
RatersOccasionsPeople x RatersPeople x
OccasionsRaters x OccasionsPeople x Raters x
Occasions, error
16
This design allows us to ask if ratings are
generalizable across different raters and
different occasions. If we added one more
facettime of dayhow many sources of variability
would there be in the People x Judges x Occasions
x Time design? 4 main effects 6 two-way
interactions 4 three-way interactions 1 four-way
interaction (and error)
17
If we have C items that can be combined, the
number of ways that R of them can be combined is
18
  • Identifying and correcting the sources of error
    (lack of generalizability) requires attention to
    four important distinctions
  • Crossed versus nested facets
  • Random versus fixed effects
  • Generalizability versus decision studies
  • Relative versus absolute decisions

19
The facets in generalizability theory need not be
crossed. Instead, facets can be nested. When they
are nested, then some sources of variability
cannot be determined independently--some sources
of variability are confounded.
20
In this design, raters are nested within people.
Each person was rated by a unique pair of raters.
This might occur, for example, if parents were
asked to provide ratings of their childrens
behavior.
21
In this nested design, the variability for People
can be estimated separately, but the variability
due to Raters and the People x Raters interaction
are confounded. They cannot be estimated
separately.
rp
22
  • When there is more than one facet, they can be
    partially nested. In this design, the following
    sources of variability can be estimated
  • People
  • Occasions
  • People x Occasions
  • Raters, Raters x Occasions
  • People x Raters, People x Raters x Occasions

(ro)p
23
  • In a completely nested design, the fewest sources
    of variability can be estimated
  • People
  • Occasions People x Occasions
  • Raters, Raters x Occasions, Raters x People,
    Raters x People x Occasions

rop
24
It is important to be able to accurately define
the design underlying a generalizability theory
study. The design defines the potential sources
of error. If the design is not described
accurately, the reliability inferences could be
incorrect.
25
A second important distinction in
generalizability theory is between random and
fixed effects. Random effects occur when the
sample is small relative to the universe and
randomly selected or at least conceived of as
exchangeable with any other sample of the same
size. Fixed effects occur when the sample
exhausts the universe. Exchangeability is not an
issue for fixed effects.
26
A third important distinction is between
generalizability (G) studies and decision (D)
studies. G studies map the universe of
admissible observations by identifying and
estimating the important facets that threaten
exchangeability. D studies explore of how the
facets identified in G studies affect specific
applications. D studies take the information
generated in G studies and forecast the
implications for reliability in different
applications or contexts of measurement.
27
A fourth distinction refers to the use of data in
a D study. Some decisions in measurement require
only that the relative positions of individuals
in the sample be known. In other decisions, the
absolute score and its position in relation to a
criterion or cut-off is important. Relative and
absolute decisions have different sources of
error.
28
Consider first a relative decision. Here our goal
is to identify, for example, the top three
scorers in a sample of 5 people. Two raters make
independent judgments. What sources of
variability are present in these data? Which ones
make it difficult to find the three top scoring
people?
29
What additional source of variability is present
in these data? For relative decisions, any source
of variance that interacts with the object of
measurement can potentially obscure the ability
to reliably rank order the objects of measurement.
30
Now imagine that the purpose of measurement is to
decide if a person achieved a score that exceeded
some absolute criterion. What sources of
variability are present in these data? What
sources make it difficult to know if a person was
above or below the criterion?
31
What additional source of variability is present
in these data? For an absolute decision, all
sources of variability except the object of
measurement can potentially obscure the ability
to reliably detect if a person was above or below
an absolute criterion.
32
The type of design dictates what sources of error
can be separately estimated. The type of decision
determines what sources of error challenge the
goals of measurement. Once the type of design and
type of decision are specified, generalizability
theory can produce reliability coefficientscalled
generalizability coefficientsthat refer to the
accuracy in generalizing from an observed score
to a universe score, or, the proportion of
observed score variance that is due to universe
score variance.
33
The statistical model for generalizability theory
follows the logic of analysis of variance. The
variance in observations is partitioned into
separate effects and error. The major difference
in generalizability theory is that variability
due to people is explicitly of interest.
34
Item 1
Item 2
Item 3 . . . . Item k
Person 1
X11
Person 2
Person 3
Person 4
Person 5
In this design, any given score, Xpi, can be
thought of as a simple linear combination of
multiple sources of variability.
35
Each element in the linear combination (except m)
has a population distribution. This means that
the variance of the scores can be represented as
the linear combination of several variances
If these sources of variance can be estimated,
their magnitude can be compared to determine the
major contributors to observed score variance.
For a reliable measure, universe score
variability (s2p) should be large and the other
sources (potential sources of error) should be
small.
36
Estimating the separate variance components
requires analysis of variance, which begins with
the partitioning of the total sum of squares into
separate sums of squares due to each source of
variability
SST
SSp
SSi
SSpi
37
Each sum of squares is the numerator of a
variance estimate. Dividing by degrees of freedom
provides the corresponding mean squares that are
used in ANOVA for forming tests of significance
38
The mean squares are variance estimates, but they
are not directly estimates of the variance
components that are needed in generalizability
theory. Means squares are linear combinations of
variance components
39
The expected mean squares show how the mean
squares can be manipulated to isolate the
variance components needed in generalizability
theory.
40
The expected mean squares are used in ANOVA to
determine an appropriate F ratio for testing
hypotheses. For example, the test of whether
there are mean differences across items is formed
by
Under the null hypothesis of no differences in
means for items, this ratio will approach 1.00
41
We are not usually interested in calculating F
ratios in generalizability theory. Instead, the
mean squares are used to estimate the variance
components, which then become the basic units in
G studies and D studies.
42
More complex designs have more sources of
variance. For example, if multiple judges rated
the aggressiveness of children on the playground
on two occasions, there would be two crossed
facets. The design would be a People x Judges x
Occasions design, with 7 different sources of
variance to estimate. An ANOVA would be used to
estimate the mean squares.
43
The expected values of those mean squares would
then be used to generate the variance components.
44
The mean squares are then used to get the
variance components.
45
Judge 1 Judge 2
Judge 3 AM PM AM PM
AM PM Person 1 2 3 1
3 3 5 Person 2 1 2
2 4 4 6 Person 3
2 3 2 4 5
4 Person 4 3 4 3 3
4 6 Person 5 4 5 3
5 5 7 Person 6 4 6
3 3 5 4 Person 7
3 7 4 6 6
7 Person 8 4 7 4 6
5 6 Person 9 3 5 4
7 3 7 Person 10 4 4
4 5 4 4 Person 11
3 5 3 4 5
5 Person 12 3 4 3 2
3 5 Person 13 3 3 2
4 1 2 Person 14 1 2
2 3 2 4 Person 15 2
3 1 2 3
3 Mean 2.80 4.20 2.73 4.07 3.87
5.00 Var. 1.03 2.60 1.07 2.21
1.84 2.29
46
(No Transcript)
47
If the design had been partially nested, fewer
variance components would be estimated. For
example, assume that the judges who make ratings
in the morning are not the same judges who make
ratings in the afternoon. The design is then
(JudgesOccasions) x People.
48
Fewer sources of variance can be estimated
because some of the sources in the completely
crossed design are now confounded.
49
As before, the variance components are calculated
by manipulating the mean squares from the ANOVA.
50
(No Transcript)
51
  • If one of the facets is fixed, then it makes no
    sense to speak of generalizing from a sample of
    facet levels to the universe of admissible facet
    levelsall facet levels are already present.
  • Two approaches can be taken to handle fixed
    effects
  • An averaging approach
  • Separate estimation of variance components within
    levels of the fixed facet

52
If occasion is fixed, then the averaging approach
calculates the variance components as
53
(No Transcript)
54
What kind of design is the cartoon rating task?
55
  • The Big Questions . . .
  • What is the fundamental difference between
    generalizability theory and other classical
    models?
  • What are facets and how are they related to this
    fundamental difference?
  • What is the difference between a G-Study and a
    D-Study?
  • What is the difference between a relative
    decision and an absolute decision?

56
Next up . . . Once the variance components are
estimated, D studies can be conducted to explore
the implications for using the measure in
different designs and for different kinds of
decisions. This expanded parallel to the
Spearman-Brown formula provides considerable
power to tailor measurement to fit specific
applications.
Write a Comment
User Comments (0)
About PowerShow.com