Title: Discrete Bayes Nets
1Discrete Bayes Nets
- Robert J. Mislevy
- University of Maryland
- September 17, 2007
2Independence
- If two discrete random variables are independent,
the probability of the joint occurrence of values
of two variables is equal to the product of the
probabilities individually - P(Xx,Yy) P(Xx) P(Yy).
- Example Flip a dime, flip a penny
- Also, P(XxYy) P(Xx) Learning the value of
Y does not influence your belief about X.
Dime
Penny
3Conditional independence
- If two variables are conditionally independent,
the conditional probability of the joint
occurrence given the value of another variable is
equal to the product of the conditional
probabilities - P(Xx,YyZz) P(Xx Zz) P(Yy Zz).
- Also, P(XxYy, Zz) P(Xx Zz). It may be
that learning Z will influence what you believe
about X and about Y, but if you know the value of
Z, learning the value of Y does not influence
your belief about X.
4Example of conditional independenceTwo flips of
the same biased coin
- Two coins, with the probability of heads being .2
for Coin 1 and .7 for Coin 2. - One of these coins is selected--say with 50-50
probability--and flipped twice. - Flips are independent IF you know which coin was
flipped, but dependent if you dont...
Flip1
(which) Coin
Flip2
5Building up complex networks
- Interrelationships among many variables modeled
in terms of important relationships among smaller
subsets of variables - (sometimes unobservable ones).
6Building up complex networks
- Recursive representation of probability
distributions - All orderings are equally correct, but some are
more beneficial because they capitalize on
causal, dependence, time-order, or theoretical
relationships that we posit - Terms simplify when there is conditional
independence.
7Building up complex networks
- Acyclic directed graphs (DAGs)
- Nodes variables, edges arrows, cant have
loops. - The relationship between recursive
representations and acyclic directed graphs - Edges (arrows) represent explicit dependence
relationships - No edges means no explicit dependence, although
there can be dependence through relationships
with other variables. - There can be conditional independence
relationships that are not revealed in a DAG.
8Asia example
- 1. A recent trip to Asia increases the chance of
contracting tuberculosis. - 2. Smoking is a risk factor for both lung cancer
and bronchitis . - 3. The presence of either tuberculosis or lung
cancer is detectable by an X-ray , but the X-ray
cannot distinguish between them. - 4. Dyspnoea (shortness of breath) may be caused
by either tuberculosis or lung cancer, or also by
bronchitis.
9Computation in Bayes nets
- Concepts basics of computing strategy
- Chapter 5 of Almond, Mislevy, Steinberg,
Williamson, Yan (in progress). Bayesian
networks in educational assessment. - For more detail, see
- Jensen, F.V. (1996). An introduction to Bayesian
networks. New York Springer-Verlag. - Lauritzen, S.L., Spiegelhalter, D.J. (1988).
Local computations with probabilities on
graphical structures and their application to
expert systems (with discussion). Journal of the
Royal Statistical Society, Series B, 50, 157-224.
10Why does Bayes Theorem work?
- The setup, with two random variables, X and Y.
- Joint probabilities
Yy1
Yy2
Total
p(x1,y1)
p(x1,y2)
Xx1
p(x1)
p(x2,y1)
p(x2,y2)
Xx2
p(x2)
p(x3,y1)
p(x3,y2)
Xx3
p(x3)
p(y1)
Total
p(y2)
11Why does Bayes Theorem work?
- The setup, with two random variables, X and Y.
- Joint probabilities
Yy1
Yy2
Total
p(x1,y1)
p(x1,y2)
Xx1
p(x1)
These are the cells in which Yy2 p(xj,y2)
p(y2 xj) p(xj). Divide each by the total of
the column, or p(y2). Result is the proportion
each cell represents in that column, or p(xj
y2).
p(x2,y1)
p(x2,y2)
Xx2
p(x2)
p(x3,y1)
p(x3,y2)
Xx3
p(x3)
p(y1)
Total
p(y2)
12Why does Bayes Theorem work?
- The setup, with two random variables, X and Y.
- Joint probabilities
Yy1
Yy2
Total
p(x1,y1)
p(x1,y2)
Xx1
p(x1)
These are the cells in which Xx1 p(x1,yk)
p(x1 yk) p(yk). Divide each by the total of
the row, or p(x1). Result is the proportion
each cell represents in that column, or p(yk
x1).
p(x2,y1)
p(x2,y2)
Xx2
p(x2)
p(x3,y1)
p(x3,y2)
Xx3
p(x3)
p(y1)
Total
p(y2)
13Bayes Theorem with 2 Variables
- The setup, with two random variables, X and Y
- You know conditional probabilities, p(xj yk),
which tell you what to believe about X if you
knew the value of Y. - You learn Xx what should you believe about Y?
- You combine two things
- Relative conditional probabilities (the
likelihood) - i.e., p(x yk) as a function of yk with X
fixed at x. - Previous probabilities about Y values p(yk),
posterior likelihood
prior
14Inference in a chain
Recursive representation
p(u,v,x,y,z) p(zy,x,v,u) p(yx,v,u) p(xv,u)
p(vu) p(u) p(zy)
p(yx) p(xv) p(vu) p(u).
U
V
X
Y
Z
p(zy)
p(yx)
p(xv)
p(vu)
15Inference in a chain
Suppose we learn the value of X
Start here, by revising belief about X
U
V
X
Y
Z
p(zy)
p(yx)
p(xv)
p(vu)
16Inference in a chain
Propagate information down the chain using
conditional probabilities
From updated belief about X, use conditional
probability to revise belief about Y
U
V
X
Y
Z
p(zy)
p(yx)
p(xv)
p(vu)
17Inference in a chain
Propagate information down the chain using
conditional probabilities
From updated belief about Y, use conditional
probability to revise belief about Z
U
V
X
Y
Z
p(zy)
p(yx)
p(xv)
p(vu)
18Inference in a chain
Propagate information up the chain using Bayes
Theorem
From updated belief about X, use Bayes Theorem to
revise belief about V
U
V
X
Y
Z
p(zy)
p(yx)
p(xv)
p(vu)
19Inference in a chain
Propagate information up the chain using Bayes
Theorem
From updated belief about V, use Bayes Theorem to
revise belief about U
U
V
X
Y
Z
p(zy)
p(yx)
p(xv)
p(vu)
20Inference in singly-connected nets
Singly connected There is never more than one
path from one variable to another variable.
Chains and trees are singly connected. Can use
repeated applications of Bayes theorem and
conditional probability to propagate
evidence. (Pearl, early 1980s)
V
U
X
Y
Z
21Inference in multiply-connected nets
In a multiply- connected graph, in at least one
instance there is more than one path from one
variable to another variable. Repeated
applications of Bayes theorem and conditional
probability at the level of individual variables
doesnt work.
V
W
U
X
Y
Z
22Inference in multiply-connected nets
V
W
- Key idea Group variables into subsets
(cliques) such that the subsets form a tree. -
U
X
Y
Z
U,V,W
23Inference in multiply-connected nets
V
W
- Key idea Group variables into subsets
(cliques) such that the subsets form a tree. -
U
X
Y
Z
U,V,W
U,V,X
24Inference in multiply-connected nets
V
W
- Key idea Group variables into subsets
(cliques) such that the subsets form a tree. -
U
X
Y
Z
U,V,W
U,V,X
U,X,Y
25Inference in multiply-connected nets
V
W
- Key idea Group variables into subsets
(cliques) such that the subsets form a tree. - Can the update cliques with a generalized
version of updating individual variables in
cliques.
U
X
Y
Z
X,Z
U,V,W
U,V,X
U,X,Y
26The Lauritzen-Spiegelhalter algorithm
- 1. Recursive representation of the joint
distribution of variables. - 2. Directed graph representation of (1).
- 3. Moralized, undirected, triangulated, graph.
- 4. Determination of cliques and clique
intersections - 5. Join tree representation.
- 6. Potential tables.
- 7. Updating scheme.
27Example from Andreassen, Jensen, Olesen
- Two possible diseases flu and throat infection
(FLU and THRINF) - Two possible symptoms, fever and sore throat (FEV
and SORETHR). - The diseases are modeled as independent,
- the symptoms as conditionally independent given
disease states.
28Example from Andreassen, Jensen, Olesen
- Aside Medical diagnosis with observable
symptoms of latent disease states has many
parallels to measurement modeling in assessment - State is a construct, inferred from theory
experience--proposed to organize our knowledge - Conditional independence of observations given
(possibly complex) state - Persistent interest in the underlying state
- Observations mainly of transitory interest
- States relationships meant to aid thinking
about unique cases but surely oversimplified - State is the level at which treatment prognosis
is discussed, - although there is often therapeutic/educational
value in addressing specifics from observational
setting
291) Recursive representation of joint distribution
P(FEV, SORTHR, FLU, THRINF) P(FEV SORTHR,
FLU, THRINF) P(SORTHR FLU, THRINF)
P(FLU THRINF) P(THRINF) P(FEV FLU, THRINF)
P(SORTHR FLU, THRINF) P( FLU) P( THRINF).
302) DAG representation
312) DAG representation
Aside A look ahead toward cognitive diagnosis
Good differential diagnosis value for neither
vs. at least one of the two
Good differential diagnosis value for throat
infection vs. no throat infection
322) DAG representation
Aside A look ahead toward cognitive diagnosis
No differential diagnosis value for which of
the two?
Good differential diagnosis value for which of
the two?
333a) Moralized graph
Marry parents Look at the set of parents of
each variable. If they are not already
connected, connect them. Direction doesnt
matter, since well drop it in the next step.
Rationale If variables are all parents of the
same variable, then even if they were independent
otherwise, learning the value of their common
child generally introduces dependence among them
(think Holmes Watson on the icy road, or the
coin/penny coin flipping example). We will need
to include this possibility in our computational
machinery.
343b) Undirected graph
FLU
THRINF
FLU
THRINF
FEVER
SORTHR
FEVER
SORTHR
Drop the directionality of the edges. Although
the conditional probability directions were
important for constructing the graph and will be
important for building potential tables, we want
a structure for computing that can go in any
direction.
353c) Triangulated graph
A1
A2
OR
A3
A5
A4
Triangulation means looking at the undirected
graph for cycles from a variable to itself going
through a sequence of other variables. There
should be no cycle with length greater than
three. Whenever there is, add undirected edges
so there are not cycles. The Flu/Throat-infection
moral graph is already triangulated, so it is
not an issue here. A different example is shown
above. Why do we do this? It is essential to
producing cliques of variables that are trees.
Can be many ways to do this finding best one
is NP-hard. People develop heuristic approaches.
364) Determine cliques and clique intersections
FLU
THRINF
FLU
THRINF
FLU
THRINF
AND
FEVER
SORTHR
FEVER
SORTHR
FEVER
SORTHR
From the triangulated graph, one determines
cliques, subsets of variables that are all
linked pairwise to one another. Cliques overlap,
with sets of overlapping variables called clique
intersections. The two cliques here are
FEVER, FLU, THRINF and FLU, THRINF, SORTHR.
The clique intersection is FLU, THRINF
374) Determine cliques and clique intersections
- Cliques and intersections are the structure for
local updating. - Can be multiple ways to define cliques from a
triangulated graph. Finding the best is
NP-hard. Heuristics developed. - The amount of computation grows roughly
geometrically with clique size, as measured by
the number of possible configurations of all
values of all variables in a clique. - A clique representation with many small cliques
is therefore preferred to a representation with a
few larger cliques. - Strategies for increased efficiency include
defining collector variables, adding variables
to break loops, and dropping associations when
the consequences are benign.
385) Join tree representation
FEVER, FLU, THRINF
A join-tree representation depicts the
singly-connected structure of cliques and clique
intersections. A join tree has the running
intersection property If a variable appears in
two cliques, it appears in all cliques and clique
intersections in the single path connecting them.
FLU, THRINF
FLU, THRINF, SORTHR
396) Potential tables
- Local calculation is carried out with tables that
convey the joint distributions of variables
within cliques, or potential tables. - Similar tables for clique intersections are used
to pass updating information from one clique to
another.
406) Potential tables
For each clique, determine the joint
probabilities for all the possible combinations
of values of all variables. For convenience, we
have written them as matrices. These potential
tables indicate the initial status of the network
in our example--before specific knowledge of a
particular individuals symptoms or disease
states is known.
416) Potential tables
The potential table for the clique intersection
is the marginal distribution of flu and throat
infection.
FLU THRINF PROB
Marginal probs for FLU THRINF
426) Potential tables
The potential table for Clique 1 is calculated
using the prior probabilities of .11 for both flu
and throat infection, the assumption that they
are independent, and the conditional
probabilities of sore throat for each
flu/throat-infection combination.
FLU THRINF PROB
yes yes .012
x
yes no .098
no yes .098
no no .792
Conditional probs for FEVER given FLU THRINF
Marginal probs for FLU THRINF
436) Potential tables
Similar calculation for the other clique
Marginal probabilities of FLU THRINF, times the
conditionals for SORTHR. Note that the implied
distributions for FLUTHRINF are consistent
across both clique potential tables and the
clique intersection table. From these, we can
reconstruct a coherent joint distribution for the
entire set of variables.
447) Updating scheme
- Absorbing new evidence about a single variable is
effected by re-adjusting the appropriate margin
in a potential table that contains that variable,
then propagating the resulting change to the
clique to other cliques via the clique
intersections. - This process continues outward from the clique
where the process began, until all cliques have
been updated. - The single-connectedness and running intersection
properties of the join tree assure that coherent
probabilities result.
457) Updating scheme
Suppose we learn FEVERyes. Go to any clique
where FEVER appears (actually theres just one in
this example). Zero out the entries for
FEVERno. The remaining values express our new
beliefs about the proportional chances that the
other variables in that clique take their
respective joint values.
467) Updating scheme
Propagate the new beliefs about FLU, THRINF to
the clique intersection. You could normalize
these if you wanted to, but the proportional
information is what matters.
NEW
OLD
477) Updating scheme
Propagate the new beliefs about FLU, THRINF to
the next clique. Divide each row by the old
weight for that combination of clique-intersection
variables and multiply it by the new one. I.e.,
the adjustment factor for each row is New Weight
/ Old Weight.
NEW
OLD
487) Updating scheme
Clique 2
Apply the adjustment factor for each row, then
renormalize with respect to all values.
497) Updating scheme
Clique 2
Apply the adjustment factor for each row, then
renormalize with respect to all values.
Predictive distribution for SORTHR
50Comments
- Clique determination NP-hard--need hueristics
- HUGIN has multiple options tells you cliques
- Computation depends on largest clique size (large
potential tables) - More conditional independence is generally better
51Some Favorable Unfavorable Structures
- Multiple children are good (think IRT)
-
- Multiple parents are not good. Why?
-
52Some Favorable Unfavorable Structures
- Multiple children are good (think IRT)
- Multiple simple cliques
- Multiple parents are not good. Why?
- Moralization forces a clique containing all
parents the child.
53Key points for measurement models
- Student model (SM) variables are of transcending
interest - They characterize student knowledge, skill,
strategies - Cannot be directly observed
- Observable variables are means of getting
evidence about SM variables - They characterize salient aspects of performance
- Observable variables from performances are
modeled as conditionally independent across (but
not necessarily within) tasks, given SM variables.