Discrete Bayes Nets

About This Presentation

Title:

Discrete Bayes Nets

Description:

If two discrete random variables are independent, the probability of the joint ... p(holm|rn,sprnk) x p(wat|rn) x p(rn) x p(sprnk) whereas... University of Maryland ... – PowerPoint PPT presentation

Number of Views:69

Avg rating:3.0/5.0

Slides: 58

Provided by: bobmi9

Category:

more less

Transcript and Presenter's Notes

Title: Discrete Bayes Nets

1
Discrete Bayes Nets

Robert J. Mislevy
University of Maryland
September 13, 2004

2
Independence

If two discrete random variables are independent,
the probability of the joint occurrence of values
of two variables is equal to the product of the
probabilities individually
P(Xx,Yy) P(Xx) P(Yy).
Example Flip a dime, flip a penny
Also, P(XxYy) P(Xx) Learning the value of
Y does not influence your belief about X.

Dime
Penny
3
Conditional independence

If two variables are conditionally independent,
the conditional probability of the joint
occurrence given the value of another variable is
equal to the product of the conditional
probabilities
P(Xx,YyZz) P(Xx Zz) P(Yy Zz).
Also, P(XxYy, Zz) P(Xx Zz). It may be
that learning Z will influence what you believe
about X and about Y, but if you know the value of
Z, learning the value of Y does not influence
your belief about X.

4
Example of conditional independenceTwo flips of
the same biased coin

Two coins, with the probability of heads being .2
for Coin 1 and .7 for Coin 2.
One of these coins is selected--say with 50-50
probability--and flipped twice.
Flips are independent IF you know which coin was
flipped, but dependent if you dont...

Flip1
(which) Coin
Flip2
5
Two flips of the same biased coin
2. Observe Flip1h prob for Flip2h increases.
Why?
1. Initial status
3. Status if you know Coin1
4. Observe Flip1h when know Coin1 no change
for prob of Flip2h. Why not?
6
The heart of measurement models Pearl on
Conditional independence

Conditional independence is not a grace of
nature for which we must wait passively, but
rather a psychological necessity which we satisfy
actively by organizing our knowledge in a
specific way.
An important tool in such organization is the
identification of intermediate variables that
induce conditional independence among
observables if such variables are not in our
vocabulary, we create them.
In medical diagnosis, for instance, when some
symptoms directly influence one another, the
medical profession invents a name for that
interaction (e.g., syndrome, complication,
pathological state) and treats it as a new
auxiliary variable that induces conditional
independence dependency between any two
interacting systems is fully attributed to the
dependencies of each on the auxiliary variable.
(Pearl, 1988, p. 44)

7
Building up complex networks

Interrelationships among many variables modeled
in terms of important relationships among smaller
subsets of variables
(sometimes unobservable ones).

8
Building up complex networks

Recursive representation of probability
distributions
All orderings are equally correct, but some are
more beneficial because they capitalize on
causal, dependence, time-order, or theoretical
relationships that we posit
Terms simplify when there is conditional
independence.

9
Jensens Wet Lawn Example

Mr Holmes now lives in Los Angeles. One morning
when Holmes leaves his house, he realizes that
his lawn is wet. Is it due to rain (R), or has
he forgotten to turn off his sprinkler (S)? His
belief in both events increases.
Next he notices that the grass of his neighbor,
Dr. Watson, is also wet. Elementary Holmes is
almost certain that it has been raining. (p. 8)

Jensen, F.V. (1996). An introduction to Bayesian
networks. New York Springer-Verlag.
10
Jensens Wet Lawn Example

p(holmes, watson,rain, sprinkler)
p(holmwat,rn,sprnk) x p(watrn,sprnk) x
p(rnsprnk) x p(sprnk)
p(holmrn,sprnk) x p(watrn) x
p(rn) x p(sprnk)
whereas...

11
Jensens Wet Lawn Example

p(rain, sprinkler, watson, holmes)
p(rnsprnk,wat,holm) x p(sprnkwat,holm) x
p(watholm) x p(holm)
This doesnt simplify. You get the same
answers, but less efficiently.

12
Building up complex networks

Acyclic directed graphs (DAGs)
Nodes variables, edges arrows, cant have
loops.
The relationship between recursive
representations and acyclic directed graphs
Edges (arrows) represent explicit dependence
relationships
No edges means no explicit dependence, although
there can be dependence through relationships
with other variables.
There can be conditional independence
relationships that are not revealed in a DAG
(e.g., the inefficient WetGrass representation).

13
Computation in Bayes nets

Concepts basics of computing strategy
Chapter 5 of Almond, Mislevy, Steinberg,
Williamson, Yan (in progress). Bayesian
networks in educational assessment.
For more detail, see
Jensen, F.V. (1996). An introduction to Bayesian
networks. New York Springer-Verlag.
Lauritzen, S.L., Spiegelhalter, D.J. (1988).
Local computations with probabilities on
graphical structures and their application to
expert systems (with discussion). Journal of the
Royal Statistical Society, Series B, 50, 157-224.

14
Why does Bayes Theorem work?

The setup, with two random variables, X and Y.
Joint probabilities

Yy1
Yy2
Total
p(x1,y1)
p(x1,y2)
Xx1
p(x1)
p(x2,y1)
p(x2,y2)
Xx2
p(x2)
p(x3,y1)
p(x3,y2)
Xx3
p(x3)
p(y1)
Total
p(y2)
15
Why does Bayes Theorem work?

The setup, with two random variables, X and Y.
Joint probabilities

Yy1
Yy2
Total
p(x1,y1)
p(x1,y2)
Xx1
p(x1)
These are the cells in which Yy2 p(xj,y2)
p(y2 xj) p(xj). Divide each by the total of
the column, or p(y2). Result is the proportion
each cell represents in that column, or p(xj
y2).
p(x2,y1)
p(x2,y2)
Xx2
p(x2)
p(x3,y1)
p(x3,y2)
Xx3
p(x3)
p(y1)
Total
p(y2)
16
Why does Bayes Theorem work?

The setup, with two random variables, X and Y.
Joint probabilities

Yy1
Yy2
Total
p(x1,y1)
p(x1,y2)
Xx1
p(x1)
These are the cells in which Xx1 p(x1,yk)
p(x1 yk) p(yk). Divide each by the total of
the row, or p(x1). Result is the proportion
each cell represents in that column, or p(yk
x1).
p(x2,y1)
p(x2,y2)
Xx2
p(x2)
p(x3,y1)
p(x3,y2)
Xx3
p(x3)
p(y1)
Total
p(y2)
17
Bayes Theorem with 2 Variables

The setup, with two random variables, X and Y
You know conditional probabilities, p(xj yk),
which tell you what to believe about X if you
knew the value of Y.
You learn Xx what should you believe about Y?
You combine two things
Relative conditional probabilities (the
likelihood)
i.e., p(x yk) as a function of yk with X
fixed at x.
Previous probabilities about Y values p(yk),

posterior likelihood
prior
18
Inference in a chain
Recursive representation
p(u,v,x,y,z) p(zy,x,v,u) p(yx,v,u) p(xv,u)
p(vu) p(u) p(zy)
p(yx) p(xv) p(vu) p(u).
U
V
X
Y
Z
p(zy)
p(yx)
p(xv)
p(vu)
19
Inference in a chain
Suppose we learn the value of X
Start here, by revising belief about X
U
V
X
Y
Z
p(zy)
p(yx)
p(xv)
p(vu)
20
Inference in a chain
Propagate information down the chain using
conditional probabilities
From updated belief about X, use conditional
probability to revise belief about Y
U
V
X
Y
Z
p(zy)
p(yx)
p(xv)
p(vu)
21
Inference in a chain
Propagate information down the chain using
conditional probabilities
From updated belief about Y, use conditional
probability to revise belief about Z
U
V
X
Y
Z
p(zy)
p(yx)
p(xv)
p(vu)
22
Inference in a chain
Propagate information up the chain using Bayes
Theorem
From updated belief about X, use Bayes Theorem to
revise belief about V
U
V
X
Y
Z
p(zy)
p(yx)
p(xv)
p(vu)
23
Inference in a chain
Propagate information up the chain using Bayes
Theorem
From updated belief about V, use Bayes Theorem to
revise belief about U
U
V
X
Y
Z
p(zy)
p(yx)
p(xv)
p(vu)
24
Inference in singly-connected nets
Singly connected There is never more than one
path from one variable to another variable.
Chains and trees are singly connected. Can use
repeated applications of Bayes theorem and
conditional probability to propagate
evidence. (Pearl, early 1980s)
V
U
X
Y
Z
25
Inference in multiply-connected nets
In a multiply- connected graph, in at least one
instance there is more than one path from one
variable to another variable. Repeated
applications of Bayes theorem and conditional
probability at the level of individual variables
doesnt work.
V
W
U
X
Y
Z
26
Inference in multiply-connected nets
V
W

Key idea Group variables into subsets
(cliques) such that the subsets form a tree.

U
X
Y
Z
U,V,W
27
Inference in multiply-connected nets
V
W

Key idea Group variables into subsets
(cliques) such that the subsets form a tree.

U
X
Y
Z
U,V,W
U,V,X
28
Inference in multiply-connected nets
V
W

Key idea Group variables into subsets
(cliques) such that the subsets form a tree.

U
X
Y
Z
U,V,W
U,V,X
U,X,Y
29
Inference in multiply-connected nets
V
W

Key idea Group variables into subsets
(cliques) such that the subsets form a tree.
Can the update cliques with a generalized
version of updating individual variables in
cliques.

U
X
Y
Z
X,Z
U,V,W
U,V,X
U,X,Y
30
The Lauritzen-Spiegelhalter algorithm

1. Recursive representation of the joint
distribution of variables.
2. Directed graph representation of (1).
3. Moralized, undirected, triangulated, graph.
4. Determination of cliques and clique
intersections
5. Join tree representation.
6. Potential tables.
7. Updating scheme.

31
Example from Andreassen, Jensen, Olesen

Two possible diseases flu and throat infection
(FLU and THRINF)
Two possible symptoms, fever and sore throat (FEV
and SORETHR).
The diseases are modeled as independent,
the symptoms as conditionally independent given
disease states.

32
Example from Andreassen, Jensen, Olesen

Aside Medical diagnosis with observable
symptoms of latent disease states has many
parallels to measurement modeling in assessment
State is a construct, inferred from theory
experience--proposed to organize our knowledge
Conditional independence of observations given
(possibly complex) state
Persistent interest in the underlying state
Observations mainly of transitory interest
States relationships meant to aid thinking
about unique cases but surely oversimplified
State is the level at which treatment prognosis
is discussed,
although there is often therapeutic/educational
value in addressing specifics from observational
setting

33
1) Recursive representation of joint distribution
P(FEV, SORTHR, FLU, THRINF) P(FEV SORTHR,
FLU, THRINF) P(SORTHR FLU, THRINF)
P(FLU THRINF) P(THRINF) P(FEV FLU, THRINF)
P(SORTHR FLU, THRINF) P( FLU) P( THRINF).
34
2) DAG representation
35
2) DAG representation
Aside A look ahead toward cognitive diagnosis
Good differential diagnosis value for neither
vs. at least one of the two
Good differential diagnosis value for throat
infection vs. no throat infection
36
2) DAG representation
Aside A look ahead toward cognitive diagnosis
No differential diagnosis value for which of
the two?
Good differential diagnosis value for which of
the two?
37
3a) Moralized graph
Marry parents Look at the set of parents of
each variable. If they are not already
connected, connect them. Direction doesnt
matter, since well drop it in the next step.
Rationale If variables are all parents of the
same variable, then even if they were independent
otherwise, learning the value of their common
child generally introduces dependence among them
(think Holmes Watson on the icy road, or the
coin/penny coin flipping example). We will need
to include this possibility in our computational
machinery.
38
3b) Undirected graph
FLU
THRINF
FLU
THRINF
FEVER
SORTHR
FEVER
SORTHR
Drop the directionality of the edges. Although
the conditional probability directions were
important for constructing the graph and will be
important for building potential tables, we want
a structure for computing that can go in any
direction.
39
3c) Triangulated graph
A1
A2
OR
A3
A5
A4
Triangulation means looking at the undirected
graph for cycles from a variable to itself going
through a sequence of other variables. There
should be no cycle with length greater than
three. Whenever there is, add undirected edges
so there are not cycles. The Flu/Throat-infection
moral graph is already triangulated, so it is
not an issue here. A different example is shown
above. Why do we do this? It is essential to
producing cliques of variables that are trees.
Can be many ways to do this finding best one
is NP-hard. People develop heuristic approaches.
40
4) Determine cliques and clique intersections
FLU
THRINF
FLU
THRINF
FLU
THRINF
AND
FEVER
SORTHR
FEVER
SORTHR
FEVER
SORTHR
From the triangulated graph, one determines
cliques, subsets of variables that are all
linked pairwise to one another. Cliques overlap,
with sets of overlapping variables called clique
intersections. The two cliques here are
FEVER, FLU, THRINF and FLU, THRINF, SORTHR.
The clique intersection is FLU, THRINF
41
4) Determine cliques and clique intersections

Cliques and intersections are the structure for
local updating.
Can be multiple ways to define cliques from a
triangulated graph. Finding the best is
NP-hard. Heuristics developed.
The amount of computation grows roughly
geometrically with clique size, as measured by
the number of possible configurations of all
values of all variables in a clique.
A clique representation with many small cliques
is therefore preferred to a representation with a
few larger cliques.
Strategies for increased efficiency include
defining collector variables, adding variables
to break loops, and dropping associations when
the consequences are benign.

42
5) Join tree representation
FEVER, FLU, THRINF
A join-tree representation depicts the
singly-connected structure of cliques and clique
intersections. A join tree has the running
intersection property If a variable appears in
two cliques, it appears in all cliques and clique
intersections in the single path connecting them.
FLU, THRINF
FLU, THRINF, SORTHR
43
6) Potential tables

Local calculation is carried out with tables that
convey the joint distributions of variables
within cliques, or potential tables.
Similar tables for clique intersections are used
to pass updating information from one clique to
another.

44
6) Potential tables
For each clique, determine the joint
probabilities for all the possible combinations
of values of all variables. For convenience, we
have written them as matrices. These potential
tables indicate the initial status of the network
in our example--before specific knowledge of a
particular individuals symptoms or disease
states is known.
45
6) Potential tables
The potential table for the clique intersection
is the marginal distribution of flu and throat
infection.
FLU THRINF PROB
Marginal probs for FLU THRINF
46
6) Potential tables
The potential table for Clique 1 is calculated
using the prior probabilities of .11 for both flu
and throat infection, the assumption that they
are independent, and the conditional
probabilities of sore throat for each
flu/throat-infection combination.
FLU THRINF PROB
yes yes .012
x
yes no .098
no yes .098
no no .792
Conditional probs for FEVER given FLU THRINF
Marginal probs for FLU THRINF
47
6) Potential tables
Similar calculation for the other clique
Marginal probabilities of FLU THRINF, times the
conditionals for SORTHR. Note that the implied
distributions for FLUTHRINF are consistent
across both clique potential tables and the
clique intersection table. From these, we can
reconstruct a coherent joint distribution for the
entire set of variables.
48
7) Updating scheme

Absorbing new evidence about a single variable is
effected by re-adjusting the appropriate margin
in a potential table that contains that variable,
then propagating the resulting change to the
clique to other cliques via the clique
intersections.
This process continues outward from the clique
where the process began, until all cliques have
been updated.
The single-connectedness and running intersection
properties of the join tree assure that coherent
probabilities result.

49
7) Updating scheme
Suppose we learn FEVERyes. Go to any clique
where FEVER appears (actually theres just one in
this example). Zero out the entries for
FEVERno. The remaining values express our new
beliefs about the proportional chances that the
other variables in that clique take their
respective joint values.
50
7) Updating scheme
Propagate the new beliefs about FLU, THRINF to
the clique intersection. You could normalize
these if you wanted to, but the proportional
information is what matters.
NEW
OLD
51
7) Updating scheme
Propagate the new beliefs about FLU, THRINF to
the next clique. Divide each row by the old
weight for that combination of clique-intersection
variables and multiply it by the new one. I.e.,
the adjustment factor for each row is New Weight
/ Old Weight.
NEW
OLD
52
7) Updating scheme
Clique 2
Apply the adjustment factor for each row, then
renormalize with respect to all values.
53
7) Updating scheme
Clique 2
Apply the adjustment factor for each row, then
renormalize with respect to all values.
Predictive distribution for SORTHR
54
Comments

Triangulation clique determination
NP-hard--need hueristics
HUGIN has multiple options tells you cliques
Computation depends on largest clique size (large
potential tables)
More conditional independence is generally better

55
Some Favorable Unfavorable Structures

Multiple children are good (think IRT)

Multiple parents are not good. Why?

56
Some Favorable Unfavorable Structures

Multiple children are good (think IRT)
Multiple simple cliques

Multiple parents are not good. Why?
Moralization forces a clique containing all
parents the child.

57
Key points for measurement models

Student model (SM) variables are of transcending
interest
They characterize student knowledge, skill,
strategies
Cannot be directly observed
Observable variables are means of getting
evidence about SM variables
They characterize salient aspects of performance
Observable variables from performances are
modeled as conditionally independent across (but
not necessarily within) tasks, given SM variables.