Title: Introduction to Probabilistic Graphical Models
1Introduction to Probabilistic Graphical Models
- Eran Segal
- Weizmann Institute
2Probabilistic Graphical Models
- Tool for representing complex systems and
performing sophisticated reasoning tasks - Fundamental notion Modularity
- Complex systems are built by combining simpler
parts - Why have a model?
- Compact and modular representation of complex
systems - Ability to execute complex reasoning patterns
- Make predictions
- Generalize from particular problem
3Probabilistic Graphical Models
- Increasingly important in Machine Learning
- Many classical probabilistic problems in
statistics, information theory, pattern
recognition, and statistical mechanics are
special cases of the formalism - Graphical models provides a common framework
- Advantage specialized techniques developed in
one field can be transferred between research
communities
4Representation Graphs
- Intuitive data structure for modeling
highly-interacting sets of variables - Explicit model for modularity
- Data structure that allows for design of
efficient general-purpose algorithms
5Reasoning Probability Theory
- Well understood framework for modeling
uncertainty - Partial knowledge of the state of the world
- Noisy observations
- Phenomenon not covered by our model
- Inherent stochasticity
- Clear semantics
- Can be learned from data
6A Simple Example
- We want to model whether our neighbor will inform
us of the alarm being set off - The alarm can set off if
- There is a burglary
- There is an earthquake
- Whether our neighbor calls depends on whether the
alarm is set off
7A Simple Example
- Variables
- Earthquake (E), Burglary (B), Alarm (A),
NeighborCalls (N)
E B A N Prob.
F F F F 0.01
F F F T 0.04
F F T F 0.05
F F T T 0.01
F T F F 0.02
F T F T 0.07
F T T F 0.2
F T T T 0.1
T F F F 0.01
T F F T 0.07
T F T F 0.13
T F T T 0.04
T T F F 0.06
T T F T 0.05
T T T F 0.1
T T T T 0.05
24-1 independent parameters
8A Simple Example
E E
F T
0.9 0.1
B B
F T
0.7 0.3
Burglary
Earthquake
A A
E B F T
F F 0.99 0.01
F T 0.1 0.9
T F 0.3 0.7
T T 0.01 0.99
Alarm
NeighborCalls
7 independent parameters
N N
A F T
F 0.9 0.1
T 0.2 0.8
9Example Bayesian Network
- The Alarm network for monitoring intensive care
patients - 509 parameters (full joint 237)
- 37 variables
10Application Clustering Users
- Input TV shows that each user watches
- Output TV show clusters
- Assumption shows watched by same users are
similar
- Class 1
- Power rangers
- Animaniacs
- X-men
- Tazmania
- Spider man
- Class 2
- Young and restless
- Bold and the beautiful
- As the world turns
- Price is right
- CBS eve news
- Class 3
- Tonight show
- Conan OBrien
- NBC nightly news
- Later with Kinnear
- Seinfeld
- Class 4
- 60 minutes
- NBC nightly news
- CBS eve news
- Murder she wrote
- Matlock
- Class 5
- Seinfeld
- Friends
- Mad about you
- ER
- Frasier
11App. Recommendation Systems
- Given user preferences, suggest recommendations
- Example Amazon.com
- Input movie preferences of many users
- Solution model correlations between movie
features - Users that like comedy, often like drama
- Users that like action, often do not like
cartoons - Users that like Robert Deniro films often like Al
Pacino films - Given user preferences, can predict probability
that new movies match preferences
12Probability Theory
- Probability distribution P over (?, S) is a
mapping from events in S such that - P(?)?? 0 for all ??S
- P(?) 1
- If ?,??S and ????, then P(???)P(?)P(?)
- Conditional Probability
- Chain Rule
- Bayes Rule
- Conditional Independence
13Random Variables Notation
- Random variable Function from?? to a value
- Categorical / Ordinal / Continuous
- Val(X) set of possible values of RV X
- Upper case letters denote RVs (e.g., X, Y, Z)
- Upper case bold letters denote set of RVs (e.g.,
X, Y) - Lower case letters denote RV values (e.g., x, y,
z) - Lower case bold letters denote RV set values
(e.g., x) - Values for categorical RVs with Val(X)k
x1,x2,,xk - Marginal distribution over X P(X)
- Conditional independence X is independent of Y
given Z if?
14Expectation
- Discrete RVs
- Continuous RVs
- Linearity of expectation
- Expectation of products(when X?? Y in P)
15Variance
- Variance of RV
- If X and Y are independent VarXYVarXVarY
- VaraXba2VarX
16Information Theory
- Entropy
- We use log base 2 to interpret entropy as bits of
information - Entropy of X is a lower bound on avg. of bits
to encode values of X - 0 ? Hp(X) ? logVal(X) for any distribution P(X)
- Conditional entropy
- Information only helps
- Mutual information
- 0 ? Ip(XY) ? Hp(X)
- Symmetry Ip(XY) Ip(YX)
- Ip(XY)0 iff X and Y are independent
- Chain rule of entropies
17Representing Joint Distributions
- Random variables X1,,Xn
- P is a joint distribution over X1,,Xn
- Can we represent P more compactly?
- Key Exploit independence properties
18Independent Random Variables
- Two variables X and Y are independent if
- P(XxYy) P(Xx) for all values x,y
- Equivalently, knowing Y does not change
predictions of X - If X and Y are independent then
- P(X, Y) P(XY)P(Y) P(X)P(Y)
- If X1,,Xn are independent then
- P(X1,,Xn) P(X1)P(Xn)
- O(n) parameters
- All 2n probabilities are implicitly defined
- Cannot represent many types of distributions
19Conditional Independence
- X and Y are conditionally independent given Z if
- P(XxYy, Zz) P(XxZz) for all values x, y,
z - Equivalently, if we know Z, then knowing Y does
not change predictions of X - Notation Ind(XY Z) or (X ? Y Z)
20Conditional Parameterization
- S Score on test, Val(S) s0,s1
- I Intelligence, Val(I) i0,i1
P(SI)
P(I)
P(I,S)
I S P(I,S)
i0 s0 0.665
i0 s1 0.035
i1 s0 0.06
i1 s1 0.24
I I
i0 i1
0.7 0.3
S S
I s0 s1
i0 0.95 0.05
i1 0.2 0.8
Joint parameterization
Conditional parameterization
3 parameters
3 parameters
Alternative parameterization P(S) and P(IS)
21Conditional Parameterization
- S Score on test, Val(S) s0,s1
- I Intelligence, Val(I) i0,i1
- G Grade, Val(G) g0,g1,g2
- Assume that G and S are independent given I
22Naïve Bayes Model
- Class variable C, Val(C) c1,,ck
- Evidence variables X1,,Xn
- Naïve Bayes assumption evidence variables are
conditionally independent given C - Applications in medical diagnosis, text
classification - Used as a classifier
- Problem Double counting correlated evidence
23Bayesian Network (Informal)
- Directed acyclic graph G
- Nodes represent random variables
- Edges represent direct influences between random
variables - Local probability models
24Bayesian Network (Informal)
- Represent a joint distribution
- Specifies the probability for P(Xx)
- Specifies the probability for P(XxEe)
- Allows for reasoning patterns
- Prediction (e.g., intelligent ? high scores)
- Explanation (e.g., low score ? not intelligent)
- Explaining away (different causes for same effect
interact)
I
S
G
Example 2
25Bayesian Network Structure
- Directed acyclic graph G
- Nodes X1,,Xn represent random variables
- G encodes local Markov assumptions
- Xi is independent of its non-descendants given
its parents - Formally (Xi ? NonDesc(Xi) Pa(Xi))
26Independency Mappings (I-Maps)
- Let P be a distribution over X
- Let I(P) be the independencies (X ? Y Z) in P
- A Bayesian network structure is an I-map
(independency mapping) of P if I(G)?I(P)
I S P(I,S)
i0 s0 0.25
i0 s1 0.25
i1 s0 0.25
i1 s1 0.25
I
I
I S P(I,S)
i0 s0 0.4
i0 s1 0.3
i1 s0 0.2
i1 s1 0.1
S
S
I(P)I?S
I(G)I?S
I(G)?
I(P)?
27Factorization Theorem
- If G is an I-Map of P, then
- Proof
- wlog. X1,,Xn is an ordering consistent with G
- By chain rule
- From assumption
- Since G is an I-Map ? (Xi NonDesc(Xi)
Pa(Xi))?I(P)
28Factorization Implies I-Map
- ? G is an
I-Map of P - Proof
- Need to show (Xi NonDesc(Xi) Pa(Xi))?I(P) or
that P(Xi NonDesc(Xi)) P(Xi Pa(Xi)) - wlog. X1,,Xn is an ordering consistent with G
-
29Bayesian Network Definition
- A Bayesian network is a pair (G,P)
- P factorizes over G
- P is specified as set of CPDs associated with Gs
nodes - Parameters
- Joint distribution 2n
- Bayesian network (bounded in-degree k) n2k
30Bayesian Network Design
- Variable considerations
- Clarity test can an omniscient being determine
its value? - Hidden variables?
- Irrelevant variables
- Structure considerations
- Causal order of variables
- Which independencies (approximately) hold?
- Probability considerations
- Zero probabilities
- Orders of magnitude
- Relative values
31CPDs
- Thus far we ignored the representation of CPDs
- Now we will cover the range of CPD
representations - Discrete
- Continuous
- Sparse
- Deterministic
- Linear
32Table CPDs
- Entry for each joint assignment of X and Pa(X)
- For each pax
- Most general representation
- Represents every discrete CPD
- Limitations
- Cannot model continuous RVs
- Number of parameters exponential in Pa(X)
- Cannot model large in-degree dependencies
- Ignores structure within the CPD
I
S
P(SI)
P(I)
S S
I s0 s1
i0 0.95 0.05
i1 0.2 0.8
I I
i0 i1
0.7 0.3
33Structured CPDs
- Key idea reduce parameters by modeling P(XPaX)
without explicitly modeling all entries of the
joint - Lose expressive power (cannot represent every
CPD)
34Deterministic CPDs
- There is a function f Val(PaX) ? Val(X) such
that - Examples
- OR, AND, NAND functions
- Z YX (continuous variables)
35Deterministic CPDs
- Replace spurious dependencies with deterministic
CPDs - Need to make sure that deterministic CPD is
compactly stored
T1
T2
T1
T2
T T
T1 T2 t0 t1
t0 t0 1 0
t0 t1 0 1
t1 t0 0 1
t1 t1 0 1
T
S
S S
T1 T2 s0 s1
t0 t0 0.95 0.05
t0 t1 0.2 0.8
t1 t0 0.2 0.8
t1 t1 0.2 0.8
S
S S
T s0 s1
t0 0.95 0.05
t1 0.2 0.8
36Deterministic CPDs
- Induce additional conditional independencies
- Example T is any deterministic function of T1,T2
T1
T2
T
S1
S2
37Deterministic CPDs
- Induce additional conditional independencies
- Example C is an XOR deterministic function of
A,B
D
A
B
C
E
38Deterministic CPDs
- Induce additional conditional independencies
- Example T is an OR deterministic function of
T1,T2
T1
T2
T
S1
S2
Context specific independencies
39Tree CPDs
A
B
C
D
D D
A B C d0 d1
a0 b0 c0 0.2 0.8
a0 b0 c1 0.2 0.8
a0 b1 c0 0.2 0.8
a0 b1 c1 0.2 0.8
a1 b0 c0 0.9 0.1
a1 b0 c1 0.7 0.3
a1 b1 c0 0.4 0.6
A1 b1 C1 0.4 0.6
8 parameters
40Context Specific Independencies
A
B
C
D
A
a0
a1
C
B
c1
c0
b1
b0
(0.2,0.8)
(0.4,0.6)
(0.7,0.3)
(0.9,0.1)
Reasoning by cases implies that Ind(BC A,D)
41Continuous Variables
- One solution Discretize
- Often requires too many value states
- Loses domain structure
- Other solution use continuous function for
P(XPa(X)) - Can combine continuous and discrete variables,
resulting in hybrid networks - Inference and learning may become more difficult
42Gaussian Density Functions
- Among the most common continuous representations
- Univariate case
43Gaussian Density Functions
- A multivariate Gaussian distribution over
X1,...Xn has - Mean vector ?
- nxn positive definite covariance matrix
?positive definite - Joint density function
- ?iEXi
- ?iiVarXi
- ?ijCovXi,XjEXiXj-EXiEXj (i?j)
44Hybrid Models
- Models of continuous and discrete variables
- Continuous variables with discrete parents
- Discrete variables with continuous parents
- Conditional Linear Gaussians
- Y continuous variable
- X X1,...,Xn continuous parents
- U U1,...,Um discrete parents
-
- A Conditional Linear Bayesian network is one
where - Discrete variables have only discrete parents
- Continuous variables have only CLG CPDs
45Hybrid Models
- Continuous parents for discrete children
- Threshold models
- Linear sigmoid
46Undirected Graphical Models
- Useful when edge directionality cannot be
assigned - Simpler interpretation of structure
- Simpler inference
- Simpler independency structure
- Harder to learn
- We will also see models with combined directed
and undirected edges - Some computations require restriction to discrete
variables
47Undirected Model (Informal)
- Nodes correspond to random variables
- Local factor models are attached to sets of nodes
- Factor elements are positive
- Do not have to sum to 1
- Represent affinities
A C ?1A,C
a0 c0 4
a0 c1 12
a1 c0 2
a1 c1 9
A B ?2A,B
a0 b0 30
a0 b1 5
a1 b0 1
a1 b1 10
A
B
C
C D ?3C,D
c0 d0 30
c0 d1 5
c1 d0 1
c1 d1 10
D
B D ?4B,D
b0 d0 100
b0 d1 1
b1 d0 1
b1 d1 1000
48Undirected Model (Informal)
- Represents joint distribution
- Unnormalized factor
- Partition function
- Probability
- As Markov networks represent joint distributions,
they can be used for answering queries
A
B
C
D
49Markov Network Structure
- Undirected graph H
- Nodes X1,,Xn represent random variables
- H encodes independence assumptions
- A path X1,,Xk is active if none of the Xi
variables along the path are observed - X and Y are separated in H given Z if there is no
active path between any node x?X and any node y?Y
given Z - Denoted sepH(XYZ)
Global Markov assumptions I(H) (X?YZ)
sepH(XYZ)
50Relationship with Bayesian Network
- Can all independencies encoded by Markov networks
be encoded by Bayesian networks? - No, Ind(AB C,D) and Ind(CD A,B) example
- Can all independencies encoded by Bayesian
networks be encoded by Markov networks? - No, immoral v-structures (explaining away)
- Markov networks encode monotonic independencies
- If sepH(XYZ) and Z?Z then sepH(XYZ)
51Markov Network Factors
- A factor is a function from value assignments of
a set of random variables D to real positive
numbers ?? - The set of variables D is the scope of the factor
- Factors generalize the notion of CPDs
- Every CPD is a factor (with additional
constraints)
52Markov Network Factors
- Can we represent any joint distribution by using
only factors that are defined on edges? - No!
- Example binary variables
- Joint distribution has 2n-1 independent
parameters - Markov network with edge factors has
parameters
53Markov Network Distribution
- A distribution P factorizes over H if it has
- A set of subsets D1,...Dm where each Di is a
complete subgraph in H - Factors ?1D1,...,?mDm such that
- Z is called the partition function
- P is also called a Gibbs distribution over H
where
54Relationship with Bayesian Network
- Bayesian Networks
- Semantics defined via local Markov assumptions
- Global independencies induced by d-separation
- Local and global independencies equivalent since
one implies the other - Markov Networks
- Semantics defined via global separation property
- Can we define the induced local independencies?
- We show two definitions
- All three definitions (global and two local) are
equivalent only for positive distributions P
55Local Structure
- Factor graphs still encode complete tables
- Goal as in Bayesian networks, represent
context-specificity - A feature ?D on variables D is an indicator
function that for some y?D - A distribution P is a log-linear model over H if
it has - Features ?1D1,...,?kDk where each Di is a
subclique in H - A set of weights w1,...,wk such that
56Domain Application Vision
- The image segmentation problem
- Task Partition an image into distinct parts of
the scene - Example separate water, sky, background
57Markov Network for Segmentation
- Grid structured Markov network
- Random variable Xi corresponds to pixel i
- Domain is 1,...K
- Value represents region assignment to pixel i
- Neighboring pixels are connected in the network
- Appearance distribution
- wik extent to which pixel i fits region k
(e.g., difference from typical pixel for region
k) - Introduce node potential exp(-wik1Xik)
- Edge potentials
- Encodes contiguity preference by edge
potentialexp(?1XiXj) for ?gt0
58Markov Network for Segmentation
- Solution inference
- Find most likely assignment to Xi variables
Appearance distribution
X11
X12
X13
X14
X21
X22
X23
X24
X31
X32
X33
X34
Contiguity preference