Introduction to Probabilistic Graphical Models - PowerPoint PPT Presentation

1 / 58
About This Presentation
Title:

Introduction to Probabilistic Graphical Models

Description:

Introduction to Probabilistic Graphical Models Eran Segal Weizmann Institute * * * * Tree cpd induces independencies and spurious edges under some contexts We can ... – PowerPoint PPT presentation

Number of Views:174
Avg rating:3.0/5.0
Slides: 59
Provided by: erans4
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Probabilistic Graphical Models


1
Introduction to Probabilistic Graphical Models
  • Eran Segal
  • Weizmann Institute

2
Probabilistic Graphical Models
  • Tool for representing complex systems and
    performing sophisticated reasoning tasks
  • Fundamental notion Modularity
  • Complex systems are built by combining simpler
    parts
  • Why have a model?
  • Compact and modular representation of complex
    systems
  • Ability to execute complex reasoning patterns
  • Make predictions
  • Generalize from particular problem

3
Probabilistic Graphical Models
  • Increasingly important in Machine Learning
  • Many classical probabilistic problems in
    statistics, information theory, pattern
    recognition, and statistical mechanics are
    special cases of the formalism
  • Graphical models provides a common framework
  • Advantage specialized techniques developed in
    one field can be transferred between research
    communities

4
Representation Graphs
  • Intuitive data structure for modeling
    highly-interacting sets of variables
  • Explicit model for modularity
  • Data structure that allows for design of
    efficient general-purpose algorithms

5
Reasoning Probability Theory
  • Well understood framework for modeling
    uncertainty
  • Partial knowledge of the state of the world
  • Noisy observations
  • Phenomenon not covered by our model
  • Inherent stochasticity
  • Clear semantics
  • Can be learned from data

6
A Simple Example
  • We want to model whether our neighbor will inform
    us of the alarm being set off
  • The alarm can set off if
  • There is a burglary
  • There is an earthquake
  • Whether our neighbor calls depends on whether the
    alarm is set off

7
A Simple Example
  • Variables
  • Earthquake (E), Burglary (B), Alarm (A),
    NeighborCalls (N)

E B A N Prob.
F F F F 0.01
F F F T 0.04
F F T F 0.05
F F T T 0.01
F T F F 0.02
F T F T 0.07
F T T F 0.2
F T T T 0.1
T F F F 0.01
T F F T 0.07
T F T F 0.13
T F T T 0.04
T T F F 0.06
T T F T 0.05
T T T F 0.1
T T T T 0.05
24-1 independent parameters
8
A Simple Example
E E
F T
0.9 0.1
B B
F T
0.7 0.3
Burglary
Earthquake
A A
E B F T
F F 0.99 0.01
F T 0.1 0.9
T F 0.3 0.7
T T 0.01 0.99
Alarm
NeighborCalls
7 independent parameters
N N
A F T
F 0.9 0.1
T 0.2 0.8
9
Example Bayesian Network
  • The Alarm network for monitoring intensive care
    patients
  • 509 parameters (full joint 237)
  • 37 variables

10
Application Clustering Users
  • Input TV shows that each user watches
  • Output TV show clusters
  • Assumption shows watched by same users are
    similar
  • Class 1
  • Power rangers
  • Animaniacs
  • X-men
  • Tazmania
  • Spider man
  • Class 2
  • Young and restless
  • Bold and the beautiful
  • As the world turns
  • Price is right
  • CBS eve news
  • Class 3
  • Tonight show
  • Conan OBrien
  • NBC nightly news
  • Later with Kinnear
  • Seinfeld
  • Class 4
  • 60 minutes
  • NBC nightly news
  • CBS eve news
  • Murder she wrote
  • Matlock
  • Class 5
  • Seinfeld
  • Friends
  • Mad about you
  • ER
  • Frasier

11
App. Recommendation Systems
  • Given user preferences, suggest recommendations
  • Example Amazon.com
  • Input movie preferences of many users
  • Solution model correlations between movie
    features
  • Users that like comedy, often like drama
  • Users that like action, often do not like
    cartoons
  • Users that like Robert Deniro films often like Al
    Pacino films
  • Given user preferences, can predict probability
    that new movies match preferences

12
Probability Theory
  • Probability distribution P over (?, S) is a
    mapping from events in S such that
  • P(?)?? 0 for all ??S
  • P(?) 1
  • If ?,??S and ????, then P(???)P(?)P(?)
  • Conditional Probability
  • Chain Rule
  • Bayes Rule
  • Conditional Independence

13
Random Variables Notation
  • Random variable Function from?? to a value
  • Categorical / Ordinal / Continuous
  • Val(X) set of possible values of RV X
  • Upper case letters denote RVs (e.g., X, Y, Z)
  • Upper case bold letters denote set of RVs (e.g.,
    X, Y)
  • Lower case letters denote RV values (e.g., x, y,
    z)
  • Lower case bold letters denote RV set values
    (e.g., x)
  • Values for categorical RVs with Val(X)k
    x1,x2,,xk
  • Marginal distribution over X P(X)
  • Conditional independence X is independent of Y
    given Z if?

14
Expectation
  • Discrete RVs
  • Continuous RVs
  • Linearity of expectation
  • Expectation of products(when X?? Y in P)

15
Variance
  • Variance of RV
  • If X and Y are independent VarXYVarXVarY
  • VaraXba2VarX

16
Information Theory
  • Entropy
  • We use log base 2 to interpret entropy as bits of
    information
  • Entropy of X is a lower bound on avg. of bits
    to encode values of X
  • 0 ? Hp(X) ? logVal(X) for any distribution P(X)
  • Conditional entropy
  • Information only helps
  • Mutual information
  • 0 ? Ip(XY) ? Hp(X)
  • Symmetry Ip(XY) Ip(YX)
  • Ip(XY)0 iff X and Y are independent
  • Chain rule of entropies

17
Representing Joint Distributions
  • Random variables X1,,Xn
  • P is a joint distribution over X1,,Xn
  • Can we represent P more compactly?
  • Key Exploit independence properties

18
Independent Random Variables
  • Two variables X and Y are independent if
  • P(XxYy) P(Xx) for all values x,y
  • Equivalently, knowing Y does not change
    predictions of X
  • If X and Y are independent then
  • P(X, Y) P(XY)P(Y) P(X)P(Y)
  • If X1,,Xn are independent then
  • P(X1,,Xn) P(X1)P(Xn)
  • O(n) parameters
  • All 2n probabilities are implicitly defined
  • Cannot represent many types of distributions

19
Conditional Independence
  • X and Y are conditionally independent given Z if
  • P(XxYy, Zz) P(XxZz) for all values x, y,
    z
  • Equivalently, if we know Z, then knowing Y does
    not change predictions of X
  • Notation Ind(XY Z) or (X ? Y Z)

20
Conditional Parameterization
  • S Score on test, Val(S) s0,s1
  • I Intelligence, Val(I) i0,i1

P(SI)
P(I)
P(I,S)
I S P(I,S)
i0 s0 0.665
i0 s1 0.035
i1 s0 0.06
i1 s1 0.24
I I
i0 i1
0.7 0.3
S S
I s0 s1
i0 0.95 0.05
i1 0.2 0.8
Joint parameterization
Conditional parameterization
3 parameters
3 parameters
Alternative parameterization P(S) and P(IS)
21
Conditional Parameterization
  • S Score on test, Val(S) s0,s1
  • I Intelligence, Val(I) i0,i1
  • G Grade, Val(G) g0,g1,g2
  • Assume that G and S are independent given I

22
Naïve Bayes Model
  • Class variable C, Val(C) c1,,ck
  • Evidence variables X1,,Xn
  • Naïve Bayes assumption evidence variables are
    conditionally independent given C
  • Applications in medical diagnosis, text
    classification
  • Used as a classifier
  • Problem Double counting correlated evidence

23
Bayesian Network (Informal)
  • Directed acyclic graph G
  • Nodes represent random variables
  • Edges represent direct influences between random
    variables
  • Local probability models

24
Bayesian Network (Informal)
  • Represent a joint distribution
  • Specifies the probability for P(Xx)
  • Specifies the probability for P(XxEe)
  • Allows for reasoning patterns
  • Prediction (e.g., intelligent ? high scores)
  • Explanation (e.g., low score ? not intelligent)
  • Explaining away (different causes for same effect
    interact)

I
S
G
Example 2
25
Bayesian Network Structure
  • Directed acyclic graph G
  • Nodes X1,,Xn represent random variables
  • G encodes local Markov assumptions
  • Xi is independent of its non-descendants given
    its parents
  • Formally (Xi ? NonDesc(Xi) Pa(Xi))

26
Independency Mappings (I-Maps)
  • Let P be a distribution over X
  • Let I(P) be the independencies (X ? Y Z) in P
  • A Bayesian network structure is an I-map
    (independency mapping) of P if I(G)?I(P)

I S P(I,S)
i0 s0 0.25
i0 s1 0.25
i1 s0 0.25
i1 s1 0.25
I
I
I S P(I,S)
i0 s0 0.4
i0 s1 0.3
i1 s0 0.2
i1 s1 0.1
S
S
I(P)I?S
I(G)I?S
I(G)?
I(P)?
27
Factorization Theorem
  • If G is an I-Map of P, then
  • Proof
  • wlog. X1,,Xn is an ordering consistent with G
  • By chain rule
  • From assumption
  • Since G is an I-Map ? (Xi NonDesc(Xi)
    Pa(Xi))?I(P)

28
Factorization Implies I-Map
  • ? G is an
    I-Map of P
  • Proof
  • Need to show (Xi NonDesc(Xi) Pa(Xi))?I(P) or
    that P(Xi NonDesc(Xi)) P(Xi Pa(Xi))
  • wlog. X1,,Xn is an ordering consistent with G

29
Bayesian Network Definition
  • A Bayesian network is a pair (G,P)
  • P factorizes over G
  • P is specified as set of CPDs associated with Gs
    nodes
  • Parameters
  • Joint distribution 2n
  • Bayesian network (bounded in-degree k) n2k

30
Bayesian Network Design
  • Variable considerations
  • Clarity test can an omniscient being determine
    its value?
  • Hidden variables?
  • Irrelevant variables
  • Structure considerations
  • Causal order of variables
  • Which independencies (approximately) hold?
  • Probability considerations
  • Zero probabilities
  • Orders of magnitude
  • Relative values

31
CPDs
  • Thus far we ignored the representation of CPDs
  • Now we will cover the range of CPD
    representations
  • Discrete
  • Continuous
  • Sparse
  • Deterministic
  • Linear

32
Table CPDs
  • Entry for each joint assignment of X and Pa(X)
  • For each pax
  • Most general representation
  • Represents every discrete CPD
  • Limitations
  • Cannot model continuous RVs
  • Number of parameters exponential in Pa(X)
  • Cannot model large in-degree dependencies
  • Ignores structure within the CPD

I
S
P(SI)
P(I)
S S
I s0 s1
i0 0.95 0.05
i1 0.2 0.8
I I
i0 i1
0.7 0.3
33
Structured CPDs
  • Key idea reduce parameters by modeling P(XPaX)
    without explicitly modeling all entries of the
    joint
  • Lose expressive power (cannot represent every
    CPD)

34
Deterministic CPDs
  • There is a function f Val(PaX) ? Val(X) such
    that
  • Examples
  • OR, AND, NAND functions
  • Z YX (continuous variables)

35
Deterministic CPDs
  • Replace spurious dependencies with deterministic
    CPDs
  • Need to make sure that deterministic CPD is
    compactly stored

T1
T2
T1
T2
T T
T1 T2 t0 t1
t0 t0 1 0
t0 t1 0 1
t1 t0 0 1
t1 t1 0 1
T
S
S S
T1 T2 s0 s1
t0 t0 0.95 0.05
t0 t1 0.2 0.8
t1 t0 0.2 0.8
t1 t1 0.2 0.8
S
S S
T s0 s1
t0 0.95 0.05
t1 0.2 0.8
36
Deterministic CPDs
  • Induce additional conditional independencies
  • Example T is any deterministic function of T1,T2

T1
T2
T
S1
S2
37
Deterministic CPDs
  • Induce additional conditional independencies
  • Example C is an XOR deterministic function of
    A,B

D
A
B
C
E
38
Deterministic CPDs
  • Induce additional conditional independencies
  • Example T is an OR deterministic function of
    T1,T2

T1
T2
T
S1
S2
Context specific independencies
39
Tree CPDs
A
B
C
D
D D
A B C d0 d1
a0 b0 c0 0.2 0.8
a0 b0 c1 0.2 0.8
a0 b1 c0 0.2 0.8
a0 b1 c1 0.2 0.8
a1 b0 c0 0.9 0.1
a1 b0 c1 0.7 0.3
a1 b1 c0 0.4 0.6
A1 b1 C1 0.4 0.6
8 parameters
40
Context Specific Independencies
A
B
C
D
A
a0
a1
C
B
c1
c0
b1
b0
(0.2,0.8)
(0.4,0.6)
(0.7,0.3)
(0.9,0.1)
Reasoning by cases implies that Ind(BC A,D)
41
Continuous Variables
  • One solution Discretize
  • Often requires too many value states
  • Loses domain structure
  • Other solution use continuous function for
    P(XPa(X))
  • Can combine continuous and discrete variables,
    resulting in hybrid networks
  • Inference and learning may become more difficult

42
Gaussian Density Functions
  • Among the most common continuous representations
  • Univariate case

43
Gaussian Density Functions
  • A multivariate Gaussian distribution over
    X1,...Xn has
  • Mean vector ?
  • nxn positive definite covariance matrix
    ?positive definite
  • Joint density function
  • ?iEXi
  • ?iiVarXi
  • ?ijCovXi,XjEXiXj-EXiEXj (i?j)

44
Hybrid Models
  • Models of continuous and discrete variables
  • Continuous variables with discrete parents
  • Discrete variables with continuous parents
  • Conditional Linear Gaussians
  • Y continuous variable
  • X X1,...,Xn continuous parents
  • U U1,...,Um discrete parents
  • A Conditional Linear Bayesian network is one
    where
  • Discrete variables have only discrete parents
  • Continuous variables have only CLG CPDs

45
Hybrid Models
  • Continuous parents for discrete children
  • Threshold models
  • Linear sigmoid

46
Undirected Graphical Models
  • Useful when edge directionality cannot be
    assigned
  • Simpler interpretation of structure
  • Simpler inference
  • Simpler independency structure
  • Harder to learn
  • We will also see models with combined directed
    and undirected edges
  • Some computations require restriction to discrete
    variables

47
Undirected Model (Informal)
  • Nodes correspond to random variables
  • Local factor models are attached to sets of nodes
  • Factor elements are positive
  • Do not have to sum to 1
  • Represent affinities

A C ?1A,C
a0 c0 4
a0 c1 12
a1 c0 2
a1 c1 9
A B ?2A,B
a0 b0 30
a0 b1 5
a1 b0 1
a1 b1 10
A
B
C
C D ?3C,D
c0 d0 30
c0 d1 5
c1 d0 1
c1 d1 10
D
B D ?4B,D
b0 d0 100
b0 d1 1
b1 d0 1
b1 d1 1000
48
Undirected Model (Informal)
  • Represents joint distribution
  • Unnormalized factor
  • Partition function
  • Probability
  • As Markov networks represent joint distributions,
    they can be used for answering queries

A
B
C
D
49
Markov Network Structure
  • Undirected graph H
  • Nodes X1,,Xn represent random variables
  • H encodes independence assumptions
  • A path X1,,Xk is active if none of the Xi
    variables along the path are observed
  • X and Y are separated in H given Z if there is no
    active path between any node x?X and any node y?Y
    given Z
  • Denoted sepH(XYZ)

Global Markov assumptions I(H) (X?YZ)
sepH(XYZ)
50
Relationship with Bayesian Network
  • Can all independencies encoded by Markov networks
    be encoded by Bayesian networks?
  • No, Ind(AB C,D) and Ind(CD A,B) example
  • Can all independencies encoded by Bayesian
    networks be encoded by Markov networks?
  • No, immoral v-structures (explaining away)
  • Markov networks encode monotonic independencies
  • If sepH(XYZ) and Z?Z then sepH(XYZ)

51
Markov Network Factors
  • A factor is a function from value assignments of
    a set of random variables D to real positive
    numbers ??
  • The set of variables D is the scope of the factor
  • Factors generalize the notion of CPDs
  • Every CPD is a factor (with additional
    constraints)

52
Markov Network Factors
  • Can we represent any joint distribution by using
    only factors that are defined on edges?
  • No!
  • Example binary variables
  • Joint distribution has 2n-1 independent
    parameters
  • Markov network with edge factors has
    parameters

53
Markov Network Distribution
  • A distribution P factorizes over H if it has
  • A set of subsets D1,...Dm where each Di is a
    complete subgraph in H
  • Factors ?1D1,...,?mDm such that
  • Z is called the partition function
  • P is also called a Gibbs distribution over H

where
54
Relationship with Bayesian Network
  • Bayesian Networks
  • Semantics defined via local Markov assumptions
  • Global independencies induced by d-separation
  • Local and global independencies equivalent since
    one implies the other
  • Markov Networks
  • Semantics defined via global separation property
  • Can we define the induced local independencies?
  • We show two definitions
  • All three definitions (global and two local) are
    equivalent only for positive distributions P

55
Local Structure
  • Factor graphs still encode complete tables
  • Goal as in Bayesian networks, represent
    context-specificity
  • A feature ?D on variables D is an indicator
    function that for some y?D
  • A distribution P is a log-linear model over H if
    it has
  • Features ?1D1,...,?kDk where each Di is a
    subclique in H
  • A set of weights w1,...,wk such that

56
Domain Application Vision
  • The image segmentation problem
  • Task Partition an image into distinct parts of
    the scene
  • Example separate water, sky, background

57
Markov Network for Segmentation
  • Grid structured Markov network
  • Random variable Xi corresponds to pixel i
  • Domain is 1,...K
  • Value represents region assignment to pixel i
  • Neighboring pixels are connected in the network
  • Appearance distribution
  • wik extent to which pixel i fits region k
    (e.g., difference from typical pixel for region
    k)
  • Introduce node potential exp(-wik1Xik)
  • Edge potentials
  • Encodes contiguity preference by edge
    potentialexp(?1XiXj) for ?gt0

58
Markov Network for Segmentation
  • Solution inference
  • Find most likely assignment to Xi variables

Appearance distribution
X11
X12
X13
X14
X21
X22
X23
X24
X31
X32
X33
X34
Contiguity preference
Write a Comment
User Comments (0)
About PowerShow.com