Title: Graphical Models Learning
1Graphical Models- Learning -
Advanced I WS 06/07
Based on J. A. Bilmes,A Gentle Tutorial of the
EM Algorithm and its Application to Parameter
Estimation for Gaussian Mixture and Hidden Markov
Models, TR-97-021, U.C. Berkeley, April 1998
G. J. McLachlan, T. Krishnan, The EM Algorithm
and Extensions, John Wiley Sons, Inc., 1997
D. Koller, course CS-228 handouts, Stanford
University, 2001., N. Friedman D. Kollers
NIPS99.
Parameter Estimation
- Wolfram Burgard, Luc De Raedt, Kristian
Kersting, Bernhard Nebel
Albert-Ludwigs University Freiburg, Germany
2Outline
- Introduction
- Reminder Probability theory
- Basics of Bayesian Networks
- Modeling Bayesian networks
- Inference (VE, Junction tree)
- Excourse Markov Networks
- Learning Bayesian networks
- Relational Models
3What is Learning?
- Agents are said to learn if they improve their
performance over time based on experience.
The problem of understanding intelligence is said
to be the greatest problem in science today and
the problem for this century as deciphering
the genetic code was for the second half of the
last onethe problem of learning represents a
gateway to understanding intelligence in man and
machines. -- Tomasso Poggio and Steven
Smale, 2003.
- Learning
4Why bothering with learning?
- Bottleneck of knowledge aquisition
- Expensive, difficult
- Normally, no expert is around
- Data is cheap !
- Huge amount of data avaible, e.g.
- Clinical tests
- Web mining, e.g. log files
- ....
- Learning
5Why Learning Bayesian Networks?
- Conditional independencies graphical language
capture structure of many real-world
distributions - Graph structure provides much insight into domain
- Allows knowledge discovery
- Learned model can be used for many tasks
- Supports all the features of probabilistic
learning - Model selection criteria
- Dealing with missing data hidden variables
6Learning With Bayesian Networks
Data Priori Info
- Learning
7Learning With Bayesian Networks
Data Priori Info
- Learning
8What does the data look like?
attributes/variables
complete data set
X1
X2
data cases
- Learning
...
XM
9What does the data look like?
incomplete data set
- Real-world data states of
- some random variables are
- missing
- E.g. medical diagnose not all patient are
subjects to all test - Parameter reduction, e.g. clustering, ...
- Learning
10What does the data look like?
incomplete data set
- Real-world data states of
- some random variables are
- missing
- E.g. medical diagnose not all patient are
subjects to all test - Parameter reduction, e.g. clustering, ...
- Learning
missing value
11What does the data look like?
hidden/ latent
incomplete data set
- Real-world data states of
- some random variables are
- missing
- E.g. medical diagnose not all patient are
subjects to all test - Parameter reduction, e.g. clustering, ...
- Learning
missing value
12Hidden variable Examples
X1
X3
X2
X1
X3
X2
hidden
H
H
Y3
Y2
Y1
Y3
Y2
Y1
- Learning
13Hidden variable Examples
Cluster
assignment of cluster
Cluster
hidden
...
attributes
X1
X2
Xn
- Hidden variables also appear in clustering
- Autoclass model
- Hidden variables assignsclass labels
- Observed attributes areindependent given the
class
- Learning
14Slides due to Eamonn Keogh
- Learning
15Slides due to Eamonn Keogh
Iteration 1 The cluster means are randomly
assigned
- Learning
16Slides due to Eamonn Keogh
Iteration 2
- Learning
17Slides due to Eamonn Keogh
Iteration 5
- Learning
18Slides due to Eamonn Keogh
Iteration 25
- Learning
19Slides due to Eamonn Keogh
What is a natural grouping among these objects?
20Slides due to Eamonn Keogh
What is a natural grouping among these objects?
Clustering is subjective
Simpson's Family
Males
Females
School Employees
21Learning With Bayesian Networks
?
?
?
A
B
A
B
H
A
B
- Learning
22Parameter Estimation
- Let be set
of data - over m RVs
- is called a data case
- iid - assumption
- All data cases are independently sampled from
identical distributions
- Learning
Find Parameters of CPDs which match the
data best
23Maximum Likelihood - Parameter Estimation
- What does best matching mean ?
- Learning
24Maximum Likelihood - Parameter Estimation
- What does best matching mean ?
Find paramteres which have most likely
produced the data
- Learning
25Maximum Likelihood - Parameter Estimation
- What does best matching mean ?
- MAP parameters
- Learning
26Maximum Likelihood - Parameter Estimation
- What does best matching mean ?
- MAP parameters
- Data is equally likely for all parameters.
- Learning
27Maximum Likelihood - Parameter Estimation
- What does best matching mean ?
- MAP parameters
- Data is equally likely for all parameters
- All parameters are apriori equally likely
- Learning
28Maximum Likelihood - Parameter Estimation
- What does best matching mean ?
Find ML parameters
- Learning
29Maximum Likelihood - Parameter Estimation
- What does best matching mean ?
Find ML parameters
Likelihood of the paramteres given
the data
- Learning
30Maximum Likelihood - Parameter Estimation
- What does best matching mean ?
Find ML parameters
Likelihood of the paramteres given
the data
- Learning
Log-Likelihood of the paramteres
given the data
31Maximum Likelihood
- This is one of the most commonly used estimators
in statistics - Intuitively appealing
- Consistent estimate converges to best possible
value as the number of examples grow - Asymptotic efficiency estimate is as close to
the true value as possible given a particular
training set
- Learning
32Learning With Bayesian Networks
?
?
?
A
B
A
B
H
A
B
?
- Learning
33Known Structure, Complete Data
E, B, A ltY,N,Ngt ltY,N,Ygt ltN,N,Ygt ltN,Y,Ygt .
. ltN,Y,Ygt
- Learning
- Network structure is specified
- Learner needs to estimate parameters
- Data does not contain missing values
34ML Parameter Estimation
- Learning
35ML Parameter Estimation
(iid)
- Learning
36ML Parameter Estimation
(iid)
- Learning
37ML Parameter Estimation
(iid)
(BN semantics)
- Learning
38ML Parameter Estimation
(iid)
(BN semantics)
- Learning
39ML Parameter Estimation
(iid)
(BN semantics)
- Learning
40ML Parameter Estimation
(iid)
(BN semantics)
Only local parameters of family of Aj involved
- Learning
41ML Parameter Estimation
(iid)
(BN semantics)
Only local parameters of family of Aj involved
- Learning
Each factor individually !!
42ML Parameter Estimation
(iid)
Decomposability of the likelihood
(BN semantics)
Only local parameters of family of Aj involved
- Learning
Each factor individually !!
43Decomposability of Likelihood
- If the data set if complete (no question marks)
- we can maximize each local likelihood function
independently, and - then combine the solutions to get an MLE
solution. - decomposition of the global problem to
- independent, local sub-problems. This allows
- efficient solutions to the MLE problem.
44Likelihood for Multinominals
- Random variable V with 1,...,K values
-
- where Nk is the counts of state k in data
This constraint implies that the choice on ?I
influences the choice on ?j (iltgtj)
- Learning
45Likelihood for Binominals (2 states only)
- Compute partial derivative
?1 ?2 1
- Set partial derivative zero
- Learning
gt MLE is
46Likelihood for Binominals (2 states only)
- Compute partial derivative
?1 ?2 1
- Set partial derivative zero
In general, for multinomials (gt2 states), the MLE
is
- Learning
gt MLE is
47Likelihood for Conditional Multinominals
- multinomial for
each joint state pa of the parents of V -
- MLE
- Learning
48Learning With Bayesian Networks
?
?
?
A
B
A
B
H
A
B
- Learning
?
49Known Structure, Incomplete Data
E, B, A ltY,?,Ngt ltY,N,?gt ltN,N,Ygt ltN,Y,Ygt .
. lt?,Y,Ygt
- Learning
- Network structure is specified
- Data contains missing values
- Need to consider assignments to missing values
50EM Idea
- In the case of complete data, ML parameter
estimation is easy - simply counting (1 iteration)
- Incomplete data ?
- Complete data (Imputation)
- most probable?, average?, ... value
- Count
- Iterate
- Learning
51EM Idea complete the data
A
B
incomplete data
- Learning
52EM Idea complete the data
A
B
incomplete data
complete data
expected counts
- Learning
N
B
A
1.5
true
true
1.5
false
true
1.5
true
false
0.5
false
false
53EM Idea complete the data
A
B
incomplete data
expected counts
complete data
- Learning
N
B
A
1.5
true
true
1.5
false
true
1.5
true
false
0.5
false
false
54EM Idea complete the data
A
B
incomplete data
complete data
expected counts
- Learning
N
B
A
1.5
true
true
1.5
false
true
1.5
true
false
maximize
0.5
false
false
55EM Idea complete the data
A
B
incomplete data
complete data
expected counts
iterate
- Learning
N
B
A
1.5
true
true
1.5
false
true
1.5
true
false
maximize
0.5
false
false
56EM Idea complete the data
A
B
incomplete data
- Learning
57EM Idea complete the data
A
B
incomplete data
complete data
expected counts
iterate
- Learning
N
B
A
1.5
true
true
1.5
false
true
1.75
true
false
maximize
0.25
false
false
58EM Idea complete the data
A
B
incomplete data
- Learning
59EM Idea complete the data
A
B
incomplete data
complete data
expected counts
iterate
- Learning
N
B
A
1.5
true
true
1.5
false
true
1.875
true
false
maximize
0.125
false
false
60Complete-data likelihood
incomplete-data likelihood
Assume complete data exists
with
- Learning
complete-data likelihood
61EM Algorithm - Abstract
- Learning
62EM Algorithm - Principle
P(Yq)
- Learning
q
Expectation Maximization (EM) Construct an new
function based on the current point (which
behaves well) Property The maximum of the new
function has a better scoring then the current
point.
63EM Algorithm - Principle
P(Yq)
- Learning
q
Expectation Maximization (EM) Construct an new
function based on the current point (which
behaves well) Property The maximum of the new
function has a better scoring then the current
point.
64EM Algorithm - Principle
P(Yq)
- Learning
q
Expectation Maximization (EM) Construct an new
function based on the current point (which
behaves well) Property The maximum of the new
function has a better scoring then the current
point.
65EM Algorithm - Principle
P(Yq)
- Learning
q
Expectation Maximization (EM) Construct an new
function based on the current point (which
behaves well) Property The maximum of the new
function has a better scoring then the current
point.
66EM for Multi-Nominals
- Random variable V with 1,...,K values
-
- where ENk are the expected counts of state k
in the data, i.e. - MLE
- Learning
67EM for Conditional Multinominals
- multinomial for
each joint state pa of the parents of V -
- MLE
- Learning
68Learning Parameters incomplete data
Non-decomposable likelihood (missing value,
hidden nodes)
Initial parameters
Current model
- Learning
69Learning Parameters incomplete data
Non-decomposable likelihood (missing value,
hidden nodes)
Initial parameters
Expectation
Current model
- Learning
70Learning Parameters incomplete data
Non-decomposable likelihood (missing value,
hidden nodes)
Initial parameters
Expectation
Current model
- Learning
Maximization
Update parameters (ML, MAP)
71Learning Parameters incomplete data
Non-decomposable likelihood (missing value,
hidden nodes)
Initial parameters
Expectation
Current model
- Learning
Maximization
Update parameters (ML, MAP)
EM-algorithm iterate until convergence
72Learning Parameters incomplete data
- Initialize parameters
- Compute pseudo counts for each variable
- Set parameters to the (completed) ML estimates
- If not converged, iterate to 2
- Learning
73Monotonicity
- (Dempster, Laird, Rubin 77) the incomplete-data
likelihood fuction is not decreased after an EM
iteration - (discrete) Bayesian networks for any initial,
non-uniform value the EM algorithm converges to a
(local or global) maximum.
- Learning
74LL on training set (Alarm)
- Learning
Experiment by Bauer, Koller and Singer UAI97
75Parameter value (Alarm)
- Learning
Experiment by Baur, Koller and Singer UAI97
76EM in Practice
- Initial parameters
- Random parameters setting
- Best guess from other source
- Stopping criteria
- Small change in likelihood of data
- Small change in parameter values
- Avoiding bad local maxima
- Multiple restarts
- Early pruning of unpromising ones
- Speed up
- various methods to speed convergence
- Learning
77Gradient Ascent
- Main result
- Requires same BN inference computations as EM
- Pros
- Flexible
- Closely related to methods in neural network
training - Cons
- Need to project gradient onto space of legal
parameters - To get reasonable convergence we need to combine
with smart optimization techniques
- Learning
78Parameter Estimation Summary
- Parameter estimation is a basic task for learning
with Bayesian networks - Due to missing values non-linear optimization
- EM, Gradient, ...
- EM for multi-nominal random variables
- Fully observed data counting
- Partially observed data pseudo counts
- Junction tree to do multiple inference
- Learning