Principles and Applications of Probabilistic Learnin

About This Presentation

Title:

Principles and Applications of Probabilistic Learnin

Description:

Principles and Applications of Probabilistic Learning Padhraic Smyth Department of Computer Science University of California, Irvine www.ics.uci.edu/~smyth – PowerPoint PPT presentation

Number of Views:84

Avg rating:3.0/5.0

Slides: 245

Provided by: icsUciEd97

Learn more at: https://ics.uci.edu

Category:

more less

Transcript and Presenter's Notes

Title: Principles and Applications of Probabilistic Learnin

1
Principles and Applications ofProbabilistic
Learning

Padhraic Smyth
Department of Computer Science
University of California, Irvine
www.ics.uci.edu/smyth

2
New Slides
NEW

Original slides created in mid-July for ACM
Some new slides have been added
new logo in upper left

3
New Slides
UPDATED

Original slides created in mid-July for ACM
Some new slides have been added
new logo in upper left
A few slides have been updated
updated logo in upper left
Current slides (including new and updated) at
www.ics.uci.edu/smyth/talks

4
NEW

From the tutorial Web page
The intent of this tutorial is to provide a
starting point for students and researchers

5
Probabilistic Modeling vs. Function Approximation

Two major themes in machine learning
1. Function approximation/black box methods
e.g., for classification and regression
Learn a flexible function y f(x)
e.g., SVMs, decision trees, boosting, etc
2. Probabilistic learning
e.g., for regression, model p(yx) or p(y,x)
e.g, graphical models, mixture models, hidden
Markov models, etc
Both approaches are useful in general
In this tutorial we will focus only on the 2nd
approach, probabilistic modeling

6
Motivations for Probabilistic Modeling

leverage prior knowledge
generalize beyond data analysis in vector-spaces
handle missing data
combine multiple types of information into an
analysis
generate calibrated probability outputs
quantify uncertainty about parameters, models,
and predictions in a statistical manner

7
Learning object models in vision Weber,
Welling, Perona, 2000
NEW
8
Learning object models in vision Weber,
Welling, Perona, 2000
NEW
9
Learning to Extract Information from Documents
NEW
e.g., Seymore, McCallum, Rosenfeld, 1999
10
NEW
11
NEW
Segal, Friedman, Koller, et al, Nature Genetics,
2005
12
P(Data Parameters)
Probabilistic Model
Real World Data
13
P(Data Parameters)
Probabilistic Model
Real World Data
P(Parameters Data)
14
(Generative Model)
P(Data Parameters)
Probabilistic Model
Real World Data
P(Parameters Data)
(Inference)
15
Outline

Review of probability
Graphical models
Connecting probability models to data
Models with hidden variables
Case studies
(i) Simulating and forecasting rainfall data
(ii) Curve clustering with cyclone
trajectories(iii) Topic modeling from text
documents

16
Part 1 Review of Probability
17
Notation and Definitions

X is a random variable
Lower-case x is some possible value for X
X x is a logical proposition that X takes
value x
There is uncertainty about the value of X
e.g., X is the Dow Jones index at 5pm tomorrow
p(X x) is the probability that proposition Xx
is true
often shortened to p(x)
If the set of possible xs is finite, we have a
probability distribution and S p(x) 1
If the set of possible xs is infinite, p(x) is a
density function, and p(x) integrates to 1 over
the range of X

18
Example

Let X be the Dow Jones Index (DJI) at 5pm Monday
August 22nd (tomorrow)
X can take real values from 0 to some large
number
p(x) is a density representing our uncertainty
about X
This density could be constructed from historical
data, e.g.,
After 5pm p(x) becomes infinitely narrow around
the true known x (no uncertainty)

19
Probability as Degree of Belief

Different agents can have different p(x)s
Your p(x) and the p(x) of a Wall Street expert
might be quite different
OR if we were on vacation we might not have
access to stock market information
we would still be uncertain about p(x) after 5pm
So we should really think of p(x) as p(x BI)
Where BI is background information available to
agent I
(will drop explicit conditioning on BI in
notation)
Thus, p(x) represents the degree of belief that
agent I has in proposition x, conditioned on
available background information

20
Comments on Degree of Belief

Different agents can have different probability
models
There is no necessarily correct p(x)
Why? Because p(x) is a model built on whatever
assumptions or background information we use
Naturally leads to the notion of updating
p(x BI) -gt p(x BI, CI)
This is the subjective Bayesian interpretation of
probability
Generalizes other interpretations (such as
frequentist)
Can be used in cases where frequentist reasoning
is not applicable
We will use degree of belief as our
interpretation of p(x) in this tutorial
Note!
Degree of belief is just our semantic
interpretation of p(x)
The mathematics of probability (e.g., Bayes rule)
remain the same regardless of our semantic
interpretation

21
Multiple Variables

p(x, y, z)
Probability that Xx AND Yy AND Z z
Possible values cross-product of X Y Z
e.g., X, Y, Z each take 10 possible values
x,y,z can take 103 possible values
p(x,y,z) is a 3-dimensional array/table
Defines 103 probabilities
Note the exponential increase as we add more
variables
e.g., X, Y, Z are all real-valued
x,y,z live in a 3-dimensional vector space
p(x,y,z) is a positive function defined over this
space, integrates to 1

22
Conditional Probability

p(x y, z)
Probability of x given that Yy and Z z
Could be
hypothetical, e.g., if Yy and if Z z
observational, e.g., we observed values y and z
can also have p(x, y z), etc
all probabilities are conditional probabilities
Computing conditional probabilities is the basis
of many prediction and learning problems, e.g.,
p(DJI tomorrow DJI index last week)
expected value of DJI tomorrow DJI index next
week)
most likely value of parameter a given observed
data

23
Computing Conditional Probabilities

Variables A, B, C, D
All distributions of interest related to A,B,C,D
can be computed from the full joint distribution
p(a,b,c,d)
Examples, using the Law of Total Probability
p(a) Sb,c,d p(a, b, c, d)
p(c,d) Sa,b p(a, b, c, d)
p(a,c d) Sb p(a, b, c d)
where p(a, b, c d) p(a,b,c,d)/p(d)
These are standard probability manipulations
however, we will see how to use these to make
inferences about parameters and unobserved
variables, given data

24
Conditional Independence

A is conditionally independent of B given C iff
p(a b, c) p(a c)
(also implies that B is conditionally independent
of A given C)
In words, B provides no information about A, if
value of C is known
Example
a patient has upset stomach
b patient has headache
c patient has flu
Note that conditional independence does not imply
marginal independence

25
Two Practical Problems

(Assume for simplicity each variable takes K
values)
Problem 1 Computational Complexity
Conditional probability computations scale as
O(KN)
where N is the number of variables being summed
over
Problem 2 Model Specification
To specify a joint distribution we need a table
of O(KN) numbers
Where do these numbers come from?

26
Two Key Ideas

Problem 1 Computational Complexity
Idea Graphical models
Structured probability models lead to tractable
inference
Problem 2 Model Specification
Idea Probabilistic learning
General principles for learning from data

27
Part 2 Graphical Models
28
probability theory is more fundamentally
concerned with the structure of reasoning and
causation than with numbers.
Glenn Shafer and Judea Pearl Introduction to
Readings in Uncertain Reasoning, Morgan Kaufmann,
1990
29
Graphical Models

Represent dependency structure with a directed
graph
Node lt-gt random variable
Edges encode dependencies
Absence of edge -gt conditional independence
Directed and undirected versions
Why is this useful?
A language for communication
A language for computation
Origins
Wright 1920s
Independently developed by Spiegelhalter and
Lauritzen in statistics and Pearl in computer
science in the late 1980s

30
Examples of 3-way Graphical Models
Marginal Independence p(A,B,C) p(A) p(B) p(C)
31
Examples of 3-way Graphical Models
Conditionally independent effects p(A,B,C)
p(BA)p(CA)p(A) B and C are conditionally
independent Given A e.g., A is a disease, and we
model B and C as conditionally
independent symptoms given A
32
Examples of 3-way Graphical Models
Independent Causes p(A,B,C) p(CA,B)p(A)p(B)
33
Examples of 3-way Graphical Models
Markov dependence p(A,B,C) p(CB) p(BA)p(A)
34
Real-World Example

Monitoring Intensive-Care Patients
37 variables
509 parameters
instead of 237
(figure courtesy of Kevin
Murphy/Nir Friedman)

35
Directed Graphical Models
p(A,B,C) p(CA,B)p(A)p(B)

36
Directed Graphical Models
p(A,B,C) p(CA,B)p(A)p(B)
In general, p(X1, X2,....XN) ? p(Xi
parents(Xi ) )

37
Directed Graphical Models
p(A,B,C) p(CA,B)p(A)p(B)
In general, p(X1, X2,....XN) ? p(Xi
parents(Xi ) )

Probability model has simple factored form
Directed edges gt direct dependence
Absence of an edge gt conditional independence
Also known as belief networks, Bayesian
networks, causal networks

38
Example
D
B
E
C
A
F
G
39
Example
D
B
E
C
A
F
G
p(A, B, C, D, E, F, G) ? p( variable
parents ) p(AB)p(CB)p(BD)p(FE)p(GE)
p(ED) p(D)
40
Example
D
B
E
c
A
g
F
Say we want to compute p(a c, g)
41
Example
D
B
E
c
A
g
F
Direct calculation p(ac,g) Sbdef p(a,b,d,e,f
c,g) Complexity of the sum is O(K4)
42
Example
D
B
E
c
A
g
F
Reordering (using factorization) Sb p(ab) Sd
p(bd,c) Se p(de) Sf p(e,f g)
43
Example
D
B
E
c
A
g
F
Reordering Sb p(ab) Sd p(bd,c) Se p(de) Sf
p(e,f g)
p(eg)
44
Example
D
B
E
c
A
g
F
Reordering Sb p(ab) Sd p(bd,c) Se p(de)
p(eg)
p(dg)
45
Example
D
B
E
c
A
g
F
Reordering Sb p(ab) Sd p(bd,c) p(dg)
p(bc,g)
46
Example
D
B
E
c
A
g
F
Reordering Sb p(ab) p(bc,g)
p(ac,g)
Complexity is O(K), compared to O(K4)
47
A More General Algorithm

Message Passing (MP) Algorithm
Pearl, 1988 Lauritzen and Spiegelhalter, 1988
Declare 1 node (any node) to be a root
Schedule two phases of message-passing
nodes pass messages up to the root
messages are distributed back to the leaves
In time O(N), we can compute P(.)

48
Sketch of the MP algorithm in action
49
Sketch of the MP algorithm in action
1
50
Sketch of the MP algorithm in action
2
1
51
Sketch of the MP algorithm in action
2
1
3
52
Sketch of the MP algorithm in action
2
1
3
4
53
Complexity of the MP Algorithm

Efficient
Complexity scales as O(N K m)
N number of variables
K arity of variables
m maximum number of parents for any node
Compare to O(KN) for brute-force method

54
Graphs with loops
D
B
E
C
A
F
G
Message passing algorithm does not work
when there are multiple paths between 2 nodes
55
Graphs with loops
D
B
E
C
A
F
G
General approach cluster variables together to
convert graph to a tree
56
Reduce to a Tree
D
B, E
C
A
F
G
57
Reduce to a Tree
D
B, E
C
A
F
G
Good news can perform MP algorithm on this
tree Bad news complexity is now O(K2)
58
Probability Calculations on Graphs

Structure of the graph reveals
Computational strategy
Dependency relations
Complexity is typically O(K max(number of
parents) )
If single parents (e.g., tree), -gt O(K)
The sparser the graph the lower the complexity
Technique can be automated
i.e., a fully general algorithm for arbitrary
graphs
For continuous variables
replace sum with integral
For identification of most likely values
Replace sum with max operator

59
Hidden Markov Model (HMM)
Observed
Y3
Yn
Y1
Y2
- - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - -
- -
Hidden
S3
Sn
S1
S2
Two key assumptions 1. hidden state sequence is
Markov 2. observation Yt is CI of all
other variables given St Widely used in speech
recognition, protein sequence models Motivation
switching dynamics, low-d representation of Ys,
etc
60
HMMs as graphical models

Computations of interest
p( Y ) S p(Y , S s) -gt
forward-backward algorithm
arg maxs p(S s Y) -gt Viterbi
algorithm
Both algorithms.
computation time linear in T
special cases of MP algorithm
Many generalizations and extensions.
Make state S continuous -gt Kalman filters
Add inputs -gt convolutional decoding
Add additional dependencies in the model
Generalized HMMs

61
Part 3 Connecting Probability Models to Data

Recommended References for this Section
All of Statistics, L. Wasserman, Chapman and
Hall, 2004 (Chapters 6,9,11)
Pattern Classification and Scene Analysis, 1st
ed, R. Duda and P. Hart, Wiley, 1973, Chapter
3.

62
(Generative Model)
P(Data Parameters)
Probabilistic Model
Real World Data
P(Parameters Data)
(Inference)
63
Conditionally Independent Observations
NEW
q
Model parameters
Data
yn-1
yn
y1
y2
64
Plate Notation
q
Model parameters
Data y1,yn
yi
i1n
Plate rectangle in graphical model
variables within a plate are replicated
in a conditionally independent manner
65
Example Gaussian Model
s
m
yi
i1n
Generative model p(y1,yn m, s) P p(yi
m, s)
p(data parameters)
p(D q) where q
m, s
66
The Likelihood Function

Likelihood p(data parameters)
p( D q )
L (q)
Likelihood tells us how likely the observed data
are conditioned on a particular setting of the
parameters
Details
Constants that do not involve q can be dropped in
defining L (q)
Often easier to work with log L (q)

67
Comments on the Likelihood Function

Constructing a likelihood function L (q) is the
first step in probabilistic modeling
The likelihood function implicitly assumes an
underlying probabilistic model M with parameters
q
L (q) connects the model to the observed data
Graphical models provide a useful language for
constructing likelihoods

68
Binomial Likelihood
NEW

Binomial model
N memoryless trials
probability q of success at each trial
Observed data
r successes in n trials
Defines a likelihood
L(q) p(D q)
p(succeses)
p(non-successes)
q r (1-q) n-r

q
yi
i1n
69
Binomial Likelihood Examples
NEW
70
Gaussian Model and Likelihood
Model assumptions 1. ys are
conditionally independent given model 2.
each y comes from a Gaussian (Normal) density
71
(No Transcript)
72
Conditional Independence (CI)

CI in a likelihood model means that we are
assuming data points provide no information about
each other, if the model parameters are assumed
known.
p( D q ) p(y1, yN q ) P
p(yi q )
Works well for (e.g.)
Patients randomly arriving at a clinic
Web surfers randomly arriving at a Web site
Does not work well for
Time-dependent data (e.g., stock market)
Spatial data (e.g., pixel correlations)

CI assumption
73
Example Markov Likelihood

Motivation wish to model data in a sequence
where there is sequential dependence,
e.g., a first-order Markov chain for a DNA
sequence
Markov modeling assumption
p(yt yt-1, yt-2, yt) p(yt yt-1)
q matrix of K x K transition matrix
probabilities
L( q ) p( D q ) p(y1, yN q ) P
p(yt yt-1 , q )

74
Maximum Likelihood (ML) Principle
(R. Fisher 1922)
Model parameters
q
Data y1,yn
yi
i1n
L (q) p(Data q ) P p(yi q ) Maximum
Likelihood qML arg max
Likelihood(q) Select the parameters that make
the observed data most likely
75
Example ML for Gaussian Model
Maximum Likelhood Estimate qML
76
Maximizing the Likelihood

More generally, we analytically solve for the q
value that maximizes the function L (q)
With p parameters, L (q) is a scalar function
defined over a p-dimensional space
2 situations
We can analytically solve for the maxima of L (q)
This is rare
We have to resort to iterative techniques to find
qML
More common
General approach
Define a generative probabilistic model
Define an associated likelihood (connect model to
data)
Solve an optimization problem to find qML

77
Analytical Solution for Gaussian Likelihood
78
Graphical Model for Regression
xi
s
q
yi
i1n
79
Example
y
x
80
Example
y
f(x q ) this is unknown
x
81
Example ML for Linear Regression

Generative model
y ax b Gaussian noise
p(y) N(ax b, s)
Conditional Likelihood
L(q) p(y1, yN x1, xN, q )
P p(yi xi , q ) ,
q a, b
Can show (homework problem!) that
log L(q) - S yi - (a xi b) 2
i.e., finding a,b to maximize log-
likelihood is the same as finding a,b that
minimizes least squares

82
ML and Regression

Multivariate case
multiple xs, multiple regression coefficients
with Gaussian noise, the ML solution is again
equivalent to least-squares (solutions to a set
of linear equations)
Non-linear multivariate model
With Gaussian noise we get
log L(q) - S yi - f (xi q) 2
Conditions for the q that maximizes L(q) leads
to a set of p non-linear equations in p variables
e.g., f (xi q) a multilayer neural network
with 1000 weights
Optimization finding the maximum of a
non-convex function in 1000 dimensional space!
Typically use iterative local search based on
gradient (many possible variations)

83
Probabilistic Learning and Classification
NEW

2 main approaches
p(c x) p(xc) p(c) / p(x) p(xc) p(c)
-gt learn a model for p(xc) for each class,
use Bayes rule to classify
- example naïve Bayes
- advantage theoretically optimal if
p(xc) is correct
- disadvantage not directly
optimizing predictive accuracy
Learn p(cx) directly, e.g.,
logistic regression (see tutorial notes from D.
Lewis)
other regression methods such as neural networks,
etc.
Often quite effective in practice very useful
for ranking, scoring, etc
Contrast with purely discriminative methods such
as SVMs, trees

84
The Bayesian Approach to Learning
a
Prior(q) p( q a )
q
yi
i1n
Maximum A Posteriori qMAP arg max
Likelihood(q) x Prior(q) Fully Bayesian
p( q Data) p(Data q ) p(q) / p(Data)
85
The Bayesian Approach
Fully Bayesian p( q Data) p(Data q ) p(q)
/ p(Data) Likelihood x Prior
/ Normalization term Estimating p( q Data) can
be viewed as inference in a graphical model ML
is a special case MAP with a flat prior
a
q
yi
i1n
86
More Comments on Bayesian Learning

fully Bayesian report full posterior density
p(q D)
For simple models, we can calculate p(q D)
analytically
Otherwise we empirically estimate p(q D)
Monte Carlo sampling methods are very useful
Bayesian prediction (e.g., for regression)
p(y x, D ) integral p(y, q x, D) dq
integral p(y q, x) p(q
D) dq
-gt prediction at each q is
weighted by p(qD)
theoretically preferable to picking a single q
(as in ML)

87
More Comments on Bayesian Learning

In practice
Fully Bayesian is theoretically optimal but not
always the most practical approach
E.g., computational limitations with large
numbers of parameters
assessing priors can be tricky
Bayesian approach particularly useful for small
data sets
For large data sets, Bayesian, MAP, ML tend to
agree
ML/MAP are much simpler gt often used in practice

88
Example of Bayesian Estimation

Definition of Beta prior
Definition of Binomial likelihood
Form of Beta posterior
Examples of plots with priorlikelihood -gt
posterior

89
Beta Density as a Prior
NEW

Let q be a proportion,
e.g., fraction of customers that respond to an
email ad
p(q) is a prior for q
e.g. p(q a, b) Beta density with parameters a
and b
p(q a, b) q a-1 (1-q) b-1
a /(a b) influences the location
a b controls the width

a
q
b
90
Examples of Beta Density Priors
NEW
91
Binomial Likelihood
NEW

Binomial model
N memoryless trials
probability q of success at each trial
Observed data
r successes in n trials
Defines a likelihood
p(D q) p(succeses)
p(non-successes)
q r (1-q) n-r

92
Beta Binomial -gt Beta
NEW

p(q D) Posterior Likelihood x Prior
Binomial x
Beta
q r (1-q)
n-r x q a-1 (1-q) b-1
Beta(a r, b
n r)
Prior is updated using data
Parameters a -gt ar, b -gt b n
r
Sample size a b -gt a b n
Mean a /(a b) -gt (a r)/(a b n)

93
NEW
94
NEW
95
NEW
96
Extensions
NEW

K categories with K probabilities that sum to 1
Dirichlet prior Multinomial likelihood -gt
Dirichlet posterior
Used in text modeling, protein alignment
algorithms, etc
E.g. Biological Sequence Analysis, R. Durbin et
al., Cambridge University Press, 1998.
Hierarchical modeling
Multiple trials for different individuals
Each individual has their own q
The qs common population distribution
For applications in marketing see
Market Segmentation Conceptual and
Methodological Foundations, M. Wedel and W. A.
Kamakura, Kluwer, 1998

97
Example Bayesian Gaussian Model
a
b
s
m
yi
i1n
Note priors and parameters are assumed
independent here
98
Example Bayesian Regression
b
a
xi
s
q
yi
i1n
Model yi f xiq e, e N(0, s2)
p(yi xi) N ( fxiq , s2 )
99
Other Examples
UPDATED

Bayesian examples
Bayesian neural networks
Richer probabilistic models
Random effects models
E.g., Learning to align curves
Learning model structure
Chow-Liu trees
General graphical model structures
e.g. gene regulation networks
Comprehensive reference
Bayesian Data Analysis, A. Gelman, J. B. Carlin.
H. S. Stern, and D. B. Rubin, Chapman and Hall,
2nd edition, 2003.

100
Learning Shapes and Shifts
Data smoothed growth acceleration data from
teenagers EM used to learn a spline model
time-shift for each curve
Data after Learning
Original data
101
Learning to Track People Sidenbladh, Black ,
Fleet, 2000
102
Model Uncertainty

How do we know what model M to select for our
likelihood function?
In general, we dont!
However, we can use the data to help us infer
which model from a set of possible models is best

103
Method 1 Bayesian Approach

Can evaluate the evidence for each model,
p(M D) p(DM) p(M)/ p(D)
Can get p(DM) by integrating p(D, q M) over
parameter space (this is the marginal
likelihood)
in theory p(M D) is how much evidence exists in
the data for model M
More complex models are automatically penalized
because of the integration over
higher-dimensional parameter spaces
in practice p(MD) can rarely be computed
directly
Monte Carlo schemes are popular
Also approximations such as BIC, Laplace, etc

104
Comments on Bayesian Approach

Bayesian Model Averaging (BMA)
Instead of selecting the single best model, for
prediction average over all available models
(theoretically the correct thing to do)
Weights used for averaging are p(MD)
Empirical alternatives
e.g., Stacking, Bagging
Idea is to learn a set of unconstrained combining
weights from the data, weights that optimize
predictive accuracy
emulate BMA approach
may be more effective in practice

105
Method 2 Predictive Validation

Instead of the Bayesian approach, we could use
the probability of new unseen test data as our
metric for selecting models
E.g., 2 models
If p(D M1) gt p(D M2) then M1 is assigning
higher probability to new data than M2
This will (with enough data) select the model
that predicts the best, in a probabilistic sense
Useful for problems where we have very large
amounts of data and it is easy to create a large
validation data set D

106
The Prediction Game
NEW
0
10
Observed Data
x
What is a good guess at p(x)?
Model A for p(x)
0
10
Model B for p(x)
0
10
107
Which of Model A or B is better?
NEW
Test data generated from the true underlying q(x)
Model A
Model B
We can score each model in terms of p(new data
model) Asymptotically, this is a fair unbiased
score (irrespective of the complexities of
the models) Note empirical average of log
p(data) scores negative entropy
108
NEW
Model-based clustering and visualization of
navigation patterns on a Web site Cadez et al,
Journal of Data Mining and Knowledge Discovery,
2003
109
Simple Model Class
110
Data-generating process (truth)
Simple Model Class
111
Data-generating process (truth)
Closest model in terms of KL distance
Simple Model Class
Best model is relatively far from Truth gt High
Bias
112
Data-generating process (truth)
Simple Model Class
Complex Model Class
113
Data-generating process (truth)
Simple Model Class
Best model is closer to Truth gt Low Bias
Complex Model Class
114
However,. this could be the model that best fits
the observed data gt High Variance
Data-generating process (truth)
Simple Model Class
Complex Model Class
115
Part 4 Models with Hidden Variables
116
Hidden or Latent Variables

In many applications there are 2 sets of
variables
Variables whose values we can directly measure
Variables that are hidden, cannot be measured
Examples
Speech recognition
Observed acoustic voice signal
Hidden label of the word spoken
Face tracking in images
Observed pixel intensities
Hidden position of the face in the image
Text modeling
Observed counts of words in a document
Hidden topics that the document is about

117
Mixture Models
Pearson, 1894, Phil. Trans. Roy. Soc. A.
p(Y) Sk p(Y Sk) p(Sk)
S
Hidden discrete variable
Y
Observed variable(s)
Motivation 1. models a true process (e.g., fish
example) 2. approximation for a
complex process
118
(No Transcript)
119
(No Transcript)
120
(No Transcript)
121
A Graphical Model for Clustering
Hidden discrete (cluster) variable
S
Yj
Y1
Yd
Observed variable(s) (assumed conditionally
independent given S)
Clusters p(Y1,Yd S s) Probabilistic
Clustering learning these probability
distributions from data
122
Hidden Markov Model (HMM)
Observed
Y3
Yn
Y1
Y2
- - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - -
- -
Hidden
S3
Sn
S1
S2
Two key assumptions 1. hidden state sequence is
Markov 2. observation Yt is CI of all
other variables given St Widely used in speech
recognition, protein sequence models Motivation?
- S can provide non-linear switching
- S can encode low-dim time-dependence for
high-dim Y
123
Generalizing HMMs
Y3
Yn
Y1
Y2
S3
Sn
S1
S2
T3
Tn
T1
T2
Two independent state variables, e.g., two
processes evolving at different time-scales
124
Generalizing HMMs
Y3
Yn
Y1
Y2
S3
Sn
S1
S2
I3
In
I1
I2

Inputs I provide context to influence switching,
e.g., external forcing variables
Model is still a tree -gt inference is still linear

125
Generalizing HMMs
Y3
Yn
Y1
Y2
S3
Sn
S1
S2
I3
In
I1
I2

Add direct dependence between Ys to better model
persistence
Can merge each St and Yt to construct a
tree-structured model

126
Mixture Model
q
Si
yi
i1n
Likelihood(q) p(Data q ) Pi
p(yi q ) Pi Sk p(yi si
k , q ) p(si k)
127
Learning with Missing Data

Guess at some initial parameters q0
E-step (Inference)
For each case, and each unknown variable compute
p(S known data, q0 )
M-step (Optimization)
Maximize L(q) using p(S .. )
This yields new parameter estimates q1
This is the EM algorithm
Guaranteed to converge to a (local) maximum of
L(q)
Dempster, Laird, Rubin, 1977

128
E-Step
q
Si
yi
i1n
129
M-Step
q
Si
yi
i1n
130
E-Step
q
Si
yi
i1n
131
The E (Expectation) Step
Current K components and parameters
n objects
E step Compute p(object i is in group k)
132
The M (Maximization) Step
New parameters for the K components
n objects
M step Compute q, given n objects and memberships
133
Complexity of EM for mixtures
K models
n objects
Complexity per iteration scales as O( n K f(d) )
134
Data from Prof. Christine McLaren, Dept of
Epidemiology, UC Irvine
135
(No Transcript)
136
(No Transcript)
137
(No Transcript)
138
(No Transcript)
139
(No Transcript)
140
(No Transcript)
141
Control Group
Anemia Group
142
(No Transcript)
143
Example of a Log-Likelihood Surface

Mean 2
Log Scale for Sigma 2
144
(No Transcript)
145
HMMs
Y3
YN
Y1
Y2
S3
SN
S1
S2
146
q1
Y3
YN
Y1
Y2
S3
SN
S1
S2
147
q1
Y3
YN
Y1
Y2
S3
SN
S1
S2
q2
148
E-Step (linear inference)
q1
Y3
YN
Y1
Y2
S3
SN
S1
S2
q2
149
M-Step (closed form)
q1
Y3
YN
Y1
Y2
S3
SN
S1
S2
q2
150
Alternatives to EM

Method of Moments
EM is more efficient
Direct optimization
e.g., gradient descent, Newton methods
EM is usually simpler to implement
Sampling (e.g., MCMC)
Minimum distance, e.g.,

151
Mixtures as Data Simulators
For i 1 to N classk p(class1, class2, .,
class K) xi p(x classk) end
152
Mixtures with Markov Dependence
For i 1 to N classk p(class1, class2, .,
class K classxi-1 ) xi p(x classk) end
Current class depends on previous class (Markov
dependence) This is a hidden Markov model
153
Mixtures of Sequences
For i 1 to N classk p(class1, class2, .,
class K) while non-end state xij p(xj
xj-1, classk) end end
Produces a variable length sequence
Markov sequence model
154
Mixtures of Curves
For i 1 to N classk p(class1, class2, .,
class K) Li p(Li classk) for i 1 to Li
yij f(y xj, classk) ek end end
Length of curve
Independent variable x
Class-dependent curve model
155
Mixtures of Image Models
For i 1 to N classk p(class1, class2, .,
class K) sizei p(sizeclassk) for i 1 to
Vi-1 intensityi p(intensity classk)
end end
Global scale
Number of vertices
Pixel generation model
156
More generally..
Generative Model - select a component ck for
individual i - generate data according to p(Di
ck) - p(Di ck) can be very general - e.g.,
sets of sequences, spatial patterns, etc Note
given p(Di ck), we can define an EM algorithm
157
References
NEW

The EM Algorithm and Mixture Models
The EM Algorithm and ExtensionsG. McLachlan and
T. Krishnan. John Wiley and Sons, New York, 1997.
Mixture models
Statistical analysis of finite mixture
distributions.
D. M. Titterington, A. F. M. Smith U. E.
Makov. Wiley Sons, Inc., New York, 1985.
Finite Mixture Models
G.J. McLachlan and D. Peel, New York Wiley
(2000)
Model-based clustering, discriminant analysis,
and density estimation, C. Fraley and A. E.
Raftery, Journal of the American Statistical
Association 97611-631 (2002).

158
References
NEW

Hidden Markov Models
A tutorial on hidden Markov models and selected
applications in speech recognition, L. R.
Rabiner, Proceedings of the IEEE, vol. 77, no.2,
257-287, 1989.
Probabilistic independence networks for hidden
Markov modelsP. Smyth, D. Heckerman, and M.
Jordan, Neural Computation , vol.9, no. 2,
227-269, 1997.
Hidden Markov models, A. Moore, online tutorial
slides, http//www.autonlab.org/tutorials/hmm12.pd
f

159
Part 5 Case Studies(i) Simulating and
forecasting rainfall data (ii) Curve clustering
with cyclones(iii) Topic modeling from text
documentsand if time permits..(iv) Sequence
clustering for Web data(v) Analysis of
time-course gene expression data
160
Case Study 1Simulating and Predicting Rainfall
PatternsJoint work withAndy Robertson,
International Research Institute for Climate
PredictionSergey Kirshner, Department of
Computer Science, UC Irvine
161
Spatio-Temporal Rainfall Data
Northeast Brazil 1975-2002 90-day time
series 24 years 10 stations
162
(No Transcript)
163
Modeling Goals

Downscaling
Modeling interannual variability
coupling rainfall to large-scale effects like El
Nino
Prediction
e.g., hindcasting of missing data
Seasonal Forecasts
E.g. on Dec 1 produce simulations of likely
90-day winters

164
HMMs for Rainfall Modeling
Y3
YN
Y1
Y2
S3
SN
S1
S2
I3
IN
I1
I2
S unobserved weather state Y spatial
rainfall pattern (outputs) I atmospheric
variables (inputs)
165
Learned Weather States

States provide an interpretable view of
spatio-temporal relationships in the data

166
(No Transcript)
167
WeatherStatesfor Kenya
168
(No Transcript)
169
(No Transcript)
170

Spatial Chow-Liu Trees
Spatial distribution given a state is a tree
structure
(a graphical model)
Useful intermediate between full pair-wise model
and conditional independence
Optimal topology learned from data using minimum
spanningtree algorithm
Can use priors based on distance, topography
Tree-structure over time also

171
Missing Data
172
Error rate v. fraction of missing data
173
References
NEW

Trees and Hidden Markov Models
Conditional Chow-Liu tree structures for modeling
discrete-valued vector time seriesS. Kirshner,
P. Smyth, and A. Robertsonin Proceedings of the
20th International Conference on Uncertainty in
AI , 2004.
Applications to rainfall modeling
Hidden Markov models for modeling daily rainfall
occurrence over BrazilA. Robertson, S. Kirshner,
and P. Smyth Journal of Climate, November 2005.

174
Summary

Simple empirical probabilistic models can be
very helpful in interpreting large scientific
data sets
e.g., HMM states provide scientists with a basic
but useful classification of historical spatial
rainfall patterns
Graphical models provide glue to link together
different information
Spatial
Temporal
Hidden states, etc
Generative aspect of probabilistic models can
be quite useful, e.g., for simulation
Missing data is handled naturally in a
probabilistic framework

175
Case Study 2Clustering Cyclone
TrajectoriesJoint work withSuzana Camargo,
Andy Robertson, International Research Institute
for Climate PredictionScott Gaffney, Department
of Computer Science, UC Irvine
176
Storm Trajectories
177
Microarray Gene Expression Data
178
Clustering non-vector data

Challenges with the data.
May be of different lengths, sizes, etc
Not easily representable in vector spaces
Distance is not naturally defined a priori
Possible approaches
convert into a fixed-dimensional vector space
Apply standard vector clustering but loses
information
use hierarchical clustering
But O(N2) and requires a distance measure
probabilistic clustering with mixtures
Define a generative mixture model for the data
Learn distance and clustering simultaneously

179
Graphical Models for Curves
Data (y1,t1),. yT, tT)
t
q
y
T
y f(t q ) e.g., y
at2 bt c, q a, b, c
180
Graphical Models for Curves
t
q
s
y
T
y Gaussian density with mean f(t q ),
variance s2
181
Example
y
t
182
Example
f(t q ) lt- this is hidden
y
t
183
Graphical Models for Sets of Curves
q
t
s
y
T
N curves
Each curve P(yi ti, q ) product of
Gaussians
184
Curve-Specific Transformations
Note we can learn function parameters and
shifts simultaneously with EM
q
t
a
s
y
T
N curves
e.g., yi at2 bt c ai, q a, b, c,
a1,.aN
185
Learning Shapes and Shifts
Data smoothed growth acceleration data from
teenagers EM used to learn a spline model
time-shift for each curve
Data after Learning
Original data
186
Clustering Mixtures of Curves
c
q
t
a
s
y
T
N curves
Each set of trajectory points comes from 1 of K
models Model for group k is a Gaussian curve
model Marginal probability for a trajectory
mixture model
187
The Learning Problem

K cluster models
Each cluster is a shape model EY f(Xq) with
its own parameters
N observed curves for each curve we learn
P(cluster k curve data)
distribution on alignments, shifts, scaling, etc,
given data
Requires simultaneous learning of
Cluster models
Curve transformation parameters
Results in an EM algorithm where E and M step are
tractable

188
(No Transcript)
189
(No Transcript)
190
Results on Simulated Data
Standard EM
0.89
-7.87
0.171
0.105
Averaged over 50 train/test sets
191
Clusters of Trajectories
192
Cluster Shapes for Pacific Cyclones
193
TROPICAL CYCLONES Western North Pacific 1983-2002
194
(No Transcript)
195
References on Curve Clustering
NEW

Functional Data Analysis
J. O. Ramsay and B. W. Silverman, Springer,
1997.
Probabilistic curve-aligned clustering and
prediction with regression mixture models S. J.
Gaffney, Phd Thesis, Department of Computer
Science, University of California, Irvine, March
2004.
Joint probabilistic curve clustering and
alignment S. Gaffney and P. Smyth Advances in
Neural Information Processing 17 , in press,
2005.
Probabilistic clustering of extratropical
cyclones using regression mixture modelsS.
Gaffney, A. Robertson, P. Smyth, S. Camargo, M.
Ghil preprint, online at www.datalab.uci.edu.

196
Summary

Graphical models provide a flexible
representational language for modeling complex
scientific data
can build complex models from simpler building
blocks
Systematic variability in the data can be handled
in a principled way
Variable length time-series
Misalignments in trajectories
Generative probabilistic models are interpretable
and understandable by scientists

197
Case Study 3Topic Modeling from Text
DocumentsJoint work withMark Steyvers, Dave
Newman, Chaitanya Chemudugunta, UC
IrvineMichal Rosen-Zvi, Hebrew University,
JerusalemTom Griffiths, Brown University
198
Enron email data
250,000 emails 5000 authors 1999-2002
199
Questions of Interest

What topics do these documents span?
Which documents are about a particular topic?
How have topics changed over time?
What does author X write about?
Who is likely to write about topic Y?
Who wrote this specific document?
and so on..

200
Graphical Model for Clustering
Cluster for document
z
f
Cluster-Word distributions
w
Word
n
D
201
Graphical Model for Topics
Document-Topic distributions
q
z
Topic
f
Topic-Word distributions
w
Word
n
D
202
Topic probability distribution over words
203
Key Features of Topic Models
NEW

Generative model for documents in form of bags of
words
Allows a document to be composed of multiple
topics
Much more powerful than 1 doc -gt 1 cluster
Completely unsupervised
Topics learned directly from data
Leverages strong dependencies at word level AND
large data sets
Learning algorithm
Gibbs sampling is the method of choice
Scalable
Linear in number of word tokens
Can be run on millions of documents

204
Topics vs. Other Approaches

Clustering documents
Computationally simpler
But a less accurate and less flexible model
LSI/LSA
Projects words into a K-dimensional hidden space
Less interpretable
Not generalizable
E.g., authors or other side-information
Not as accurate
E.g., precision-recall Hoffman, Blei et al,
Buntine, etc
Topic Models (aka LDA model)
next-generation text modeling, after LSI
More flexible and more accurate (in prediction)
Linear time complexity in fitting the model

205
Examples of Topics learned from Proceedings of
the National Academy of Sciences
Griffiths and Steyvers, 2004
NEW
STRUCTURE ANGSTROM CRYSTAL RESIDUES STRUCTURES STR
UCTURAL RESOLUTION HELIX THREE HELICES DETERMINED
RAY CONFORMATION HELICAL HYDROPHOBIC SIDE DIMENSIO
NAL INTERACTIONS MOLECULE SURFACE
NEURONS BRAIN CORTEX CORTICAL OLFACTORY NUCLEUS NE
URONAL LAYER RAT NUCLEI CEREBELLUM CEREBELLAR LATE
RAL CEREBRAL LAYERS GRANULE LABELED HIPPOCAMPUS AR
EAS THALAMIC
TUMOR CANCER TUMORS HUMAN CELLS BREAST MELANOMA GR
OWTH CARCINOMA PROSTATE NORMAL CELL METASTATIC MAL
IGNANT LUNG CANCERS MICE NUDE PRIMARY OVARIAN
MUSCLE CARDIAC HEART SKELETAL MYOCYTES VENTRICULAR
MUSCLES SMOOTH HYPERTROPHY DYSTROPHIN HEARTS CONT
RACTION FIBERS FUNCTION TISSUE RAT MYOCARDIAL ISOL
ATED MYOD FAILURE
HIV VIRUS INFECTED IMMUNODEFICIENCY CD4 INFECTION
HUMAN VIRAL TAT GP120 REPLICATION TYPE ENVELOPE AI
DS REV BLOOD CCR5 INDIVIDUALS ENV PERIPHERAL
FORCE SURFACE MOLECULES SOLUTION SURFACES MICROSCO
PY WATER FORCES PARTICLES STRENGTH POLYMER IONIC A
TOMIC AQUEOUS MOLECULAR PROPERTIES LIQUID SOLUTION
S BEADS MECHANICAL
206
What can Topic Models be used for?

Queries
Who writes on this topic?
e.g., finding experts or reviewers in a
particular area
What topics does this person do research on?
Comparing groups of authors or documents
Discovering trends over time
Detecting unusual papers and authors
Interactive browsing of a digital library via
topics
Parsing documents (and parts of documents) by
topic
and more..

207
What is this paper about?
NEW

Empirical Bayes screening for multi-item
associations
Bill DuMouchel and Daryl Pregibon, ACM SIGKDD
2001
Most likely topics according to the model are
data, mining, discovery, association, attribute..
set, subset, maximal, minimal, complete,
measurements, correlation, statistical,
variation,
Bayesian, model, prior, data, mixture,..

208
(No Transcript)
209
(No Transcript)
210
Pennsylvania Gazette
NEW
(courtesy of David Newman Sharon Block, UC
Irvine)
1728-1800 80,000 articles
211
Historical Trends in Pennsylvania Gazette
NEW
(courtesy of David Newman Sharon Block, UC
Irvine)
STATE GOVERNMENT CONSTITUTION LAW UNITED POWER CIT
IZEN PEOPLE PUBLIC CONGRESS
SILK COTTON DITTO WHITE BLACK LINEN CLOTH WOMEN BL
UE WORSTED
212
Enron email data
250,000 emails 5000 authors 1999-2002
213
Enron email topics
214
Non-work Topics
215
Topical Topics
216
Using Topic Models for Information Retrieval
UPDATED
217
Author-Topic Models

The author-topic model
a probabilistic model linking authors and topics
authors -gt topics -gt words
Topic distribution over words
Author distribution over topics
Document generated from a mixture of author
distributions
Learns about entities based on associated text
Can be generalized
Replace author with any categorical doc
information
e.g., publication type, source, year, country of
origin, etc

218
Author-Topic Graphical Model
a
x
Author
Author-Topic distributions
q
z
Topic
f
Topic-Word distributions
w
Word
n
D
219
Learning Author-Topic Models from Text

Full probabilistic model
Power of statistical learning can be leveraged
Learning algorithm is linear in number of word
occurrences
Scalable to very large data sets
Completely automated (no tweaking required)
completely unsupervised, no labels
Query answering
A wide variety of queries can be answered
Which authors write on topic X?
What are the spatial patterns in usage of topic
Y?
How have authors A, B and C changed over time?
Queries answered using probabilistic inference
Query time is real-time (learning is offline)

220
Author-Topic Models for CiteSeer
221
Author-Profiles

Author Andrew McCallum, U Mass
Topic 1 classification, training,
generalization, decision, data,
Topic 2 learning, machine, examples,
reinforcement, inductive,..
Topic 3 retrieval, text, document, information,
content,
Author Hector Garcia-Molina, Stanford
- Topic 1 query, index, data, join, processing,
aggregate.
- Topic 2 transaction, concurrency, copy,
permission, distributed.
- Topic 3 source, separation, paper,
heterogeneous, merging..
Author Jerry Friedman, Stanford
Topic 1 regression, estimate, variance, data,
series,
Topic 2 classification, training, accuracy,
decision, data,
Topic 3 distance, metric, similarity, measure,
nearest,

222
(No Transcript)
223
PubMed-Query Topics
224
PubMed-Query Topics
225
PubMed Topics by Country
226
PubMed-Query Topics by Country
227
Extended Models

Conditioning on non-authors
side-information other than authors
e.g., date, publication venue, country, etc
can use citations as authors
Fictitious authors and common author
Allow 1 unique fictitious author per document
Captures document specific effects
Assign 1 common fictitious author to each
document
Captures broad topics that are used in many
documents
Semantics and syntax model
Semantic topics topics that are specific to
certain documents
Syntactic topics broad, across many documents
Probabilistic model that learns each type
automatically

228
Scientific syntax and semantics(Griffiths et
al., NIPS 2004 slides courtesy of Mark Steyvers
and Tom Griffiths, PNAS Symposium presentation,
2003)
Factorization of language based on statistical
dependency patterns long-range, document
specific dependencies short-range
dependencies constant across all documents
semantics probabilistic topics
q
z
z
z
w
w
w
x
x
x
syntax probabilistic regular grammar
229
x 2
OF 0.6 FOR 0.3 BETWEEN 0.1
x 1
0.8
z 1 0.4
z 2 0.6
HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY
0.2
SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK
0.2 RESEARCH 0.2 MATHEMATICS 0.2
0.7
0.1
0.3
0.2
x 3
THE 0.6 A 0.3 MANY 0.1
0.9
230
x 2
OF 0.6 FOR 0.3 BETWEEN 0.1
x 1
0.8
z 1 0.4
z 2 0.6
HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY
0.2
SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK
0.2 RESEARCH 0.2 MATHEMATICS 0.2
0.7
0.1
0.3
0.2
x 3
THE 0.6 A 0.3 MANY 0.1
0.9
THE
231
x 2
OF 0.6 FOR 0.3 BETWEEN 0.1
x 1
0.8
z 1 0.4
z 2 0.6
HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY
0.2
SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK
0.2 RESEARCH 0.2 MATHEMATICS 0.2
0.7
0.1
0.3
0.2
x 3
THE 0.6 A 0.3 MANY 0.1
0.9
THE LOVE

Write a Comment

User Comments (0)