David Newman, UC Irvine Lecture 10: Mixture Models 1 presentation

About This Presentation

Transcript and Presenter's Notes

Title: David Newman, UC Irvine Lecture 10: Mixture Models 1

1
CS 277 Data MiningLectures 10 Mixture Models

David Newman
Department of Computer Science
University of California, Irvine

2
Notices

Homework 2 due Tuesday Nov 6 in class
Any questions?
Do you need some hints?
Are you learning anything?

3
Clustering
4
Different Types of Clustering Algorithms

Partition-based clustering
K-means
Probabilistic model-based clustering
mixture models
above work with measurement data, e.g., feature
vectors
Hierarchical clustering
hierarchical agglomerative clustering
Graph-based clustering
min-cut algorithms
above work with distance data, e.g., distance
matrix

5
K-Means After 1 iteration
6
K-Means Converged solution
7
Finite Mixture Models
8
Finite Mixture Models
9
Finite Mixture Models
10
Finite Mixture Models
11
Finite Mixture Models
Weightk
ComponentModelk
Parametersk
12
(No Transcript)
13
(No Transcript)
14
(No Transcript)
15
Interpretation of Mixtures

1. C has a direct (physical) interpretation
e.g., C age of fish, C male, female,
C Australian, American

16
Interpretation of Mixtures

1. C has a direct (physical) interpretation
e.g., C age of fish, C male, female
2. C is a convenient hidden variable (i.e., the
cluster variable)
- focuses attention on subsets of the
data
e.g., for visualization,
clustering, etc
- C might have a physical/real interpretation
but not necessarily so

17
Probabilistic Clustering Mixture Models

assume a probabilistic model for each component
cluster
mixture model f(x) ?k1K wk fk(x?k)
wk are K mixing weights
0 ? wk ? 1 and ?k1K wk 1
where K component densities fk(x?k) can be
Gaussian
Poisson
exponential
...
Note
Assumes a model for the data (advantages and
disadvantages)
Results in probabilistic membership p(cluster k
x), also called responsibilities

18
Gaussian Mixture Models (GMM)

model for k-th component is normal N(?k,?k)
often assume diagonal covariance ?jj ?j2 ,
?i?j 0
or sometimes even simpler ?jj ?2 ,
?i?j 0
f(x) ?k1K wk fk(x?k) with ?k lt?k , ?kgt or
lt?k ,?kgt
generative model
randomly choose a component
selected with probability wk
generate x N(?k,?k)
note ?k ?k both d-dim vectors

19
Learning Mixture Models from Data

Score function log-likelihood L(?)
L(?) log p(X?) log ?H p(X,H?)
H hidden variables (cluster memberships of each
x)
L(?) cannot be optimized directly
EM Procedure
General technique for maximizing log-likelihood
with missing data
For mixtures
E-step compute memberships p(k x) wk
fk(x?k) / f(x)
M-step pick a new ? to maximize expected data
log-likelihood
Iterate guaranteed to climb to (local) maximum
of L(?)

20
The E (Expectation) Step Responsibilities
Current K clusters and parameters
n data points
E step Compute p(data point i is in group k)
21
The M (Maximization) Step Re-estimate params
New parameters for the K clusters
n data points
M step Compute q, given n data points and
memberships
22
Complexity of EM for mixtures
K models
n data points
Complexity per iteration scales as O( n K f(p) )
23
Comments on Mixtures and EM Learning

Complexity of each EM iteration
Depends on the probabilistic model being used
e.g., for Gaussians, Estep is O(nK), Mstep is
O(nKp2)
Sometimes E or M-step is not closed form
gt can require numerical optimization or sampling
within each iteration
Generalized EM (GEM) instead of maximizing
likelihood, just increase likelihood
EM can be thought of as hill-climbing with
direction and step-size provided automatically
K-means as a special case of EM
Gaussian mixtures with isotropic (diagonal,
equi-variance) ?k s
Approximate the E-step by choosing most likely
cluster (instead of using membership
probabilities)
Generalizations
Mixtures of multinomials for text data
Mixtures of Markov chains for Web sequences
more
Will be discussed later in lectures on text and
Web data

24
EM for Gaussian Mixtures

Gaussian Mixture
Log Likelihood

25
EM for Gaussian Mixtures

E-Step Responsibilities
M-Step Re-estimate parameters
Evaluate log likelihood

26
(No Transcript)
27
(No Transcript)
28
(No Transcript)
29
(No Transcript)
30
(No Transcript)
31
(No Transcript)
32
(No Transcript)
33
(No Transcript)
34
(No Transcript)
35
Selecting K in mixture models

cannot just choose K that maximizes likelihood
Likelihood L(?) is always larger for larger K
Model selection alternatives
1) penalize complexity
e.g., BIC L(?) d/2 log n , d parameters
(Bayesian information criterion)
Asymptotically correct under certain assumptions
Often used in practice for mixture models even
though assumptions for theory are not met
2) Bayesian compute posteriors p(k data)
P(kdata) requires computation of p(datak)
marginal likelihood
Can be tricky to compute for mixture models
Recent work on Dirichlet process priors has made
this more practical
3) (cross) validation
Score different models by log p(Xtest ?)
split data into train and validate sets
Works well on large data sets
Can be noisy on small data (logL is sensitive to
outliers)

36
Example of BIC Score for Red-Blood Cell Data
37
Example of BIC Score for Red-Blood Cell Data
True number of classes (2) selected by BIC
38
Relationship between K-Means and Mix-of-Gaussians

Welling (Clustering notes)
parameter a
p(x,i)?p(x,i)a
a0, get Mixture of Gaussians
a8, get K-Means
Bishop
S eI (diagonal/isotropic covariance matrix)
e ? 0, get K-Means

David Newman, UC Irvine Lecture 10: Mixture Models 1 PowerPoint PPT Presentation