Title: David Newman, UC Irvine Lecture 10: Mixture Models 1
1CS 277 Data MiningLectures 10 Mixture Models
- David Newman
- Department of Computer Science
- University of California, Irvine
2Notices
- Homework 2 due Tuesday Nov 6 in class
- Any questions?
- Do you need some hints?
- Are you learning anything?
3Clustering
4Different Types of Clustering Algorithms
- Partition-based clustering
- K-means
- Probabilistic model-based clustering
- mixture models
- above work with measurement data, e.g., feature
vectors - Hierarchical clustering
- hierarchical agglomerative clustering
- Graph-based clustering
- min-cut algorithms
- above work with distance data, e.g., distance
matrix
5K-Means After 1 iteration
6K-Means Converged solution
7Finite Mixture Models
8Finite Mixture Models
9Finite Mixture Models
10Finite Mixture Models
11Finite Mixture Models
Weightk
ComponentModelk
Parametersk
12(No Transcript)
13(No Transcript)
14(No Transcript)
15Interpretation of Mixtures
- 1. C has a direct (physical) interpretation
- e.g., C age of fish, C male, female,
- C Australian, American
16Interpretation of Mixtures
- 1. C has a direct (physical) interpretation
- e.g., C age of fish, C male, female
- 2. C is a convenient hidden variable (i.e., the
cluster variable) - - focuses attention on subsets of the
data - e.g., for visualization,
clustering, etc - - C might have a physical/real interpretation
- but not necessarily so
17Probabilistic Clustering Mixture Models
- assume a probabilistic model for each component
cluster - mixture model f(x) ?k1K wk fk(x?k)
- wk are K mixing weights
- 0 ? wk ? 1 and ?k1K wk 1
- where K component densities fk(x?k) can be
- Gaussian
- Poisson
- exponential
- ...
- Note
- Assumes a model for the data (advantages and
disadvantages) - Results in probabilistic membership p(cluster k
x), also called responsibilities
18Gaussian Mixture Models (GMM)
- model for k-th component is normal N(?k,?k)
- often assume diagonal covariance ?jj ?j2 ,
?i?j 0 - or sometimes even simpler ?jj ?2 ,
?i?j 0 - f(x) ?k1K wk fk(x?k) with ?k lt?k , ?kgt or
lt?k ,?kgt - generative model
- randomly choose a component
- selected with probability wk
- generate x N(?k,?k)
- note ?k ?k both d-dim vectors
19Learning Mixture Models from Data
- Score function log-likelihood L(?)
- L(?) log p(X?) log ?H p(X,H?)
- H hidden variables (cluster memberships of each
x) - L(?) cannot be optimized directly
- EM Procedure
- General technique for maximizing log-likelihood
with missing data - For mixtures
- E-step compute memberships p(k x) wk
fk(x?k) / f(x) - M-step pick a new ? to maximize expected data
log-likelihood - Iterate guaranteed to climb to (local) maximum
of L(?)
20The E (Expectation) Step Responsibilities
Current K clusters and parameters
n data points
E step Compute p(data point i is in group k)
21The M (Maximization) Step Re-estimate params
New parameters for the K clusters
n data points
M step Compute q, given n data points and
memberships
22Complexity of EM for mixtures
K models
n data points
Complexity per iteration scales as O( n K f(p) )
23Comments on Mixtures and EM Learning
- Complexity of each EM iteration
- Depends on the probabilistic model being used
- e.g., for Gaussians, Estep is O(nK), Mstep is
O(nKp2) - Sometimes E or M-step is not closed form
- gt can require numerical optimization or sampling
within each iteration - Generalized EM (GEM) instead of maximizing
likelihood, just increase likelihood - EM can be thought of as hill-climbing with
direction and step-size provided automatically - K-means as a special case of EM
- Gaussian mixtures with isotropic (diagonal,
equi-variance) ?k s - Approximate the E-step by choosing most likely
cluster (instead of using membership
probabilities) - Generalizations
- Mixtures of multinomials for text data
- Mixtures of Markov chains for Web sequences
- more
- Will be discussed later in lectures on text and
Web data
24EM for Gaussian Mixtures
- Gaussian Mixture
- Log Likelihood
25EM for Gaussian Mixtures
- E-Step Responsibilities
- M-Step Re-estimate parameters
- Evaluate log likelihood
26(No Transcript)
27(No Transcript)
28(No Transcript)
29(No Transcript)
30(No Transcript)
31(No Transcript)
32(No Transcript)
33(No Transcript)
34(No Transcript)
35Selecting K in mixture models
- cannot just choose K that maximizes likelihood
- Likelihood L(?) is always larger for larger K
- Model selection alternatives
- 1) penalize complexity
- e.g., BIC L(?) d/2 log n , d parameters
(Bayesian information criterion) - Asymptotically correct under certain assumptions
- Often used in practice for mixture models even
though assumptions for theory are not met - 2) Bayesian compute posteriors p(k data)
- P(kdata) requires computation of p(datak)
marginal likelihood - Can be tricky to compute for mixture models
- Recent work on Dirichlet process priors has made
this more practical - 3) (cross) validation
- Score different models by log p(Xtest ?)
- split data into train and validate sets
- Works well on large data sets
- Can be noisy on small data (logL is sensitive to
outliers)
36Example of BIC Score for Red-Blood Cell Data
37Example of BIC Score for Red-Blood Cell Data
True number of classes (2) selected by BIC
38Relationship between K-Means and Mix-of-Gaussians
- Welling (Clustering notes)
- parameter a
- p(x,i)?p(x,i)a
- a0, get Mixture of Gaussians
- a8, get K-Means
- Bishop
- S eI (diagonal/isotropic covariance matrix)
- e ? 0, get K-Means
39Next
- Use EM to learn model for documents