Title: Part 2: Unsupervised Learning
1Machine Learning Techniques for Computer Vision
- Part 2 Unsupervised Learning
Christopher M. Bishop
Microsoft Research Cambridge
ECCV 2004, Prague
2Overview of Part 2
- Mixture models
- EM
- Variational Inference
- Bayesian model complexity
- Continuous latent variables
3The Gaussian Distribution
- Multivariate Gaussian
- Maximum likelihood
mean
4Gaussian Mixtures
- Linear super-position of Gaussians
- Normalization and positivity require
5Example Mixture of 3 Gaussians
6Maximum Likelihood for the GMM
- Log likelihood function
- Sum over components appears inside the log
- no closed form ML solution
7EM Algorithm Informal Derivation
8EM Algorithm Informal Derivation
9EM Algorithm Informal Derivation
10EM Algorithm Informal Derivation
- Can interpret the mixing coefficients as prior
probabilities - Corresponding posterior probabilities
(responsibilities)
11Old Faithful Data Set
Time betweeneruptions (minutes)
Duration of eruption (minutes)
12(No Transcript)
13(No Transcript)
14(No Transcript)
15(No Transcript)
16(No Transcript)
17(No Transcript)
18Latent Variable View of EM
- To sample from a Gaussian mixture
- first pick one of the components with probability
- then draw a sample from that component
- repeat these two steps for each new data point
19Latent Variable View of EM
- Goal given a data set, find
- Suppose we knew the colours
- maximum likelihood would involve fitting each
component to the corresponding cluster - Problem the colours are latent (hidden) variables
20Incomplete and Complete Data
incomplete
complete
21Latent Variable Viewpoint
22Latent Variable Viewpoint
- Binary latent variables
describing which component generated each data
point - Conditional distribution of observed variable
- Prior distribution of latent variables
- Marginalizing over the latent variables we obtain
23Graphical Representation of GMM
24Latent Variable View of EM
- Suppose we knew the values for the latent
variables - maximize the complete-data log likelihood
- trivial closed-form solution fit each component
to the corresponding set of data points - We dont know the values of the latent variables
- however, for given parameter values we can
compute the expected values of the latent
variables
25Posterior Probabilities (colour coded)
26Over-fitting in Gaussian Mixture Models
- Infinities in likelihood function when a
component collapses onto a data point
with - Also, maximum likelihood cannot determine the
number K of components
27Cross Validation
- Can select model complexity using an independent
validation data set - If data is scarce use cross-validation
- partition data into S subsets
- train on S-1 subsets
- test on remainder
- repeat and average
- Disadvantages
- computationally expensive
- can only determine one or two complexity
parameters
28Bayesian Mixture of Gaussians
- Parameters and latent variables appear on equal
footing - Conjugate priors
29Data Set Size
- Problem 1 learn the functionfor
from 100 (slightly) noisy examples - data set is computationally small but
statistically large - Problem 2 learn to recognize 1,000 everyday
objects from 5,000,000 natural images - data set is computationally large but
statistically small - Bayesian inference
- computationally more demanding than ML or
MAP(but see discussion of Gaussian mixtures
later) - significant benefit for statistically small data
sets
30Variational Inference
- Exact Bayesian inference intractable
- Markov chain Monte Carlo
- computationally expensive
- issues of convergence
- Variational Inference
- broadly applicable deterministic approximation
- let denote all latent variables and parameters
- approximate true posterior using a
simpler distribution - minimize Kullback-Leibler divergence
31General View of Variational Inference
- For arbitrarywhere
- Maximizing over would give the true
posterior - this is intractable by definition
32Variational Lower Bound
33Factorized Approximation
- Goal choose a family of q distributions which
are - sufficiently flexible to give good approximation
- sufficiently simple to remain tractable
- Here we consider factorized distributions
- No further assumptions are required!
- Optimal solution for one factor, keeping the
remainder fixed - coupled solutions so initialize then cyclically
update - message passing view (Winn and Bishop, 2004)
34(No Transcript)
35Lower Bound
- Can also be evaluated
- Useful for maths/code verification
- Also useful for model comparison
36Illustration Univariate Gaussian
- Likelihood function
- Conjugate prior
- Factorized variational distribution
37Initial Configuration
38After Updating
39After Updating
40Converged Solution
41Variational Mixture of Gaussians
- Assume factorized posterior distribution
- No other approximations needed!
42Variational Equations for GMM
43Lower Bound for GMM
44VIBES
- Bishop, Spiegelhalter and Winn (2002)
45ML Limit
- If instead we choosewe recover the maximum
likelihood EM algorithm
46Bound vs. K for Old Faithful Data
47Bayesian Model Complexity
48Sparse Bayes for Gaussian Mixture
- Corduneanu and Bishop (2001)
- Start with large value of K
- treat mixing coefficients as parameters
- maximize marginal likelihood
- prunes out excess components
49(No Transcript)
50(No Transcript)
51Summary Variational Gaussian Mixtures
- Simple modification of maximum likelihood EM code
- Small computational overhead compared to EM
- No singularities
- Automatic model order selection
52Continuous Latent Variables
- Conventional PCA
- data covariance matrix
- eigenvector decomposition
- Minimizes sum-of-squares projection
- not a probabilistic model
- how should we choose L ?
53Probabilistic PCA
- Tipping and Bishop (1998)
- L dimensional continuous latent space
- D dimensional data space
PCA
factor analysis
54Probabilistic PCA
- Marginal distribution
- Advantages
- exact ML solution
- computationally efficient EM algorithm
- captures dominant correlations with few
parameters - mixtures of PPCA
- Bayesian PCA
- building block for more complex models
55EM for PCA
56EM for PCA
57EM for PCA
58EM for PCA
59EM for PCA
60EM for PCA
61EM for PCA
62Bayesian PCA
- Bishop (1998)
- Gaussian prior over columns of
- Automatic relevance determination (ARD)
ML PCA
Bayesian PCA
63Non-linear Manifolds
- Example images of a rigid object
64Bayesian Mixture of BPCA Models
65(No Transcript)
66Flexible Sprites
- Jojic and Frey (2001)
- Automatic decomposition of video sequence into
- background model
- ordered set of masks (one per object per frame)
- foreground model (one per object per frame)
67(No Transcript)
68Transformed Component Analysis
- Generative model
- Now include transformations (translations)
- Extend to L layers
- Inference intractable so use variational
framework
69(No Transcript)
70Bayesian Constellation Model
- Li, Fergus and Perona (2003)
- Object recognition from small training sets
- Variational treatment of fully Bayesian model
71Bayesian Constellation Model
72Summary of Part 2
- Discrete and continuous latent variables
- EM algorithm
- Build complex models from simple components
- represented graphically
- incorporates prior knowledge
- Variational inference
- Bayesian model comparison