Title: Generative Models of Affinity Matrices
1Generative Models of Affinity Matrices
- Rómer Rosales and Brendan Frey
- romer_at_psi.toronto.edu frey_at_psi.toronto.edu
- Probabilistic and Statistical Inference Group
- University of Toronto
2Overview
- Background
- Generative Models of Affinity Matrices
- Spectral Clustering
- A Graphical Model of Spectral Clustering
- A More general view of Spectral Clustering
- Limitations in Spectral Clustering
- New models of Affinity Matrices
- Experimental Results
- Conclusions
3Background and Notation
- Data set
- We are given a set
- With a given measure
- Alternatively, an affinity for each pair
of elements in data set - Class labels
- A finite set of size M, e.g.,
- Clusters
- We want to infer a distribution or most likely
class assignment for each data set element
4Affinity Matrices
- Different forms, e.g., with standard L2 measure
- intuitively separates close points from far
points - What scale ? What form for L?
- Idea use simpler form, e.g.,
and incorporate knowledge or specific
properties of desired clustering into a well
defined probabilistic model instead
5Latent Representation Idea
- There is a low-dimensional (usually) vector
associated with each input data point - We don't observe the , but we observe some
function of them (e.g.,the pair-wise affinities
or some high-dimensional version of them) - Want to find a probability distribution over
to explain the observations
6A Generative Model for L
Joint distribution
7Spectral Clustering Instance
- For Spectral Clustering, choose
- Uses same variance for all
- In SC, key structural form is a product of hidden
variable pairs -
-
8Spectral Clustering (cont.)
- Standard Algorithm for Spectral Clustering
Greedy Inference in this model - Step 1
- Choose best assignment for (d-dim) hidden
variables (MAP) to minimize Frobenius
norm SVD (best d-rank approx.) - Step 2
- Choose good means (and covariances) given all
cluster eigenvector rows (e.g., use k-means,
mixture of Gaussians, etc.)
9Generalizing SC
- New algorithms for optimization
- Rooted on probabilistic view e.g., different
forms of approximate inference in graphical
models - New models based on SC components
- Same basic components but using new structural
form (e.g., different hidden variable
interactions) - New models of affinity matrices SC
10New Algorithms for Optimization
- A simple example algorithm
- Inference using the EM algorithm
- Find posteriors over class labels
- Find MAP estimate for means and variances
- Jointly optimize given rest of the
variables, instead of greedy two step
optimization
11New Algorithms
12Example (cont.)
- Usually converges to same solution as SC
- Other algorithms variational inference on
13New Models
- A slight change in the conditional distribution
(a more intuitive example) - Perhaps, can be used to explain the two ends of
the spectrum (0 clustering inf dim.
reduction)
14Overview
- Background
- Generative Models of Affinity Matrices
- Spectral Clustering
- A Graphical Model of Spectral Clustering
- Generalizing Spectral Clustering
- Limitations in Spectral Clustering
- New models of Affinity Matrices
- Experimental Results
- Conclusions
15Usual SC Examples
16Less Usual SC Examples
- Which clustering is better?
17Less Usual SC Examples
- Spectral Clustering results
18Less Usual SC Examples
19(No Transcript)
20Remarks
- No such a thing as correct clustering (in
general) - Some clusterings may just disagree or agree with
our perception - Optimal choice of scale issue
- Can usually explain those clusterings that are
produced
21New Models of Affinity Matrices
- A basic Bayesian net view of affinity matrices
- Also MRF representation, Ising Models
22- Bayes Net of the Bayes Net
23Inference in the Basic Affinity Matrix BN (or MRF)
- For example (for clustering)
- Difficult to perform inference
- MAP estimate of equivalent to MAX-CUT
problem (NP-complete) - Can always test approximate inference
Just large
24Scaled Affinity Matrix BN
- Each point is only connected to subset of same
class points, represented by random variable - Internal class scale as a random variable
25Scaled Affinity Matrix
- Avoids setting an explicit affinity scale
- Allows different class dependent scales and
variances - Representation for
26Scaled Affinity Matrix
- Admissible graphs
- Constrains same class nodes to be connected
- Indicator function
- The remaining has simple conditional form
- Intuition Introduces bias in random walk
- Do not want to model interclass relationships
27Example
C(1,1,1,1,2,2,2,2)
1 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 1 0 1 1 0 0 0 0 1
0 1 1 0 0 0 0 0 0 0 0 2 2 0 0 0 0 0 0 0 2 2 0 0 0
0 0 0 0 2 2 0 0 0 0 0 0 0 2
- 1 1 1 1 0 0 0 0
- 1 1 1 1 0 0 0 0
- 1 1 1 1 0 0 0 0
- 1 1 1 1 0 0 0 0
- 0 0 0 0 2 2 2 2
- 0 0 0 0 2 2 2 2
- 0 0 0 0 2 2 2 2
- 0 0 0 0 2 2 2 2
28Scaled Affinity Matrix
- Another form
- This definition is based on k-nearest neighbors
29Approximate Inference in the Scaled Model
- Assume we can compute
- Update based on expectations under this
distribution (EM) - is given by the
M-minimum spanning trees (simple proof based on
Zahn 1971, case for a single sigma for both
classes). - Polynomial time
- However, using MAP estimate, global optimization
tends to fall in local minima - Sol We use ICM on a single (pick at
random) - Compute MAP for (1-MST!! for each class)
and iterate - Can do this because classes are given now
30Results (Scaled Model)
31Results (Scaled Model)
32Results (Scaled Model)
Connected Graph prior (As before)
K-neighbor Graph prior K4
33Results (Scaled Model)
34Results (Scaled Model)
35Results (Scaled Model)
36Conclusions
- Affinity Matrices in terms of Bayes Nets
- Provides probabilistic view of Spectral
Clustering some generalizations - Allows to incorporate desired clustering
properties explicitly into model - Scaled Affinity Matrix BN
- Avoids setting an explicit affinity scale
- Allows modeling different scales within classes
- Data probability distribution no longer
constrained to be uniform - A view on the clustering dimensionality
reduction continuum
37(No Transcript)
38- There is no right answer for clustering
- One way to solve this up to a point Can learn
beta in supervised fashion, user gives examples
of close by points in each class e.g in a
different setting Wagstaff et al, Ping et al - A way to generate LLE from our SC GM?
- UCI real dataset
- LLE and IB
39- The local minima (.
- Dynamic L vs statics L with distance from
different features - We have seen how they try to use different
features, to go with Gestalt
40(No Transcript)
41Generating Affinity Matrices
42Inference (cont.)
Exp
- Why MAP estimate is equivalent to finding M-MST?
- I will explain here
43A Generative Model of L