Title: Research Directions in Adaptive Mixtures and ModelBased Clustering
1Research Directions in Adaptive Mixtures and
Model-Based Clustering
- Wendy L. Martinez
- Office of Naval Research
- April 1, 2005
2Acknowledgements
- This work has been conducted jointly with Jeff
Solka, NSWCDD. - This work has been supported by ONR ROPO program.
- Jeff Solkas work has been supported by the ONR
ILIR program.
3Disclaimer
- Work in progress
- Describe research ideas
- Obtain feedback and suggestions
4Outline
- Model-based Clustering (MBC).
- Mixture models and the EM algorithm
- The agglomerative step
- Adaptive Mixtures Density Estimation
- Kernel density estimation
- Their Synthesis
- Initialization for MB agglomerative clustering
- MB Adaptive Mixtures Density Estimation
- Preliminary Results
- Research Directions
5Model-Based Clustering
Chosen Model
dendrogram
Agglomerative Model-Based Clustering
Initialization for EM 1. Initial number of
components 2. Initial values for parameters
Data
EM Algorithm
Final Result Estimated Model 1. Number of
components 2. Best Model M1-M4 3. Parameter
estimates
Highest BIC
BIC
Standard MBC performs hierarchical clustering
starting with the full dataset.
6MODEL-BASED CLUSTERING
- This technique takes a density function approach.
- Uses finite mixture densities as models for
cluster analysis. - Each component density characterizes a cluster.
7FINITE MIXTURESREVIEW
- Model the density as a sum of C weighted
densities. - Expectation-maximization method used to estimate
parameters. - Must assume distribution for components - usually
normal distribution. - Each component characterizes a cluster.
8EXPECTATION-MAXIMIZATION (EM) METHOD
- Method for building or estimating the model.
- Solution of likelihood functions requires
iterative procedure. - E Step - Expectation
- Find probability that observations belong to the
k-th component density - the posteriors (t ik
s). - M Step - Maximization
- Update all parameters based on posteriors (pk,
mk, Sk).
9EXPECTATION-MAXIMIZATION (EM) METHOD
- Issues
- Can converge to a local optimum.
- Can diverge.
- Requires initial guess at the parameters of the
component densities. - Need an estimate of the number of components.
- Requires an assumed distribution for the
component densities.
10EXPECTATION-MAXIMIZATION (EM) METHOD
- Model-based clustering addresses these issues
- Form of densities constrains covariance
matrices - Initialization of EM via model-based
agglomerative clustering - Estimate of number of components via BIC
- Adaptive mixtures
- Covariance model is unconstrained version
- Initialization of EM
- Over-determined Estimate of number of components
11AGGLOMERATIVE MBC
- Regular agglomerative clustering
- Each point is in a cluster.
- Two closest clusters are merged at each step.
- Closeness is determined by distance and linkage.
- Model-Based agglomerative clustering
- At each step, two clusters are merged such that
the likelihood for the given model is maximized. - We propose using Adaptive Mixtures to initialize
MB agglomerative clustering.
12MODEL-BASED CLUSTERING
- Best model is chosen using the Bayesian
Information Criterion (mM is parameters, LM is
loglikelihood) - The four models are (more models are possible)
- Spherical/equal (M1)
- Spherical/unequal (M2)
- Ellipsoidal/equal (M3)
- Ellipsoidal/unequal (unconstrained) (M4)
13Kernel Density Estimation
- Center a kernel at each data point
- Evaluate weighted kernel-usually normal kernel
- Add the value of the n curves
- Computationally intensive, must store all of the
data, choice of kernel, smoothing parameter.
14ADAPTIVE MIXTURES DENSITY ESTIMATION (AMDE)
- Priebe and Marchette 1990s.
- Priebe, JASA, 1994
- Hybrid of Kernel Estimator and Mixture Model.
- Number of Terms Driven by the Data.
15AMDE ALGORITHM
- 1 - Given a New Observation.
- 2 - Update Existing Model Using the Recursive EM.
- or
- 3 - Add a New Term to Explain This Data Point.
16Recursive EM Update Equations All Have Hats
17CREATE RULE - AMDE
- Test the Mahalanobis distance from current data
point to each mixture term in the existing model. - Add in a new term when this distance exceeds a
certain create threshold - Location given by current data point.
- Covariance given by weighted average of the
existing covariances. - Mixing coefficient set to 1/n.
18Adaptive Mixtures
- Creates over-determined model too many terms
- Depends on order of the data
- Uses sieve bound parameter to reset singular
covariance matrices - Covariance matrices are not constrained model
4. - Limited applicability in high-dimensional spaces
- EM algorithm is used to refine estimate.
19Visualizing the ProcessAdaptive Mixtures
20Synthesis of AMDE and MBC
- First, to use the AMDE as a way to initialize the
model-based agglomerative clustering - Second, to devise a model-based version of AMDE
- Third, to combine these two ideas
21MBC with an AMDE Start
Chosen Model
dendrogram
Agglomerative Model-Based Clustering
Initialization for EM 1. Initial number of
components 2. Initial values for parameters
Adaptive mixtures model
EM Algorithm
Data
Final Result Estimated Model 1. Number of
components 2. Best Model M1-M4 3. Parameter
estimates
Highest BIC
BIC
22MBC With AMDE Smart Start
- Form an adaptive mixtures model of the dataset.
- Set create threshold in order to guarantee an
over determined model. - Partition the data based on the AMDE model using
tik. - Note some of the original AMDE mixture terms
die due to insufficient support. - Utilize this partition as a start to the usual
MBC procedure. - Instead of starting with as many terms as points
we start with approximately log(n) number of
points.
23Other Possibilities
- Other types of initialization
- Posse (JCGS) used initial partitions based on
minimal spanning tree. - K-means
- Benefits of AMDE initialization
- Do not have to specify number of clusters as in
k-means. - Methods like k-means impose a certain structure.
- In most cases, initial clusters are not
singletons.
24Why Do This?
- Computational tradeoff of the AMDE procedure vs.
the agglomerative procedure on the full dataset. - Advantages as the size of the dataset grows.
- Non-singleton clusters possibly
- Save on storage
- AMDE is data order dependent.
- Multiple mixture models/clustering can be
obtained by merely reordering the dataset. - Could get a distribution of models (number of
clusters/BICs)
254 Term Test Case
264 Term BIC Curves
27Experiment Real Data
- Model-based clustering was applied to Lansing
Woods maples. - Ran 20 trials with AMDE initialization.
- Re-ordered data each time.
- Maximum BIC model is 6 component non-uniform
spherical mixture. - This is model 2
- Covariance matrices are diagonal.
- Covariance matrices are not equal across terms.
28The Raw Data
29Original ConfigurationJASA, 2002
30BICS for Best Trial
31Number of Clusters 20 Trials
32Configuration with AMDE
33Now for Model-Based AMDE
- The Adaptive Mixtures method uses the
unconstrained model. - It often tends to provide models that are
overly-complex too many terms. - Using a model-based version might provide
better density estimates. - A model-based version might provide better
starting partitions for the MB agglomerative
clustering for other models. - Extend applicability of AMDE to
higher-dimensional spaces fewer parameters
34Recursive Update Equation Covariance
- The different models correspond to constraints
on the covariance matrices. - Update equations can be found in Celeux and
Govaert, Pattern Recognition, 1995. - All depend on the scatter matrix for each term.
- Propose updating based on the recursive update
for covariance. - Then multiply by n to get the scatter matrix.
35Recursive Update Equation Covariance
Scatter matrix
36Procedure MB-AMDE
- Using same rule, either update current
configuration for new point - Weights, means, covariance.
- Convert covariance to scatter matrix.
- Update covariance according to model.
- Or, create new term
- Mean and weights created as in AMDE
- Covariance is the common one for the model
family OR we could use the weighted average as
before. - And, allow terms to die if the covariances become
singular. - Assign term weight proportionally among the
remaining terms.
37Idea I
- Use MB AMDE as a stand-alone method.
- Recall that EM is used to refine AMDE.
- Not really focused on getting the number of
groups right. - Use MB AMDE as an initialization for MB
clustering.
38Idea II
- One of the issues with AMDE is that the mixtures
are overly complex too many terms. - Use model-based agglomerative clustering to prune
terms of the AMDE. - Then use the BIC to choose the model.
- This is just MB clustering (maybe without the EM
step).
39Idea III
- Do more with the model-based agglomerative
clustering as stand-alone procedure - Cophenetic coefficient
- Can be used to compare interpoint distances and
clustering - Can be used to compare two dendrograms
- Inconsistency coefficient
- The inconsistency coefficient characterizes links
by comparing its length with the average length
of other links at the same level of the
hierarchy. - Higher the value, the less similar the objects
connected by the link. - Do we have to convert link merge values to a
distance?
40Idea III
41Other Questions
- Is there an equivalent BIC that does not
require the data? - Can the likelihood (classification or otherwise)
be recursively updated?
42Conclusion
- Discussed an initialization procedure for the
model-based agglomerative clustering. - Showed applications to synthetic and real data.
- Possible advantages of AMDE initialization
- Savings in storage??
- Possibly find other solutions??
- Formulation of Model-Based AMDE.