Title: Jeremy Tantrum,
1Hierarchical Model-Based Clustering of Large
Datasets Through Fractionation and Refractionation
- Jeremy Tantrum,
- Department of Statistics,
- University of Washington
- joint work with
- Alejandro Murua Werner Stuetzle
- Insightful Corporation University of
Washington
This work has been supported by NSA grant 62-1942
2Motivating Example
- Consider clustering documents
- Topic Detection and Tracking corpus
- 15,863 news stories for one year from Reuters
and CNN - 25,000 unique words
- Possibly many topics
- Large numbers of observations
- High dimensions
- Many groups
3Goal of Clustering
Detect that there are 5 or 6 groups Assign
Observations to groups
4NonParametric Clustering
- Premise
- Observations are sampled from a density p(x)
- Groups correspond to modes of p(x)
5NonParametric Clustering
Fitting Estimate p(x) nonparametrically and
find significant modes of the estimate
6Model Based Clustering
- Premise
- Observations are sampled from a mixture density
- p(x) å pg pg(x)
- Groups correspond to mixture components
7Model Based Clustering
Fitting Estimate pg and parameters of pg(x)
8Model Based Clustering
- Fitting a Mixture of Gaussians
- Use the EM algorithm to maximize the log
likelihood - Estimates the probabilities of each observation
belonging to each group - Maximizes likelihood given these probabilites
- Requires a good starting point
9Model Based Clustering
- Hierarchical Clustering
- Provides a good starting point for EM algorithm
- Start with every point being its own cluster
- Merge the two closest clusters
- Measured by the decrease in likelihood when those
two clusters are merged - Uses the Classification Likelihood not the
Mixture Likelihood - Algorithm is quadratic in the number of
observations
10Likelihood Distance
p (x)
p1(x)
p2(x)
11Bayesian Information Criterion
- Choose number of clusters by maximizing the
Bayesian Information Criterion - r is the number of parameters
- n is the number of observations
- Log likelihood penalized for complexity
12Fractionation
Invented by Cutting, Karger, Pederson and Tukey
for nonparametric clustering of large datasets.
M is the largest number of observations for which
a hierarchical O(M2) algorithm is computationally
feasible
13Fractionation
- an meta-observations after the first round
- a2n meta-observations after the second round
- ain meta-observations after the ith round
- For the ith pass, we have ai-1n/M fractions
taking O(M2) operations each - Total number of operations is
- Total running time is linear in n!
14Model Based Fractionation
- Use model based clustering
- Meta-observations contain all sufficient
statistics (ni, mi, Si) - ni is the number of observations size
- mi is the mean location
- Si is the covariance matrix shape and volume
15Model Based Fractionation
16Example 2
17Refractionation
- Problem
- If the number of meta-observations generated from
a fraction is less than the number of groups in
that fraction then two or more groups will be
merged. - Once observations from two groups are merged they
can never be split again. - Solution
- Apply fractionation repeatedly.
- Use meta-observations from the previous pass of
fractionation to create better fractions.
18Example 2 Continued
19Example 2 Pass 2
20Example 2 Pass 3
21Realistic Example
- 1100 documents from the TDT corpus partitioned by
people into 19 topics - Transformed into 50 dimensional space using
Latent Semantic Indexing
Projection of the data onto a plane
colors represent topics
22Realistic Example
Want to create a dataset with more observations
and more groups Idea Replace each group with a
scaled and transformed version of the entire data
set.
23Realistic Example
Want to create a dataset with more observations
and more groups Idea Replace each group with a
scaled and transformed version of the entire data
set.
24Realistic Example
- To measure similarity of clusters to groups
- Fowlkes-Mallows index
- Geometric average of
- Probability of 2 randomly chosen observations
from the same cluster being in the same group - Probability of 2 randomly chosen observations
from the same group being in the same cluster - FowlkesMallows index near 1 means clusters are
good estimates of the groups - Clustering the 1100 documents gives a
FowlkesMallows index of 0.76 our gold
standard
25Realistic Example
- 1919361 clusters, 19110020900 observations in
50 dimensions - Fraction size¼1000 with 100 metaobservations per
fraction - 4 passes of fractionation choosing 361 clusters
Number of fractions
Pass Min Median Max nf
1 270 289 296 20
2 18 88 150 18
3 18 19 60 17
4 19 19 58 16
Distribution of the number of groups per fraction.
26Realistic Example
- 1919361 clusters, 19110020900 observations in
50 dimensions - Fraction size¼1000 with 100 metaobservations per
fraction - 4 passes of fractionation choosing 361 clusters
- The sum of the number of groups represented in
each cluster - 361 is perfect
Pass Fowlkes Mallows Purity of the clusters
1 0.325 1729
2 0.554 908
3 0.616 671
4 0.613 651
27Realistic Example
- 1919361 clusters, 19110020900 observations in
50 dimensions - Fraction size¼1000 with 100 metaobservations per
fraction - 4 passes of fractionation choosing 361 clusters
- Refractionation
- Purifies fractions
- Successfully deals with the case where the
number of groups is greater than aM, the number
of meta-observations
28Contributions
- Model Based Fractionation
- Extended fractionation idea to parametric setting
- Incorporates information about size, shape and
volume of clusters - Chooses number of clusters
- Still linear in n
- Model Based ReFractionation
- Extended fractionation to handle larger number of
groups
29Extensions
- Extend to 100,000s of observations 1000s of
groups - Currently the number of groups must be less than
M - Extend to a more flexible class of models
- With small groups in high dimensions, we need a
more constrained model (fewer parameters) than
the full covariance model - Mixture of Factor Analyzers
30(No Transcript)
31Fowlkes-Mallows Index
true clusters clusters clusters clusters
Groups 1 2 I Total
1 n11 n12 n1I n1
2 n21 n22 n2I n1
J nJ1 nj2 nJI n1
Total n1 n2 nI n
Pr(2 documents in same group they are in
the same cluster)
Pr(2 documents in same cluster they are
in the same group)