Jeremy Tantrum, - PowerPoint PPT Presentation

1 / 31

About This Presentation

Title:

Jeremy Tantrum,

Description:

Insightful Corporation University of Washington. This work has been supported by NSA grant 62-1942 ... 15,863 news stories for one year from Reuters and CNN. 25, ... – PowerPoint PPT presentation

Number of Views:22

Avg rating:3.0/5.0

Slides: 32

Provided by: msc78

Learn more at: https://stat.uw.edu

Category:

more less

Transcript and Presenter's Notes

Title: Jeremy Tantrum,

1
Hierarchical Model-Based Clustering of Large
Datasets Through Fractionation and Refractionation

Jeremy Tantrum,
Department of Statistics,
University of Washington
joint work with
Alejandro Murua Werner Stuetzle
Insightful Corporation University of
Washington

This work has been supported by NSA grant 62-1942
2
Motivating Example

Consider clustering documents
Topic Detection and Tracking corpus
15,863 news stories for one year from Reuters
and CNN
25,000 unique words
Possibly many topics
Large numbers of observations
High dimensions
Many groups

3
Goal of Clustering
Detect that there are 5 or 6 groups Assign
Observations to groups
4
NonParametric Clustering

Premise
Observations are sampled from a density p(x)
Groups correspond to modes of p(x)

5
NonParametric Clustering
Fitting Estimate p(x) nonparametrically and
find significant modes of the estimate
6
Model Based Clustering

Premise
Observations are sampled from a mixture density
p(x) å pg pg(x)
Groups correspond to mixture components

7
Model Based Clustering
Fitting Estimate pg and parameters of pg(x)
8
Model Based Clustering

Fitting a Mixture of Gaussians
Use the EM algorithm to maximize the log
likelihood
Estimates the probabilities of each observation
belonging to each group
Maximizes likelihood given these probabilites
Requires a good starting point

9
Model Based Clustering

Hierarchical Clustering
Provides a good starting point for EM algorithm
Start with every point being its own cluster
Merge the two closest clusters
Measured by the decrease in likelihood when those
two clusters are merged
Uses the Classification Likelihood not the
Mixture Likelihood
Algorithm is quadratic in the number of
observations

10
Likelihood Distance
p (x)
p1(x)
p2(x)
11
Bayesian Information Criterion

Choose number of clusters by maximizing the
Bayesian Information Criterion
r is the number of parameters
n is the number of observations
Log likelihood penalized for complexity

12
Fractionation
Invented by Cutting, Karger, Pederson and Tukey
for nonparametric clustering of large datasets.
M is the largest number of observations for which
a hierarchical O(M2) algorithm is computationally
feasible
13
Fractionation

an meta-observations after the first round
a2n meta-observations after the second round
ain meta-observations after the ith round
For the ith pass, we have ai-1n/M fractions
taking O(M2) operations each
Total number of operations is
Total running time is linear in n!

14
Model Based Fractionation

Use model based clustering
Meta-observations contain all sufficient
statistics (ni, mi, Si)
ni is the number of observations size
mi is the mean location
Si is the covariance matrix shape and volume

15
Model Based Fractionation
16
Example 2
17
Refractionation

Problem
If the number of meta-observations generated from
a fraction is less than the number of groups in
that fraction then two or more groups will be
merged.
Once observations from two groups are merged they
can never be split again.
Solution
Apply fractionation repeatedly.
Use meta-observations from the previous pass of
fractionation to create better fractions.

18
Example 2 Continued
19
Example 2 Pass 2
20
Example 2 Pass 3
21
Realistic Example

1100 documents from the TDT corpus partitioned by
people into 19 topics
Transformed into 50 dimensional space using
Latent Semantic Indexing

Projection of the data onto a plane
colors represent topics
22
Realistic Example
Want to create a dataset with more observations
and more groups Idea Replace each group with a
scaled and transformed version of the entire data
set.
23
Realistic Example
Want to create a dataset with more observations
and more groups Idea Replace each group with a
scaled and transformed version of the entire data
set.
24
Realistic Example

To measure similarity of clusters to groups
Fowlkes-Mallows index
Geometric average of
Probability of 2 randomly chosen observations
from the same cluster being in the same group
Probability of 2 randomly chosen observations
from the same group being in the same cluster
FowlkesMallows index near 1 means clusters are
good estimates of the groups
Clustering the 1100 documents gives a
FowlkesMallows index of 0.76 our gold
standard

25
Realistic Example

1919361 clusters, 19110020900 observations in
50 dimensions
Fraction size¼1000 with 100 metaobservations per
fraction
4 passes of fractionation choosing 361 clusters

Number of fractions
Pass Min Median Max nf
1 270 289 296 20
2 18 88 150 18
3 18 19 60 17
4 19 19 58 16
Distribution of the number of groups per fraction.
26
Realistic Example

1919361 clusters, 19110020900 observations in
50 dimensions
Fraction size¼1000 with 100 metaobservations per
fraction
4 passes of fractionation choosing 361 clusters

The sum of the number of groups represented in
each cluster
361 is perfect

Pass Fowlkes Mallows Purity of the clusters
1 0.325 1729
2 0.554 908
3 0.616 671
4 0.613 651
27
Realistic Example

1919361 clusters, 19110020900 observations in
50 dimensions
Fraction size¼1000 with 100 metaobservations per
fraction
4 passes of fractionation choosing 361 clusters
Refractionation
Purifies fractions
Successfully deals with the case where the
number of groups is greater than aM, the number
of meta-observations

28
Contributions

Model Based Fractionation
Extended fractionation idea to parametric setting
Incorporates information about size, shape and
volume of clusters
Chooses number of clusters
Still linear in n
Model Based ReFractionation
Extended fractionation to handle larger number of
groups

29
Extensions

Extend to 100,000s of observations 1000s of
groups
Currently the number of groups must be less than
M
Extend to a more flexible class of models
With small groups in high dimensions, we need a
more constrained model (fewer parameters) than
the full covariance model
Mixture of Factor Analyzers

30
(No Transcript)
31
Fowlkes-Mallows Index
true clusters clusters clusters clusters
Groups 1 2 I Total
1 n11 n12 n1I n1
2 n21 n22 n2I n1

J nJ1 nj2 nJI n1
Total n1 n2 nI n
Pr(2 documents in same group they are in
the same cluster)
Pr(2 documents in same cluster they are
in the same group)

Write a Comment

User Comments (0)