Efficient%20Clustering%20of%20High-Dimensional%20Data%20Sets

About This Presentation

Title:

Efficient%20Clustering%20of%20High-Dimensional%20Data%20Sets

Description:

distance calculations [Bradley, ... canopies determine which distances calculated. 6. Illustrating Canopies. 7 ... Calculate expensive distances between ... – PowerPoint PPT presentation

Number of Views:79

Avg rating:3.0/5.0

Slides: 44

Provided by: AndrewM110

Learn more at: https://www.cs.cornell.edu

Category:

more less

Transcript and Presenter's Notes

Title: Efficient%20Clustering%20of%20High-Dimensional%20Data%20Sets

1
Efficient Clustering of High-Dimensional Data
Sets

Andrew McCallum
WhizBang! Labs CMU

Kamal Nigam WhizBang! Labs
Lyle Ungar UPenn
2
Large Clustering Problems

Many examples
Many clusters
Many dimensions

Example Domains

Text
Images
Protein Structure

3
The Citation Clustering Data

Over 1,000,000 citations
About 100,000 unique papers
About 100,000 unique vocabulary words
Over 1 trillion distance calculations

4
Reduce number of distance calculations

Bradley, Fayyad, Reina KDD-98
Sample to find initial starting points for
k-means or EM
Moore 98
Use multi-resolution kd-trees to group similar
data points
Omohundro 89
Balltrees

5
The Canopies Approach

Two distance metrics cheap expensive
First Pass
very inexpensive distance metric
create overlapping canopies
Second Pass
expensive, accurate distance metric
canopies determine which distances calculated

6
Illustrating Canopies
7
Overlapping Canopies
8
Creating canopies with two thresholds

Put all points in D
Loop
Pick a point X from D
Put points within Kloose of X in canopy
Remove points within Ktight of X from D

tight
loose
9
Canopies

Two distance metrics
cheap and approximate
expensive and accurate
Two-pass clustering
create overlapping canopies
full clustering with limited distances
Canopy property
points in same cluster will be in same canopy

10
Using canopies with GAC

Calculate expensive distances between points in
the same canopy
All other distances default to infinity
Sort finite distances and iteratively merge
closest

11
Computational Savings

inexpensive metric ltlt expensive metric
number of canopies c (large)
canopies overlap each point in f canopies
roughly fn/c points per canopy
O(f 2 n 2/c) expensive distance calculations
complexity reduction O(f2/c)
n106 k104 c1000 f small
computation reduced by factor of 1000

12
Experimental Results
Minutes
F1
7.65
0.838
Canopies GAC
134.09
0.835
Complete GAC
13
Preserving Good Clustering

Small, disjoint canopies big time savings
Large, overlapping canopies original accurate
clustering
Goal fast and accurate
requires good, cheap distance metric

14
Reduced Dimension Representations
15

Clustering finds groups of similar objects
Understanding clusters can be difficult
Important to understand/interpret results
Patterns waiting to be discovered

16
A picture is worth 1000 clusters
17
Feature Subset Selection

Find n features that work best for prediction
Find n features such that distance on them best
correlates with distance on all features
Minimize

18
Feature Subset Selection

Suppose all features relevant
Does that mean dimensionality cant be reduced?
No!
Manifold in feature space is what counts, not
relevance of individual features
Manifold can be lower dimension than feats

19
PCA Principal Component Analysis

Given data in d dimensions
Compute
d-dim mean vector M
dxd-dim covariance matrix C
eigenvectors and eigenvalues
Sort by eigenvalues
Select top kltd eigenvalues
Project data onto k eigenvectors

20
PCA

Mean vector M

21
PCA

Covariance C

22
PCA

Eigenvectors
Unit vectors in directions of maximum variance
Eigenvalues
Magnitude of the variance in the direction of
each eigenvector

23
PCA

Find largest eigenvalues and
corresponding eigenvectors
Project points onto k principal components
where A is a d x k matrix whose columns are the k
principal components of each point

24
PCA via Autoencoder ANN
25
Non-Linear PCA by Autoencoder
26
PCA

need vector representation
0-d sample mean
1-d y mx b
2-d y1 mx b y2 mx b

27
MDS Multidimensional Scaling

PCA requires vector representation
Given pairwise distances between n points?
Find coordinates for points in d dimensional
space s.t. distances are preserved best

28
(No Transcript)
29
(No Transcript)
30
MDS

Assign points to coords xi in d-dim space
random coordinate values
principal components
dimensions with greatest variance
Do gradient descent on coordinates xi of each
point j until distortion is minimzed