Title: Efficient%20Clustering%20of%20High-Dimensional%20Data%20Sets
1Efficient Clustering of High-Dimensional Data
Sets
- Andrew McCallum
- WhizBang! Labs CMU
Kamal Nigam WhizBang! Labs
Lyle Ungar UPenn
2Large Clustering Problems
- Many examples
- Many clusters
- Many dimensions
Example Domains
- Text
- Images
- Protein Structure
3The Citation Clustering Data
- Over 1,000,000 citations
- About 100,000 unique papers
- About 100,000 unique vocabulary words
- Over 1 trillion distance calculations
4Reduce number of distance calculations
- Bradley, Fayyad, Reina KDD-98
- Sample to find initial starting points for
k-means or EM - Moore 98
- Use multi-resolution kd-trees to group similar
data points - Omohundro 89
- Balltrees
5The Canopies Approach
- Two distance metrics cheap expensive
- First Pass
- very inexpensive distance metric
- create overlapping canopies
- Second Pass
- expensive, accurate distance metric
- canopies determine which distances calculated
6Illustrating Canopies
7Overlapping Canopies
8Creating canopies with two thresholds
- Put all points in D
- Loop
- Pick a point X from D
- Put points within Kloose of X in canopy
- Remove points within Ktight of X from D
tight
loose
9Canopies
- Two distance metrics
- cheap and approximate
- expensive and accurate
- Two-pass clustering
- create overlapping canopies
- full clustering with limited distances
- Canopy property
- points in same cluster will be in same canopy
10Using canopies with GAC
- Calculate expensive distances between points in
the same canopy - All other distances default to infinity
- Sort finite distances and iteratively merge
closest
11Computational Savings
- inexpensive metric ltlt expensive metric
- number of canopies c (large)
- canopies overlap each point in f canopies
- roughly fn/c points per canopy
- O(f 2 n 2/c) expensive distance calculations
- complexity reduction O(f2/c)
- n106 k104 c1000 f small
- computation reduced by factor of 1000
12Experimental Results
Minutes
F1
7.65
0.838
Canopies GAC
134.09
0.835
Complete GAC
13Preserving Good Clustering
- Small, disjoint canopies big time savings
- Large, overlapping canopies original accurate
clustering - Goal fast and accurate
- requires good, cheap distance metric
14Reduced Dimension Representations
15- Clustering finds groups of similar objects
- Understanding clusters can be difficult
- Important to understand/interpret results
- Patterns waiting to be discovered
16A picture is worth 1000 clusters
17Feature Subset Selection
- Find n features that work best for prediction
- Find n features such that distance on them best
correlates with distance on all features - Minimize
18Feature Subset Selection
- Suppose all features relevant
- Does that mean dimensionality cant be reduced?
- No!
- Manifold in feature space is what counts, not
relevance of individual features - Manifold can be lower dimension than feats
19PCA Principal Component Analysis
- Given data in d dimensions
- Compute
- d-dim mean vector M
- dxd-dim covariance matrix C
- eigenvectors and eigenvalues
- Sort by eigenvalues
- Select top kltd eigenvalues
- Project data onto k eigenvectors
20PCA
21PCA
22PCA
- Eigenvectors
- Unit vectors in directions of maximum variance
- Eigenvalues
- Magnitude of the variance in the direction of
each eigenvector
23PCA
- Find largest eigenvalues and
corresponding eigenvectors - Project points onto k principal components
- where A is a d x k matrix whose columns are the k
principal components of each point
24PCA via Autoencoder ANN
25Non-Linear PCA by Autoencoder
26PCA
- need vector representation
- 0-d sample mean
- 1-d y mx b
- 2-d y1 mx b y2 mx b
27MDS Multidimensional Scaling
- PCA requires vector representation
- Given pairwise distances between n points?
- Find coordinates for points in d dimensional
space s.t. distances are preserved best
28(No Transcript)
29(No Transcript)
30MDS
- Assign points to coords xi in d-dim space
- random coordinate values
- principal components
- dimensions with greatest variance
- Do gradient descent on coordinates xi of each
point j until distortion is minimzed
31Distortion
32Distortion
33Distortion
34Gradient Descent on Coordinates
35Subjective Distances
- Brazil
- USA
- Egypt
- Congo
- Russia
- France
- Cuba
- Yugoslavia
- Israel
- China
36(No Transcript)
37(No Transcript)
38How Many Dimensions?
- D too large
- perfect fit, no distortion
- not easy to understand/visualize
- D too small
- poor fit, much distortion
- easyto visualize, but pattern may be misleading
- D just right?
39(No Transcript)
40(No Transcript)
41(No Transcript)
42Agglomerative Clustering of Proteins
43(No Transcript)