Title: Scalable Data Mining
1Scalable Data Mining
autonlab.org
Brigham Anderson, Andrew Moore, Dan Pelleg, Alex
Gray, Bob Nichols, Andy Connolly
The Auton Lab, Carnegie Mellon University www.auto
nlab.org
2Outline
autonlab.org
- Kd-trees
- Fast nearest-neighbor finding
- Fast K-means clustering
- Fast kernel density estimation
- Large-scale galactic morphology
3Outline
autonlab.org
- Kd-trees
- Fast nearest-neighbor finding
- Fast K-means clustering
- Fast kernel density estimation
- Large-scale galactic morphology
4Nearest Neighbor - Naïve Approach
- Given a query point X.
- Scan through each point Y
- Takes O(N) time for each query!
5Speeding Up Nearest Neighbor
- We can speed up the search for the nearest
neighbor - Examine nearby points first.
- Ignore any points that are further then the
nearest point found so far. - Do this using a KD-tree
- Tree based data structure
- Recursively partitions points into axis aligned
boxes.
6KD-Tree Construction
Pt X Y
1 0.00 0.00
2 1.00 4.31
3 0.13 2.85
We start with a list of n-dimensional points.
7KD-Tree Construction
Xgt.5
YES
NO
Pt X Y
1 0.00 0.00
3 0.13 2.85
Pt X Y
2 1.00 4.31
We can split the points into 2 groups by choosing
a dimension X and value V and separating the
points into X gt V and X lt V.
8KD-Tree Construction
Xgt.5
YES
NO
Pt X Y
1 0.00 0.00
3 0.13 2.85
Pt X Y
2 1.00 4.31
We can then consider each group separately and
possibly split again (along same/different
dimension).
9KD-Tree Construction
Xgt.5
YES
NO
Ygt.1
Pt X Y
2 1.00 4.31
NO
YES
Pt X Y
3 0.13 2.85
Pt X Y
1 0.00 0.00
We can then consider each group separately and
possibly split again (along same/different
dimension).
10KD-Tree Construction
We can keep splitting the points in each set to
create a tree structure. Each node with no
children (leaf node) contains a list of points.
11KD-Tree Construction
We will keep around one additional piece of
information at each node. The (tight) bounds of
the points at or below this node.
12KD-Tree Construction
- Use heuristics to make splitting decisions
- Which dimension do we split along? Widest
- Which value do we split at? Median of value of
that split dimension for the points. - When do we stop? When there are fewer then m
points left OR the box has hit some minimum width.
13Outline
autonlab.org
- Kd-trees
- Fast nearest-neighbor finding
- Fast K-means clustering
- Fast kernel density estimation
- Large-scale galactic morphology
14Nearest Neighbor with KD Trees
We traverse the tree looking for the nearest
neighbor of the query point.
15Nearest Neighbor with KD Trees
Examine nearby points first Explore the branch
of the tree that is closest to the query point
first.
16Nearest Neighbor with KD Trees
Examine nearby points first Explore the branch
of the tree that is closest to the query point
first.
17Nearest Neighbor with KD Trees
When we reach a leaf node compute the distance
to each point in the node.
18Nearest Neighbor with KD Trees
When we reach a leaf node compute the distance
to each point in the node.
19Nearest Neighbor with KD Trees
Then we can backtrack and try the other branch at
each node visited.
20Nearest Neighbor with KD Trees
Each time a new closest node is found, we can
update the distance bounds.
21Nearest Neighbor with KD Trees
Using the distance bounds and the bounds of the
data below each node, we can prune parts of the
tree that could NOT include the nearest neighbor.
22Nearest Neighbor with KD Trees
Using the distance bounds and the bounds of the
data below each node, we can prune parts of the
tree that could NOT include the nearest neighbor.
23Nearest Neighbor with KD Trees
Using the distance bounds and the bounds of the
data below each node, we can prune parts of the
tree that could NOT include the nearest neighbor.
24Metric Trees
- Kd-trees rendered worse-than-useless in higher
dimensions - Only requires metric space (a well-behaved
distance function)
25Outline
autonlab.org
- Kd-trees
- Fast nearest-neighbor finding
- Fast K-means clustering
- Fast kernel density estimation
- Large-scale galactic morphology
26What does k-means do?
27K-means
- Ask user how many clusters theyd like. (e.g.
k5)
28K-means
- Ask user how many clusters theyd like. (e.g.
k5) - Randomly guess k cluster Center locations
29K-means
- Ask user how many clusters theyd like. (e.g.
k5) - Randomly guess k cluster Center locations
- Each datapoint finds out which Center its
closest to. (Thus each Center owns a set of
datapoints)
30K-means
- Ask user how many clusters theyd like. (e.g.
k5) - Randomly guess k cluster Center locations
- Each datapoint finds out which Center its
closest to. - Each Center finds the centroid of the points it
owns
31K-means
- Ask user how many clusters theyd like. (e.g.
k5) - Randomly guess k cluster Center locations
- Each datapoint finds out which Center its
closest to. - Each Center finds the centroid of the points it
owns - and jumps there
- Repeat until terminated!
- For basic tutorial on kmeans, see any machine
learning text book , or www.cs.cmu.edu/awm/tutori
als/kmeans.html
32K-means
- Ask user how many clusters theyd like. (e.g.
k5) - Randomly guess k cluster Center locations
- Each datapoint finds out which Center its
closest to. - Each Center finds the centroid of the points it
owns
33K-means search guts
I must compute Sxi of all points I own
I must compute Sxi of all points I own
I must compute Sxi of all points I own
I must compute Sxi of all points I own
34K-means search guts
I will compute Sxi of all points I own in
rectangle
I will compute Sxi of all points I own in
rectangle
I will compute Sxi of all points I own in
rectangle
I will compute Sxi of all points I own in
rectangle
35K-means search guts
I will compute Sxi of all points I own in left
rectangle, then right subrectangle, then add em
I will compute Sxi of all points I own in left
subrectangle, then right subrectangle, then add
em
I will compute Sxi of all points I own in left
rectangle, then right subrectangle, then add em
I will compute Sxi of all points I own in left
rectangle, then right subrectangle, then add em
36In recursive call
I will compute Sxi of all points I own in
rectangle
I will compute Sxi of all points I own in
rectangle
I will compute Sxi of all points I own in
rectangle
I will compute Sxi of all points I own in
rectangle
37In recursive call
- Find center nearest to rectangle
I will compute Sxi of all points I own in
rectangle
I will compute Sxi of all points I own in
rectangle
I will compute Sxi of all points I own in
rectangle
I will compute Sxi of all points I own in
rectangle
38In recursive call
- Find center nearest to rectangle
- For each other center can it own any points in
rectangle?
I will compute Sxi of all points I own in
rectangle
I will compute Sxi of all points I own in
rectangle
I will compute Sxi of all points I own in
rectangle
I will compute Sxi of all points I own in
rectangle
39In recursive call
- Find center nearest to rectangle
- For each other center can it own any points in
rectangle?
I will compute Sxi of all points I own in
rectangle
I will compute Sxi of all points I own in
rectangle
I will compute Sxi of all points I own in
rectangle
I will compute Sxi of all points I own in
rectangle
40Pruning
- Find center nearest to rectangle
- For each other center can it own any points in
rectangle? - If not, PRUNE
I will just grab Sxi from the cached value in the
node
I will compute Sxi of all points I own in
rectangle
I will compute Sxi of all points I own in
rectangle
I will compute Sxi of all points I own in
rectangle
41Pruning not possible at previous level
- Find center nearest to rectangle
- For each other center can it own any points in
rectangle? - If yes, recurse...
I will compute Sxi of all points I own in
rectangle
I will compute Sxi of all points I own in
rectangle
But.. maybe theres some optimization possible
anyway?
I will compute Sxi of all points I own in
rectangle
I will compute Sxi of all points I own in
rectangle
42Blacklisting
- Find center nearest to rectangle
- For each other center can it own any points in
rectangle? - If yes, recurse...
I will compute Sxi of all points I own in
rectangle
I will compute Sxi of all points I own in
rectangle
But.. maybe theres some optimization possible
anyway?
A hopeless center never needs to be considered in
any recursion
I will compute Sxi of all points I own in
rectangle
I will compute Sxi of all points I own in
rectangle
43Example
Example generated by Dan Pelleg and Andrew
Moore. Accelerating Exact k-means Algorithms with
Geometric Reasoning. Proc. Conference on
Knowledge Discovery in Databases 1999, (KDD99)
(available on www.autonlab.org )
44K-means continues
45K-means continues
46K-means continues
47K-means continues
48K-means continues
49K-means continues
50K-means continues
51K-means continues
52K-means terminates
53Comparison to a linear algorithm
- Astrophysics data (2-dimensions)
54Outline
autonlab.org
- Kd-trees
- Fast nearest-neighbor finding
- Fast K-means clustering
- Fast kernel density estimation
- Large-scale galactic morphology
55What if we want to do density estimation with
multimodal or clumpy data?
56The GMM assumption
- There are k components. The ith component is
called wi - Component wi has an associated mean vector mi
m2
m3
57The GMM assumption
- There are k components. The ith component is
called wi - Component wi has an associated mean vector mi
- Each component generates data from a Gaussian
with mean mi and covariance matrix s2I - Assume that each datapoint is generated according
to the following recipe
m2
m3
58The GMM assumption
- There are k components. The ith component is
called wi - Component wi has an associated mean vector mi
- Each component generates data from a Gaussian
with mean mi and covariance matrix s2I - Assume that each datapoint is generated according
to the following recipe - Pick a component at random. Choose component i
with probability P(wi).
m2
59The GMM assumption
- There are k components. The ith component is
called wi - Component wi has an associated mean vector mi
- Each component generates data from a Gaussian
with mean mi and covariance matrix s2I - Assume that each datapoint is generated according
to the following recipe - Pick a component at random. Choose component i
with probability P(wi). - Datapoint N(mi, s2I )
m2
x
60The General GMM assumption
- There are k components. The ith component is
called wi - Component wi has an associated mean vector mi
- Each component generates data from a Gaussian
with mean mi and covariance matrix Si - Assume that each datapoint is generated according
to the following recipe - Pick a component at random. Choose component i
with probability P(wi). - Datapoint N(mi, Si )
m2
m3
61Gaussian Mixture Example Start
Advance apologies in Black and White this
example will be incomprehensible
62After first iteration
63After 2nd iteration
64After 3rd iteration
65After 4th iteration
66After 5th iteration
67After 6th iteration
68After 20th iteration
69Some Bio Assay data
70GMM clustering of the assay data
71Resulting Density Estimator
- For basic tutorial on Gaussian Mixture Models,
see Hastie et al or Duda, Hart Stork books , or
www.cs.cmu.edu/awm/tutorials/gmm.html
72(No Transcript)
73(No Transcript)
74(No Transcript)
75Uses for kd-trees and cousins
- K-Means clustering Pelleg and Moore, 99,
Moore 2000 - Kernel Density Estimation Deng Moore, 1995,
Gray Moore 2001 - Kernel-density-based clustering Wong and Moore,
2002 - Gaussian Mixture Model Moore, 1999
- Kernel Regression Deng and Moore, 1995
- Locally weighted regression Moore, Schneider,
Deng, 97 - Kernel-based Bayes Classifiers Moore and
Schneider, 97 - N-point correlation functions Gray and Moore
2001 - Also work by Priebe, Ramikrishnan, Schaal,
DSouza, Elkan, - Papers (and software) www.autonlab.org
76Uses for kd-trees and cousins
- K-Means clustering Pelleg and Moore, 99,
Moore 2000 - Kernel Density Estimation Deng Moore, 1995,
Gray Moore 2001 - Kernel-density-based clustering Wong and Moore,
2002 - Gaussian Mixture Model Moore, 1999
- Kernel Regression Deng and Moore, 1995
- Locally weighted regression Moore, Schneider,
Deng, 97 - Kernel-based Bayes Classifiers Moore and
Schneider, 97 - N-point correlation functions Gray and Moore
2001 - Also work by Priebe, Ramikrishnan, Schaal,
DSouza, Elkan, - Papers (and software) www.autonlab.org
77Outline
autonlab.org
- Kd-trees
- Fast nearest-neighbor finding
- Fast K-means clustering
- Fast kernel density estimation
- Large-scale galactic morphology
78Outline
autonlab.org
- Kd-trees
- Fast nearest-neighbor finding
- Fast K-means clustering
- Fast kernel density estimation
- Large-scale galactic morphology
- GMorph
- Memory-based
- PCA
79(No Transcript)
80Image space (4096 dim)
81Image space (4096 dim)
galaxies
82(No Transcript)
83(No Transcript)
84demo
85(No Transcript)
86Outline
autonlab.org
- Kd-trees
- Fast nearest-neighbor finding
- Fast K-means clustering
- Fast kernel density estimation
- Large-scale galactic morphology
- GMorph
- Memory-based
- PCA
87Outline
autonlab.org
- Kd-trees
- Fast nearest-neighbor finding
- Fast K-means clustering
- Fast kernel density estimation
- Large-scale galactic morphology
- GMorph
- Memory-based
- PCA
88Parameter space (12 dim)
Image space (4096 dim)
Model
synthetic galaxies
89Target Image
Parameter space (12 dim)
Image space (4096 dim)
Model
90Target Image
Parameter space (12 dim)
Image space (4096 dim)
Model
91Target Image
Parameter space (12 dim)
Image space (4096 dim)
Model
92Outline
autonlab.org
- Kd-trees
- Fast nearest-neighbor finding
- Fast K-means clustering
- Fast kernel density estimation
- Large-scale galactic morphology
- GMorph
- Memory-based
- PCA
93(No Transcript)
94(No Transcript)
95(No Transcript)
96Target Image
Parameter space (12 dim)
Image space (4096 dim)
Model
Eigenspace (16 dim)
97Target Image
Parameter space (12 dim)
Image space (4096 dim)
Model
Eigenspace (16 dim)
98Target Image
Parameter space (12 dim)
Image space (4096 dim)
Model
Eigenspace (16 dim)
99snapshot
100Outline
autonlab.org
- Kd-trees
- Fast nearest-neighbor finding
- Fast K-means clustering
- Fast kernel density estimation
- Large-scale galactic morphology
- GMorph
- Memory-based
- PCA
- Results
101(No Transcript)
102(No Transcript)
103(No Transcript)
104Outline
autonlab.org
- Kd-trees
- Fast nearest-neighbor finding
- Fast K-means clustering
- Fast kernel density estimation
- Large-scale galactic morphology
- GMorph
- Memory-based
- PCA
- Results
See AUTON website for papers
paper