Scalable Data Mining - PowerPoint PPT Presentation

1 / 104

About This Presentation

Title:

Scalable Data Mining

Description:

Brigham Anderson, Andrew Moore, Dan Pelleg, Alex Gray, Bob Nichols, ... see Hastie et al or Duda, Hart & Stork books , or www.cs.cmu.edu/~awm/tutorials/gmm.html ... – PowerPoint PPT presentation

Number of Views:29

Avg rating:3.0/5.0

Slides: 105

Provided by: boo74

Category:

more less

Transcript and Presenter's Notes

Title: Scalable Data Mining

1
Scalable Data Mining
autonlab.org
Brigham Anderson, Andrew Moore, Dan Pelleg, Alex
Gray, Bob Nichols, Andy Connolly
The Auton Lab, Carnegie Mellon University www.auto
nlab.org
2
Outline
autonlab.org

Kd-trees
Fast nearest-neighbor finding
Fast K-means clustering
Fast kernel density estimation
Large-scale galactic morphology

3
Outline
autonlab.org

Kd-trees
Fast nearest-neighbor finding
Fast K-means clustering
Fast kernel density estimation
Large-scale galactic morphology

4
Nearest Neighbor - Naïve Approach

Given a query point X.
Scan through each point Y
Takes O(N) time for each query!

5
Speeding Up Nearest Neighbor

We can speed up the search for the nearest
neighbor
Examine nearby points first.
Ignore any points that are further then the
nearest point found so far.
Do this using a KD-tree
Tree based data structure
Recursively partitions points into axis aligned
boxes.

6
KD-Tree Construction
Pt X Y
1 0.00 0.00
2 1.00 4.31
3 0.13 2.85

We start with a list of n-dimensional points.
7
KD-Tree Construction
Xgt.5
YES
NO
Pt X Y
1 0.00 0.00
3 0.13 2.85

Pt X Y
2 1.00 4.31

We can split the points into 2 groups by choosing
a dimension X and value V and separating the
points into X gt V and X lt V.
8
KD-Tree Construction
Xgt.5
YES
NO
Pt X Y
1 0.00 0.00
3 0.13 2.85

Pt X Y
2 1.00 4.31

We can then consider each group separately and
possibly split again (along same/different
dimension).
9
KD-Tree Construction
Xgt.5
YES
NO
Ygt.1
Pt X Y
2 1.00 4.31

NO
YES
Pt X Y
3 0.13 2.85

Pt X Y
1 0.00 0.00

We can then consider each group separately and
possibly split again (along same/different
dimension).
10
KD-Tree Construction
We can keep splitting the points in each set to
create a tree structure. Each node with no
children (leaf node) contains a list of points.
11
KD-Tree Construction
We will keep around one additional piece of
information at each node. The (tight) bounds of
the points at or below this node.
12
KD-Tree Construction

Use heuristics to make splitting decisions
Which dimension do we split along? Widest
Which value do we split at? Median of value of
that split dimension for the points.
When do we stop? When there are fewer then m
points left OR the box has hit some minimum width.

13
Outline
autonlab.org

Kd-trees
Fast nearest-neighbor finding
Fast K-means clustering
Fast kernel density estimation
Large-scale galactic morphology

14
Nearest Neighbor with KD Trees
We traverse the tree looking for the nearest
neighbor of the query point.
15
Nearest Neighbor with KD Trees
Examine nearby points first Explore the branch
of the tree that is closest to the query point
first.
16
Nearest Neighbor with KD Trees
Examine nearby points first Explore the branch
of the tree that is closest to the query point
first.
17
Nearest Neighbor with KD Trees
When we reach a leaf node compute the distance
to each point in the node.
18
Nearest Neighbor with KD Trees
When we reach a leaf node compute the distance
to each point in the node.
19
Nearest Neighbor with KD Trees
Then we can backtrack and try the other branch at
each node visited.
20
Nearest Neighbor with KD Trees
Each time a new closest node is found, we can
update the distance bounds.
21
Nearest Neighbor with KD Trees
Using the distance bounds and the bounds of the
data below each node, we can prune parts of the
tree that could NOT include the nearest neighbor.
22
Nearest Neighbor with KD Trees
Using the distance bounds and the bounds of the
data below each node, we can prune parts of the
tree that could NOT include the nearest neighbor.
23
Nearest Neighbor with KD Trees
Using the distance bounds and the bounds of the
data below each node, we can prune parts of the
tree that could NOT include the nearest neighbor.
24
Metric Trees

Kd-trees rendered worse-than-useless in higher
dimensions
Only requires metric space (a well-behaved
distance function)

25
Outline
autonlab.org

Kd-trees
Fast nearest-neighbor finding
Fast K-means clustering
Fast kernel density estimation
Large-scale galactic morphology

26
What does k-means do?
27
K-means

Ask user how many clusters theyd like. (e.g.
k5)

28
K-means

Ask user how many clusters theyd like. (e.g.
k5)
Randomly guess k cluster Center locations

29
K-means

Ask user how many clusters theyd like. (e.g.
k5)
Randomly guess k cluster Center locations
Each datapoint finds out which Center its
closest to. (Thus each Center owns a set of
datapoints)

30
K-means

Ask user how many clusters theyd like. (e.g.
k5)
Randomly guess k cluster Center locations
Each datapoint finds out which Center its
closest to.
Each Center finds the centroid of the points it
owns

31
K-means

Ask user how many clusters theyd like. (e.g.
k5)
Randomly guess k cluster Center locations
Each datapoint finds out which Center its
closest to.
Each Center finds the centroid of the points it
owns
and jumps there
Repeat until terminated!

For basic tutorial on kmeans, see any machine
learning text book , or www.cs.cmu.edu/awm/tutori
als/kmeans.html

32
K-means

Ask user how many clusters theyd like. (e.g.
k5)
Randomly guess k cluster Center locations
Each datapoint finds out which Center its
closest to.
Each Center finds the centroid of the points it
owns

33
K-means search guts
I must compute Sxi of all points I own
I must compute Sxi of all points I own
I must compute Sxi of all points I own
I must compute Sxi of all points I own
34
K-means search guts
I will compute Sxi of all points I own in
rectangle
I will compute Sxi of all points I own in
rectangle
I will compute Sxi of all points I own in
rectangle
I will compute Sxi of all points I own in
rectangle
35
K-means search guts
I will compute Sxi of all points I own in left
rectangle, then right subrectangle, then add em
I will compute Sxi of all points I own in left
subrectangle, then right subrectangle, then add
em
I will compute Sxi of all points I own in left
rectangle, then right subrectangle, then add em
I will compute Sxi of all points I own in left
rectangle, then right subrectangle, then add em
36
In recursive call
I will compute Sxi of all points I own in
rectangle
I will compute Sxi of all points I own in
rectangle
I will compute Sxi of all points I own in
rectangle
I will compute Sxi of all points I own in
rectangle
37
In recursive call

Find center nearest to rectangle

I will compute Sxi of all points I own in
rectangle
I will compute Sxi of all points I own in
rectangle
I will compute Sxi of all points I own in
rectangle
I will compute Sxi of all points I own in
rectangle
38
In recursive call

Find center nearest to rectangle
For each other center can it own any points in
rectangle?

Find center nearest to rectangle
For each other center can it own any points in
rectangle?

Find center nearest to rectangle
For each other center can it own any points in
rectangle?
If not, PRUNE

I will just grab Sxi from the cached value in the
node
I will compute Sxi of all points I own in
rectangle
I will compute Sxi of all points I own in
rectangle
I will compute Sxi of all points I own in
rectangle
41
Pruning not possible at previous level

Find center nearest to rectangle
For each other center can it own any points in
rectangle?
If yes, recurse...

I will compute Sxi of all points I own in
rectangle
I will compute Sxi of all points I own in
rectangle
But.. maybe theres some optimization possible
anyway?
I will compute Sxi of all points I own in
rectangle
I will compute Sxi of all points I own in
rectangle
42
Blacklisting

Find center nearest to rectangle
For each other center can it own any points in
rectangle?
If yes, recurse...

I will compute Sxi of all points I own in
rectangle
I will compute Sxi of all points I own in
rectangle
But.. maybe theres some optimization possible
anyway?
A hopeless center never needs to be considered in
any recursion
I will compute Sxi of all points I own in
rectangle
I will compute Sxi of all points I own in
rectangle
43
Example
Example generated by Dan Pelleg and Andrew
Moore. Accelerating Exact k-means Algorithms with
Geometric Reasoning. Proc. Conference on
Knowledge Discovery in Databases 1999, (KDD99)
(available on www.autonlab.org )
44
K-means continues
45
K-means continues
46
K-means continues
47
K-means continues
48
K-means continues
49
K-means continues
50
K-means continues
51
K-means continues
52
K-means terminates
53
Comparison to a linear algorithm

Astrophysics data (2-dimensions)

54
Outline
autonlab.org

Kd-trees
Fast nearest-neighbor finding
Fast K-means clustering
Fast kernel density estimation
Large-scale galactic morphology

55
What if we want to do density estimation with
multimodal or clumpy data?
56
The GMM assumption

There are k components. The ith component is
called wi
Component wi has an associated mean vector mi

m2
m3
57
The GMM assumption

There are k components. The ith component is
called wi
Component wi has an associated mean vector mi
Each component generates data from a Gaussian
with mean mi and covariance matrix s2I
Assume that each datapoint is generated according
to the following recipe

m2
m3
58
The GMM assumption

There are k components. The ith component is
called wi
Component wi has an associated mean vector mi
Each component generates data from a Gaussian
with mean mi and covariance matrix s2I
Assume that each datapoint is generated according
to the following recipe
Pick a component at random. Choose component i
with probability P(wi).

m2
59
The GMM assumption

There are k components. The ith component is
called wi
Component wi has an associated mean vector mi
Each component generates data from a Gaussian
with mean mi and covariance matrix s2I
Assume that each datapoint is generated according
to the following recipe
Pick a component at random. Choose component i
with probability P(wi).
Datapoint N(mi, s2I )

m2
x
60
The General GMM assumption

There are k components. The ith component is
called wi
Component wi has an associated mean vector mi
Each component generates data from a Gaussian
with mean mi and covariance matrix Si
Assume that each datapoint is generated according
to the following recipe
Pick a component at random. Choose component i
with probability P(wi).
Datapoint N(mi, Si )

m2
m3
61
Gaussian Mixture Example Start
Advance apologies in Black and White this
example will be incomprehensible
62
After first iteration
63
After 2nd iteration
64
After 3rd iteration
65
After 4th iteration
66
After 5th iteration
67
After 6th iteration
68
After 20th iteration
69
Some Bio Assay data
70
GMM clustering of the assay data
71
Resulting Density Estimator

For basic tutorial on Gaussian Mixture Models,
see Hastie et al or Duda, Hart Stork books , or
www.cs.cmu.edu/awm/tutorials/gmm.html

72
(No Transcript)
73
(No Transcript)
74
(No Transcript)
75
Uses for kd-trees and cousins

K-Means clustering Pelleg and Moore, 99,
Moore 2000
Kernel Density Estimation Deng Moore, 1995,
Gray Moore 2001
Kernel-density-based clustering Wong and Moore,
2002
Gaussian Mixture Model Moore, 1999
Kernel Regression Deng and Moore, 1995
Locally weighted regression Moore, Schneider,
Deng, 97
Kernel-based Bayes Classifiers Moore and
Schneider, 97
N-point correlation functions Gray and Moore
2001
Also work by Priebe, Ramikrishnan, Schaal,
DSouza, Elkan,
Papers (and software) www.autonlab.org

76
Uses for kd-trees and cousins

K-Means clustering Pelleg and Moore, 99,
Moore 2000
Kernel Density Estimation Deng Moore, 1995,
Gray Moore 2001
Kernel-density-based clustering Wong and Moore,
2002
Gaussian Mixture Model Moore, 1999
Kernel Regression Deng and Moore, 1995
Locally weighted regression Moore, Schneider,
Deng, 97
Kernel-based Bayes Classifiers Moore and
Schneider, 97
N-point correlation functions Gray and Moore
2001
Also work by Priebe, Ramikrishnan, Schaal,
DSouza, Elkan,
Papers (and software) www.autonlab.org

77
Outline
autonlab.org

Kd-trees
Fast nearest-neighbor finding
Fast K-means clustering
Fast kernel density estimation
Large-scale galactic morphology

78
Outline
autonlab.org

Kd-trees
Fast nearest-neighbor finding
Fast K-means clustering
Fast kernel density estimation
Large-scale galactic morphology
GMorph
Memory-based
PCA

79
(No Transcript)
80
Image space (4096 dim)
81
Image space (4096 dim)
galaxies
82
(No Transcript)
83
(No Transcript)
84
demo
85
(No Transcript)
86
Outline
autonlab.org

Kd-trees
Fast nearest-neighbor finding
Fast K-means clustering
Fast kernel density estimation
Large-scale galactic morphology
GMorph
Memory-based
PCA

87
Outline
autonlab.org

Kd-trees
Fast nearest-neighbor finding
Fast K-means clustering
Fast kernel density estimation
Large-scale galactic morphology
GMorph
Memory-based
PCA

88
Parameter space (12 dim)
Image space (4096 dim)
Model
synthetic galaxies
89
Target Image
Parameter space (12 dim)
Image space (4096 dim)
Model
90
Target Image
Parameter space (12 dim)
Image space (4096 dim)
Model
91
Target Image
Parameter space (12 dim)
Image space (4096 dim)
Model
92
Outline
autonlab.org

Kd-trees
Fast nearest-neighbor finding
Fast K-means clustering
Fast kernel density estimation
Large-scale galactic morphology
GMorph
Memory-based
PCA

93
(No Transcript)
94
(No Transcript)
95
(No Transcript)
96
Target Image
Parameter space (12 dim)
Image space (4096 dim)
Model
Eigenspace (16 dim)
97
Target Image
Parameter space (12 dim)
Image space (4096 dim)
Model
Eigenspace (16 dim)
98
Target Image
Parameter space (12 dim)
Image space (4096 dim)
Model
Eigenspace (16 dim)
99
snapshot
100
Outline
autonlab.org