Scalable Data Mining - PowerPoint PPT Presentation

1 / 104
About This Presentation
Title:

Scalable Data Mining

Description:

Brigham Anderson, Andrew Moore, Dan Pelleg, Alex Gray, Bob Nichols, ... see Hastie et al or Duda, Hart & Stork books , or www.cs.cmu.edu/~awm/tutorials/gmm.html ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 105
Provided by: boo74
Category:
Tags: data | mining | scalable

less

Transcript and Presenter's Notes

Title: Scalable Data Mining


1
Scalable Data Mining
autonlab.org
Brigham Anderson, Andrew Moore, Dan Pelleg, Alex
Gray, Bob Nichols, Andy Connolly
The Auton Lab, Carnegie Mellon University www.auto
nlab.org
2
Outline
autonlab.org
  • Kd-trees
  • Fast nearest-neighbor finding
  • Fast K-means clustering
  • Fast kernel density estimation
  • Large-scale galactic morphology

3
Outline
autonlab.org
  • Kd-trees
  • Fast nearest-neighbor finding
  • Fast K-means clustering
  • Fast kernel density estimation
  • Large-scale galactic morphology

4
Nearest Neighbor - Naïve Approach
  • Given a query point X.
  • Scan through each point Y
  • Takes O(N) time for each query!

5
Speeding Up Nearest Neighbor
  • We can speed up the search for the nearest
    neighbor
  • Examine nearby points first.
  • Ignore any points that are further then the
    nearest point found so far.
  • Do this using a KD-tree
  • Tree based data structure
  • Recursively partitions points into axis aligned
    boxes.

6
KD-Tree Construction
Pt X Y
1 0.00 0.00
2 1.00 4.31
3 0.13 2.85

We start with a list of n-dimensional points.
7
KD-Tree Construction
Xgt.5
YES
NO
Pt X Y
1 0.00 0.00
3 0.13 2.85

Pt X Y
2 1.00 4.31

We can split the points into 2 groups by choosing
a dimension X and value V and separating the
points into X gt V and X lt V.
8
KD-Tree Construction
Xgt.5
YES
NO
Pt X Y
1 0.00 0.00
3 0.13 2.85

Pt X Y
2 1.00 4.31

We can then consider each group separately and
possibly split again (along same/different
dimension).
9
KD-Tree Construction
Xgt.5
YES
NO
Ygt.1
Pt X Y
2 1.00 4.31

NO
YES
Pt X Y
3 0.13 2.85

Pt X Y
1 0.00 0.00

We can then consider each group separately and
possibly split again (along same/different
dimension).
10
KD-Tree Construction
We can keep splitting the points in each set to
create a tree structure. Each node with no
children (leaf node) contains a list of points.
11
KD-Tree Construction
We will keep around one additional piece of
information at each node. The (tight) bounds of
the points at or below this node.
12
KD-Tree Construction
  • Use heuristics to make splitting decisions
  • Which dimension do we split along? Widest
  • Which value do we split at? Median of value of
    that split dimension for the points.
  • When do we stop? When there are fewer then m
    points left OR the box has hit some minimum width.

13
Outline
autonlab.org
  • Kd-trees
  • Fast nearest-neighbor finding
  • Fast K-means clustering
  • Fast kernel density estimation
  • Large-scale galactic morphology

14
Nearest Neighbor with KD Trees
We traverse the tree looking for the nearest
neighbor of the query point.
15
Nearest Neighbor with KD Trees
Examine nearby points first Explore the branch
of the tree that is closest to the query point
first.
16
Nearest Neighbor with KD Trees
Examine nearby points first Explore the branch
of the tree that is closest to the query point
first.
17
Nearest Neighbor with KD Trees
When we reach a leaf node compute the distance
to each point in the node.
18
Nearest Neighbor with KD Trees
When we reach a leaf node compute the distance
to each point in the node.
19
Nearest Neighbor with KD Trees
Then we can backtrack and try the other branch at
each node visited.
20
Nearest Neighbor with KD Trees
Each time a new closest node is found, we can
update the distance bounds.
21
Nearest Neighbor with KD Trees
Using the distance bounds and the bounds of the
data below each node, we can prune parts of the
tree that could NOT include the nearest neighbor.
22
Nearest Neighbor with KD Trees
Using the distance bounds and the bounds of the
data below each node, we can prune parts of the
tree that could NOT include the nearest neighbor.
23
Nearest Neighbor with KD Trees
Using the distance bounds and the bounds of the
data below each node, we can prune parts of the
tree that could NOT include the nearest neighbor.
24
Metric Trees
  • Kd-trees rendered worse-than-useless in higher
    dimensions
  • Only requires metric space (a well-behaved
    distance function)

25
Outline
autonlab.org
  • Kd-trees
  • Fast nearest-neighbor finding
  • Fast K-means clustering
  • Fast kernel density estimation
  • Large-scale galactic morphology

26
What does k-means do?
27
K-means
  1. Ask user how many clusters theyd like. (e.g.
    k5)

28
K-means
  1. Ask user how many clusters theyd like. (e.g.
    k5)
  2. Randomly guess k cluster Center locations

29
K-means
  1. Ask user how many clusters theyd like. (e.g.
    k5)
  2. Randomly guess k cluster Center locations
  3. Each datapoint finds out which Center its
    closest to. (Thus each Center owns a set of
    datapoints)

30
K-means
  1. Ask user how many clusters theyd like. (e.g.
    k5)
  2. Randomly guess k cluster Center locations
  3. Each datapoint finds out which Center its
    closest to.
  4. Each Center finds the centroid of the points it
    owns

31
K-means
  1. Ask user how many clusters theyd like. (e.g.
    k5)
  2. Randomly guess k cluster Center locations
  3. Each datapoint finds out which Center its
    closest to.
  4. Each Center finds the centroid of the points it
    owns
  5. and jumps there
  6. Repeat until terminated!
  • For basic tutorial on kmeans, see any machine
    learning text book , or www.cs.cmu.edu/awm/tutori
    als/kmeans.html

32
K-means
  1. Ask user how many clusters theyd like. (e.g.
    k5)
  2. Randomly guess k cluster Center locations
  3. Each datapoint finds out which Center its
    closest to.
  4. Each Center finds the centroid of the points it
    owns

33
K-means search guts
I must compute Sxi of all points I own
I must compute Sxi of all points I own
I must compute Sxi of all points I own
I must compute Sxi of all points I own
34
K-means search guts
I will compute Sxi of all points I own in
rectangle
I will compute Sxi of all points I own in
rectangle
I will compute Sxi of all points I own in
rectangle
I will compute Sxi of all points I own in
rectangle
35
K-means search guts
I will compute Sxi of all points I own in left
rectangle, then right subrectangle, then add em
I will compute Sxi of all points I own in left
subrectangle, then right subrectangle, then add
em
I will compute Sxi of all points I own in left
rectangle, then right subrectangle, then add em
I will compute Sxi of all points I own in left
rectangle, then right subrectangle, then add em
36
In recursive call
I will compute Sxi of all points I own in
rectangle
I will compute Sxi of all points I own in
rectangle
I will compute Sxi of all points I own in
rectangle
I will compute Sxi of all points I own in
rectangle
37
In recursive call
  1. Find center nearest to rectangle

I will compute Sxi of all points I own in
rectangle
I will compute Sxi of all points I own in
rectangle
I will compute Sxi of all points I own in
rectangle
I will compute Sxi of all points I own in
rectangle
38
In recursive call
  1. Find center nearest to rectangle
  2. For each other center can it own any points in
    rectangle?

I will compute Sxi of all points I own in
rectangle
I will compute Sxi of all points I own in
rectangle
I will compute Sxi of all points I own in
rectangle
I will compute Sxi of all points I own in
rectangle
39
In recursive call
  1. Find center nearest to rectangle
  2. For each other center can it own any points in
    rectangle?

I will compute Sxi of all points I own in
rectangle
I will compute Sxi of all points I own in
rectangle
I will compute Sxi of all points I own in
rectangle
I will compute Sxi of all points I own in
rectangle
40
Pruning
  1. Find center nearest to rectangle
  2. For each other center can it own any points in
    rectangle?
  3. If not, PRUNE

I will just grab Sxi from the cached value in the
node
I will compute Sxi of all points I own in
rectangle
I will compute Sxi of all points I own in
rectangle
I will compute Sxi of all points I own in
rectangle
41
Pruning not possible at previous level
  1. Find center nearest to rectangle
  2. For each other center can it own any points in
    rectangle?
  3. If yes, recurse...

I will compute Sxi of all points I own in
rectangle
I will compute Sxi of all points I own in
rectangle
But.. maybe theres some optimization possible
anyway?
I will compute Sxi of all points I own in
rectangle
I will compute Sxi of all points I own in
rectangle
42
Blacklisting
  1. Find center nearest to rectangle
  2. For each other center can it own any points in
    rectangle?
  3. If yes, recurse...

I will compute Sxi of all points I own in
rectangle
I will compute Sxi of all points I own in
rectangle
But.. maybe theres some optimization possible
anyway?
A hopeless center never needs to be considered in
any recursion
I will compute Sxi of all points I own in
rectangle
I will compute Sxi of all points I own in
rectangle
43
Example
Example generated by Dan Pelleg and Andrew
Moore. Accelerating Exact k-means Algorithms with
Geometric Reasoning. Proc. Conference on
Knowledge Discovery in Databases 1999, (KDD99)
(available on www.autonlab.org )
44
K-means continues
45
K-means continues
46
K-means continues
47
K-means continues
48
K-means continues
49
K-means continues
50
K-means continues
51
K-means continues
52
K-means terminates
53
Comparison to a linear algorithm
  • Astrophysics data (2-dimensions)

54
Outline
autonlab.org
  • Kd-trees
  • Fast nearest-neighbor finding
  • Fast K-means clustering
  • Fast kernel density estimation
  • Large-scale galactic morphology

55
What if we want to do density estimation with
multimodal or clumpy data?
56
The GMM assumption
  • There are k components. The ith component is
    called wi
  • Component wi has an associated mean vector mi

m2
m3
57
The GMM assumption
  • There are k components. The ith component is
    called wi
  • Component wi has an associated mean vector mi
  • Each component generates data from a Gaussian
    with mean mi and covariance matrix s2I
  • Assume that each datapoint is generated according
    to the following recipe

m2
m3
58
The GMM assumption
  • There are k components. The ith component is
    called wi
  • Component wi has an associated mean vector mi
  • Each component generates data from a Gaussian
    with mean mi and covariance matrix s2I
  • Assume that each datapoint is generated according
    to the following recipe
  • Pick a component at random. Choose component i
    with probability P(wi).

m2
59
The GMM assumption
  • There are k components. The ith component is
    called wi
  • Component wi has an associated mean vector mi
  • Each component generates data from a Gaussian
    with mean mi and covariance matrix s2I
  • Assume that each datapoint is generated according
    to the following recipe
  • Pick a component at random. Choose component i
    with probability P(wi).
  • Datapoint N(mi, s2I )

m2
x
60
The General GMM assumption
  • There are k components. The ith component is
    called wi
  • Component wi has an associated mean vector mi
  • Each component generates data from a Gaussian
    with mean mi and covariance matrix Si
  • Assume that each datapoint is generated according
    to the following recipe
  • Pick a component at random. Choose component i
    with probability P(wi).
  • Datapoint N(mi, Si )

m2
m3
61
Gaussian Mixture Example Start
Advance apologies in Black and White this
example will be incomprehensible
62
After first iteration
63
After 2nd iteration
64
After 3rd iteration
65
After 4th iteration
66
After 5th iteration
67
After 6th iteration
68
After 20th iteration
69
Some Bio Assay data
70
GMM clustering of the assay data
71
Resulting Density Estimator
  • For basic tutorial on Gaussian Mixture Models,
    see Hastie et al or Duda, Hart Stork books , or
    www.cs.cmu.edu/awm/tutorials/gmm.html

72
(No Transcript)
73
(No Transcript)
74
(No Transcript)
75
Uses for kd-trees and cousins
  • K-Means clustering Pelleg and Moore, 99,
    Moore 2000
  • Kernel Density Estimation Deng Moore, 1995,
    Gray Moore 2001
  • Kernel-density-based clustering Wong and Moore,
    2002
  • Gaussian Mixture Model Moore, 1999
  • Kernel Regression Deng and Moore, 1995
  • Locally weighted regression Moore, Schneider,
    Deng, 97
  • Kernel-based Bayes Classifiers Moore and
    Schneider, 97
  • N-point correlation functions Gray and Moore
    2001
  • Also work by Priebe, Ramikrishnan, Schaal,
    DSouza, Elkan,
  • Papers (and software) www.autonlab.org

76
Uses for kd-trees and cousins
  • K-Means clustering Pelleg and Moore, 99,
    Moore 2000
  • Kernel Density Estimation Deng Moore, 1995,
    Gray Moore 2001
  • Kernel-density-based clustering Wong and Moore,
    2002
  • Gaussian Mixture Model Moore, 1999
  • Kernel Regression Deng and Moore, 1995
  • Locally weighted regression Moore, Schneider,
    Deng, 97
  • Kernel-based Bayes Classifiers Moore and
    Schneider, 97
  • N-point correlation functions Gray and Moore
    2001
  • Also work by Priebe, Ramikrishnan, Schaal,
    DSouza, Elkan,
  • Papers (and software) www.autonlab.org

77
Outline
autonlab.org
  • Kd-trees
  • Fast nearest-neighbor finding
  • Fast K-means clustering
  • Fast kernel density estimation
  • Large-scale galactic morphology

78
Outline
autonlab.org
  • Kd-trees
  • Fast nearest-neighbor finding
  • Fast K-means clustering
  • Fast kernel density estimation
  • Large-scale galactic morphology
  • GMorph
  • Memory-based
  • PCA

79
(No Transcript)
80
Image space (4096 dim)
81
Image space (4096 dim)
galaxies
82
(No Transcript)
83
(No Transcript)
84
demo
85
(No Transcript)
86
Outline
autonlab.org
  • Kd-trees
  • Fast nearest-neighbor finding
  • Fast K-means clustering
  • Fast kernel density estimation
  • Large-scale galactic morphology
  • GMorph
  • Memory-based
  • PCA

87
Outline
autonlab.org
  • Kd-trees
  • Fast nearest-neighbor finding
  • Fast K-means clustering
  • Fast kernel density estimation
  • Large-scale galactic morphology
  • GMorph
  • Memory-based
  • PCA

88
Parameter space (12 dim)
Image space (4096 dim)
Model
synthetic galaxies
89
Target Image
Parameter space (12 dim)
Image space (4096 dim)
Model
90
Target Image
Parameter space (12 dim)
Image space (4096 dim)
Model
91
Target Image
Parameter space (12 dim)
Image space (4096 dim)
Model
92
Outline
autonlab.org
  • Kd-trees
  • Fast nearest-neighbor finding
  • Fast K-means clustering
  • Fast kernel density estimation
  • Large-scale galactic morphology
  • GMorph
  • Memory-based
  • PCA

93
(No Transcript)
94
(No Transcript)
95
(No Transcript)
96
Target Image
Parameter space (12 dim)
Image space (4096 dim)
Model
Eigenspace (16 dim)
97
Target Image
Parameter space (12 dim)
Image space (4096 dim)
Model
Eigenspace (16 dim)
98
Target Image
Parameter space (12 dim)
Image space (4096 dim)
Model
Eigenspace (16 dim)
99
snapshot
100
Outline
autonlab.org
  • Kd-trees
  • Fast nearest-neighbor finding
  • Fast K-means clustering
  • Fast kernel density estimation
  • Large-scale galactic morphology
  • GMorph
  • Memory-based
  • PCA
  • Results

101
(No Transcript)
102
(No Transcript)
103
(No Transcript)
104
Outline
autonlab.org
  • Kd-trees
  • Fast nearest-neighbor finding
  • Fast K-means clustering
  • Fast kernel density estimation
  • Large-scale galactic morphology
  • GMorph
  • Memory-based
  • PCA
  • Results

See AUTON website for papers
paper
Write a Comment
User Comments (0)
About PowerShow.com