ClusteringBased Similarity Search in Metric Spaces with Sparse Spatial Centers PowerPoint PPT Presentation

presentation player overlay
1 / 30
About This Presentation
Transcript and Presenter's Notes

Title: ClusteringBased Similarity Search in Metric Spaces with Sparse Spatial Centers


1
Clustering-Based Similarity Search in Metric
Spaces with Sparse Spatial Centers
SOFSEM 2008 Theory and Practice of Computer
Science
Nový Smokovec, Slovakia, January 2008
  • Nieves Brisaboa, Oscar Pedreira, Diego Seco,
  • Roberto Solar, Roberto Uribe
  • Databases Lab, University of A Coruña, Spain
  • Dept. of Computer Engineering, Universidad de
    Magallanes, Chile

2
Outline
  • Introduction
  • Basic concepts
  • Cluster center selection
  • SSSTree
  • Experimental results
  • Conclusions and future work

3
Introduction
  • Traditional databases
  • Data has a well defined structure (e.g.,
    VARCHAR(15))
  • Exact searching, using equality/inequality
    comparisons
  • SELECT name FROM Student WHERE city Poprad
  • Non-structured databases
  • Exact searching is not possible !
  • Similarity search is a common operation in these
    domains
  • Some examples

4
Introduction
  • Similarity search retrieve from the database the
    objects
  • similar to one given as a query

r
q
5
Outline
  • Introduction
  • Basic concepts
  • Cluster center selection
  • SSSTree
  • Experimental results
  • Conclusions and future work

v
6
Basic concepts
  • Distance functions (metrics)
  • The evaluation of d has a high computational
    cost.
  • The comparison of the query with all the objects
    in the database makes similarity search a costly
    operation
  • Indexes are built on the database to avoid the
    comparison of the query with all the objects in
    the database.

d (latest, greatest) 3 g r e a t
e s t l a t e s t
d(
) 1.5435
,
7
Basic concepts
  • Metric space universe of objects Distance
    function
  • E.g. Collection of words Edit distance
  • The triangle inequality, base of indexing
    algorithms
  • ? x, y, z ? U, d(x, y) ? d(x, z) d(z, y)

p
q
xi
8
Basic concepts
  • Pivot-based algorithms



Query
p1 p2 p3 pk-2 pk-1 pk
x1 x2 x3 xn
p1 p2 p3 pk-2 pk-1 pk
x1 x2 x3 xn
q
q
9
Basic concepts
  • Clustering-based algorithms



Indexing
Query
Index processing
q
10
Basic concepts
  • Clustering-based algorithms
  • Bisector Tree.
  • Generalized-hyperplane Tree.
  • Geometric Near-neighbor Access Tree.
  • Voronoi Tree.
  • M-Tree.
  • Spatial Approximation Tree.
  • SSS Tree
  • Pivot-based algorithms
  • Burkhard-Keller Tree.
  • Fixed-Queries Tree.
  • Fixed-Queries Array.
  • Vantage Point Tree.
  • Multi-Vantage Point Tree.
  • AESA.
  • Linear AESA.

11
Basic concepts
Example of clustering-based algorithm BST
v
v
v
v
?
?
v
Metric space
12
Basic concepts
  • GNAT is build and it is used in a similar way
  • In this case each node has m childs
  • And each node stores the distances among cluster
    centers, which are used during the search

13
Motivation
  • What are the problems of these algorithms?
  • The number of clusters in each node has to be
    stated beforehand
  • Cluster centers are selected at random
  • The partition of the space is not the most
    adequate to the topology of the dataset

14
Motivation
  • SOLUTION?
  • In each node, create a partition adapted to the
    topology of that subspace, both in the number of
    clusters as in the position of the cluster
    centers
  • I.e., do NOT partition the space at random
  • Select the appropriate number of cluster centers
  • And place those centers in strategic positions

15
Outline
  • Introduction
  • Basic concepts
  • Cluster center selection
  • SSSTree
  • Experimental results
  • Conclusions and future work

v
v
16
Cluster center selection
  • We want each cluster to be created in function of
    the topology and complexity of the space we are
    partitioning
  • We need a selection strategy which
  • Determines by itself the proper number of
    clusters
  • Selects good cluster centers
  • We use Sparse Spatial Selection Pedreira and
    Brisaboa, 07
  • Initially selected for pivot-based algorithms
  • It dynamic and adapts the selected points to the
    spaces characteristics

17
Sparse Spatial Selection
  • An object is selected as a center, if its
    distance to the other centers is greater than Ma
  • Thus, we select only the centers necessary to
    cover the space
  • M maximum distance
  • 0 lt a lt 1

a 0.5
M
18
Outline
  • Introduction
  • Basic concepts
  • Cluster center selection
  • SSSTree
  • Experimental results
  • Conclusions and future work

v
v
v
19
SSSTree
  • Clustering-based indexing algorithm
  • Main differences with previous approaches
  • The number of branches of each node is not fixed
  • Each cluster associated to a node is partitioned
    in the necessary number of clusters, not in a
    fixed number
  • Cluster centers are strategically selected

20
SSSTree - Indexing
  • SSS is applied in each node to select the cluster
    centers
  • The index (a tree) is recursively build until
    reaching the leaves when the clusters are small
    enough

21
SSSTree - Indexing
  • PROBLEM
  • SSS needs the maximum distance of the space (M)
    to select the cluster centers. But this distance
    is difference in each cluster
  • Instead of computing it (brute force), we can get
    a good estimation two times the covering radius

22
SSSTree - Search
  • As with previous algorithms, we traverse the tree
    from root to leaves discarding clusters which do
    not contain any object in the result

23
Outline
  • Introduction
  • Basic concepts
  • Cluster center selection
  • SSSTree
  • Experimental results
  • Conclusions and future work

v
v
v
v
24
Experimental results
  • Test datasets
  • Synthetic collections of vectors
  • Collections of 1,000,000 of vectors of dimension
    8,10,12, and 14, using the Euclidean distance.
    Gauss distribution.
  • Collections of words (edit distance).
  • 69.069 words from the English dictionary.
  • 51.589 words from the Spanish dictionary.
  • Collection of images
  • 40,050 images from the NASA video and image
    archive

25
Experimental results
  • Search efficiency in collections of vectors

26
Experimental results
  • Search efficiency in collections of words

27
Experimental results
  • Search efficiency in collections of images

28
Outline
  • Introduction
  • Basic concepts
  • Cluster center selection
  • SSSTree
  • Experimental results
  • Conclusions and future work

v
v
v
v
v
29
Conclusions
  • We present a new clustering-based indexing
    algorithm
  • Adaptive
  • Previous approaches need to state beforehand a
    fixed number of clusters in each node of the tree
  • And sthe cluster centers were selected at random
  • SSTree determines itself the optimal number of
    clusters and places them in strategies positions
  • Efficient
  • Experimental results show that SSSTree is more
    efficient than previous proposals in most of the
    situations

30
And
  • Thanks for your attention

Write a Comment
User Comments (0)
About PowerShow.com