Pivoting%20M-tree:%20A%20Metric%20Access%20Method%20for%20Efficient%20Similarity%20Search PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: Pivoting%20M-tree:%20A%20Metric%20Access%20Method%20for%20Efficient%20Similarity%20Search


1
Pivoting M-tree A Metric Access Method for
Efficient Similarity Search
  • Tomáš Skopaltomas.skopal_at_vsb.czDepartment of
    Computer Science, VŠB-Technical University of
    Ostrava

2
Presentation Outline
  • Similarity search in Metric Spaces
  • M-tree
  • PM-tree
  • structure
  • range queries
  • hyper-ring storage
  • Experimental Results

3
Similarity search in Metric Spaces
  • Similarity search methods for content-based
    retrieval in multimedia databases (in Information
    Retrieval resp.)
  • Similarity modelled by metric d
  • Restriction to metric yields a paradigmatic
    discrepancy with several similarity theories
    nevertheless, the triangular inequality is the
    basic tool for metric region construction leading
    to an efficient similarity search
  • Metric queries
  • range query (specified by pivot object Q and
    covering radius rQ)
  • k-NN query (specified by pivot object Q and
    number of nearest neighbours k)

4
Metric Access Methods
  • Designed to search in metric datasets in order to
    keep the search costs minimal (number of distance
    computation).
  • When searching large multimedia databases also
    the I/O search costs have to be minimized.
  • Many MAMs developed so far M-tree, GH-tree,
    GNAT, LAESA, D-index, VP-tree, MVP-tree, SAT, ...
  • Majority of the MAMs is not suitable for
    similarity search in large datasets (either a
    static method or high I/O search costs)
  • only M-tree and (recently) D-index are suitable
    candidates

5
M-tree
  • dynamic, balanced, and paged metric tree (like
    e.g. B-tree, R-tree)
  • the leaves are clusters of objects
  • routing entries in the inner nodes represent
    metric regions, recursively bounding the object
    clusters in leaves
  • during query evaluation, the triangular
    inequality allows discarding of irrelevant
    M-tree branches (metric regions resp.)

6
PM-tree, motivation
  • metric regions in M-tree are unnecessarily large
  • ? indexing of large portions of empty space (the
    dead space)
  • ? higher probability of intersection with query
    region
  • ? less efficient search
  • reduction of metric region volume should lead
    to more effective discarding of irrelevant
    subtrees
  • the way is to specify a metric region bounding
    all the objects more tightly

7
PM-tree, structure
  • Pivoting M-tree (PM-tree) a combination of
    M-tree with the pivot-based methods (LAESA-like)
  • given a fixed set of p pivots Pi (selected from
    the dataset), a PM-tree region is additionaly
    defined by p hyper-ring regions (Pi , HRi)
  • each routing entry contains an array HR of p
    intervals ltHRi.min, HRi.maxgt
  • each interval HRi bounds the distances of
    objects to the respective pivot Pi
  • intersection of the hyper-sphere and the
    hyper-rings forms a smaller region bounding all
    the objects
  • the more pivots, the more thightly bounded
    region

8
PM-tree, query processing
  • prior to processing of a query (Q,rQ), distances
    d(Q, Pi) for all i p must be computed
  • metric region is relevant to a range query just
    in case that all the hyper-rings and the
    hyper-sphere intersect the range query region ?
    the more hyper-rings, the lower probability of
    intersection with query
  • ? no additional distance computations are
    needed for the intersection test

M-tree region
PM-tree region
9
PM-tree, hyper-ring storage
  • The routing entries of PM-tree nodes are enlarged
    by the additional pivot-based information stored
    in HR arrays
  • To keep the space overhead minimal, a compact
    storage of HRi intervals is necessary
  • A distance histogram for each pivot Pi is
    created, and interval ltdimin, dimaxgt is chosen
    such that e.g. 90 of distances in the distance
    histogram fall into that interval
  • Each value HRi.min, HRi.max, is scaled to the
    ltdimin, dimaxgt interval using a single byte,
    i.e. each hyper-ring HRi takes 2 bytes

10
Experimental results (synthetic)
  • synthetic dataset of 100,000 30-dimensional
    tuples distributed within 1000 clusters, L2
    distance, query selectivity 50 objs.

11
Experimental results (images)
  • collection of 10,000 images represented by
    256-dimensional vectors (gray histograms), L2
    distance, query selectivity 50 objs.

12
Recent results (not included in proceedings)
  • Cost models for range queries in PM-tree (?
    ADBIS04)
  • Experiments on image dataset (? ADBIS04)
  • Optimal k-NN query algorithm for PM-tree cost
    models (to be published...)

13
Reference
  • 1 Skopal T., Pokorný J., Snášel V. PM-tree
    Pivoting Metric Tree for Similarity Search in
    Multimedia Databases, submitted to ADBIS 2004,
    Budapest, Hungary
  • 2 Skopal T. Pivoting M-tree A Metric Access
    Method for Efficient Similarity Search, DATESO
    2004, Desná
  • 3 Skopal T., Pokorný J., Krátký M., Snášel V.
    Revisiting M-tree Building Principles. ADBIS
    2003, LNCS 2798, Springer-Verlag, Dresden,
    Germany
Write a Comment
User Comments (0)
About PowerShow.com