Christian B - PowerPoint PPT Presentation

About This Presentation
Title:

Christian B

Description:

Christian B hm & Hans-Peter Kriegel, Ludwig Maximilians Universit t ... Assuming 100% selectivity (index doesnt work) How much more expensive is index usage ? ... – PowerPoint PPT presentation

Number of Views:19
Avg rating:3.0/5.0
Slides: 22
Provided by: AT198
Category:
Tags: christian | doesnt

less

Transcript and Presenter's Notes

Title: Christian B


1
Christian Böhm Hans-Peter Kriegel,Ludwig
Maximilians Universität MünchenA Cost Model
and Index Architecture for the Similarity Join
2
Feature Based Similarity
3
Simple Similarity Queries
  • Specify query object and
  • Find similar objects range query
  • Find the k most similar objects nearest
    neighbor q.

4
Join Applications Catalogue Matching
  • Catalogue matching
  • E.g. Astronomic catalogues

5
Join Applications Clustering
  • Clustering (e.g. DBSCAN)
  • Similarity self-join

6
R-tree Spatial Join (RSJ)
procedure r_tree_sim_join (R, S, e) if IsDirpg
(R) Ù IsDirpg (S) then foreach r Î R.children
do foreach s Î S.children do if
mindist (r,s) e then CacheLoad(r)
CacheLoad(s) r_tree_sim_join (r,s,e)
else ( assume R,S both DataPg ) foreach
p Î R.points do foreach q Î S.points do
if p - q e then report (p,q)
R
S
7
Cost Modeling
  • Single similarity queries Access prob. of pages
    modeled using the concept of Minkowski Sum

8
Cost Modeling
  • Binomial formula

9
Cost Modeling
  • Mating probability of index pages
  • Probability that distance between two pages e
  • Two-fold application of Minkowski sum

10
Page Capacity Optimization
  • Cost model can determine index selectivity which
    depends on various parameters
  • Page capacity (number of stored points) is an
    important parameter
  • Known from similarity search Page capacity
    optimization yields considerable improvement

11
Analysis of the Index Overhead
  • Assuming 100 selectivity (index doesnt work)How
    much more expensive is index usage ?
  • CPU
  • Distance betw. boxes more expensive to compute
    than distance betw. points a 5
  • Smaller capacity ? more box distance computations

12
Analysis of the Index Overhead
  • Disk I/O
  • High constant cost per page access (move disk
    head)
  • Page access is by factor b 10000 / d more
    expensive than continuous reading of a point
  • Smaller capacity ? more disk head movement

13
Analysis of the Index Overhead
  • What selectivity is needed that index pays off ?

14
Optimization
  • I/O cost functionis optimized by
  • CPU cost functionis optimized by

15
Optimization
  • I/O cost
  • Large capacity optimum (several 10,000 points,
    typically)
  • CPU cost
  • Small capacity optimum (lt 100 points, typically)
  • No compromise achievable

16
Multipage Index (MuX)
separate optimization
  • CPU-performance like CPU optimized index
  • I/O- performance like I/O optimized index

17
Experimental Evaluation
Uniform 4D
Uniform 8D
18
Experimental Evaluation
CAD Data 16D
Color Images 64D
19
Conclusions
  • Summary
  • High potential for performance gains of the
    similarity join by page capacity optimization
  • Necessary to separately optimize I/O and CPU
  • Future research potential
  • Similarity join for metric index structures
  • Approximate similarity join
  • Parallel similarity join algorithms

20
Consequences
  • Assume for I/O optimization selectivity 100
  • Page accesses in a nested block loop like style

fill cache with pages of R (1 page free)
foreach S-page s do
if s joins some of the cached R-pg then
load (s)
foreach joining R-page r in cache do
if mindist(r,s) lt e then join (r,s)
21
R-tree Spatial Join (RSJ)
procedure r_tree_sim_join (R, S, e) if IsDirpg
(R) Ù IsDirpg (S) then foreach r Î R.children
do foreach s Î S.children do if
mindist (r,s) e then CacheLoad(r)
CacheLoad(s) r_tree_sim_join (r,s,e)
else ( assume R,S both DataPg ) foreach
p Î R.points do foreach q Î S.points do
if p - q e then report (p,q)
R
S
Write a Comment
User Comments (0)
About PowerShow.com