Title: Christian B
1Christian Böhm Hans-Peter Kriegel,Ludwig
Maximilians Universität MünchenA Cost Model
and Index Architecture for the Similarity Join
2Feature Based Similarity
3Simple Similarity Queries
- Specify query object and
- Find similar objects range query
- Find the k most similar objects nearest
neighbor q.
4Join Applications Catalogue Matching
- Catalogue matching
- E.g. Astronomic catalogues
5Join Applications Clustering
- Clustering (e.g. DBSCAN)
- Similarity self-join
6R-tree Spatial Join (RSJ)
procedure r_tree_sim_join (R, S, e) if IsDirpg
(R) Ù IsDirpg (S) then foreach r Î R.children
do foreach s Î S.children do if
mindist (r,s) e then CacheLoad(r)
CacheLoad(s) r_tree_sim_join (r,s,e)
else ( assume R,S both DataPg ) foreach
p Î R.points do foreach q Î S.points do
if p - q e then report (p,q)
R
S
7Cost Modeling
- Single similarity queries Access prob. of pages
modeled using the concept of Minkowski Sum
8Cost Modeling
9Cost Modeling
- Mating probability of index pages
- Probability that distance between two pages e
- Two-fold application of Minkowski sum
10Page Capacity Optimization
- Cost model can determine index selectivity which
depends on various parameters - Page capacity (number of stored points) is an
important parameter - Known from similarity search Page capacity
optimization yields considerable improvement
11Analysis of the Index Overhead
- Assuming 100 selectivity (index doesnt work)How
much more expensive is index usage ? - CPU
- Distance betw. boxes more expensive to compute
than distance betw. points a 5 - Smaller capacity ? more box distance computations
12Analysis of the Index Overhead
- Disk I/O
- High constant cost per page access (move disk
head) - Page access is by factor b 10000 / d more
expensive than continuous reading of a point - Smaller capacity ? more disk head movement
13Analysis of the Index Overhead
- What selectivity is needed that index pays off ?
14Optimization
- I/O cost functionis optimized by
- CPU cost functionis optimized by
15Optimization
- I/O cost
- Large capacity optimum (several 10,000 points,
typically) - CPU cost
- Small capacity optimum (lt 100 points, typically)
- No compromise achievable
16Multipage Index (MuX)
separate optimization
- CPU-performance like CPU optimized index
- I/O- performance like I/O optimized index
17Experimental Evaluation
Uniform 4D
Uniform 8D
18Experimental Evaluation
CAD Data 16D
Color Images 64D
19Conclusions
- Summary
- High potential for performance gains of the
similarity join by page capacity optimization - Necessary to separately optimize I/O and CPU
- Future research potential
- Similarity join for metric index structures
- Approximate similarity join
- Parallel similarity join algorithms
20Consequences
- Assume for I/O optimization selectivity 100
- Page accesses in a nested block loop like style
fill cache with pages of R (1 page free)
foreach S-page s do
if s joins some of the cached R-pg then
load (s)
foreach joining R-page r in cache do
if mindist(r,s) lt e then join (r,s)
21R-tree Spatial Join (RSJ)
procedure r_tree_sim_join (R, S, e) if IsDirpg
(R) Ù IsDirpg (S) then foreach r Î R.children
do foreach s Î S.children do if
mindist (r,s) e then CacheLoad(r)
CacheLoad(s) r_tree_sim_join (r,s,e)
else ( assume R,S both DataPg ) foreach
p Î R.points do foreach q Î S.points do
if p - q e then report (p,q)
R
S