SEMILARITY JOIN - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

SEMILARITY JOIN

Description:

Applications - duplicate detection, similarity comparison, etc. ... number of dimensions and/or high selectivity (due to large e), not as ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 21
Provided by: Kien5
Category:

less

Transcript and Presenter's Notes

Title: SEMILARITY JOIN


1
SEMILARITY JOIN
  • COP6731
  • Advanced Database Systems

2
Basic Similarity Queries
  • Range Query
  • Find similar items
  • k-Nearest-Neighbor (kNN) Query
  • Find the k most similar items

3
Similarity Join
  • Given two sets, R and S, of data points
  • Find all pairs (r,s) ? RxS, such that d(r,s) e.
  • Applications - duplicate detection, similarity
    comparison, etc.

4
Similarity Join - SQL-like Notation
  • SELECT
  • FROM R, S
  • WHERE d(R.r, S.s) e
  • ? too small, no results
  • e too large, very large result set

5
k-Closest Pair Query
  • Given two sets, R and S, of data points
  • Find those k (r,s) pairs that yield least
    distance
  • r and s are NN of each other
  • This is called distance join

6
k-Closest Pair QuerySQL-like Notation
  • SELECT
  • FROM R, S
  • ORDER BY d(R.r, S.s)
  • UNTIL k
  • Applications
  • Find all pairs of people who have the most
    similar interests
  • Find music scores which are most similar to each
    other

7
k-Nearest Neighbor Join
  • Combine each point with its k nearest neighbors
    from the other data set
  • SQL-like Notation
  • SELECT
  • FROM R, S
  • GROUP BY R.r
  • GROUP SIZE k
  • ORDER BY d(R.r, S.s)

8
k-Nearest Neighbor Join
9
k-NN Join Applications
  • k-means clustering
  • k initial centers randomly selected
  • Assign each database point to its nearest center
  • Redetermine center for each cluster
  • Repeat Steps 2 and 3 until convergence
  • Classify new objects according to the majority of
    their k nearest neighbors

10
Nested Loop Join
  • Simple nested loop
  • For each R-points, iterate over S-points
  • Scan S R times, very expensive
  • Nested block loop
  • For each page of R-points, iterate over S-points
  • Scan S only R/page times, more cost effective

11
Indexed Nested Loop Join
  • For each R-point, determine matches in S using
    the index
  • For large number of dimensions and/or high
    selectivity (due to large e), not as competitive
    as nested loop join

12
Spatial Join vs Similarity Join
  • Represent each data point as hypercube of
    edge-length 0.71e
  • Map similarity join wrt e to spatial join on
    hypercubes
  • If two hypercubes overlap, the corresponding
    points are within e distance from each other
  • That is, they are neighbors wrt e

13
R-tree Spatial Join (RSJ)
  • Assumption Index preconstructed on R and S with
    equal tree height
  • Procedure RSJ (R, S page)
  • for each r ? R.children do
  • for each s ? S.children do
  • if (r ? s ?F) then RSJ(r,s)

14
Adapt RSJ for Similarity Join
  • Distance predicate rather than intersection
  • Mindist(R,S) computes least distance of two
    points in (R,S)
  • Procedure RsimJ(R, S, e)
  • if IsDirPg(R) ? IsDirPg(S) then
  • for each r ? R.children do
  • for each s ? S.children do
  • if mindist(r,s) e then
  • RsimJ(r, s, e) / recursive /
  • else / R S are data pages /
  • for each p ? R.points do
  • for each q ? S.points do
  • if d(r, s) e then output(p, q)

15
Performance Issues in R-tree Join
  • Cost dominated by point-distance computations -
    CPU-bound
  • Random page accesses can be worse than nested
    block loop join

16
Parallel Similarity Join
  • A task corresponds to a pair of tree nodes (data
    page or directory page)
  • Various task assignment strategies
  • Round robin
  • Static range assignment
  • Dynamic task assignment to achieve load balancing

17
Breadth-First R-tree Join
  • Shortcoming of RsimJ
  • Depth-first traversal is sequential in nature
  • No strategy for improving locality in inner loop
    resulting in inefficient page access pattern
  • Solution
  • Proceed level by level (i.e., breadth first
    traversal)
  • Determine all relevant pairs for the next level
  • Access these relevant pairs in the order of their
    physical locations in storage

18
Reducing Random Access inBreadth-First Traversal
  • Space is regularly tiled with a space filling
    curve (e.g., Hilbert curve) defined
  • Store the index tree level by level
  • For each level, store tree nodes according to
    their space-filling-curve order

19
Without Preconstructed Index (1)
  • Tree construction time often much less than join
    time - amortize during join
  • Indexes can be constructed temporarily for join
  • Techniques include Hilbert R-tree and e-kdB tree
  • Hilbert R-tree Sort points by SFC, and pack
    adjacent points to page

20
Without Preconstructed Index (2)
  • e-kdB tree
  • Space is partitioned into grid cells with grid
    line distance e
  • Tree structure is specific to given e, and must
    be constructed for each join

root
leaf
leaf
leaf
e
leaves
Write a Comment
User Comments (0)
About PowerShow.com