High-dimensional Similarity Join - PowerPoint PPT Presentation

About This Presentation
Title:

High-dimensional Similarity Join

Description:

Define a sort order of data: epsilon grid order. Laying an equi-distant grid cell with cell length , over the data space and ... Epsilon Grid Order (Cont. ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 59
Provided by: compN
Category:

less

Transcript and Presenter's Notes

Title: High-dimensional Similarity Join


1
High-dimensional Similarity Join
  • Presented by
  • Yang Xia
  • Wongsodihardjo, Hariyanto
  • Wang Hao

2
Agenda
  • Introduction
  • Motivation
  • R-tree based join
  • ?-kdb tree join
  • Epsilon grid order join
  • Summary

3
Introduction
  • Extracting knowledge from large multi-dimensional
    databases.
  • Many data mining algorithms require to process
    all pair of points which have a distance not
    exceeding a user-given parameter ?.
  • The operation of generating all such pairs is in
    essence a similarity join.
  • Data mining algorithms can be directly performed
    on top of a similarity join.

4
Motivation
  • Conventional joining algorithms cannot be
    directly applied to high-D similarity join, such
    as nested-loop join, sort-merge join, and
    hash-based join.
  • Make use of the index built on the high-D data.

5
Efficient Processing of Spatial Joins Using
R-trees
by T. Brinkhoff, H. P. Kriegel, and B.
Seeger SIGMOD 1993
Presented by Hariyanto Wongsodihardjo 6 September
2001
6
Efficient Processing of Spatial Joins Using
R-trees
  • Presenting a study of spatial join processing
    using R-trees, particularly R-trees, which is
    one of the most efficient members of the R-tree
    family
  • Presenting several techniques for improving
    spatial join execution time with respect to CPU
    and I/O time

7
R-tree Basic Algorithms
  • Let S be a query rectangle of a window query.
    The query is performed by starting in the root
    and computing all entries which rectangles
    intersects S
  • For these entries, the corresponding child nodes
    are read into main memory and the query is
    performed like in the root node
  • The efficiency of queries depends on the goodness
    how R-trees assign rectangles to nodes.

8
A First Approach of a Spatial Join for R-trees
9
CPU-Time and I/O-Time Tuning
  • CPU-Time Tuning
  • Restricting the search space
  • Spatial Sorting and plane sweep
  • I/O-Time Tuning
  • Local plane-sweep order with pinning
  • Local z-order

10
Restricting the search space
11
Restricting the search space
12
Restricting the search space
13
Spatial sorting and plane sweep
14
Spatial sorting and plane sweep
15
Spatial sorting and plane sweep
16
Spatial sorting and plane sweep
17
Local plane-sweep order
18
Local plane-sweep order
19
Local plane-sweep order with pinning (SJ4)
  • Sequence for local plane-sweep order on example 2
    is II, I,IV, III and the read schedule is ltr1,
    s2, s1, r2, s2, r4, r3gt
  • Pinning algorithm is based on the degree of the
    rectangles of both entries. The degree of an
    rectangle E is given by the number of
    intersections between rectangle E and the
    rectangles which belong to entries of the other
    tree not processed until now. Thus for ex. 2 the
    read schedule is ltr1, s2, r4, r3, s1, r2gt.
  • The page whose rectangle has a max degree is
    pinned and the join is performed for the pinned
    page.

20
Local z-order (SJ5)
21
Local z-order (SJ5)
  • Compute intersection between each rectangle of R
    with all rectangles of S
  • Sort resulting rectangles on the spatial location
    of their centers
  • Use z-ordering to sort resulting rectangles
  • Then pin pages as before.
  • The sequence for Figure 7 is I, II, III, V, IV
    and the read schedule is lts1, r2, r1, s2, r4, r3,
    s3gt.

22
I/O Performance Comparison
23
I/O Performance Comparison
24
Conclusion
  • R tree join algorithm is straightforward
  • R tree join algorithm improves CPU-time by
    applying spatial sorting and restricting the
    search space
  • R tree join algorithm improves I/O-time by
    applying local sweep order with pinning or local
    z-order

25
High-dimensional similarity joins (? tree)
  • Presented By
  • Yang Xia
  • ReferencesK. Shim, R. Srikant, and R. Agrawarl,
    High-dimensional similarity joins, Proc. 13th
    IEEE Internat. Conf. on Data Engineering, 1997,
    pp. 301--311.

26
Introduction
  • ? tree is a main-memory data structure optimized
    for performing similarity joins. It uses the
    similarity distance limit ? as a parameter in
    building the tree.
  • Problem Definition
  • -Self-join
  • -Non-self-join
  • -Distance metric

27
Problems with Current Indices
  • Number of Neighboring Leaf Nodes
  • Storage Utilization
  • Traversal Cost
  • Build Time
  • Skewed Data

28
? tree Definition
  • The co-ordinates of the points in each dimension
    lie between 0 and 1.
  • Start with a single leaf node.
  • Whenever the number of points in a leaf node
    exceeds a threshold, the leaf node is split.
  • If the leaf node was at level i, the i dimension
    is used for splitting. The node is split into
    parts.

29
Example of ? tree
30
Similarity Join using the ? tree
31
Memory Management
  • Main-memory can hold all points within a 2 ?
    distance on the first dimension.

32
Memory Management
  • Main-memory cannot hold all points within a 2 ?
    distance on the first dimension.

33
Design Rationale
  • Biased Splitting The dimension used in previous
    split is selected again for splitting as long as
    the length of the dimension in the bounding
    rectangle of each resulting leaf node is at least
    ?.
  • ? Sized Splitting When we split a node, we split
    the node in ? sized chunks.

34
Design Rationale
  • Number of Neighboring Leaf Nodes.
  • Space Requirements.
  • Traversal Cost.
  • Build time.
  • Skewed data.

35
An example
36
Experiments
  • Synthetic Data Parameters

37
Experiments(1)
38
Experiments(2)
39
Experiments(3)
40
Conclusions
  • ? tree reduces the number of neighbor leaf nodes
    that are considered for the join test.
  • ? tree reduces the traversal cost of finding
    appropriate branches in the internal nodes.
  • The storage cost for internal nodes is
    independent of the number of dimensions.

41
Epsilon Grid Order An Algorithm for the
Similarity Join on Massive High-Dimensional
DataChristian Bhm, Bernhard Braunmller, Florian
Krebs, and Hans-Peter KriegelSIGMOD 2001
  • Presented By Wang Hao
  • 6 September 2001

42
Motivation
  • Indexing Based Join
  • R-tree family, MuX (Multipage Index) tree, etc..
  • Optimization conflict between CPU and IO BK01.
  • Optimize CPU fine-gained partitioning with page
    capacities of a few points.
  • Optimized IO large block size requires less IO.
  • Join without Index
  • Seeded tree, spatial hash join, ?-kdb tree,
    etc..
  • Not scalable to large data sets.
  • ?-kdb tree cache size can be from 36 to 60 of
    database size.

43
Design Objectives
  • Join without Index.
  • Optimize both CPU and IO.
  • Scalable to large data set of size well beyond
    1GB.

44
Basic Ideas
  • Define a sort order of data epsilon grid order.
  • Laying an equi-distant grid cell with cell length
    ?, over the data space and comparing the cells
    lexicographically.
  • Use external sort to sort the data.
  • Schedule the IO carefully during join phase.

45
Epsilon Grid Order
  • For two vectors p, q is true iff
    there exists a dimension di, such that
  • Epsilon grid order is a strict order
  • irreflexive, asymmetric, and transitive.

46
Epsilon Grid Order (Cont.)
  • A point with cannot
    be a join mate or p, of any point p which is not
  • A point with cannot
    be a join mate or p, of any point p which is not

47
I/O Scheduling Using the ? Grid Order
  • Unbuffered IO operations.
  • Example IO Units in a 2-D data space

48
I/O Scheduling (Cont.)
  • Illustration Pairs of IO units that must be
    considered for join.

In the picture, each entry in the matrix stands
for one pair of IO Units.
  • IO thrashing effects

49
Scheduling Mode
50
Scheduling Algorithm
51
Joining Two IO Units
  • Active dimensions
  • Minlen minimum of length of sequences for join.

52
Optimization Potentials
  • Use larger sequences to optimize IO.
  • Optimize minlen for minimal CPU processing time.
  • Comparing with ?-kdb tree and MuX tree, no
    directory is constructed. The only space overhead
    is the recursion stack O(log n)
  • Other possible optimizations
  • Modification of sort order.
  • Optimization in the recursion in join_sequence.

53
Experiments
  • Settings
  • Buffer memory 10 of database size.
  • Use Euclidean distance.
  • Distance parameter ? determined using algorithm
    in SEKX98 such that they are suitable for
    clustering.
  • Compare with Nested-loop join, Z-ordering R-tree
    based join, and MuX tree based join.

54
Experiments on Uniformly Distributed 8-D Data.
55
Experiments on Real 16-D Data from CAD Database.
56
Conclusions and Future work
  • Define a strict order epsilon grid order.
  • A sophisticated scheduling algorithm.
  • Several optimization techniques.
  • Experiments show it outperforms competitive
    algorithms for data sets with size up to 1.2 GB.
  • Future work
  • Parallel version of the join algorithm.
  • Extend the cost model to query optimizer.

57
Overall Summary
  • We have covered three joining algorithms R
    tree-based join, e-kdb tree join, and epsilon
    grid order join.
  • Specific algorithms have been proposed to perform
    similarity join for each of the following cases
  • Both data set have index,
  • Only one data set has index,
  • None of them have index.
  • High-D similarity joins can be applied in data
    mining algorithms such as clustering.

58
Resource Links
  • Readings on High-dimensional Similarity Join
  • http//www.comp.nus.edu.sg/wanghao/cs6203/join.ht
    m
Write a Comment
User Comments (0)
About PowerShow.com