Multiquery Optimization for Distributed Similarity Query Processing - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Multiquery Optimization for Distributed Similarity Query Processing

Description:

such that the volume of the MBS is minimal. Definition 2(Maximal Inner Tangent Sphere, MITS) ... Given two query spheres: A and B, the volume of the union part ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 23
Provided by: yzhu2
Category:

less

Transcript and Presenter's Notes

Title: Multiquery Optimization for Distributed Similarity Query Processing


1
Multi-query Optimization for Distributed
Similarity Query Processing
  • Yi Zhuang
    Zhejiang University
  • Qing Li City
    University of Hong Kong
  • Lei Chen
    HKUST

2
Introduction(I)
  • High-dimensional data access
  • Query-intensive similarity query
  • Multi-query optimization to answer a bunch of
    queries in a batch manner via combining the
    correlated queries together

3
Introduction(II)
  • Given three queries Q1,Q2 and Q3
  • Q1 ? Q2 Not Null

The MDSQ Algorithm (Multi-query optimization for
Distributed Similarity Query Processing)
4
Overview of the framework
  • Query node level
  • Queries submission
  • Dynamic query scheduling
  • Data node level
  • Index-based vector set reduction
  • Return the query results to the query node

5
Enabling Techniques
  • Dynamic query scheduling
  • SD-based load balancing scheme
  • Index-based vector set reduction

6
Dynamic query scheduling(I)
MBS(A,B)
  • Preliminaries
  • Definition 1(Minimal Bounding Sphere, MBS).
    Given two query spheres A and B, their
    corresponding MBS is a sphere which can be
    denoted as
  • such that the volume of the MBS is minimal.
  • Definition 2(Maximal Inner Tangent Sphere,
    MITS). Given two spheres A and B, their
    corresponding MITS is contained by the
    intersection part of these two spheres.

MITS(A,B)
7
Dynamic query scheduling(II)
  • Motivation
  • Given two query spheres A and B, the volume
    of the union part of A and B should be larger
    than half of the volume of their MBS, which is
    formally denoted as

MBS(A,B)
where
8
Dynamic query scheduling(III)
  • Theorem 1 Given two query spheres
    and , let their MBS be denoted as
  • , in order to get
  • the following condition should be satisfied.
  • Three cases

(1)
(a). Contained (b).
Intersected (c). otherwise
9
Dynamic query scheduling(IV)
Algorithm 1. The query scheduling
algorithm Input m query spheres Output m'
clustered query spheres 1. while(TRUE) 2.
for any two spheres do 3. if Eq.(1) is
satisfied then 4. merge the two query
sphere 5. update the query sphere
list 6. m?m-1 7. end if 8.
end for 9. end while / the value of m has
been reduced to m and mltm / 10. return
m'(updated m) clustered query spheres
10
SD-based load balancing scheme
  • Objective
  • To maximize the query parallelism
  • Definition 3(Start distance)
  • Given a point Vi, its start-distance (SD) is
    defined as the distance between Vi and Vo, where
    Vo 0,0,,0.

Vi
SD(Vi)
11
Motivation
  • If a cluster sphere does not intersect with the
    affected slices in which a query sphere is
    contained, it will not intersect with the query
    one.

Cluster Sphere
The three affected slices
12
The algorithm
  • Algorithm 2. Vector allocation in data nodes
  • Input O the vector set, the a data nodes
  • Output O(1 to a) the placed vectors in data
    nodes
  • 1. The high-dimensional space is equally divided
    into a
  • slices in terms of start-distance
  • 2. for each data node
  • the n/a vectors(O(j)) are randomly selected
    from the
  • each sub-range of start distance
    respectively
  • 4. O(j) is deployed in the j-th data node
  • 5. end for

13
Index-based vector set reduction
  • The indexes(i.e., iDistance) are deployed at the
    data node level in order to reduce the data
    transmission and CPU costs

14
The MDSQ Algorithm
  • Input m query requests
  • Output m query results
  • 1. The dynamic query scheduling(DQS) of m
    queries are conducted to get m new queries
  • 2. The global vector set reduction is performed
    at the data node level in parallel
  • 3. The refinement process is conducted to get
    the m query results

15
Performance Evaluation
  • Experimental setup
  • Run in a fast local network
  • 100 user requests are randomly generated
  • Experiment data
  • Real life dataset 68,040 32-D data from UCI KDD
    Archive
  • Synthetic dataset 5,000,000 100-D vector data

16
Effect of VSR
17
Effect of DQS
18
Effect of Dimensionality
19
Effect of m
20
Effect of a
21
Conclusion and Future Work
  • Conclusion
  • The MSDQ algorithm
  • - Cost-based dynamic query scheduling
  • - SD-based load balancing scheme
  • - Index-based vector set reduction
  • Future work
  • - The multi-query optimization of sub-space
  • similarity queries over P2P network

22
Thank you!QA
Write a Comment
User Comments (0)
About PowerShow.com