Title: Multiquery Optimization for Distributed Similarity Query Processing
1Multi-query Optimization for Distributed
Similarity Query Processing
- Yi Zhuang
Zhejiang University - Qing Li City
University of Hong Kong - Lei Chen
HKUST
2Introduction(I)
- High-dimensional data access
- Query-intensive similarity query
- Multi-query optimization to answer a bunch of
queries in a batch manner via combining the
correlated queries together
3Introduction(II)
- Given three queries Q1,Q2 and Q3
- Q1 ? Q2 Not Null
The MDSQ Algorithm (Multi-query optimization for
Distributed Similarity Query Processing)
4Overview of the framework
- Query node level
- Queries submission
- Dynamic query scheduling
- Data node level
- Index-based vector set reduction
- Return the query results to the query node
5Enabling Techniques
- Dynamic query scheduling
- SD-based load balancing scheme
- Index-based vector set reduction
6Dynamic query scheduling(I)
MBS(A,B)
- Preliminaries
- Definition 1(Minimal Bounding Sphere, MBS).
Given two query spheres A and B, their
corresponding MBS is a sphere which can be
denoted as -
- such that the volume of the MBS is minimal.
- Definition 2(Maximal Inner Tangent Sphere,
MITS). Given two spheres A and B, their
corresponding MITS is contained by the
intersection part of these two spheres.
MITS(A,B)
7Dynamic query scheduling(II)
- Motivation
- Given two query spheres A and B, the volume
of the union part of A and B should be larger
than half of the volume of their MBS, which is
formally denoted as
MBS(A,B)
where
8Dynamic query scheduling(III)
- Theorem 1 Given two query spheres
and , let their MBS be denoted as - , in order to get
-
- the following condition should be satisfied.
- Three cases
(1)
(a). Contained (b).
Intersected (c). otherwise
9Dynamic query scheduling(IV)
Algorithm 1. The query scheduling
algorithm Input m query spheres Output m'
clustered query spheres 1. while(TRUE) 2.
for any two spheres do 3. if Eq.(1) is
satisfied then 4. merge the two query
sphere 5. update the query sphere
list 6. m?m-1 7. end if 8.
end for 9. end while / the value of m has
been reduced to m and mltm / 10. return
m'(updated m) clustered query spheres
10SD-based load balancing scheme
- Objective
- To maximize the query parallelism
- Definition 3(Start distance)
- Given a point Vi, its start-distance (SD) is
defined as the distance between Vi and Vo, where
Vo 0,0,,0.
Vi
SD(Vi)
11Motivation
- If a cluster sphere does not intersect with the
affected slices in which a query sphere is
contained, it will not intersect with the query
one.
Cluster Sphere
The three affected slices
12The algorithm
- Algorithm 2. Vector allocation in data nodes
- Input O the vector set, the a data nodes
- Output O(1 to a) the placed vectors in data
nodes - 1. The high-dimensional space is equally divided
into a - slices in terms of start-distance
- 2. for each data node
- the n/a vectors(O(j)) are randomly selected
from the - each sub-range of start distance
respectively - 4. O(j) is deployed in the j-th data node
- 5. end for
13Index-based vector set reduction
- The indexes(i.e., iDistance) are deployed at the
data node level in order to reduce the data
transmission and CPU costs
14The MDSQ Algorithm
- Input m query requests
- Output m query results
- 1. The dynamic query scheduling(DQS) of m
queries are conducted to get m new queries - 2. The global vector set reduction is performed
at the data node level in parallel - 3. The refinement process is conducted to get
the m query results
15Performance Evaluation
- Experimental setup
- Run in a fast local network
- 100 user requests are randomly generated
- Experiment data
- Real life dataset 68,040 32-D data from UCI KDD
Archive - Synthetic dataset 5,000,000 100-D vector data
16Effect of VSR
17Effect of DQS
18Effect of Dimensionality
19Effect of m
20Effect of a
21Conclusion and Future Work
- Conclusion
- The MSDQ algorithm
- - Cost-based dynamic query scheduling
- - SD-based load balancing scheme
- - Index-based vector set reduction
- Future work
- - The multi-query optimization of sub-space
- similarity queries over P2P network
22Thank you!QA