Multiquery Optimization for Distributed Similarity Query Processing

About This Presentation

Title:

Multiquery Optimization for Distributed Similarity Query Processing

Description:

such that the volume of the MBS is minimal. Definition 2(Maximal Inner Tangent Sphere, MITS) ... Given two query spheres: A and B, the volume of the union part ... – PowerPoint PPT presentation

Number of Views:52

Avg rating:3.0/5.0

Slides: 23

Provided by: yzhu2

Category:

more less

Transcript and Presenter's Notes

Title: Multiquery Optimization for Distributed Similarity Query Processing

1
Multi-query Optimization for Distributed
Similarity Query Processing

Yi Zhuang
Zhejiang University
Qing Li City
University of Hong Kong
Lei Chen
HKUST

2
Introduction(I)

High-dimensional data access
Query-intensive similarity query
Multi-query optimization to answer a bunch of
queries in a batch manner via combining the
correlated queries together

3
Introduction(II)

Given three queries Q1,Q2 and Q3
Q1 ? Q2 Not Null

The MDSQ Algorithm (Multi-query optimization for
Distributed Similarity Query Processing)
4
Overview of the framework

Query node level
Queries submission
Dynamic query scheduling
Data node level
Index-based vector set reduction
Return the query results to the query node

5
Enabling Techniques

Dynamic query scheduling
SD-based load balancing scheme
Index-based vector set reduction

6
Dynamic query scheduling(I)
MBS(A,B)

Preliminaries
Definition 1(Minimal Bounding Sphere, MBS).
Given two query spheres A and B, their
corresponding MBS is a sphere which can be
denoted as
such that the volume of the MBS is minimal.
Definition 2(Maximal Inner Tangent Sphere,
MITS). Given two spheres A and B, their
corresponding MITS is contained by the
intersection part of these two spheres.

MITS(A,B)
7
Dynamic query scheduling(II)

Motivation
Given two query spheres A and B, the volume
of the union part of A and B should be larger
than half of the volume of their MBS, which is
formally denoted as

MBS(A,B)
where
8
Dynamic query scheduling(III)

Theorem 1 Given two query spheres
and , let their MBS be denoted as
, in order to get
the following condition should be satisfied.
Three cases

(1)
(a). Contained (b).
Intersected (c). otherwise
9
Dynamic query scheduling(IV)
Algorithm 1. The query scheduling
algorithm Input m query spheres Output m'
clustered query spheres 1. while(TRUE) 2.
for any two spheres do 3. if Eq.(1) is
satisfied then 4. merge the two query
sphere 5. update the query sphere
list 6. m?m-1 7. end if 8.
end for 9. end while / the value of m has
been reduced to m and mltm / 10. return
m'(updated m) clustered query spheres
10
SD-based load balancing scheme

Objective
To maximize the query parallelism
Definition 3(Start distance)
Given a point Vi, its start-distance (SD) is
defined as the distance between Vi and Vo, where
Vo 0,0,,0.

Vi
SD(Vi)
11
Motivation

If a cluster sphere does not intersect with the
affected slices in which a query sphere is
contained, it will not intersect with the query
one.

Cluster Sphere
The three affected slices
12
The algorithm

Algorithm 2. Vector allocation in data nodes
Input O the vector set, the a data nodes
Output O(1 to a) the placed vectors in data
nodes
1. The high-dimensional space is equally divided
into a
slices in terms of start-distance
2. for each data node
the n/a vectors(O(j)) are randomly selected
from the
each sub-range of start distance
respectively
4. O(j) is deployed in the j-th data node
5. end for

13
Index-based vector set reduction

The indexes(i.e., iDistance) are deployed at the
data node level in order to reduce the data
transmission and CPU costs

14
The MDSQ Algorithm

Input m query requests
Output m query results
1. The dynamic query scheduling(DQS) of m
queries are conducted to get m new queries
2. The global vector set reduction is performed
at the data node level in parallel
3. The refinement process is conducted to get
the m query results

15
Performance Evaluation