Progressive Computation of Constrained Subspace Skyline Queries - PowerPoint PPT Presentation

1 / 40

About This Presentation

Title:

Progressive Computation of Constrained Subspace Skyline Queries

Description:

Evangelos Dellis1 Akrivi Vlachou1 Ilya Vladimirskiy1. Bernhard Seeger1 Yannis Theodoridis2. 1 Department of ... horsepower, age, fuel consumption, etc... – PowerPoint PPT presentation

Number of Views:98

Avg rating:3.0/5.0

Slides: 41

Provided by: evangelo6

Category:

more less

Transcript and Presenter's Notes

Title: Progressive Computation of Constrained Subspace Skyline Queries

1
Progressive Computation of Constrained Subspace
Skyline Queries

Evangelos Dellis1 Akrivi Vlachou1 Ilya
Vladimirskiy1
Bernhard Seeger1 Yannis Theodoridis2
1 Department of Computer Science, University of
Marburg, Germany
2 Department of Computer Science, University of
Piraeus, Greece

2
Overview

Introduction
Motivation - Related Work
Basic STA
Improved Pruning
Indexing using Low-dimensional R-trees
Experimental Evaluation
Conclusions Future Work

3
Overview

Introduction
Motivation - Related Work
Basic STA
Improved Pruning
Indexing using Low-dimensional R-trees
Experimental Evaluation
Conclusions Future Work

4
Finding A Hotel Close to the Beach

Which one is better?
i or h? (i, because its price and distance
dominate those of h)
i or k?

5
Skyline Queries

Retrieve points not dominated by any other point
A point p dominates another point q if it is as
good or better as p in all dimensions and better
in at least one dimension.

6
Skyline of Manhattan

Which buildings can we see?
Higher or nearer
(a building dominates another building if it
is higher, closer to the river, and has the same
x position)

7
SQL Extension

SQL syntax

Examples

b) Find salespersons who were very successful in
1999 and have low salary
a) Find a hotel that is cheap and close to the
beach.
8
Overview

Introduction
Motivation - Related Work
Basic STA
Improved Pruning
Indexing using Low-dimensional R-trees
Experimental Evaluation
Conclusions Future Work

9
Motivation

Constrained Skyline (car database)
A user may only be interested in records within
the price range from 3 thousand to 7 thousand
euros and with mileage reading between 20K and
100K.

The traditional skyline (dashed line) fails to
return interesting points.

10
Motivation (continued)

Subspace Skyline
A car database could contain many other
attributes of the cars
horsepower, age, fuel consumption, etc
A customer that is sensitive on the price and the
mileage reading (2-dimensional subspace) would
like to pose a skyline query on those attributes,
rather than on the whole data space.
While the dimensionality of the corresponding
data space might be rather high, skyline queries
generally refer to a low dimensional subspace.
The constrained subspace skyline queries form the
generalization of all meaningful skyline queries
over a given dataset.

11
Related Work

SKYCUBE VLDB 2005, SIGMOD 2006
The Skyline Cube (SKYCUBE), consists of the
skylines in all possible (2d-1) subspaces.
Drawback It is not possible to pre-calculate the
points of the full space skyline and their
duplicates, since the result depends on the given
constraints (static).
SUBSKY ICDE 2006
Transforms the multi-dimensional data into
one-dimensional, and therefore permits indexing
the dataset with a B-tree.
Drawbacks
is unable to answer constrained subspace skyline
queries as all points have to be transformed in a
pre-processing step.
does not deliver the skyline points progressively.

12
Related Work (Continued)

BBS SIGMOD 2003, TODS 2005
all points are indexed in an R-tree.
mindist(MBR) the L1 distance between its
lower-left corner and the origin (NN).
Keep a heap of index entries and objects, ordered
by mindist.
Is still the most efficient method for
(constrained subspace) skyline retrieval!

13
Related Work (Continued)

Shortcomings of BBS
Maintaining a high-dimensional index to support
constrained skyline queries in arbitrary
dimensionality is not suitable
It has been shown that the performance of such
high-dimensional indexes deteriorates with an
increasing number of dimensions. (Curse of
Dimensionality)
The performance of low-dimensional constrained
skyline queries decreases when the dimensionality
of the indexed space is high in contrast to the
query space that is low. (Random Grouping Effect)
Only low-dimensional indexes, e.g. R-trees, seem
to perform well in practice and for that reason
have found their place in commercial database
management systems (DBMS).

14
Our Approach

We partition vertically the data space among
several low-dimensional subspaces and index each
of these subspaces using an R-tree.
A constrained skyline query is then partitioned
into several sub-queries, each of them is
processed by utilizing the corresponding index
using incremental NN search.

TA-INDEX DAWAK 2005 An algorithm for
vertically partitioned nearest neighbor queries.

15
Contributions

We present a threshold-based skyline algorithm
(called STA), which exploits multiple indexes.
We propose different pruning strategies to
identify dominated regions and to discard
irrelevant sub-trees of the indexes.
A workload-adaptive strategy for determining the
number of indexes and the assignment of
dimensions to the indexes is presented.

16
Overview

Introduction
Motivation - Related Work
Basic STA
Improved Pruning
Indexing using Low-dimensional R-trees
Experimental Evaluation
Conclusions Future Work

17
Problem Definition

Constrained Subspace Skyline Queries
For a point p?Dc in the dimension set S?
the dominance region contains points which are
dominated by p.
the anti-dominance region refers to the set of
points dominating p.

A point p?D is said to dominate another point q?D
on subspace S? if
on every dimension di?S?, pi qi and
on at least one dimension dj?S?, pj lt qj.

18
One-point Pruning

Observation A point p is a skyline point in S?
if and only if there exists no point q that
belongs to the anti-dominance area of p for all
dimension sets Si? (1 i n).
Pruning with the Nearest Neighbor need to prune
objects ? not part of skyline.
because it is a member of the skyline, there is
no dominating point.
among all the skyline points it is the one with a
large volume, and hence, it is also expected to
prune a large percentage of the data points.

19
STA A Threshold-based Skyline Algorithm

Our algorithm works in two steps
Filter step
All retrieved points are organized in a priority
queue (heap) based on their Manhattan distance
according to the dimension set S?.
We use the Manhattan distance of the last
reported point of Si? as a threshold to speed up
the filtering phase.
Refinement step (domination test)
The refinement step begins when the first
constrained nearest neighbor based on S? is
returned by the filter step. This point is
guaranteed to be a skyline point.
In the next iteration, where another candidate is
found, the refinement step needs to determine
whether this candidate is a skyline point or not.
The dominance test is performed in a way similar
to traditional window queries using a main-memory
R-tree whose dimensionality is equal to the query
dimensionality.

20
Index Scheduling

Round Robin strategy
Inefficient
We are interested in more advanced strategies
resulting in a fast increase of the threshold.
We choose the index that will increase the
partial distance mostly as it is more beneficial
for our threshold.
Strategies for index scheduling for nearest
neighbor search on a vertically partitioned data
set have been studied in DAWAK 2005.

21
Overview

Introduction
Motivation - Related Work
Basic STA
Improved Pruning
Indexing using Low-dimensional R-trees
Experimental Evaluation
Conclusions Future Work

22
Improved Pruning

Motivating example
Non uniform distributions ? Points form clusters
Need Pruning using multiple points

Simultaneous pruning
we are not able to prune simultaneously in both
subspaces using the same point.

23
Multiple-point Pruning

Observation when points lying in the dominance
region of a point are not discarded in at least
one subspace, then we are able, under certain
conditions, to discard points in all remaining
subspaces, while we guarantee no false
dismissals.
we use the points that are retrieved as local
constrained nearest neighbors from an index, for
pruning in all other indexes.

Example 4-dimensional data space is divided into
two 2-dimensional subspaces. When the point p1 is
retrieved from subspace S1 then the dominance
area of the point p1 in subspace S2 is used for
pruning.

24
Avoiding False Hits

Unfortunately, by following this strategy some
skyline points are falsely discarded.
Case 1 Let the point q in the projection S2
collapse on the point q1. The point p is not a
skyline point in S, since it is dominated by q in
all dimensions sets of S.
Case 2 On the other hand, if the point q in the
projection S2 collapses on the point q2, then
point p may be discarded falsely, since it is a
potential skyline point.

Solution To discard points from the dominance
area of p in S2, the point p and a point qi must
be dominated by the projection of the same point
in S2 and S1 respectively. This condition must
hold for each point qi which belongs in the
discarded area of S1.

25
Overview

Introduction
Motivation - Related Work
Basic STA
Improved Pruning
Indexing using Low-dimensional R-trees
Experimental Evaluation
Conclusions Future Work

26
Random Grouping Effect

Random Grouping Effect Since not all dimensions
are used for splitting the axes during the index
creation for a leaf node, when a query that
requires projection is posed to the index the
performance of the index corresponds to a random
low-dimensional index,
i.e. an index that groups the points into leaf
nodes in a mostly random manner.
Example consider a 10-d data space and assume
that we are interested in retrieving the skyline
of any 2-d subspace.
If only two dimensions are used for splitting,
then the probability that the chosen dimensions
have been used for splitting is very small.
Thus, the query performance is similar to the
performance of a 2-d index, where the data points
were grouped together randomly.

27
Number of Indexes

If every leaf node is splitted at least once in
each dimension, we need a total number of at
least 2d leaf nodes.
Well-performing index every leaf node is
splitted by each dimension once (L 2d).
(Defines a maximum dimensionality for a
low-dimensional index)
Example 32-d Color dataset, 68,040 records.
Our formula suggests ? 2 indexes
In this way we index more effectively high
dimensional datasets, by avoiding performance
degradation due to random grouping effect.

28
Dimension Assignment Algorithm

Number of Distinct Values a quality measure of a
subspace Si
points whose projections coincide to a
low-dimensional point, so that it is dominated by
some duplicate point in the query-dimensional
space.
DAA a greedy algorithm to distribute the
attributes over the n indexes.
restrict the random grouping effect
maximize the number of distinct values

29
Workload-adaptive Extension

User preferences are correlated
use multiple indexes, which are built on the most
preferred subspaces
Simple, but very powerful extension
associate some probability with each subspace
(the frequency with which it is queried)
weight the cost estimation of each dimension set
by its probability.
This extension allows us to examine the
performance of our algorithm under a workload,
which is closer to real applications, instead of
picking random subspaces.

30
Overview

Introduction
Motivation - Related Work
Basic STA
Improved Pruning
Indexing using Low-dimensional R-trees
Experimental Evaluation
Conclusions Future Work

31
Experimental Evaluation

Datasets
Three data sets from real-world applications
NBA dataset contains 17,000 13-dimensional
points, where each point corresponds to the
statistics of a player in 13 categories.
Color moments dataset contain 9-dimensional
features of 68,040 photo images extracted from
the Corel Draw database.
Color histogram consists of 32-dimensional
features, representing the histogram of an image.
Additionally, we generated 10-dimensional uniform
datasets with a cardinality of 10,000, 50,000 and
100,000 data points.
Implementation Details
We compare our algorithm against the current
state-of-the-art method BBS.
We set the page size for each R-tree to 4K and
each dimension was represented by a real number.
Measurement The number of disc I/Os (page
accesses)

32
Examination of Constrained Subspace Skylines

Effect of Constrained Region
Varying constrained region from 50 to 100 of
each axis.
We examine subspaces with dimensionality of
dsub3.
Uniform dataset full space dimensionality of
10-d and a cardinality of 50,000 points.

Observation the performance of our algorithm is
not affected significantly by the size of the
constrained region.

33
Examination of Constrained Subspace Skylines

Effect of Subspace Dimensionality
We vary the query subspace dimensionality from 2
to 4.
We set the constrained region constant
(represented as 60 of the values of each
requested axis). These results demonstrate that
the STA algorithm leads to substantially less
page accesses than BBS.

a) 10-d Uniform Dataset, 50k
b) 9-d Color Dataset, 68k

These results demonstrate that the STA algorithm
leads to substantially less page accesses than
BBS.

34
Scalability with the Dataset Cardinality

We use uniform datasets, (dimensionality of 10-D)
Vary the cardinality between 10,000 and 100,000
points.
We set the constrained region to cover 60 of
each axis.
In addition we request the skyline of
3-dimensional subspaces.

The proposed method scale better with cardinality
than BBS.

35
Scalability with Full-space Dimensionality

Varying the Full-space Dimensionality
We set the constrained region to cover 60 of
each axis. In addition we request the skyline of
3-dimensional subspaces.
Uniform dataset with varied dimensionality of 10,
20 and 30-d.
Real datasets with varied dimensionality of 9, 13
and 32-d

a) Uniform Datasets
b) Real Datasets

In both cases our algorithm constantly
outperforms BBS in this experiment.

36
Adaptation to the query Workload

Query-workload using the 80-20 law
20 of the attributes contribute to 80 of the
queries
32-dimensional Color histogram dataset, which
consists of 68,040 records

a) I/O cost
b) CPU cost

Scalability using the 80-20 law
Subspace skyline with dsub 3
Constrained Region 60 of each axis

37
Overview

Introduction
Motivation - Related Work
Basic STA
Improved Pruning
Indexing using Low-dimensional R-trees
Experimental Evaluation
Conclusions Future Work

38
Conclusions Future Work

We addressed the problem of Constrained Subspace
Skyline Queries and we have presented a
threshold-based skyline algorithm, which exploits
multiple indexes.
We proposed different pruning strategies to
identify dominated regions and to discard
irrelevant sub-trees of the indexes.
A workload-adaptive strategy for determining the
number of indexes and the assignment of
dimensions to the indexes is presented.
Extensive performance evaluation show the
superiority of our proposed technique against
related work.

Future Work may include
Examination of STA using external queues
Development of a Cost Model for Constrained
Subspace Skyline Queries

39
References

SKYCUBE VLDB 2005, SIGMOD 2006
Yuan, Y., Lin, X., Liu, Q., Wang, W., Yu, J.,
Zhang, Q. Efficient Computation of the Skyline
Cube. Very Large Data Bases Conference (VLDB),
Trondheim, Norway, August 30 - September 2, 2005.
Pei, J. Jin, W, Ester, M., Tao, Y. Catching the
Best Views of Skyline A Semantic Approach Based
on Decisive Subspaces. Very Large Data Bases
Conference (VLDB), Trondheim, Norway, August 30 -
September 2, 2005.
Xia, T., Zhang, D. Refreshing the Sky The
Compressed Skycube with Efficient Support for
Frequent Updates. To appear in Proceedings of the
2006 ACM SIGMOD International Conforerence on
Management of Data (SIGMOD), Chicago, IL, USA
2006.
SUBSKY ICDE 2006
Tao, Y., Xiao, X., Pei, J. SUBSKY Efficient
Computation of Skylines in Subspaces. IEEE
International Conference on Data Engineering
(ICDE), Atlanta, Georgia, USA, April 3-7, 2006.
BBS SIGMOD 2003, TODS 2005
Papadias, D., Tao, Y., Fu, G., Seeger, B. An
Optimal and Progressive Algorithm for Skyline
Queries. ACM Conference on the Management of Data
(SIGMOD), San Diego, CA, June 9-12, 2003.
Papadias, D., Tao, Y., Fu, G., Seeger, B.
Progressive Skyline Computation in Database
Systems. ACM Transactions on Database Systems,
30(1) 41-82, 2005.
TA-INDEX DAWAK 2005
Dellis, E., Seeger, B., Vlachou, A. Nearest
Neighbor Search on Vertically Partitioned
High-Dimensional Data. In Proceedings of 7th
International Conference on Data Warehousing and
Knowledge Discovery (DaWaK), Copenhagen, Denmark,
2005