Title: Skyline Queries
1Skyline Queries
- HKU CSIS DB Seminar
- 9 April 2003
- Speaker Eric Lo
2Skyline
- A new operator in database systems
- Filters out a set of interesting points from a
potential large set of data points - A data point is interesting if it is not
dominated by any other point
3Example
- Find some good places for us to hold the next DB
Seminar - Dataset (Table Homes)
Home Distance from HKU Area (m2)
Kevin 1 km 10
Ben 9 km 100
Felix 5 km 2
K.K Loo 8 km 250
- Good ? Close to HKU (Min.)
- Good ? Max. Area (Max.)
- Return those homes that are not worse than any
others in ALL DIMENSIONS
4Outline
- Introduction to Skyline Queries
- Skyline Operator in SQL
- Implementation Algorithms
- Progressive Algorithms
- Variations of Skyline Queries
- Experimental Result
- Conclusion
5The Skyline OperatorICDE 2001S.Borzonyi,
D.Kossmann, K.Stocker
- Define the skyline operator in databases
- Extension of SQL for skyline
- Block-nested-loop Algorithm
- Divide-and-conquer Algorithm
6Problem Definition
- Related to
- maximum vector problem
- contour problem
- convex hull of a data set
- Assume the whole dataset fits in the memory
7SQL Extensions
- SELECT FROM WHERE GROUP BY HAVING
SKYLINE OF DISTINCT d1 MIN MAX,
dm MIN MAXORDER BY - d1 dm denote the dimensions participate the
Skyline - SELECT FROM HOMESWHERE CITYHKSKYLINE OF
DIST MIN, AREA MAX
8Naïve Approach for Skyline
- 1D skyline is equivalent to computing min, max in
SQL - Naïve 2D skyline
- Sort the data according to the 2 dimensions
- Compare every tuple with its predecessor
- Sorting may need in 2 or more passes if the data
are not fit into memory ? use current external
sorting techniques
9Naïve 2D
Home Distance from HKU Area
Kevin 1 km 10
Felix 5 km 2
KK 8 km 250
Ben 9 km 100
- Sort by Distance
- Compare Felix with Kevin ? eliminate Felix
- Compare KK with Kevin ? incompatible ? part
of skyline - Compare Ben with KK ? eliminate Ben
10Naïve 2D not works for gt 2Ds
- If skyline involves more than 2D, sorting does
not work
Home Distance from HKU Area Rent
Kevin 1 km 10 9
Felix 5 km 2 5
KK 8 km 250 10
Ben 9 km 100 9
2D 3D
Cmp Felix, Kevin ? eliminated Cmp KK, Kevin ? part of skyline Cmp Ben, KK ? eliminated Cmp Felix, Kevin ? part of skyline Cmp KK, Felix ? part of skyline Cmp Ben, KK ? part of skyline ? No! Ben dominated by Kevin ? predecessor not work!
11Block-nested-loops Algorithm
- A straightforward approach
- Compare each point p with every other point
- If p is not dominated ? part of skyline
- Scan the data file and keeping a list of
candidate skyline points in main memory
12BNL cont.
- Insert the 1st data point into the list
- For each subsequent point p
- If p is dominated by any point in the list, it is
discarded - If p dominates any point in the list, insert it
into the list and remove all points dominated by
p - If p neither dominated, nor dominates any point
in the list, inserted it as part of the skyline
in the list
13BNL cont.
- The candidate list is self-organizing
- Points that have been dominated other points are
moved to the top of list - Reduces the number of comparisons
- E.g. the self-organizing list holdings the
partial skyline like
Home Distance from HKU Area Rent
Kevin 1 km 249 1
K.K 8 km 250 100
... Other skylines which is not as strong as Kevin except a few dimensions ... Other skylines which is not as strong as Kevin except a few dimensions ... Other skylines which is not as strong as Kevin except a few dimensions ... Other skylines which is not as strong as Kevin except a few dimensions
14More on BNL
- Point 3 in BNL If p neither dominated, nor
dominates any point in the list, inserted it as
part of the skyline in the list. If the are no
more space in the list, write p on a temporary
file on disk. Tuples in tmp file will be further
processed in next iteration of algorithm
15More on BNL (cont.)
A
E
F
H
A
B
C
D
E
F
G
H
I
J
Dom. by A
After 1st Iteration, A,E,G,H areoutput as
skylines, then clear upthe list and treat I,J
as newdata set and perform BNL again
Dom. by A
G
Dom. by A
Incompatible with A
Incompatible with A,E
I
J
Dom On F, replace F
Incompatible with A,E,G
Incompatible with A,E,G,H, but full! Incompatible
with A,E,G,H, but full!J has not compare with I
16Short summary on BNL
- Easy to implement
- Any dimension without using index or sorting
- Relies on main memory ? may have many iterations
- Not adequate for on-line processing ? it has to
read the entire data file before it returns the
first skyline point (not progressively)
17Divide-and-Conquer Algorithm
Price Dist
0.2 4
0.8 4
0.4 3
0.3 2
0.1 2
0.6 1
0.8 3
0.2 3
0.3 3
0.5 2
- Find the median of some dimension, sayprice,
Price(med)0.3 - Split the input into 2 partitions according to
Price(med) - Compute Skyline S1 in P1(lt0.3) and S2 in
P2(gt0.3) respectively by recursive
partitioning.Note S1 is better than S2 on
price - Recursive partitioning until a partition contains
very few (or 1) tuples - If only a few tuples, find out skyline is very
easy - Merging the skylines of each partitionsby
eliminating those S2 which are dominated by
S1Note None of the tuples in S1 can be
dominatedby S2 as all tuples in S1 are better
than S2 on pricei.e. tuples in UPPER never be
eliminated
18Divide-and-Conquer Algorithm
0.3 2
0.3 3
0.2 4
0.2 3
0.1 2
Price Dist
0.2 4
0.8 4
0.4 3
0.3 2
0.1 2
0.6 1
0.8 3
0.2 3
0.3 3
0.5 2
- Find the median of some dimension, sayprice,
Price(med)0.3 - Split the input into 2 partitions according to
Price(med)
0.8 4
0.6 1
0.5 2
0.4 3
0.8 3
19Divide-and-Conquer Algorithm
0.2 4
0.2 3
0.1 2
0.3 2
0.3 3
0.3 2
0.3 3
0.2 4
0.2 3
0.1 2
S1
Price Dist
0.2 4
0.8 4
0.4 3
0.3 2
0.1 2
0.6 1
0.8 3
0.2 3
0.3 3
0.5 2
S2
S3
0.4 3
0.5 2
0.6 1
0.8 3
0.8 4
0.8 4
0.6 1
0.5 2
0.4 3
0.8 3
S4
S5
S6
S7
20Divide-and-Conquer Algorithm
0.2 4
0.2 3
0.1 2
0.3 2
0.3 3
0.3 2
0.3 3
0.2 4
0.2 3
0.1 2
S1
Price Dist
0.2 4
0.8 4
0.4 3
0.3 2
0.1 2
0.6 1
0.8 3
0.2 3
0.3 3
0.5 2
S2
S1, S2
S3
S1,S2, S7
i.e. tuples in UPPER never be eliminated
0.4 3
0.5 2
0.8 4
0.8 3
0.6 1
0.8 4
0.6 1
0.5 2
0.4 3
0.8 3
S4
S5
S4, S5, S7
S6
S7
21Efficient Progressive Skyline ComputationVLDB
2001K.L. Tan, P.K. Eng, B.C. Ooi
- Previous approach require at least one
- pass over the dataset to return the first
- interesting point, We propose
- Bitmap-based Algorithm
- B-tree-based Algorithm
- They can return the first interesting point once
- they identified.
22Progressive?
- Both bitmap and tree-base returns skyline very
quickly - Maybe useful if you are not willing to wait so
long for the first few interesting homes out of
the large dataset - Also outperform BNL and D--C in overall response
time
23Skyline by Bitmap
- Main idea
- Given a point p, if something can tell you
- p is not dominated by any other points in DB ?
skyline! - p is dominated by some points in DB ? throw away
- Non-blocking!
- Can return the skyline points immediately
24Bitmap
- All information requires to decide whether a
point is in skyline are encoded in bitmaps - A data point p (p1, p2, , pd) where d is no.
of dimensions, is mapped to a m-bit vector, m is
number of distinct values over all dimensions
25Bitmap
- Distinct values on price and distance is 7 and 4
- m 11
Price Dist
0.2 4
0.8 4
0.4 3
0.3 2
0.1 2
0.6 1
0.8 3
0.2 3
0.3 3
0.5 2
26Bitmap representation
- Distinct value on x 10
- Distinct value on y 10
- m20
- 20-bit vector
- E.g (4,8)
- 4 is 4-th smallest on dimension x, set 4-th to
the leftmost be 1 (starting from right) - 8 is 8-th smallest on y, set 8-th to the leftmost
be 1)
Point Bitmap Representation
(1,9) (1111111111, 1100000000)
(2,10) (1111111110, 1000000000)
(4,8 ) (1111111000, 1110000000)
(6,7 ) (1111100000, 1111000000)
(9,10 ) (1100000000, 1000000000)
(7,5 ) (1111000000, 1111110000)
(5,6 ) (1111110000, 1111100000)
(4,3 ) (1111111000, 1111111100)
(3,2 ) (1111111100, 1111111110)
(9,1 ) (1100000000, 1111111111)
(10,4 ) (1000000000, 1111111000)
(6,2 ) (1111100000, 1111111110)
(8,3) (1110000000, 1111111100)
27Bitmap representation
- Do (4,8) is a skyline point? (min x, y)
- Create bit-strings Cx and Cy (Not CY Ng!)
- Cx 1110000110000
- Cy 0011011111111
- Cx Cy 0010000110000
- If CxCy has more than one 1, dominated by some
points
Point Bitmap Representation
(1,9) (1111111111, 1100000000)
(2,10) (1111111110, 1000000000)
(4,8 ) (1111111000, 1110000000)
(6,7 ) (1111100000, 1111000000)
(9,10 ) (1100000000, 1000000000)
(7,5 ) (1111000000, 1111110000)
(5,6 ) (1111110000, 1111100000)
(4,3 ) (1111111000, 1111111100)
(3,2 ) (1111111100, 1111111110)
(9,1 ) (1100000000, 1111111111)
(10,4 ) (1000000000, 1111111000)
(6,2 ) (1111100000, 1111111110)
(8,3) (1110000000, 1111111100)
28Bitmap representation
- Do (3,2) is a skyline point? (min x, y)
- Create bit-strings Cx and Cy
- Cx 1100000010000
- Cy 0000000011010
- Cx Cy 0000000010000
- If CxCy has only 1, it is a skyline
Point Bitmap Representation
(1,9) (1111111111, 1100000000)
(2,10) (1111111110, 1000000000)
(4,8 ) (1111111000, 1110000000)
(6,7 ) (1111100000, 1111000000)
(9,10 ) (1100000000, 1000000000)
(7,5 ) (1111000000, 1111110000)
(5,6 ) (1111110000, 1111100000)
(4,3 ) (1111111000, 1111111100)
(3,2 ) (1111111100, 1111111110)
(9,1 ) (1100000000, 1111111111)
(10,4 ) (1000000000, 1111111000)
(6,2 ) (1111100000, 1111111110)
(8,3) (1110000000, 1111111100)
29Short summary on Bitmap
- Need to pre-compute bitmap representation of
every point - Each point retrieve all bitmaps in order to get
the juxtapositions (Cx and Cy) - Large storage if the domain of each attributes
are large
30Some other progressive algorithms
- B-tree index (also proposed by BOC)
- Organizes the points into d lists (d is no. of
dimensions in data) - Build Btree on the lists for retrieving skylines
- Suffer similar problem as bitmap approach
31Some other progressive algorithms (cont.)
- NN algorithm (by Donald Kossmann again)
- VLDB 02
32NN skyline
33Outline
- Introduction to Skyline Queries
- Skyline Operator in SQL
- Implementation Algorithms
- Progressive Algorithms
- Variations of Skyline Queries
- Experimental Result
- Conclusion
34An Optimal and Progressive Algorithm for Skyline
QueriesSIGMOD 2003D.Papadias, Y. Tao, G. Fu, B.
Seeger
- We propose
- A NN algorithm which is more efficient andI/O
Optimal - Ranked skyline queries
- Constrained skyline queries
- Dynamic skyline queries
- K-dominating queries
35Ranked Skyline
- A ranked skyline returns K skyline points that
have minimum/max score according to a function f - In our example, f 3Dist 7Area
- Return the top K homes
- Though skylines are returning interesting points,
we may want the most interesting points according
to our own preferences, especially the data set
is large(say hotels) and skyline is also large!
36Constrained Skyline
- Returning the most interesting points in a
specific data space
37Dynamic Skyline
- Returning update skyline dynamically
- E.g. Ask for hotels with minimum distance and
price (again?) - Minimum distance is now depends on my current
location
38Enumerating Skyline
- Enumerate queries return, for each skyline point
p, the number of points dominated by p - Sometime useful if you want to know this skyline
hotel C has dominated 1000 hotels, and another
hotel Y dominated only 1 hotel - maybe C is better than Y in many properties (e.g.
price, dist, etc), but Y has only 1 properties
better than C, e.g. with PS2
39Experimental Evaluation
- Running time comparison on progressive algorithms
without NN approaches
BNL
DC
Bitmap
Index
40Conclusion
- Introduction the skyline queries
- How to implement (support) the skyline operator
in DBMS? - Variation of skyline queries
- If the information are placed in different
places, how to answer skyline queries on a mobile
device?
41References
- S.Borzonyi, D.Kossmann, K.Stocker. The Skyline
Operator. ICDE 2001. - K.L. Tan, P.K. Eng, B.C. Ooi . Efficient
Progressive Skyline Computation. VLDB 2001. - D.Kossmann, F.Ramsak, S. Rost. Shooting Stars in
the Sky An Online Algorithm for Skyline Queries.
VLDB 2002. - D.Papadias, Y. Tao, G. Fu, B. Seeger. An Optimal
and Progressive Algorithm for Skyline Queries.
SIGMOD 2003.