Title: Finding skyline on the fly
1Finding skyline on the fly
- HKU CS DB Seminar
- 21 July 2004
- Speaker Eric Lo
2Skyline
- A new operator (like ORDER BY) in database
systems - A set of data points that is not dominated by any
other data points
3Example
- Find some good places for us to hold the next DB
Seminar - Good ? Closer to HKU (Min)
- Good ? Larger Area (Max)
- Return those homes that are not worse than any
others in ALL DIMENSIONS - Dataset (Table Homes)
4Outline
- Introduction to skyline queries
- Non-progressive skylining on the Web
- Basic Distributed Skyline Algorithm (BDS)
- Progressive skylining on the Web
- Experimental result
- Conclusion and future directions
5Skylining on the Web
- One distributed site holds one attribute
- Attribute Distance from HKU stored at HKU
- Attribute Area (m2) stored at Purdue
Internet
HKU
Purdue
6Accessing interfaces
Purdue
Internet
HKU
- Interfaces of Web-accessible sites
- Sorted Access (SA)
- HKU?getNext() returns rank 1st data tuple K.K
Loo - HKU?getNext()? 2nd Ivy , HKU?getNext()?3rd
Nikos, . - Random Access (RA)
- Purdue?getScore(K.K. Loo) ? 10 m2
- HKU?getScore(Nikos) ? 8 km
7Basic distributed skyline algorithm (EDBT 04)
- Phase 1 find all possible skyline
- Perform sorted access on each source 1-by-1
- S1?getNext(), S2?getNext(), S3?getNext()
- S1?getNext(), S2?getNext() .
- .
- Stop until there is an object which attribute
values are all known
8Phase 1
- f is the terminating object
9Phase 1 (15 sorted accesses)
10Implication
- f is the terminating object ? Objects that do not
appear must be dominated by f
11Phase 2
- Find skyline from candidates in phase 1
- During sequential scanning of sources, data
structures K1, K2, K3, , Kn are created - n is the no. of dimension
- If source i?getNext() returns a data object d
- create an entry in Ki
- update the lower_bound of the source i
12Phase 2 find skyline from candidates Ki
- A lemma shows that Objects can only be dominated
by objects in the same set Ki
13Motivations
- BDS returns skyline results in a batch
- In practice, it would be useful to return skyline
results progressively such that users could
adjust their decisions right away - Consider the next DB seminar skyline example
- minimize Distance from HKU, maximize Area
- ltNikos 8km, 250m2gt is first returned
- From HKU to Nikoss home needs to take a 50 bus!
- Add the travel-expense attribute into the
skyline query
14Progressive Distributed Skylining (PDS)
- Goal
- Evaluates skyline queries progressively with
minimal overhead - Overhead
- Network/Data source accesses
- Computational time
15Enable progressiveness
- To identify a data point belongs to the final
skyline or not, we rely on the following lemma
(assume the data values are distinct) - If a data source Di returns data objects in a
strictly monotonic order, an object O retrieved
from Di would only be dominated by objects that
are retrieved from Di before O
16- If an object O is retrieved from a data source by
sorted access, we could only need to test if O is
dominated by any objects that appears before O in
the same source only - 2 usages
- We dont need to consider objects appear in other
data sources - After the test, we can output O as a skyline
immediately ? O must be a skyline, we do not need
to worry about objects appear later would
dominate O
17An R-tree approach
- Build an r-tree Ri for each attribute/data source
i involved in the skyline query - For each object O retrieved from source i, we
check to see if any object in Ri dominates O - If no such objects exists, O is a skyline (output
it immediately) - If some objects dominates O in Ri, O is not a
skyline object (O is discarded immediately)
18D3.getNext() the 1st time
D2
e(7,4)
D1
- SA on D3 returns elt1gt
- e is a skyline (no object is better than e on
D3), e(7,4) is projected into r-tree R3
19D3.getNext() the 2nd time
c(2,5)
e(7,4)
- SA on D3 returns clt2gt
- Construct a query Q(origin, c) on R3
- Q returns no answer ? c is a skyline ? insert c
into R3
20D3.getNext() the 3rd time
j(6,10)
c(2,5)
e(7,4)
- SA on D3 returns jlt3gt
- Construct a query Q(origin, j) on R3
- Q returns c as an answer ? j is dominated by c ?
discard j
21D3.getNext() the 4th time
c(2,5)
e(7,4)
- SA on D3 returns flt4gt, construct a query
Q(origin, f) on R3 - Q returns no answer ? f is a skyline
- Delete e after insertion of f to make the R-tree
more compact and efficient
22The R-tree approach
- The R-tree is very small in size since it stores
skyline objects with highest pruning power - Containment query operation is very efficient
23A linear regression based heuristic
- The R-tree approach enable progressiveness with
better efficiency - We use a linear regression based heuristic to
minimize the number of source accesses during the
evaluation process
24A rank based approach
- We use linear regression to estimate the rank of
objects along the process - Assume the object with lowest rank is the real
terminating object and probe the sources
accordingly (rather than round-robin)
25Extensions
- Evaluation of top-K skyline queries
- Progress indicator (based on the estimated ranks)
An clipart of Kevin Yip
26Experimental results Number of source accesses
27Experimental results Number of source accesses
Random Distribution
Denormalized Domain
28Experimental results progressive behavior
29Experimental results progress indicator
30Conclusion and future directions
- Skyline queries on the Web
- Return skyline points on-the-fly
- Future work
- Improve the usability of PDS by allowing the
users to barter between progressiveness and
efficiency - Compute skyline from real-time stream data
- Only 1 data source supports sorted access and the
rest support random access only
31References
- S.Borzonyi, D.Kossmann, K.Stocker, The Skyline
Operator, in ICDE 2001. - D.Kossmann, F.Ramsak, S. Rost, Shooting Stars in
the Sky An Online Algorithm for Skyline Queries,
in VLDB 2002. - W.T.Balke, U.Guntzer, J.X. Zheng, Efficient
Distributed Skylining for Web Information
Systems, in EDBT 2004