Topk Dominating Queries - PowerPoint PPT Presentation

1 / 26

About This Presentation

Title:

Topk Dominating Queries

Description:

... measures (e.g., number of cars in car-parks) in a region (e.g. ... Compute their scores and compare them with retrieved points in all previous iterations ... – PowerPoint PPT presentation

Number of Views:66

Avg rating:3.0/5.0

Slides: 27

Provided by: iCs8

Category:

more less

Transcript and Presenter's Notes

Title: Topk Dominating Queries

1
Top-k Dominating Queries

DB seminar
Speaker Ken Yiu
Date 25/05/2006

2
Outline

Motivations and applications
Background
Skyline-based algorithm
Best-first algorithms
Experimental results
Conclusions

3
Top-k query, skyline query

D dataset of points in multi-dimensional space
?d
Top-k query
k points with the lowest F values
Top-2 p4, p6
Require a ranking function ?
Result affected by scales of dimensions ?
Skyline query
pgtp ( ? i, pi lt pi ) ? ( ? i, pi ? pi
)
Points not dominated by any other point
Skyline p1, p4, p6, p7
Result size cannot be controled ?

4
Top-k dominating query

Intuitive score function
?(p) p?D, pgtp
Property ? p,p?D, pgtp ? ?(p)gt?(p)
Top-k dominating query
k points with the highest ? values
Also known as k-dominating query PTFS05
Top-2 dominating points p4 (3), p5 (2)
Applications decision support systems, find the
most popular objects
Advantages ?
Control of result size
No need to specify ranking function
Result independent of scales of dimensions

5
Related work

Spatial aggregation processing
aggregate measures (e.g., number of cars in
car-parks) in a region (e.g., district)
aR-trees PKZT01
Each entry is augmented with the aggregate
measure of all points in its subtree
Example COUNT R-tree
Query find the number of points intersect W
Prune entries that do not intersect W
Fully covered by W, increment by its count
Partially covered by W, recursive call
Cost 10 for aR-tree, but 17 for typical
R-tree

6
Related work skyline computation

Non-indexed data
DC (divide-and-conquer), BNL (block-nested loop),
SFS (sort-filter-skyline), LESS (linear
elimination sort for skyline)
Indexed data
NN, BBS PTFS05
Skyline variants based on dominance relationship
Top-k frequent skyline points CJT06a
Frequency (p) number of subspaces that p is a
skyline point
k-dominant skyline points CJT06b
Relax the dominance relationship by k
kd original skyline k decreases ? skyline size
decreases
Data cube for analyzing dominance relationship of
points LOTW06

7
Top-k dominating query

How to answer top-k dominating query
Block nested loop join compute the score of
every point
Quadratic cost
Skyline based solution retrieve the skyline
points and compute their scores, find the top-1
from the skyline
Expensive for datasets with large skylines
Goal develop efficient algorithm for the query
on indexed multi-dimensional data

8
Problem characteristics

Pre-computation possible? ?
Materialize the score of every point
Updates change the score of influenced points
Update cost is expensive for dynamic datasets
Find (K gtgt k) points with the highest dominating
area, compute their scores to get best-k results
?
Approximate solution, hard to specify K
Dominating area cannot provide bounds for ?
DomArea(p1) (1 0.25) (1 0.50) 0.375
DomArea(p4) (1 0.45) (1 0.40) 0.330
?(p1)1 lt ?(p4)2 !!!
Unlike the dominating area, computing ? value (or
even its upper bound) requires accessing data

9
Skyline-based solution

BBS Top-k dominating algorithm PTFS05
Example top-2 dominating query
Iteration 1
Find the skyline points
Score of a point is smaller than the one
dominating it (if any)
Compute their scores (by accessing the tree)
Report p2 (4) as the first result
Iteration 2
Find the constrained skyline (gray region) WHY ?
Region dominated by p2 but not others (p1, p3)
Compute their scores and compare them with
retrieved points in all previous iterations
Report p4 (2) as the second result

10
Our optimizations

Hilbert ordering of retrieved points before
counting
Exhibit locality of node accesses
Batch counting
Pack B (page capacity) points into one page
and count their scores simultaneously
e and e denote the lower and upper corners
(virtual points) of an entry e respectively

Properties
p1 gt e ? p1 dominates all points in e
p2 gt e and p2 ?gt e ? p2 may dominate some
points in e
p3 ?gt e ? p3 dominates no points in e

11
The best-first approach

The optimized BBS is inefficient when the skyline
is large
Not necessary to compute the whole skyline
Best-first approach visit the nodes in
descending order of their upper bound scores
Use a max-heap H for organizing the entries to be
visited in descending order of their upper bound
scores
Keep an array W of the best-k data points found
so far
Terminates when the top entry in H cannot improve
the result
Compute upper bound scores of entries in the same
non-leaf node
Upper score of the entry e is ?(e_) WHY?
For each entry e in the node, put the point e_ in
the set T
Perform batch counting for the points in T

12
Optimizations for best-first search

Pruning technique
Let ? be the best-k score found so far (lowest
score in result array)
Suppose that a point p satisfies ?(p) ? ?. Any
point p dominated by p ? ?(p) lt ?.
Keep a pruner set F of visited data points whose
scores are ? ?
Among the points in F, only need to maintain
their skyline
Apply F to eliminate unqualified entries
Lazy counting (for computing scores of leaf
entries)
some data points (in the same leaf node) remain
before counting, not cost-effective to perform
batch counting for them
Use a FIFO queue L to store discovered points
Once L is full (i.e., L page capacity B),
perform batch counting for the points in L,
update the result and clear L

13
Lightweight best-first search

Expensive to compute upper bound scores
for non-leaf entries
Root node contains e1, e2, e3
Compute ?(e1), ?(e2), ?(e3) in batch
e2 may dominate some points in e1 and e3
Cost 3 node accesses ?(e1)3, ?(e2)7,
?(e3)3
Use a lightweight function to compute upper bound
scores for non-leaf entries
Goal do not allow leaf nodes to be accessed
Compute ?ub(e1), ?ub(e2), ?ub(e3) in batch
Cost 1 node access, since leaf nodes not
accessed
e2 dominates all points in e2 and some in e1 and
e3
?ub(e1)3, ?ub(e2)9, ?ub(e3)3

Correct bound ! Approx. preserve original
ordering of entries !
14
Incremental best-first search

No objects need to be pruned
Data points are inserted into the heap H after
their scores have been computed
When a data point p is deheaped, check whether
its score is greater than those in the Lazy
Counting Queue L
If yes, report p as the next result
If not,
consider the points in L whose (upper bound)
score is greater than p
Compute their actual scores and insert them to H
Insert p back to H again
The next result is now at the top of H (to be
found in next iteration)

15
Query variant

Bichromatic top-k dominating query
Given a provider dataset DP and a consumer
dataset DA, a point p in DP,
?A(p) a?DA, pgta
?A(p1)2, ?A(p2)3, ?A(p3)1
Bichromatic top-1 point p2
Application find the most popular hotel, where
DP contains hotels and DA specify requirements
from different customers
Query processing
The proposed algorithms are still applicable
Search for the results in DP
Perform counting on DA

16
Setup of efficiency experiments

Algorithms
BBS (skyline-based method)
Best first search BF1 (basic), BF2 (lightweight)
Incremental best-first IBF1, IBF2
Synthetic datasets
UI (independent), CO (correlated), AC
(anti-correlated)
Parameters and other settings
aR tree node page size 4K bytes
LRU buffer size () 0, 1, 2, 5, 10, 15, 20
Datasize N (million) 0.25, 0.5, 1, 2, 4
Data dimensionality d 2, 3, 4, 5
Result size k 1, 4, 16, 64, 256

17
I/O cost vs buffer size
UI
AC
CO
18
I/O cost vs k
UI
AC
CO
19
I/O cost vs d
UI
AC
CO
20
I/O cost vs N
UI
AC
CO
21
Bichromatic queries, I/O cost vs dataset
combination

Column UI/CO means
provider dataset DP is UI and consumer dataset DA
is CO
BF1 is more efficient than BBS in 7 cases
BF2 outperforms its competitors in all cases

22
Meaningfulness of results

Explore the meaningfulness of the results
returned by top-k dominating queries
Real datasets
Statistics of NBA players
http//basketballreference.com/stats_download.htm
19112 players (identified by both name and year)
Attributes for query GP (games played), PTS
(points), REB (rebounds), and AST (assists)
Statistics BASEBALL pitchers
http//baseball1.com/statistics/
36898 players (identified by both name and year)
Attributes for query W (Wins), H (Hits), ERA
(Earned Run Average), and R (Runs Allowed)

23
Top-k dominating points meaningful?

Results match the publics view of super-star
players in NBA and BASEBALL
Enables users to discover top objects without
any specific domain knowledge

Top-5 dominating points
24
Skyline vs top-k dominating points

Perform a skyline query, compute top-k dominating
points by setting k to the skyline size (69 for
NBA and 700 for BASEBALL)
Plot their dominating scores in descending order
Observations
Top-k dominating points have much higher scores
than skyline points
Top-k dominating points are more informative to
users

BASEBALL
NBA
25
Conclusions

Study top-k dominating queries on indexed
multi-dimensional data
Present algorithms for the problem
The lightweight best-first algorithm BF2
performs the best
Top-k dominating queries produce more meaningful
results than skylines

26
References

FLN01 R. Fagin, A. Lotem, and M. Naor. Optimal
Aggregation Algorithms for Middleware. In PODS,
2001.
BKS01 S. Borzsonyi, D. Kossmann, and K.
Stocker. The Skyline Operator. In ICDE, 2001.
PKZT01 D. Papadias, P. Kalnis, J. Zhang, and Y.
Tao. Efficient OLAP Operations in Spatial Data
Warehouses. In SSTD, 2001.
PTFS05 D. Papadias, Y. Tao, G. Fu, and B.
Seeger. Progressive Skyline Computation in
Database Systems. TODS, 30(1)4182, 2005.
CJT06a C.-Y. Chan, H. Jagadish, K.-L. Tan, A.
Tung, and Z. Zhang. On High Dimensional Skylines.
In EDBT, 2006.
CJT06b C.-Y. Chan, H. Jagadish, K.-L. Tan, A.
Tung, and Z. Zhang. Finding k-Dominant Skylines
in High Dimensional Space. In SIGMOD, 2006.
LOTW06 C. Li, B. C. Ooi, A. Tung, and S.Wang.
DADA A Data Cube for Dominant Relationship
Analysis. In SIGMOD, 2006.