Topk Dominating Queries - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Topk Dominating Queries

Description:

... measures (e.g., number of cars in car-parks) in a region (e.g. ... Compute their scores and compare them with retrieved points in all previous iterations ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 27
Provided by: iCs8
Category:

less

Transcript and Presenter's Notes

Title: Topk Dominating Queries


1
Top-k Dominating Queries
  • DB seminar
  • Speaker Ken Yiu
  • Date 25/05/2006

2
Outline
  • Motivations and applications
  • Background
  • Skyline-based algorithm
  • Best-first algorithms
  • Experimental results
  • Conclusions

3
Top-k query, skyline query
  • D dataset of points in multi-dimensional space
    ?d
  • Top-k query
  • k points with the lowest F values
  • Top-2 p4, p6
  • Require a ranking function ?
  • Result affected by scales of dimensions ?
  • Skyline query
  • pgtp ( ? i, pi lt pi ) ? ( ? i, pi ? pi
    )
  • Points not dominated by any other point
  • Skyline p1, p4, p6, p7
  • Result size cannot be controled ?

4
Top-k dominating query
  • Intuitive score function
  • ?(p) p?D, pgtp
  • Property ? p,p?D, pgtp ? ?(p)gt?(p)
  • Top-k dominating query
  • k points with the highest ? values
  • Also known as k-dominating query PTFS05
  • Top-2 dominating points p4 (3), p5 (2)
  • Applications decision support systems, find the
    most popular objects
  • Advantages ?
  • Control of result size
  • No need to specify ranking function
  • Result independent of scales of dimensions

5
Related work
  • Spatial aggregation processing
  • aggregate measures (e.g., number of cars in
    car-parks) in a region (e.g., district)
  • aR-trees PKZT01
  • Each entry is augmented with the aggregate
    measure of all points in its subtree
  • Example COUNT R-tree
  • Query find the number of points intersect W
  • Prune entries that do not intersect W
  • Fully covered by W, increment by its count
  • Partially covered by W, recursive call
  • Cost 10 for aR-tree, but 17 for typical
  • R-tree

6
Related work skyline computation
  • Non-indexed data
  • DC (divide-and-conquer), BNL (block-nested loop),
    SFS (sort-filter-skyline), LESS (linear
    elimination sort for skyline)
  • Indexed data
  • NN, BBS PTFS05
  • Skyline variants based on dominance relationship
  • Top-k frequent skyline points CJT06a
  • Frequency (p) number of subspaces that p is a
    skyline point
  • k-dominant skyline points CJT06b
  • Relax the dominance relationship by k
  • kd original skyline k decreases ? skyline size
    decreases
  • Data cube for analyzing dominance relationship of
    points LOTW06

7
Top-k dominating query
  • How to answer top-k dominating query
  • Block nested loop join compute the score of
    every point
  • Quadratic cost
  • Skyline based solution retrieve the skyline
    points and compute their scores, find the top-1
    from the skyline
  • Expensive for datasets with large skylines
  • Goal develop efficient algorithm for the query
    on indexed multi-dimensional data

8
Problem characteristics
  • Pre-computation possible? ?
  • Materialize the score of every point
  • Updates change the score of influenced points
  • Update cost is expensive for dynamic datasets
  • Find (K gtgt k) points with the highest dominating
    area, compute their scores to get best-k results
    ?
  • Approximate solution, hard to specify K
  • Dominating area cannot provide bounds for ?
  • DomArea(p1) (1 0.25) (1 0.50) 0.375
  • DomArea(p4) (1 0.45) (1 0.40) 0.330
  • ?(p1)1 lt ?(p4)2 !!!
  • Unlike the dominating area, computing ? value (or
    even its upper bound) requires accessing data

9
Skyline-based solution
  • BBS Top-k dominating algorithm PTFS05
  • Example top-2 dominating query
  • Iteration 1
  • Find the skyline points
  • Score of a point is smaller than the one
    dominating it (if any)
  • Compute their scores (by accessing the tree)
  • Report p2 (4) as the first result
  • Iteration 2
  • Find the constrained skyline (gray region) WHY ?
  • Region dominated by p2 but not others (p1, p3)
  • Compute their scores and compare them with
    retrieved points in all previous iterations
  • Report p4 (2) as the second result

10
Our optimizations
  • Hilbert ordering of retrieved points before
    counting
  • Exhibit locality of node accesses
  • Batch counting
  • Pack B (page capacity) points into one page
  • and count their scores simultaneously
  • e and e denote the lower and upper corners
  • (virtual points) of an entry e respectively
  • Properties
  • p1 gt e ? p1 dominates all points in e
  • p2 gt e and p2 ?gt e ? p2 may dominate some
    points in e
  • p3 ?gt e ? p3 dominates no points in e

11
The best-first approach
  • The optimized BBS is inefficient when the skyline
    is large
  • Not necessary to compute the whole skyline
  • Best-first approach visit the nodes in
    descending order of their upper bound scores
  • Use a max-heap H for organizing the entries to be
    visited in descending order of their upper bound
    scores
  • Keep an array W of the best-k data points found
    so far
  • Terminates when the top entry in H cannot improve
    the result
  • Compute upper bound scores of entries in the same
    non-leaf node
  • Upper score of the entry e is ?(e_) WHY?
  • For each entry e in the node, put the point e_ in
    the set T
  • Perform batch counting for the points in T

12
Optimizations for best-first search
  • Pruning technique
  • Let ? be the best-k score found so far (lowest
    score in result array)
  • Suppose that a point p satisfies ?(p) ? ?. Any
    point p dominated by p ? ?(p) lt ?.
  • Keep a pruner set F of visited data points whose
    scores are ? ?
  • Among the points in F, only need to maintain
    their skyline
  • Apply F to eliminate unqualified entries
  • Lazy counting (for computing scores of leaf
    entries)
  • some data points (in the same leaf node) remain
    before counting, not cost-effective to perform
    batch counting for them
  • Use a FIFO queue L to store discovered points
  • Once L is full (i.e., L page capacity B),
    perform batch counting for the points in L,
    update the result and clear L

13
Lightweight best-first search
  • Expensive to compute upper bound scores
  • for non-leaf entries
  • Root node contains e1, e2, e3
  • Compute ?(e1), ?(e2), ?(e3) in batch
  • e2 may dominate some points in e1 and e3
  • Cost 3 node accesses ?(e1)3, ?(e2)7,
    ?(e3)3
  • Use a lightweight function to compute upper bound
    scores for non-leaf entries
  • Goal do not allow leaf nodes to be accessed
  • Compute ?ub(e1), ?ub(e2), ?ub(e3) in batch
  • Cost 1 node access, since leaf nodes not
    accessed
  • e2 dominates all points in e2 and some in e1 and
    e3
  • ?ub(e1)3, ?ub(e2)9, ?ub(e3)3

Correct bound ! Approx. preserve original
ordering of entries !
14
Incremental best-first search
  • No objects need to be pruned
  • Data points are inserted into the heap H after
    their scores have been computed
  • When a data point p is deheaped, check whether
    its score is greater than those in the Lazy
    Counting Queue L
  • If yes, report p as the next result
  • If not,
  • consider the points in L whose (upper bound)
    score is greater than p
  • Compute their actual scores and insert them to H
  • Insert p back to H again
  • The next result is now at the top of H (to be
    found in next iteration)

15
Query variant
  • Bichromatic top-k dominating query
  • Given a provider dataset DP and a consumer
    dataset DA, a point p in DP,
  • ?A(p) a?DA, pgta
  • ?A(p1)2, ?A(p2)3, ?A(p3)1
  • Bichromatic top-1 point p2
  • Application find the most popular hotel, where
    DP contains hotels and DA specify requirements
    from different customers
  • Query processing
  • The proposed algorithms are still applicable
  • Search for the results in DP
  • Perform counting on DA

16
Setup of efficiency experiments
  • Algorithms
  • BBS (skyline-based method)
  • Best first search BF1 (basic), BF2 (lightweight)
  • Incremental best-first IBF1, IBF2
  • Synthetic datasets
  • UI (independent), CO (correlated), AC
    (anti-correlated)
  • Parameters and other settings
  • aR tree node page size 4K bytes
  • LRU buffer size () 0, 1, 2, 5, 10, 15, 20
  • Datasize N (million) 0.25, 0.5, 1, 2, 4
  • Data dimensionality d 2, 3, 4, 5
  • Result size k 1, 4, 16, 64, 256

17
I/O cost vs buffer size
UI
AC
CO
18
I/O cost vs k
UI
AC
CO
19
I/O cost vs d
UI
AC
CO
20
I/O cost vs N
UI
AC
CO
21
Bichromatic queries, I/O cost vs dataset
combination
  • Column UI/CO means
  • provider dataset DP is UI and consumer dataset DA
    is CO
  • BF1 is more efficient than BBS in 7 cases
  • BF2 outperforms its competitors in all cases

22
Meaningfulness of results
  • Explore the meaningfulness of the results
    returned by top-k dominating queries
  • Real datasets
  • Statistics of NBA players
  • http//basketballreference.com/stats_download.htm
  • 19112 players (identified by both name and year)
  • Attributes for query GP (games played), PTS
    (points), REB (rebounds), and AST (assists)
  • Statistics BASEBALL pitchers
  • http//baseball1.com/statistics/
  • 36898 players (identified by both name and year)
  • Attributes for query W (Wins), H (Hits), ERA
    (Earned Run Average), and R (Runs Allowed)

23
Top-k dominating points meaningful?
  • Results match the publics view of super-star
    players in NBA and BASEBALL
  • Enables users to discover top objects without
    any specific domain knowledge

Top-5 dominating points
24
Skyline vs top-k dominating points
  • Perform a skyline query, compute top-k dominating
    points by setting k to the skyline size (69 for
    NBA and 700 for BASEBALL)
  • Plot their dominating scores in descending order
  • Observations
  • Top-k dominating points have much higher scores
    than skyline points
  • Top-k dominating points are more informative to
    users

BASEBALL
NBA
25
Conclusions
  • Study top-k dominating queries on indexed
    multi-dimensional data
  • Present algorithms for the problem
  • The lightweight best-first algorithm BF2
    performs the best
  • Top-k dominating queries produce more meaningful
    results than skylines

26
References
  • FLN01 R. Fagin, A. Lotem, and M. Naor. Optimal
    Aggregation Algorithms for Middleware. In PODS,
    2001.
  • BKS01 S. Borzsonyi, D. Kossmann, and K.
    Stocker. The Skyline Operator. In ICDE, 2001.
  • PKZT01 D. Papadias, P. Kalnis, J. Zhang, and Y.
    Tao. Efficient OLAP Operations in Spatial Data
    Warehouses. In SSTD, 2001.
  • PTFS05 D. Papadias, Y. Tao, G. Fu, and B.
    Seeger. Progressive Skyline Computation in
    Database Systems. TODS, 30(1)4182, 2005.
  • CJT06a C.-Y. Chan, H. Jagadish, K.-L. Tan, A.
    Tung, and Z. Zhang. On High Dimensional Skylines.
    In EDBT, 2006.
  • CJT06b C.-Y. Chan, H. Jagadish, K.-L. Tan, A.
    Tung, and Z. Zhang. Finding k-Dominant Skylines
    in High Dimensional Space. In SIGMOD, 2006.
  • LOTW06 C. Li, B. C. Ooi, A. Tung, and S.Wang.
    DADA A Data Cube for Dominant Relationship
    Analysis. In SIGMOD, 2006.
Write a Comment
User Comments (0)
About PowerShow.com