RankSQL: Query Algebra and Optimization for Relational Top-k Queries PowerPoint PPT Presentation

presentation player overlay
1 / 35
About This Presentation
Transcript and Presenter's Notes

Title: RankSQL: Query Algebra and Optimization for Relational Top-k Queries


1
RankSQL Query Algebra and Optimization for
Relational Top-k Queries
  • Chengkai Li (UIUC)
  • Kevin Chen-Chuan Chang (UIUC)
  • Ihab F. Ilyas (U. of Waterloo)
  • Sumin Song (UIUC)

2
Ranking (Top-k) Queries
  • Comparing and Ranking (ordering) objects is a
    pervasive functionality needed in all kinds of
    real-world database applications
  • Data Mining
  • E-Commerce
  • Information Retrieval
  • Multimedia
  • OLAP Decision Support
  • Search Engine
  • Web Sources

3
Motivating Application 1 Text Data Retrieval
  • Find the records/documents/pages most
    similar/related to top-k queries.
  • Relational Databases
  • XML
  • Web Search (Google)

4
Motivating Application 2 Decision Support
Systems
  • TPC-H Query 10
  • Top 20 customers who have returned parts, ranked
    by their effect on lost revenue, for a given
    quarter.
  • sum(l_extendedprice (1-l_discount))
  • 15 out 22 TPC-H queries are such ranking queries.

5
Motivating Application 3 Multimedia Databases
Search image by an aggregation of criteria on
color, shape, texture, description, size,
etc. http//iidr.unn.ac.uk/mermaid
6
Motivating Application 4 E-Commerce
  • Trip planning (running example)
  • suggest a hotel to stay and a museum to visit,
    after Midwest DB Symposium, by the following
    ranking criteria
  • Hotel is cheap.
  • Museum has some interesting exhibitions.
  • Hotel is close to museum.

7
Ranking in RDBMS
  • Important everywhere
  • But
  • No essential and efficient support, only language
    facility.

8
Ranking Query
HID MID cheap related score
H1 M2 0.9 0.8 1.7
H2 M1 0.7 0.9 1.6
H1 M3 0.8 0.7 1.5
  • Select
  • From
  • Hotel h, Museum m
  • Where
  • h.star3 AND
  • h.aream.area
  • Order By
  • cheap(h.price)
  • related(m.collection,dinosaur)
  • Limit 5

boolean predicates
ranking predicates and monotonic scoring function
9
Processing Ranking Queries by Traditional RDBMS
Select From Hotel h, Museum m Where
h.star3 AND h.aream.area Order By
cheap(h.price) related(m.collection,dinos
aur) Limit 5
10
Disadvantages of traditional approach
  • Naïve Materialize-then-Sort scheme
  • Overkill
  • whole join results are scored and ordered
  • only 5 top results is requested.
  • Very inefficient
  • Scan of large base tables
  • Join large intermediate results
  • Evaluate expensive ranking predicates
  • cheap(h.price) online source
  • related(m.collection,dinosaur) IR
  • close(h.addr,m.addr) querying geographical data

5 results
Sort
ranking predicates
11
Boolean vs. Ranking
  • Select
  • From
  • Hotel h, Museum m
  • Where
  • h.star3 AND
  • h.aream.area

12
Boolean vs. Ranking
  • Select
  • From
  • Hotel h, Museum m
  • Where
  • h.star3 AND
  • h.aream.area
  • Boolean predicates could have been processed by
    materialize-then-filter scheme, why its not?

13
Boolean vs. Ranking
  • The boolean predicates are split into join and
    selection operators, which interleave with each
    other.
  • Result reduction of intermediate results, thus
    processing cost.
  • Can we do the same thing for ranking predicates?

Boolean Predicates
Hotel X Museum
14
RankSQL system
  • Two goals
  • Support Ranking as a first-class query type in
    RDBMS
  • notion of ranking in the data model and algebra,
    by splitting ranking predicates.
  • Integrate Ranking with the traditional boolean
    query constructs
  • facilities for manipulate ranking, by
    interleaving ranking predicates with other
    operations.

15
Orders of magnitude performance improvements
Prototype Implementation in PostgreSQL
16
Rank-Relational Algebra
  • Foundation
  • splitting µ (rank) operator
  • interleaving rank-aware selection, join
    operators, etc.

data model operations
relational algebria relation s, p, , n, U, -, scan
rank-relational algebra rank-relation µ, sr, pr, r, nr, Ur, -r, scanr
17
Rank-Relation
TID p1 p2 p3 score
s1 0.7 0.8 0.9 2.4
s2 0.9 0.85 0.8 2.55
s3 0.5 0.45 0.75 1.7
s4 0.4 0.7 0.95 2.05
s5 0.3 0.9 0.6 1.8
s6 0.25 0.45 0.9 1.6
Select From R Order By p1p2p3 Limit 1
µp3(µp2(Scanp1(R)))
18
Rank-Relation
TID upper-bound


TID p1 p2 p3 score
s1
s2 0.9
s3
s4
s5
s6
µp3
TID upper-bound



µp2
Select From R Order By p1p2p3 Limit 1
TID upper-bound
s2


Scanp1(R)
19
Rank-Relation
TID upper-bound


TID p1 p2 p3 score
s1 0.9 1.0 1.0 2.9
s2 0.9 1.0 1.0 2.9
s3 0.9 1.0 1.0 2.9
s4 0.9 1.0 1.0 2.9
s5 0.9 1.0 1.0 2.9
s6 0.9 1.0 1.0 2.9
µp3
TID upper-bound



µp2
Select From R Order By p1p2p3 Limit 1
TID upper-bound
s2 2.9


Scanp1(R)
20
Rank-Relation
TID upper-bound


TID p1 p2 p3 score
s1 0.9 1.0 1.0 2.9
s2 0.9 1.0 1.0 2.9
s3 0.9 1.0 1.0 2.9
s4 0.9 1.0 1.0 2.9
s5 0.9 1.0 1.0 2.9
s6 0.9 1.0 1.0 2.9
µp3
TID upper-bound



s2 2.9
µp2
Select From R Order By p1p2p3 Limit 1
TID upper-bound
s2 2.9


Scanp1(R)
21
Rank-Relation
TID upper-bound


TID p1 p2 p3 score
s1 0.9 1.0 1.0 2.9
s2 0.9 0.85 1.0 2.75
s3 0.9 1.0 1.0 2.9
s4 0.9 1.0 1.0 2.9
s5 0.9 1.0 1.0 2.9
s6 0.9 1.0 1.0 2.9
µp3
TID upper-bound
s2 2.75


µp2
Select From R Order By p1p2p3 Limit 1
TID upper-bound
s2 2.9


Scanp1(R)
22
Rank-Relation
TID upper-bound


TID p1 p2 p3 score
s1 0.7 1.0 1.0 2.7
s2 0.9 0.85 1.0 2.75
s3 0.7 1.0 1.0 2.7
s4 0.7 1.0 1.0 2.7
s5 0.7 1.0 1.0 2.7
s6 0.7 1.0 1.0 2.7
µp3
TID upper-bound
s2 2.75


µp2
Select From R Order By p1p2p3 Limit 1
TID upper-bound
s2 2.9
s1 2.7

Scanp1(R)
23
Rank-Relation
TID upper-bound


TID p1 p2 p3 score
s1 0.7 0.8 1.0 2.5
s2 0.9 0.85 1.0 2.75
s3 0.7 1.0 1.0 2.7
s4 0.7 1.0 1.0 2.7
s5 0.7 1.0 1.0 2.7
s6 0.7 1.0 1.0 2.7
µp3
TID upper-bound
s2 2.75
s1 2.5

µp2
Select From R Order By p1p2p3 Limit 1
TID upper-bound
s2 2.9
s1 2.7

Scanp1(R)
24
Rank-Relation
TID upper-bound
s2 2.55

TID p1 p2 p3 score
s1 0.7 0.8 1.0 2.5
s2 0.9 0.85 0.8 2.55
s3 0.7 1.0 1.0 2.7
s4 0.7 1.0 1.0 2.7
s5 0.7 1.0 1.0 2.7
s6 0.7 1.0 1.0 2.7
µp3
TID upper-bound
s2 2.75
s1 2.5

µp2
Select From R Order By p1p2p3 Limit 1
TID upper-bound
s2 2.9
s1 2.7

Scanp1(R)
25
Rank-Relation
TID upper-bound
s2 2.55

TID p1 p2 p3 score
s1 0.7 0.8 1.0 2.5
s2 0.9 0.85 0.8 2.55
s3 0.5 1.0 1.0 2.5
s4 0.5 1.0 1.0 2.5
s5 0.5 1.0 1.0 2.5
s6 0.5 1.0 1.0 2.5
µp3
TID upper-bound
s2 2.75
s1 2.5

µp2
Select From R Order By p1p2p3 Limit 1
TID upper-bound
s2 2.9
s1 2.7
s3 2.5

Scanp1(R)
26
Rank-Relation
TID upper-bound
s2 2.55

TID p1 p2 p3 score
s1 0.7 0.8 1.0 2.5
s2 0.9 0.85 0.8 2.55
s3 0.5 0.45 1.0 2.5
s4 0.5 1.0 1.0 2.5
s5 0.5 1.0 1.0 2.5
s6 0.5 1.0 1.0 2.5
µp3
TID upper-bound
s2 2.75
s1 2.5
s3 1.95
µp2
Select From R Order By p1p2p3 Limit 1
TID upper-bound
s2 2.9
s1 2.7
s3 2.5

Scanp1(R)
27
Rank-Relation
TID upper-bound
s2 2.55
s1 2.4
TID p1 p2 p3 score
s1 0.7 0.8 0.9 2.4
s2 0.9 0.85 0.8 2.55
s3 0.5 0.45 1.0 2.5
s4 0.5 1.0 1.0 2.5
s5 0.5 1.0 1.0 2.5
s6 0.5 1.0 1.0 2.5
µp3
TID upper-bound
s2 2.75
s1 2.5
s3 1.95
µp2
Select From R Order By p1p2p3 Limit 1
TID upper-bound
s2 2.9
s1 2.7
s3 2.5

Scanp1(R)
28
Rank-Relation
TID upper-bound
s2 2.55
s1 2.4
TID p1 p2 p3 score
s1 0.7 0.8 0.9 2.4
s2 0.9 0.85 0.8 2.55
s3 0.5 0.45 1.0 2.5
s4 0.5 1.0 1.0 2.5
s5 0.5 1.0 1.0 2.5
s6 0.5 1.0 1.0 2.5
µp3
TID upper-bound
s2 2.75
s1 2.5
s3 1.95
µp2
Select From R Order By p1p2p3 Limit 1
TID upper-bound
s2 2.9
s1 2.7
s3 2.5

Scanp1(R)
29
Meterialize-then-sort
TID p1 p2 p3 score
s1 0.7 0.8 0.9 2.4
s2 0.9 0.85 0.8 2.55
s3 0.5 0.45 1.0 2.5
s4 0.5 1.0 1.0 2.5
s5 0.5 1.0 1.0 2.5
s6 0.5 1.0 1.0 2.5
Sortp1p2p3
Select From R Order By p1p2p3 Limit 1
Scan(R)
30
Rank-Relation
TID upper-bound


TID p1 p2 p3 score
s1 0.9 1.0 1.0 2.9
s2 0.9 1.0 1.0 2.9
s3 0.9 1.0 1.0 2.9
s4 0.9 1.0 1.0 2.9
s5 0.9 1.0 1.0 2.9
s6 0.9 1.0 1.0 2.9
µp3
TID upper-bound



µp2
Select From R Order By p1p2p3 Limit 1
TID upper-bound
s2 2.9


Scanp1(R)
31
Ranking principle
  • Tuples should be processed in the order of their
    upper-bound scores with respect to the evaluated
    predicates.
  • Enable efficient incremental query evaluation
    plans.
  • Inspired by the work of CH02.

32
Impact of Rank-Relational Algebra
Ranking Query
33
Optimization
  • Two-dimensional enumeration ranking (ranking
    predicate scheduling) and filtering (join order
    selection)
  • Sampling-based cardinality estimation

34
Related Work
35
Conclusion
  • RankSQL system
  • Goal Support Ranking as a first-class query
    type Integrate Ranking with boolean query
    constructs
  • Intuition splitting and interleaving ranking
    predicates
  • Principle ranking principle
  • Foundation rank-relational algebra
  • Impact Executor, Optimizer, Rewriter, etc.
  • Implementation prototype in PostgreSQL, demo in
    progress
Write a Comment
User Comments (0)
About PowerShow.com