RankSQL: Query Algebra and Optimization for Relational Top-k Queries presentation

About This Presentation

Transcript and Presenter's Notes

Title: RankSQL: Query Algebra and Optimization for Relational Top-k Queries

1
RankSQL Query Algebra and Optimization for
Relational Top-k Queries

Chengkai Li (UIUC)
Kevin Chen-Chuan Chang (UIUC)
Ihab F. Ilyas (U. of Waterloo)
Sumin Song (UIUC)

2
Ranking (Top-k) Queries

Comparing and Ranking (ordering) objects is a
pervasive functionality needed in all kinds of
real-world database applications
Data Mining
E-Commerce
Information Retrieval
Multimedia
OLAP Decision Support
Search Engine
Web Sources

3
Motivating Application 1 Text Data Retrieval

Find the records/documents/pages most
similar/related to top-k queries.
Relational Databases
XML
Web Search (Google)

4
Motivating Application 2 Decision Support
Systems

TPC-H Query 10
Top 20 customers who have returned parts, ranked
by their effect on lost revenue, for a given
quarter.
sum(l_extendedprice (1-l_discount))
15 out 22 TPC-H queries are such ranking queries.

5
Motivating Application 3 Multimedia Databases
Search image by an aggregation of criteria on
color, shape, texture, description, size,
etc. http//iidr.unn.ac.uk/mermaid
6
Motivating Application 4 E-Commerce

Trip planning (running example)
suggest a hotel to stay and a museum to visit,
after Midwest DB Symposium, by the following
ranking criteria
Hotel is cheap.
Museum has some interesting exhibitions.
Hotel is close to museum.

7
Ranking in RDBMS

Important everywhere
But
No essential and efficient support, only language
facility.

8
Ranking Query
HID MID cheap related score
H1 M2 0.9 0.8 1.7
H2 M1 0.7 0.9 1.6
H1 M3 0.8 0.7 1.5

Select
From
Hotel h, Museum m
Where
h.star3 AND
h.aream.area
Order By
cheap(h.price)
related(m.collection,dinosaur)
Limit 5

boolean predicates
ranking predicates and monotonic scoring function
9
Processing Ranking Queries by Traditional RDBMS
Select From Hotel h, Museum m Where
h.star3 AND h.aream.area Order By
cheap(h.price) related(m.collection,dinos
aur) Limit 5
10
Disadvantages of traditional approach

Naïve Materialize-then-Sort scheme
Overkill
whole join results are scored and ordered
only 5 top results is requested.
Very inefficient
Scan of large base tables
Join large intermediate results
Evaluate expensive ranking predicates
cheap(h.price) online source
related(m.collection,dinosaur) IR
close(h.addr,m.addr) querying geographical data

5 results
Sort
ranking predicates
11
Boolean vs. Ranking

Select
From
Hotel h, Museum m
Where
h.star3 AND
h.aream.area

12
Boolean vs. Ranking

Select
From
Hotel h, Museum m
Where
h.star3 AND
h.aream.area

Boolean predicates could have been processed by
materialize-then-filter scheme, why its not?

13
Boolean vs. Ranking

The boolean predicates are split into join and
selection operators, which interleave with each
other.
Result reduction of intermediate results, thus
processing cost.
Can we do the same thing for ranking predicates?

Boolean Predicates
Hotel X Museum
14
RankSQL system

Two goals
Support Ranking as a first-class query type in
RDBMS
notion of ranking in the data model and algebra,
by splitting ranking predicates.
Integrate Ranking with the traditional boolean
query constructs
facilities for manipulate ranking, by
interleaving ranking predicates with other
operations.

15
Orders of magnitude performance improvements
Prototype Implementation in PostgreSQL
16
Rank-Relational Algebra

Foundation
splitting µ (rank) operator
interleaving rank-aware selection, join
operators, etc.

data model operations
relational algebria relation s, p, , n, U, -, scan
rank-relational algebra rank-relation µ, sr, pr, r, nr, Ur, -r, scanr
17
Rank-Relation
TID p1 p2 p3 score
s1 0.7 0.8 0.9 2.4
s2 0.9 0.85 0.8 2.55
s3 0.5 0.45 0.75 1.7
s4 0.4 0.7 0.95 2.05
s5 0.3 0.9 0.6 1.8
s6 0.25 0.45 0.9 1.6
Select From R Order By p1p2p3 Limit 1
µp3(µp2(Scanp1(R)))
18
Rank-Relation
TID upper-bound

TID p1 p2 p3 score
s1
s2 0.9
s3
s4
s5
s6
µp3
TID upper-bound

µp2
Select From R Order By p1p2p3 Limit 1
TID upper-bound
s2

Scanp1(R)
19
Rank-Relation
TID upper-bound

TID p1 p2 p3 score
s1 0.9 1.0 1.0 2.9
s2 0.9 1.0 1.0 2.9
s3 0.9 1.0 1.0 2.9
s4 0.9 1.0 1.0 2.9
s5 0.9 1.0 1.0 2.9
s6 0.9 1.0 1.0 2.9
µp3
TID upper-bound

µp2
Select From R Order By p1p2p3 Limit 1
TID upper-bound
s2 2.9

Scanp1(R)
20
Rank-Relation
TID upper-bound

TID p1 p2 p3 score
s1 0.9 1.0 1.0 2.9
s2 0.9 1.0 1.0 2.9
s3 0.9 1.0 1.0 2.9
s4 0.9 1.0 1.0 2.9
s5 0.9 1.0 1.0 2.9
s6 0.9 1.0 1.0 2.9
µp3
TID upper-bound

s2 2.9
µp2
Select From R Order By p1p2p3 Limit 1
TID upper-bound
s2 2.9

Scanp1(R)
21
Rank-Relation
TID upper-bound

TID p1 p2 p3 score
s1 0.9 1.0 1.0 2.9
s2 0.9 0.85 1.0 2.75
s3 0.9 1.0 1.0 2.9
s4 0.9 1.0 1.0 2.9
s5 0.9 1.0 1.0 2.9
s6 0.9 1.0 1.0 2.9
µp3
TID upper-bound
s2 2.75

µp2
Select From R Order By p1p2p3 Limit 1
TID upper-bound
s2 2.9

Scanp1(R)
22
Rank-Relation
TID upper-bound

TID p1 p2 p3 score
s1 0.7 1.0 1.0 2.7
s2 0.9 0.85 1.0 2.75
s3 0.7 1.0 1.0 2.7
s4 0.7 1.0 1.0 2.7
s5 0.7 1.0 1.0 2.7
s6 0.7 1.0 1.0 2.7
µp3
TID upper-bound
s2 2.75

µp2
Select From R Order By p1p2p3 Limit 1
TID upper-bound
s2 2.9
s1 2.7

Scanp1(R)
23
Rank-Relation
TID upper-bound

TID p1 p2 p3 score
s1 0.7 0.8 1.0 2.5
s2 0.9 0.85 1.0 2.75
s3 0.7 1.0 1.0 2.7
s4 0.7 1.0 1.0 2.7
s5 0.7 1.0 1.0 2.7
s6 0.7 1.0 1.0 2.7
µp3
TID upper-bound
s2 2.75
s1 2.5

µp2
Select From R Order By p1p2p3 Limit 1
TID upper-bound
s2 2.9
s1 2.7

Scanp1(R)
24
Rank-Relation
TID upper-bound
s2 2.55

TID p1 p2 p3 score
s1 0.7 0.8 1.0 2.5
s2 0.9 0.85 0.8 2.55
s3 0.7 1.0 1.0 2.7
s4 0.7 1.0 1.0 2.7
s5 0.7 1.0 1.0 2.7
s6 0.7 1.0 1.0 2.7
µp3
TID upper-bound
s2 2.75
s1 2.5

µp2
Select From R Order By p1p2p3 Limit 1
TID upper-bound
s2 2.9
s1 2.7

Scanp1(R)
25
Rank-Relation
TID upper-bound
s2 2.55

TID p1 p2 p3 score
s1 0.7 0.8 1.0 2.5
s2 0.9 0.85 0.8 2.55
s3 0.5 1.0 1.0 2.5
s4 0.5 1.0 1.0 2.5
s5 0.5 1.0 1.0 2.5
s6 0.5 1.0 1.0 2.5
µp3
TID upper-bound
s2 2.75
s1 2.5

µp2
Select From R Order By p1p2p3 Limit 1
TID upper-bound
s2 2.9
s1 2.7
s3 2.5

Scanp1(R)
26
Rank-Relation
TID upper-bound
s2 2.55

TID p1 p2 p3 score
s1 0.7 0.8 1.0 2.5
s2 0.9 0.85 0.8 2.55
s3 0.5 0.45 1.0 2.5
s4 0.5 1.0 1.0 2.5
s5 0.5 1.0 1.0 2.5
s6 0.5 1.0 1.0 2.5
µp3
TID upper-bound
s2 2.75
s1 2.5
s3 1.95
µp2
Select From R Order By p1p2p3 Limit 1
TID upper-bound
s2 2.9
s1 2.7
s3 2.5

Scanp1(R)
27
Rank-Relation
TID upper-bound
s2 2.55
s1 2.4
TID p1 p2 p3 score
s1 0.7 0.8 0.9 2.4
s2 0.9 0.85 0.8 2.55
s3 0.5 0.45 1.0 2.5
s4 0.5 1.0 1.0 2.5
s5 0.5 1.0 1.0 2.5
s6 0.5 1.0 1.0 2.5
µp3
TID upper-bound
s2 2.75
s1 2.5
s3 1.95
µp2
Select From R Order By p1p2p3 Limit 1
TID upper-bound
s2 2.9
s1 2.7
s3 2.5

Scanp1(R)
28
Rank-Relation
TID upper-bound
s2 2.55
s1 2.4
TID p1 p2 p3 score
s1 0.7 0.8 0.9 2.4
s2 0.9 0.85 0.8 2.55
s3 0.5 0.45 1.0 2.5
s4 0.5 1.0 1.0 2.5
s5 0.5 1.0 1.0 2.5
s6 0.5 1.0 1.0 2.5
µp3
TID upper-bound
s2 2.75
s1 2.5
s3 1.95
µp2
Select From R Order By p1p2p3 Limit 1
TID upper-bound
s2 2.9
s1 2.7
s3 2.5

Scanp1(R)
29
Meterialize-then-sort
TID p1 p2 p3 score
s1 0.7 0.8 0.9 2.4
s2 0.9 0.85 0.8 2.55
s3 0.5 0.45 1.0 2.5
s4 0.5 1.0 1.0 2.5
s5 0.5 1.0 1.0 2.5
s6 0.5 1.0 1.0 2.5
Sortp1p2p3
Select From R Order By p1p2p3 Limit 1
Scan(R)
30
Rank-Relation
TID upper-bound

TID p1 p2 p3 score
s1 0.9 1.0 1.0 2.9
s2 0.9 1.0 1.0 2.9
s3 0.9 1.0 1.0 2.9
s4 0.9 1.0 1.0 2.9
s5 0.9 1.0 1.0 2.9
s6 0.9 1.0 1.0 2.9
µp3
TID upper-bound

µp2
Select From R Order By p1p2p3 Limit 1
TID upper-bound
s2 2.9

Scanp1(R)
31
Ranking principle

Tuples should be processed in the order of their
upper-bound scores with respect to the evaluated
predicates.
Enable efficient incremental query evaluation
plans.
Inspired by the work of CH02.

32
Impact of Rank-Relational Algebra
Ranking Query
33
Optimization

Two-dimensional enumeration ranking (ranking
predicate scheduling) and filtering (join order
selection)
Sampling-based cardinality estimation

34
Related Work
35
Conclusion

RankSQL system
Goal Support Ranking as a first-class query
type Integrate Ranking with boolean query
constructs
Intuition splitting and interleaving ranking
predicates
Principle ranking principle
Foundation rank-relational algebra
Impact Executor, Optimizer, Rewriter, etc.
Implementation prototype in PostgreSQL, demo in
progress

Write a Comment

User Comments (0)

About PowerShow.com

RankSQL: Query Algebra and Optimization for Relational Top-k Queries PowerPoint PPT Presentation