Title: RankSQL: Query Algebra and Optimization for Relational Top-k Queries
1RankSQL Query Algebra and Optimization for
Relational Top-k Queries
- Chengkai Li (UIUC)
- Kevin Chen-Chuan Chang (UIUC)
- Ihab F. Ilyas (U. of Waterloo)
- Sumin Song (UIUC)
2Ranking (Top-k) Queries
- Comparing and Ranking (ordering) objects is a
pervasive functionality needed in all kinds of
real-world database applications - Data Mining
- E-Commerce
- Information Retrieval
- Multimedia
- OLAP Decision Support
- Search Engine
- Web Sources
3Motivating Application 1 Text Data Retrieval
- Find the records/documents/pages most
similar/related to top-k queries. - Relational Databases
- XML
- Web Search (Google)
4Motivating Application 2 Decision Support
Systems
- TPC-H Query 10
- Top 20 customers who have returned parts, ranked
by their effect on lost revenue, for a given
quarter. - sum(l_extendedprice (1-l_discount))
- 15 out 22 TPC-H queries are such ranking queries.
5Motivating Application 3 Multimedia Databases
Search image by an aggregation of criteria on
color, shape, texture, description, size,
etc. http//iidr.unn.ac.uk/mermaid
6Motivating Application 4 E-Commerce
- Trip planning (running example)
- suggest a hotel to stay and a museum to visit,
after Midwest DB Symposium, by the following
ranking criteria - Hotel is cheap.
- Museum has some interesting exhibitions.
- Hotel is close to museum.
7Ranking in RDBMS
- Important everywhere
- But
- No essential and efficient support, only language
facility.
8Ranking Query
HID MID cheap related score
H1 M2 0.9 0.8 1.7
H2 M1 0.7 0.9 1.6
H1 M3 0.8 0.7 1.5
- Select
- From
- Hotel h, Museum m
- Where
- h.star3 AND
- h.aream.area
- Order By
- cheap(h.price)
- related(m.collection,dinosaur)
- Limit 5
boolean predicates
ranking predicates and monotonic scoring function
9Processing Ranking Queries by Traditional RDBMS
Select From Hotel h, Museum m Where
h.star3 AND h.aream.area Order By
cheap(h.price) related(m.collection,dinos
aur) Limit 5
10Disadvantages of traditional approach
- Naïve Materialize-then-Sort scheme
- Overkill
- whole join results are scored and ordered
- only 5 top results is requested.
- Very inefficient
- Scan of large base tables
- Join large intermediate results
- Evaluate expensive ranking predicates
- cheap(h.price) online source
- related(m.collection,dinosaur) IR
- close(h.addr,m.addr) querying geographical data
5 results
Sort
ranking predicates
11Boolean vs. Ranking
- Select
- From
- Hotel h, Museum m
- Where
- h.star3 AND
- h.aream.area
12Boolean vs. Ranking
- Select
- From
- Hotel h, Museum m
- Where
- h.star3 AND
- h.aream.area
- Boolean predicates could have been processed by
materialize-then-filter scheme, why its not?
13Boolean vs. Ranking
- The boolean predicates are split into join and
selection operators, which interleave with each
other. - Result reduction of intermediate results, thus
processing cost. - Can we do the same thing for ranking predicates?
Boolean Predicates
Hotel X Museum
14RankSQL system
- Two goals
- Support Ranking as a first-class query type in
RDBMS - notion of ranking in the data model and algebra,
by splitting ranking predicates. - Integrate Ranking with the traditional boolean
query constructs - facilities for manipulate ranking, by
interleaving ranking predicates with other
operations.
15Orders of magnitude performance improvements
Prototype Implementation in PostgreSQL
16Rank-Relational Algebra
- Foundation
- splitting µ (rank) operator
- interleaving rank-aware selection, join
operators, etc.
data model operations
relational algebria relation s, p, , n, U, -, scan
rank-relational algebra rank-relation µ, sr, pr, r, nr, Ur, -r, scanr
17Rank-Relation
TID p1 p2 p3 score
s1 0.7 0.8 0.9 2.4
s2 0.9 0.85 0.8 2.55
s3 0.5 0.45 0.75 1.7
s4 0.4 0.7 0.95 2.05
s5 0.3 0.9 0.6 1.8
s6 0.25 0.45 0.9 1.6
Select From R Order By p1p2p3 Limit 1
µp3(µp2(Scanp1(R)))
18Rank-Relation
TID upper-bound
TID p1 p2 p3 score
s1
s2 0.9
s3
s4
s5
s6
µp3
TID upper-bound
µp2
Select From R Order By p1p2p3 Limit 1
TID upper-bound
s2
Scanp1(R)
19Rank-Relation
TID upper-bound
TID p1 p2 p3 score
s1 0.9 1.0 1.0 2.9
s2 0.9 1.0 1.0 2.9
s3 0.9 1.0 1.0 2.9
s4 0.9 1.0 1.0 2.9
s5 0.9 1.0 1.0 2.9
s6 0.9 1.0 1.0 2.9
µp3
TID upper-bound
µp2
Select From R Order By p1p2p3 Limit 1
TID upper-bound
s2 2.9
Scanp1(R)
20Rank-Relation
TID upper-bound
TID p1 p2 p3 score
s1 0.9 1.0 1.0 2.9
s2 0.9 1.0 1.0 2.9
s3 0.9 1.0 1.0 2.9
s4 0.9 1.0 1.0 2.9
s5 0.9 1.0 1.0 2.9
s6 0.9 1.0 1.0 2.9
µp3
TID upper-bound
s2 2.9
µp2
Select From R Order By p1p2p3 Limit 1
TID upper-bound
s2 2.9
Scanp1(R)
21Rank-Relation
TID upper-bound
TID p1 p2 p3 score
s1 0.9 1.0 1.0 2.9
s2 0.9 0.85 1.0 2.75
s3 0.9 1.0 1.0 2.9
s4 0.9 1.0 1.0 2.9
s5 0.9 1.0 1.0 2.9
s6 0.9 1.0 1.0 2.9
µp3
TID upper-bound
s2 2.75
µp2
Select From R Order By p1p2p3 Limit 1
TID upper-bound
s2 2.9
Scanp1(R)
22Rank-Relation
TID upper-bound
TID p1 p2 p3 score
s1 0.7 1.0 1.0 2.7
s2 0.9 0.85 1.0 2.75
s3 0.7 1.0 1.0 2.7
s4 0.7 1.0 1.0 2.7
s5 0.7 1.0 1.0 2.7
s6 0.7 1.0 1.0 2.7
µp3
TID upper-bound
s2 2.75
µp2
Select From R Order By p1p2p3 Limit 1
TID upper-bound
s2 2.9
s1 2.7
Scanp1(R)
23Rank-Relation
TID upper-bound
TID p1 p2 p3 score
s1 0.7 0.8 1.0 2.5
s2 0.9 0.85 1.0 2.75
s3 0.7 1.0 1.0 2.7
s4 0.7 1.0 1.0 2.7
s5 0.7 1.0 1.0 2.7
s6 0.7 1.0 1.0 2.7
µp3
TID upper-bound
s2 2.75
s1 2.5
µp2
Select From R Order By p1p2p3 Limit 1
TID upper-bound
s2 2.9
s1 2.7
Scanp1(R)
24Rank-Relation
TID upper-bound
s2 2.55
TID p1 p2 p3 score
s1 0.7 0.8 1.0 2.5
s2 0.9 0.85 0.8 2.55
s3 0.7 1.0 1.0 2.7
s4 0.7 1.0 1.0 2.7
s5 0.7 1.0 1.0 2.7
s6 0.7 1.0 1.0 2.7
µp3
TID upper-bound
s2 2.75
s1 2.5
µp2
Select From R Order By p1p2p3 Limit 1
TID upper-bound
s2 2.9
s1 2.7
Scanp1(R)
25Rank-Relation
TID upper-bound
s2 2.55
TID p1 p2 p3 score
s1 0.7 0.8 1.0 2.5
s2 0.9 0.85 0.8 2.55
s3 0.5 1.0 1.0 2.5
s4 0.5 1.0 1.0 2.5
s5 0.5 1.0 1.0 2.5
s6 0.5 1.0 1.0 2.5
µp3
TID upper-bound
s2 2.75
s1 2.5
µp2
Select From R Order By p1p2p3 Limit 1
TID upper-bound
s2 2.9
s1 2.7
s3 2.5
Scanp1(R)
26Rank-Relation
TID upper-bound
s2 2.55
TID p1 p2 p3 score
s1 0.7 0.8 1.0 2.5
s2 0.9 0.85 0.8 2.55
s3 0.5 0.45 1.0 2.5
s4 0.5 1.0 1.0 2.5
s5 0.5 1.0 1.0 2.5
s6 0.5 1.0 1.0 2.5
µp3
TID upper-bound
s2 2.75
s1 2.5
s3 1.95
µp2
Select From R Order By p1p2p3 Limit 1
TID upper-bound
s2 2.9
s1 2.7
s3 2.5
Scanp1(R)
27Rank-Relation
TID upper-bound
s2 2.55
s1 2.4
TID p1 p2 p3 score
s1 0.7 0.8 0.9 2.4
s2 0.9 0.85 0.8 2.55
s3 0.5 0.45 1.0 2.5
s4 0.5 1.0 1.0 2.5
s5 0.5 1.0 1.0 2.5
s6 0.5 1.0 1.0 2.5
µp3
TID upper-bound
s2 2.75
s1 2.5
s3 1.95
µp2
Select From R Order By p1p2p3 Limit 1
TID upper-bound
s2 2.9
s1 2.7
s3 2.5
Scanp1(R)
28Rank-Relation
TID upper-bound
s2 2.55
s1 2.4
TID p1 p2 p3 score
s1 0.7 0.8 0.9 2.4
s2 0.9 0.85 0.8 2.55
s3 0.5 0.45 1.0 2.5
s4 0.5 1.0 1.0 2.5
s5 0.5 1.0 1.0 2.5
s6 0.5 1.0 1.0 2.5
µp3
TID upper-bound
s2 2.75
s1 2.5
s3 1.95
µp2
Select From R Order By p1p2p3 Limit 1
TID upper-bound
s2 2.9
s1 2.7
s3 2.5
Scanp1(R)
29Meterialize-then-sort
TID p1 p2 p3 score
s1 0.7 0.8 0.9 2.4
s2 0.9 0.85 0.8 2.55
s3 0.5 0.45 1.0 2.5
s4 0.5 1.0 1.0 2.5
s5 0.5 1.0 1.0 2.5
s6 0.5 1.0 1.0 2.5
Sortp1p2p3
Select From R Order By p1p2p3 Limit 1
Scan(R)
30Rank-Relation
TID upper-bound
TID p1 p2 p3 score
s1 0.9 1.0 1.0 2.9
s2 0.9 1.0 1.0 2.9
s3 0.9 1.0 1.0 2.9
s4 0.9 1.0 1.0 2.9
s5 0.9 1.0 1.0 2.9
s6 0.9 1.0 1.0 2.9
µp3
TID upper-bound
µp2
Select From R Order By p1p2p3 Limit 1
TID upper-bound
s2 2.9
Scanp1(R)
31Ranking principle
- Tuples should be processed in the order of their
upper-bound scores with respect to the evaluated
predicates. - Enable efficient incremental query evaluation
plans. - Inspired by the work of CH02.
32Impact of Rank-Relational Algebra
Ranking Query
33Optimization
- Two-dimensional enumeration ranking (ranking
predicate scheduling) and filtering (join order
selection) - Sampling-based cardinality estimation
34Related Work
35Conclusion
- RankSQL system
- Goal Support Ranking as a first-class query
type Integrate Ranking with boolean query
constructs - Intuition splitting and interleaving ranking
predicates - Principle ranking principle
- Foundation rank-relational algebra
- Impact Executor, Optimizer, Rewriter, etc.
- Implementation prototype in PostgreSQL, demo in
progress