Title: Best-Effort Top-k Query Processing Under Budgetary Constraints
1- Best-Effort Top-k Query Processing Under
Budgetary Constraints
Michal Shmueli-Scheuer (IBM Haifa Research Lab
and UCI)
Yosi Mass, Haggai Roitman
Chen Li
Ralf Schenkel, Gerhard Weikum
2Motivating Example
Mediation Systems
Achieve high query throughput.
Top-k
Top-k
results
queries
Engine
Online Analytics (e.g. logs)
Achieve high query throughput.
Michal Shmueli-Scheuer
3Traditional top-k query
- Pre-computed lists over multiple attributes.
- Combine scores by some monotonic aggregation
function. - Two accesses modes
- sorted access (Cs)
- random access (Cr)
- Objective Compute k objects with highest scores.
Rm Rm
c 0.9
b 0.6
g 0.5
..
a 0.4
R2 R2
d 0.87
a 0.85
f 0.5
..
c 0.2
R1 R1
a 0.9
b 0.6
c 0.5
..
d 0.4
sorted
n
m
Michal Shmueli-Scheuer
4NRA algorithm (Fagin et al.)
Top-2
R2 R2
d 0.87
a 0.85
f 0.5
. ..
c 0.2
R1 R1
a 0.9
b 0.6
c 0.5
..
d 0.4
Best score
Worst score
highi
a 0.9,1.77
d 0.87,1.77
f SUM
mink
candidates
mink gt best-score of candidates
Michal Shmueli-Scheuer
5NRA algorithm (Fagin et al.)
Top-2
R2 R2
d 0.87
a 0.85
f 0.25
. ..
c 0.2
R1 R1
a 0.9
b 0.6
c 0.5
..
d 0.4
Best score
Worst score
a 1.75,1.75
d 0.87,1.47
highi
mink
candidates
b 0.6,1.45
mink gt best-score of candidates
Michal Shmueli-Scheuer
6NRA algorithm (Fagin et al.)
Top-2
R2 R2
d 0.87
a 0.85
f 0.25
. ..
c 0.2
R1 R1
a 0.9
b 0.6
c 0.5
..
d 0.4
Best score
Worst score
a 1.75,1.75
d 0.87,1.37
highi
mink
candidates
b 0.6,0.85
c 0.5,0.75
f 0.25,0.75
mink gt best-score of candidates
Michal Shmueli-Scheuer
7Top-k with Budget Constraints
Top-2
R2 R2
a 1.0
b 0.9
c 0.85
d 0.8
e 0.7
t 0.6
f 0.4
..
R1 R1
s 0.95
u 0.93
t 0.92
d 0.9
x 0.5
y 0.4
z 0.2
d 1.7
t 1.52
NRA 12Cs 12 precision 0.5
Given budget B, maximize result quality
Cs1, Cr 3 f SUM
TA 7Cs 7Cr 28 precision 0
Budget 10 ?
Michal Shmueli-Scheuer
8Contributions
- Sorted Accesses
- Efficient Plan
- Solution with Adaptive a
- Sorted and Random Accesses
- Efficient Plan
- Solution with Adaptive a
- Experiments
Michal Shmueli-Scheuer
9Results Under Limited Budget
Results for limited budget
K results for unlimited
budget
Michal Shmueli-Scheuer
10Efficient Plan- Sorted Accesses
- Assume that we know the k results for unlimited
budget (REXACT).
Michal Shmueli-Scheuer
11Efficient Plan- Sorted Accesses
- Goal find plan t, such that
Plans for B5
Denoted as ROPT
Michal Shmueli-Scheuer
12Sorted Accesses
L1
L2
L3
O1, SL1
O1, SL2
O2, SL1
O2, SL2
O2, SL3
Prefer high scores
Michal Shmueli-Scheuer
13Observations contd.
titlewar descriptionweapon
Prefer large score reductions
Michal Shmueli-Scheuer
14Score Utilities
Score gain
Score reduction
y 3
Michal Shmueli-Scheuer
15Optimization Problem
- Bi-objective optimization problem
-
- util(Li,x) a gain (1-a) reduction
- Heuristics
- Fair Heuristic
- Rank Heuristic
Where m is the number of lists
Michal Shmueli-Scheuer
16Adaptive ?
gain
reduction
)?)
(1-?(
time
Michal Shmueli-Scheuer
17Adaptive ?
top-k
o1 ws,bs
o2 ws,bs
d(o4) 0.8-0.60.2
o3 0.8,bs
candidates
hight1
o4 0.6,bs
hight2
o6 ws,bs
Theobald et al. VLDB04
Michal Shmueli-Scheuer
18Adaptive ?
Michal Shmueli-Scheuer
19Efficient Plan- Random Accesses
- Observations
- random accesses occur always after sorted
accesses have been finished.
schedule 1 SARASA.
schedule 2 SASARA.
precision(schedule1) precision(schedule2)
Michal Shmueli-Scheuer
20Observations- contd.
- Random accesses are only useful to objects in
REXACT.
top-k
L2
o1 ws,bs
o2, SL2
Precision reduced
o5 ws,bs
o5, Not in REXACT
o2 ws,bs
o5, SL2
candidates
o4 ws,bs
o1, SL2
o5 ws,bs
Precision remains the same
Michal Shmueli-Scheuer
21Random Accesses
- When to switch from SA to RA?
)?(
(1-?(
time
Michal Shmueli-Scheuer
22Random Accesses
- Switch from Sorted to Random
- R (1- ?)S
- S total cost of sorted accesses.
- R total cost for random accesses.
Michal Shmueli-Scheuer
23Experimental Data
- TREC Terabyte
- 25M webpages
- 50 queries with average length of 3 words.
- IMDB
- 375,000 movies
- 20 queries , each with 4 attributes Title,
Genre, Actors, Description - Synthetic data
- Zipf, lists 2,6, objects 10000,1000000
- Aggregate Function Sum
Michal Shmueli-Scheuer
24Evaluation Methods
- percentage of optimal precision
Ropt
Rexact
Ralg
Ropt
Michal Shmueli-Scheuer
25Results- Sorted Accesses
TREC, k100
- Less budget, more improvement
Michal Shmueli-Scheuer
26Varied k
IMDB, B400
- Lower K, more improvement.
Michal Shmueli-Scheuer
27Number of Lists
Zipf, K100, B4000
- More lists, more improvement.
Michal Shmueli-Scheuer
28Results- Random Accesses
TREC, k100,Cr10
TREC, K100, Cr100
29Related Works
- Minimize budget for optimal results
- the algorithm computes the exact results with
minimum cost. (Bast et al. VLDB06, Bruno et al.
ICDE02, Chang et al. SIGMOD02) - Dual problem.
- Anytime top-k
- The algorithm collects statistics during
processing, which can be used to provide
probabilistic guarantees at any time during
processing. (Aray et al. VLDB07) - Do not do any optimizations.
- Approximate top-k
- approximate results with probabilistic
guarantees. (Theobald et al. VLDB04, Fagin et al.
2001)
Michal Shmueli-Scheuer
30Conclusions
- First attempt to deal with budget constraints.
- For SA only, average precision around 70.
- Tradeoff between RAs and SAs, for relatively low
cost of RA, RA schedules are improved.
Michal Shmueli-Scheuer
31Thank You !
32(No Transcript)
33Top-k query
- Given a set of n objects and m scoring lists
sorted in decreasing order, find the top-k
objects according to a scoring function f - top-k a set T of k objects such that
f(rj1,,rjm) f(ri1,,rim) for every object Xi
in T and every object Xj not in T - Assumption The scoring function f is monotone
- f(r1,,rm) f(r1,,rm) if ri ri for all I
- Two accesses modes
- sorted access Cs
- random access - Cr
- Objective Compute top-k with the minimum cost
34Sorted Accesses
- Observations
- object with high scores has higher potential to
be part of the top-k. - object with mediocre scores does not help.
Prefer high scores
35Example
useless
36Applications
- Mobile Applications
- Highly impatient users, need fast results.
- Mediation Systems
- Achieve high query throughput.
- Online analytics (e.g. logs)
- Achieve high query throughput.
Michal Shmueli-Scheuer
37Motivating Example
Query throughput
Allocate time for each query
Given queries per time unit
38Terminology
- Sorted Access
- Random Access
- highi
- Top-k queue
- Candidates queue
- mink
- worstScore(d)
- bestScore(d)
39Efficient Offline Solution- Sorted
- Goal find trace t, such that
L1
L2
B5
t1 0 5
t2 1 4
t3 2 3
t4 3 2
t5 4 1
t6 5 0
Denoted as ROPT
40Efficient Offline Solution- Sorted
- Goal find trace t, such that
B 5
L1
L2
t1 0 5
t2 1 4
t3 2 3
t4 3 2
t5 4 1
t6 5 0
- Feasible for K up to 100, and m up to 10.
41Efficient Offline Solution- Sorted
- Proof (in negation)
- Assume that t does not exists, and chose trace s
that within the budget and has optimal precision.
Assume s with traces si that are largest
position of Pi less or equal to si. - By construction the score of any object in S is
the same to S
42Fair Heuristic
Runs in batches
43Efficient Offline Solution- Random
Top-k
o1, S
o2, S
o3, S
o4, S
o10, S
o14, S
.
44Motivation
- Many applications work in budgeted constraint
environments. Still, they wish to perform top-k
queries.
Servers
Budget-aware Query processing
Mediator
Engine
User query
45Future work
- Different access costs for different lists
- Time-aware top-k
- Top-k with budget constraints for P2P