Title: Topk Query Processing
1Top-k Query Processing
- Optimal aggregation algorithms for middleware
- Ronald Fagin, Amnon Lotem, and Moni Naor
Based on the presentation of Wesley Sebrechts,
Joost Voordouw. Modified by Vagelis Hristidis
2Why top-k query processing
- Multimedia brings fuzzy data
- attribute values are graded typically 0,1
- No clear boundary between answer / no answer
-
- A query in a multimedia database means combining
graded attributes - Combine attributes by aggregation function
- Aggregation function gives overall grade of
object - Return k objects with highest overall grade
Example
3Top-k query processing
Top-k query processing Finding k objects
that have the highest overall grades
- How ? ? Which algorithms?
- Fagins Algorithm (FA)
- Threshold Algorithm (TA)
- Which is the best algorithm?
- Keep in mind Database system serves as
middleware - Multimedia (objects) may be kept in different
subsystems - e.g. photoDB, videoDB, search engine
- Take into account the limitations of these
subsystems
4Example
- Simple database model
- Simple query
- Explaining Fagins Algorithm (FA)
- Finding top-k with FA
- Explaining Threshold Algortihm (TA)
- Finding top-k with TA
5Example Simple Database model
Sorted L1
Sorted L2
6Example Simple Query
Find the top 2 (k 2) objects on the following
query executed on the middleware A1 A2
(eg colorred shaperound)
A1 A2 as a query to the middleware results in
the middelware combining the grades of A1 en A2
by min(A1, A2)
- Aggregation function
- function that gives objects an overall grade
based on attribute grades - examples min, max functions
- Monotonicity!
7Example Fagins Algorithm
- STEP 1
- Read attributes from every sorted list
- Stop when k objects have been seen in common
from all lists
ID
a
0.85
0.9
d
0.9
b
0.8
0.7
0.72
c
8Example Fagins Algortihm
- STEP 2
- Random access to find missing grades
a
0.85
0.9
0.6
d
0.9
b
0.8
0.7
0.72
0.2
9Example Fagins Algortihm
- STEP 3
- Compute the grades of the seen objects.
- Return the k highest graded objects.
L1
L2
(a, 0.9)
(b, 0.8)
a
0.85
0.85
0.9
(c, 0.72)
0.6
0.6
d
0.9
. . . .
b
0.8
0.7
0.7
0.2
0.2
0.72
(d, 0.6)
10New Idea !!! Threshold Algorithm (TA)
- Read all grades of an object once seen from a
sorted access - No need to wait until the lists give k common
objects - Do sorted access (and corresponding random
accesses) until you have seen the top k answers. - How do we know that grades of seen objects are
higher than the grades of unseen objects ? - Predict maximum possible grade unseen objects
L2
L1
a 0.9
Seen
b 0.8
c 0.72
T min(0.72, 0.7) 0.7
. . . .
f 0.6
f 0.65
Possibly unseen
Threshold value
d 0.6
11Example Threshold Algorithm
Step 1 - parallel sorted access to each list
For each object seen - get all
grades by random access - determine
Min(A1,A2) - amongst 2 highest seen ? keep in
buffer
a
0.9
0.85
0.85
d
0.9
0.6
0.6
12Example Threshold Algorithm
Step 2 - Determine threshold value based on
objects currently seen under
sorted access. T min(L1, L2)
- 2 objects with overall grade threshold value
? stop else go to next entry position in sorted
list and repeat step 1
0.85
0.85
0.6
0.6
T min(0.9, 0.9) 0.9
13Example Threshold Algorithm
Step 1 (Again) - parallel sorted access to each
list
For each object seen - get all
grades by random access - determine
Min(A1,A2) - amongst 2 highest seen ? keep in
buffer
a
0.9
0.85
0.85
d
0.9
0.6
0.6
b
0.8
0.7
0.7
14Example Threshold Algorithm
Step 2 (Again) - Determine threshold value based
on objects currently seen. T
min(L1, L2)
- 2 objects with overall grade threshold value
? stop else go to next entry position in sorted
list and repeat step 1
0.85
0.85
0.7
0.8
T min(0.8, 0.85) 0.8
15Example Threshold Algorithm
Situation at stopping condition
0.85
0.85
0.7
0.8
T min(0.72, 0.7) 0.7
16- Comparison of Fagins and Threshold Algorithm
- TA sees less objects than FA
- TA stops at least as early as FA
- When we have seen k objects in common in FA,
their grades are higher or equal than the
threshold in TA. - TA may perform more random accesses than FA
- In TA, (m-1) random accesses for each object
- In FA, Random accesses are done at the end, only
for missing grades - TA requires only bounded buffer space (k)
- At the expense of more random seeks
- FA makes use of unbounded buffers
17The best algorithm
- Which algorithm is the best TA, FA??
- Define best
- middleware cost
- concept of instance optimality
- Consider
- wild guesses
- aggregation functions characteristics
- Monotone, strictly monotone, strict
- database restrictions
- distinctness property
18The best algorithm concept of optimality
A class of algorithms, A E A represents an
algorithm
D legal inputs to algorithms (databases), D E
D represents a database
middleware cost cost for processing data
subsystems scS rcR
Cost(A,D ) middleware cost when running
algorithm A over database D
19The best algorithm instance optimality wild
guesses
- Intuitively B instance optimal always the
best algorithm in A - always optimal
- In reality always is always ? we will exclude
wild guesses algorithms - Wild guess random access on object not
previously encountered -
by sorted access - In practice not possible
- Database need to know ID to do random access
- If wild guesses allowed in A then no algorithm
can be instance optimal - Wild guesses can find top-k objects by km
random accesses - (k objects , m lists)
20The best algorithm aggregation functions
- Aggregation function t combines object grades
into objects overall grade - x1,,xm t(x1,,xm)
- Monotone
- t(x1,,xm) t(x1,,xm) if xi xi for every
i - Strictly monotone
- t(x1,,xm) lt t(x1,,xm) if xi lt xi for every
i - Strict
- t(x1,,xm) 1 precisely when xi 1 for every
i
21The best algorithm database restrictions
Distinctness property A database has no (sorted)
attribute list in which two objects have the same
grade
22The best algorithm Fagins Algorithm
- - Database with N objects, each with m
attributes. - - Orderings of lists are independent
- FA finds top-k with middleware cost
O(N(m-1)/mk1/m) - FA optimal with high probability in the worst
case for strict monotone aggregation
functions
23- function L function S
- Each attribute list ? permutation N out N Each
attribute list ? graded set of N grades -
- Each output of L can have multiple outputs of S
- worst case only consider the maximum middleware
cost - A all possible algorithms
- cost (A,L) max middleware cost (given L and
all possible S)
or
or
24The best algorithm Threshold Algorithm
- TA instance optimal (always optimal) for every
monotone
aggregation
function, over every database (excluding wild
guesses) - optimal in much stronger sense than Fagins
Algorithm - If strict monotone aggregation function
- Optimality ratio m m (m-1)cR/cs best
possible (m attributes) - If random acces not possible (cr 0 ) ?
optimality ratio m - If sorted access not possible (cs 0) ?
optimality ratio infinite - ? TA not instance optimal
- TA instance optimal (always optimal) for every
strictly monotone aggregation function, over
every database (including wild guesses) that
satisfies the distinctness property - Optimality ratio cm2 with c max cR/cS,
cS/cR
25Extending TA
- What if sorted access is restricted ? e.g. use
distance database - TA z
- What if random access not possible? e.g. web
search engine - No Random Access Algorithm
- What if we want only the approximate top k
objects? - TA?
- What if we consider relative costs of random and
sorted access? - Combined Algorithm (between TA and NRA)
26NRA
- What if we also want the scores?
27Combined Algorithm (CA)
CA in instance optimal
28Approximation
- ?-approximation to the top k answers for the
- aggregation function t is a collection of k
objects (each along with its grade) such that for
each y among these k objects and each z not among
these k objects, ? t(y)gtt(z) - T ? As soon as at least k objects have been
seen whose grade is at least equal to threshold/
? then halt.
29