Topk Query Processing - PowerPoint PPT Presentation

1 / 29

About This Presentation

Title:

Topk Query Processing

Description:

Ronald Fagin, Amnon Lotem, and Moni Naor ... Example Fagin's Algortihm. Read all grades of an object once seen from a sorted access ... – PowerPoint PPT presentation

Number of Views:225

Avg rating:3.0/5.0

Slides: 30

Provided by: W8699

Category:

more less

Transcript and Presenter's Notes

Title: Topk Query Processing

1
Top-k Query Processing

Optimal aggregation algorithms for middleware
Ronald Fagin, Amnon Lotem, and Moni Naor

Based on the presentation of Wesley Sebrechts,
Joost Voordouw. Modified by Vagelis Hristidis
2
Why top-k query processing

Multimedia brings fuzzy data
attribute values are graded typically 0,1
No clear boundary between answer / no answer
A query in a multimedia database means combining
graded attributes
Combine attributes by aggregation function
Aggregation function gives overall grade of
object
Return k objects with highest overall grade

Example
3
Top-k query processing
Top-k query processing Finding k objects
that have the highest overall grades

How ? ? Which algorithms?
Fagins Algorithm (FA)
Threshold Algorithm (TA)
Which is the best algorithm?

Keep in mind Database system serves as
middleware
Multimedia (objects) may be kept in different
subsystems
e.g. photoDB, videoDB, search engine
Take into account the limitations of these
subsystems

4
Example

Simple database model
Simple query
Explaining Fagins Algorithm (FA)
Finding top-k with FA
Explaining Threshold Algortihm (TA)
Finding top-k with TA

5
Example Simple Database model
Sorted L1
Sorted L2
6
Example Simple Query
Find the top 2 (k 2) objects on the following
query executed on the middleware A1 A2
(eg colorred shaperound)
A1 A2 as a query to the middleware results in
the middelware combining the grades of A1 en A2
by min(A1, A2)

Aggregation function
function that gives objects an overall grade
based on attribute grades
examples min, max functions
Monotonicity!

7
Example Fagins Algorithm

STEP 1
Read attributes from every sorted list
Stop when k objects have been seen in common
from all lists

ID
a
0.85
0.9
d
0.9
b
0.8
0.7
0.72
c
8
Example Fagins Algortihm

STEP 2
Random access to find missing grades

a
0.85
0.9
0.6
d
0.9
b
0.8
0.7
0.72
0.2
9
Example Fagins Algortihm

STEP 3
Compute the grades of the seen objects.
Return the k highest graded objects.

L1
L2
(a, 0.9)
(b, 0.8)
a
0.85
0.85
0.9
(c, 0.72)
0.6
0.6
d
0.9
. . . .
b
0.8
0.7
0.7
0.2
0.2
0.72
(d, 0.6)
10
New Idea !!! Threshold Algorithm (TA)

Read all grades of an object once seen from a
sorted access
No need to wait until the lists give k common
objects
Do sorted access (and corresponding random
accesses) until you have seen the top k answers.
How do we know that grades of seen objects are
higher than the grades of unseen objects ?
Predict maximum possible grade unseen objects

L2
L1
a 0.9
Seen
b 0.8
c 0.72
T min(0.72, 0.7) 0.7
. . . .
f 0.6
f 0.65
Possibly unseen
Threshold value
d 0.6
11
Example Threshold Algorithm
Step 1 - parallel sorted access to each list
For each object seen - get all
grades by random access - determine
Min(A1,A2) - amongst 2 highest seen ? keep in
buffer
a
0.9
0.85
0.85
d
0.9
0.6
0.6
12
Example Threshold Algorithm
Step 2 - Determine threshold value based on
objects currently seen under
sorted access. T min(L1, L2)
- 2 objects with overall grade threshold value
? stop else go to next entry position in sorted
list and repeat step 1
0.85
0.85
0.6
0.6
T min(0.9, 0.9) 0.9
13
Example Threshold Algorithm
Step 1 (Again) - parallel sorted access to each
list
For each object seen - get all
grades by random access - determine
Min(A1,A2) - amongst 2 highest seen ? keep in
buffer
a
0.9
0.85
0.85
d
0.9
0.6
0.6
b
0.8
0.7
0.7
14
Example Threshold Algorithm
Step 2 (Again) - Determine threshold value based
on objects currently seen. T
min(L1, L2)
- 2 objects with overall grade threshold value
? stop else go to next entry position in sorted
list and repeat step 1
0.85
0.85
0.7
0.8
T min(0.8, 0.85) 0.8
15
Example Threshold Algorithm
Situation at stopping condition
0.85
0.85
0.7
0.8
T min(0.72, 0.7) 0.7
16

Comparison of Fagins and Threshold Algorithm
TA sees less objects than FA
TA stops at least as early as FA
When we have seen k objects in common in FA,
their grades are higher or equal than the
threshold in TA.
TA may perform more random accesses than FA
In TA, (m-1) random accesses for each object
In FA, Random accesses are done at the end, only
for missing grades
TA requires only bounded buffer space (k)
At the expense of more random seeks
FA makes use of unbounded buffers

17
The best algorithm

Which algorithm is the best TA, FA??
Define best
middleware cost
concept of instance optimality
Consider
wild guesses
aggregation functions characteristics
Monotone, strictly monotone, strict
database restrictions
distinctness property

18
The best algorithm concept of optimality
A class of algorithms, A E A represents an
algorithm
D legal inputs to algorithms (databases), D E
D represents a database
middleware cost cost for processing data
subsystems scS rcR
Cost(A,D ) middleware cost when running
algorithm A over database D
19
The best algorithm instance optimality wild
guesses

Intuitively B instance optimal always the
best algorithm in A
always optimal
In reality always is always ? we will exclude
wild guesses algorithms
Wild guess random access on object not
previously encountered
by sorted access
In practice not possible
Database need to know ID to do random access
If wild guesses allowed in A then no algorithm
can be instance optimal
Wild guesses can find top-k objects by km
random accesses
(k objects , m lists)

20
The best algorithm aggregation functions

Aggregation function t combines object grades
into objects overall grade
x1,,xm t(x1,,xm)
Monotone
t(x1,,xm) t(x1,,xm) if xi xi for every
i
Strictly monotone
t(x1,,xm) lt t(x1,,xm) if xi lt xi for every
i
Strict
t(x1,,xm) 1 precisely when xi 1 for every
i

21
The best algorithm database restrictions
Distinctness property A database has no (sorted)
attribute list in which two objects have the same
grade
22
The best algorithm Fagins Algorithm

- Database with N objects, each with m
attributes.
- Orderings of lists are independent
FA finds top-k with middleware cost
O(N(m-1)/mk1/m)
FA optimal with high probability in the worst
case for strict monotone aggregation
functions

function L function S
Each attribute list ? permutation N out N Each
attribute list ? graded set of N grades
Each output of L can have multiple outputs of S
worst case only consider the maximum middleware
cost
A all possible algorithms
cost (A,L) max middleware cost (given L and
all possible S)

or
or
24
The best algorithm Threshold Algorithm

TA instance optimal (always optimal) for every
monotone
aggregation
function, over every database (excluding wild
guesses)
optimal in much stronger sense than Fagins
Algorithm
If strict monotone aggregation function
Optimality ratio m m (m-1)cR/cs best
possible (m attributes)
If random acces not possible (cr 0 ) ?
optimality ratio m
If sorted access not possible (cs 0) ?
optimality ratio infinite
? TA not instance optimal
TA instance optimal (always optimal) for every
strictly monotone aggregation function, over
every database (including wild guesses) that
satisfies the distinctness property
Optimality ratio cm2 with c max cR/cS,
cS/cR

25
Extending TA

What if sorted access is restricted ? e.g. use
distance database
TA z
What if random access not possible? e.g. web
search engine
No Random Access Algorithm
What if we want only the approximate top k
objects?
TA?
What if we consider relative costs of random and
sorted access?
Combined Algorithm (between TA and NRA)

26
NRA

What if we also want the scores?

27
Combined Algorithm (CA)
CA in instance optimal
28
Approximation

?-approximation to the top k answers for the
aggregation function t is a collection of k
objects (each along with its grade) such that for
each y among these k objects and each z not among
these k objects, ? t(y)gtt(z)
T ? As soon as at least k objects have been
seen whose grade is at least equal to threshold/
? then halt.