Evaluating Topk Queries over WebAccessible Databases - PowerPoint PPT Presentation

1 / 33

About This Presentation

Title:

Evaluating Topk Queries over WebAccessible Databases

Description:

MapQuest returns the distance between ... by external sources (e.g., MapQuest distance) ... MapQuest: distance between two input addresses. User address: ... – PowerPoint PPT presentation

Number of Views:72

Avg rating:3.0/5.0

Slides: 34

Provided by: Amelie4

Learn more at: http://www1.cs.columbia.edu

Category:

more less

Transcript and Presenter's Notes

Title: Evaluating Topk Queries over WebAccessible Databases

1
Evaluating Top-k Queries over Web-Accessible
Databases

Nicolas Bruno
Luis Gravano
Amélie Marian
Columbia University

2
Top-k Queries Natural in Many Scenarios

Example NYC Restaurant Recommendation Service.
Goal Find best restaurants for a user
Close to address 2290 Broadway
Price around 25
Good rating

Query Specification of Flexible Preferences
Answer Best k Objects for Distance Function
3
Attributes often Handled by External Sources

MapQuest returns the distance between two
addresses.
NYTimes Review gives the price range of a
restaurant.
Zagat gives a food rating to the restaurant.

4
Top-k Query Processing Challenges

Attributes handled by external sources (e.g.,
MapQuest distance).
External sources exhibit a variety of interfaces
(e.g., NYTimes Review, Zagat).
Existing algorithms do not handle all types of
interfaces.

5
Processing Top-k Queries over Web-Accessible Data
Sources

Data and query model
Algorithms for sources with different interfaces
Our new algorithm Upper
Experimental results

6
Data Model

Top-k Query assignment of weights and target
values to attributes

close to address
preferred price
preferred rating
weights
Combined in scoring function
price most important attribute
7
Sorted Access Source S

Return objects sorted by scores for a given
query.
Example Zagat

GetNextS interface
S-Source Access Time tS(S)
8
Random Access Source R

Return the score of a given object for a given
query.
Example MapQuest

GetScoreR interface
R-Source Access Time tR(R)
9
Query Model

Attributes scores between 0 and 1.
Sequential access to sources.
Score Ties broken arbitrarily.
No wild guesses.
One S-Source (or SR-Source) and multiple
R-sources. (More on this later.)

10
Query Processing Goals

Processing top-k queries over R-Sources.
Returning exact answer to top-k query q.
Minimizing query response time.
Naïve solution too expensive (access all sources
for all objects).

11
Example NYC Restaurants

S-Source
Zagat restaurants sorted by food rating.
R-Sources
MapQuest distance between two input addresses.
User address 2290 Broadway
NYTimes Review price range of the input
restaurant.
Target Value 25

12
TA Algorithm for SR-Sources
Fagin, Lotem, and Naor (PODS 2001)

Perform sorted access sequentially to all
SR-Sources
Completely probe every object found for all
attributes using random access.
Keep best k objects.
Stop when scores of best k objects are no less
than maximum possible score of unseen objects
(threshold).

Does NOT handle R-Sources
13
Our Adaptation of TA Algorithm for R-Sources
TA-Adapt

Perform sorted access to S-Source S.
Probe every R-Source Ri for newly found object.
Keep best k objects.
Stop when scores of best k objects are no less
than maximum possible score of unseen objects
(threshold).

14
An Example Execution of TA-Adapt
Threshold 1
Total Execution Time 9
tS(S)tR(R1)tR(R2)1, w, k1 Final
Score (3.scoreZagat 2.scoreMQ 1.scoreNYT)/6
15
Improvements over TA-Adapt

Add a shortcut test after each random-access
probe (TA-Opt).
Exploit techniques for processing selections with
expensive predicates (TA-EP).
Reorder accesses to R-Sources.
Best weight/time ratio.

16
The Upper Algorithm

Selects a pair (object,source) to probe next.
Based on the property

The object with the highest upper bound will be
probed before top-k solution is reached.
17
An Example Execution of Upper

Threshold 1
Total Execution Time 6
tS(S)tR(R1)tR(R2)1, w, k1 Final
Score (3.scoreZagat 2.scoreMQ 1.scoreNYT)/6
18
The Upper Algorithm

Choose object with highest upper bound.
If some unseen object can have higher upper
bound
Access S-Source S
Else
Access best R-Source Ri for chosen object
Keep best k objects
If top-k objects have final values higher than
maximum possible value of any other object,
return top-k objects.

Interleaves accesses on objects
19
Selecting the Best Source

Upper relies on expected values to make its
choices.
Upper computes best subset of sources that is
expected to
Compute the final score for k top objects.
Discard other objects as fast as possible.
Upper chooses best source in best subset.
Best weight/time ratio.

20
Experimental Setting Synthetic Data

Attribute scores randomly generated (three data
sets uniform, gaussian and correlated).
tR(Ri) integer between 1 and 10.
tS(S) ? 0.1, 0.2,,1.0.
Query execution time ttotal
Default k50, 10000 objects, uniform data.
Results average ttotal of 100 queries.
Optimal assumes complete knowledge
(unrealistic, but useful performance bound)

21
Experiments Varying Number of Objects Requested k
22
Experiments Varying Number of Database Objects N
23
Experimental Setting Real Web Data

S-Source Verizon Yellow Pages
(sorted by distance)
R-Sources

24
Experiments Real-Web Data
of Random Accesses
25
Evaluation Conclusions

TA-EP and TA-Opt much faster than TA-Adapt.
Upper significantly better than all versions of
TA.
Upper close to optimal.
Real data experiments Upper faster than TA
adaptations.

26
Conclusion

Introduced first algorithm for top-k processing
over R-Sources.
Adapted TA to this scenario.
Presented new algorithms Upper and Pick (see
paper)
Evaluated our new algorithms with both real and
synthetic data.
Upper close to optimal

27
Current and Future Work

Relaxation of the Source Model
Current source model limited
Any number of R-Sources and SR-Sources
Upper has good results even with only SR-Sources
Parallelism
Define a query model for parallel access to
sources
Adapt our algorithms to this model
Approximate Queries

28
References

Top-k Queries
Evaluating Top-k Selection Queries, S. Chaudhuri
and L. Gravano. VLDB 1999
TA algorithm
Optimal Aggregation Algorithms for Middleware,
R. Fagin, A. Lotem, and M. Naor. PODS 2001
Variations of TA
Query Processing Issues on Image (Multimedia)
Databases, S. Nepal and V. Ramakrishna. ICDE 1999
Optimizing Multi-Feature Queries for Image
Databases, U. Güntzer, W.-T. Balke, and
W.Kießling. VLDB 2000
Expensive Predicates
Predicate Migration Optimizing queries with
Expensive Predicates, J.M. Hellerstein and M.
Stonebraker. SIGMOD 1993