Title: Evaluating Top-k Queries over Web-Accessible Databases
1Evaluating Top-k Queries over Web-Accessible
Databases
- Nicolas Bruno
- Luis Gravano
- Amélie Marian
- Columbia University
2Top-k Queries Natural in Many Scenarios
- Example NYC Restaurant Recommendation Service.
- Goal Find best restaurants for a user
- Close to address 2290 Broadway
- Price around 25
- Good rating
Query Specification of Flexible Preferences
Answer Best k Objects for Distance Function
3Attributes often Handled by External Sources
- MapQuest returns the distance between two
addresses. - NYTimes Review gives the price range of a
restaurant. - Zagat gives a food rating to the restaurant.
4Top-k Query Processing Challenges
- Attributes handled by external sources (e.g.,
MapQuest distance). - External sources exhibit a variety of interfaces
(e.g., NYTimes Review, Zagat). - Existing algorithms do not handle all types of
interfaces.
5Processing Top-k Queries over Web-Accessible Data
Sources
- Data and query model
- Algorithms for sources with different interfaces
- Our new algorithm Upper
- Experimental results
6Data Model
- Top-k Query assignment of weights and target
values to attributes
lt 25, 2290 Broadway, very good gt
close to address
preferred price
preferred rating
weights lt4, 1, 2gt
Combined in scoring function
price most important attribute
7Sorted Access Source S
- Return objects sorted by scores for a given
query. - Example Zagat
GetNextS interface
S-Source Access Time tS(S)
8Random Access Source R
- Return the score of a given object for a given
query. - Example MapQuest
GetScoreR interface
R-Source Access Time tR(R)
9Query Model
- Attributes scores between 0 and 1.
- Sequential access to sources.
- Score Ties broken arbitrarily.
- No wild guesses.
- One S-Source (or SR-Source) and multiple
R-sources. (More on this later.)
10Query Processing Goals
- Processing top-k queries over R-Sources.
- Returning exact answer to top-k query q.
- Minimizing query response time.
- Naïve solution too expensive (access all sources
for all objects).
11Example NYC Restaurants
- S-Source
- Zagat restaurants sorted by food rating.
- R-Sources
- MapQuest distance between two input addresses.
- User address 2290 Broadway
- NYTimes Review price range of the input
restaurant. - Target Value 25
12TA Algorithm for SR-Sources
Fagin, Lotem, and Naor (PODS 2001)
- Perform sorted access sequentially to all
SR-Sources - Completely probe every object found for all
attributes using random access. - Keep best k objects.
- Stop when scores of best k objects are no less
than maximum possible score of unseen objects
(threshold).
Does NOT handle R-Sources
13Our Adaptation of TA Algorithm for R-Sources
TA-Adapt
- Perform sorted access to S-Source S.
- Probe every R-Source Ri for newly found object.
- Keep best k objects.
- Stop when scores of best k objects are no less
than maximum possible score of unseen objects
(threshold).
14An Example Execution of TA-Adapt
Object S(Zagat) R1(MQ) R2(NYT) Final Score
Threshold 1
Total Execution Time 9
tS(S)tR(R1)tR(R2)1, wlt3, 2, 1gt, k1 Final
Score (3.scoreZagat 2.scoreMQ 1.scoreNYT)/6
15Improvements over TA-Adapt
- Add a shortcut test after each random-access
probe (TA-Opt). - Exploit techniques for processing selections with
expensive predicates (TA-EP). - Reorder accesses to R-Sources.
- Best weight/time ratio.
16The Upper Algorithm
- Selects a pair (object,source) to probe next.
- Based on the property
The object with the highest upper bound will be
probed before top-k solution is reached.
17An Example Execution of Upper
Object Upper Bound S(Zagat) R1(MQ) R2(NYT) Final Score
Threshold 1
Total Execution Time 6
tS(S)tR(R1)tR(R2)1, wlt3, 2, 1gt, k1 Final
Score (3.scoreZagat 2.scoreMQ 1.scoreNYT)/6
18The Upper Algorithm
- Choose object with highest upper bound.
- If some unseen object can have higher upper
bound - Access S-Source S
- Else
- Access best R-Source Ri for chosen object
- Keep best k objects
- If top-k objects have final values higher than
maximum possible value of any other object,
return top-k objects.
Interleaves accesses on objects
19Selecting the Best Source
- Upper relies on expected values to make its
choices. - Upper computes best subset of sources that is
expected to - Compute the final score for k top objects.
- Discard other objects as fast as possible.
- Upper chooses best source in best subset.
- Best weight/time ratio.
20Experimental Setting Synthetic Data
- Attribute scores randomly generated (three data
sets uniform, gaussian and correlated). - tR(Ri) integer between 1 and 10.
- tS(S) ? 0.1, 0.2,,1.0.
- Query execution time ttotal
- Default k50, 10000 objects, uniform data.
- Results average ttotal of 100 queries.
- Optimal assumes complete knowledge
- (unrealistic, but useful performance bound)
21Experiments Varying Number of Objects Requested k
22Experiments Varying Number of Database Objects N
23Experimental Setting Real Web Data
- S-Source Verizon Yellow Pages
- (sorted by distance)
- R-Sources
Subway Navigator Subway time
Altavista Popularity
MapQuest Driving time
NYTimes Review Food and price ratings
Zagat Food, Service, Décor and Price ratings
24Experiments Real-Web Data
of Random Accesses
25Evaluation Conclusions
- TA-EP and TA-Opt much faster than TA-Adapt.
- Upper significantly better than all versions of
TA. - Upper close to optimal.
- Real data experiments Upper faster than TA
adaptations.
26Conclusion
- Introduced first algorithm for top-k processing
over R-Sources. - Adapted TA to this scenario.
- Presented new algorithms Upper and Pick (see
paper) - Evaluated our new algorithms with both real and
synthetic data. - Upper close to optimal
27Current and Future Work
- Relaxation of the Source Model
- Current source model limited
- Any number of R-Sources and SR-Sources
- Upper has good results even with only SR-Sources
- Parallelism
- Define a query model for parallel access to
sources - Adapt our algorithms to this model
- Approximate Queries
28References
- Top-k Queries
- Evaluating Top-k Selection Queries, S. Chaudhuri
and L. Gravano. VLDB 1999 - TA algorithm
- Optimal Aggregation Algorithms for Middleware,
R. Fagin, A. Lotem, and M. Naor. PODS 2001 - Variations of TA
- Query Processing Issues on Image (Multimedia)
Databases, S. Nepal and V. Ramakrishna. ICDE 1999 - Optimizing Multi-Feature Queries for Image
Databases, U. Güntzer, W.-T. Balke, and
W.Kießling. VLDB 2000 - Expensive Predicates
- Predicate Migration Optimizing queries with
Expensive Predicates, J.M. Hellerstein and M.
Stonebraker. SIGMOD 1993
29Real-web Experiments
30Real-web Experiments with Adaptive Time
31Relaxing the Source Model
TA-EP
Upper
32Upcoming Journal Paper
- Variations of Upper
- Select best source
- Data Structures
- Complexity Analysis
- Relaxing Source Model
- Adaptation of our Algorithms
- New Algorithms
- Variations of Data and Query Model to handle real
web data
33Optimality
- TA instance optimal over
- Algorithms that do not make wild guesses.
- Databases that satisfy the distinctness property.
- TAZ instance optimal over
- Algorithms that do not make wild guesses.
- No complexity analysis of our algorithms, but
experimental evaluation instead