Title: Evaluating Topk Queries over WebAccessible Databases
1Evaluating Top-k Queries over Web-Accessible
- Nicolas Bruno
- Luis Gravano
- Amélie Marian
- Columbia University
2Top-k Queries Natural in Many Scenarios
- Example NYC Restaurant Recommendation Service.
- Goal Find best restaurants for a user
- Close to address 2290 Broadway
- Price around 25
- Good rating
Query Specification of Flexible Preferences
Answer Best k Objects for Distance Function
3Attributes often Handled by External Sources
- MapQuest returns the distance between two
addresses. - NYTimes Review gives the price range of a
restaurant. - Zagat gives a food rating to the restaurant.
4Top-k Query Processing Challenges
- Attributes handled by external sources (e.g.,
MapQuest distance). - External sources exhibit a variety of interfaces
(e.g., NYTimes Review, Zagat). - Existing algorithms do not handle all types of
5Processing Top-k Queries over Web-Accessible Data
- Data and query model
- Algorithms for sources with different interfaces
- Our new algorithm Upper
- Experimental results
6Data Model
- Top-k Query assignment of weights and target
values to attributes
close to address
preferred price
preferred rating
Combined in scoring function
price most important attribute
7Sorted Access Source S
- Return objects sorted by scores for a given
query. - Example Zagat
GetNextS interface
S-Source Access Time tS(S)
8Random Access Source R
- Return the score of a given object for a given
query. - Example MapQuest
GetScoreR interface
R-Source Access Time tR(R)
9Query Model
- Attributes scores between 0 and 1.
- Sequential access to sources.
- Score Ties broken arbitrarily.
- No wild guesses.
- One S-Source (or SR-Source) and multiple
R-sources. (More on this later.)
10Query Processing Goals
- Processing top-k queries over R-Sources.
- Returning exact answer to top-k query q.
- Minimizing query response time.
- Naïve solution too expensive (access all sources
for all objects).
11Example NYC Restaurants
- S-Source
- Zagat restaurants sorted by food rating.
- R-Sources
- MapQuest distance between two input addresses.
- User address 2290 Broadway
- NYTimes Review price range of the input
restaurant. - Target Value 25
12TA Algorithm for SR-Sources
Fagin, Lotem, and Naor (PODS 2001)
- Perform sorted access sequentially to all
SR-Sources - Completely probe every object found for all
attributes using random access. - Keep best k objects.
- Stop when scores of best k objects are no less
than maximum possible score of unseen objects
Does NOT handle R-Sources
13Our Adaptation of TA Algorithm for R-Sources
- Perform sorted access to S-Source S.
- Probe every R-Source Ri for newly found object.
- Keep best k objects.
- Stop when scores of best k objects are no less
than maximum possible score of unseen objects
14An Example Execution of TA-Adapt
Threshold 1
Total Execution Time 9
tS(S)tR(R1)tR(R2)1, w, k1 Final
Score (3.scoreZagat 2.scoreMQ 1.scoreNYT)/6
15Improvements over TA-Adapt
- Add a shortcut test after each random-access
probe (TA-Opt). - Exploit techniques for processing selections with
expensive predicates (TA-EP). - Reorder accesses to R-Sources.
- Best weight/time ratio.
16The Upper Algorithm
- Selects a pair (object,source) to probe next.
- Based on the property
The object with the highest upper bound will be
probed before top-k solution is reached.
17An Example Execution of Upper
Threshold 1
Total Execution Time 6
tS(S)tR(R1)tR(R2)1, w, k1 Final
Score (3.scoreZagat 2.scoreMQ 1.scoreNYT)/6
18The Upper Algorithm
- Choose object with highest upper bound.
- If some unseen object can have higher upper
bound - Access S-Source S
- Else
- Access best R-Source Ri for chosen object
- Keep best k objects
- If top-k objects have final values higher than
maximum possible value of any other object,
return top-k objects.
Interleaves accesses on objects
19Selecting the Best Source
- Upper relies on expected values to make its
choices. - Upper computes best subset of sources that is
expected to - Compute the final score for k top objects.
- Discard other objects as fast as possible.
- Upper chooses best source in best subset.
- Best weight/time ratio.
20Experimental Setting Synthetic Data
- Attribute scores randomly generated (three data
sets uniform, gaussian and correlated). - tR(Ri) integer between 1 and 10.
- tS(S) ? 0.1, 0.2,,1.0.
- Query execution time ttotal
- Default k50, 10000 objects, uniform data.
- Results average ttotal of 100 queries.
- Optimal assumes complete knowledge
- (unrealistic, but useful performance bound)
21Experiments Varying Number of Objects Requested k
22Experiments Varying Number of Database Objects N
23Experimental Setting Real Web Data
- S-Source Verizon Yellow Pages
- (sorted by distance)
- R-Sources
24Experiments Real-Web Data
of Random Accesses
25Evaluation Conclusions
- TA-EP and TA-Opt much faster than TA-Adapt.
- Upper significantly better than all versions of
TA. - Upper close to optimal.
- Real data experiments Upper faster than TA
- Introduced first algorithm for top-k processing
over R-Sources. - Adapted TA to this scenario.
- Presented new algorithms Upper and Pick (see
paper) - Evaluated our new algorithms with both real and
synthetic data. - Upper close to optimal
27Current and Future Work
- Relaxation of the Source Model
- Current source model limited
- Any number of R-Sources and SR-Sources
- Upper has good results even with only SR-Sources
- Parallelism
- Define a query model for parallel access to
sources - Adapt our algorithms to this model
- Approximate Queries
- Top-k Queries
- Evaluating Top-k Selection Queries, S. Chaudhuri
and L. Gravano. VLDB 1999 - TA algorithm
- Optimal Aggregation Algorithms for Middleware,
R. Fagin, A. Lotem, and M. Naor. PODS 2001 - Variations of TA
- Query Processing Issues on Image (Multimedia)
Databases, S. Nepal and V. Ramakrishna. ICDE 1999 - Optimizing Multi-Feature Queries for Image
Databases, U. Güntzer, W.-T. Balke, and
W.Kießling. VLDB 2000 - Expensive Predicates
- Predicate Migration Optimizing queries with
Expensive Predicates, J.M. Hellerstein and M.
Stonebraker. SIGMOD 1993
29Real-web Experiments
30Real-web Experiments with Adaptive Time
31Relaxing the Source Model
32Upcoming Journal Paper
- Variations of Upper
- Select best source
- Data Structures
- Complexity Analysis
- Relaxing Source Model
- Adaptation of our Algorithms
- New Algorithms
- Variations of Data and Query Model to handle real
web data
- TA instance optimal over
- Algorithms that do not make wild guesses.
- Databases that satisfy the distinctness property.
- TAZ instance optimal over
- Algorithms that do not make wild guesses.
- No complexity analysis of our algorithms, but
experimental evaluation instead