Evaluating Topk Queries over WebAccessible Databases - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Evaluating Topk Queries over WebAccessible Databases

Description:

MapQuest returns the distance between ... by external sources (e.g., MapQuest distance) ... MapQuest: distance between two input addresses. User address: ... – PowerPoint PPT presentation

Number of Views:72
Avg rating:3.0/5.0
Slides: 34
Provided by: Amelie4
Category:

less

Transcript and Presenter's Notes

Title: Evaluating Topk Queries over WebAccessible Databases


1
Evaluating Top-k Queries over Web-Accessible
Databases
  • Nicolas Bruno
  • Luis Gravano
  • Amélie Marian
  • Columbia University

2
Top-k Queries Natural in Many Scenarios
  • Example NYC Restaurant Recommendation Service.
  • Goal Find best restaurants for a user
  • Close to address 2290 Broadway
  • Price around 25
  • Good rating

Query Specification of Flexible Preferences
Answer Best k Objects for Distance Function
3
Attributes often Handled by External Sources
  • MapQuest returns the distance between two
    addresses.
  • NYTimes Review gives the price range of a
    restaurant.
  • Zagat gives a food rating to the restaurant.

4
Top-k Query Processing Challenges
  • Attributes handled by external sources (e.g.,
    MapQuest distance).
  • External sources exhibit a variety of interfaces
    (e.g., NYTimes Review, Zagat).
  • Existing algorithms do not handle all types of
    interfaces.

5
Processing Top-k Queries over Web-Accessible Data
Sources
  • Data and query model
  • Algorithms for sources with different interfaces
  • Our new algorithm Upper
  • Experimental results

6
Data Model
  • Top-k Query assignment of weights and target
    values to attributes


close to address
preferred price
preferred rating
weights
Combined in scoring function
price most important attribute
7
Sorted Access Source S
  • Return objects sorted by scores for a given
    query.
  • Example Zagat

GetNextS interface
S-Source Access Time tS(S)
8
Random Access Source R
  • Return the score of a given object for a given
    query.
  • Example MapQuest

GetScoreR interface
R-Source Access Time tR(R)
9
Query Model
  • Attributes scores between 0 and 1.
  • Sequential access to sources.
  • Score Ties broken arbitrarily.
  • No wild guesses.
  • One S-Source (or SR-Source) and multiple
    R-sources. (More on this later.)

10
Query Processing Goals
  • Processing top-k queries over R-Sources.
  • Returning exact answer to top-k query q.
  • Minimizing query response time.
  • Naïve solution too expensive (access all sources
    for all objects).

11
Example NYC Restaurants
  • S-Source
  • Zagat restaurants sorted by food rating.
  • R-Sources
  • MapQuest distance between two input addresses.
  • User address 2290 Broadway
  • NYTimes Review price range of the input
    restaurant.
  • Target Value 25

12
TA Algorithm for SR-Sources
Fagin, Lotem, and Naor (PODS 2001)
  • Perform sorted access sequentially to all
    SR-Sources
  • Completely probe every object found for all
    attributes using random access.
  • Keep best k objects.
  • Stop when scores of best k objects are no less
    than maximum possible score of unseen objects
    (threshold).

Does NOT handle R-Sources
13
Our Adaptation of TA Algorithm for R-Sources
TA-Adapt
  • Perform sorted access to S-Source S.
  • Probe every R-Source Ri for newly found object.
  • Keep best k objects.
  • Stop when scores of best k objects are no less
    than maximum possible score of unseen objects
    (threshold).

14
An Example Execution of TA-Adapt
Threshold 1
Total Execution Time 9
tS(S)tR(R1)tR(R2)1, w, k1 Final
Score (3.scoreZagat 2.scoreMQ 1.scoreNYT)/6
15
Improvements over TA-Adapt
  • Add a shortcut test after each random-access
    probe (TA-Opt).
  • Exploit techniques for processing selections with
    expensive predicates (TA-EP).
  • Reorder accesses to R-Sources.
  • Best weight/time ratio.

16
The Upper Algorithm
  • Selects a pair (object,source) to probe next.
  • Based on the property

The object with the highest upper bound will be
probed before top-k solution is reached.
17
An Example Execution of Upper

Threshold 1
Total Execution Time 6
tS(S)tR(R1)tR(R2)1, w, k1 Final
Score (3.scoreZagat 2.scoreMQ 1.scoreNYT)/6
18
The Upper Algorithm
  • Choose object with highest upper bound.
  • If some unseen object can have higher upper
    bound
  • Access S-Source S
  • Else
  • Access best R-Source Ri for chosen object
  • Keep best k objects
  • If top-k objects have final values higher than
    maximum possible value of any other object,
    return top-k objects.

Interleaves accesses on objects
19
Selecting the Best Source
  • Upper relies on expected values to make its
    choices.
  • Upper computes best subset of sources that is
    expected to
  • Compute the final score for k top objects.
  • Discard other objects as fast as possible.
  • Upper chooses best source in best subset.
  • Best weight/time ratio.

20
Experimental Setting Synthetic Data
  • Attribute scores randomly generated (three data
    sets uniform, gaussian and correlated).
  • tR(Ri) integer between 1 and 10.
  • tS(S) ? 0.1, 0.2,,1.0.
  • Query execution time ttotal
  • Default k50, 10000 objects, uniform data.
  • Results average ttotal of 100 queries.
  • Optimal assumes complete knowledge
  • (unrealistic, but useful performance bound)

21
Experiments Varying Number of Objects Requested k
22
Experiments Varying Number of Database Objects N
23
Experimental Setting Real Web Data
  • S-Source Verizon Yellow Pages
  • (sorted by distance)
  • R-Sources

24
Experiments Real-Web Data
of Random Accesses
25
Evaluation Conclusions
  • TA-EP and TA-Opt much faster than TA-Adapt.
  • Upper significantly better than all versions of
    TA.
  • Upper close to optimal.
  • Real data experiments Upper faster than TA
    adaptations.

26
Conclusion
  • Introduced first algorithm for top-k processing
    over R-Sources.
  • Adapted TA to this scenario.
  • Presented new algorithms Upper and Pick (see
    paper)
  • Evaluated our new algorithms with both real and
    synthetic data.
  • Upper close to optimal

27
Current and Future Work
  • Relaxation of the Source Model
  • Current source model limited
  • Any number of R-Sources and SR-Sources
  • Upper has good results even with only SR-Sources
  • Parallelism
  • Define a query model for parallel access to
    sources
  • Adapt our algorithms to this model
  • Approximate Queries

28
References
  • Top-k Queries
  • Evaluating Top-k Selection Queries, S. Chaudhuri
    and L. Gravano. VLDB 1999
  • TA algorithm
  • Optimal Aggregation Algorithms for Middleware,
    R. Fagin, A. Lotem, and M. Naor. PODS 2001
  • Variations of TA
  • Query Processing Issues on Image (Multimedia)
    Databases, S. Nepal and V. Ramakrishna. ICDE 1999
  • Optimizing Multi-Feature Queries for Image
    Databases, U. Güntzer, W.-T. Balke, and
    W.Kießling. VLDB 2000
  • Expensive Predicates
  • Predicate Migration Optimizing queries with
    Expensive Predicates, J.M. Hellerstein and M.
    Stonebraker. SIGMOD 1993

29
Real-web Experiments
30
Real-web Experiments with Adaptive Time
31
Relaxing the Source Model
TA-EP
Upper
32
Upcoming Journal Paper
  • Variations of Upper
  • Select best source
  • Data Structures
  • Complexity Analysis
  • Relaxing Source Model
  • Adaptation of our Algorithms
  • New Algorithms
  • Variations of Data and Query Model to handle real
    web data

33
Optimality
  • TA instance optimal over
  • Algorithms that do not make wild guesses.
  • Databases that satisfy the distinctness property.
  • TAZ instance optimal over
  • Algorithms that do not make wild guesses.
  • No complexity analysis of our algorithms, but
    experimental evaluation instead
Write a Comment
User Comments (0)
About PowerShow.com