Relax and Adapt: Computing Top-k Matches to XPath Queries - PowerPoint PPT Presentation

About This Presentation
Title:

Relax and Adapt: Computing Top-k Matches to XPath Queries

Description:

Instantiation of Whirlpool for various 'routing strategies' and 'prioritization' alternatives ... of partial matches created by Whirlpool-M as a function of ... – PowerPoint PPT presentation

Number of Views:93
Avg rating:3.0/5.0
Slides: 22
Provided by: amliem
Category:

less

Transcript and Presenter's Notes

Title: Relax and Adapt: Computing Top-k Matches to XPath Queries


1
Relax and Adapt Computing Top-k Matches to XPath
Queries
  • Amélie Marian (Columbia University)
  • Joint work with
  • Sihem Amer-Yahia (ATT Research)
  • Nick Koudas (University of Toronto)
  • Divesh Srivastava (ATT Research)

2
Example
book
book
info
author (Dickens)
info
title (Great Expectations)
edition (paperback)
title (Great Expectations)
author (Dickens)
  • Heterogeneous XML Data about books
  • Query
  • book./info/titleGreat Expectations and
  • ./info/authorDickens and ./editionpaperbac
    k

Query root node Distinguished node
3
XML Query Relaxation
Query
Amer-Yahia et al. EDBT02
  • Tree pattern relaxations
  • Leaf node deletion
  • Edge generalization
  • Subtree promotion

book
book
Data
edition?
info
author (Dickens)
info
title (Great Expectations)
edition (paperback)
title (Great Expectations)
author (Dickens)
4
Top-k Queries over XML DataMotivations and
Challenges
  • Structure heterogeneity
  • Efficient identification of approximate matches
  • Top-k
  • Ranking of approximate matches based on
    similarity to query
  • Early pruning
  • Query processing cost
  • Cost increases with number of matches evaluated
  • Data explosion
  • Many approximate matches
  • XML path queries akin to joins
  • Prioritization to increase pruning

5
Contributions
  • Whirlpool adaptive architecture and top-k query
    processing strategy for XPath queries
  • Goal early pruning of non-top-k partial matches
  • Approach partial matches may follow different
    plans, and may be at different stages of query
    execution
  • Real prototype implementation of Whirlpool
  • Instantiation of Whirlpool for various routing
    strategies and prioritization alternatives

6
Closely Related Work
  • Adaptive query processing
  • Eddies
  • Dynamic query join plans to adapt to processing
    environment
  • No pruning
  • Adaptive top-k query processing
  • Upper
  • Prioritization of partial matches based on
    maximum possible scores
  • Adaptive routing based on scores
  • No joins

Avnur and Hellerstein. SIGMOD00
Bruno et al. ICDE01
7
Outline
  • Whirlpool Architecture
  • Query Processing
  • Strategy
  • Alternatives
  • Evaluation Settings
  • Evaluation Results

8
Whirlpool Architecture
book
edition (paperback)
info
Router
author (Dickens)
title (Great Expectations)
book server
edition server
title server
info server
author server
Top-k Set
9
Whirlpool ArchitectureComponents
  • Top-k Set
  • Only one match with a given root node
  • Used for pruning
  • Complete matches are not processed further,
    incomplete matches are sent to the router
  • Router
  • Router Queue is based on partial matches maximum
    possible final scores
  • Dynamically choose which server to send partial
    match based on routing strategy

10
Whirlpool ArchitectureComponents
  • Root server
  • Generates candidate matches
  • Node servers
  • Maintain priority queue of partial matches
  • For each partial match that is processed
  • Compute a set of extended partial (or complete)
    matches
  • Compute scores of new matches
  • Checks partial matches against current top-k set

11
Query Processing Alternatives
  • Prioritization Strategies (at each server)
  • FIFO
  • Current Score
  • Maximum Possible Next Score
  • Maximum Possible Final Score
  • Routing Decisions (at the router)
  • Static
  • Score-based
  • Likely to increase score the most
  • Likely to increase score the least
  • Size-based
  • Likely to produce the fewest matches

12
Evaluation Strategies
  • Lockstep (Static)
  • Partial matches follow same execution plan
  • Partial matches have gone through exactly the
    same number of operations
  • Whirlpool Single-threaded (Adaptive)
  • Partial matches adaptively routed
  • Process the partial match with the highest
    maximum final score (Query processing similar to
    Upper)
  • Only one partial match processed at a time
  • Whirlpool Multi-threaded (Adaptive)
  • Prioritization strategy at server decides which
    partial match to process next at server
  • System determines which server to process next

13
Evaluation Metrics
  • Parameters
  • Query size
  • Document size
  • k
  • Parallelism
  • Scoring function (tf.idf proposed in paper)
  • Measures
  • Query execution time
  • Number of server operations
  • Number of partial matches created

14
Evaluation Setting
  • C implementation, with POSIX threads
  • Default machine
  • Red Hat 7.1 Linux
  • 1.4GHz dual processor
  • 2Gb RAM
  • XML Documents generated using XMark generating
    tool
  • XPath Queries chosen from XMark to illustrate
    different relaxations
  • XML nodes stored using Dewey encoding

15
Comparison of Adaptive Routing Strategies
Whirlpool-S and Whirlpool-M perform approximately
the same number of server operations
16
Static Routing Strategies vs. Best Adaptive
17
Effect of Parallelism
18
Varying Query Size and k (log scale)
60
48
20
For large queries and high values of k,
Whirlpool-M performs less server operations that
Whirlpool-S (and is faster even on a
one-processor machine)! (27 less server
operations for q3 k75)
19
Varying Query Size and Document Size
Almost twice as fast
20
Scalability
Document Size 1M 10M 50M
Q1 100 93.12 85.66
Q2 100 49.56 67.66
Q3 100 39.59 31.20
Percentage of partial matches created by
Whirlpool-M as a function of the maximum possible
number of partial matches
21
Conclusions
  • Efficient adaptive top-k query processing
    strategy
  • Minimize number of partial matches evaluated
  • Benefit from parallelism with little threading
    overhead
  • Adapt to different environments
  • Score distribution
  • Selectivity distribution
  • Extensive experimental evaluation
  • Good scalability
Write a Comment
User Comments (0)
About PowerShow.com