MiniCon Reformulation - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

MiniCon Reformulation

Description:

University of Pennsylvania. CIS 650 Database & Information Systems ... Lottery scheduling based on number of tickets. Now handles both selectivity and cost ... – PowerPoint PPT presentation

Number of Views:126
Avg rating:3.0/5.0
Slides: 41
Provided by: zack4
Category:

less

Transcript and Presenter's Notes

Title: MiniCon Reformulation


1
MiniCon Reformulation Adaptive Re-Optimization
  • Zachary G. Ives
  • University of Pennsylvania
  • CIS 650 Database Information Systems
  • February 23, 2005

2
Administrivia
  • Next reading assignment
  • Urhan Franklin Query Scrambling
  • Ives et al. Adaptive Data Partitioning
  • Compare the different approaches
  • One-page proposal of your project scope, goals,
    and means of assessing success/failure due next
    Monday, Feb. 28th

3
Todays Trivia Question
4
Buckets, Rev. 2 The MiniCon Algorithm
  • A much smarter bucket algorithm
  • In many cases, we dont need to perform the
    cross-product of all items in all buckets
  • Eliminates the need for the containment check
  • This and the Chase Backchase strategy of
    Tannen et al are the two methods most used in
    virtual data integration today

5
Minicon Descriptions (MCDs)
  • Basically, a modification to the bucket approach
  • head homomorphism defines what variables must
    be equated
  • Variable-substituted version of the subgoals
  • Mapping of variable names
  • Info about whats covered
  • Property 1
  • If a variable occurs in the head of a query, then
    there must be a corresponding variable in the
    head of the MCD view
  • If a variable participates in a join predicate in
    the query, then it must be in the head of the view

6
MCD Construction
  • For each subgoal of the query
  • For each subgoal of each view
  • Choose the least restrictive head homomorphism to
    match the subgoal of the query
  • If we can find a way of mapping the variables,
    then add MCD for each possible maximal
    extension of the mapping that satisfies Property 1

7
MCDs for Our Example
  • 5star(i) ? show(i, t, y, g), rating(i, 5, s)
  • TVguide(t,y,g,r) ? show(i, t, y, g), rating(i, r,
    TVGuide)
  • movieInfo(i,t,y,g) ? show(i, t, y, g)
  • critics(i,r,s) ? rating(i, r, s)
  • goodMovies(t,y) ? show(i, t, 1997, drama),
    rating(i, 5, s)
  • good98(t,y) ? show(i, t, 1998, drama),
    rating(i, 5, s)

q(t) - show(i, t, y, g), rating(i, r, s), r 5
8
Combining MCDs
  • Now look for ways of combining pairwise disjoint
    subsets of the goals
  • Greatly reduces the number of candidates!
  • Also proven to be correct without the use of a
    containment check
  • Variations need to be made for
  • Constants in general (I sneaked those in)
  • Semi-interval predicates (x
  • Note that full-blown inequality predicates are
    co-NP-hard in the size of the data, so they dont
    work

9
MiniCon Performance, Many Rewritings
10
Larger Query, Fewer Rewritings
11
MiniCon and LAV Summary
  • The state-of-the-art for AQUV in the relational
    world of data integration
  • Its been extended to support conjunctive
    XQuery as well
  • Scales to large numbers of views, which we need
    in LAV data integration
  • A similar approach Chase Backchase by Tannen
    et al.
  • Slightly more general in some ways but
  • Produces equivalent rewritings, not maximally
    contained ones
  • Not always polynomial in the size of the data

12
Motivations for Adaptive Query Processing
  • Many domains where cost-based query optimization
    fails
  • Complex queries in traditional databases
    estimation error grows exponentially with joins
    IC91
  • Querying over the Internet unpredictable access
    rates, delays
  • Querying external data sources limited
    information available about properties of this
    source
  • Monitor real-world conditions, adapt processing
    strategy in response

13
Can We Get RDBMS-Level Optimizationfor Data
Integration, without Statistics?
  • Multiple remote sources
  • Described and mapped loosely
  • Data changes frequently
  • Generally, would like to support same kinds of
    queries as in a local setting

Results
Query
Data Integration System
Mediated Schema
Source Catalog
Schema Mappings
Remote, Autonomous Data Sources
14
What Are the Sources of Inefficiency?
  • Delays we stall in waiting for I/O
  • Well talk about this on Monday
  • Bad estimation of intermediate result sizes
  • The focus of the Kabra and DeWitt paper
  • No info about source cardinalities
  • The focus of the eddies paper and Mondays
    paper
  • The latter two are closely related
  • Major challenges
  • Trading off information acquisition (exploration)
    vs. use (exploitation)
  • Extrapolating performance based on what youve
    seen so far

15
Kabra and DeWitt
  • Goal minimal update to a traditional
    optimizer in order to compensate for bad
    decisions
  • General approach
  • Break the query plan into stages
  • Instrument it
  • Allow for re-invocation of optimizer if its
    going awry

16
Elements of Mid-Query Re-Optimization
  • Annotated Query Execution Plans
  • Annotate plan with estimates of size
  • Runtime Collection of Statistics
  • Statistics collectors embedded in execution tree
  • Keep overhead down
  • Dynamic Resource Re-allocation
  • Reallocate memory to individual operations
  • Query Plan Modification
  • May wish to re-optimize the remainder of query

17
Annotated Query Plans
  • We save at each point in the tree the expected
  • Sizes and cardinalities
  • Selectivities of predicates
  • Estimates of number of groups to be aggregated

18
Statistics Collectors
  • Add into tree
  • Must be collectable in a single pass
  • Will only help with portions of query beyond
    the current pipeline

19
Resource Re-Allocation
  • Based on improved estimates, we can modify the
    memory allocated to each operation
  • Results less I/O, better performance
  • Only for operations that have not yet begun
    executing, i.e., not in the pipeline

20
Plan Modification
  • Create new plan for remainder, treating temp as
    an input
  • Only re-optimize part not begun
  • Suspend query, save intermediate in temp file

21
Re-Optimization
  • When to re-optimize
  • Calculate time current should take (using
    gathered stats)
  • Only consider re-optimization if
  • Our original estimate was off by at least some
    factor ?2 and if
  • Topt, estimated 5 and cost of optimization depends on number of
    operators, esp. joins
  • Only modify the plan if the new estimate,
    including the cost of writing the temp file, is
    better

22
Low-Overhead Statistics
  • Want to find most effective statistics
  • Dont want to gather statistics for simple
    queries
  • Want to limit effect of algorithm to maximum
    overhead ratio, ?
  • Factors
  • Probability of inaccuracy
  • Fraction of query affected
  • How do we know this without having stats?

23
Inaccuracy Potentials
  • The following heuristics are used
  • Inaccuracy potential low, medium, high
  • Lower if we have more information on table value
    distribution
  • 1max of inputs for multiple-input selection
  • Always high for user-defined methods
  • Always high for non-equijoins
  • For most other operators, same as worst of inputs

24
More Heuristics
  • Check fraction of query affected
  • Check how many other operators use the same
    statistic
  • The winner
  • Higher inaccuracy potentials first
  • Then, if a tie, the one affecting the larger
    portion of the plan

25
Implementation
  • On top of Paradise (parallel database that
    supports ADTs, built on OO framework)
  • Using System-R optimizer
  • New SCIA (Stat Collector Insertion Algorithm) and
    Dynamic Re-Optimization modules

26
It Works!
  • Results are 5 worse for simple queries, much
    better for complex queries
  • Of course, we would not really collect statistics
    on simple queries
  • Data skew made a slight difference - both normal
    and re-optimized queries performed slightly better

27
Pros and Cons
  • Provides significant potential for improvement
    without adding much overhead
  • Biased towards exploitation, with very limited
    information-gathering
  • A great way to retrofit an existing system
  • In SIGMOD04, IBM had a paper that did this in DB2
  • But fairly limited to traditional DB context
  • Relies on us knowing the (rough) cardinalities of
    the sources
  • Query plans arent pipelined, meaning
  • If the pipeline is broken too infrequently, MQRO
    may not help
  • If the pipeline is broken too frequently,
    time-to-first-answer is slow

28
The Opposite Extreme Eddies
  • The basic idea
  • Query processing consists of sending tuples
    through a series of operators
  • Why not treat it like a routing problem?
  • Rely on back-pressure (i.e., queue overflow) to
    tell us where to send tuples
  • Part of the ongoing Telegraph project at Berkeley
  • Large-scale federated, shared-nothing data stream
    engine
  • Variations in data transfer rates
  • Little knowledge of data sources

29
Telegraph Architecture
  • Simple pre-optimizer to generate initial plan
  • Creates operators, e.g.
  • Select ?predicate(sourceA)
  • Join ??predicate(sourceA, sourceB)
  • (No support for aggregation, union, etc.)
  • Chooses implementations
  • Select using index, join using hash join
  • Goal dataflow-driven scheduling of operations
  • Tuple comes into system
  • Adaptively routed through operators in eddies
  • May be combined, discarded, etc.

30
Cant Always Re-order Arbitrarily
  • Need moment of symmetry
  • Some operators have scheduling dependency
  • e.g. nested loops join
  • for each tuple in left table
  • for each tuple in right table
  • If tuples meet predicate, output result
  • Index joins, pipelined hash joins always
    symmetric
  • Sometimes have order restrictions
  • e.g. request tuple from one source, ship to
    another

31
The Eddy
  • N-ary module consisting of query operators
  • Basic unit of adaptivity
  • Subplan with select, project, join operators
  • Represents set of possible orderings of a subplan
  • Each tuple may flow through a different
    ordering (which may be constrained)

U
32
Example Join Subplan Alternatives
Join(R3R1.x R2.x, JoinR1.x R3.x(R1, R3))
R1.x R3.x
R1.x R3.x
R1
R2.x R3.x
R3
R2.x R1.x
R2
R3
R2
R1
R1.x R3.x
R1.x R3.x

R2.x R3.x
R3
R2.x R3.x
R1
R1
R2
R2
R3
33
Naïve Eddy
  • Given tuple, route to operator thats ready
  • Analogous to fluid dynamics
  • Adjusts to operator costs
  • Ignores operator selectivity (reduction of input)

34
Adapting to Variable-Cost Selection
35
But Selectivity is Ignored
36
Lottery-Based Eddy
  • Need to favor more selective operators
  • Ticket given per tuple input, returned per output
  • Lottery scheduling based on number of tickets
  • Now handles both selectivity and cost

37
Enhancing Adaptivity Sliding Window
  • Tickets were for entire query
  • Weighted sliding window approach
  • Escrow tickets during a window
  • Banked tickets from a previous window

38
What about Delays?
  • Problems here
  • Dont know when join buffers vs. discards tuples

T
S
R (SLOW)
39
Eddy Pros and Cons
  • Mechanism for adaptively re-routing queries
  • Makes optimizers task simpler
  • Can do nearly as well as well-optimized plan in
    some cases
  • Handles variable costs, variable selectivities
  • But doesnt really handle joins very well
    attempts to address in follow-up work
  • STeMs break a join into separate data
    structures requires re-computation at each step
  • STAIRs create intermediate state and shuffle it
    back and forth

40
Other Areas Where Things Can be Improved
  • The work of pre-optimization or re-optimization
  • Choose good initial/next query plan
  • Pick operator implementations, access methods
  • STeMs fix this
  • Handle arbitrary operators
  • Aggregation, outer join, sorting,
  • Next time well see techniques to address this
  • Distribute work
  • Distributed eddies (by DeWitt and students)
Write a Comment
User Comments (0)
About PowerShow.com