Title: MiniCon Reformulation
1MiniCon Reformulation Adaptive Re-Optimization
- Zachary G. Ives
- University of Pennsylvania
- CIS 650 Database Information Systems
- February 23, 2005
2Administrivia
- Next reading assignment
- Urhan Franklin Query Scrambling
- Ives et al. Adaptive Data Partitioning
- Compare the different approaches
- One-page proposal of your project scope, goals,
and means of assessing success/failure due next
Monday, Feb. 28th
3Todays Trivia Question
4Buckets, Rev. 2 The MiniCon Algorithm
- A much smarter bucket algorithm
- In many cases, we dont need to perform the
cross-product of all items in all buckets - Eliminates the need for the containment check
- This and the Chase Backchase strategy of
Tannen et al are the two methods most used in
virtual data integration today
5Minicon Descriptions (MCDs)
- Basically, a modification to the bucket approach
- head homomorphism defines what variables must
be equated - Variable-substituted version of the subgoals
- Mapping of variable names
- Info about whats covered
- Property 1
- If a variable occurs in the head of a query, then
there must be a corresponding variable in the
head of the MCD view - If a variable participates in a join predicate in
the query, then it must be in the head of the view
6MCD Construction
- For each subgoal of the query
- For each subgoal of each view
- Choose the least restrictive head homomorphism to
match the subgoal of the query - If we can find a way of mapping the variables,
then add MCD for each possible maximal
extension of the mapping that satisfies Property 1
7MCDs for Our Example
- 5star(i) ? show(i, t, y, g), rating(i, 5, s)
- TVguide(t,y,g,r) ? show(i, t, y, g), rating(i, r,
TVGuide) - movieInfo(i,t,y,g) ? show(i, t, y, g)
- critics(i,r,s) ? rating(i, r, s)
- goodMovies(t,y) ? show(i, t, 1997, drama),
rating(i, 5, s) - good98(t,y) ? show(i, t, 1998, drama),
rating(i, 5, s)
q(t) - show(i, t, y, g), rating(i, r, s), r 5
8Combining MCDs
- Now look for ways of combining pairwise disjoint
subsets of the goals - Greatly reduces the number of candidates!
- Also proven to be correct without the use of a
containment check - Variations need to be made for
- Constants in general (I sneaked those in)
- Semi-interval predicates (x
- Note that full-blown inequality predicates are
co-NP-hard in the size of the data, so they dont
work
9MiniCon Performance, Many Rewritings
10Larger Query, Fewer Rewritings
11MiniCon and LAV Summary
- The state-of-the-art for AQUV in the relational
world of data integration - Its been extended to support conjunctive
XQuery as well - Scales to large numbers of views, which we need
in LAV data integration - A similar approach Chase Backchase by Tannen
et al. - Slightly more general in some ways but
- Produces equivalent rewritings, not maximally
contained ones - Not always polynomial in the size of the data
12Motivations for Adaptive Query Processing
- Many domains where cost-based query optimization
fails - Complex queries in traditional databases
estimation error grows exponentially with joins
IC91 - Querying over the Internet unpredictable access
rates, delays - Querying external data sources limited
information available about properties of this
source - Monitor real-world conditions, adapt processing
strategy in response
13Can We Get RDBMS-Level Optimizationfor Data
Integration, without Statistics?
- Multiple remote sources
- Described and mapped loosely
- Data changes frequently
- Generally, would like to support same kinds of
queries as in a local setting
Results
Query
Data Integration System
Mediated Schema
Source Catalog
Schema Mappings
Remote, Autonomous Data Sources
14What Are the Sources of Inefficiency?
- Delays we stall in waiting for I/O
- Well talk about this on Monday
- Bad estimation of intermediate result sizes
- The focus of the Kabra and DeWitt paper
- No info about source cardinalities
- The focus of the eddies paper and Mondays
paper - The latter two are closely related
- Major challenges
- Trading off information acquisition (exploration)
vs. use (exploitation) - Extrapolating performance based on what youve
seen so far
15Kabra and DeWitt
- Goal minimal update to a traditional
optimizer in order to compensate for bad
decisions - General approach
- Break the query plan into stages
- Instrument it
- Allow for re-invocation of optimizer if its
going awry
16Elements of Mid-Query Re-Optimization
- Annotated Query Execution Plans
- Annotate plan with estimates of size
- Runtime Collection of Statistics
- Statistics collectors embedded in execution tree
- Keep overhead down
- Dynamic Resource Re-allocation
- Reallocate memory to individual operations
- Query Plan Modification
- May wish to re-optimize the remainder of query
17Annotated Query Plans
- We save at each point in the tree the expected
- Sizes and cardinalities
- Selectivities of predicates
- Estimates of number of groups to be aggregated
18Statistics Collectors
- Add into tree
- Must be collectable in a single pass
- Will only help with portions of query beyond
the current pipeline
19Resource Re-Allocation
- Based on improved estimates, we can modify the
memory allocated to each operation - Results less I/O, better performance
- Only for operations that have not yet begun
executing, i.e., not in the pipeline
20Plan Modification
- Create new plan for remainder, treating temp as
an input
- Only re-optimize part not begun
- Suspend query, save intermediate in temp file
21Re-Optimization
- When to re-optimize
- Calculate time current should take (using
gathered stats) - Only consider re-optimization if
- Our original estimate was off by at least some
factor ?2 and if - Topt, estimated 5 and cost of optimization depends on number of
operators, esp. joins - Only modify the plan if the new estimate,
including the cost of writing the temp file, is
better
22Low-Overhead Statistics
- Want to find most effective statistics
- Dont want to gather statistics for simple
queries - Want to limit effect of algorithm to maximum
overhead ratio, ? - Factors
- Probability of inaccuracy
- Fraction of query affected
- How do we know this without having stats?
23Inaccuracy Potentials
- The following heuristics are used
- Inaccuracy potential low, medium, high
- Lower if we have more information on table value
distribution - 1max of inputs for multiple-input selection
- Always high for user-defined methods
- Always high for non-equijoins
- For most other operators, same as worst of inputs
24More Heuristics
- Check fraction of query affected
- Check how many other operators use the same
statistic - The winner
- Higher inaccuracy potentials first
- Then, if a tie, the one affecting the larger
portion of the plan
25Implementation
- On top of Paradise (parallel database that
supports ADTs, built on OO framework) - Using System-R optimizer
- New SCIA (Stat Collector Insertion Algorithm) and
Dynamic Re-Optimization modules
26It Works!
- Results are 5 worse for simple queries, much
better for complex queries - Of course, we would not really collect statistics
on simple queries - Data skew made a slight difference - both normal
and re-optimized queries performed slightly better
27Pros and Cons
- Provides significant potential for improvement
without adding much overhead - Biased towards exploitation, with very limited
information-gathering - A great way to retrofit an existing system
- In SIGMOD04, IBM had a paper that did this in DB2
- But fairly limited to traditional DB context
- Relies on us knowing the (rough) cardinalities of
the sources - Query plans arent pipelined, meaning
- If the pipeline is broken too infrequently, MQRO
may not help - If the pipeline is broken too frequently,
time-to-first-answer is slow
28The Opposite Extreme Eddies
- The basic idea
- Query processing consists of sending tuples
through a series of operators - Why not treat it like a routing problem?
- Rely on back-pressure (i.e., queue overflow) to
tell us where to send tuples - Part of the ongoing Telegraph project at Berkeley
- Large-scale federated, shared-nothing data stream
engine - Variations in data transfer rates
- Little knowledge of data sources
29Telegraph Architecture
- Simple pre-optimizer to generate initial plan
- Creates operators, e.g.
- Select ?predicate(sourceA)
- Join ??predicate(sourceA, sourceB)
- (No support for aggregation, union, etc.)
- Chooses implementations
- Select using index, join using hash join
- Goal dataflow-driven scheduling of operations
- Tuple comes into system
- Adaptively routed through operators in eddies
- May be combined, discarded, etc.
30Cant Always Re-order Arbitrarily
- Need moment of symmetry
- Some operators have scheduling dependency
- e.g. nested loops join
- for each tuple in left table
- for each tuple in right table
- If tuples meet predicate, output result
- Index joins, pipelined hash joins always
symmetric - Sometimes have order restrictions
- e.g. request tuple from one source, ship to
another
31The Eddy
- N-ary module consisting of query operators
- Basic unit of adaptivity
- Subplan with select, project, join operators
- Represents set of possible orderings of a subplan
- Each tuple may flow through a different
ordering (which may be constrained)
U
32Example Join Subplan Alternatives
Join(R3R1.x R2.x, JoinR1.x R3.x(R1, R3))
R1.x R3.x
R1.x R3.x
R1
R2.x R3.x
R3
R2.x R1.x
R2
R3
R2
R1
R1.x R3.x
R1.x R3.x
R2.x R3.x
R3
R2.x R3.x
R1
R1
R2
R2
R3
33Naïve Eddy
- Given tuple, route to operator thats ready
- Analogous to fluid dynamics
- Adjusts to operator costs
- Ignores operator selectivity (reduction of input)
34Adapting to Variable-Cost Selection
35But Selectivity is Ignored
36Lottery-Based Eddy
- Need to favor more selective operators
- Ticket given per tuple input, returned per output
- Lottery scheduling based on number of tickets
- Now handles both selectivity and cost
37Enhancing Adaptivity Sliding Window
- Tickets were for entire query
- Weighted sliding window approach
- Escrow tickets during a window
- Banked tickets from a previous window
38What about Delays?
- Problems here
- Dont know when join buffers vs. discards tuples
T
S
R (SLOW)
39Eddy Pros and Cons
- Mechanism for adaptively re-routing queries
- Makes optimizers task simpler
- Can do nearly as well as well-optimized plan in
some cases - Handles variable costs, variable selectivities
- But doesnt really handle joins very well
attempts to address in follow-up work - STeMs break a join into separate data
structures requires re-computation at each step - STAIRs create intermediate state and shuffle it
back and forth
40Other Areas Where Things Can be Improved
- The work of pre-optimization or re-optimization
- Choose good initial/next query plan
- Pick operator implementations, access methods
- STeMs fix this
- Handle arbitrary operators
- Aggregation, outer join, sorting,
- Next time well see techniques to address this
- Distribute work
- Distributed eddies (by DeWitt and students)