MiniCon Reformulation - PowerPoint PPT Presentation

1 / 40

About This Presentation

Title:

MiniCon Reformulation

Description:

University of Pennsylvania. CIS 650 Database & Information Systems ... Lottery scheduling based on number of tickets. Now handles both selectivity and cost ... – PowerPoint PPT presentation

Number of Views:126

Avg rating:3.0/5.0

Slides: 41

Provided by: zack4

Learn more at: https://www.seas.upenn.edu

Category:

more less

Transcript and Presenter's Notes

Title: MiniCon Reformulation

1
MiniCon Reformulation Adaptive Re-Optimization

Zachary G. Ives
University of Pennsylvania
CIS 650 Database Information Systems
February 23, 2005

2
Administrivia

Next reading assignment
Urhan Franklin Query Scrambling
Ives et al. Adaptive Data Partitioning
Compare the different approaches
One-page proposal of your project scope, goals,
and means of assessing success/failure due next
Monday, Feb. 28th

3
Todays Trivia Question
4
Buckets, Rev. 2 The MiniCon Algorithm

A much smarter bucket algorithm
In many cases, we dont need to perform the
cross-product of all items in all buckets
Eliminates the need for the containment check
This and the Chase Backchase strategy of
Tannen et al are the two methods most used in
virtual data integration today

5
Minicon Descriptions (MCDs)

Basically, a modification to the bucket approach
head homomorphism defines what variables must
be equated
Variable-substituted version of the subgoals
Mapping of variable names
Info about whats covered
Property 1
If a variable occurs in the head of a query, then
there must be a corresponding variable in the
head of the MCD view
If a variable participates in a join predicate in
the query, then it must be in the head of the view

6
MCD Construction

For each subgoal of the query
For each subgoal of each view
Choose the least restrictive head homomorphism to
match the subgoal of the query
If we can find a way of mapping the variables,
then add MCD for each possible maximal
extension of the mapping that satisfies Property 1

7
MCDs for Our Example

5star(i) ? show(i, t, y, g), rating(i, 5, s)
TVguide(t,y,g,r) ? show(i, t, y, g), rating(i, r,
TVGuide)
movieInfo(i,t,y,g) ? show(i, t, y, g)
critics(i,r,s) ? rating(i, r, s)
goodMovies(t,y) ? show(i, t, 1997, drama),
rating(i, 5, s)
good98(t,y) ? show(i, t, 1998, drama),
rating(i, 5, s)

q(t) - show(i, t, y, g), rating(i, r, s), r 5
8
Combining MCDs

Now look for ways of combining pairwise disjoint
subsets of the goals
Greatly reduces the number of candidates!
Also proven to be correct without the use of a
containment check
Variations need to be made for
Constants in general (I sneaked those in)
Semi-interval predicates (x
Note that full-blown inequality predicates are
co-NP-hard in the size of the data, so they dont
work

9
MiniCon Performance, Many Rewritings
10
Larger Query, Fewer Rewritings
11
MiniCon and LAV Summary

The state-of-the-art for AQUV in the relational
world of data integration
Its been extended to support conjunctive
XQuery as well
Scales to large numbers of views, which we need
in LAV data integration
A similar approach Chase Backchase by Tannen
et al.
Slightly more general in some ways but
Produces equivalent rewritings, not maximally
contained ones
Not always polynomial in the size of the data

12
Motivations for Adaptive Query Processing

Many domains where cost-based query optimization
fails
Complex queries in traditional databases
estimation error grows exponentially with joins
IC91
Querying over the Internet unpredictable access
rates, delays
Querying external data sources limited
information available about properties of this
source
Monitor real-world conditions, adapt processing
strategy in response

13
Can We Get RDBMS-Level Optimizationfor Data
Integration, without Statistics?

Multiple remote sources
Described and mapped loosely
Data changes frequently
Generally, would like to support same kinds of
queries as in a local setting

Results
Query
Data Integration System
Mediated Schema
Source Catalog
Schema Mappings
Remote, Autonomous Data Sources
14
What Are the Sources of Inefficiency?

Delays we stall in waiting for I/O
Well talk about this on Monday
Bad estimation of intermediate result sizes
The focus of the Kabra and DeWitt paper
No info about source cardinalities
The focus of the eddies paper and Mondays
paper
The latter two are closely related
Major challenges
Trading off information acquisition (exploration)
vs. use (exploitation)
Extrapolating performance based on what youve
seen so far

15
Kabra and DeWitt

Goal minimal update to a traditional
optimizer in order to compensate for bad
decisions
General approach
Break the query plan into stages
Instrument it
Allow for re-invocation of optimizer if its
going awry

16
Elements of Mid-Query Re-Optimization

Annotated Query Execution Plans
Annotate plan with estimates of size
Runtime Collection of Statistics
Statistics collectors embedded in execution tree
Keep overhead down
Dynamic Resource Re-allocation
Reallocate memory to individual operations
Query Plan Modification
May wish to re-optimize the remainder of query

17
Annotated Query Plans

We save at each point in the tree the expected
Sizes and cardinalities
Selectivities of predicates
Estimates of number of groups to be aggregated

18
Statistics Collectors

Add into tree
Must be collectable in a single pass
Will only help with portions of query beyond
the current pipeline

19
Resource Re-Allocation

Based on improved estimates, we can modify the
memory allocated to each operation
Results less I/O, better performance
Only for operations that have not yet begun
executing, i.e., not in the pipeline

20
Plan Modification

Create new plan for remainder, treating temp as
an input

Only re-optimize part not begun
Suspend query, save intermediate in temp file

21
Re-Optimization

When to re-optimize
Calculate time current should take (using
gathered stats)
Only consider re-optimization if
Our original estimate was off by at least some
factor ?2 and if
Topt, estimated 5 and cost of optimization depends on number of
operators, esp. joins
Only modify the plan if the new estimate,
including the cost of writing the temp file, is
better

22
Low-Overhead Statistics

Want to find most effective statistics
Dont want to gather statistics for simple
queries
Want to limit effect of algorithm to maximum
overhead ratio, ?
Factors
Probability of inaccuracy
Fraction of query affected
How do we know this without having stats?

23
Inaccuracy Potentials

The following heuristics are used
Inaccuracy potential low, medium, high
Lower if we have more information on table value
distribution
1max of inputs for multiple-input selection
Always high for user-defined methods
Always high for non-equijoins
For most other operators, same as worst of inputs

24
More Heuristics

Check fraction of query affected
Check how many other operators use the same
statistic
The winner
Higher inaccuracy potentials first
Then, if a tie, the one affecting the larger
portion of the plan

25
Implementation

On top of Paradise (parallel database that
supports ADTs, built on OO framework)
Using System-R optimizer
New SCIA (Stat Collector Insertion Algorithm) and
Dynamic Re-Optimization modules

26
It Works!

Results are 5 worse for simple queries, much
better for complex queries
Of course, we would not really collect statistics
on simple queries
Data skew made a slight difference - both normal
and re-optimized queries performed slightly better

27
Pros and Cons

Provides significant potential for improvement
without adding much overhead
Biased towards exploitation, with very limited
information-gathering
A great way to retrofit an existing system
In SIGMOD04, IBM had a paper that did this in DB2
But fairly limited to traditional DB context
Relies on us knowing the (rough) cardinalities of
the sources
Query plans arent pipelined, meaning
If the pipeline is broken too infrequently, MQRO
may not help
If the pipeline is broken too frequently,
time-to-first-answer is slow

28
The Opposite Extreme Eddies

The basic idea
Query processing consists of sending tuples
through a series of operators
Why not treat it like a routing problem?
Rely on back-pressure (i.e., queue overflow) to
tell us where to send tuples
Part of the ongoing Telegraph project at Berkeley
Large-scale federated, shared-nothing data stream
engine
Variations in data transfer rates
Little knowledge of data sources

29
Telegraph Architecture

Simple pre-optimizer to generate initial plan
Creates operators, e.g.
Select ?predicate(sourceA)
Join ??predicate(sourceA, sourceB)
(No support for aggregation, union, etc.)
Chooses implementations
Select using index, join using hash join
Goal dataflow-driven scheduling of operations
Tuple comes into system
Adaptively routed through operators in eddies
May be combined, discarded, etc.

30
Cant Always Re-order Arbitrarily

Need moment of symmetry
Some operators have scheduling dependency
e.g. nested loops join
for each tuple in left table
for each tuple in right table
If tuples meet predicate, output result
Index joins, pipelined hash joins always
symmetric
Sometimes have order restrictions
e.g. request tuple from one source, ship to
another

31
The Eddy

N-ary module consisting of query operators
Basic unit of adaptivity
Subplan with select, project, join operators

Represents set of possible orderings of a subplan
Each tuple may flow through a different
ordering (which may be constrained)

U
32
Example Join Subplan Alternatives
Join(R3R1.x R2.x, JoinR1.x R3.x(R1, R3))
R1.x R3.x
R1.x R3.x
R1
R2.x R3.x
R3
R2.x R1.x
R2
R3
R2
R1
R1.x R3.x
R1.x R3.x

R2.x R3.x
R3
R2.x R3.x
R1
R1
R2
R2
R3
33
Naïve Eddy

Given tuple, route to operator thats ready
Analogous to fluid dynamics
Adjusts to operator costs
Ignores operator selectivity (reduction of input)

34
Adapting to Variable-Cost Selection
35
But Selectivity is Ignored
36
Lottery-Based Eddy

Need to favor more selective operators
Ticket given per tuple input, returned per output
Lottery scheduling based on number of tickets
Now handles both selectivity and cost

37
Enhancing Adaptivity Sliding Window

Tickets were for entire query
Weighted sliding window approach
Escrow tickets during a window
Banked tickets from a previous window

38
What about Delays?

Problems here
Dont know when join buffers vs. discards tuples

T
S
R (SLOW)
39
Eddy Pros and Cons

Mechanism for adaptively re-routing queries
Makes optimizers task simpler
Can do nearly as well as well-optimized plan in
some cases
Handles variable costs, variable selectivities
But doesnt really handle joins very well
attempts to address in follow-up work
STeMs break a join into separate data
structures requires re-computation at each step
STAIRs create intermediate state and shuffle it
back and forth

40
Other Areas Where Things Can be Improved