Adaptively Processing Remote Data and Learning Source Mappings - PowerPoint PPT Presentation

1 / 40

About This Presentation

Title:

Adaptively Processing Remote Data and Learning Source Mappings

Description:

speed and selectivity of each operator. Elegant and simple to implement ... revises selectivity estimates, recomputes expected costs ... – PowerPoint PPT presentation

Number of Views:49

Avg rating:3.0/5.0

Slides: 41

Provided by: zack4

Learn more at: https://www.seas.upenn.edu

Category:

more less

Transcript and Presenter's Notes

Title: Adaptively Processing Remote Data and Learning Source Mappings

1
Adaptively Processing Remote Dataand Learning
Source Mappings

Zachary G. Ives
University of Pennsylvania
CIS 650 Database Information Systems
March 14, 2005

LSD Slides courtesy AnHai Doan
2
Administrivia

Midterm due 3/16
5-10 pages (single-spaced, 10-12 pt)
If you havent told me which topic, please do so
now!

3
Todays Trivia Question
4
Many Motivations for Adaptive Query Processing

Many domains where cost-based query optimization
fails
Complex queries in traditional databases
estimation error grows exponentially with joins
IC91 the focus of KD98, M04
Querying over the Internet unpredictable access
rates, delays
Querying external data sources limited
information available about properties of this
source
Monitor real-world conditions, adapt processing
strategy in response

5
Generalizing Adaptive Query Processing

Weve seen a range of different adaptive
techniques
How do they fit together?
Can we choose points between eddies and mid-query
re-optimization? Can we exploit other kinds of
query optimization tricks?

6
Popular Types of Adaptive Query Processing

Adaptive scheduling (q. scramb. UF98 dyn.
rescheduling UF01 PH join UF00RS86I99HH
99)
Changes CPU scheduling to improve feedback or
reduce delays
Cant reduce total work
Redundant computation (competitive exec. AZ96)
Compare two ways of executing the query
Need to identify a few promising plans
Plan partitioning (INGRES S76, mid-q re-opt.
KD98I99M04)
Break the plan into stages re-optimize future
stages as necessary
Coarse granularity, breaks pipelining
Are these the only options?

7
Two More Forms of Adaptivity

Adaptive data partitioning (AH00R03DH04I0
4)
Break the data into subsets use a different plan
for each subset
Generalizes intra-plan reordering in a SPJGU
query
The only way to reduce overall computation with
fine granularity
Only previous implementation has been eddies
AH00R03DH04
Adaptive information passing
Extends sideways information passing (magic
sets) to an adaptive context, both intra- and
inter-plan
Reduces computation and space devoted to
non-productive tuples

8
Eddies Combine Adaptive Scheduling and Data
Partitioning Decisions

Intuitively, each tuple gets its own query plan
Route to next operator based on speed and
selectivity of each operator
Elegant and simple to implement
But performing a join creates subresults at the
next level!
Local greedy choices may result in state that
needs to join with all future data!
Consider long-term effects of decisions before
making them separate CPU scheduling from plan
selection

9
Focusing Purely on Adaptive Data Partitioning

Use adaptively scheduled operators to fill CPU
cycles
Now a query optimizer problem
Choose a plan that minimizes long- term cost
(in CPU cycles)
To allow multiple plans, distribute union
through join (and select, project, etc.)
If R1 R11 R12, R2 R21 R22 then
R1 ? R2 (R11 R12) ? (R21 R22) (R11 ?
R21) (R12 ? R22)
(R11 ? R22) (R12 ? R21)

R11
R21
R22
R12
R2
R1
This generalizes to njoins, other SPJ
GUoperators
10
Adaptive Data PartitioningRouting Data across
Different Plans
R ? S ? T
?
R0 S0T0

Options for combining across phases
New results always injected into old plan
Old results into new plan
Wait until the end stitch-up plan based on
best stats

R0 S0
R
S
T
11
Special Architectural Features for ADP

Monitoring and re-optimization thread runs
alongside execution
System-R-like optimizer with aggregation
supportuses most current selectivity estimates
Periodic monitoring and re-optimization revises
selectivity estimates, recomputes expected costs
Query execution with smart router operators
Special support for efficient stitch-up plans
Uses intermediate results from previous plans
(specialized-case of answering queries using
views H01)
Join-over-union (stitch-up-join) operator that
excludes certain results

12
ADP Application 1Correcting Cost Mis-estimates

Goal react to plans that are obviously bad
Dont spend cycles searching for a slightly
better plan
Try to avoid paths that are likely to not be
promising
Monitor/reoptimizer thread watches cardinalities
of subresults
Re-estimate plan cost, compare to projected costs
of alternatives, using several techniques
heuristics (see paper)
Our experiments re-estimate every 1 sec.
Smart router operator does the following
Waits for monitor/reoptimizer to suggest
replacement plan
Re-routes source data into the new plan
New plans output is unioned with output of
previous plan this is fed into any final
aggregation operations

13
Correcting for Unexpected Selectivities
Pentium IV 3.06 GHzWindows XP
14
ADP Application 2Optimizing for Order

Most general ADP approach
Pre-generate plans for general case and each
interesting order
Smart router sends tuple to the plan whose
ordering constraint is followed by this tuple
But with multiple joins, MANY plans
Instead do ADP at the operator level
Complementary join pair
Does its own stitch-up internally
Easier to optimize for!
Can also do partial sorting at the router
(priority queue)

Q
...
Merge
Hash
h(R)
h(S)
h(R)
h(S)
Q
Q
Q
Q
Routers
R
S
15
Exploiting Partial Order in the Data
Pentium IV 3.06 GHzWindows XP
(1024 tuple)
16
ADP Over WindowsOptimizing for Aggregation

Group-by optimization CS94
May be able to pre-aggregate some tuples before
joining
Why aggregates can be applied over union
But once we insert pre-aggregation, were stuck
(and its not pipelined)
Our solution
Adjustable window pre-aggregation
Change window size depending on how effectively
we can aggregate
Also allows data to propagate through the plan
better info for adaptivity, early answers

vs.
17
Pre-Aggregation Comparison
18
The State of the Union Join and Agg

Useless intermediate state is perhaps the
biggest concern in ADP-based (or even
plan-partitioning) approaches
Very easy to create large intermed. state before
switching from a plan
Results in significant additional computation
The burden of history DH04
Also the major bottleneck in computing queries
with correlated subqueries
Only want to compute parts of a subquery that
will contribute to final answers
Local DB solution magic sets rewritings
M90CR91MP94S96

19
Intuition behind Magic Sets Rewritings

Observations
Computing a subquery once for every iteration of
the outer query is repetitive, inefficient
Computing the subquery in its entirety is also
frequently inefficient
So pass in information about specifically which
tuples from the inner query might join with the
outer query
A filter set generally a projection of a
portion of the outer query results
Anything that joins with the parent block must
join with the filter set
False positives are OK

20
Query with Magic Set
CREATE VIEW TotalSales(SellerID, Sales,
ItemsSold) SELECT SellerID, sum(salePrice) AS
Sales, count() AS ItemsSold FROM
SellerList SL, SaleItem S WHERE SL.SellerID
S.SellerID GROUP BY SL.SellerID
SELECT SellerID, Sales, ItemsSold FROM
TotalSales TS, Recommended REC, Ratings
RAT WHERE REC.SellerID TS.SellerID
AND RAT.SellerID TS.SellerID AND
RAT.Rating gt 4 AND ItemsSold gt 50
21
Query with Magic Set S96
CREATE VIEW TotalSales(SellerID, Sales,
ItemsSold) SELECT SellerID, sum(salePrice) AS
Sales, count() AS ItemsSold FROM
SellerList SL, SaleItem S WHERE SL.SellerID
S.SellerID GROUP BY SL.SellerID
SELECT SellerID, Sales, ItemsSold FROM
TotalSales TS, Recommended REC, Ratings
RAT WHERE REC.SellerID TS.SellerID
AND RAT.SellerID TS.SellerID AND
RAT.Rating gt 4 AND ItemsSold gt 50
22
Magic in Data Integration

In data integration
Difficult to determine when to do sideways
information passing/magic in a cost-based way
Magic optimization destroys some potential
parallelism must compute outer block first
Opportunities
Pipelined hash joins give us complete state for
every intermediate result
We use bushy trees
Our idea do information passing out-of-band
Consider a plan as if its a relational calculus
expression every tuple must satisfy constraints
The plan dataflow enforces this
But we can also pass information across the
plan outside the normal dataflow

x
C
x
B
A
23
Adaptive Information Passing

Cost-based strategy
Execute all blocks in parallel (up to max.
pipelineable size)
Whenever a subresult is completely computed, feed
it elsewhere in the query plan as a filter set
Anywhere with a shared predicate is an eligible
target
Use our ability to estimate remaining cost of
query execution to see if the semijoin will speed
performance
Can always inject more precise filter set (one
that checks more predicates), or remove a filter
set
Filter set is a performance/space optimization,
not necessary for correctness
We use Bloom filters rather than hash tables(our
VLDB05 submission has detailed performance
comparison)
Also compared against a naïve strategy that
generates filter sets at every operator when
complete, they are used as filters by downstream
ops

24
Tuples Created TPC-H, 1GB(67 savings in Q2.
Also savings in Q5, not shown)
25
Adaptive QP in Summary

A variety of different techniques, focusing on
Scheduling
Comparison competition
Data plan partitioning
Information passing
A field that is still fairly open missing
Effective exploration methods
A true theory!
Whats possible? What kinds of queries make
sense to adapt?
Guarantees of optimality and convergence (perhaps
under certain assumptions)

26
Switching from Low-Level to High-Level

Weve talked about
Query reformulation (composing queries with
mappings)
Query optimization execution
But how did we ever get the mappings in the first
place?
This is one of the most tedious tasks
Answer LSD (and not the kind that makes you
high!)
Slides courtesy of AnHai Doan, UIUC

27
Semantic Mappings between Schemas

Mediated source schemas XML DTDs

house
address
num-baths
contact-info
agent-name agent-phone
1-1 mapping
non 1-1 mapping
house
location contact
full-baths
half-baths
name phone
28
The LSD (Learning Source Descriptions) Approach

Suppose user wants to integrate 100 data sources
1. User
manually creates mappings for a few sources, say
3
shows LSD these mappings
2. LSD learns from the mappings
Multi-strategy learning incorporates many types
of info in a general way
Knowledge of constraints further helps
3. LSD proposes mappings for remaining 97 sources

29
Example
Mediated schema
address price agent-phone
description
location listed-price phone
comments
Learned hypotheses
Schema of realestate.com
If phone occurs in the name gt agent-phone
listed-price 250,000 110,000 ...
location Miami, FL Boston, MA ...
phone (305) 729 0831 (617) 253 1429 ...
comments Fantastic house Great location ...
realestate.com
If fantastic great occur frequently in
data values gt description
homes.com
price 550,000 320,000 ...
contact-phone (278) 345 7215 (617) 335 2315 ...
extra-info Beautiful yard Great beach ...
30
LSDs Multi-Strategy Learning

Use a set of base learners
each exploits well certain types of information
Match schema elements of a new source
apply the base learners
combine their predictions using a meta-learner
Meta-learner
uses training sources to measure base learner
accuracy
weighs each learner based on its accuracy

31
Base Learners

Input
schema information name, proximity, structure,
...
data information value, format, ...
Output
prediction weighted by confidence score
Examples
Name learner
agent-name gt (name,0.7), (phone,0.3)
Naive Bayes learner
Kent, WA gt (address,0.8),
(name,0.2)
Great location gt (description,0.9),
(address,0.1)

32
Training the Learners
Mediated schema
address price agent-phone
description
location listed-price phone
comments
Schema of realestate.com
Name Learner
(location, address) (listed-price, price) (phone,
agent-phone) (comments, description) ...
ltlocationgt Miami, FL lt/gt ltlisted-pricegt
250,000lt/gt ltphonegt (305) 729 0831lt/gt
ltcommentsgt Fantastic house lt/gt
realestate.com
Naive Bayes Learner
ltlocationgt Boston, MA lt/gt ltlisted-pricegt
110,000lt/gt ltphonegt (617) 253 1429lt/gt
ltcommentsgt Great location lt/gt
(Miami, FL, address) ( 250,000,
price) ((305) 729 0831, agent-phone) (Fantastic
house, description) ...
33
Applying the Learners
Mediated schema
Schema of homes.com
address price agent-phone
description
area day-phone extra-info
Name Learner Naive Bayes
ltareagtSeattle, WAlt/gt ltareagtKent,
WAlt/gt ltareagtAustin, TXlt/gt
(address,0.8), (description,0.2) (address,0.6),
(description,0.4) (address,0.7), (description,0.3)
Meta-Learner
Name Learner Naive Bayes
Meta-Learner
(address,0.7), (description,0.3)
ltday-phonegt(278) 345 7215lt/gt ltday-phonegt(617) 335
2315lt/gt ltday-phonegt(512) 427 1115lt/gt
(agent-phone,0.9), (description,0.1)
(address,0.6), (description,0.4)
ltextra-infogtBeautiful yardlt/gt ltextra-infogtGreat
beachlt/gt ltextra-infogtClose to Seattlelt/gt
34
Domain Constraints

Impose semantic regularities on sources
verified using schema or data
Examples
a address b address a b
a house-id a is a key
a agent-info b agent-name b is
nested in a
Can be specified up front
when creating mediated schema
independent of any actual source schema

35
The Constraint Handler
Domain Constraints a address b adderss
a b
Predictions from Meta-Learner
area (address,0.7),
(description,0.3) contact-phone
(agent-phone,0.9), (description,0.1) extra-info
(address,0.6), (description,0.4)
0.3 0.1 0.4 0.012
area address contact-phone
agent-phone extra-info description
area address contact-phone
agent-phone extra-info address
0.7 0.9 0.6 0.378
0.7 0.9 0.4 0.252

Can specify arbitrary constraints
User feedback domain constraint
ad-id house-id
Extended to handle domain heuristics
a agent-phone b agent-name a b are
usually close to each other

36
Putting It All Together LSD System
Matching Phase
Training Phase
Mediated schema
Source schemas
Domain Constraints
Data listings
Training data for base learners
User Feedback
Constraint Handler
L1
L2
Lk
Mapping Combination

Base learners Name Learner, XML learner, Naive
Bayes, Whirl learner
Meta-learner
uses stacking TingWitten99, Wolpert92
returns linear weighted combination of base
learners predictions

37
Empirical Evaluation

Four domains
Real Estate I II, Course Offerings, Faculty
Listings
For each domain
create mediated DTD domain constraints
choose five sources
extract convert data listings into XML
mediated DTDs 14 - 66 elements, source DTDs 13
48
Ten runs for each experiment - in each run
manually provide 1-1 mappings for 3 sources
ask LSD to propose mappings for remaining 2
sources
accuracy of 1-1 mappings correctly identified

38
LSD Matching Accuracy
Average Matching Acccuracy ()
LSDs accuracy 71 - 92
Best single base learner 42 - 72
Meta-learner 5 - 22
Constraint handler 7 - 13 XML
learner 0.8 - 6
39
LSD Summary

Applies machine learning to schema matching
use of multi-strategy learning
Domain user-specified constraints
Probably the most flexible means of doing schema
matching today in a semi-automated way
Complementary project CLIO (IBM Almaden) uses
key and foreign-key constraints to help the user
build mappings

40
Jumping Up a Level