Title: Adaptively Processing Remote Data and Learning Source Mappings
1Adaptively Processing Remote Dataand Learning
Source Mappings
- Zachary G. Ives
- University of Pennsylvania
- CIS 650 Database Information Systems
- March 14, 2005
LSD Slides courtesy AnHai Doan
2Administrivia
- Midterm due 3/16
- 5-10 pages (single-spaced, 10-12 pt)
- If you havent told me which topic, please do so
now!
3Todays Trivia Question
4Many Motivations for Adaptive Query Processing
- Many domains where cost-based query optimization
fails - Complex queries in traditional databases
estimation error grows exponentially with joins
IC91 the focus of KD98, M04 - Querying over the Internet unpredictable access
rates, delays - Querying external data sources limited
information available about properties of this
source - Monitor real-world conditions, adapt processing
strategy in response
5Generalizing Adaptive Query Processing
- Weve seen a range of different adaptive
techniques - How do they fit together?
- Can we choose points between eddies and mid-query
re-optimization? Can we exploit other kinds of
query optimization tricks?
6Popular Types of Adaptive Query Processing
- Adaptive scheduling (q. scramb. UF98 dyn.
rescheduling UF01 PH join UF00RS86I99HH
99) - Changes CPU scheduling to improve feedback or
reduce delays - Cant reduce total work
- Redundant computation (competitive exec. AZ96)
- Compare two ways of executing the query
- Need to identify a few promising plans
- Plan partitioning (INGRES S76, mid-q re-opt.
KD98I99M04) - Break the plan into stages re-optimize future
stages as necessary - Coarse granularity, breaks pipelining
- Are these the only options?
7Two More Forms of Adaptivity
- Adaptive data partitioning (AH00R03DH04I0
4) - Break the data into subsets use a different plan
for each subset - Generalizes intra-plan reordering in a SPJGU
query - The only way to reduce overall computation with
fine granularity - Only previous implementation has been eddies
AH00R03DH04 - Adaptive information passing
- Extends sideways information passing (magic
sets) to an adaptive context, both intra- and
inter-plan - Reduces computation and space devoted to
non-productive tuples
8Eddies Combine Adaptive Scheduling and Data
Partitioning Decisions
- Intuitively, each tuple gets its own query plan
- Route to next operator based on speed and
selectivity of each operator - Elegant and simple to implement
- But performing a join creates subresults at the
next level! - Local greedy choices may result in state that
needs to join with all future data! - Consider long-term effects of decisions before
making them separate CPU scheduling from plan
selection
9Focusing Purely on Adaptive Data Partitioning
- Use adaptively scheduled operators to fill CPU
cycles - Now a query optimizer problem
- Choose a plan that minimizes long- term cost
(in CPU cycles) - To allow multiple plans, distribute union
through join (and select, project, etc.) - If R1 R11 R12, R2 R21 R22 then
- R1 ? R2 (R11 R12) ? (R21 R22) (R11 ?
R21) (R12 ? R22) - (R11 ? R22) (R12 ? R21)
R11
R21
R22
R12
R2
R1
This generalizes to njoins, other SPJ
GUoperators
10Adaptive Data PartitioningRouting Data across
Different Plans
R ? S ? T
?
R0 S0T0
- Options for combining across phases
- New results always injected into old plan
- Old results into new plan
- Wait until the end stitch-up plan based on
best stats
R0 S0
R
S
T
11Special Architectural Features for ADP
- Monitoring and re-optimization thread runs
alongside execution - System-R-like optimizer with aggregation
supportuses most current selectivity estimates - Periodic monitoring and re-optimization revises
selectivity estimates, recomputes expected costs - Query execution with smart router operators
- Special support for efficient stitch-up plans
- Uses intermediate results from previous plans
(specialized-case of answering queries using
views H01) - Join-over-union (stitch-up-join) operator that
excludes certain results
12ADP Application 1Correcting Cost Mis-estimates
- Goal react to plans that are obviously bad
- Dont spend cycles searching for a slightly
better plan - Try to avoid paths that are likely to not be
promising - Monitor/reoptimizer thread watches cardinalities
of subresults - Re-estimate plan cost, compare to projected costs
of alternatives, using several techniques
heuristics (see paper) - Our experiments re-estimate every 1 sec.
- Smart router operator does the following
- Waits for monitor/reoptimizer to suggest
replacement plan - Re-routes source data into the new plan
- New plans output is unioned with output of
previous plan this is fed into any final
aggregation operations
13Correcting for Unexpected Selectivities
Pentium IV 3.06 GHzWindows XP
14ADP Application 2Optimizing for Order
- Most general ADP approach
- Pre-generate plans for general case and each
interesting order - Smart router sends tuple to the plan whose
ordering constraint is followed by this tuple - But with multiple joins, MANY plans
- Instead do ADP at the operator level
- Complementary join pair
- Does its own stitch-up internally
- Easier to optimize for!
- Can also do partial sorting at the router
(priority queue)
Q
...
Merge
Hash
h(R)
h(S)
h(R)
h(S)
Q
Q
Q
Q
Routers
R
S
15Exploiting Partial Order in the Data
Pentium IV 3.06 GHzWindows XP
(1024 tuple)
16ADP Over WindowsOptimizing for Aggregation
- Group-by optimization CS94
- May be able to pre-aggregate some tuples before
joining - Why aggregates can be applied over union
- But once we insert pre-aggregation, were stuck
(and its not pipelined) - Our solution
- Adjustable window pre-aggregation
- Change window size depending on how effectively
we can aggregate - Also allows data to propagate through the plan
better info for adaptivity, early answers
vs.
17Pre-Aggregation Comparison
18The State of the Union Join and Agg
- Useless intermediate state is perhaps the
biggest concern in ADP-based (or even
plan-partitioning) approaches - Very easy to create large intermed. state before
switching from a plan - Results in significant additional computation
- The burden of history DH04
- Also the major bottleneck in computing queries
with correlated subqueries - Only want to compute parts of a subquery that
will contribute to final answers - Local DB solution magic sets rewritings
M90CR91MP94S96
19Intuition behind Magic Sets Rewritings
- Observations
- Computing a subquery once for every iteration of
the outer query is repetitive, inefficient - Computing the subquery in its entirety is also
frequently inefficient - So pass in information about specifically which
tuples from the inner query might join with the
outer query - A filter set generally a projection of a
portion of the outer query results - Anything that joins with the parent block must
join with the filter set - False positives are OK
20Query with Magic Set
CREATE VIEW TotalSales(SellerID, Sales,
ItemsSold) SELECT SellerID, sum(salePrice) AS
Sales, count() AS ItemsSold FROM
SellerList SL, SaleItem S WHERE SL.SellerID
S.SellerID GROUP BY SL.SellerID
SELECT SellerID, Sales, ItemsSold FROM
TotalSales TS, Recommended REC, Ratings
RAT WHERE REC.SellerID TS.SellerID
AND RAT.SellerID TS.SellerID AND
RAT.Rating gt 4 AND ItemsSold gt 50
21Query with Magic Set S96
CREATE VIEW TotalSales(SellerID, Sales,
ItemsSold) SELECT SellerID, sum(salePrice) AS
Sales, count() AS ItemsSold FROM
SellerList SL, SaleItem S WHERE SL.SellerID
S.SellerID GROUP BY SL.SellerID
SELECT SellerID, Sales, ItemsSold FROM
TotalSales TS, Recommended REC, Ratings
RAT WHERE REC.SellerID TS.SellerID
AND RAT.SellerID TS.SellerID AND
RAT.Rating gt 4 AND ItemsSold gt 50
22Magic in Data Integration
- In data integration
- Difficult to determine when to do sideways
information passing/magic in a cost-based way - Magic optimization destroys some potential
parallelism must compute outer block first - Opportunities
- Pipelined hash joins give us complete state for
every intermediate result - We use bushy trees
- Our idea do information passing out-of-band
- Consider a plan as if its a relational calculus
expression every tuple must satisfy constraints - The plan dataflow enforces this
- But we can also pass information across the
plan outside the normal dataflow
x
C
x
B
A
23Adaptive Information Passing
- Cost-based strategy
- Execute all blocks in parallel (up to max.
pipelineable size) - Whenever a subresult is completely computed, feed
it elsewhere in the query plan as a filter set - Anywhere with a shared predicate is an eligible
target - Use our ability to estimate remaining cost of
query execution to see if the semijoin will speed
performance - Can always inject more precise filter set (one
that checks more predicates), or remove a filter
set - Filter set is a performance/space optimization,
not necessary for correctness - We use Bloom filters rather than hash tables(our
VLDB05 submission has detailed performance
comparison) - Also compared against a naïve strategy that
generates filter sets at every operator when
complete, they are used as filters by downstream
ops
24Tuples Created TPC-H, 1GB(67 savings in Q2.
Also savings in Q5, not shown)
25Adaptive QP in Summary
- A variety of different techniques, focusing on
- Scheduling
- Comparison competition
- Data plan partitioning
- Information passing
- A field that is still fairly open missing
- Effective exploration methods
- A true theory!
- Whats possible? What kinds of queries make
sense to adapt? - Guarantees of optimality and convergence (perhaps
under certain assumptions)
26Switching from Low-Level to High-Level
- Weve talked about
- Query reformulation (composing queries with
mappings) - Query optimization execution
- But how did we ever get the mappings in the first
place? - This is one of the most tedious tasks
- Answer LSD (and not the kind that makes you
high!) - Slides courtesy of AnHai Doan, UIUC
27Semantic Mappings between Schemas
- Mediated source schemas XML DTDs
house
address
num-baths
contact-info
agent-name agent-phone
1-1 mapping
non 1-1 mapping
house
location contact
full-baths
half-baths
name phone
28The LSD (Learning Source Descriptions) Approach
- Suppose user wants to integrate 100 data sources
- 1. User
- manually creates mappings for a few sources, say
3 - shows LSD these mappings
- 2. LSD learns from the mappings
- Multi-strategy learning incorporates many types
of info in a general way - Knowledge of constraints further helps
- 3. LSD proposes mappings for remaining 97 sources
29Example
Mediated schema
address price agent-phone
description
location listed-price phone
comments
Learned hypotheses
Schema of realestate.com
If phone occurs in the name gt agent-phone
listed-price 250,000 110,000 ...
location Miami, FL Boston, MA ...
phone (305) 729 0831 (617) 253 1429 ...
comments Fantastic house Great location ...
realestate.com
If fantastic great occur frequently in
data values gt description
homes.com
price 550,000 320,000 ...
contact-phone (278) 345 7215 (617) 335 2315 ...
extra-info Beautiful yard Great beach ...
30LSDs Multi-Strategy Learning
- Use a set of base learners
- each exploits well certain types of information
- Match schema elements of a new source
- apply the base learners
- combine their predictions using a meta-learner
- Meta-learner
- uses training sources to measure base learner
accuracy - weighs each learner based on its accuracy
31Base Learners
- Input
- schema information name, proximity, structure,
... - data information value, format, ...
- Output
- prediction weighted by confidence score
- Examples
- Name learner
- agent-name gt (name,0.7), (phone,0.3)
- Naive Bayes learner
- Kent, WA gt (address,0.8),
(name,0.2) - Great location gt (description,0.9),
(address,0.1)
32Training the Learners
Mediated schema
address price agent-phone
description
location listed-price phone
comments
Schema of realestate.com
Name Learner
(location, address) (listed-price, price) (phone,
agent-phone) (comments, description) ...
ltlocationgt Miami, FL lt/gt ltlisted-pricegt
250,000lt/gt ltphonegt (305) 729 0831lt/gt
ltcommentsgt Fantastic house lt/gt
realestate.com
Naive Bayes Learner
ltlocationgt Boston, MA lt/gt ltlisted-pricegt
110,000lt/gt ltphonegt (617) 253 1429lt/gt
ltcommentsgt Great location lt/gt
(Miami, FL, address) ( 250,000,
price) ((305) 729 0831, agent-phone) (Fantastic
house, description) ...
33Applying the Learners
Mediated schema
Schema of homes.com
address price agent-phone
description
area day-phone extra-info
Name Learner Naive Bayes
ltareagtSeattle, WAlt/gt ltareagtKent,
WAlt/gt ltareagtAustin, TXlt/gt
(address,0.8), (description,0.2) (address,0.6),
(description,0.4) (address,0.7), (description,0.3)
Meta-Learner
Name Learner Naive Bayes
Meta-Learner
(address,0.7), (description,0.3)
ltday-phonegt(278) 345 7215lt/gt ltday-phonegt(617) 335
2315lt/gt ltday-phonegt(512) 427 1115lt/gt
(agent-phone,0.9), (description,0.1)
(address,0.6), (description,0.4)
ltextra-infogtBeautiful yardlt/gt ltextra-infogtGreat
beachlt/gt ltextra-infogtClose to Seattlelt/gt
34Domain Constraints
- Impose semantic regularities on sources
- verified using schema or data
- Examples
- a address b address a b
- a house-id a is a key
- a agent-info b agent-name b is
nested in a - Can be specified up front
- when creating mediated schema
- independent of any actual source schema
35The Constraint Handler
Domain Constraints a address b adderss
a b
Predictions from Meta-Learner
area (address,0.7),
(description,0.3) contact-phone
(agent-phone,0.9), (description,0.1) extra-info
(address,0.6), (description,0.4)
0.3 0.1 0.4 0.012
area address contact-phone
agent-phone extra-info description
area address contact-phone
agent-phone extra-info address
0.7 0.9 0.6 0.378
0.7 0.9 0.4 0.252
- Can specify arbitrary constraints
- User feedback domain constraint
- ad-id house-id
- Extended to handle domain heuristics
- a agent-phone b agent-name a b are
usually close to each other
36Putting It All Together LSD System
Matching Phase
Training Phase
Mediated schema
Source schemas
Domain Constraints
Data listings
Training data for base learners
User Feedback
Constraint Handler
L1
L2
Lk
Mapping Combination
- Base learners Name Learner, XML learner, Naive
Bayes, Whirl learner - Meta-learner
- uses stacking TingWitten99, Wolpert92
- returns linear weighted combination of base
learners predictions
37Empirical Evaluation
- Four domains
- Real Estate I II, Course Offerings, Faculty
Listings - For each domain
- create mediated DTD domain constraints
- choose five sources
- extract convert data listings into XML
- mediated DTDs 14 - 66 elements, source DTDs 13
48 - Ten runs for each experiment - in each run
- manually provide 1-1 mappings for 3 sources
- ask LSD to propose mappings for remaining 2
sources - accuracy of 1-1 mappings correctly identified
38LSD Matching Accuracy
Average Matching Acccuracy ()
LSDs accuracy 71 - 92
Best single base learner 42 - 72
Meta-learner 5 - 22
Constraint handler 7 - 13 XML
learner 0.8 - 6
39LSD Summary
- Applies machine learning to schema matching
- use of multi-strategy learning
- Domain user-specified constraints
- Probably the most flexible means of doing schema
matching today in a semi-automated way - Complementary project CLIO (IBM Almaden) uses
key and foreign-key constraints to help the user
build mappings
40Jumping Up a Level
- Weve now seen how distributed data makes a huge
difference - In heterogeneity and the need for relating
different kinds of attributes - Mapping languages
- Mapping tools
- Query reformulation
- and in query processing
- Adaptive query processing