Adaptively Processing Remote Data and Learning Source Mappings - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Adaptively Processing Remote Data and Learning Source Mappings

Description:

speed and selectivity of each operator. Elegant and simple to implement ... revises selectivity estimates, recomputes expected costs ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 41
Provided by: zack4
Category:

less

Transcript and Presenter's Notes

Title: Adaptively Processing Remote Data and Learning Source Mappings


1
Adaptively Processing Remote Dataand Learning
Source Mappings
  • Zachary G. Ives
  • University of Pennsylvania
  • CIS 650 Database Information Systems
  • March 14, 2005

LSD Slides courtesy AnHai Doan
2
Administrivia
  • Midterm due 3/16
  • 5-10 pages (single-spaced, 10-12 pt)
  • If you havent told me which topic, please do so
    now!

3
Todays Trivia Question
4
Many Motivations for Adaptive Query Processing
  • Many domains where cost-based query optimization
    fails
  • Complex queries in traditional databases
    estimation error grows exponentially with joins
    IC91 the focus of KD98, M04
  • Querying over the Internet unpredictable access
    rates, delays
  • Querying external data sources limited
    information available about properties of this
    source
  • Monitor real-world conditions, adapt processing
    strategy in response

5
Generalizing Adaptive Query Processing
  • Weve seen a range of different adaptive
    techniques
  • How do they fit together?
  • Can we choose points between eddies and mid-query
    re-optimization? Can we exploit other kinds of
    query optimization tricks?

6
Popular Types of Adaptive Query Processing
  • Adaptive scheduling (q. scramb. UF98 dyn.
    rescheduling UF01 PH join UF00RS86I99HH
    99)
  • Changes CPU scheduling to improve feedback or
    reduce delays
  • Cant reduce total work
  • Redundant computation (competitive exec. AZ96)
  • Compare two ways of executing the query
  • Need to identify a few promising plans
  • Plan partitioning (INGRES S76, mid-q re-opt.
    KD98I99M04)
  • Break the plan into stages re-optimize future
    stages as necessary
  • Coarse granularity, breaks pipelining
  • Are these the only options?

7
Two More Forms of Adaptivity
  • Adaptive data partitioning (AH00R03DH04I0
    4)
  • Break the data into subsets use a different plan
    for each subset
  • Generalizes intra-plan reordering in a SPJGU
    query
  • The only way to reduce overall computation with
    fine granularity
  • Only previous implementation has been eddies
    AH00R03DH04
  • Adaptive information passing
  • Extends sideways information passing (magic
    sets) to an adaptive context, both intra- and
    inter-plan
  • Reduces computation and space devoted to
    non-productive tuples

8
Eddies Combine Adaptive Scheduling and Data
Partitioning Decisions
  • Intuitively, each tuple gets its own query plan
  • Route to next operator based on speed and
    selectivity of each operator
  • Elegant and simple to implement
  • But performing a join creates subresults at the
    next level!
  • Local greedy choices may result in state that
    needs to join with all future data!
  • Consider long-term effects of decisions before
    making them separate CPU scheduling from plan
    selection

9
Focusing Purely on Adaptive Data Partitioning
  • Use adaptively scheduled operators to fill CPU
    cycles
  • Now a query optimizer problem
  • Choose a plan that minimizes long- term cost
    (in CPU cycles)
  • To allow multiple plans, distribute union
    through join (and select, project, etc.)
  • If R1 R11 R12, R2 R21 R22 then
  • R1 ? R2 (R11 R12) ? (R21 R22) (R11 ?
    R21) (R12 ? R22)
  • (R11 ? R22) (R12 ? R21)

R11
R21
R22
R12
R2
R1
This generalizes to njoins, other SPJ
GUoperators
10
Adaptive Data PartitioningRouting Data across
Different Plans
R ? S ? T
?
R0 S0T0
  • Options for combining across phases
  • New results always injected into old plan
  • Old results into new plan
  • Wait until the end stitch-up plan based on
    best stats


R0 S0
R
S
T
11
Special Architectural Features for ADP
  • Monitoring and re-optimization thread runs
    alongside execution
  • System-R-like optimizer with aggregation
    supportuses most current selectivity estimates
  • Periodic monitoring and re-optimization revises
    selectivity estimates, recomputes expected costs
  • Query execution with smart router operators
  • Special support for efficient stitch-up plans
  • Uses intermediate results from previous plans
    (specialized-case of answering queries using
    views H01)
  • Join-over-union (stitch-up-join) operator that
    excludes certain results

12
ADP Application 1Correcting Cost Mis-estimates
  • Goal react to plans that are obviously bad
  • Dont spend cycles searching for a slightly
    better plan
  • Try to avoid paths that are likely to not be
    promising
  • Monitor/reoptimizer thread watches cardinalities
    of subresults
  • Re-estimate plan cost, compare to projected costs
    of alternatives, using several techniques
    heuristics (see paper)
  • Our experiments re-estimate every 1 sec.
  • Smart router operator does the following
  • Waits for monitor/reoptimizer to suggest
    replacement plan
  • Re-routes source data into the new plan
  • New plans output is unioned with output of
    previous plan this is fed into any final
    aggregation operations

13
Correcting for Unexpected Selectivities
Pentium IV 3.06 GHzWindows XP
14
ADP Application 2Optimizing for Order
  • Most general ADP approach
  • Pre-generate plans for general case and each
    interesting order
  • Smart router sends tuple to the plan whose
    ordering constraint is followed by this tuple
  • But with multiple joins, MANY plans
  • Instead do ADP at the operator level
  • Complementary join pair
  • Does its own stitch-up internally
  • Easier to optimize for!
  • Can also do partial sorting at the router
    (priority queue)

Q
...
Merge
Hash
h(R)
h(S)
h(R)
h(S)
Q
Q
Q
Q
Routers
R
S
15
Exploiting Partial Order in the Data
Pentium IV 3.06 GHzWindows XP
(1024 tuple)
16
ADP Over WindowsOptimizing for Aggregation
  • Group-by optimization CS94
  • May be able to pre-aggregate some tuples before
    joining
  • Why aggregates can be applied over union
  • But once we insert pre-aggregation, were stuck
    (and its not pipelined)
  • Our solution
  • Adjustable window pre-aggregation
  • Change window size depending on how effectively
    we can aggregate
  • Also allows data to propagate through the plan
    better info for adaptivity, early answers

vs.
17
Pre-Aggregation Comparison
18
The State of the Union Join and Agg
  • Useless intermediate state is perhaps the
    biggest concern in ADP-based (or even
    plan-partitioning) approaches
  • Very easy to create large intermed. state before
    switching from a plan
  • Results in significant additional computation
  • The burden of history DH04
  • Also the major bottleneck in computing queries
    with correlated subqueries
  • Only want to compute parts of a subquery that
    will contribute to final answers
  • Local DB solution magic sets rewritings
    M90CR91MP94S96

19
Intuition behind Magic Sets Rewritings
  • Observations
  • Computing a subquery once for every iteration of
    the outer query is repetitive, inefficient
  • Computing the subquery in its entirety is also
    frequently inefficient
  • So pass in information about specifically which
    tuples from the inner query might join with the
    outer query
  • A filter set generally a projection of a
    portion of the outer query results
  • Anything that joins with the parent block must
    join with the filter set
  • False positives are OK

20
Query with Magic Set
CREATE VIEW TotalSales(SellerID, Sales,
ItemsSold) SELECT SellerID, sum(salePrice) AS
Sales, count() AS ItemsSold FROM
SellerList SL, SaleItem S WHERE SL.SellerID
S.SellerID GROUP BY SL.SellerID
SELECT SellerID, Sales, ItemsSold FROM
TotalSales TS, Recommended REC, Ratings
RAT WHERE REC.SellerID TS.SellerID
AND RAT.SellerID TS.SellerID AND
RAT.Rating gt 4 AND ItemsSold gt 50
21
Query with Magic Set S96
CREATE VIEW TotalSales(SellerID, Sales,
ItemsSold) SELECT SellerID, sum(salePrice) AS
Sales, count() AS ItemsSold FROM
SellerList SL, SaleItem S WHERE SL.SellerID
S.SellerID GROUP BY SL.SellerID
SELECT SellerID, Sales, ItemsSold FROM
TotalSales TS, Recommended REC, Ratings
RAT WHERE REC.SellerID TS.SellerID
AND RAT.SellerID TS.SellerID AND
RAT.Rating gt 4 AND ItemsSold gt 50
22
Magic in Data Integration
  • In data integration
  • Difficult to determine when to do sideways
    information passing/magic in a cost-based way
  • Magic optimization destroys some potential
    parallelism must compute outer block first
  • Opportunities
  • Pipelined hash joins give us complete state for
    every intermediate result
  • We use bushy trees
  • Our idea do information passing out-of-band
  • Consider a plan as if its a relational calculus
    expression every tuple must satisfy constraints
  • The plan dataflow enforces this
  • But we can also pass information across the
    plan outside the normal dataflow

x
C
x
B
A
23
Adaptive Information Passing
  • Cost-based strategy
  • Execute all blocks in parallel (up to max.
    pipelineable size)
  • Whenever a subresult is completely computed, feed
    it elsewhere in the query plan as a filter set
  • Anywhere with a shared predicate is an eligible
    target
  • Use our ability to estimate remaining cost of
    query execution to see if the semijoin will speed
    performance
  • Can always inject more precise filter set (one
    that checks more predicates), or remove a filter
    set
  • Filter set is a performance/space optimization,
    not necessary for correctness
  • We use Bloom filters rather than hash tables(our
    VLDB05 submission has detailed performance
    comparison)
  • Also compared against a naïve strategy that
    generates filter sets at every operator when
    complete, they are used as filters by downstream
    ops

24
Tuples Created TPC-H, 1GB(67 savings in Q2.
Also savings in Q5, not shown)
25
Adaptive QP in Summary
  • A variety of different techniques, focusing on
  • Scheduling
  • Comparison competition
  • Data plan partitioning
  • Information passing
  • A field that is still fairly open missing
  • Effective exploration methods
  • A true theory!
  • Whats possible? What kinds of queries make
    sense to adapt?
  • Guarantees of optimality and convergence (perhaps
    under certain assumptions)

26
Switching from Low-Level to High-Level
  • Weve talked about
  • Query reformulation (composing queries with
    mappings)
  • Query optimization execution
  • But how did we ever get the mappings in the first
    place?
  • This is one of the most tedious tasks
  • Answer LSD (and not the kind that makes you
    high!)
  • Slides courtesy of AnHai Doan, UIUC

27
Semantic Mappings between Schemas
  • Mediated source schemas XML DTDs

house
address
num-baths
contact-info
agent-name agent-phone
1-1 mapping
non 1-1 mapping
house
location contact
full-baths
half-baths
name phone
28
The LSD (Learning Source Descriptions) Approach
  • Suppose user wants to integrate 100 data sources
  • 1. User
  • manually creates mappings for a few sources, say
    3
  • shows LSD these mappings
  • 2. LSD learns from the mappings
  • Multi-strategy learning incorporates many types
    of info in a general way
  • Knowledge of constraints further helps
  • 3. LSD proposes mappings for remaining 97 sources

29
Example
Mediated schema
address price agent-phone
description
location listed-price phone
comments
Learned hypotheses
Schema of realestate.com
If phone occurs in the name gt agent-phone
listed-price 250,000 110,000 ...
location Miami, FL Boston, MA ...
phone (305) 729 0831 (617) 253 1429 ...
comments Fantastic house Great location ...
realestate.com
If fantastic great occur frequently in
data values gt description
homes.com
price 550,000 320,000 ...
contact-phone (278) 345 7215 (617) 335 2315 ...
extra-info Beautiful yard Great beach ...
30
LSDs Multi-Strategy Learning
  • Use a set of base learners
  • each exploits well certain types of information
  • Match schema elements of a new source
  • apply the base learners
  • combine their predictions using a meta-learner
  • Meta-learner
  • uses training sources to measure base learner
    accuracy
  • weighs each learner based on its accuracy

31
Base Learners
  • Input
  • schema information name, proximity, structure,
    ...
  • data information value, format, ...
  • Output
  • prediction weighted by confidence score
  • Examples
  • Name learner
  • agent-name gt (name,0.7), (phone,0.3)
  • Naive Bayes learner
  • Kent, WA gt (address,0.8),
    (name,0.2)
  • Great location gt (description,0.9),
    (address,0.1)

32
Training the Learners
Mediated schema
address price agent-phone
description
location listed-price phone
comments
Schema of realestate.com
Name Learner
(location, address) (listed-price, price) (phone,
agent-phone) (comments, description) ...
ltlocationgt Miami, FL lt/gt ltlisted-pricegt
250,000lt/gt ltphonegt (305) 729 0831lt/gt
ltcommentsgt Fantastic house lt/gt
realestate.com
Naive Bayes Learner
ltlocationgt Boston, MA lt/gt ltlisted-pricegt
110,000lt/gt ltphonegt (617) 253 1429lt/gt
ltcommentsgt Great location lt/gt
(Miami, FL, address) ( 250,000,
price) ((305) 729 0831, agent-phone) (Fantastic
house, description) ...
33
Applying the Learners
Mediated schema
Schema of homes.com
address price agent-phone
description
area day-phone extra-info
Name Learner Naive Bayes
ltareagtSeattle, WAlt/gt ltareagtKent,
WAlt/gt ltareagtAustin, TXlt/gt
(address,0.8), (description,0.2) (address,0.6),
(description,0.4) (address,0.7), (description,0.3)
Meta-Learner
Name Learner Naive Bayes
Meta-Learner
(address,0.7), (description,0.3)
ltday-phonegt(278) 345 7215lt/gt ltday-phonegt(617) 335
2315lt/gt ltday-phonegt(512) 427 1115lt/gt
(agent-phone,0.9), (description,0.1)
(address,0.6), (description,0.4)
ltextra-infogtBeautiful yardlt/gt ltextra-infogtGreat
beachlt/gt ltextra-infogtClose to Seattlelt/gt
34
Domain Constraints
  • Impose semantic regularities on sources
  • verified using schema or data
  • Examples
  • a address b address a b
  • a house-id a is a key
  • a agent-info b agent-name b is
    nested in a
  • Can be specified up front
  • when creating mediated schema
  • independent of any actual source schema

35
The Constraint Handler
Domain Constraints a address b adderss
a b
Predictions from Meta-Learner
area (address,0.7),
(description,0.3) contact-phone
(agent-phone,0.9), (description,0.1) extra-info
(address,0.6), (description,0.4)
0.3 0.1 0.4 0.012
area address contact-phone
agent-phone extra-info description
area address contact-phone
agent-phone extra-info address
0.7 0.9 0.6 0.378
0.7 0.9 0.4 0.252
  • Can specify arbitrary constraints
  • User feedback domain constraint
  • ad-id house-id
  • Extended to handle domain heuristics
  • a agent-phone b agent-name a b are
    usually close to each other

36
Putting It All Together LSD System
Matching Phase
Training Phase
Mediated schema
Source schemas
Domain Constraints
Data listings
Training data for base learners
User Feedback
Constraint Handler
L1
L2
Lk
Mapping Combination
  • Base learners Name Learner, XML learner, Naive
    Bayes, Whirl learner
  • Meta-learner
  • uses stacking TingWitten99, Wolpert92
  • returns linear weighted combination of base
    learners predictions

37
Empirical Evaluation
  • Four domains
  • Real Estate I II, Course Offerings, Faculty
    Listings
  • For each domain
  • create mediated DTD domain constraints
  • choose five sources
  • extract convert data listings into XML
  • mediated DTDs 14 - 66 elements, source DTDs 13
    48
  • Ten runs for each experiment - in each run
  • manually provide 1-1 mappings for 3 sources
  • ask LSD to propose mappings for remaining 2
    sources
  • accuracy of 1-1 mappings correctly identified

38
LSD Matching Accuracy
Average Matching Acccuracy ()
LSDs accuracy 71 - 92
Best single base learner 42 - 72
Meta-learner 5 - 22
Constraint handler 7 - 13 XML
learner 0.8 - 6
39
LSD Summary
  • Applies machine learning to schema matching
  • use of multi-strategy learning
  • Domain user-specified constraints
  • Probably the most flexible means of doing schema
    matching today in a semi-automated way
  • Complementary project CLIO (IBM Almaden) uses
    key and foreign-key constraints to help the user
    build mappings

40
Jumping Up a Level
  • Weve now seen how distributed data makes a huge
    difference
  • In heterogeneity and the need for relating
    different kinds of attributes
  • Mapping languages
  • Mapping tools
  • Query reformulation
  • and in query processing
  • Adaptive query processing
Write a Comment
User Comments (0)
About PowerShow.com