Title: Adapting to Source Properties in Processing Data Integration Queries
1Adapting to Source Properties in Processing Data
Integration Queries
Computer Science Faculty, System Architecture,
Database Technology Group
Oberseminar 09.05.2007 Matthias Böhm
2Gliederung
- Current Work
- Project "DIPBench"
- (Data Intensive Integration Process Benchmark)
- Project "GCIP"
- (Generation of Complex Integration Processes)
- Model-driven developement of integration
processes - Inter-System optimization of integration
processes - Intra-System optimization of integration
processes - Papersession
- "Adapting to Source Properties in Processing Data
Integration Queries " - Zachary G. Ives, Alon Y. Halevy, Daniel S. Weld
- SIGMOD 2004
3Project "DIPBench"
- DWH scenario
- Schema definitions
- Integration process types
- Execution schedule
- scale factors
- datasize
- time
- Execution tools
- Initializer, Client, Monitor
- Reference implementationexperiments
- VDBMS
- EAI-Server
4Project "GignoMDA-GCIP"
- Model-driven developement of integration
processes - MDA for EII and EAI
- Inter-System Process Optimization
- Selection of optimal integration system for a set
of integration processes - Intra-System Process Optimization
- adaptive optimization of A-PSM (MTM)
representations - rule-based process rewriting
- workload-based process rewriting
- using generic representations for workload,
statistics and hints
5Papersession 7 Related Work
- Pipelined Hash Join (aka Double-Pipelined Join)
- symmetric
- non-blocking
- memory-consuming
- foreach arrivingtupel (P1 P2)
- 1) Probing
- 2) Building
Joinfidflight
P1
P2
6Papersession 7 Related Work
- Query Scrambling 27
- addresses minimization of delays, while accessing
remote data - response-time optimization
Normal Execution
Sources delayed?
- Query Scrambling
- Phase 1 Rescheduling
- Phase 2 Operator Synthesis
- Pair
- Include Delayed (IN)
- Estimated Delay (ED)
Sources still delayed?
Sources responded?
7Papersession 7 Related Work
- Bushy-Tree Enumeration
- addresses the parallel computation of joins
- exploit the network delay, ...
- enumeration algorithm determines the join-order
Sequenced-Tree Enumeration(Left- /
Right-Deep-Tree)
Bushy-Tree Enumeration
Thread 3
D
Thread 1
Thread 2
C
B
A
B
C
D
A
8Papersession 7 Related Work
- Dynamic Rescheduling 25
- Rate-based Pipeline Scheduling
- Importance-based Tupel Regulation (SIT Selective
Input Processing) - Plan Partitioning Techniques
- Choose Nodes 6
- Dynamic plans
- Materialization points / Choose-Plan operator
- Mid-Query Optimization 19
- query plans annotated with statistics
- Collection of statistics / query plan
modification during runtime - Inter-Query-Adaptive Techniques
- LEO DB2 Learning Optimizer 23
- Adaptive Selectivity Estimation 5
- SIT Statistics on query expressions 3
9Papersession 2 Adaptive Data Partitioning
- Algebraic key property
- distribute relational UNION through PSJ
operations - If R1 R11 ? R12, (vertical partitioning)
- R2 R21 ? R22
-
- then R1 ? R2 (R11 ? R12) ? (R21 ? R22)
- (R11 ? R21)
- ? (R12 ? R22)
- ? (R11 ? R22)
- ? (R12 ? R21)
-
Union over all join combinations of partitions
R11
R21
R1
R2
R12
R22
10Papersession 5 Exploiting Order
Stichup-phase
Joinfidflight, Sorted-Merge Join
Joinfidflight, Pipelined Hash Join
no
no
yes
yes
Is join-attribute in sequence?
11Papersession 6 Pre-Aggregation
- Adjustable-window pre-aggregation operator
- pipelined pre-aggregation
- potential paralism
- adapting join selectivities
- if current window is effective, increase next
window size - if current window is not effective, reduce next
window size (until 1) - used before
- Join
- Final-Aggregation
Joinflight fk_flight
PRE-AGGfk_flight, windowsize
...