Distributed Query Processing - PowerPoint PPT Presentation

1 / 24

About This Presentation

Title:

Distributed Query Processing

Description:

response time, throughput, first answers, little IO, ... connections, cursors, data, ... Wrappers map queries into local dialect ... – PowerPoint PPT presentation

Number of Views:67

Avg rating:3.0/5.0

Slides: 25

Provided by: donal211

Category:

more less

Transcript and Presenter's Notes

Title: Distributed Query Processing

1
Distributed Query Processing

Donald Kossmann
University of Heidelberg
kossmann_at_informatik.uni-heidelberg.de

2
Agenda

Query Processing 101
centralized query processing
distributed query processing
Middleware
SQL and XML data integration
The Role of Web Services

3
Problem Statement

Input Query
How many times has the moon circled around the
earth in the last twenty years?
Output Answer
240!
Objectives
response time, throughput, first answers, little
IO, ...
Centralized vs. Distributed Query Processing
same problem
but, different parameters and objectives

4
Query Processing 101

Input Declarative Query
SQL, OQL, XQuery, ...
Step 1 Translate Query into Algebra
Tree of operators
Step 2 Optimize Query (physical and logical)
Tree of operators
(Compilation)
Step 3 Interpretation
Query result

5
Algebra
A.d
SELECT A.d FROM A, B WHERE A.a B.b
AND A.c 35
A.a B.b, A.c 35
X
A
B

relational algebra for SQL very well understood
algebra for OQL fairly well understood
algebra for XQuery (work in progress)

6
Query Optimization
A.d
A.d
A.a B.b, A.c 35
hashjoin
X
B.b
A
B
index A.c
B

no brainers (e.g., push down cheap predicates)
enumerate alternative plans, apply cost model
use search heuristics to find cheapest plan

7
Query Execution
John
A.d
(John, 35, CS)
hashjoin
(CS) (AS)
(John, 35, CS) (Mary, 35, EE)
B.b
(Edinburgh, CS,5.0) (Edinburgh, AS, 6.0)
index A.c
B

library of operators (hash join, merge join, ...)
pipelining (iterator model)
lazy evaluation
exploit indexes and clustering in database

8
Summary Centralized Queries

Basic SQL (SPJG, nesting) well understood
Very good extensibility
nearest neighbor search, spatial joins, time
series, UDF, roll-up, cube, ...
Current problems
statistics, cost model for optimization
physical database design expensive
Trends
interactiveness during execution
approximate answers
more and more functionality, powerful models (XML)

9
Distributed Query Processing 101

Idea
This is just an extension of centralized query
processing. (System R et al. in the early 80s)
What is different?
extend physical algebra sendreceive operators
resource vectors, network interconnect matrix
caching and replication
optimize for response time
less predictability in cost model (adaptive
algos)
heterogeneity in data formats and data models

10
Distributed Query Plan
A.d
hashjoin
receive
receive
send
send
B.b
index A.c
B
11
Cost
Total Cost Sum of Cost of Ops Cost 40
1
8
1
6
1
6
2
5
10
12
Response Time
Total Cost 40 first tuple 25 last tuple 33
25, 33
independent, pipelined parallelism
24, 32
0, 7
0, 24
0, 6
0, 18
0, 12
first tuple 0 last tuple 10
0, 5
0, 10
13
Adaptive Algorithms

Deal with unpredictable events at run time
delays in arrival of data, burstiness of network
autonomity of nodes, change in policies
Example double pipelined hash joins
build hash table for both input streams
read inputs in separate threads
good for bursty arrival of data
re-optimization at run time
monitor execution of query
adjust estimates of cost model
re-optimize if delta is too large

14
Heterogeneity

Use Wrappers to hide heterogeneity
Wrappers take care of data format, packaging
Wrappers map from local to global schema
Wrappers carry out caching
connections, cursors, data, ...
Wrappers map queries into local dialect
Wrappers participate in query planning!!!
define the subset of queries that can be handled
give cost information, statistics
capability-based rewrite (HKWY, VLDB 1997)

15
Data Cleaning

Are two objects the same?
Is D. A. Kossman the same as Kossmann?
Is the object that was at Position x 10 min. ago
the same as the object at Position y now?
Approaches (combination of)
statistical
domain knowledge
human interspection
Very Expensive

16
Summary

Theory very well understood
extend traditional (centralized) query processing
add some bells and whistles
heterogeinity needs manual work and wrappers
Problems in Practice
cost model, statistics
architectures are not fit for adaptivity,
heterogeneity
optimizers do not scale for 10,000s of sites
autonomy of sites, systems not built for
asynchronous communication
data cleaning

17
Middleware

Two kinds of middleware
data warehouses
virtual integration
Data Warehouses
good query response times
good materializes results of data cleaning
bad high resource requirements in middleware
bad staleness of data
Virtual Integration
the opposite
caching possible to improve response times

18
Virtual Integration
Query
Middleware (query decomposition, result
composition)
wrapper
wrapper
sub query
sub query
DB1
DB2
19
IBM Data Joiner
SQL Query
Data Joiner
wrapper
wrapper
sub query
sub query
SQL DB1
SQL DB2
20
Adding XML
Query
XML Publishing
Middleware (SQL)
wrapper
wrapper
sub query
sub query
DB1
DB2
21
XML Data Integration
XML Query
Middleware (XML)
XML query
XML query
wrapper
wrapper
DB1
DB2
22
XML Data Integration

Example BEA Liquid Data
Advantage
Availability of XML wrappers for all major
databases
Problems
XML - SQL mapping is very difficult
XML is not always the right language (e.g.,
decision support style queries)

23
Summary

Middleware looks like a homogenous, centralized
database
location transparency
data model transparency
Middleware provides global schema
data sources map local schemas to global schema
Various kinds of middleware (SQL, OQL, XML)
Stacks of middleware possible
Data Cleaning requires special attention

24
A Note on Web Services

Idea Encapsulate Data Source
provide WSDL interface to access data
works very well if query pattern is known
Problem Exploit Capability of Source
WSDL limits capabilities of data sourcegood
optimization requires white box
example access by id, access by name, full
scanshould all combinations be listed in WSDL?
Solution WSDL for Query Planning
Details ???