Distributed Query Processing - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Distributed Query Processing

Description:

response time, throughput, first answers, little IO, ... connections, cursors, data, ... Wrappers map queries into local dialect ... – PowerPoint PPT presentation

Number of Views:67
Avg rating:3.0/5.0
Slides: 25
Provided by: donal211
Category:

less

Transcript and Presenter's Notes

Title: Distributed Query Processing


1
Distributed Query Processing
  • Donald Kossmann
  • University of Heidelberg
  • kossmann_at_informatik.uni-heidelberg.de

2
Agenda
  • Query Processing 101
  • centralized query processing
  • distributed query processing
  • Middleware
  • SQL and XML data integration
  • The Role of Web Services

3
Problem Statement
  • Input Query
  • How many times has the moon circled around the
    earth in the last twenty years?
  • Output Answer
  • 240!
  • Objectives
  • response time, throughput, first answers, little
    IO, ...
  • Centralized vs. Distributed Query Processing
  • same problem
  • but, different parameters and objectives

4
Query Processing 101
  • Input Declarative Query
  • SQL, OQL, XQuery, ...
  • Step 1 Translate Query into Algebra
  • Tree of operators
  • Step 2 Optimize Query (physical and logical)
  • Tree of operators
  • (Compilation)
  • Step 3 Interpretation
  • Query result

5
Algebra
A.d
SELECT A.d FROM A, B WHERE A.a B.b
AND A.c 35
A.a B.b, A.c 35
X
A
B
  • relational algebra for SQL very well understood
  • algebra for OQL fairly well understood
  • algebra for XQuery (work in progress)

6
Query Optimization
A.d
A.d
A.a B.b, A.c 35
hashjoin
X
B.b
A
B
index A.c
B
  • no brainers (e.g., push down cheap predicates)
  • enumerate alternative plans, apply cost model
  • use search heuristics to find cheapest plan

7
Query Execution
John
A.d
(John, 35, CS)
hashjoin
(CS) (AS)
(John, 35, CS) (Mary, 35, EE)
B.b
(Edinburgh, CS,5.0) (Edinburgh, AS, 6.0)
index A.c
B
  • library of operators (hash join, merge join, ...)
  • pipelining (iterator model)
  • lazy evaluation
  • exploit indexes and clustering in database

8
Summary Centralized Queries
  • Basic SQL (SPJG, nesting) well understood
  • Very good extensibility
  • nearest neighbor search, spatial joins, time
    series, UDF, roll-up, cube, ...
  • Current problems
  • statistics, cost model for optimization
  • physical database design expensive
  • Trends
  • interactiveness during execution
  • approximate answers
  • more and more functionality, powerful models (XML)

9
Distributed Query Processing 101
  • Idea
  • This is just an extension of centralized query
    processing. (System R et al. in the early 80s)
  • What is different?
  • extend physical algebra sendreceive operators
  • resource vectors, network interconnect matrix
  • caching and replication
  • optimize for response time
  • less predictability in cost model (adaptive
    algos)
  • heterogeneity in data formats and data models

10
Distributed Query Plan
A.d
hashjoin
receive
receive
send
send
B.b
index A.c
B
11
Cost
Total Cost Sum of Cost of Ops Cost 40
1
8
1
6
1
6
2
5
10
12
Response Time
Total Cost 40 first tuple 25 last tuple 33
25, 33
independent, pipelined parallelism
24, 32
0, 7
0, 24
0, 6
0, 18
0, 12
first tuple 0 last tuple 10
0, 5
0, 10
13
Adaptive Algorithms
  • Deal with unpredictable events at run time
  • delays in arrival of data, burstiness of network
  • autonomity of nodes, change in policies
  • Example double pipelined hash joins
  • build hash table for both input streams
  • read inputs in separate threads
  • good for bursty arrival of data
  • re-optimization at run time
  • monitor execution of query
  • adjust estimates of cost model
  • re-optimize if delta is too large

14
Heterogeneity
  • Use Wrappers to hide heterogeneity
  • Wrappers take care of data format, packaging
  • Wrappers map from local to global schema
  • Wrappers carry out caching
  • connections, cursors, data, ...
  • Wrappers map queries into local dialect
  • Wrappers participate in query planning!!!
  • define the subset of queries that can be handled
  • give cost information, statistics
  • capability-based rewrite (HKWY, VLDB 1997)

15
Data Cleaning
  • Are two objects the same?
  • Is D. A. Kossman the same as Kossmann?
  • Is the object that was at Position x 10 min. ago
    the same as the object at Position y now?
  • Approaches (combination of)
  • statistical
  • domain knowledge
  • human interspection
  • Very Expensive

16
Summary
  • Theory very well understood
  • extend traditional (centralized) query processing
  • add some bells and whistles
  • heterogeinity needs manual work and wrappers
  • Problems in Practice
  • cost model, statistics
  • architectures are not fit for adaptivity,
    heterogeneity
  • optimizers do not scale for 10,000s of sites
  • autonomy of sites, systems not built for
    asynchronous communication
  • data cleaning

17
Middleware
  • Two kinds of middleware
  • data warehouses
  • virtual integration
  • Data Warehouses
  • good query response times
  • good materializes results of data cleaning
  • bad high resource requirements in middleware
  • bad staleness of data
  • Virtual Integration
  • the opposite
  • caching possible to improve response times

18
Virtual Integration
Query
Middleware (query decomposition, result
composition)
wrapper
wrapper
sub query
sub query
DB1
DB2
19
IBM Data Joiner
SQL Query
Data Joiner
wrapper
wrapper
sub query
sub query
SQL DB1
SQL DB2
20
Adding XML
Query
XML Publishing
Middleware (SQL)
wrapper
wrapper
sub query
sub query
DB1
DB2
21
XML Data Integration
XML Query
Middleware (XML)
XML query
XML query
wrapper
wrapper
DB1
DB2
22
XML Data Integration
  • Example BEA Liquid Data
  • Advantage
  • Availability of XML wrappers for all major
    databases
  • Problems
  • XML - SQL mapping is very difficult
  • XML is not always the right language (e.g.,
    decision support style queries)

23
Summary
  • Middleware looks like a homogenous, centralized
    database
  • location transparency
  • data model transparency
  • Middleware provides global schema
  • data sources map local schemas to global schema
  • Various kinds of middleware (SQL, OQL, XML)
  • Stacks of middleware possible
  • Data Cleaning requires special attention

24
A Note on Web Services
  • Idea Encapsulate Data Source
  • provide WSDL interface to access data
  • works very well if query pattern is known
  • Problem Exploit Capability of Source
  • WSDL limits capabilities of data sourcegood
    optimization requires white box
  • example access by id, access by name, full
    scanshould all combinations be listed in WSDL?
  • Solution WSDL for Query Planning
  • Details ???
Write a Comment
User Comments (0)
About PowerShow.com