Title: Parallel Database Systems A SNAP Application
1Parallel Database Systems A SNAP Application
Jim Gray 310 Filbert, SF CA 94133 Gray_at_Microsoft
.com
Gordon Bell 450 Old Oak Court Los Altos, CA
94022 GBell_at_Microsoft.com
2Outline
- Cyberspace Pep Talk
- Databases are the dirt of Cyberspace
- Billions of clients mean millions of servers
- Parallel Imperative
- Hardware trend Many little devices
- Consequence Servers are arrays of commodity
components - PCs are the bricks of Cyberspace
- Must automate parallel design / operation /
use - Software parallelism via dataflow Data
Partitioning - Parallel database techniques
- Parallel execution of many little jobs (OLTP)
- Data Partitioning
- Pipeline Execution
- Automation techniques)
- Summary
3Kinds Of Information Processing
Point-to-Point
Broadcast
lecture concert
conversation money
Net work
Immediate
book newspaper
mail
Time Shifted
Data Base
Its ALL going electronic Immediate is being
stored for analysis (so ALL database) Analysis
Automatic Processing are being added
4Why Put Everything in Cyberspace?
Low rent min /byte Shrinks time now or
later Shrinks space here or there Automate
processing knowbots
Point-to-Point OR Broadcast
Network
Immediate OR Time Delayed
Locate Process Analyze Summarize
Data Base
5Databases Information At Your Fingertips
Information NetworkKnowledge Navigator
- All information will be in an online database
(somewhere) - You might record everything you
- read 10MB/day, 400 GB/lifetime (two tapes)
- hear 400MB/day, 16 TB/lifetime (a tape per
decade) - see 1MB/s, 40GB/day, 1.6 PB/lifetime (maybe
someday) - Data storage, organization, and analysis is a
challenge. - That is what databases are about
- DBs do a good job on records
- Now working on text, spatial, image, and sound.
- This talk is about automatic parallel search (the
outer loop) - Techniques work for ALL kinds of data
6Database Store ALL Data Types
- The Old World
- Millions of objects
- 100-byte objects
- The New World
- Billions of objects
- Big objects (1MB)
- Objects have behavior (methods)
People
Name
Address
David
NY
Mike
Berk
Paperless office Library of congress online All
information online entertainment
publishing business Information Network,
Knowledge Navigator, Information at your
fingertips
Won
Austin
People
Name
Address
Papers
Picture
Voice
NY
David
Mike
Berk
Won
Austin
7Magnetic Storage Cheaper than Paper
- File Cabinet cabinet (4 drawer) 250 paper
(24,000 sheets) 250 space (2x3 _at_
10/ft2) 180 total 700 3 /sheet - Disk disk (8 GB ) 4,000 ASCII
4 m pages 0.1 /sheet (30x cheaper) - Image 200 k pages 2 /sheet (similar
to paper) - Store everything on disk
8Cyberspace Demographics
1950 National Computer 1960 Corporate
Computer 1970 Site Computer 1980 Departmental
Computer 1990 Personal Computer 2000 ?
- Computer History
- most computers are small NEXT 1 Billion X for
some X (phone?) - most of the money is in clients and
wiring 1990 50 desktop 1995 75 desktop
1B
9Billions of Clients
- Every device will be intelligent
- Doors, rooms, cars, ...
- Computing will be ubiquitous
10Billions of Clients Need Millions of Servers
All clients are networked to servers may be
nomadic or on-demand Fast clients want faster
servers Servers provide data, control,
coordination communication
Clients
mobile
clients
fixed
clients
Servers
server
super
Super Servers Large Databases High Traffic
shared data
server
11If Hardware is Free, Where Will The Money Go?
- All clients and servers will be based on PC
technology economies of scale give lowest
price. - Traditional budget 40 vendor, 60 staff
- If hardware_price software_price 0 then
what? - Money will go to
- CONTENT (databases)
- NEW APPLICATIONS
- AUTOMATION analogy to 1920 telephone operators
- Systems programmer per MIPS
- DBA per 10GB
12The New Computer Industry
- Horizontal integration is new structure
- Each layer picks best from lower layer.
- Desktop market
- 1991 50
- 1995 75
- Compaq is biggest computer company
Example
Function
Operation
ATT
Integration
EDS
Applications
SAP
Middleware
Oracle
Baseware
Microsoft
Systems
Compaq
Intel Seagate
Silicon Oxide
13Constant Dollars vs Constant Work
- Constant Work
- One SuperServer can do all the worlds
computations. - Constant Dollars
- The world spends 10 on information processing
- Computers are moving from 5 penetration to 50
- 300 B to 3T
- We have the patent on the byte and algorithm
14The Seven Price Tiers
- 10 wrist watch computers
- 100 pocket/ palm computers
- 1,000 portable computers
- 10,000 personal computers (desktop)
- 100,000 departmental computers
(closet) - 1,000,000 site computers (glass house)
- 10,000,000 regional computers (glass
castle)
SuperServer Costs more than 100,000
Mainframe Costs more than 1M Must be an
array of processors, disks, tapes comm
ports
15Software Economics Bills Law
- Bill Joys law (Sun) Dont write software for
less than 100,000 platforms. _at_10M engineering
expense, 1,000 price - Bill Gates lawDont write software for less
than 1,000,000 platforms. _at_10M engineering
expense, 100 price - Examples
- UNIX vs NT 3,500 vs 500
- Oracle vs SQL-Server 100,000 vs 6,000
- No Spreadsheet or Presentation pack on
Unix/VMS/... - Commoditization of base Software Hardware
16What comes next
- MANY new clients
- Applications to enable clients servers
- super-servers
17Outline
- Cyberspace Pep Talk
- Databases are the dirt of Cyberspace
- Billions of clients mean millions of servers
- Parallel Imperative
- Hardware trend Many little devices
- Consequence Server arrays of commodity parts
- PCs are the bricks of Cyberspace
- Must automate parallel design / operation /
use - Software parallelism via dataflow Data
Partitioning - Parallel database techniques
- Parallel execution of many little jobs (OLTP)
- Data Partitioning
- Pipeline Execution
- Automation techniques)
- Summary
18Moores Law RestatedMany Little Won over Few Big
Hardware trends Few generic parts CPU
RAM
Disk Tape arrays
ATM for LAN/WAN ?? for CAN
?? for OS These parts will be
inexpensive (commodity components) Systems will
be arrays of these parts Software challenge how
to program arrays
1 M
100 K
10 K
Micro
Nano
Mini
Mainframe
1.8"
2.5"
3.5"
5.25"
9"
19Future SuperServer
Array of processors, disks, tapes comm
lines Challenge How to program it Must use
parallelism Pipeline hide latency Partition bandw
idth scaleup
20Great Debate Shared What?
Shared Memory (SMP)
Shared Nothing (network)
Shared Disk
Easy to program Difficult to build Difficult to
scaleup
Hard to program Easy to build Easy to scaleup
Sequent, SGI, Sun
VMScluster, Sysplex
Tandem, Teradata, SP2
Winner will be a synthesis of these
ideas Distributed shared memory (DASH, Encore)
blurs distinction between Network and Bus
(locality still important) But gives Shared
memory message cost.
21The Hardware is in Place and Then A
Miracle Occurs
?
SNAP Scaleable Network And Platforms Commodity
Distributed OS built on Commodity
Platforms Commodity Network Interconnect
22Why Parallel Access To Data?
At 10 MB/s 1.2 days to scan
1,000 x parallel 1.3 minute SCAN.
BANDWIDTH
Parallelism divide a big problem into many
smaller ones to be solved in parallel.
23DataFlow ProgrammingPrefetch Postwrite Hide
Latency
- Can't wait for the data to arrive
- Need a memory that gets the data in advance (
100MB/S) - Solution
- Pipeline from source (tape, disc, ram...) to cpu
cache - Pipeline results to destination
LATENCY
24The New Law of Computing
Grosch's Law
Parallel Law Needs Linear Speedup and
Linear Scaleup Not always possible
25Parallelism Performance is the Goal
Goal is to get 'good' performance.
Law 1 parallel system should be faster than
serial system
Law 2 parallel system should give near-linear
scaleup or near-linear speedup or both.
Parallelism is faster, not cheaper trades
money for time.
26Parallelism Speedup Scaleup
100GB
Speedup Same Job, More Hardware
Less time
100GB
Scaleup Bigger Job, More Hardware
Same time
100GB
1 TB
10 k clients
1 k clients
Transaction Scaleup more clients/servers
Same response time
100GB
1 TB
Server
Server
27The Perils of Parallelism
A Bad Speedup Curve
No Parallelism
Benefit
Linearity
Processors Discs
Startup Creating processes Opening
files Optimization Interference Device
(cpu, disc, bus) logical (lock, hotspot,
server, log,...) Skew If tasks get very small,
variance gt service time
28Outline
- Cyberspace Pep Talk
- Databases are the dirt of Cyberspace
- Billions of clients mean millions of servers
- Parallel Imperative
- Hardware trend Many little devices
- Consequence Server arrays of commodity parts
- PCs are the bricks of Cyberspace
- Must automate parallel design / operation /
use - Software parallelism via dataflow Data
Partitioning - Parallel database techniques
- Parallel execution of many little jobs (OLTP)
- Data Partitioning
- Pipeline Execution
- Automation techniques
- Summary
29Kinds of Parallel Execution
Any
Any
Sequential
Sequential
Pipeline
Program
Program
Sequential
Sequential
Partition outputs split N ways inputs merge
M ways
Any
Any
Sequential
Sequential
Sequential
Sequential
Program
Program
30Data Rivers Split Merge Streams
N X M Data Streams
M Consumers
N producers
River
Producers add records to the river, Consumers
consume records from the river Purely sequential
programming. River does flow control and
buffering does partition and merge of data
records River Exchange operator in Volcano.
31Partitioned Data and Execution
Spreads computation and IO among processors
Partitioned data gives NATURAL
execution parallelism
32Partitioned Merge Pipeline Execution
Pure dataflow programming Gives linear
speedup scaleup But, top node may be
bottleneck So....
33N xM way Parallelism
N inputs, M outputs, no bottlenecks.
34Why are Relational OperatorsSuccessful for
Parallelism?
Relational data model uniform operators on
uniform data stream Closed under
composition Each operator consumes 1 or 2 input
streams Each stream is a uniform collection of
data Sequential data in and out Pure
dataflow partitioning some operators (e.g.
aggregates, non-equi-join, sort,..) requires
innovation AUTOMATIC PARALLELISM
35SQL a NonProcedural Programming Language
- SQL functional programming language
describes answer set. - Optimizer picks best execution plan
- Picks data flow web (pipeline),
- degree of parallelism (partitioning)
- other execution parameters (process placement,
memory,...)
Execution
Planning
Monitor
Schema
Plan
GUI
Optimizer
Rivers
36Database Systems Hide Parallelism
- Automate system management via tools
- data placement
- data organization (indexing)
- periodic tasks (dump / recover / reorganize)
- Automatic fault tolerance
- duplex failover
- transactions
- Automatic parallelism
- among transactions (locking)
- within a transaction (parallel execution)
37Success Stories
- Online Transaction Processing
- many little jobs
- SQL systems support 3700 tps-A (24 cpu, 240
disk) - SQL systems support 21,000 tpm-C
- (110 cpu, 800 disk)
- Batch (decision support and Utility)
- few big jobs, parallelism inside
- Scan data at 100 MB/s
- Linear Scaleup to 50 processors
transactions / sec
hardware
recs/ sec
hardware
38Kinds of Partitioned Data
Split a SQL table to subset of nodes
disks Partition within set Range Hash Round
Robin
Good to spread load
Good for equijoins, range queries group-by
Good for equijoins
Shared disk and memory less sensitive to
partitioning, Shared nothing benefits from
"good" partitioning
39Index Partitioning
Hash indices partition by hash B-tree
indices partition as a forest of trees. One tree
per range Primary index clusters data
40Secondary Index Partitioning
In shared nothing, secondary indices are
Problematic Partition by base table key
ranges Insert completely local (but what about
unique?) Lookup examines ALL trees (see
figure) Unique index involves lookup on
insert. Partition by secondary key
ranges Insert two nodes (base and index) Lookup
two nodes (index -gt base) Uniqueness is easy
A..Z
A..Z
A..Z
A..Z
A..Z
Base Table
Base Table
Teradata solution Partition non-unique by base
table Partition Unique by secondary key
41Picking Data Ranges
Disk Partitioning For range partitioning, sample
load on disks. Cool hot disks by making range
smaller For hash partitioning, Cool hot disks by
mapping some buckets to others River
Partitioning Use hashing and assume uniform If
range partitioning, sample data and use
histogram to level the bulk Teradata, Tandem,
Oracle use these tricks
42Parallel Data Scan
Select image from landsat where date between 1970
and 1990 and overlaps(location, Rockies) and
snow_cover(image) gt.7
Temporal
Spatial
Image
Landsat
Assign one process per processor/disk find
images with right data location analyze image,
if 70 snow, return it
Answer
date
loc
image
image
33N 120W . . . . . . . 34N 120W
1/2/72 . . . . . .. . . 4/8/95
date, location, image tests
43Simple Aggregates (sort or hash?)
Simple aggregates (count, min, max, ...) can use
indices More compact Sometimes have aggregate
info. GROUP BY aggregates scan in category
order if possible (use indices) Else If
categories fit in RAM use RAM category hash table
Else make temp of ltcategory, itemgt sort by
category, do math in merge step.
44Parallel Aggregates
For aggregate function, need a decomposition
strategy count(S) å count(s(i)), ditto for
sum() avg(S) (å sum(s(i))) / å count(s(i)) and
so on... For groups, sub-aggregate groups close
to the source drop sub-aggregates into a hash
river.
45Sort
Used for loading and reorganization (sort makes
them sequential) building B-trees reports non-equi
joins Rarely used for aggregates or equi-joins
(if hash available) Should run at 10MB/s or
better Faster than a disk, so need striped
scratch files In memory sort is about 250Kr/s
Sort
Merge
Input Data
Sorted Data
Runs
46Parallel Sort
M input N output Sort design Disk and merge not
needed if sort fits in memory Scales linearly
because
Sort is benchmark from hell for shared nothing
machines net traffic disk bandwidth, no data
filtering at the source
47Blocking Operators Short Piplelines
An operator is blocking, if it does not produce
any output, until it has consumed all its
input Examples Sort, Aggregates,
Hash-Join (reads all of one operand) Blocking
operators kill pipeline parallelism Make
partition parallelism all the more important.
Database Load Template has three blocked phases
48Nested Loops Join
If inner table indexed on join cols (b-tree or
hash) then sequential scan outer (from start
key) For each outer record probe inner table for
matching recs Works best if inner is in RAM (gt
small inner ) Works great if inner is B-tree or
hash in RAM Partitions well replicate inner at
each outer partition. (if outer partitioned on
join col, dont replicate inner, partition
it) Works for all joins (outer, non-equijoins,
cartesian, exclusion,...)
Inner Table
Outer Table
49Merge Join (and sort-merge join)
If tables sorted on join cols (b-tree or
hash) then sequential scan each (from start
key) left lt right leftright left gt right advance
left match advance right Nice sequential scan
of data (disk speed) (MxN case may cause
backwards rescan) Sort-merge join sorts before
doing the merge
NxM case cartesian product
Left Table
Right Table
Partitions well partition smaller to larger
partition. Works for all joins (outer,
non-equijoins, cartesian, exclusion,...)
50Hash Join
Right Table
Hash smaller table into N buckets (hope N1) If
N1 read larger table, hash to smaller Else, hash
outer to disk then bucket-by-bucket hash
join. Purely sequential data behavior Always
beats sort-merge and nested unless data is
clustered. Good for equi, outer, exclusion
join Lots of papers, products just appearing
(what went wrong?) Hash reduces skew
Hash Buckets
Left Table
51Observation Execution easyAutomation hard
It is easy to build a fast parallel execution
environment (no one has done it, but it is just
programming) It is hard to write a robust and
world-class query optimizer. There are many
tricks One quickly hits the complexity
barrier Common approach Pick best sequential
plan Pick degree of parallelism based on
bottleneck analysis Bind operators to
process Place processes at nodes Place scratch
files near processes Use memory as a constraint
52Systems That Work This Way
Shared Nothing Teradata 400 nodes Tandem 110
nodes IBM / SP2 / DB2 48 nodes ATT Sybase 112
nodes Informix/SP2 48 nodes Shared
Disk Oracle 170 nodes Rdb 24 nodes Shared
Memory Informix 9 nodes RedBrick ?
nodes
53Research Problems
Automatic data placement (partition
random or organized) Automatic parallel
programming (process placement)
Parallel concepts, algorithms tools  Parallel
Query Optimization  Execution Techniques
load balance, checkpoint/restart,
pacing,
54Summary
- Cyberspace is Growing
- Databases are the dirt of cybersspace PCs are
the bricks, Networks are the morter. Many
little devices Performance via Arrays of cpu,
disk ,tape - Then a miracle occurs a scaleable distributed OS
and net - SNAP Scaleable Networks and Platforms
- Then parallel database systems give software
parallelism - OLTP lots of little jobs run in parallel
- Batch TP data flow data partitioning
- Automate processor storage array administration
- Automate processor storage array programming
- 2000 platforms as easy as 1 platform.