Title: Parallel Database Systems Analyzing LOTS of Data Jim Gray Microsoft Research
1Parallel Database SystemsAnalyzing LOTS of
DataJim GrayMicrosoft Research
2Outline
- Why Parallelism
- technology push
- application pull
- Parallel Database Techniques
- partitioned data
- partitioned and pipelined execution
- parallel relational operators
3Main Message
- Technology trends give
- many processors and storage units
- inexpensively
- To analyze large quantities of data
- sequential (regular) access patterns are 100x
faster - parallelism is 1000x faster (trades time for
money) - Relational systems show many parallel algorithms.
4Database Store ALL Data Types
- The New World
- Billions of objects
- Big objects (1MB)
- Objects have behavior (methods)
- The Old World
- Millions of objects
- 100-byte objects
People
Name
Address
David
NY
Mike
Berk
Paperless office Library of congress online All
information online entertainment
publishing business Information Network,
Knowledge Navigator, Information at your
fingertips
Won
Austin
People
Name
Voice
Address
Papers
Picture
NY
David
Mike
Berk
Won
Austin
5Moores Law
- XXX doubles every 18 months 60 increase per
year - Micro Processor speeds
- chip density
- Magnetic disk density
- Communications bandwidthWAN bandwidth
approaching LANs.
6Magnetic Storage Cheaper than Paper
- File Cabinet cabinet (4 drawer) 250 paper
(24,000 sheets) 250 space (2x3 _at_
10/ft2) 180 total 700 3 /sheet - Disk disk (4 GB ) 500 ASCII 2 m
pages (100x cheaper) 0.025 /sheet - Image 200 k pages (10x cheaper) .25
/sheet - Store everything on disk
7Disc Tape Storage
- /byte got 104 better
- /access got 103 better
- capacity grew 103
Latency down 10x Bandwidth up 10x
RAM (a/s)
1e 8
1e 7
Tape
B/
1e 6
Disk (a/min)
1e 4
Disk
1e 5
RAM
1e 4
Tape (a/hr)
1e 3
1e 0
8What's a Terabyte? (250 K of Disk _at_ .25/MB)
150 miles of bookshelf 15 miles of bookshelf
7 miles of bookshelf 10 days of video
1 Terabyte 1,000,000,000 business letters
100,000,000 book pages 50,000,000 FAX
images 10,000,000 TV pictures (mpeg)
4,000 LandSat images Library of
Congress (in ASCII) is 25 TB
1980 200 M of disc
10,000 discs 5
M of tape silo 10,000 tapes
1995 250 K of magnetic disc 70
discs 500 K of optical disc robot
250 platters 50 K of tape silo
50 tapes Terror
Byte !!
9Todays Storage Hierarchy Speed Capacity vs
Cost Tradeoffs
Price vs Speed
Size vs Speed
Cache
Nearline
Offline
Tape
Main
Tape
1
Secondary
Disc
Online
/MB
Online
Size(B)
Secondary
Tape
Tape
Disc
Main
Offline
Tape
Nearline
Tape
Cache
-9
-6
-3
0
3
-9
-6
-3
0
3
10
10
10
10
10
10
10
10
10
10
Access Time (seconds)
Access Time (seconds)
10Storage Latency How Far Away is the Data?
11ThesisMany Little will Win over Few Big
1 M
100 K
10 K
Micro
Nano
Mini
Mainframe
1.8"
2.5"
3.5"
5.25"
9"
14"
12Implications of Hardware Trends
Large Disc Farms will be inexpensive
(10k/TB) Large RAM databases will be
inexpensive (1K/GB) Processors will be
inexpensive So building block will be
a processor with large RAM
lots of Disc lots of network bandwidth CyberBric
k
13Implication of Hardware Trends Clusters
CPU
50 GB Disc
5 GB RAM
Future Servers are CLUSTERS of processors,
discs Distributed Database techniques make
clusters work
14Scaleables Uneconomic So Far
- A Slice a processor, memory, and a few disks.
- Slice Price of Scaleables is 5x to 10x markup
- Teradata 70K for a Intel P5 32MB 4 disk.
- Tandem 100k for a MipsCo R4000 64MB 4 disk
- Intel 75k for an I860 32MB 2 disk
- IBM/SP2 100k for a R6000 64MB 8 disk
- Compaq Slice Price is less than 10k
- What is the problem?
- Proprietary interconnect
- Proprietary packaging
- Proprietary software (vendorIX)
15Summary
- Tech trends gt pipeline partition parallelism
- Lots of bytes bandwidth per dollar
- Lots of latency
- Lots of MIPS per dollar
- Lots of processors
- Putting it together Scaleable Networks and
Platforms - Build clusters of commodity processors storage
- Commodity interconnect is key (S of PMS)
- Traditional interconnects give 100k/slice.
- Commodity Cluster Operating System is key
- Fault isolation and tolerance is key
- Automatic Parallel Programming is key
16The Hardware is in Place and Then A
Miracle Occurs
Enables Parallel Applications
17The Software Challenge
Automatic data placement (partition
random or organized) Automatic parallel
programming (process placement)
Parallel concepts, algorithms tools Parallel
Query Optimization Execution Techniques
load balance, checkpoint/restart,
pacing, multi-programming
18Parallelism Performance is the Goal
Goal is to get 'good' performance.
Law 1 parallel system should be faster than
serial system
Law 2 parallel system should give near-linear
scaleup or near-linear speedup or both.
19Kinds of Parallel Execution
Pipeline
Any
Any
Sequential
Sequential
Program
Program
Sequential
Sequential
Any
Any
Sequential
Sequential
Sequential
Sequential
Partition outputs split N ways inputs merge
M ways
Program
Program
20The Perils of Parallelism
Startup Creating processes Opening
files Optimization Interference Device (cpu,
disc, bus) logical (lock, hotspot, server,
log,...) Skew If tasks get very small, variance
gt service time
21Why Parallel Access To Data?
At 10 MB/s 1.2 days to scan
1,000 x parallel 1.5 minute SCAN.
Bandwidth
Parallelism divide a big problem into many
smaller ones to be solved in parallel.
22Data Flow ProgrammingPrefetch Postwrite Hide
Latency
- Can't wait for the data to arrive (2,000 years!)
- Memory that gets the data in advance ( 100MB/S)
- Solution
- Pipeline from storage (tape, disc...) to cpu
cache - Pipeline results to destination
Latency
23Parallelism Speedup Scaleup
Speedup Same Job, More Hardware
Less time
100GB
100GB
Scaleup Bigger Job, More Hardware
Same time
1 TB
100GB
Transaction Scaleup more clients/servers
Same response time
1 k clients
10 k clients
1 TB
100GB
Server
Server
24Benchmark Buyer's Guide
Things to ask When does it stop
scaling? Throughput numbers, Not ratios.
Standard benchmarks allow Comparison to
others Comparison to sequential
Ratios non-standard benchmarks are red flags.
25Why are Relational OperatorsSo Successful for
Parallelism?
Relational data model uniform operators on
uniform data stream closed under
composition Each operator consumes 1 or 2 input
streams Each stream is a uniform collection of
data Sequential data in and out Pure
dataflow partitioning some operators (e.g.
aggregates, non-equi-join, sort,..) requires
innovation AUTOMATIC PARALLELISM
26Database Systems Hide Parallelism
- Automate system management via tools
- data placement
- data organization (indexing)
- periodic tasks (dump / recover / reorganize)
- Automatic fault tolerance
- duplex failover
- transactions
- Automatic parallelism
- among transactions (locking)
- within a transaction (parallel execution)
27Automatic Parallel OR DB
Select image from landsat where date between 1970
and 1990 and overlaps(location, Rockies) and
snow_cover(image) gt.7
Temporal
Spatial
Image
Landsat
Assign one process per processor/disk find
images with right data location analyze image,
if 70 snow, return it
Answer
date
loc
image
image
33N 120W . . . . . . . 34N 120W
1/2/72 . . . . . .. . . 4/8/95
date, location, image tests
28Automatic Data Partitioning
Split a SQL table to subset of nodes
disks Partition within set Range
Hash Round Robin
Good for equi-joins, range queries group-by
Good for equi-joins
Good to spread load
Shared disk and memory less sensitive to
partitioning, Shared nothing benefits from
"good" partitioning
29Index Partitioning
Hash indices partition by hash B-tree
indices partition as a forest of trees. One tree
per range Primary index clusters data
30Secondary Index Partitioning
In shared nothing, secondary indices are
Problematic Partition by base table key
ranges Insert completely local (but what about
unique?) Lookup examines ALL trees (see
figure) Unique index involves lookup on
insert. Partition by secondary key
ranges Insert two nodes (base and index) Lookup
two nodes (index -gt base) Uniqueness is easy
Teradata solution
Base Table
31Data Rivers Split Merge Streams
Producers add records to the river, Consumers
consume records from the river Purely sequential
programming. River does flow control and
buffering does partition and merge of data
records River Split/Merge in Gamma
Exchange operator in Volcano.
32Partitioned Execution
Spreads computation and IO among processors
Partitioned data gives
NATURAL parallelism
33N x M way Parallelism
N inputs, M outputs, no bottlenecks. Partitioned
Data Partitioned and Pipelined Data Flows
34Picking Data Ranges
Disk Partitioning For range partitioning, sample
load on disks. Cool hot disks by making range
smaller For hash partitioning, Cool hot disks
by mapping some buckets to others River
Partitioning Use hashing and assume uniform If
range partitioning, sample data and use
histogram to level the bulk Teradata, Tandem,
Oracle use these tricks
35Blocking Operators Short Pipelines
An operator is blocking, if it does not produce
any output, until it has consumed all its
input Examples Sort, Aggregates,
Hash-Join (reads all of one operand) Blocking
operators kill pipeline parallelism Make
partition parallelism all the more important.
Database Load Template has three blocked phases
Sort Runs
Table Insert
Scan
Tape
Merge Runs
SQL Table
File
SQL Table
Process
Sort Runs
Index Insert
Merge Runs
Index 1
Sort Runs
Index Insert
Merge Runs
Index 2
Sort Runs
Merge Runs
Index Insert
Index 3
36Simple Aggregates (sort or hash?)
Simple aggregates (count, min, max, ...) can use
indices More compact Sometimes have aggregate
info. GROUP BY aggregates scan in category
order if possible (use indices) Else If
categories fit in RAM use RAM category
hash table Else make temp of ltcategory,
itemgt sort by category, do math in merge step.
37Parallel Aggregates
For aggregate function, need a decomposition
strategy count(S) ? count(s(i)), ditto for
sum() avg(S) (? sum(s(i))) / ? count(s(i)) and
so on... For groups, sub-aggregate groups close
to the source drop sub-aggregates into a hash
river.
38Parallel Sort
M inputs N outputs Disk and merge not needed if
sort fits in memory Scales linearly because
Sort is benchmark from hell for shared nothing
machines net traffic disk bandwidth, no data
filtering at the source
39Hash Join Combining Two Tables
Right Table
Hash smaller table into N buckets (hope N1) If
N1 read larger table, hash to smaller Else, hash
outer to disk then bucket-by-bucket hash
join. Purely sequential data behavior Always
beats sort-merge and nested unless data is
clustered. Good for equi, outer, exclusion
join Lots of papers, products just appearing
(what went wrong?) Hash reduces skew
Hash Buckets
Left Table
40Parallel Hash Join
ICL implemented hash join with bitmaps in CAFS
machine (1976)! Kitsuregawa pointed out the
parallelism benefits of hash join in early 1980s
(it partitions beautifully) We ignored them!
(why?) But now, Everybody's doing it. (or
promises to do it). Hashing minimizes skew,
requires little thinking for redistribution Hash
ing uses massive main memory
41Exotic Joins (Exclusion, Cartesian, Outer, ...)
Exclusion used for NOT IN and DIFFERENCE queries
Outer is lossless join, left, right appears
with null sibling if matching sibling not
found. Cartesian is often a mistake (missing
where clause) but also important for small-table
large-table optimization (also called STAR
schema).
Small table used as a pick list. Each small table
represents a mapping name -gt code set membership
(e.g. holidays) Best plan Restrict small
tables, Form Cartesian product of all small Send
it to each partition of large table for hash
join.
42What Systems Work This Way
Shared Nothing Teradata 400 nodes 80x12
nodes Tandem 110 nodes IBM / SP2 / DB2 128
nodes Informix/SP2 100 nodes ATT Sybase
8x14 nodes Shared Disk Oracle 170 nodes Rdb
24 nodes Shared Memory Informix 9 nodes
RedBrick ? nodes
43Outline
- Why Parallelism
- technology push
- application pull
- Parallel Database Techniques
- partitioned data
- partitioned and pipelined execution
- parallel relational operators
44Main Message
- Technology trends give
- many processors and storage units
- inexpensively
- To analyze large quantities of data
- sequential (regular) access patterns are 100x
faster - parallelism is 1000x faster (trades time for
money) - Relational systems show many parallel algorithms.