Parallel Database Systems Analyzing LOTS of Data Jim Gray Microsoft Research - PowerPoint PPT Presentation

1 / 36

About This Presentation

Title:

Parallel Database Systems Analyzing LOTS of Data Jim Gray Microsoft Research

Description:

Disk: disk (4 GB =) 500$ ASCII: 2 m pages (100x cheaper) 0.025 /sheet ... Library of Congress (in ASCII) is 25 TB. 1980: 200 M$ of disc 10,000 discs ... – PowerPoint PPT presentation

Number of Views:82

Avg rating:3.0/5.0

Slides: 37

Provided by: jimg178

Category:

more less

Transcript and Presenter's Notes

Title: Parallel Database Systems Analyzing LOTS of Data Jim Gray Microsoft Research

1
Parallel Database SystemsAnalyzing LOTS of
DataJim GrayMicrosoft Research
2
Outline

Why Parallelism
technology push
application pull
Parallel Database Techniques
partitioned data
partitioned and pipelined execution
parallel relational operators

3
Main Message

Technology trends give
many processors and storage units
inexpensively
To analyze large quantities of data
sequential (regular) access patterns are 100x
faster
parallelism is 1000x faster (trades time for
money)
Relational systems show many parallel algorithms.

4
Database Store ALL Data Types

The New World
Billions of objects
Big objects (1MB)
Objects have behavior (methods)

The Old World
Millions of objects
100-byte objects

People
Name
Address
David
NY
Mike
Berk
Paperless office Library of congress online All
information online entertainment
publishing business Information Network,
Knowledge Navigator, Information at your
fingertips
Won
Austin
People
Name
Voice
Address
Papers
Picture
NY
David
Mike
Berk
Won
Austin
5
Moores Law

XXX doubles every 18 months 60 increase per
year
Micro Processor speeds
chip density
Magnetic disk density
Communications bandwidthWAN bandwidth
approaching LANs.

6
Magnetic Storage Cheaper than Paper

File Cabinet cabinet (4 drawer) 250 paper
(24,000 sheets) 250 space (2x3 _at_
10/ft2) 180 total 700 3 /sheet
Disk disk (4 GB ) 500 ASCII 2 m
pages (100x cheaper) 0.025 /sheet
Image 200 k pages (10x cheaper) .25
/sheet
Store everything on disk

7
Disc Tape Storage

/byte got 104 better
/access got 103 better
capacity grew 103

Latency down 10x Bandwidth up 10x
RAM (a/s)
1e 8
1e 7
Tape
B/
1e 6
Disk (a/min)
1e 4
Disk
1e 5
RAM
1e 4
Tape (a/hr)
1e 3
1e 0
8
What's a Terabyte? (250 K of Disk _at_ .25/MB)
150 miles of bookshelf 15 miles of bookshelf
7 miles of bookshelf 10 days of video
1 Terabyte 1,000,000,000 business letters
100,000,000 book pages 50,000,000 FAX
images 10,000,000 TV pictures (mpeg)
4,000 LandSat images Library of
Congress (in ASCII) is 25 TB
1980 200 M of disc
10,000 discs 5
M of tape silo 10,000 tapes
1995 250 K of magnetic disc 70
discs 500 K of optical disc robot
250 platters 50 K of tape silo
50 tapes Terror
Byte !!
9
Todays Storage Hierarchy Speed Capacity vs
Cost Tradeoffs
Price vs Speed
Size vs Speed
Cache
Nearline
Offline
Tape
Main
Tape
1
Secondary
Disc
Online
/MB
Online
Size(B)
Secondary
Tape
Tape
Disc
Main
Offline
Tape
Nearline
Tape
Cache
-9
-6
-3
0
3
-9
-6
-3
0
3
10
10
10
10
10
10
10
10
10
10
Access Time (seconds)
Access Time (seconds)
10
Storage Latency How Far Away is the Data?
11
ThesisMany Little will Win over Few Big
1 M
100 K
10 K
Micro
Nano
Mini
Mainframe
1.8"
2.5"
3.5"
5.25"
9"
14"
12
Implications of Hardware Trends
Large Disc Farms will be inexpensive
(10k/TB) Large RAM databases will be
inexpensive (1K/GB) Processors will be
inexpensive So building block will be
a processor with large RAM
lots of Disc lots of network bandwidth CyberBric
k
13
Implication of Hardware Trends Clusters
CPU
50 GB Disc
5 GB RAM
Future Servers are CLUSTERS of processors,
discs Distributed Database techniques make
clusters work
14
Scaleables Uneconomic So Far

A Slice a processor, memory, and a few disks.
Slice Price of Scaleables is 5x to 10x markup
Teradata 70K for a Intel P5 32MB 4 disk.
Tandem 100k for a MipsCo R4000 64MB 4 disk
Intel 75k for an I860 32MB 2 disk
IBM/SP2 100k for a R6000 64MB 8 disk
Compaq Slice Price is less than 10k
What is the problem?
Proprietary interconnect
Proprietary packaging
Proprietary software (vendorIX)

15
Summary

Tech trends gt pipeline partition parallelism
Lots of bytes bandwidth per dollar
Lots of latency
Lots of MIPS per dollar
Lots of processors
Putting it together Scaleable Networks and
Platforms
Build clusters of commodity processors storage
Commodity interconnect is key (S of PMS)
Traditional interconnects give 100k/slice.
Commodity Cluster Operating System is key
Fault isolation and tolerance is key
Automatic Parallel Programming is key

16
The Hardware is in Place and Then A
Miracle Occurs
Enables Parallel Applications
17
The Software Challenge
Automatic data placement (partition
random or organized) Automatic parallel
programming (process placement)
Parallel concepts, algorithms tools Parallel
Query Optimization Execution Techniques
load balance, checkpoint/restart,
pacing, multi-programming
18
Parallelism Performance is the Goal
Goal is to get 'good' performance.
Law 1 parallel system should be faster than
serial system
Law 2 parallel system should give near-linear
scaleup or near-linear speedup or both.
19
Kinds of Parallel Execution
Pipeline
Any
Any
Sequential
Sequential
Program
Program
Sequential
Sequential
Any
Any
Sequential
Sequential
Sequential
Sequential
Partition outputs split N ways inputs merge
M ways
Program
Program
20
The Perils of Parallelism
Startup Creating processes Opening
files Optimization Interference Device (cpu,
disc, bus) logical (lock, hotspot, server,
log,...) Skew If tasks get very small, variance
gt service time
21
Why Parallel Access To Data?
At 10 MB/s 1.2 days to scan
1,000 x parallel 1.5 minute SCAN.
Bandwidth
Parallelism divide a big problem into many
smaller ones to be solved in parallel.
22
Data Flow ProgrammingPrefetch Postwrite Hide
Latency

Can't wait for the data to arrive (2,000 years!)
Memory that gets the data in advance ( 100MB/S)
Solution
Pipeline from storage (tape, disc...) to cpu
cache
Pipeline results to destination

Latency
23
Parallelism Speedup Scaleup
Speedup Same Job, More Hardware
Less time
100GB
100GB
Scaleup Bigger Job, More Hardware
Same time
1 TB
100GB
Transaction Scaleup more clients/servers
Same response time
1 k clients
10 k clients
1 TB
100GB
Server
Server
24
Benchmark Buyer's Guide
Things to ask When does it stop
scaling? Throughput numbers, Not ratios.
Standard benchmarks allow Comparison to
others Comparison to sequential
Ratios non-standard benchmarks are red flags.
25
Why are Relational OperatorsSo Successful for
Parallelism?
Relational data model uniform operators on
uniform data stream closed under
composition Each operator consumes 1 or 2 input
streams Each stream is a uniform collection of
data Sequential data in and out Pure
dataflow partitioning some operators (e.g.
aggregates, non-equi-join, sort,..) requires
innovation AUTOMATIC PARALLELISM
26
Database Systems Hide Parallelism

Automate system management via tools
data placement
data organization (indexing)
periodic tasks (dump / recover / reorganize)
Automatic fault tolerance
duplex failover
transactions
Automatic parallelism
among transactions (locking)
within a transaction (parallel execution)

27
Automatic Parallel OR DB
Select image from landsat where date between 1970
and 1990 and overlaps(location, Rockies) and
snow_cover(image) gt.7
Temporal
Spatial
Image
Landsat
Assign one process per processor/disk find
images with right data location analyze image,
if 70 snow, return it
Answer
date
loc
image
image
33N 120W . . . . . . . 34N 120W
1/2/72 . . . . . .. . . 4/8/95
date, location, image tests
28
Automatic Data Partitioning
Split a SQL table to subset of nodes
disks Partition within set Range
Hash Round Robin
Good for equi-joins, range queries group-by
Good for equi-joins
Good to spread load
Shared disk and memory less sensitive to
partitioning, Shared nothing benefits from
"good" partitioning
29
Index Partitioning
Hash indices partition by hash B-tree
indices partition as a forest of trees. One tree
per range Primary index clusters data
30
Secondary Index Partitioning
In shared nothing, secondary indices are
Problematic Partition by base table key
ranges Insert completely local (but what about
unique?) Lookup examines ALL trees (see
figure) Unique index involves lookup on
insert. Partition by secondary key
ranges Insert two nodes (base and index) Lookup
two nodes (index -gt base) Uniqueness is easy
Teradata solution
Base Table
31
Data Rivers Split Merge Streams
Producers add records to the river, Consumers
consume records from the river Purely sequential
programming. River does flow control and
buffering does partition and merge of data
records River Split/Merge in Gamma
Exchange operator in Volcano.
32
Partitioned Execution
Spreads computation and IO among processors

Partitioned data gives
NATURAL parallelism
33
N x M way Parallelism
N inputs, M outputs, no bottlenecks. Partitioned
Data Partitioned and Pipelined Data Flows
34
Picking Data Ranges
Disk Partitioning For range partitioning, sample
load on disks. Cool hot disks by making range
smaller For hash partitioning, Cool hot disks
by mapping some buckets to others River
Partitioning Use hashing and assume uniform If
range partitioning, sample data and use
histogram to level the bulk Teradata, Tandem,
Oracle use these tricks
35
Blocking Operators Short Pipelines
An operator is blocking, if it does not produce
any output, until it has consumed all its
input Examples Sort, Aggregates,
Hash-Join (reads all of one operand) Blocking
operators kill pipeline parallelism Make
partition parallelism all the more important.
Database Load Template has three blocked phases
Sort Runs
Table Insert
Scan
Tape
Merge Runs
SQL Table
File
SQL Table
Process
Sort Runs
Index Insert
Merge Runs
Index 1
Sort Runs
Index Insert
Merge Runs
Index 2
Sort Runs
Merge Runs
Index Insert
Index 3
36
Simple Aggregates (sort or hash?)
Simple aggregates (count, min, max, ...) can use
indices More compact Sometimes have aggregate
info. GROUP BY aggregates scan in category
order if possible (use indices) Else If
categories fit in RAM use RAM category
hash table Else make temp of ltcategory,
itemgt sort by category, do math in merge step.
37
Parallel Aggregates
For aggregate function, need a decomposition
strategy count(S) ? count(s(i)), ditto for
sum() avg(S) (? sum(s(i))) / ? count(s(i)) and
so on... For groups, sub-aggregate groups close
to the source drop sub-aggregates into a hash
river.
38
Parallel Sort
M inputs N outputs Disk and merge not needed if
sort fits in memory Scales linearly because
Sort is benchmark from hell for shared nothing
machines net traffic disk bandwidth, no data
filtering at the source
39
Hash Join Combining Two Tables
Right Table
Hash smaller table into N buckets (hope N1) If
N1 read larger table, hash to smaller Else, hash
outer to disk then bucket-by-bucket hash
join. Purely sequential data behavior Always
beats sort-merge and nested unless data is
clustered. Good for equi, outer, exclusion
join Lots of papers, products just appearing
(what went wrong?) Hash reduces skew
Hash Buckets
Left Table
40
Parallel Hash Join
ICL implemented hash join with bitmaps in CAFS
machine (1976)! Kitsuregawa pointed out the
parallelism benefits of hash join in early 1980s
(it partitions beautifully) We ignored them!
(why?) But now, Everybody's doing it. (or
promises to do it). Hashing minimizes skew,
requires little thinking for redistribution Hash
ing uses massive main memory
41
Exotic Joins (Exclusion, Cartesian, Outer, ...)
Exclusion used for NOT IN and DIFFERENCE queries
Outer is lossless join, left, right appears
with null sibling if matching sibling not
found. Cartesian is often a mistake (missing
where clause) but also important for small-table
large-table optimization (also called STAR
schema).
Small table used as a pick list. Each small table
represents a mapping name -gt code set membership
(e.g. holidays) Best plan Restrict small
tables, Form Cartesian product of all small Send
it to each partition of large table for hash
join.
42
What Systems Work This Way
Shared Nothing Teradata 400 nodes 80x12
nodes Tandem 110 nodes IBM / SP2 / DB2 128
nodes Informix/SP2 100 nodes ATT Sybase
8x14 nodes Shared Disk Oracle 170 nodes Rdb
24 nodes Shared Memory Informix 9 nodes
RedBrick ? nodes

43
Outline