Parallel Database Systems A SNAP Application

About This Presentation

Title:

Parallel Database Systems A SNAP Application

Description:

Databases are the dirt of Cyberspace. Billions of clients mean ... Storage Cheaper than ... (8 GB =) 4,000$ ASCII: 4 m pages 0.1 /sheet (30x cheaper) ... – PowerPoint PPT presentation

Number of Views:186

Avg rating:3.0/5.0

Slides: 39

Provided by: ResearchM53

Category:

more less

Transcript and Presenter's Notes

Title: Parallel Database Systems A SNAP Application

1
Parallel Database Systems A SNAP Application
Jim Gray 310 Filbert, SF CA 94133 Gray_at_Microsoft
.com
Gordon Bell 450 Old Oak Court Los Altos, CA
94022 GBell_at_Microsoft.com
2
Outline

Cyberspace Pep Talk
Databases are the dirt of Cyberspace
Billions of clients mean millions of servers
Parallel Imperative
Hardware trend Many little devices
Consequence Servers are arrays of commodity
components
PCs are the bricks of Cyberspace
Must automate parallel design / operation /
use
Software parallelism via dataflow Data
Partitioning
Parallel database techniques
Parallel execution of many little jobs (OLTP)
Data Partitioning
Pipeline Execution
Automation techniques)
Summary

3
Kinds Of Information Processing
Point-to-Point
Broadcast
lecture concert
conversation money
Net work
Immediate
book newspaper
mail
Time Shifted
Data Base
Its ALL going electronic Immediate is being
stored for analysis (so ALL database) Analysis
Automatic Processing are being added
4
Why Put Everything in Cyberspace?
Low rent min /byte Shrinks time now or
later Shrinks space here or there Automate
processing knowbots
Point-to-Point OR Broadcast
Network
Immediate OR Time Delayed
Locate Process Analyze Summarize
Data Base
5
Databases Information At Your Fingertips
Information NetworkKnowledge Navigator

All information will be in an online database
(somewhere)
You might record everything you
read 10MB/day, 400 GB/lifetime (two tapes)
hear 400MB/day, 16 TB/lifetime (a tape per
decade)
see 1MB/s, 40GB/day, 1.6 PB/lifetime (maybe
someday)
Data storage, organization, and analysis is a
challenge.
That is what databases are about
DBs do a good job on records
Now working on text, spatial, image, and sound.
This talk is about automatic parallel search (the
outer loop)
Techniques work for ALL kinds of data

6
Database Store ALL Data Types

The Old World
Millions of objects
100-byte objects

The New World
Billions of objects
Big objects (1MB)
Objects have behavior (methods)

People
Name
Address
David
NY
Mike
Berk
Paperless office Library of congress online All
information online entertainment
publishing business Information Network,
Knowledge Navigator, Information at your
fingertips
Won
Austin
People
Name
Address
Papers
Picture
Voice
NY
David
Mike
Berk
Won
Austin
7
Magnetic Storage Cheaper than Paper

File Cabinet cabinet (4 drawer) 250 paper
(24,000 sheets) 250 space (2x3 _at_
10/ft2) 180 total 700 3 /sheet
Disk disk (8 GB ) 4,000 ASCII
4 m pages 0.1 /sheet (30x cheaper)
Image 200 k pages 2 /sheet (similar
to paper)
Store everything on disk

8
Cyberspace Demographics
1950 National Computer 1960 Corporate
Computer 1970 Site Computer 1980 Departmental
Computer 1990 Personal Computer 2000 ?

Computer History
most computers are small NEXT 1 Billion X for
some X (phone?)
most of the money is in clients and
wiring 1990 50 desktop 1995 75 desktop

1B
9
Billions of Clients

Every device will be intelligent
Doors, rooms, cars, ...
Computing will be ubiquitous

10
Billions of Clients Need Millions of Servers
All clients are networked to servers may be
nomadic or on-demand Fast clients want faster
servers Servers provide data, control,
coordination communication
Clients
mobile
clients
fixed
clients
Servers
server
super
Super Servers Large Databases High Traffic
shared data
server
11
If Hardware is Free, Where Will The Money Go?

All clients and servers will be based on PC
technology economies of scale give lowest
price.
Traditional budget 40 vendor, 60 staff
If hardware_price software_price 0 then
what?
Money will go to
CONTENT (databases)
NEW APPLICATIONS
AUTOMATION analogy to 1920 telephone operators
Systems programmer per MIPS
DBA per 10GB

12
The New Computer Industry

Horizontal integration is new structure
Each layer picks best from lower layer.
Desktop market
1991 50
1995 75
Compaq is biggest computer company

Example
Function
Operation
ATT
Integration
EDS
Applications
SAP
Middleware
Oracle
Baseware
Microsoft
Systems
Compaq
Intel Seagate
Silicon Oxide
13
Constant Dollars vs Constant Work

Constant Work
One SuperServer can do all the worlds
computations.
Constant Dollars
The world spends 10 on information processing
Computers are moving from 5 penetration to 50
300 B to 3T
We have the patent on the byte and algorithm

14
The Seven Price Tiers

10 wrist watch computers
100 pocket/ palm computers
1,000 portable computers
10,000 personal computers (desktop)
100,000 departmental computers
(closet)
1,000,000 site computers (glass house)
10,000,000 regional computers (glass
castle)

SuperServer Costs more than 100,000
Mainframe Costs more than 1M Must be an
array of processors, disks, tapes comm
ports
15
Software Economics Bills Law

Bill Joys law (Sun) Dont write software for
less than 100,000 platforms. _at_10M engineering
expense, 1,000 price
Bill Gates lawDont write software for less
than 1,000,000 platforms. _at_10M engineering
expense, 100 price
Examples
UNIX vs NT 3,500 vs 500
Oracle vs SQL-Server 100,000 vs 6,000
No Spreadsheet or Presentation pack on
Unix/VMS/...
Commoditization of base Software Hardware

16
What comes next

MANY new clients
Applications to enable clients servers
super-servers

17
Outline

Cyberspace Pep Talk
Databases are the dirt of Cyberspace
Billions of clients mean millions of servers
Parallel Imperative
Hardware trend Many little devices
Consequence Server arrays of commodity parts
PCs are the bricks of Cyberspace
Must automate parallel design / operation /
use
Software parallelism via dataflow Data
Partitioning
Parallel database techniques
Parallel execution of many little jobs (OLTP)
Data Partitioning
Pipeline Execution
Automation techniques)
Summary

18
Moores Law RestatedMany Little Won over Few Big
Hardware trends Few generic parts CPU
RAM

Disk Tape arrays
ATM for LAN/WAN ?? for CAN

?? for OS These parts will be
inexpensive (commodity components) Systems will
be arrays of these parts Software challenge how
to program arrays
1 M
100 K
10 K
Micro
Nano
Mini
Mainframe
1.8"
2.5"
3.5"
5.25"
9"
19
Future SuperServer
Array of processors, disks, tapes comm
lines Challenge How to program it Must use
parallelism Pipeline hide latency Partition bandw
idth scaleup
20
Great Debate Shared What?
Shared Memory (SMP)
Shared Nothing (network)
Shared Disk
Easy to program Difficult to build Difficult to
scaleup
Hard to program Easy to build Easy to scaleup
Sequent, SGI, Sun
VMScluster, Sysplex
Tandem, Teradata, SP2
Winner will be a synthesis of these
ideas Distributed shared memory (DASH, Encore)
blurs distinction between Network and Bus
(locality still important) But gives Shared
memory message cost.
21
The Hardware is in Place and Then A
Miracle Occurs
?
SNAP Scaleable Network And Platforms Commodity
Distributed OS built on Commodity
Platforms Commodity Network Interconnect
22
Why Parallel Access To Data?
At 10 MB/s 1.2 days to scan
1,000 x parallel 1.3 minute SCAN.
BANDWIDTH
Parallelism divide a big problem into many
smaller ones to be solved in parallel.
23
DataFlow ProgrammingPrefetch Postwrite Hide
Latency

Can't wait for the data to arrive
Need a memory that gets the data in advance (
100MB/S)
Solution
Pipeline from source (tape, disc, ram...) to cpu
cache
Pipeline results to destination

LATENCY
24
The New Law of Computing
Grosch's Law
Parallel Law Needs Linear Speedup and
Linear Scaleup Not always possible
25
Parallelism Performance is the Goal
Goal is to get 'good' performance.
Law 1 parallel system should be faster than
serial system
Law 2 parallel system should give near-linear
scaleup or near-linear speedup or both.
Parallelism is faster, not cheaper trades
money for time.
26
Parallelism Speedup Scaleup
100GB
Speedup Same Job, More Hardware
Less time
100GB
Scaleup Bigger Job, More Hardware
Same time

100GB
1 TB
10 k clients
1 k clients
Transaction Scaleup more clients/servers
Same response time
100GB
1 TB
Server
Server
27
The Perils of Parallelism
A Bad Speedup Curve
No Parallelism

Benefit
Linearity
Processors Discs
Startup Creating processes Opening
files Optimization Interference Device
(cpu, disc, bus) logical (lock, hotspot,
server, log,...) Skew If tasks get very small,
variance gt service time
28
Outline

Cyberspace Pep Talk
Databases are the dirt of Cyberspace
Billions of clients mean millions of servers
Parallel Imperative
Hardware trend Many little devices
Consequence Server arrays of commodity parts
PCs are the bricks of Cyberspace
Must automate parallel design / operation /
use
Software parallelism via dataflow Data
Partitioning
Parallel database techniques
Parallel execution of many little jobs (OLTP)
Data Partitioning
Pipeline Execution
Automation techniques
Summary

29
Kinds of Parallel Execution
Any
Any
Sequential
Sequential
Pipeline
Program
Program
Sequential
Sequential
Partition outputs split N ways inputs merge
M ways
Any
Any
Sequential
Sequential
Sequential
Sequential
Program
Program
30
Data Rivers Split Merge Streams
N X M Data Streams
M Consumers
N producers
River
Producers add records to the river, Consumers
consume records from the river Purely sequential
programming. River does flow control and
buffering does partition and merge of data
records River Exchange operator in Volcano.
31
Partitioned Data and Execution
Spreads computation and IO among processors

Partitioned data gives NATURAL
execution parallelism
32
Partitioned Merge Pipeline Execution

Pure dataflow programming Gives linear
speedup scaleup But, top node may be
bottleneck So....
33
N xM way Parallelism
N inputs, M outputs, no bottlenecks.
34
Why are Relational OperatorsSuccessful for
Parallelism?
Relational data model uniform operators on
uniform data stream Closed under
composition Each operator consumes 1 or 2 input
streams Each stream is a uniform collection of
data Sequential data in and out Pure
dataflow partitioning some operators (e.g.
aggregates, non-equi-join, sort,..) requires
innovation AUTOMATIC PARALLELISM
35
SQL a NonProcedural Programming Language

SQL functional programming language
describes answer set.
Optimizer picks best execution plan
Picks data flow web (pipeline),
degree of parallelism (partitioning)
other execution parameters (process placement,
memory,...)

Execution
Planning
Monitor
Schema
Plan
GUI
Optimizer
Rivers
36
Database Systems Hide Parallelism

Automate system management via tools
data placement
data organization (indexing)
periodic tasks (dump / recover / reorganize)
Automatic fault tolerance
duplex failover
transactions
Automatic parallelism
among transactions (locking)
within a transaction (parallel execution)

37
Success Stories

Online Transaction Processing
many little jobs
SQL systems support 3700 tps-A (24 cpu, 240
disk)
SQL systems support 21,000 tpm-C
(110 cpu, 800 disk)
Batch (decision support and Utility)
few big jobs, parallelism inside
Scan data at 100 MB/s
Linear Scaleup to 50 processors

transactions / sec
hardware
recs/ sec
hardware
38
Kinds of Partitioned Data
Split a SQL table to subset of nodes
disks Partition within set Range Hash Round
Robin
Good to spread load
Good for equijoins, range queries group-by
Good for equijoins
Shared disk and memory less sensitive to
partitioning, Shared nothing benefits from
"good" partitioning
39
Index Partitioning
Hash indices partition by hash B-tree
indices partition as a forest of trees. One tree
per range Primary index clusters data
40
Secondary Index Partitioning
In shared nothing, secondary indices are
Problematic Partition by base table key
ranges Insert completely local (but what about
unique?) Lookup examines ALL trees (see
figure) Unique index involves lookup on
insert. Partition by secondary key
ranges Insert two nodes (base and index) Lookup
two nodes (index -gt base) Uniqueness is easy
A..Z
A..Z
A..Z
A..Z
A..Z
Base Table
Base Table
Teradata solution Partition non-unique by base
table Partition Unique by secondary key
41
Picking Data Ranges
Disk Partitioning For range partitioning, sample
load on disks. Cool hot disks by making range
smaller For hash partitioning, Cool hot disks by
mapping some buckets to others River
Partitioning Use hashing and assume uniform If
range partitioning, sample data and use
histogram to level the bulk Teradata, Tandem,
Oracle use these tricks
42
Parallel Data Scan
Select image from landsat where date between 1970
and 1990 and overlaps(location, Rockies) and
snow_cover(image) gt.7
Temporal
Spatial
Image
Landsat
Assign one process per processor/disk find
images with right data location analyze image,
if 70 snow, return it
Answer
date
loc
image
image
33N 120W . . . . . . . 34N 120W
1/2/72 . . . . . .. . . 4/8/95
date, location, image tests
43
Simple Aggregates (sort or hash?)
Simple aggregates (count, min, max, ...) can use
indices More compact Sometimes have aggregate
info. GROUP BY aggregates scan in category
order if possible (use indices) Else If
categories fit in RAM use RAM category hash table
Else make temp of ltcategory, itemgt sort by
category, do math in merge step.
44
Parallel Aggregates
For aggregate function, need a decomposition
strategy count(S) å count(s(i)), ditto for
sum() avg(S) (å sum(s(i))) / å count(s(i)) and
so on... For groups, sub-aggregate groups close
to the source drop sub-aggregates into a hash
river.
45
Sort
Used for loading and reorganization (sort makes
them sequential) building B-trees reports non-equi
joins Rarely used for aggregates or equi-joins
(if hash available) Should run at 10MB/s or
better Faster than a disk, so need striped
scratch files In memory sort is about 250Kr/s

Sort
Merge
Input Data
Sorted Data
Runs
46
Parallel Sort
M input N output Sort design Disk and merge not
needed if sort fits in memory Scales linearly
because
Sort is benchmark from hell for shared nothing
machines net traffic disk bandwidth, no data
filtering at the source
47
Blocking Operators Short Piplelines
An operator is blocking, if it does not produce
any output, until it has consumed all its
input Examples Sort, Aggregates,
Hash-Join (reads all of one operand) Blocking
operators kill pipeline parallelism Make
partition parallelism all the more important.
Database Load Template has three blocked phases
48
Nested Loops Join
If inner table indexed on join cols (b-tree or
hash) then sequential scan outer (from start
key) For each outer record probe inner table for
matching recs Works best if inner is in RAM (gt
small inner ) Works great if inner is B-tree or
hash in RAM Partitions well replicate inner at
each outer partition. (if outer partitioned on
join col, dont replicate inner, partition
it) Works for all joins (outer, non-equijoins,
cartesian, exclusion,...)
Inner Table
Outer Table
49
Merge Join (and sort-merge join)
If tables sorted on join cols (b-tree or
hash) then sequential scan each (from start
key) left lt right leftright left gt right advance
left match advance right Nice sequential scan
of data (disk speed) (MxN case may cause
backwards rescan) Sort-merge join sorts before
doing the merge
NxM case cartesian product
Left Table
Right Table
Partitions well partition smaller to larger
partition. Works for all joins (outer,
non-equijoins, cartesian, exclusion,...)
50
Hash Join
Right Table
Hash smaller table into N buckets (hope N1) If
N1 read larger table, hash to smaller Else, hash
outer to disk then bucket-by-bucket hash
join. Purely sequential data behavior Always
beats sort-merge and nested unless data is
clustered. Good for equi, outer, exclusion
join Lots of papers, products just appearing
(what went wrong?) Hash reduces skew
Hash Buckets
Left Table
51
Observation Execution easyAutomation hard
It is easy to build a fast parallel execution
environment (no one has done it, but it is just
programming) It is hard to write a robust and
world-class query optimizer. There are many
tricks One quickly hits the complexity
barrier Common approach Pick best sequential
plan Pick degree of parallelism based on
bottleneck analysis Bind operators to
process Place processes at nodes Place scratch
files near processes Use memory as a constraint
52
Systems That Work This Way
Shared Nothing Teradata 400 nodes Tandem 110
nodes IBM / SP2 / DB2 48 nodes ATT Sybase 112
nodes Informix/SP2 48 nodes Shared
Disk Oracle 170 nodes Rdb 24 nodes Shared
Memory Informix 9 nodes RedBrick ?
nodes

53
Research Problems
Automatic data placement (partition
random or organized) Automatic parallel
programming (process placement)
Parallel concepts, algorithms tools Parallel
Query Optimization Execution Techniques
load balance, checkpoint/restart,
pacing,
54
Summary

Cyberspace is Growing
Databases are the dirt of cybersspace PCs are
the bricks, Networks are the morter. Many
little devices Performance via Arrays of cpu,
disk ,tape
Then a miracle occurs a scaleable distributed OS
and net
SNAP Scaleable Networks and Platforms
Then parallel database systems give software
parallelism
OLTP lots of little jobs run in parallel
Batch TP data flow data partitioning
Automate processor storage array administration
Automate processor storage array programming
2000 platforms as easy as 1 platform.