Title: Parallel Database Systems 101 Jim Gray
1Parallel Database Systems 101Jim Gray Gordon
BellMicrosoft Corporationpresented at VLDB 95,
Zurich Switzerland, Sept 1995
- Detailed notes available from Gray_at_Microsoft.com
- this presentation is 120 of the 174 slides (time
limit) - Notes in PowerPoint7 and Word7
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
2Outline
- Why Parallelism
- technology push
- application pull
- Benchmark Buyers Guide
- metrics
- simple tests
- Parallel Database Techniques
- partitioned data
- partitioned and pipelined execution
- parallel relational operators
- Parallel Database Systems
- Teradata. Tandem, Oracle, Informix, Sybase, DB2,
RedBrick
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
3Kinds Of Information Processing
Point-to-Point
Broadcast
lecture concert
conversation money
Net work
Immediate
book newspaper
mail
Time Shifted
Data Base
Its ALL going electronic Immediate is being
stored for analysis (so ALL database) Analysis
Automatic Processing are being added
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
4Why Put Everything in Cyberspace?
Point-to-Point OR Broadcast
Low rent min /byte Shrinks time now or
later Shrinks space here or there Automate
processing knowbots
Network
Immediate OR Time Delayed
Locate Process Analyze Summarize
Data Base
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
5Databases Information At Your Fingertips
Information NetworkKnowledge Navigator
- All information will be in an online database
(somewhere) - You might record everything you
- read 10MB/day, 400 GB/lifetime (two tapes)
- hear 400MB/day, 16 TB/lifetime (a tape per
decade) - see 1MB/s, 40GB/day, 1.6 PB/lifetime (maybe
someday) - Data storage, organization, and analysis is a
challenge. - That is what databases are about
- DBs do a good job on records
- Now working on text, spatial, image, and sound.
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
6Database Store ALL Data Types
- The Old World
- Millions of objects
- 100-byte objects
- The New World
- Billions of objects
- Big objects (1MB)
- Objects have behavior (methods)
People
Name
Address
David
NY
Mike
Berk
Paperless office Library of congress online All
information online entertainment
publishing business Information Network,
Knowledge Navigator, Information at your
fingertips
Won
Austin
People
Name
Voice
Address
Papers
Picture
NY
David
Mike
Berk
Won
Austin
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
7Magnetic Storage Cheaper than Paper
- File Cabinet cabinet (4 drawer) 250 paper
(24,000 sheets) 250 space (2x3 _at_
10/ft2) 180 total 700 3 /sheet - Disk disk (8 GB ) 2,000 ASCII
4 m pages 0.05 /sheet (60x cheaper) - Image 200 k pages 1 /sheet (3x
cheaper than paper) - Store everything on disk
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
8Billions of Clients
- Every device will be intelligent
- Doors, rooms, cars, ...
- Computing will be ubiquitous
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
9Billions of Clients Need Millions of Servers
All clients are networked to servers may be
nomadic or on-demand Fast clients want faster
servers Servers provide data, control,
coordination communication
Clients
mobile
clients
fixed
clients
Servers
server
super
server
Super Servers Large Databases High Traffic
shared data
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
10Moores Law
- XXX doubles every 18 months 60 increase per year
- Micro Processor speeds
- chip density
- Magnetic disk density
- Communications bandwidthWAN bandwidth
approaching LANs - Exponential Growth
- The past does not matter
- 10x here, 10x there, soon you're talking REAL
change. - PC costs decline faster than any other platform
- Volume learning curves
- PCs will be the building bricks of all future
systems
1GB
128MB
1 chip memory size ( 2 MB to 32 MB)
8MB
1MB
128KB
8KB
1980
1990
2000
1970
1M
16M
bits 1K
4K
16K
64K
256K
4M
64M
256M
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
11Moore's Law for Memory
Capacity with 64Mb DRAMs
1.6m
4GB
8G
1GB
Memory Price _at_ 50/chip
1G
200k
32K
128MB
128M
25k
Number of
chips
4K
32MB
3k
8M
512
8MB
400
1M
64
640K DOS limit
50
128K
8
1/8th chip
6
1
8K
256M
1M
4M
16M
64M
1Kbit
4K
16K
64K
256K
2000
1970
1980
1990
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
12MicroProcessor Speeds Went Up
- Clock rates went from 10Khz to 300Mhz
- Processors now 4x issue
- SPECInt92 fits in Cache,
- it tracks cpu speed
- Peak Advertised Performance (PAP) is 1.2 BIPS
- Real Application Performance (RAP) is 60 MIPS
- Similar curves for
- DEC VAX Alpha
- HP/PA
- IBM R6000/ PowerPC
- MIPS SGI
- SUN
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
13System SPECint vs. Price
SGI XL
SGI L
- PCs
- good performance
- Best price-performance
- 30 / SPECint!
- Proprietary UNIX
- poor price/performance
- HP- 9000, IBM SP/2 are above 1K / SPECint!
- Use PCs for CyberBricks
486_at_66 PCs
Pentium
NCR 3555
SUN 2000
to 16 cpu.
Compaq
NCR 3525
SUN 1000
NCR 3600 AP
Tricord ES 5K
HP 9000
Price (K)
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
14In The Limit The Pico Processor
1 M SPECmarks, 1TFLOP 106
clocks to bulk ram Event-horizon on chip. VM
reincarnated Multi-program cache On-Chip SMP
Terror Bytes!
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
15Disc Tape Storage
- /byte got 104 better
- /access got 103 better
- capacity grew 103
- Latency down 10x
- Bandwidth up 10x
RAM (a/s)
1e 8
1e 7
Tape
B/
1e 6
Disk (a/min)
1e 4
Disk
1e 5
RAM
1e 4
Tape (a/hr)
1e 3
1e 0
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
16Disc Trends
Discs are getting smaller ( 1GB/unit) Discs are
getting standard (SCSI) Discs are getting
faster 1MB/s -gt 10MB/s
25 IO/s -gt 75 IO/s
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
17Disc Trends
The 100GB disc card An array of discs Can be
used as 100 discs 1 striped
disc 10 Fault Tolerant discs
....etc. LOTS of accesses/second
bandwidth
14"
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
18Tape Trends
1950-1980 Reel 160 MB 1.0 MB/s 30
k/drive 1980-1993 3480 300MB 3.0 MB/s 30
k/drive 1985-1993 8MM 4GB 0.4 MB/s 3
k/drive 1993-1994 4MM 1GB 0.2 MB/s 300
/drive 1993- DLT 20GB 2.5 MB/s 3
K/drive
Mainframe silos 250K and up for thousands of
tapes 8MM silos 5K and up for tens of
tapes
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
19Todays Storage Hierarchy Speed Capacity vs
Cost Tradeoffs
Price vs Speed
Size vs Speed
Cache
Nearline
Offline
Tape
Main
Tape
1
Secondary
Disc
Online
Online
/MB
Size(B)
Secondary
Tape
Tape
Disc
Main
Offline
Nearline
Tape
Tape
Cache
-9
-6
-3
0
3
-9
-6
-3
0
3
10
10
10
10
10
10
10
10
10
10
Access Time (seconds)
Access Time (seconds)
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
20Tape Optical Beware of the Media Myth
Optical is cheap 200 /platter
2 GB/platter gt 100/GB (5x
cheaper than disc) Tape is cheap 30 /tape
20 GB/tape gt 1.5 /GB (700x
cheaper than disc).
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
21Tape Optical Reality Media is 10 of System
Cost
Tape needs a robot (10 k ... 3 m ) 10 ...
1000 tapes (at 20GB each) gt 20/GB ... 200/GB
(5x...50x cheaper than disc) Optical needs
a robot (100 k ) 100 platters 200GB ( TODAY
) gt 550 /GB ( same price as disc ) Robots
have poor access times Not good for Library
of Congress (25TB) Data motel data checks
in but it never checks out!
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
22The Access Time Myth
- Myth seek or pick time dominates
- Reality (1) Queuing dominates
- (2) Transfer dominates BLOBs
- (3) Disk seeks often short
- Implications many cheap servers better than
one fast expensive server - shorter queues
- parallel transfer
- lower cost/access and cost/byte
- This is now obvious for disk arrays
- This will be obvious for tape arrays
Wait
Transfer
Transfer
Rotate
Rotate
Seek
Seek
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
23What's a Terabyte? (250 K of Disk _at_ .25/MB)
150 miles of bookshelf 15 miles of bookshelf
7 miles of bookshelf 10 days of video
1 Terabyte 1,000,000,000 business letters
100,000,000 book pages 50,000,000 FAX
images 10,000,000 TV pictures (mpeg)
4,000 LandSat images Library of
Congress (in ASCII) is 25 TB
1980 200 M of disc
10,000 discs 5
M of tape silo 10,000 tapes
1995 250 K of magnetic disc 70
discs 500 K of optical disc robot
250 platters 50 K of tape silo
50 tapes Terror
Byte !!
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
24Standard Storage Metrics
- Capacity
- RAM MB and /MB today at 10Mb 100/MB
- Disk GB and /GB today at 5GB and 500/GB
- Tape TB and /TB today at .1TB and 50k/TB
(nearline) - Access time (latency)
- RAM 100 ns
- Disk 10 ms
- Tape 30 second pick, 30 second position
- Transfer rate
- RAM 1 GB/s
- Disk 5 MB/s - - - Arrays can go to 1GB/s
- Tape 5 MB/s - - - Arrays can go 100 MB/s
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
25New Storage Metrics KOXs, MOXs, GOXs, SCANs?
- KOX How many kilobyte objects served per second
- the file server, transaction processing metric
- MOX How many megabyte objects served per second
- the Mosaic metric
- GOX How many gigabyte objects served per hour
- the video EOSDIS metric
- SCANS How many scans of all the data per day
- the data mining and utility metric
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
26Tertiary Storage Tape Farms
100 robots
1M
20TB
50/GB
3K MOX
10K robot
1.5K GOX
10 tapes
2.5 Scans
200 GB
6 MB/s
50/GB
Scan in 10 hours. many independent tape
robots (like a disc farm)
30 MOX
15 GOX
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
27NearLine StorageDisc Array and Tape Farms Win
TB/M
MOX
GOX
SCANS/Day
Disc Farm
Optical /Tape
Tape Farm
Robot
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
28Nearline Storage Metrics Disk and Tape Farms
Win
Data Motel Data checks in, but it never checks
out
GB/K
1
,
000
,
000
K
OX
100
,
000
MOX
GOX
10
,
000
SCANS/Day
1
,
000
100
10
1
0.1
0.01
1000 x
D
i
sc Farm
100x DLT
Tape Farm
STC Tape Robot
6,000 tapes, 8 readers
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
29Access/ (3-year life)
540
,000
67
,000
500K
100,000
KOX/
MOX/
GOX/
100
68
SCANS/k
23
120
10
4.3
7
7
100
2
1.5
1
0.2
0.1
1000 x Disc Farm
STC Tape Robot
100x DLT Tape Farm
6,000 tapes, 16
readers
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
30Summary (of storage)
- Capacity and cost are improving fast (100x per
decade) - Accesses are getting larger (MOX, GOX, SCANS)
- BUT Latencies and bandwidth are not improving
much - (3x per decade)
- How to deal with this???
- Bandwidth
- Use partitioned parallel access (disk tape
farms) - Latency
- Pipeline data up storage hierarchy (next section)
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
31Interesting Storage Ratios
- Disk is back to 100x cheaper than RAM
- Nearline tape is only 10x cheaper than disk
- and the gap is closing!
RAM /MB Disk /MB
1001
Disk DRAM look good
301
?
101
??? Why bother with Tape
Disk /MB Nearline Tape
11
1960 1970 1980 1990 2000
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
32Performance Storage Accesses not Instructions
Executed
- In the old days we counted instructions and
IOs - Now we count memory references
- Processors wait most of the time
Where the time goes
clock ticks used by AlphaSort Components
70 MIPS real apps have worse Icache misses so
run at 60 MIPS if well tuned, 20 MIPS if not
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
33Storage Latency How Far Away is the Data?
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
34Network Speeds
Comm Speedups
- Network speeds grow 60 / year
- WAN speeds limited by politics
- if voice is X/minute, how much is video?
- Switched 100Mb Ethernet
- 1,000x more bandwidth
- ATM is a scaleable net
- 1 Gb/s to desktop wall plug
- commodity same for LAN, WAN
- 1Tb/s fibers in laboratory
Processors (i/s)
LANs WANs (b/s)
1960
1970
1980
1990
2000
Year
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
35Network Trends Challenge
- Bandwidth UP 104 Price DOWN
- Speed-of-light unchanged
- Software got worse
- Standard Fast Nets
- ATM
- PCI
- Myrinet
- Tnet
- HOPE
- Commodity Net
- Good software
- Then clusters become a SNAP!
- commodity 10k/slice
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
36The Seven Price Tiers
- 10 wrist watch computers
- 100 pocket/ palm computers
- 1,000 portable computers
- 10,000 personal computers (desktop)
- 100,000 departmental computers
(closet) - 1,000,000 site computers (glass house)
- 10,000,000 regional computers (glass
castle)
SuperServer Costs more than 100,000
Mainframe Costs more than 1M Must be an
array of processors, disks, tapes comm
ports
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
37If Hardware is Free, Where Will The Money Go?
- All clients and servers will be based on PC
technology economies of scale give lowest
price. - Traditional budget 40 vendor, 60 staff
- If hardware_price software_price 0 then
what? - Money will go to
- CONTENT (databases)
- NEW APPLICATIONS
- AUTOMATION analogy to 1920 telephone operators
- Systems programmer per MIPS
- DBA per 10GB
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
38The New Computer Industry
- Horizontal integration is new structure
- Each layer picks best from lower layer.
- Desktop (C/S) market
- 1991 50
- 1995 75
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
39Constant Dollars vs Constant Work
- Constant Work
- One SuperServer can do all the worlds
computations. - Constant Dollars
- The world spends 10 on information processing
- Computers are moving from 5 penetration to 50
- 300 B to 3T
- We have the patent on the byte and algorithm
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
40Software Economics Bills Law
- Bill Joys law (Sun) Dont write software for
less than 100,000 platforms. _at_10M engineering
expense, 1,000 price - Bill Gates lawDont write software for less
than 1,000,000 platforms. _at_10M engineering
expense, 100 price - Examples
- UNIX vs NT 3,500 vs 500
- UNIX-Oracle vs SQL-Server 100,000 vs 1,000
- No Spreadsheet or Presentation pack on
UNIX/VMS/... - Commoditization of base Software Hardware
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
41What comes next
- MANY new clients
- Applications to enable clients servers
- super-servers
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
42Opportunities
- Bad News Big players will dominate
- cpu / storage / network (Intel, Seagate, ?)
- OS (Microsoft, Novell)
- DB-TP-CS...(Oracle, Sybase, Informix, IBM,
Novell, Microsoft) - Good News Applications are up for grabs! Value
added is in - applications Content
- Service, Support
- Advice Create new sub-spaces in
Cyberspace. Create new clients (e.g., cellular
phones,...) - Examples SAP, Adabase, Lotus, Peoplesoft,
Netscape, Doom..., Lotus Notes, Netscape
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
43Outline
- Cyberspace pep talk PCs are bricks, Nets are
mortar, DBs are land (content) - 4B machines Commodity Processor-Disk-Tape-Net
- Plus commodity base software (POSIX or NT or..)
- Build 4T SuperServers from arrays of 4B machines
- Challenge
- Software to automate operations programming
- Parallel DBMSs do this
- Opportunities
- new kinds of clients (e.g., intelligent universe)
- new applications (new subspaces in Cyberspace)
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
44ThesisMany Little will Win over Few Big
1 M
100 K
10 K
Micro
Nano
Mini
Mainframe
1.8"
2.5"
3.5"
5.25"
9"
14"
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
45Year 2000 4B Machine
- The Year 2000 commodity PC (3K)
- Billion Instructions/Sec
- Billion Bytes RAM
- Billion Bits/s Net
- 10 B Bytes Disk
- Billion Pixel display
- 3000 x 3000 x 24 pixel
1 Bips Processor
.1 B byte RAM
10 B byte Disk
1 B bits/sec LAN/WAN
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
464 B PCs The Bricks of Cyberspace
- Cost 3,000
- Come with
- OS (NT, POSIX,..)
- DBMS
- High speed Net
- System management
- GUI / OOUI
- Tools
- Compatible with everyone else
- CyberBricks
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
47Implications of Hardware Trends
Large Disc Farms will be inexpensive (
100/GB) Large RAM databases will be inexpensive
(1,000/GB) Processors will be inexpensive So
The building block will be a
processor with large RAM
lots of Disc
1k SPECint
CPU
50 GB Disc
5 GB RAM
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
48Implication of Hardware Trends Clusters
CPU
50 GB Disc
5 GB RAM
Future Servers are CLUSTERS of processors,
discs Distributed Database techniques make
clusters work
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
49Future SuperServer4T Machine
100 Tape Transports
1,000 tapes 1 PetaByte
Array of 1,000 4B machines processors, disks,
tapes comm lines A few MegaBucks Challenge Manag
eability Programmability Security Availability Sca
leability Affordability As easy as a single
system
1,000 discs 10 Terrorbytes
100 Nodes
1 Tips
High Speed Network ( 10 Gb/s)
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
50Great Debate Shared What?
Shared Memory (SMP)
Shared Nothing (network)
Shared Disk
Easy to program Difficult to build Difficult to
scaleup
Hard to program Easy to build Easy to scaleup
Sequent, SGI, Sun
VMScluster, Sysplex
Tandem, Teradata, SP2
Winner will be a synthesis of these
ideas Distributed shared memory (DASH, Encore)
blurs distinction between Network and Bus
(locality still important) But gives Shared
memory message cost.
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
51Scaleables Uneconomic So Far
- A Slice is a processor, memory, and a few disks.
- Slice Price of Scaleables so far is 5x to 10x
markup - Teradata 70K for a Intel 486 32MB 4 disk.
- Tandem 100k for a MipsCo R4000 64MB 4 disk
- Intel 75k for an I860 32MB 2 disk
- TMC 75k for a SPARC 3 32MB 2 disk.
- IBM/SP2 100k for a R6000 64MB 8 disk
- Compaq Slice Price is less than 10k
- What is the problem?
- Proprietary interconnect
- Proprietary packaging
- Proprietary software (vendorIX)
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
52Summary
- Storage trends force pipeline partition
parallelism - Lots of bytes bandwidth per dollar
- Lots of latency
- Processor trends force pipeline partition
- Lots of MIPS per dollar
- Lots of processors
- Putting it together Scaleable Networks and
Platforms) - Build clusters of commodity processors storage
- Commodity interconnect is key (S of PMS)
- Traditional interconnects give 100k/slice.
- Commodity Cluster Operating System is key
- Fault isolation and tolerance is key
- Automatic Parallel Programming is key
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
53The Hardware is in Place and Then A
Miracle Occurs
Enables Parallel Applications
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
54The Software Challenge
Automatic data placement (partition
random or organized) Automatic parallel
programming (process placement)
Parallel concepts, algorithms tools  Parallel
Query Optimization  Execution Techniques
load balance, checkpoint/restart,
pacing, multi-programming
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
55Kinds of Parallel Execution
Any
Any
Sequential
Sequential
Pipeline
Program
Program
Sequential
Partition outputs split N ways inputs merge
M ways
Sequential
Any
Any
Sequential
Sequential
Sequential
Sequential
Program
Program
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
56Why Parallel Access To Data?
At 10 MB/s 1.2 days to scan
1,000 x parallel 1.5 minute SCAN.
Bandwidth
Parallelism divide a big problem into many
smaller ones to be solved in parallel.
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
57DataFlow ProgrammingPrefetch Postwrite Hide
Latency
- Can't wait for the data to arrive (2,000 years!)
- Need a memory that gets the data in advance (
100MB/S) - Solution
- Pipeline from source (tape, disc, ram...) to cpu
cache - Pipeline results to destination
Latency
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
58Why are Relational OperatorsSo Successful for
Parallelism?
Relational data model uniform operators on
uniform data stream Closed under
composition Each operator consumes 1 or 2 input
streams Each stream is a uniform collection of
data Sequential data in and out Pure
dataflow partitioning some operators (e.g.
aggregates, non-equi-join, sort,..) requires
innovation AUTOMATIC PARALLELISM
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
59Database Systems Hide Parallelism
- Automate system management via tools
- data placement
- data organization (indexing)
- periodic tasks (dump / recover / reorganize)
- Automatic fault tolerance
- duplex failover
- transactions
- Automatic parallelism
- among transactions (locking)
- within a transaction (parallel execution)
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
60Automatic Parallel OR DB
Select image from landsat where date between 1970
and 1990 and overlaps(location, Rockies) and
snow_cover(image) gt.7
Temporal
Spatial
Image
Landsat
Assign one process per processor/disk find
images with right data location analyze image,
if 70 snow, return it
Answer
date
loc
image
image
33N 120W . . . . . . . 34N 120W
1/2/72 . . . . . .. . . 4/8/95
date, location, image tests
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
61Outline
- Why Parallelism
- technology push
- application pull
- Benchmark Buyers Guide
- metrics
- simple tests
- Parallel Database Techniques
- partitioned data
- partitioned and pipelined execution
- parallel relational operators
- Parallel Database Systems
- Teradata. Tandem, Oracle, Informix, Sybase, DB2,
RedBrick
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
62Parallelism Speedup Scaleup
Speedup Same Job, More Hardware
Less time
Scaleup Bigger Job, More Hardware
Same time
Transaction Scaleup more clients/servers
Same response time
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
63The New Law of Computing
Grosch's Law
Parallel Law Needs Linear Speedup and
Linear Scaleup Not always possible
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
64Parallelism Performance is the Goal
Goal is to get 'good' performance.
Law 1 parallel system should be faster than
serial system
Law 2 parallel system should give near-linear
scaleup or near-linear speedup or both.
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
65The New Performance Metrics
- Transaction Processing Performance Council
- TPC-A simple transaction
- TPC-B server only, about 3x lighter than TPC-A
- Both obsoleted by TPC-C (no new results after
6/7/95) - TPC-C (revision 3) Transactions Per Minute tpm-C
- Mix of 5 transactions query, update, minibatch
- Terminal price eliminated
- about 5x heavier than tpcA (so 3.5 ktpcA 20
ktpmC) - TPC-D approved in March 1995 - Transactions Per
Hour - Scaleable database (30 GB, 100GB, 300GB,... )
- 17 complex SQL queries (no rewrites, no hints
without permission) - 2 load/purge queries
- No official results yet, many customer results.
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
66TPC-C Results 12/94
Courtesy of Charles Levine of Tandem (of course)
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
67Success Stories
- Online Transaction Processing
- many little jobs
- SQL systems support 3700 tps-A (24 cpu, 240
disk) - SQL systems support 21,000 tpm-C
- (112 cpu,670 disks)
- Batch (decision support and Utility)
- few big jobs, parallelism inside
- Scan data at 100 MB/s
- Linear Scaleup to 500 processors
transactions / sec
hardware
recs/ sec
hardware
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
68The Perils of Parallelism
Startup Creating processes Opening
files Optimization Interference Device (cpu,
disc, bus) logical (lock, hotspot, server,
log,...) Skew If tasks get very small,
variance gt service time
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
69Benchmark Buyer's Guide
Things to ask When does it stop
scaling? Throughput numbers, Not ratios.
Standard benchmarks allow Comparison to
others Comparison to sequential
Ratios and non-standard benchmarks are red flags.
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
70Performance 101 Scan Rate
Disk is 3MB/s to 10MB/s Record is 100B to 200B
(TPC-D 110...160, Wisconsin 204) So should be
able to read 10kr/s to 100kr/s Simple test
Time this on a 1M record table SELECT count()
FROM T WHERE x lt infinity (table on one disk,
turn off parallelism) Typical problems disk or
controller is an antique no read-ahead in
operating system or DB small page reads
(2kb) data not clustered on disk big cpu
overhead in record movement Parallelism is not
the cure for these problems
Scan
Agg
Count
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
71Parallel Scan Rate
Simplest parallel test Scaleup previous test 4
disks, 4 controllers, 4 processors 4 times as
many records partitioned 4 ways. Same
query Should have same elapsed time. Some
systems do.
Scan
Scan
Scan
Scan
Agg
Count
Agg
Count
Agg
Count
Agg
Count
Agg
Sum
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
72Parallel Update Rate
Test UPDATE T SET x x one Test for
million row T on 1 disk Test for four million
row T on 4 disks Look for bottlenecks. After
each call, execute ROLLBACK WORK See if UNDO
runs at the DO speed See if UNDO is parallel
(scales up)
UPDATE
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
73Parallel Insert Rate
INSERT INTO T2 SELECT FROM T1 First 1 scan-gt
insert, then 4 scan-gtinsert. See if log becomes
bottleneck. After each run, call ROLLBACK
WORK See if rollback runs as fast as forward
processing runs faster in parallel. If not, think
about the implications of 100x parallelism Few
systems pass these basic sanity tests.
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
74The records//second Metric
- parallel database systems scan data
- An interesting metric (100 byte record)
- Record Scan Rate / System Cost
- Typical scan rates 1k records/s to 30k records/s
- Each Scaleable system has a slice price guess
- Gateway 15k (P5 ATM 2 disks NT
SQLserver or Informix or Oracle) - Teradata 75k
- Sequent 75k (P52 disksDynixInformix)
- Tandem 100k
- IBM SP2 130k (RS60002 disks, AIX, DB2)
- You can compute slice price for systems later in
presentation - BAD 0.1 records/s/ (there is one of
these) - GOOD 0.33 records/s/ (there is one of
these) - Super! 1.00 records/s/ (there is one of
these) - We should aim at 10 records/s/ with P6.
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
75Embarrassing Questions to Ask Your PDB Vendor
How are constraints checked? ask about unique
secondary indices ask about deferred
constraints ask about referential integrity How
does parallelism interact with triggers Stored
procedures OO extensions How can I change my 10
TB database design in an hour? add index add
constraint reorganize / repartition These are
hard problems.
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
76Outline
- Why Parallelism
- technology push
- application pull
- Benchmark Buyers Guide
- metrics
- simple tests
- Parallel Database Techniques
- partitioned data
- partitioned and pipelined execution
- parallel relational operators
- Parallel Database Systems
- Teradata. Tandem, Oracle, Informix, Sybase, DB2,
RedBrick
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
77Automatic Data Partitioning
Split a SQL table to subset of nodes
disks Partition within set Range Hash Round
Robin
Good for equijoins, range queries group-by
Good for equijoins
Good to spread load
Shared disk and memory less sensitive to
partitioning, Shared nothing benefits from
"good" partitioning
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
78Index Partitioning
Hash indices partition by hash B-tree
indices partition as a forest of trees. One tree
per range Primary index clusters data
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
79Secondary Index Partitioning
In shared nothing, secondary indices are
Problematic Partition by base table key
ranges Insert completely local (but what about
unique?) Lookup examines ALL trees (see
figure) Unique index involves lookup on
insert. Partition by secondary key
ranges Insert two nodes (base and index) Lookup
two nodes (index -gt base) Uniqueness is easy
A..Z
A..Z
A..Z
A..Z
A..Z
Base Table
Base Table
Teradata solution
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
80Kinds of Parallel Execution
Any
Any
Sequential
Sequential
Pipeline
Program
Program
Partition outputs split N ways inputs merge
M ways
Any
Any
Sequential
Sequential
Program
Program
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
81Data Rivers Split Merge Streams
N X M Data Streams
M Consumers
N producers
River
Producers add records to the river, Consumers
consume records from the river Purely sequential
programming. River does flow control and
buffering does partition and merge of data
records River Split/Merge in Gamma Exchange
operator in Volcano.
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
82Partitioned Execution
Spreads computation and IO among processors
Partitioned data gives
NATURAL parallelism
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
83N x M way Parallelism
N inputs, M outputs, no bottlenecks. Partitioned
Data Partitioned and Pipelined Data Flows
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
84Picking Data Ranges
Disk Partitioning For range partitioning, sample
load on disks. Cool hot disks by making range
smaller For hash partitioning, Cool hot disks by
mapping some buckets to others River
Partitioning Use hashing and assume uniform If
range partitioning, sample data and use
histogram to level the bulk Teradata, Tandem,
Oracle use these tricks
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
85Blocking Operators Short Pipelines
An operator is blocking, if it does not produce
any output, until it has consumed all its
input Examples Sort, Aggregates,
Hash-Join (reads all of one operand) Blocking
operators kill pipeline parallelism Make
partition parallelism all the more important.
Database Load Template has three blocked phases
Sort Runs
Scan
Merge Runs
Table Insert
Tape
SQL Table
File
SQL Table
Process
Sort Runs
Index Insert
Merge Runs
Index 1
Sort Runs
Index Insert
Merge Runs
Index 2
Sort Runs
Merge Runs
Index Insert
Index 3
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
86Simple Aggregates (sort or hash?)
Simple aggregates (count, min, max, ...) can use
indices More compact Sometimes have aggregate
info. GROUP BY aggregates scan in category
order if possible (use indices) Else If
categories fit in RAM use RAM category hash table
Else make temp of ltcategory, itemgt sort by
category, do math in merge step.
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
87Parallel Aggregates
For aggregate function, need a decomposition
strategy count(S) S count(s(i)), ditto for
sum() avg(S) (S sum(s(i))) / S count(s(i)) and
so on... For groups, sub-aggregate groups close
to the source drop sub-aggregates into a hash
river.
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
88Sort
Used for loading and reorganization (sort makes
them sequential) build B-trees reports non-equijoi
ns Rarely used for aggregates or equi-joins (if
hash available
Sort
Merge
Input Data
Sorted Data
Runs
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
89Parallel Sort
M input N output Sort design Disk and merge not
needed if sort fits in memory Scales linearly
because
Sort is benchmark from hell for shared nothing
machines net traffic disk bandwidth, no data
filtering at the source
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
90SIGMOD Sort Award
- Datamation Sort 1M records (100 B recs)
- 1000 seconds 1986
- 60 seconds 1990
- 7 seconds 1994
- 3.5 seconds 1995 (SGI challenge)
- micros finally beat the mainframe!
- finally! a UNIX system that does IO
- SIGMOD MinuteSort
- 1.1GB, Nyberg, 1994
- Alpha 3cpu
- 1.6GB, Nyberg, 1995
- SGI Challenge (12 cpu)
- no SIGMOD PennySort record
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
91Nested Loops Join
If inner table indexed on join cols (b-tree or
hash) then sequential scan outer (from start
key) For each outer record probe inner table for
matching recs Works best if inner is in RAM (gt
small inner
Inner Table
Outer Table
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
92Merge Join (and sort-merge join)
If tables sorted on join cols (b-tree or
hash) then sequential scan each (from start
key) left lt right leftright left gt right advance
left match advance right Nice sequential scan
of data (disk speed) (MxN case may cause
backwards rescan) Sort-merge join sorts before
doing the merge
NxM case Cartesian product
Left Table
Right Table
Partitions well partition smaller to larger
partition. Works for all joins (outer,
non-equijoins, Cartesian, exclusion,...)
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
93Hash Join
Right Table
Hash smaller table into N buckets (hope N1) If
N1 read larger table, hash to smaller Else, hash
outer to disk then bucket-by-bucket hash
join. Purely sequential data behavior Always
beats sort-merge and nested unless data is
clustered. Good for equi, outer, exclusion
join Lots of papers, products just appearing
(what went wrong?) Hash reduces skew
Hash Buckets
Left Table
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
94Hash Join Variants
Hybrid Keep one bucket in memory. Bitmap If
disk-based, keep bitmap of first table join
vals Filter second table with first. (useless if
a 1-N join (which is typical)) Bucket Tuning
use many small buckets Aggregate them based on
memory pressure
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
95Parallel Hash Join
ICL implemented hash join with bitmaps in CAFS
machine (1976)! Kitsuregawa pointed out the
parallelism benefits of hash join in early 1980s
(it partitions beautifully) We ignored them!
(why?) But now, Everybody's doing it. (or
promises to do it). Hashing minimizes skew,
requires little thinking for redistribution Hash
ing uses massive main memory
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
96Index-only Scan and Join
Index has secondary key and primary key or
RID. Index is like a table (but compact and
clustered) Think of it as a replica Can scan it
rather than base table (called a semi-join when
used for joins) Can scan it, select primary
keys, Sort primary keys and Scan base table
Index
Sec Key(s)
Pri Keys
(called a hybrid-join when used for joins)
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
97Exotic Joins (Exclusion, Cartesian, Outer, ...)
Exclusion used for NOT IN and DIFFERENCE queries
Outer is lossless join, left, right appears
with null sibling if matching sibling not found.
Cartesian is often a mistake (missing where
clause) but also important for small-table
large-table optimization.
Small table used as a pick list. Each small table
represents a mapping name -gt code set membership
(e.g. holidays) Best plan Restrict small
tables, Form Cartesian product of all small Send
it to each partition of large table for hash
join.
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
98Observations
It is easy to build a fast parallel execution
environment (no one has done it, but it is just
programming) It is hard to write a robust and
world-class query optimizer. There are many
tricks One quickly hits the complexity
barrier Common approach Pick best sequential
plan Pick degree of parallelism based on
bottleneck analysis Bind operators to process
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
99Whats Wrong With That?
Why isnt the best serial plan, the best parallel
plan? Counter example Table partitioned with
local secondary index at two nodes Range query
selects all of node 1 and 1 of node 2. Node 1
should do a scan of its partition. Node 2 should
use secondary index. SELECT FROM
telephone_book WHERE name lt NoGood Sybase
Navigator DB2 PE should get this right. We
need theorems here (practitioners do not have
them)
Index Scan
Table Scan
N..Z
A..M
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
100Great Debate Shared What?
Shared Memory (SMP)
Shared Nothing (network)
Shared Disk
Easy to program Difficult to build Difficult to
scaleup
Hard to program Easy to build Easy to scaleup
Sequent, SGI, Sun
VMScluster, Sysplex
Tandem, Teradata, SP2
Winner will be a synthesis of these
ideas Distributed shared memory (DASH, Encore)
blurs distinction between Network and Bus
(locality still important) But gives Shared
memory message cost.
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
101What Systems Work This Way
Shared Nothing Teradata 400 nodes Tandem
110 nodes IBM / SP2 / DB2 128 nodes Informix/SP2
48 nodes ATT Sybase 8x14
nodes Shared Disk Oracle 170 nodes Rdb 24
nodes Shared Memory Informix 9 nodes
RedBrick ? nodes
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
102Outline
- Why Parallelism
- technology push
- application pull
- Benchmark Buyers Guide
- metrics
- simple tests
- Parallel Database Techniques
- partitioned data
- partitioned and pipelined execution
- parallel relational operators
- Parallel Database Systems
- Teradata - Oracle -DB2
- Tandem - Informix -RedBrick - Sybase
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
103System Survey Ground Rules
Premise The world does not need yet another PDB
survey It would be nice to have a survey of
real systems Visited each parallel DB vendor I
could (time limited) Asked not to be given
confidential info. Asked for public manuals and
benchmarks Asked that my notes be reviewed I say
only nice things (I am a PDB booster)
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
104Acknowledgments
Teradata Todd Walter and Carrie
Ballinger Tandem Susanne Englert, Don Slutz,
HansJorge Zeller, Mike Pong Oracle Gary Hallmark,
Bill Widdington Informix Gary Kelley, Hannes
Spintzik, Frank Symonds, Dave Clay Navigator Rick
Stellwagen, Brian Hart, Ilya Listvinsky, Bill
Huffman , Bob McDonald, Jan Graveson Ron Chung
Hu, Stuart Thompto DB2 Chaitan Baru, Gilles
Fecteau, James Hamilton, Hamid
Pirahesh Redbrick Phil Fernandez, Donovan
Schneider
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
105Teradata
Ship 1984, now an ATT GIS brand
name  Parallel DB server for decision support
SQL in, tables out  Support Heterogeneous data
(convert to client format) Data hash partitioned
among AMPs with fallback (mirror)
hash. Applications run on clients Biggest
installation 476 nodes, 2.4 TB Ported to UNIX
base
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
106Parsing Engines
Interface to IBM or Ethernet or... Accept SQL,
return records and status. Support SQL 89, moving
to SQL92 Parse, Plan authorize SQL cost
based optimizer Issue requests to AMPs Merge AMP
results to requester. Some global load control
based on client priority (adaptive and
GREAT!) Access Modules Almost all work done in
AMPs A shared nothing SQL engine scans, inserts,
joins, log, lock,.... Manages up to 4 disks (as
one logical volume) Easy design, manage, grow
(just add disk)
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
107Data Layout Hash Partitioning
- All data declustered to all nodes
- Each table has a hash key (may be compound)
- Key maps to one of 4,000 buckets
- Buckets map to one of the AMPs
- Non-Unique secondary index partitioned by table
criterion - Fallback bucket maps to second AMP in cluster.
- Typical cluster is 6 nodes (2 is mirroring).
- Cluster limits failure scope
- 2 failures only cause data outage if both in same
cluster. - Within a node, each hash to cylinder
- then hash to page
- Page is a heap with a sorted directory
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
108Teradata Optimization Execution
Sophisticated query optimizer (many tricks)
Great emphasis on Joins Aggregates. Nested,
merge, product, bitmap join (no hash
join) Automatic load balancing from hashing
load control Excellent utilities for data
loading, reorganize Move gt 1TB database from
old to new in 6 days, in background while old
system running Old hardware, 3.8B row table
(1TB), gt300 AMPs typical scan, sort, join
averages 30 minutes
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
109Query Execution
Protocol PE requests work AMP responds OK (or
pushback) AMP works (if all OK) AMP declares
finished When all finished, PE does 2PC and
starts pull Simple scan PE broadcasts scan to
each AMP Each AMP scans produces answer spool
file PE pulls spool file from AMPs via Ynet If
scan were ordered, sort catcher would be
forked at each AMP pipelined to scans Ynet and PE
would do merge of merges from AMPs
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
110 Aggregates, Updates
Aggregate of Scan Scans produce local
sub-aggregates Hash sub-aggregates to Ynet Each
AMP catches its sub-aggregate hash
buckets Consolidate sub-aggregates. PE pulls
aggregates from AMPs via Ynet. Note fully
scaleable design Insert / Update / Delete at a
AMP node generates insert / update /delete
messages to unique-secondary indices fallback
bucket of base table. messages saved in spool if
node is down
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
111Query Execution Joins
Great emphasis on Joins. Includes small-table
large-table optimization cheapest triple, then
cheapest in triple. If equi-partitioned, do
locally If not equi-partitioned, May replicate
small table to large partition (Ynet shines) May
repartition one if other is already partitioned
on join May repartition both (in parallel) Join
algorithm within node is Product Nested Sort-merg
e Hash bit map of secondary indices, intersected.
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
112Utilities
Bulk Data Load, Fast Data Load, Multi-load,
Blast 32KB of data to an AMP Multiple sessions
by multiple clients can drive 200x
parallel Double buffer AMP unpacks, and puts
upsertonto Ynet One record can generate
multiple upserts (transaction-gt inventory,
store-sales, ...) Catcher on Ynet, grabs relevant
upserts to temp file. Sorts and then batches
inserts (survives restarts). Online and
restartable. Customers cite this as Teradata
strength. Fast Export (similar to bulk data load)
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
113Utilities II
Backup / Restore Rarely needed because of
fallback. Cluster is unit of recovery Backup is
online, Restore is offline Reorganize Rarely
needed, add disk is just restart Add node
rehash all buckets that go to that node (Ynet
has old and new bucket map) Fully parallel and
fault tolerant, takes minutes
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
114Port To UNIX
- New design (3700 series) described in VLDB 93
- Ported to UNIX platforms (3600 AP, PE, AMP)
- Moved Teradata to Software Ynet on SMPs
- Based on Bullet-Proof UNIX with TOS layer atop.
- message system
- communications stacks
- raw disk virtual processors
- virtual partitions (buckets go to virtual
partitions) - removes many TOS limits
- Result is 10x to 60x faster
- than an AMP
- Compiled expression evaluation
- (gives 50x speedup on scans)
- Large main memory helps
SQL
Applications
Parsing engine (parallelism)
Teradata SQL (AMP logic)
UNIX PDE TOS adapter
UNIX 5.4 (SMP, RAS, virtual Ynet)
HARDWARE
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
115Customer Benchmarks
Standard Benchmarks Only old Boral/DeWitt
Wisconsin numbers. Nothing public.
Moving gt 1TB database from one old to new in 6
days, in background while old system
running So unload-load rate gt 2MB/s
sustained Background task (speed limited by host
speed/space) Old hardware, 3.8B row table, gt300
AMPs typical scan, sort, join averages 30
minutes rates (rec size not cited) krec/s/AMP k
rec/s scan 9 2.7 mr/s !!!!!!
clustered join 2 600 kr/s
insert-select .39 120 kr/s Hash index
build 3.3 100 kr/s
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
116UNIX/SMP Port of Teradata
Times to process a Teradata Test DB on a 8
Pentium, 3650. These numbers are 10 to 150x
better than a single AMP Compiled expression
handling more memory
- op rows seconds
k r/s MB/s - scan 50000000 737
67.8 11.0 - copy 5000000 1136
4.4 0.7 - aggregate 50000000 788
63.5 10.3 - Join 50x2M (clustered) 52000000 768
67.7 11.0 - Join 5x5 (unclustered) 10000000 237
42.2 6.8 - Join 50Mx.1K 50000100 1916
26.1 4.2
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
117Teradata Good Things
Scaleable to large (multi-terabyte)
databases Available TODAY! It is VERY real in
production in many large sites Robust and
complete set of utilities Automatic
management. Integrates with the IBM mainframe
OLTP world Heterogeneous data support is good
data warehouse
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
118Tandem
Message-based OS (Guardian) (1) location
transparency (2) fault isolation (failover to
other nodes). Expand software 255 Systems WAN
224 PROCESSORS
4 node System
Classic shared-nothing system (like Teradata
except applications run inside DB machine.
30MB/S
1-16 MIPS R4400 cpus dual port
controllers, dual 30MB/s LAN
8 x1M B/S
1974-1985 Encompass Fault-tolerant
Distributed OLTP 1986 NonStopSQL First
distributed and high-performance SQL (200
tps) 1989 Parallel NonStopSQL Parallel query
optimizer/executor 1994 Parallel and Online SQL
(utilities, DDL, recovery, ....) 1995 Moving to
ServerNet shared disk model
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
119Tandem Data Layout
Each table or index range partitioned to a set
of disks (anywhere in network) Index is B-tree
per partition clustering index is B tree Table
fragments are files (extent based). Descriptors
for all local files live in local catalog (node
autonomy) Tables can be distributed in network
(lan or wan) Duplexed disks and disk processes
for failover
Partition
File parts
Block
Extents may be added
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
120Tandem Software (Process Structure)
Disk Server Pair
SQL
Disk Pair or Array
C/COBOL/.. Application
SQL engine Joins, Sorts global aggs triggers index
maintenance views security
Selects Update, Delete Record/Set
insert Aggregates Assertions Locking Logging
buffer pool
Data partition
Query Compiler
Transactions
GUI
Helper Processes
Utilities
Hardware OS move data at 4MB/s with gt1 ins/byte
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
121OLTP Features
Insert / Update / Delete index in parallel with
base table If 5 indices, 5x faster response
time. Record and key-value range locking, SQL92
isolation levels Undo scanner per log
double-buffers undo to each server 21 k tpc-C
(WOW!!) with 110 node server (800GB db) Can mix
OLTP and batch. Priority serving to avoid
priority inversion probl