Title: Building PetaByte Servers
1Building PetaByte Servers
- Jim Gray
- Microsoft Research
- Gray_at_Microsoft.com
- http//www.Research.Microsoft.com/Gray/talks
Kilo 103 Mega 106 Giga 109 Tera 1012 today, we
are here Peta 1015 Exa 1018
2Outline
- The challenge Building GIANT data stores
- for example, the EOS/DIS 15 PB system
- Conclusion 1
- Think about MOX and SCANS
- Conclusion 2
- Think about Clusters
- SMP report
- Cluster report
3The Challenge -- EOS/DIS
- Antarctica is melting -- 77 of fresh water
liberated - sea level rises 70 meters
- Chico Memphis are beach-front property
- New York, Washington, SF, LA, London, Paris
- Lets study it! Mission to Planet Earth
- EOS Earth Observing System (17B gt 10B)
- 50 instruments on 10 satellites 1997-2001
- Landsat (added later)
- EOS DIS Data Information System
- 3-5 MB/s raw, 30-50 MB/s processed.
- 4 TB/day,
- 15 PB by year 2007
4The Process Flow
- Data arrives and is pre-processed.
- instrument data is calibrated,
gridded averaged - Geophysical data is derived
- Users ask for stored data OR to analyze and
combine data. - Can make the pull-push split dynamically
Pull Processing
Push Processing
Other Data
5Designing EOS/DIS
- Expect that millions will use the system
(online)Three user categories - NASA 500 -- funded by NASA to do science
- Global Change 10 k - other dirt bags
- Internet 20 m - everyone else
- Grain speculators
- Environmental Impact Reports
- New applications gt discovery access must
be automatic - Allow anyone to set up a peer- node (DAAC SCF)
- Design for Ad Hoc queries, Not Standard Data
Products If push is 90, then 10 of
data is read (on average). - gt A failure no one uses the data, in DSS, push
is 1 or less. - gt computation demand is enormous (pullpush
is 100 1)
6Obvious Points EOS/DIS will be a cluster of SMPs
- It needs 16 PB storage
- 1 M disks in current technology
- 500K tapes in current technology
- It needs 100 TeraOps of processing
- 100K processors (current technology)
- and 100 Terabytes of DRAM
- 1997 requirements are 1000x smaller
- smaller data rate
- almost no re-processing work
7The architecture
- 2N data center design
- Scaleable OR-DBMS
- Emphasize Pull vs Push processing
- Storage hierarchy
- Data Pump
- Just in time acquisition
82N data center design
- duplex the archive (for fault tolerance)
- let anyone build an extract (the N)
- Partition data by time and by space (store 2 or 4
ways). - Each partition is a free-standing
OR-DBBMS (similar to Tandem, Teradata designs). - Clients and Partitions interact via standard
protocols - OLE-DB, DCOM/CORBA, HTTP,
9Hardware Architecture
- 2 Huge Data Centers
- Each has 50 to 1,000 nodes in a cluster
- Each node has about 25250 TB of storage
- SMP .5Bips to 50 Bips 20K
- DRAM 50GB to 1 TB 50K
- 100 disks 2.3 TB to 230 TB 200K
- 10 tape robots 25 TB to 250 TB 200K
- 2 Interconnects 1GBps to 100 GBps 20K
- Node costs 500K
- Data Center costs 25M (capital cost)
10Scaleable OR-DBMS
- Adopt cluster approach (Tandem, Teradata,
VMScluster, DB2/PE, Informix,....) - System must scale to many processors, disks,
links - OR DBMS based on standard object model
- CORBA or DCOM (not vendor specific)
- Grow by adding components
- System must be self-managing
11Storage Hierarchy
- Cache hot 10 (1.5 PB) on disk.
- Keep cold 90 on near-line tape.
- Remember recent results on speculation
12Data Pump
0101101000111...
- Some queries require reading ALL the data (for
reprocessing) - Each Data Center scans the data every 2 weeks.
- Data rate 10 PB/day 10 TB/node/day 120 MB/s
- Compute on demand small jobs
- less than 1,000 tape mounts
- less than 100 M disk accesses
- less than 100 TeraOps.
- (less than 30 minute response time)
- For BIG JOBS scan entire 15PB database
- Queries (and extracts) snoop this data pump.
13Just-in-time acquisition 30
- Hardware prices decline 20-40/year
- So buy at last moment
- Buy best product that day commodity
- Depreciate over 3 years so that facility is
fresh. - (after 3 years, cost is 23 of original). 60
decline peaks at 10M
EOS DIS Disk Storage Size and Cost
assume 40 price decline/year
Data Need TB
Storage Cost M
1996
1994
1998
2000
2002
2004
2006
2008
14Problems
- HSM
- Design and Meta-data
- Ingest
- Data discovery, search, and analysis
- reorg-reprocess
- disaster recovery
- cost
15TrendsNew Applications
- The Old World
- Millions of objects
- 100-byte objects
- The New World
- Billions of objects
- Big objects (1MB)
Multimedia Text, voice, image, video,
...
The paperless office Library of congress online
(on your campus) All information comes
electronically entertainment
publishing business Information Network,
Knowledge Navigator, Information at Your
Fingertips
16What's a Terabyte
1 Terabyte 1,000,000,000 business letters
100,000,000 book pages 50,000,000 FAX
images 10,000,000 TV pictures (mpeg)
4,000 LandSat images Library of
Congress (in ASCI) is 25 TB
1980 200 M of disc
10,000 discs 5
M of tape silo 10,000 tapes
1994 1 M of magnetic disc 120
discs 500 K of optical disc robot
250 platters 50 K of tape silo
50 tapes Terror
Byte !! .1 of a PetaByte!!!!!!!!!!!!!!!!!!
150 miles of bookshelf 15 miles of bookshelf
7 miles of bookshelf 10 days of video
17The Cost of Storage Access
- File Cabinet cabinet (4 drawer) 250 paper
(24,000 sheets) 250 space (2x3 _at_
10/ft2) 180 total 700 3.0 /sheet - Disk disk (9 GB ) 2,000 ASCII
5 m pages 0.04 /sheet (100x cheaper) - Image 200 k pages 1 /sheet (similar
to paper)
18Standard Storage Metrics
- Capacity
- RAM MB and /MB today at 100 MB 10 /MB
- Disk GB and /GB today at 10 GB and 200 /GB
- Tape TB and /TB today at .1 TB and 100
k/TB (nearline) - Access time (latency)
- RAM 100 ns
- Disk 10 ms
- Tape 30 second pick, 30 second position
- Transfer rate
- RAM 1 GB/s
- Disk 5 MB/s - - - Arrays can go to 1GB/s
- Tape 3 MB/s - - - not clear that striping
works
19New Storage Metrics KOXs, MOXs, GOXs, SCANs?
- KOX How many kilobyte objects served per second
- the file server, transaction processing metric
- MOX How many megabyte objects served per second
- the Mosaic metric
- GOX How many gigabyte objects served per hour
- the video EOSDIS metric
- SCANS How many scans of all the data per day
- the data mining and utility metric
20Summary (of new ideas)
- Storage accesses are the bottleneck
- Accesses are getting larger (MOX, GOX, SCANS)
- Capacity and cost are improving
- BUT
- Latencies and bandwidth are not improving much
- SO
- Use parallel access (disk and tape farms)
21How To Get Lots of MOX, GOX, SCANS
- parallelism use many little devices in parallel
- Beware of the media myth
- Beware of the access time myth
At 10 MB/s 1.2 days to scan
1,000 x parallel 1.5 minute SCAN.
1 Terabyte
1 Terabyte
10 MB/s
Parallelism divide a big problem into many
smaller ones to be solved in parallel.
22Meta-Message Technology Ratios Are Important
- If everything gets fastercheaper at the same
rate then nothing really changes. - Some things getting MUCH BETTER
- communication speed cost 1,000x
- processor speed cost 100x
- storage size cost 100x
- Some things staying about the same
- speed of light (more or less constant)
- people (10x worse)
- storage speed (only 10x better)
23Outline
- The challenge Building GIANT data stores
- for example, the EOS/DIS 15 PB system
- Conclusion 1
- Think about MOX and SCANS
- Conclusion 2
- Think about Clusters
- SMP report
- Cluster report
24Scaleable ComputersBOTH SMP and Cluster
Grow Up with SMP 4xP6 is now standard Grow Out
with Cluster Cluster has inexpensive parts
SMP
Super Server
Departmental
Cluster of PCs
Server
Personal
System
25TPC-C Current Results
- Best Performance is 30,390 tpmC _at_ 305/tpmC
(Oracle/DEC) - Best Price/Perf. is 7,693 tpmC _at_ 43.5/tpmC (MS
SQL/Dell) - Graphs show
- UNIX high price
- UNIX scaleup diseconomy
26Compare SMP Performance
27TPC C improved fast
40 hardware, 100 software, 100 PC Technology
28Where the money goes
29What does this mean?
- PC Technology is 3x cheaper than high-end SMPs
- PC nodes performance are 1/2 of high-end SMPs
- 4xP6 vs 20xUltraSparc
- Peak performance is a cluster
- Tandem 100 node cluster
- DEC Alpha 4x8 cluster
- Commodity solutions WILL come to this market
30Cluster Shared What?
- Shared Memory Multiprocessor
- Multiple processors, one memory
- all devices are local
- DEC, SG, Sun Sequent 16..64 nodes
- easy to program, not commodity
- Shared Disk Cluster
- an array of nodes
- all shared common disks
- VAXcluster Oracle
- Shared Nothing Cluster
- each device local to a node
- ownership may change
- Tandem, SP2, Wolfpack
31Clusters being built
- Teradata 1500 nodes 24 TB disk
(50k/slice) - Tandem,VMScluster 150 nodes (100k/slice)
- Intel, 9,000 nodes _at_ 55M
( 6k/slice) - Teradata, Tandem, DEC moving to NTlow slice
price - IBM 512 nodes _at_ 100m
(200k/slice) - PC clusters (bare handed) at dozens of nodes web
servers (msn, PointCast,), DB servers - KEY TECHNOLOGY HERE IS THE APPS.
- Apps distribute data
- Apps distribute execution
32Cluster Advantages
- Clients and Servers made from the same stuff.
- Inexpensive Built with commodity components
- Fault tolerance
- Spare modules mask failures
- Modular growth
- grow by adding small modules
- Parallel data search
- use multiple processors and disks
33Clusters are winning the high end
- You saw that a 4x8 cluster has best TPC-C
performance - This year, a 95xUltraSparc cluster won the
MinuteSort Speed Trophy (see NOWsort at
www.now.cs.berkeley.edu) - Ordinal 16x on SGI Origin is close (but the
loser!).
34Clusters (Plumbing)
- Single system image
- naming
- protection/security
- management/load balance
- Fault Tolerance
- Wolfpack Demo
- Hot Pluggable hardware Software
35So, Whats New?
- When slices cost 50k, you buy 10 or 20.
- When slices cost 5k you buy 100 or 200.
- Manageability, programmability, usability become
key issues (total cost of ownership). - PCs are MUCH easier to use and program
MPP Vicious Cycle No Customers!
Apps
CP/Commodity Virtuous Cycle Standards allow
progress and investment protection
Standard OS Hardware
Customers
36Windows NT Server ClusteringHigh Availability On
Standard Hardware
- Standard API for clusters on many platforms
- No special hardware required.
- Resource Group is unit of failover
- Typical resources
- shared disk, printer, ...
- IP address, NetName
- Service (Web,SQL, File, Print Mail,MTS
- API to define
- resource groups,
- dependencies,
- resources,
- GUI administrative interface
- A consortium of 60 HW SW vendors (everybody who
is anybody)
2-Node Cluster in beta test now. Available
97H1 gt2 node is next SQL Server and Oracle Demo
on it today Key concepts System a node Cluster
systems working together Resource hard/
soft-ware module Resource dependency resource
needs another Resource group fails over as a
unit Dependencies do not cross group boundaries
37Wolfpack NT Clusters 1.0
- Two node file and print failover
Private
Private
Shared SCSI Disk Strings
Disks
Disks
B
A
etty
lice
Clients
38What is Wolfpack?
Cluster Management Tools
Cluster Api DLL
RPC
Cluster Service
Global Update
Database
Manager
Manager
Node
Event Processor
Manager
Failover Mgr
Communication
App
Manager
Resource
Mgr
Resource
Other Nodes
DLL
Open Online IsAlive LooksAlive Offline Close
Resource
Resource Monitors
Management
Interface
Physical
Logical
App
Resource
Resource
Resource
DLL
DLL
DLL
Cluster Aware
App
39Where We Are Today
- Clusters moving fast
- OLTP
- Sort
- WolfPack
- Technology ahead of schedule
- cpus, disks, tapes,wires,..
- OR Databases are evolving
- Parallel DBMSs are evolving
- HSM still immature
40Outline
- The challenge Building GIANT data stores
- for example, the EOS/DIS 15 PB system
- Conclusion 1
- Think about MOX and SCANS
- Conclusion 2
- Think about Clusters
- SMP report
- Cluster report
41Building PetaByte Servers
- Jim Gray
- Microsoft Research
- Gray_at_Microsoft.com
- http//www.Research.Microsoft.com/Gray/talks
Kilo 103 Mega 106 Giga 109 Tera 1012 today, we
are here Peta 1015 Exa 1018