Title: Recent Progress on Scaleable Servers Jim Gray, Microsoft Research
1Recent Progress on Scaleable ServersJim Gray,
Microsoft Research
- Substantial progress has been made towards the
goal of building supercomputers by composing
arrays of commodity processors, disks, and
networks into a cluster that provides a single
system image. True, vector-supers still are 10x
faster than commodity processors on certain
floating point computations, but they cost
disproportionately more. Indeed, the
highest-performance computations are now
performed by processor arrays. In the broader
context of business and internet computing,
processor arrays long ago surpassed mainframe
performance, and for a tiny fraction of the cost.
This talk first reviews this history and
describes the current landscape of scaleable
servers in the commercial, internet, and
scientific segments. The talk then discusses
the Achilles heels of scaleable systems
programming tools and system management. There
has been relatively little progress in either
area. This suggests some important research areas
for computer systems research.
2Outline
- Scaleability MAPS
- Scaleup has limits, scaleout for really big jobs
- Two generic kinds of computing
- many little few big
- Many little has credible programming model
- tp, web, mail, fileserver, all based on RPC
- Few big has marginal success (best is DSS)
- Rivers and objects
3ScaleabilityScale Up and Scale Out
Grow Up with SMP 4xP6 is now standard Grow Out
with Cluster Cluster has inexpensive parts
Cluster of PCs
4Key Technologies
- Hardware
- commodity processors
- nUMA
- Smart Storage
- SAN/VIA
- Software
- Directory Services
- Security Domains
- Process/Data migration
- Load balancing
- Fault tolerance
- RPC/Objects
- Streams/Rivers
5MAPS - The Problems
- Manageability N machines are N times harder to
manage - Availability N machines fail N times more
often - Programmability N machines are 2N times harder
to program - Scaleability N machines cost N times more
but do little more work.
6Manageability
- Goal Systems self managing
- N systems as easy to manage as one system
- Some progress
- Distributed name servers (gives transparent
naming) - Distributed security
- Auto cooling of disks
- Auto scheduling and load balancing
- Global event log (reporting)
- Automate most routine tasks
- Still very hard and app-specific
7Availability
- Redundancy allows failover/migration (processes,
disks, links) - Good progress on technology (theory and practice)
- Migration also good for load balancing
- Transaction concept helps exception handling
8Programmability Scaleability
- Thats what the rest of this talk is about
- Success on embarrassingly parallel jobs
- file server, mail, transactions, web, crypto
- Limited success on batch
- relational DBMs, PVM,..
9Outline
- Scaleability MAPS
- Scaleup has limits, scaleout for really big jobs
- Two generic kinds of computing
- many little few big
- Many little has credible programming model
- tp, web, mail, fileserver, all based on RPC
- Few big has marginal success (best is DSS)
- Rivers and objects
10Scaleup Has Limits(chart courtesy of Catharine
Van Ingen)
- Vector Supers 10x supers
- 3 GFlops
- bus/memory 20 GBps
- IO 1GBps
- Supers 10x PCs
- 300 Mflops
- bus/memory 2 GBps
- IO 1 GBps
- PCs are slow
- 30 Mflops
- and bus/memory 200MBps
- and IO 100 MBps
11Loki Pentium Clusters for Sciencehttp//loki-www
.lanl.gov/
- 16 Pentium Pro Processors
- x 5 Fast Ethernet interfaces
- 2 Gbytes RAM
- 50 Gbytes Disk
- 2 Fast Ethernet switches
- Linux...
- 1.2 real Gflops for 63,000
- (but that is the 1996 price)
- Beowulf project is similar
- http//cesdis.gsfc.nasa.gov/pub/people/becker/beow
ulf.html - Scientists want cheap mips.
12Your Tax Dollars At WorkASCI for Stockpile
Stewardship
- Intel/Sandia 9000x1 node Ppro
- LLNL/IBM 512x8 PowerPC (SP2)
- LANL/Cray ?
- Maui Supercomputer Center
- 512x1 SP2
13TOP500 Systems by Vendor(courtesy of Larry Smarr
NCSA)
500
Other
Japanese Vector Machines
Other
DEC
400
Intel
Japanese
TMC
Sun
DEC
Intel
HP
300
TMC
IBM
Number of Systems
Sun
Convex
HP
200
Convex
SGI
IBM
SGI
100
CRI
CRI
0
Jun-93
Jun-94
Jun-95
Jun-96
Jun-97
Jun-98
Nov-93
Nov-94
Nov-95
Nov-96
Nov-97
TOP500 Reports http//www.netlib.org/benchmark/t
op500.html
14NCSA Super Cluster
http//access.ncsa.uiuc.edu/CoverStories/SuperClus
ter/super.html
- National Center for Supercomputing
ApplicationsUniversity of Illinois _at_ Urbana - 512 Pentium II cpus, 2,096 disks, SAN
- Compaq HP Myricom WindowsNT
- A Super Computer for 3M
- Classic Fortran/MPI programming
- DCOM programming model
15A Variety of Discipline Codes -Single Processor
Performance Origin vs. T3EnUMA vs UMA (courtesy
of Larry Smarr NCSA)
16Basket of Applications Average Performance as
Percentage of Linpack Performance (courtesy of
Larry Smarr NCSA)
22
Application Codes CFD Biomolecular Chemistry Mat
erials QCD
25
19
14
33
26
17Observations
- Uniprocessor RAP ltlt PAP
- real app performance ltlt peak advertised
performance - Growth has slowed (Bell Prize
- 1987 0.5 GFLOPS
- 1988 1.0 GFLOPS 1 year
- 1990 14 GFLOPS 2 years
- 1994 140 GFLOPS 4 years
- 1998 604 GFLOPS
- xxx 1 TFLOPS 5 years?
- Time Gap 2N-1 or 2N-1 where N (
log(performance)-9)
18Commercial Clusters
- 16-node Cluster
- 64 cpus
- 2 TB of disk
- Decision support
- 45-node Cluster
- 140 cpus
- 14 GB DRAM
- 4 TB RAID disk
- OLTP (Debit Credit)
- 1 B tpd (14 k tps)
19Oracle/NT
- 27,383 tpmC
- 71.50 /tpmC
- 4 x 6 cpus
- 384 disks2.7 TB
2024 cpu, 384 disks (2.7TB)
21Microsoft.com 150x4 nodes
(3)
22The Microsoft TerraServer Hardware
- Compaq AlphaServer 8400
- 8x400Mhz Alpha cpus
- 10 GB DRAM
- 324 9.2 GB StorageWorks Disks
- 3 TB raw, 2.4 TB of RAID5
- STK 9710 tape robot (4 TB)
- WindowsNT 4 EE, SQL Server 7.0
23TerraServer ExampleLots of Web Hits
- 1 TB, largest SQL DB on the Web
- 99.95 uptime since 1 July 1998
- No downtime in August
- No NT failures (ever)
- most downtime is for SQL software upgrades
24HotMail 400 Computers
25Outline
- Scaleability MAPS
- Scaleup has limits, scaleout for really big jobs
- Two generic kinds of computing
- many little few big
- Many little has credible programming model
- tp, web, mail, fileserver, all based on RPC
- Few big has marginal success (best is DSS)
- Rivers and objects
26Two Generic Kinds of computing
- Many little
- embarrassingly parallel
- Fit RPC model
- Fit partitioned data and computation model
- Random works OK
- OLTP, File Server, Email, Web,..
- Few big
- sometimes not obviously parallel
- Do not fit RPC model (BIG rpcs)
- Scientific, simulation, data mining, ...
27Many Little Programming Model
- many small requests
- route requests to data
- encapsulate data with procedures (objects)
- three-tier computing
- RPC is a convenient/appropriate model
- Transactions are a big help in error handling
- Auto partition (e.g. hash data and computation)
- Works fine.
- Software CyberBricks
28Object Oriented ProgrammingParallelism From Many
Little Jobs
- Gives location transparency
- ORB/web/tpmon multiplexes clients to servers
- Enables distribution
- Exploits embarrassingly parallel apps
(transactions) - HTTP and RPC (dcom, corba, rmi, iiop, ) are
basis
Tp mon / orb/ web server
29Few Big Programming Model
- Finding parallelism is hard
- Pipelines are short (3x 6x speedup)
- Spreading objects/data is easy, but getting
locality is HARD - Mapping big job onto cluster is hard
- Scheduling is hard
- coarse grained (job) and fine grain (co-schedule)
- Fault tolerance is hard
30Kinds of Parallel Execution
Any
Any
Sequential
Sequential
Pipeline
Program
Program
Sequential
Sequential
Partition outputs split N ways inputs merge
M ways
Any
Any
Sequential
Sequential
Sequential
Sequential
Program
Program
31Why Parallel Access To Data?
At 10 MB/s 1.2 days to scan
1,000 x parallel 100 second SCAN.
BANDWIDTH
Parallelism divide a big problem into many
smaller ones to be solved in parallel.
32Why are Relational OperatorsSuccessful for
Parallelism?
Relational data model uniform operators on
uniform data stream Closed under
composition Each operator consumes 1 or 2 input
streams Each stream is a uniform collection of
data Sequential data in and out Pure
dataflow partitioning some operators (e.g.
aggregates, non-equi-join, sort,..) requires
innovation AUTOMATIC PARALLELISM
33Database Systems Hide Parallelism
- Automate system management via tools
- data placement
- data organization (indexing)
- periodic tasks (dump / recover / reorganize)
- Automatic fault tolerance
- duplex failover
- transactions
- Automatic parallelism
- among transactions (locking)
- within a transaction (parallel execution)
34SQL a Non-Procedural Programming Language
- SQL functional programming language
describes answer set. - Optimizer picks best execution plan
- Picks data flow web (pipeline),
- degree of parallelism (partitioning)
- other execution parameters (process placement,
memory,...)
Execution
Planning
Monitor
Schema
Executors
Plan
GUI
Optimizer
Rivers
35Partitioned Execution
Spreads computation and IO among processors
Partitioned data gives
NATURAL parallelism
36N x M way Parallelism
N inputs, M outputs, no bottlenecks. Partitioned
Data Partitioned and Pipelined Data Flows
37Automatic Parallel Object Relational DB
Select image from landsat where date between 1970
and 1990 and overlaps(location, Rockies) and
snow_cover(image) gt.7
Temporal
Spatial
Image
Assign one process per processor/disk find
images with right data location analyze image,
if 70 snow, return it
Landsat
Answer
date
loc
image
image
33N 120W . . . . . . . 34N 120W
1/2/72 . . . . . .. . . 4/8/95
date, location, image tests
38Data Rivers Split Merge Streams
Producers add records to the river, Consumers
consume records from the river Purely sequential
programming. River does flow control and
buffering does partition and merge of data
records River Split/Merge in Gamma
Exchange operator in Volcano /SQL Server.
39Generalization Object-oriented Rivers
- Rivers transport sub-class of record-set (
stream of objects) - record type and partitioning are part of subclass
- Node transformers are data pumps
- an object with river inputs and outputs
- do late-binding to record-type
- Programming becomes data flow programming
- specify the pipelines
- Compiler/Scheduler does data partitioning and
transformer placement
40NT Cluster Sort as a Prototype
- Using
- data generation and
- sort as a prototypical app
- Hello world of distributed processing
- goal easy install execute
41PennySort
- Hardware
- 266 Mhz Intel PPro
- 64 MB SDRAM (10ns)
- Dual Fujitsu DMA 3.2GB EIDE
- Software
- NT workstation 4.3
- NT 5 sort
- Performance
- sort 15 M 100-byte records (1.5 GB)
- Disk to disk
- elapsed time 820 sec
- cpu time 404 sec
42Remote Install
- Add Registry entry to each remote node.
RegConnectRegistry() RegCreateKeyEx()
43Cluster StartupExecution
- Setup
- MULTI_QI struct
- COSERVERINFO struct
- Retrieve remote object handle
- from MULTI_QI struct
44Cluster Sort Conceptual Model
- Multiple Data Sources
- Multiple Data Destinations
- Multiple nodes
- Disks -gt Sockets -gt Disk -gt Disk
A
AAA BBB CCC
B
C
AAA BBB CCC
AAA BBB CCC
45Summary
- Clusters of Hardware CyberBricks
- all nodes are very intelligent
- Processing migrates to where the power is
- Disk, network, display controllers have
full-blown OS - Send RPCs to them (SQL, Java, HTTP, DCOM, CORBA)
to them - Computer is a federated distributed system.
- Software CyberBricks
- standard way to interconnect intelligent nodes
- needs execution model
- partition pipeline
- RPC and Rivers)
- needs parallelism
46Recent Progress on Scaleable ServersJim Gray,
Microsoft Research
- Substantial progress has been made towards the
goal of building supercomputers by composing
arrays of commodity processors, disks, and
networks into a cluster that provides a single
system image. True, vector-supers still are 10x
faster than commodity processors on certain
floating point computations, but they cost
disproportionately more. Indeed, the
highest-performance computations are now
performed by processor arrays. In the broader
context of business and internet computing,
processor arrays long ago surpassed mainframe
performance, and for a tiny fraction of the cost.
This talk first reviews this history and
describes the current landscape of scaleable
servers in the commercial, internet, and
scientific segments. The talk then discusses
the Achilles heels of scaleable systems
programming tools and system management. There
has been relatively little progress in either
area. This suggests some important research areas
for computer systems research.
47end
48What Im Doing
- TerraServer Photo of the planet on the web
- a database (not a file system)
- 1TB now, 15 PB in 10 years
- http//www.TerraServer.microsoft.com/
- Sloan Digital Sky Survey picture of the universe
- just getting started, cyberbricks for astronomers
- http//www.sdss.org/
- Sorting
- one node pennysort (http//research.microsoft.com/
barc/SortBenchmark/) - multinode NT Cluster sort (shows off SAN and
DCOM)
49What Im Doing
- NT Clusters
- failover Fault tolerance within a cluster
- NT Cluster Sort balanced IO, cpu, network
benchmar - AlwaysUp Geographical fault tolerance.
- RAGS random testing of SQL systems
- a bug finder
- Telepresence
- Working with Gordon Bell on the killer app
- FileCast and PowerCast
- Cyberversity (international, on demand, free
university)
50Outline
- Scaleability MAPS
- Scaleup has limits, scaleout for really big jobs
- Two generic kinds of computing
- many little few big
- Many little has credible programming model
- tp, web, fileserver, mail, all based on RPC
- Few big has marginal success (best is DSS)
- Rivers and objects
514 B PCs (1 Bips, .1GB dram, 10 GB disk 1 Gbps
Net, BG) The Bricks of Cyberspace
- Cost 1,000
- Come with
- NT
- DBMS
- High speed Net
- System management
- GUI / OOUI
- Tools
- Compatible with everyone else
- CyberBricks
52Super Server 4T Machine
- Array of 1,000 4B machines
- 1 b ips processors
- 1 B B DRAM
- 10 B B disks
- 1 Bbps comm lines
- 1 TB tape robot
- A few megabucks
- Challenge
- Manageability
- Programmability
- Security
- Availability
- Scaleability
- Affordability
- As easy as a single system
Cyber Brick a 4B machine
Future servers are CLUSTERS of processors,
discs Distributed database techniques make
clusters work
53Cluster VisionBuying Computers by the Slice
- Rack Stack
- Mail-order components
- Plug them into the cluster
- Modular growth without limits
- Grow by adding small modules
- Fault tolerance
- Spare modules mask failures
- Parallel execution data search
- Use multiple processors and disks
- Clients and servers made from the same stuff
- Inexpensive built with commodity CyberBricks
54Nostalgia Behemoth in the Basement
- todays PC is yesterdays supercomputer
- Can use LOTS of them
- Main Apps changed
- scientific ? commercial ? web
- Web Transaction servers
- Data Mining, Web Farming
55Technology Drivers Disks
Kilo Mega Giga Tera Peta Exa Zetta Yotta
- Disks on track
- 100x in 10 years 2 TB 3.5 drive
- Shrink to 1 is 200GB
- Disk replaces tape?
- Disk is super computer!