Recent Progress on Scaleable Servers Jim Gray, Microsoft Research

About This Presentation

Title:

Recent Progress on Scaleable Servers Jim Gray, Microsoft Research

Description:

160 GB HD. Ave CFG: 4xP6, 512 RAM, 160 GB HD. FTP.microsoft.com (3) ... Microsoft TerraServer Hardware. Compaq AlphaServer 8400. 8x400Mhz Alpha cpus. 10 GB DRAM ... – PowerPoint PPT presentation

Number of Views:56

Avg rating:3.0/5.0

Slides: 56

Provided by: jimg178

Category:

more less

Transcript and Presenter's Notes

Title: Recent Progress on Scaleable Servers Jim Gray, Microsoft Research

1
Recent Progress on Scaleable ServersJim Gray,
Microsoft Research

Substantial progress has been made towards the
goal of building supercomputers by composing
arrays of commodity processors, disks, and
networks into a cluster that provides a single
system image. True, vector-supers still are 10x
faster than commodity processors on certain
floating point computations, but they cost
disproportionately more. Indeed, the
highest-performance computations are now
performed by processor arrays. In the broader
context of business and internet computing,
processor arrays long ago surpassed mainframe
performance, and for a tiny fraction of the cost.
This talk first reviews this history and
describes the current landscape of scaleable
servers in the commercial, internet, and
scientific segments. The talk then discusses
the Achilles heels of scaleable systems
programming tools and system management. There
has been relatively little progress in either
area. This suggests some important research areas
for computer systems research.

2
Outline

Scaleability MAPS
Scaleup has limits, scaleout for really big jobs
Two generic kinds of computing
many little few big
Many little has credible programming model
tp, web, mail, fileserver, all based on RPC
Few big has marginal success (best is DSS)
Rivers and objects

3
ScaleabilityScale Up and Scale Out
Grow Up with SMP 4xP6 is now standard Grow Out
with Cluster Cluster has inexpensive parts
Cluster of PCs
4
Key Technologies

Hardware
commodity processors
nUMA
Smart Storage
SAN/VIA

Software
Directory Services
Security Domains
Process/Data migration
Load balancing
Fault tolerance
RPC/Objects
Streams/Rivers

5
MAPS - The Problems

Manageability N machines are N times harder to
manage
Availability N machines fail N times more
often
Programmability N machines are 2N times harder
to program
Scaleability N machines cost N times more
but do little more work.

6
Manageability

Goal Systems self managing
N systems as easy to manage as one system
Some progress
Distributed name servers (gives transparent
naming)
Distributed security
Auto cooling of disks
Auto scheduling and load balancing
Global event log (reporting)
Automate most routine tasks
Still very hard and app-specific

7
Availability

Redundancy allows failover/migration (processes,
disks, links)
Good progress on technology (theory and practice)
Migration also good for load balancing
Transaction concept helps exception handling

8
Programmability Scaleability

Thats what the rest of this talk is about
Success on embarrassingly parallel jobs
file server, mail, transactions, web, crypto
Limited success on batch
relational DBMs, PVM,..

9
Outline

Scaleability MAPS
Scaleup has limits, scaleout for really big jobs
Two generic kinds of computing
many little few big
Many little has credible programming model
tp, web, mail, fileserver, all based on RPC
Few big has marginal success (best is DSS)
Rivers and objects

10
Scaleup Has Limits(chart courtesy of Catharine
Van Ingen)

Vector Supers 10x supers
3 GFlops
bus/memory 20 GBps
IO 1GBps
Supers 10x PCs
300 Mflops
bus/memory 2 GBps
IO 1 GBps
PCs are slow
30 Mflops
and bus/memory 200MBps
and IO 100 MBps

11
Loki Pentium Clusters for Sciencehttp//loki-www
.lanl.gov/

16 Pentium Pro Processors
x 5 Fast Ethernet interfaces
2 Gbytes RAM
50 Gbytes Disk
2 Fast Ethernet switches
Linux...
1.2 real Gflops for 63,000
(but that is the 1996 price)
Beowulf project is similar
http//cesdis.gsfc.nasa.gov/pub/people/becker/beow
ulf.html
Scientists want cheap mips.

12
Your Tax Dollars At WorkASCI for Stockpile
Stewardship

Intel/Sandia 9000x1 node Ppro
LLNL/IBM 512x8 PowerPC (SP2)
LANL/Cray ?
Maui Supercomputer Center
512x1 SP2

13
TOP500 Systems by Vendor(courtesy of Larry Smarr
NCSA)
500
Other
Japanese Vector Machines
Other
DEC
400
Intel
Japanese
TMC
Sun
DEC
Intel
HP
300
TMC
IBM
Number of Systems
Sun
Convex
HP
200
Convex
SGI
IBM
SGI
100
CRI
CRI
0
Jun-93
Jun-94
Jun-95
Jun-96
Jun-97
Jun-98
Nov-93
Nov-94
Nov-95
Nov-96
Nov-97
TOP500 Reports http//www.netlib.org/benchmark/t
op500.html
14
NCSA Super Cluster
http//access.ncsa.uiuc.edu/CoverStories/SuperClus
ter/super.html

National Center for Supercomputing
ApplicationsUniversity of Illinois _at_ Urbana
512 Pentium II cpus, 2,096 disks, SAN
Compaq HP Myricom WindowsNT
A Super Computer for 3M
Classic Fortran/MPI programming
DCOM programming model

15
A Variety of Discipline Codes -Single Processor
Performance Origin vs. T3EnUMA vs UMA (courtesy
of Larry Smarr NCSA)
16
Basket of Applications Average Performance as
Percentage of Linpack Performance (courtesy of
Larry Smarr NCSA)
22
Application Codes CFD Biomolecular Chemistry Mat
erials QCD
25
19
14
33
26
17
Observations

Uniprocessor RAP ltlt PAP
real app performance ltlt peak advertised
performance
Growth has slowed (Bell Prize
1987 0.5 GFLOPS
1988 1.0 GFLOPS 1 year
1990 14 GFLOPS 2 years
1994 140 GFLOPS 4 years
1998 604 GFLOPS
xxx 1 TFLOPS 5 years?
Time Gap 2N-1 or 2N-1 where N (
log(performance)-9)

18
Commercial Clusters

16-node Cluster
64 cpus
2 TB of disk
Decision support
45-node Cluster
140 cpus
14 GB DRAM
4 TB RAID disk
OLTP (Debit Credit)
1 B tpd (14 k tps)

19
Oracle/NT

27,383 tpmC
71.50 /tpmC
4 x 6 cpus
384 disks2.7 TB

20
24 cpu, 384 disks (2.7TB)
21
Microsoft.com 150x4 nodes
(3)
22
The Microsoft TerraServer Hardware

Compaq AlphaServer 8400
8x400Mhz Alpha cpus
10 GB DRAM
324 9.2 GB StorageWorks Disks
3 TB raw, 2.4 TB of RAID5
STK 9710 tape robot (4 TB)
WindowsNT 4 EE, SQL Server 7.0

23
TerraServer ExampleLots of Web Hits

1 TB, largest SQL DB on the Web
99.95 uptime since 1 July 1998
No downtime in August
No NT failures (ever)
most downtime is for SQL software upgrades

24
HotMail 400 Computers
25
Outline

Scaleability MAPS
Scaleup has limits, scaleout for really big jobs
Two generic kinds of computing
many little few big
Many little has credible programming model
tp, web, mail, fileserver, all based on RPC
Few big has marginal success (best is DSS)
Rivers and objects

26
Two Generic Kinds of computing

Many little
embarrassingly parallel
Fit RPC model
Fit partitioned data and computation model
Random works OK
OLTP, File Server, Email, Web,..
Few big
sometimes not obviously parallel
Do not fit RPC model (BIG rpcs)
Scientific, simulation, data mining, ...

27
Many Little Programming Model

many small requests
route requests to data
encapsulate data with procedures (objects)
three-tier computing
RPC is a convenient/appropriate model
Transactions are a big help in error handling
Auto partition (e.g. hash data and computation)
Works fine.
Software CyberBricks

28
Object Oriented ProgrammingParallelism From Many
Little Jobs

Gives location transparency
ORB/web/tpmon multiplexes clients to servers
Enables distribution
Exploits embarrassingly parallel apps
(transactions)
HTTP and RPC (dcom, corba, rmi, iiop, ) are
basis

Tp mon / orb/ web server
29
Few Big Programming Model

Finding parallelism is hard
Pipelines are short (3x 6x speedup)
Spreading objects/data is easy, but getting
locality is HARD
Mapping big job onto cluster is hard
Scheduling is hard
coarse grained (job) and fine grain (co-schedule)
Fault tolerance is hard

30
Kinds of Parallel Execution
Any
Any
Sequential
Sequential
Pipeline
Program
Program
Sequential
Sequential
Partition outputs split N ways inputs merge
M ways
Any
Any
Sequential
Sequential
Sequential
Sequential
Program
Program
31
Why Parallel Access To Data?
At 10 MB/s 1.2 days to scan
1,000 x parallel 100 second SCAN.
BANDWIDTH
Parallelism divide a big problem into many
smaller ones to be solved in parallel.
32
Why are Relational OperatorsSuccessful for
Parallelism?
Relational data model uniform operators on
uniform data stream Closed under
composition Each operator consumes 1 or 2 input
streams Each stream is a uniform collection of
data Sequential data in and out Pure
dataflow partitioning some operators (e.g.
aggregates, non-equi-join, sort,..) requires
innovation AUTOMATIC PARALLELISM
33
Database Systems Hide Parallelism

Automate system management via tools
data placement
data organization (indexing)
periodic tasks (dump / recover / reorganize)
Automatic fault tolerance
duplex failover
transactions
Automatic parallelism
among transactions (locking)
within a transaction (parallel execution)

34
SQL a Non-Procedural Programming Language

SQL functional programming language
describes answer set.
Optimizer picks best execution plan
Picks data flow web (pipeline),
degree of parallelism (partitioning)
other execution parameters (process placement,
memory,...)

Execution
Planning
Monitor
Schema
Executors
Plan
GUI
Optimizer
Rivers
35
Partitioned Execution
Spreads computation and IO among processors

Partitioned data gives
NATURAL parallelism
36
N x M way Parallelism
N inputs, M outputs, no bottlenecks. Partitioned
Data Partitioned and Pipelined Data Flows
37
Automatic Parallel Object Relational DB
Select image from landsat where date between 1970
and 1990 and overlaps(location, Rockies) and
snow_cover(image) gt.7
Temporal
Spatial
Image
Assign one process per processor/disk find
images with right data location analyze image,
if 70 snow, return it
Landsat
Answer
date
loc
image
image
33N 120W . . . . . . . 34N 120W
1/2/72 . . . . . .. . . 4/8/95
date, location, image tests
38
Data Rivers Split Merge Streams
Producers add records to the river, Consumers
consume records from the river Purely sequential
programming. River does flow control and
buffering does partition and merge of data
records River Split/Merge in Gamma
Exchange operator in Volcano /SQL Server.
39
Generalization Object-oriented Rivers

Rivers transport sub-class of record-set (
stream of objects)
record type and partitioning are part of subclass
Node transformers are data pumps
an object with river inputs and outputs
do late-binding to record-type
Programming becomes data flow programming
specify the pipelines
Compiler/Scheduler does data partitioning and
transformer placement

40
NT Cluster Sort as a Prototype

Using
data generation and
sort as a prototypical app
Hello world of distributed processing
goal easy install execute

41
PennySort

Hardware
266 Mhz Intel PPro
64 MB SDRAM (10ns)
Dual Fujitsu DMA 3.2GB EIDE
Software
NT workstation 4.3
NT 5 sort
Performance
sort 15 M 100-byte records (1.5 GB)
Disk to disk
elapsed time 820 sec
cpu time 404 sec

42
Remote Install

Add Registry entry to each remote node.

RegConnectRegistry() RegCreateKeyEx()
43
Cluster StartupExecution

Setup
MULTI_QI struct
COSERVERINFO struct

CoCreateInstanceEx()

Retrieve remote object handle
from MULTI_QI struct

Invoke methods as usual

44
Cluster Sort Conceptual Model

Multiple Data Sources
Multiple Data Destinations
Multiple nodes
Disks -gt Sockets -gt Disk -gt Disk

A
AAA BBB CCC
B
C
AAA BBB CCC
AAA BBB CCC
45
Summary

Clusters of Hardware CyberBricks
all nodes are very intelligent
Processing migrates to where the power is
Disk, network, display controllers have
full-blown OS
Send RPCs to them (SQL, Java, HTTP, DCOM, CORBA)
to them
Computer is a federated distributed system.
Software CyberBricks
standard way to interconnect intelligent nodes
needs execution model
partition pipeline
RPC and Rivers)
needs parallelism

46
Recent Progress on Scaleable ServersJim Gray,
Microsoft Research

Substantial progress has been made towards the
goal of building supercomputers by composing
arrays of commodity processors, disks, and
networks into a cluster that provides a single
system image. True, vector-supers still are 10x
faster than commodity processors on certain
floating point computations, but they cost
disproportionately more. Indeed, the
highest-performance computations are now
performed by processor arrays. In the broader
context of business and internet computing,
processor arrays long ago surpassed mainframe
performance, and for a tiny fraction of the cost.
This talk first reviews this history and
describes the current landscape of scaleable
servers in the commercial, internet, and
scientific segments. The talk then discusses
the Achilles heels of scaleable systems
programming tools and system management. There
has been relatively little progress in either
area. This suggests some important research areas
for computer systems research.

47
end

48
What Im Doing

TerraServer Photo of the planet on the web
a database (not a file system)
1TB now, 15 PB in 10 years
http//www.TerraServer.microsoft.com/
Sloan Digital Sky Survey picture of the universe
just getting started, cyberbricks for astronomers
http//www.sdss.org/
Sorting
one node pennysort (http//research.microsoft.com/
barc/SortBenchmark/)
multinode NT Cluster sort (shows off SAN and
DCOM)

49
What Im Doing

NT Clusters
failover Fault tolerance within a cluster
NT Cluster Sort balanced IO, cpu, network
benchmar
AlwaysUp Geographical fault tolerance.
RAGS random testing of SQL systems
a bug finder
Telepresence
Working with Gordon Bell on the killer app
FileCast and PowerCast
Cyberversity (international, on demand, free
university)

50
Outline

Scaleability MAPS
Scaleup has limits, scaleout for really big jobs
Two generic kinds of computing
many little few big
Many little has credible programming model
tp, web, fileserver, mail, all based on RPC
Few big has marginal success (best is DSS)
Rivers and objects

51
4 B PCs (1 Bips, .1GB dram, 10 GB disk 1 Gbps
Net, BG) The Bricks of Cyberspace

Cost 1,000
Come with
NT
DBMS
High speed Net
System management
GUI / OOUI
Tools
Compatible with everyone else
CyberBricks

52
Super Server 4T Machine

Array of 1,000 4B machines
1 b ips processors
1 B B DRAM
10 B B disks
1 Bbps comm lines
1 TB tape robot
A few megabucks
Challenge
Manageability
Programmability
Security
Availability
Scaleability
Affordability
As easy as a single system

Cyber Brick a 4B machine
Future servers are CLUSTERS of processors,
discs Distributed database techniques make
clusters work
53
Cluster VisionBuying Computers by the Slice

Rack Stack
Mail-order components
Plug them into the cluster
Modular growth without limits
Grow by adding small modules
Fault tolerance
Spare modules mask failures
Parallel execution data search
Use multiple processors and disks
Clients and servers made from the same stuff
Inexpensive built with commodity CyberBricks

54
Nostalgia Behemoth in the Basement

todays PC is yesterdays supercomputer
Can use LOTS of them
Main Apps changed
scientific ? commercial ? web
Web Transaction servers
Data Mining, Web Farming

55
Technology Drivers Disks
Kilo Mega Giga Tera Peta Exa Zetta Yotta

Disks on track
100x in 10 years 2 TB 3.5 drive
Shrink to 1 is 200GB
Disk replaces tape?
Disk is super computer!

Write a Comment

User Comments (0)

About PowerShow.com

Recent Progress on Scaleable Servers Jim Gray, Microsoft Research - PowerPoint PPT Presentation

Recent Progress on Scaleable Servers Jim Gray, Microsoft Research

160 GB HD. Ave CFG: 4xP6, 512 RAM, 160 GB HD. FTP.microsoft.com (3) ... Microsoft TerraServer Hardware. Compaq AlphaServer 8400. 8x400Mhz Alpha cpus. 10 GB DRAM ... – PowerPoint PPT presentation