Recent Progress on Scaleable Servers Jim Gray, Microsoft Research - PowerPoint PPT Presentation

1 / 55
About This Presentation
Title:

Recent Progress on Scaleable Servers Jim Gray, Microsoft Research

Description:

160 GB HD. Ave CFG: 4xP6, 512 RAM, 160 GB HD. FTP.microsoft.com (3) ... Microsoft TerraServer Hardware. Compaq AlphaServer 8400. 8x400Mhz Alpha cpus. 10 GB DRAM ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 56
Provided by: jimg178
Category:

less

Transcript and Presenter's Notes

Title: Recent Progress on Scaleable Servers Jim Gray, Microsoft Research


1
Recent Progress on Scaleable ServersJim Gray,
Microsoft Research
  • Substantial progress has been made towards the
    goal of building supercomputers by composing
    arrays of commodity processors, disks, and
    networks into a cluster that provides a single
    system image. True, vector-supers still are 10x
    faster than commodity processors on certain
    floating point computations, but they cost
    disproportionately more. Indeed, the
    highest-performance computations are now
    performed by processor arrays. In the broader
    context of business and internet computing,
    processor arrays long ago surpassed mainframe
    performance, and for a tiny fraction of the cost.
    This talk first reviews this history and
    describes the current landscape of scaleable
    servers in the commercial, internet, and
    scientific segments. The talk then discusses
    the Achilles heels of scaleable systems
    programming tools and system management. There
    has been relatively little progress in either
    area. This suggests some important research areas
    for computer systems research.

2
Outline
  • Scaleability MAPS
  • Scaleup has limits, scaleout for really big jobs
  • Two generic kinds of computing
  • many little few big
  • Many little has credible programming model
  • tp, web, mail, fileserver, all based on RPC
  • Few big has marginal success (best is DSS)
  • Rivers and objects

3
ScaleabilityScale Up and Scale Out
Grow Up with SMP 4xP6 is now standard Grow Out
with Cluster Cluster has inexpensive parts
Cluster of PCs
4
Key Technologies
  • Hardware
  • commodity processors
  • nUMA
  • Smart Storage
  • SAN/VIA
  • Software
  • Directory Services
  • Security Domains
  • Process/Data migration
  • Load balancing
  • Fault tolerance
  • RPC/Objects
  • Streams/Rivers

5
MAPS - The Problems
  • Manageability N machines are N times harder to
    manage
  • Availability N machines fail N times more
    often
  • Programmability N machines are 2N times harder
    to program
  • Scaleability N machines cost N times more
    but do little more work.

6
Manageability
  • Goal Systems self managing
  • N systems as easy to manage as one system
  • Some progress
  • Distributed name servers (gives transparent
    naming)
  • Distributed security
  • Auto cooling of disks
  • Auto scheduling and load balancing
  • Global event log (reporting)
  • Automate most routine tasks
  • Still very hard and app-specific

7
Availability
  • Redundancy allows failover/migration (processes,
    disks, links)
  • Good progress on technology (theory and practice)
  • Migration also good for load balancing
  • Transaction concept helps exception handling

8
Programmability Scaleability
  • Thats what the rest of this talk is about
  • Success on embarrassingly parallel jobs
  • file server, mail, transactions, web, crypto
  • Limited success on batch
  • relational DBMs, PVM,..

9
Outline
  • Scaleability MAPS
  • Scaleup has limits, scaleout for really big jobs
  • Two generic kinds of computing
  • many little few big
  • Many little has credible programming model
  • tp, web, mail, fileserver, all based on RPC
  • Few big has marginal success (best is DSS)
  • Rivers and objects

10
Scaleup Has Limits(chart courtesy of Catharine
Van Ingen)
  • Vector Supers 10x supers
  • 3 GFlops
  • bus/memory 20 GBps
  • IO 1GBps
  • Supers 10x PCs
  • 300 Mflops
  • bus/memory 2 GBps
  • IO 1 GBps
  • PCs are slow
  • 30 Mflops
  • and bus/memory 200MBps
  • and IO 100 MBps

11
Loki Pentium Clusters for Sciencehttp//loki-www
.lanl.gov/
  • 16 Pentium Pro Processors
  • x 5 Fast Ethernet interfaces
  • 2 Gbytes RAM
  • 50 Gbytes Disk
  • 2 Fast Ethernet switches
  • Linux...
  • 1.2 real Gflops for 63,000
  • (but that is the 1996 price)
  • Beowulf project is similar
  • http//cesdis.gsfc.nasa.gov/pub/people/becker/beow
    ulf.html
  • Scientists want cheap mips.

12
Your Tax Dollars At WorkASCI for Stockpile
Stewardship
  • Intel/Sandia 9000x1 node Ppro
  • LLNL/IBM 512x8 PowerPC (SP2)
  • LANL/Cray ?
  • Maui Supercomputer Center
  • 512x1 SP2

13
TOP500 Systems by Vendor(courtesy of Larry Smarr
NCSA)
500
Other
Japanese Vector Machines
Other
DEC
400
Intel
Japanese
TMC
Sun
DEC
Intel
HP
300
TMC
IBM
Number of Systems
Sun
Convex
HP
200
Convex
SGI
IBM
SGI
100
CRI
CRI
0
Jun-93
Jun-94
Jun-95
Jun-96
Jun-97
Jun-98
Nov-93
Nov-94
Nov-95
Nov-96
Nov-97
TOP500 Reports http//www.netlib.org/benchmark/t
op500.html
14
NCSA Super Cluster
http//access.ncsa.uiuc.edu/CoverStories/SuperClus
ter/super.html
  • National Center for Supercomputing
    ApplicationsUniversity of Illinois _at_ Urbana
  • 512 Pentium II cpus, 2,096 disks, SAN
  • Compaq HP Myricom WindowsNT
  • A Super Computer for 3M
  • Classic Fortran/MPI programming
  • DCOM programming model

15
A Variety of Discipline Codes -Single Processor
Performance Origin vs. T3EnUMA vs UMA (courtesy
of Larry Smarr NCSA)
16
Basket of Applications Average Performance as
Percentage of Linpack Performance (courtesy of
Larry Smarr NCSA)
22
Application Codes CFD Biomolecular Chemistry Mat
erials QCD
25
19
14
33
26
17
Observations
  • Uniprocessor RAP ltlt PAP
  • real app performance ltlt peak advertised
    performance
  • Growth has slowed (Bell Prize
  • 1987 0.5 GFLOPS
  • 1988 1.0 GFLOPS 1 year
  • 1990 14 GFLOPS 2 years
  • 1994 140 GFLOPS 4 years
  • 1998 604 GFLOPS
  • xxx 1 TFLOPS 5 years?
  • Time Gap 2N-1 or 2N-1 where N (
    log(performance)-9)

18
Commercial Clusters
  • 16-node Cluster
  • 64 cpus
  • 2 TB of disk
  • Decision support
  • 45-node Cluster
  • 140 cpus
  • 14 GB DRAM
  • 4 TB RAID disk
  • OLTP (Debit Credit)
  • 1 B tpd (14 k tps)

19
Oracle/NT
  • 27,383 tpmC
  • 71.50 /tpmC
  • 4 x 6 cpus
  • 384 disks2.7 TB

20
24 cpu, 384 disks (2.7TB)
21
Microsoft.com 150x4 nodes
(3)
22
The Microsoft TerraServer Hardware
  • Compaq AlphaServer 8400
  • 8x400Mhz Alpha cpus
  • 10 GB DRAM
  • 324 9.2 GB StorageWorks Disks
  • 3 TB raw, 2.4 TB of RAID5
  • STK 9710 tape robot (4 TB)
  • WindowsNT 4 EE, SQL Server 7.0

23
TerraServer ExampleLots of Web Hits
  • 1 TB, largest SQL DB on the Web
  • 99.95 uptime since 1 July 1998
  • No downtime in August
  • No NT failures (ever)
  • most downtime is for SQL software upgrades

24
HotMail 400 Computers
25
Outline
  • Scaleability MAPS
  • Scaleup has limits, scaleout for really big jobs
  • Two generic kinds of computing
  • many little few big
  • Many little has credible programming model
  • tp, web, mail, fileserver, all based on RPC
  • Few big has marginal success (best is DSS)
  • Rivers and objects

26
Two Generic Kinds of computing
  • Many little
  • embarrassingly parallel
  • Fit RPC model
  • Fit partitioned data and computation model
  • Random works OK
  • OLTP, File Server, Email, Web,..
  • Few big
  • sometimes not obviously parallel
  • Do not fit RPC model (BIG rpcs)
  • Scientific, simulation, data mining, ...

27
Many Little Programming Model
  • many small requests
  • route requests to data
  • encapsulate data with procedures (objects)
  • three-tier computing
  • RPC is a convenient/appropriate model
  • Transactions are a big help in error handling
  • Auto partition (e.g. hash data and computation)
  • Works fine.
  • Software CyberBricks

28
Object Oriented ProgrammingParallelism From Many
Little Jobs
  • Gives location transparency
  • ORB/web/tpmon multiplexes clients to servers
  • Enables distribution
  • Exploits embarrassingly parallel apps
    (transactions)
  • HTTP and RPC (dcom, corba, rmi, iiop, ) are
    basis

Tp mon / orb/ web server
29
Few Big Programming Model
  • Finding parallelism is hard
  • Pipelines are short (3x 6x speedup)
  • Spreading objects/data is easy, but getting
    locality is HARD
  • Mapping big job onto cluster is hard
  • Scheduling is hard
  • coarse grained (job) and fine grain (co-schedule)
  • Fault tolerance is hard

30
Kinds of Parallel Execution
Any
Any
Sequential
Sequential
Pipeline
Program
Program
Sequential
Sequential
Partition outputs split N ways inputs merge
M ways
Any
Any
Sequential
Sequential
Sequential
Sequential
Program
Program
31
Why Parallel Access To Data?
At 10 MB/s 1.2 days to scan
1,000 x parallel 100 second SCAN.
BANDWIDTH
Parallelism divide a big problem into many
smaller ones to be solved in parallel.
32
Why are Relational OperatorsSuccessful for
Parallelism?
Relational data model uniform operators on
uniform data stream Closed under
composition Each operator consumes 1 or 2 input
streams Each stream is a uniform collection of
data Sequential data in and out Pure
dataflow partitioning some operators (e.g.
aggregates, non-equi-join, sort,..) requires
innovation AUTOMATIC PARALLELISM
33
Database Systems Hide Parallelism
  • Automate system management via tools
  • data placement
  • data organization (indexing)
  • periodic tasks (dump / recover / reorganize)
  • Automatic fault tolerance
  • duplex failover
  • transactions
  • Automatic parallelism
  • among transactions (locking)
  • within a transaction (parallel execution)

34
SQL a Non-Procedural Programming Language
  • SQL functional programming language
    describes answer set.
  • Optimizer picks best execution plan
  • Picks data flow web (pipeline),
  • degree of parallelism (partitioning)
  • other execution parameters (process placement,
    memory,...)

Execution
Planning
Monitor
Schema
Executors
Plan
GUI
Optimizer
Rivers
35
Partitioned Execution
Spreads computation and IO among processors

Partitioned data gives
NATURAL parallelism
36
N x M way Parallelism
N inputs, M outputs, no bottlenecks. Partitioned
Data Partitioned and Pipelined Data Flows
37
Automatic Parallel Object Relational DB
Select image from landsat where date between 1970
and 1990 and overlaps(location, Rockies) and
snow_cover(image) gt.7
Temporal
Spatial
Image
Assign one process per processor/disk find
images with right data location analyze image,
if 70 snow, return it
Landsat
Answer
date
loc
image
image
33N 120W . . . . . . . 34N 120W
1/2/72 . . . . . .. . . 4/8/95
date, location, image tests
38
Data Rivers Split Merge Streams
Producers add records to the river, Consumers
consume records from the river Purely sequential
programming. River does flow control and
buffering does partition and merge of data
records River Split/Merge in Gamma
Exchange operator in Volcano /SQL Server.
39
Generalization Object-oriented Rivers
  • Rivers transport sub-class of record-set (
    stream of objects)
  • record type and partitioning are part of subclass
  • Node transformers are data pumps
  • an object with river inputs and outputs
  • do late-binding to record-type
  • Programming becomes data flow programming
  • specify the pipelines
  • Compiler/Scheduler does data partitioning and
    transformer placement

40
NT Cluster Sort as a Prototype
  • Using
  • data generation and
  • sort as a prototypical app
  • Hello world of distributed processing
  • goal easy install execute

41
PennySort
  • Hardware
  • 266 Mhz Intel PPro
  • 64 MB SDRAM (10ns)
  • Dual Fujitsu DMA 3.2GB EIDE
  • Software
  • NT workstation 4.3
  • NT 5 sort
  • Performance
  • sort 15 M 100-byte records (1.5 GB)
  • Disk to disk
  • elapsed time 820 sec
  • cpu time 404 sec

42
Remote Install
  • Add Registry entry to each remote node.

RegConnectRegistry() RegCreateKeyEx()
43
Cluster StartupExecution
  • Setup
  • MULTI_QI struct
  • COSERVERINFO struct
  • CoCreateInstanceEx()
  • Retrieve remote object handle
  • from MULTI_QI struct
  • Invoke methods as usual

44
Cluster Sort Conceptual Model
  • Multiple Data Sources
  • Multiple Data Destinations
  • Multiple nodes
  • Disks -gt Sockets -gt Disk -gt Disk

A
AAA BBB CCC
B
C
AAA BBB CCC
AAA BBB CCC
45
Summary
  • Clusters of Hardware CyberBricks
  • all nodes are very intelligent
  • Processing migrates to where the power is
  • Disk, network, display controllers have
    full-blown OS
  • Send RPCs to them (SQL, Java, HTTP, DCOM, CORBA)
    to them
  • Computer is a federated distributed system.
  • Software CyberBricks
  • standard way to interconnect intelligent nodes
  • needs execution model
  • partition pipeline
  • RPC and Rivers)
  • needs parallelism

46
Recent Progress on Scaleable ServersJim Gray,
Microsoft Research
  • Substantial progress has been made towards the
    goal of building supercomputers by composing
    arrays of commodity processors, disks, and
    networks into a cluster that provides a single
    system image. True, vector-supers still are 10x
    faster than commodity processors on certain
    floating point computations, but they cost
    disproportionately more. Indeed, the
    highest-performance computations are now
    performed by processor arrays. In the broader
    context of business and internet computing,
    processor arrays long ago surpassed mainframe
    performance, and for a tiny fraction of the cost.
    This talk first reviews this history and
    describes the current landscape of scaleable
    servers in the commercial, internet, and
    scientific segments. The talk then discusses
    the Achilles heels of scaleable systems
    programming tools and system management. There
    has been relatively little progress in either
    area. This suggests some important research areas
    for computer systems research.

47
end

48
What Im Doing
  • TerraServer Photo of the planet on the web
  • a database (not a file system)
  • 1TB now, 15 PB in 10 years
  • http//www.TerraServer.microsoft.com/
  • Sloan Digital Sky Survey picture of the universe
  • just getting started, cyberbricks for astronomers
  • http//www.sdss.org/
  • Sorting
  • one node pennysort (http//research.microsoft.com/
    barc/SortBenchmark/)
  • multinode NT Cluster sort (shows off SAN and
    DCOM)

49
What Im Doing
  • NT Clusters
  • failover Fault tolerance within a cluster
  • NT Cluster Sort balanced IO, cpu, network
    benchmar
  • AlwaysUp Geographical fault tolerance.
  • RAGS random testing of SQL systems
  • a bug finder
  • Telepresence
  • Working with Gordon Bell on the killer app
  • FileCast and PowerCast
  • Cyberversity (international, on demand, free
    university)

50
Outline
  • Scaleability MAPS
  • Scaleup has limits, scaleout for really big jobs
  • Two generic kinds of computing
  • many little few big
  • Many little has credible programming model
  • tp, web, fileserver, mail, all based on RPC
  • Few big has marginal success (best is DSS)
  • Rivers and objects

51
4 B PCs (1 Bips, .1GB dram, 10 GB disk 1 Gbps
Net, BG) The Bricks of Cyberspace
  • Cost 1,000
  • Come with
  • NT
  • DBMS
  • High speed Net
  • System management
  • GUI / OOUI
  • Tools
  • Compatible with everyone else
  • CyberBricks

52
Super Server 4T Machine
  • Array of 1,000 4B machines
  • 1 b ips processors
  • 1 B B DRAM
  • 10 B B disks
  • 1 Bbps comm lines
  • 1 TB tape robot
  • A few megabucks
  • Challenge
  • Manageability
  • Programmability
  • Security
  • Availability
  • Scaleability
  • Affordability
  • As easy as a single system

Cyber Brick a 4B machine
Future servers are CLUSTERS of processors,
discs Distributed database techniques make
clusters work
53
Cluster VisionBuying Computers by the Slice
  • Rack Stack
  • Mail-order components
  • Plug them into the cluster
  • Modular growth without limits
  • Grow by adding small modules
  • Fault tolerance
  • Spare modules mask failures
  • Parallel execution data search
  • Use multiple processors and disks
  • Clients and servers made from the same stuff
  • Inexpensive built with commodity CyberBricks

54
Nostalgia Behemoth in the Basement
  • todays PC is yesterdays supercomputer
  • Can use LOTS of them
  • Main Apps changed
  • scientific ? commercial ? web
  • Web Transaction servers
  • Data Mining, Web Farming

55
Technology Drivers Disks
Kilo Mega Giga Tera Peta Exa Zetta Yotta
  • Disks on track
  • 100x in 10 years 2 TB 3.5 drive
  • Shrink to 1 is 200GB
  • Disk replaces tape?
  • Disk is super computer!
Write a Comment
User Comments (0)
About PowerShow.com