Berkeley NOW Project - PowerPoint PPT Presentation

1 / 47

About This Presentation

Title:

Berkeley NOW Project

Description:

100 node Ultra/Myrinet NOW. NOW 18. Massive Cheap Storage ... Ultra 2, 300 GB raid, 800 GB tape stacker, ATM. scalable backup/restore. Dedicated Info Servers ... – PowerPoint PPT presentation

Number of Views:65

Avg rating:3.0/5.0

Slides: 48

Provided by: DavidE1

Learn more at: https://people.eecs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Berkeley NOW Project

1
Berkeley NOW Project

David E. Culler
culler_at_cs.berkeley.edu
http//now.cs.berkeley.edu/
Sun Visit
May 1, 1998

2
Project Goals

Make a fundamental change in how we design and
construct large-scale systems
market reality
50/year performance growth gt cannot allow 1-2
year engineering lag
technological opportunity
single-chip Killer Switch gt fast, scalable
communication
Highly integrated building-wide system
Explore novel system design concepts in this new
cluster paradigm

3
Remember the Killer Micro
Linpack Peak Performance

Technology change in all markets
At many levels Arch, Compiler, OS, Application

4
Another Technological Revolution

The Killer Switch
single chip building block for scalable networks
high bandwidth
low latency
very reliable
if its not unplugged
gt System Area Networks

5
One Example Myrinet

8 bidirectional ports of 160 MB/s each way
lt 500 ns routing delay
Simple - just moves the bits
Detects connectivity and deadlock

Tomorrow gigabit Ethernet?
6
Potential Snap together large systems

incremental scalability
time / cost to market
independent failure gt availability

Node Performance in Large System
Engineering Lag Time
7
Opportunity Rethink O.S. Design

Remote memory and processor are closer to you
than your own disks!
Networking Stacks ?
Virtual Memory ?
File system design ?

8
Example Traditional File System
Server
Fast Channel (HPPI)
Clients

RAID Disk Storage

Global Shared File Cache

Local Private File Cache
Bottleneck

Expensive
Complex
Non-Scalable
Single point of failure

Server resources at a premium
Client resources poorly utilized

9
Truly Distributed File System
Scalable Low-Latency Communication Network
Cluster Caching
Local Cache
Network RAID striping
G Node Comm BW / Disk BW

VM page to remote memory

10
Fast Communication Challenge
Killer Platform

ns
ms
µs
Killer Switch

Fast processors and fast networks
The time is spent in crossing between them

11
Opening Intelligent Network Interfaces

Dedicated Processing power and storage embedded
in the Network Interface
An I/O card today
Tomorrow on chip?

Mryicom Net
160 MB/s
Myricom NIC
M
M
I/O bus (S-Bus) 50 MB/s
M
M

M

Sun Ultra 170

12
Our Attack Active Messages
Request
handler
Reply
handler

Request / Reply small active messages (RPC)
Bulk-Transfer (store get)
Highly optimized communication layer on a range
of HW

13
NOW System Architecture
Parallel Apps
Large Seq. Apps
Sockets, Split-C, MPI, HPF, vSM
Global Layer UNIX
Process Migration
Distributed Files
Network RAM
Resource Management
UNIX Workstation
UNIX Workstation
UNIX Workstation
UNIX Workstation
Comm. SW
Comm. SW
Comm. SW
Comm. SW
Net Inter. HW
Net Inter. HW
Net Inter. HW
Net Inter. HW
Fast Commercial Switch (Myrinet)
14
Outline

Introduction to the NOW project
Quick tour of the NOW lab
Important new system design concepts
Conclusions
Future Directions

15
First HP/fddi Prototype

FDDI on the HP/735 graphics bus.
First fast msg layer on non-reliable network

16
SparcStation ATM NOW

ATM was going to take over the world.

The original INKTOMI
Today www.hotbot.com
17
100 node Ultra/Myrinet NOW
18
Massive Cheap Storage

Basic unit
2 PCs double-ending four SCSI chains

Currently serving Fine Art at http//www.thinker.o
rg/imagebase/
19
Cluster of SMPs (CLUMPS)

Four Sun E5000s
8 processors
3 Myricom NICs
Multiprocessor, Multi-NIC, Multi-Protocol

20
Information Servers

Basic Storage Unit
Ultra 2, 300 GB raid, 800 GB tape stacker, ATM
scalable backup/restore
Dedicated Info Servers
web,
security,
mail,
VLANs project into dept.

21
Whats Different about Clusters?

Commodity parts?
Communications Packaging?
Incremental Scalability?
Independent Failure?
Intelligent Network Interfaces?
Complete System on every node
virtual memory
scheduler
files
...

22
Three important system design aspects

Virtual Networks
Implicit co-scheduling
Scalable File Transfer

23
Communication Performance ? Direct Network
Access
Latency
1/BW

LogP Latency, Overhead, and Bandwidth
Active Messages lean layer supporting
programming models

24
Example NAS Parallel Benchmarks

Better node performance than the Cray T3D
Better scalability than the IBM SP-2

25
General purpose requirements

Many timeshared processes
each with direct, protected access
User and system
Client/Server, Parallel clients, parallel servers
they grow, shrink, handle node failures
Multiple packages in a process
each may have own internal communication layer
Use communication as easily as memory

26
Virtual Networks

Endpoint abstracts the notion of attached to the
network
Virtual network is a collection of endpoints that
can name each other.
Many processes on a node can each have many
endpoints, each with own protection domain.

27
How are they managed?

How do you get direct hardware access for
performance with a large space of logical
resources?
Just like virtual memory
active portion of large logical space is bound to
physical resources

Host Memory
Process n
Processor

Process 3
Process 2
Process 1
NIC Mem
P
Network Interface
28
Solaris System Abstractions

Segment Driver
manages portions of an address space

Device Driver
manages I/O device

Virtual Network Driver
29
Virtualization is not expensive
30
Bursty Communication among many virtual networks
31
Sustain high BW with many VN
32
Perspective on Virtual Networks

Networking abstractions are vertical stacks
new function gt new layer
poke through for performance
Virtual Networks provide a horizontal abstraction
basis for build new, fast services

33
Beyond the Personal Supercomputer

Able to timeshare parallel programs
with fast, protected communication
Mix with sequential and interactive jobs
Use fast communication in OS subsystems
parallel file system, network virtual memory,
Nodes have powerful, local OS scheduler
Problem local schedulers do not know to run
parallel jobs in parallel

34
Local Scheduling

Local Schedulers act independently
no global control
Program waits while trying communicate with peers
that are not running
10 - 100x slowdowns for fine-grain programs!
gt need coordinated scheduling

35
Traditional Solution Gang Scheduling

Global context switch according to precomputed
schedule
Inflexible, inefficient, fault prone

36
Novel Solution Implicit Coscheduling

Coordinate schedulers using only the
communication in the program
very easy to build
potentially very robust to component failures
inherently service on-demand
scalable
Local service component can evolve.

37
Why it works

Infer non-local state from local observations
React to maintain coordination
observation implication action
fast response partner scheduled spin
delayed response partner not scheduled block

38
Example Synthetic Pgms

Range of granularity and load imbalance
spin wait 10x slowdown

39
Implicit Coordination

Surprisingly effective
real programs
range of workloads
simple an robust
Opens many new research questions
fairness
How broadly can implicit coordination be applied
in the design of cluster subsystems?

40
A look at Serious File I/O

Traditional I/O system
NOW I/O system
Benchmark Problem sort large number of 100 byte
records with 10 byte keys
start on disk, end on disk
accessible as files (use the file system)
Datamation sort 1 million records
Minute sort quantity in a minute

Proc- Mem
P-M
P-M
P-M
P-M
41
World-Record Disk-to-Disk Sort

Sustain 500 MB/s disk bandwidth and 1,000 MB/s
network bandwidth

42
Key Implementation Techniques

Performance Isolation highly tuned local
disk-to-disk sort
manage local memory
manage disk striping
memory mapped I/O with m-advise, buffering
manage overlap with threads
Efficient Communication
completely hidden under disk I/O
competes for I/O bus bandwidth
Self-tuning Software
probe available memory, disk bandwidth, trade-offs

43
Towards a Cluster File System

Remote disk system built on a virtual network

Client
RD server
RDlib
Active msgs
44
Conclusions

Fast, simple Cluster Area Networks are a
technological breakthrough
Complete system on every node makes clusters a
very powerful architecture.
Extend the system globally
virtual memory systems,
schedulers,
file systems, ...
Efficient communication enables new solutions to
classic systems challenges.

45
Millennium Computational Community
Business
SIMS
BMRC
Chemistry
C.S.
E.E.
Biology
Gigabit Ethernet
Astro
NERSC
M.E.
Physics
N.E.
Math
IEOR
Transport
Economy
C. E.
MSME
46
Millennium PC Clumps