Title: Whats So Different about Cluster Architectures
1Whats So Different about Cluster Architectures?
- David E. Culler
- Computer Science Division
- U.C. Berkeley
- http//now.cs.berkeley.edu
2High Performance Clusters happen
- Many groups have built them.
- Many more are using them.
- Industry is running with it
- Virtual Interface Architecture
- System Area Networks
- A powerful, flexible new design technique
3Outline
- Quick guided tour of Clusters at Berkeley
- Three Important Advances
- gt Virtual Networks Alan Mainwaring
-
- gt Implicit Co-scheduling Andrea
Arpaci-Dusseau - gt Scalable I/O Remzi Arpaci-Dusseau
- What it means
4Stop 1 HP/fddi Prototype
- FDDI on the HP/735 graphics bus.
- First fast msg layer on non-reliable network
5Stop 2 SparcStation NOW
- ATM was going to take over the world.
The original INKTOMI
6Stop 3 Large Ultra/Myrinet NOW
7Stop 4 Massive Cheap Storage
- Basic unit
- 2 PCs double-ending four SCSI chains
Currently serving Fine Art at http//www.thinker.o
rg/imagebase/
8Stop 5 Cluster of SMPs (CLUMPS)
- Four Sun E5000s
- 8 processors
- 3 Myricom NICs
- Multiprocessor, Multi-NIC, Multi-Protocol
- see S. Lumetta IPPS98
9Stop 6 Information Servers
- Basic Storage Unit
- Ultra 2, 300 GB raid, 800 GB tape stacker, ATM
- scalable backup/restore
- Dedicated Info Servers
- web,
- security,
- mail,
- VLANs project into dept.
10Stop 7 Millennium PC Clumps
- Inexpensive, easy to manage Cluster
- Replicated in many departments
- Prototype for very large PC cluster
11So Whats So Different?
- Commodity parts?
- Communications Packaging?
- Incremental Scalability?
- Independent Failure?
- Intelligent Network Interfaces?
- Complete System on every node
- virtual memory
- scheduler
- files
- ...
12Three important system design aspects
- Virtual Networks
- Implicit co-scheduling
- Scalable File Transfer
13Communication Performance ? Direct Network
Access
Latency
1/BW
- LogP Latency, Overhead, and Bandwidth
- Active Messages lean layer supporting
programming models
14General purpose requirements
- Many timeshared processes
- each with direct, protected access
- User and system
- Client/Server, Parallel clients, parallel servers
- they grow, shrink, handle node failures
- Multiple packages in a process
- each may have own internal communication layer
- Use communication as easily as memory
15Virtual Networks
- Endpoint abstracts the notion of attached to the
network - Virtual network is a collection of endpoints that
can name each other. - Many processes on a node can each have many
endpoints, each with own protection domain.
16How are they managed?
- How do you get direct hardware access for
performance with a large space of logical
resources? - Just like virtual memory
- active portion of large logical space is bound to
physical resources
Host Memory
Process n
Processor
Process 3
Process 2
Process 1
NIC Mem
P
Network Interface
17Endpoint Transition Diagram
HOT R/W NIC Memory
Evict
Write Msg Arrival
WARM R/O Paged Host Memory
Read
Swap
COLD Paged Host Memory
18Network Interface Support
- NIC has endpoint frames
- Services active endpoints
- Signals misses to driver
- using a system endpont
Frame 0
Transmit
Receive
Frame 7
EndPoint Miss
19Solaris System Abstractions
- Segment Driver
- manages portions of an address space
- Device Driver
- manages I/O device
Virtual Network Driver
20LogP Performance
- Competitive latency
- Increased NIC processing
- Difference mostly
- ack processing
- protection check
- data structures
- code quality
- Virtualization cheap
21Bursty Communication among many
22Multiple VNs, Single-thread Server
23Multiple VNs, Multithreaded Server
24Perspective on Virtual Networks
- Networking abstractions are vertical stacks
- new function gt new layer
- poke through for performance
- Virtual Networks provide a horizontal abstraction
- basis for build new, fast services
25Beyond the Personal Supercomputer
- Able to timeshare parallel programs
- with fast, protected communication
- Mix with sequential and interactive jobs
- Use fast communication in OS subsystems
- parallel file system, network virtual memory,
- Nodes have powerful, local OS scheduler
- Problem local schedulers do not know to run
parallel jobs in parallel
26Local Scheduling
- Schedulers act independently w/o global control
- Program waits while trying communicate with its
peers that are not running - 10 - 100x slowdowns for fine-grain programs!
- gt need coordinated scheduling
27Explicit Coscheduling
- Global context switch according to precomputed
schedule - How do you build it? Does it work?
28Typical Cluster Subsystem Structures
Master-Slave
Local service
Applications
Communication
Communication
Peer-to-Peer
Global Service
Communication
29Ideal Cluster Subsystem Structure
- Obtain coordination without explicit subsystem
interaction, only the events in the program - very easy to build
- potentially very robust to component failures
- inherently service on-demand
- scalable
- Local service component can evolve.
30Three approaches examined in NOW
- GLUNIX explicit master-slave (user level)
- matrix algorithm to pick PP
- uses stops signals to try to force desired
PP to run - Explicit peer-peer scheduling assist with VNs
- co-scheduling daemons decide on PP and kick the
solaris scheduler - Implicit
- modify the parallel run-time library to allow it
to get itself co-scheduled with standard scheduler
31Problems with explicit coscheduling
- Implementation complexity
- Need to identify parallel programs in advance
- Interacts poorly with interactive use and load
imbalance - Introduces new potential faults
- Scalability
32Why implicit coscheduling might work
- Active message request-reply model
- Infer non-local state from local observations
react to maintain coordination - observation implication action
- fast response partner scheduled spin
- delayed response partner not scheduled block
33Obvious Questions
- Does it work?
- How long do you spin?
- What are the requirements on the local scheduler?
34How Long to Spin?
- Answer round trip time 5 x wake-up time
- round-trip to stay scheduled together
- plus wake-up to get scheduled together
- plus wake-up to be competitive with blocking cost
- plus 3 x wake-up to meet pairwise cost
35Does it work?
36Synthetic Bulk-synchronous Apps
- Range of granularity and load imbalance
- spin wait 10x slowdown
37With mixture of reads
- Block-immediate 4x slowdown
38Timesharing Split-C Programs
39Many Questions
- What about
- mix of jobs?
- sequential jobs?
- unbalanced placement?
- Fairness?
- Scalability?
- How broadly can implicit coordination be applied
in the design of cluster subsystems?
40A look at Serious File I/O
- Traditional I/O system
- NOW I/O system
- Benchmark Problem sort large number of 100 byte
records with 10 byte keys - start on disk, end on disk
- accessible as files (use the file system)
- Datamation sort 1 million records
- Minute sort quantity in a minute
Proc- Mem
P-M
P-M
P-M
P-M
41NOW-Sort Algorithm 1 pass
- Read
- N/P records from disk -gt memory
- Distribute
- send keys to processors holding result buckets
- Sort
- partial radix sort on each bucket
- Write
- gather and write records to disk
42Key Implementation Techniques
- Performance Isolation highly tuned local
disk-to-disk sort - manage local memory
- manage disk striping
- memory mapped I/O with m-advise, buffering
- manage overlap with threads
- Efficient Communication
- completely hidden under disk I/O
- competes for I/O bus bandwidth
- Self-tuning Software
- probe available memory, disk bandwidth, trade-offs
43World-Record Disk-to-Disk Sort
- Sustain 500 MB/s disk bandwidth and 1,000 MB/s
network bandwidth
44Towards a Cluster File System
- Remote disk system built on a virtual network
Client
RD server
RDlib
Active msgs
45Streaming Transfer Experiment
46Results
- Data distribution affects resource utilization
- Not delivered bandwidth
47I/O Bus crossings
48Conclusions
- Complete system on every node makes clusters a
very powerful architecture. - Extend the system globally
- virtual memory systems,
- schedulers,
- file systems, ...
- Efficient communication enables new solutions to
classic systems challenges. - Opens a rich set of issues for parallel
processing beyond the personal supercomputer.