Whats So Different about Cluster Architectures - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

Whats So Different about Cluster Architectures

Description:

Multiprocessor, Multi-NIC, Multi-Protocol. see S. Lumetta IPPS98. IPPS 98. 9 ... NIC has endpoint frames. Services active endpoints. Signals misses to driver ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 49
Provided by: DavidE1
Category:

less

Transcript and Presenter's Notes

Title: Whats So Different about Cluster Architectures


1
Whats So Different about Cluster Architectures?
  • David E. Culler
  • Computer Science Division
  • U.C. Berkeley
  • http//now.cs.berkeley.edu

2
High Performance Clusters happen
  • Many groups have built them.
  • Many more are using them.
  • Industry is running with it
  • Virtual Interface Architecture
  • System Area Networks
  • A powerful, flexible new design technique

3
Outline
  • Quick guided tour of Clusters at Berkeley
  • Three Important Advances
  • gt Virtual Networks Alan Mainwaring
  • gt Implicit Co-scheduling Andrea
    Arpaci-Dusseau
  • gt Scalable I/O Remzi Arpaci-Dusseau
  • What it means

4
Stop 1 HP/fddi Prototype
  • FDDI on the HP/735 graphics bus.
  • First fast msg layer on non-reliable network

5
Stop 2 SparcStation NOW
  • ATM was going to take over the world.

The original INKTOMI
6
Stop 3 Large Ultra/Myrinet NOW
7
Stop 4 Massive Cheap Storage
  • Basic unit
  • 2 PCs double-ending four SCSI chains

Currently serving Fine Art at http//www.thinker.o
rg/imagebase/
8
Stop 5 Cluster of SMPs (CLUMPS)
  • Four Sun E5000s
  • 8 processors
  • 3 Myricom NICs
  • Multiprocessor, Multi-NIC, Multi-Protocol
  • see S. Lumetta IPPS98

9
Stop 6 Information Servers
  • Basic Storage Unit
  • Ultra 2, 300 GB raid, 800 GB tape stacker, ATM
  • scalable backup/restore
  • Dedicated Info Servers
  • web,
  • security,
  • mail,
  • VLANs project into dept.

10
Stop 7 Millennium PC Clumps
  • Inexpensive, easy to manage Cluster
  • Replicated in many departments
  • Prototype for very large PC cluster

11
So Whats So Different?
  • Commodity parts?
  • Communications Packaging?
  • Incremental Scalability?
  • Independent Failure?
  • Intelligent Network Interfaces?
  • Complete System on every node
  • virtual memory
  • scheduler
  • files
  • ...

12
Three important system design aspects
  • Virtual Networks
  • Implicit co-scheduling
  • Scalable File Transfer

13
Communication Performance ? Direct Network
Access
Latency
1/BW
  • LogP Latency, Overhead, and Bandwidth
  • Active Messages lean layer supporting
    programming models

14
General purpose requirements
  • Many timeshared processes
  • each with direct, protected access
  • User and system
  • Client/Server, Parallel clients, parallel servers
  • they grow, shrink, handle node failures
  • Multiple packages in a process
  • each may have own internal communication layer
  • Use communication as easily as memory

15
Virtual Networks
  • Endpoint abstracts the notion of attached to the
    network
  • Virtual network is a collection of endpoints that
    can name each other.
  • Many processes on a node can each have many
    endpoints, each with own protection domain.

16
How are they managed?
  • How do you get direct hardware access for
    performance with a large space of logical
    resources?
  • Just like virtual memory
  • active portion of large logical space is bound to
    physical resources

Host Memory
Process n
Processor

Process 3
Process 2
Process 1
NIC Mem
P
Network Interface
17
Endpoint Transition Diagram
HOT R/W NIC Memory
Evict
Write Msg Arrival
WARM R/O Paged Host Memory
Read
Swap
COLD Paged Host Memory
18
Network Interface Support
  • NIC has endpoint frames
  • Services active endpoints
  • Signals misses to driver
  • using a system endpont

Frame 0
Transmit
Receive
Frame 7
EndPoint Miss
19
Solaris System Abstractions
  • Segment Driver
  • manages portions of an address space
  • Device Driver
  • manages I/O device

Virtual Network Driver
20
LogP Performance
  • Competitive latency
  • Increased NIC processing
  • Difference mostly
  • ack processing
  • protection check
  • data structures
  • code quality
  • Virtualization cheap

21
Bursty Communication among many
22
Multiple VNs, Single-thread Server
23
Multiple VNs, Multithreaded Server
24
Perspective on Virtual Networks
  • Networking abstractions are vertical stacks
  • new function gt new layer
  • poke through for performance
  • Virtual Networks provide a horizontal abstraction
  • basis for build new, fast services

25
Beyond the Personal Supercomputer
  • Able to timeshare parallel programs
  • with fast, protected communication
  • Mix with sequential and interactive jobs
  • Use fast communication in OS subsystems
  • parallel file system, network virtual memory,
  • Nodes have powerful, local OS scheduler
  • Problem local schedulers do not know to run
    parallel jobs in parallel

26
Local Scheduling
  • Schedulers act independently w/o global control
  • Program waits while trying communicate with its
    peers that are not running
  • 10 - 100x slowdowns for fine-grain programs!
  • gt need coordinated scheduling

27
Explicit Coscheduling
  • Global context switch according to precomputed
    schedule
  • How do you build it? Does it work?

28
Typical Cluster Subsystem Structures
Master-Slave
Local service
Applications
Communication
Communication
Peer-to-Peer
Global Service
Communication
29
Ideal Cluster Subsystem Structure
  • Obtain coordination without explicit subsystem
    interaction, only the events in the program
  • very easy to build
  • potentially very robust to component failures
  • inherently service on-demand
  • scalable
  • Local service component can evolve.

30
Three approaches examined in NOW
  • GLUNIX explicit master-slave (user level)
  • matrix algorithm to pick PP
  • uses stops signals to try to force desired
    PP to run
  • Explicit peer-peer scheduling assist with VNs
  • co-scheduling daemons decide on PP and kick the
    solaris scheduler
  • Implicit
  • modify the parallel run-time library to allow it
    to get itself co-scheduled with standard scheduler

31
Problems with explicit coscheduling
  • Implementation complexity
  • Need to identify parallel programs in advance
  • Interacts poorly with interactive use and load
    imbalance
  • Introduces new potential faults
  • Scalability

32
Why implicit coscheduling might work
  • Active message request-reply model
  • Infer non-local state from local observations
    react to maintain coordination
  • observation implication action
  • fast response partner scheduled spin
  • delayed response partner not scheduled block

33
Obvious Questions
  • Does it work?
  • How long do you spin?
  • What are the requirements on the local scheduler?

34
How Long to Spin?
  • Answer round trip time 5 x wake-up time
  • round-trip to stay scheduled together
  • plus wake-up to get scheduled together
  • plus wake-up to be competitive with blocking cost
  • plus 3 x wake-up to meet pairwise cost

35
Does it work?
36
Synthetic Bulk-synchronous Apps
  • Range of granularity and load imbalance
  • spin wait 10x slowdown

37
With mixture of reads
  • Block-immediate 4x slowdown

38
Timesharing Split-C Programs
39
Many Questions
  • What about
  • mix of jobs?
  • sequential jobs?
  • unbalanced placement?
  • Fairness?
  • Scalability?
  • How broadly can implicit coordination be applied
    in the design of cluster subsystems?

40
A look at Serious File I/O
  • Traditional I/O system
  • NOW I/O system
  • Benchmark Problem sort large number of 100 byte
    records with 10 byte keys
  • start on disk, end on disk
  • accessible as files (use the file system)
  • Datamation sort 1 million records
  • Minute sort quantity in a minute

Proc- Mem
P-M
P-M
P-M
P-M
41
NOW-Sort Algorithm 1 pass
  • Read
  • N/P records from disk -gt memory
  • Distribute
  • send keys to processors holding result buckets
  • Sort
  • partial radix sort on each bucket
  • Write
  • gather and write records to disk

42
Key Implementation Techniques
  • Performance Isolation highly tuned local
    disk-to-disk sort
  • manage local memory
  • manage disk striping
  • memory mapped I/O with m-advise, buffering
  • manage overlap with threads
  • Efficient Communication
  • completely hidden under disk I/O
  • competes for I/O bus bandwidth
  • Self-tuning Software
  • probe available memory, disk bandwidth, trade-offs

43
World-Record Disk-to-Disk Sort
  • Sustain 500 MB/s disk bandwidth and 1,000 MB/s
    network bandwidth

44
Towards a Cluster File System
  • Remote disk system built on a virtual network

Client
RD server
RDlib
Active msgs
45
Streaming Transfer Experiment
46
Results
  • Data distribution affects resource utilization
  • Not delivered bandwidth

47
I/O Bus crossings
48
Conclusions
  • Complete system on every node makes clusters a
    very powerful architecture.
  • Extend the system globally
  • virtual memory systems,
  • schedulers,
  • file systems, ...
  • Efficient communication enables new solutions to
    classic systems challenges.
  • Opens a rich set of issues for parallel
    processing beyond the personal supercomputer.
Write a Comment
User Comments (0)
About PowerShow.com