Title: Clusters: Efficient Aggregation
1Clusters Efficient Aggregation
- Andrew Chien
- CSE225
- January 12, 1999
2Announcements/Review
- Homework 1 is due 1/16 (yes, on Saturday)
- Leave at my office, or electronic submit (email)
- Email Andrew w/ groups today (if havent
already) - Slides up on the web (after class), let me know
via email asap if there are problems - Grid Applications
- Distributed Supercomputing, Real-time Distributed
Instrumentation, Data Intensive Computing,
Tele-immersion - Typical architectures and motivations
- Application Requirements
3Grid Service Requirements
- All require basic support for access, initiation,
and run (security, accounting, etc.) - All require coscheduling
- Some benefit from wealth of network resources
(faster network, better application) - Some require predictable network performance(QoS)
- Some require integration of large data
archives/repositories - Some require heterogeneous computer types
- Some require integration of weird digital devices
(instruments, sensors, actuators)
4Todays Outline
- Requirements for Efficient Aggregation in
Clusters (Uniform Grids) - Communication, Scheduling, Resource Access
- Fast Communication
- Active Messages and Fast Messages
5Original Clusters High Throughput
- Cycle stealing (Condor, Utopia -gt LSF)
- Shared, heterogeneous resources, slow networks
- Enhancements Identifying resources, uniform
access, migration, and scheduling - Uniprocessor jobs, multiprocessor, loosely
coupled - No change communication performance,
coordination, quality of service - Key Idea reap shared resources that are idle.
6HPC Clusters High Aggregate Performance
- Dedicated Cluster (SP2,CS-2,NOW) high aggregate
performance - Dedicated, homogeneous, fast network
- Enhancements Low Overhead, High BW
communication, coarse-grain coordination - No support coordinated multitasking,
heterogeneity, quality of service, wide-area - Key Idea workstations/PCs are supercomputer
building blocks
7Grids Enabling High Performance Distributed
Computing
Switched Multigigabit Networks
- Clusters homogeneous (or nearly) collections of
machines - connected by fast networks (usually switched,
gigabit) - like processors (and like speed)
- like configurations
- like operating systems
- identical binaries
- gt this is easy as it gets for grids!
8What do we need to meld this into an efficient
Aggregate?
- Q What examples of efficient aggregates do we
have? - A Tightly coupled parallel computers
- Custom high speed interconnects
- Deeply integrated network interfaces / special
communication mechanisms (combining, barrier
networks, service) - Lightweight customized node monitors/operating
systems - Batch scheduled
- Early 90s examples ...
- Intel Paragon (i860-based computer with 100MB/s
mesh) - Thinking Machines CM5 (Sparc-based computer with
network interface on memory bus and special
vector units) - Stanford DASH (directory based cache coherent
linked 4-way SMPs) - Cray T3D (DEC Alpha with 300MB/s network in
memory controller and special network barrier and
synchronization opns)
9High Performance Communication
- Exploiting parallelism requires efficient
communication - high bandwidth data movement
- low-latency communication
- low-latency synchronization
- Can efficiently distribute work
- Can efficiently load balance
- Can efficiently collect results
- Can efficiently synchronize on state
- Communication allow the pooling of resources and
joint application to a parallel task
10Coordinated Processor Scheduling
vs.
- Uncoordinated scheduling
- degrades communication, coordination, external as
well - Goal Coordinated processor management
- 10 ms scale, dedicated efficiency for parallel
jobs, interactive performance preserved for
seq/par jobs - Must coexist with commodity software and
timesharing schedulers(no gang scheduling) - Why do cluster/grid applications need this
property? - Well revisit this later in the quarter
11Uniform Resource Access
- View all resources as a uniform namespace
- pooling of processors
- pooling of memory
- shared filesystem
- shared process namespace
- shared network resources and external access
(naming) - gt Single system image
12Fast Messages Project Goals (1996)
- Explore techniques for lowest latency, high
bandwidth communication - 1. Dedicated clusters (single parallel job)
MPPs - 2. Multi-tasked clusters (multiple parallel job),
clustered SMPs - 3. Scalable servers (multiple jobs, external
networking) - Develop high speed messaging prototypes to build
a user community and explore the design space - Experimentation with in vivo usage
- MPI, TCP, Scalable servers, Grand Challenge
Applications - Exploration of more complex system issues
13Why a new Communication Interface?
- Existing parallel machine interfaces
(non-standard) - Existing portable software interfaces (MPI, PVM),
not very high performance - Cost of Communication services
- Guarantees supplied critically affect performance
- Performance Implications of Processor-network
coupling - Popular interfaces do not provide decoupling
14Active Messages
Handler----------
- Lightweight Messaging
- Thinking Machines CM5, Ncube/2
- Demonstrated 10 - 20x reduction in communication
overhead in messaging libraries - remote procedure like mechanism handlers
- process the data out of the network
- go on with the computation
15Cost of Communication(Active Messages on the
CM-5)
- Efficient implementation of a lean messaging
layer - Metric Dynamic instruction counts (overhead)
- Dependent on interfacing and network services
- Full study in KaramchetiChien, ASPLOS 94
SEND
Status Regs.
P
DATA NETWORK
RECEIVE
16Costs for 16-word Messages
Finite sequence
Indefinite sequence
500
400
Fault-toler.
300
In-order Del.
insts
Buffer Mgmt.
200
Base Cost
100
0
Src
Dest
Total
Src
Dest
Total
- Messaging cost is significantly higher than base
cost (50-70), 47 insts/send-recv - Cost in application level services
17Costs for 1024-word Messages
30000
Finite sequence
Indefinite sequence
Fault-toler.
25000
In-order Del.
20000
insts
15000
Buffer Mgmt.
10000
Base Cost
5000
0
Src
Dest
Total
Src
Dest
Total
- Finite sequence overhead reduced to 10
- Indefinite sequence overhead scales with message
size (still 70)
18Requirements for a High Performance Communication
- Reliable delivery
- flow control, buffer management
- In-order delivery
- tradeoff with known penalties
- Decoupled interaction with processor
- processor performance
- communication performance
19Fast Messages Interface (V 1.0)
- FM_send(dest, handler, buffer, size)
- generic send, handler called at destination
- buffer released after send returns
- Memory-to-memory protocol
- FM_extract()
- Blind poll, Invoke handlers for all pending
messages - Buffers released upon handler completion
- Extract calls determine when messages are
processed - gt Essentially and Active Message 1.0 API
20FM Messaging Costs (T3D)
- Two implementations, low overheads (0.3 x ovhd)
- Push and Pull messaging (available on WWW)
21Myrinet available, Fall 1994
- 1.2 Gbps, 80MB/s duplex
- lt 1 ms switch latency
- Identical wormhole routing technology to parallel
computers - gt should be able to build high performance
clusters from this, right?
22Fast Messages on a Cluster
- Sun Sparc 20 Workstations
- SBus (Latency 0.5 ?s, BW45 MB/s burst,
20-30 MB/s processor, MBus (Read-Write BWs of
40-80 MB/s) - Myrinet LAN (LANai 2.3 numbers)
- Byte-wide, twisted-pair ribbons, 30M, 640Mbps
full duplex - 1.5K per machine connection (switch port,
processor interface, and cables)
23FM Design Philosophy Strategy
- Goal low latency and high bandwidth
- Design strategy
- Minimal layer first, Pay for what you use!
- Drive small message performance
- FM on the Myrinet is a minimal low-level layer
- Perspective
- Networks are really fast.
- Network Interface Processors are not very fast
(relatively)
24Design Features
- Operating system bypass to achieve high
performance - memory-mapping protection, multiplexing must be
in device - Thread-safety for convenient use
- Tuned for low latency and high bandwidth for
short messages for accessible performance
25Fast Messages Design Tactics
- Extremely simple design
- Simple host-LANai synchronization
- Simple buffer management
- Host faster than LANai (NI processor)
- Shift work to the host
- Minimize critical path
- Minimize copying, flow control back through
layers - Move noncritical stuff out of line
- Optimize for bursts (BW and low latency)
26LANai-to-LANai Performance
microseconds
- Only a small fraction network performance is
achievable - Careful tuning necessary to preserve as much as
possible - Streaming as basis for later graphs
27Host - LANai Coupling
- Queues Decouple, pointers form basis of
synchronization - Performance critical
- Quick decisions, memory hierarchy traffic
- Data Movement (Processor I/O vs. DMA)
28Data Movement
MB/s)
- Processor mediated vs. DMA-based
- 64 2x, 128 30, crossover at 200 bytes
- Better bridges should improve Processor-mediated
- Processor mediated avoids memory copy
Frame Size
29Data Movement (Latencies)
Micro Seconds
Frame Size
- DMA incurs startup penalty, but BW dominates
- gt copy to DMA-able region is not included
- gt Host initiated DMA is desirable
30Queuing Issues
- Issues
- What and how many queues to have, Where to place
the queue pointers? (ownership, consistency),
Where to place the queues? (access, bandwidth),
How much work to do on each side? - Constraints
- Limited host memory access (DMA only)
- Limited LANai memory
- 128KB memory (lt 2 milliseconds worth -- host
queuing essential for decoupling) - LANai processing inefficiency
- Fixed size frames
- Low latency -gt minimal processing overhead -gt
short polling loops
31Queuing in FM
- Send Queue in LANai (0-copy)
- Receive Queue in LANai and host memory (decouple,
1-copy)
32Protocol Decomposition
- Majority of functionality on Host
- Lowest latency, highest bandwidth for short
messages - Host gt LANai
- Extant speed mismatches Meiko CS-2, IBM SP2,
Myrinet, ATM switches, etc. - Cost balance and upgrade issues
- Division of Labor affects
- Processor utilization, Pipeline Balance
- Actual balance depend on message size
33Data Integrity and Flow Control
- Requirements
- Reliable delivery, Prevent buffer under/overflow
and deadlock - for multitasking
- flow control should work for uncoordinated
scheduling - Flow control can be performance critical
- critical path, additional messages
- FMs memory -- memory protocol
- enables stackable data integrity protocols
- decouples processors and network performance
34Flow Control Approaches
- Return to Sender (optimistic)
- Scalable protocol, group window
- Packet returned to the sender if no space at
receiver - Sender guarantees space for returned packets
- Returned pkts eventually retransmitted (progress)
- Windowing
- Traditional approach, window for each destination
- Non-scalable, large buffer requirements
- Count kept for each of the nodes
- Piggy backed acks auxiliary low water mark
mechanism - Simplifies checking/queuing/etc. (all on host)
35Flow Control Performance
MB/s
- Flow control is cheap. Both schemes have
comparable performance (Host reject) - Need LANai reject or Gang scheduling for memory
benefits - Open issues wiring buffers, window size, memory
utilization, delivery order
36If coprocessor were faster ...
- More functionality (offload host)
- Tag checking graph demultiplexing
- Return-to-sender flow control in coprocessor?
- More performance (BW, latency)
- No headroom in coprocessor performance...
37Tag Checking Overhead
- Dramatic impact for shorter messages
- Importance likely to be accentuated in higher
speed networks.
38Analyzing FM 1.1s Performance
D
e
s
c
r
i
p
t
i
o
n
t
r
n
0
i
n
f
1
/
2
L
i
n
k
7
6
M
B
/
s
3
1
5
b
y
t
e
s
4
.
2
m
s
L
i
n
k
L
A
N
a
i
2
1
.
2
M
B
/
s
4
4
b
y
t
e
s
3
.
5
m
s
m
F
M
4
.
1
s
2
1
.
4
M
B
/
s
5
4
b
y
t
e
s
F
M
-
d
m
a
3
3
.
0
M
B
/
s
1
6
2
b
y
t
e
s
7
.
5
m
s
m
M
y
r
i
c
o
m
1
0
5
s
2
3
.
9
M
B
/
s
gt
4
,
4
0
9
b
y
t
e
s
A
P
I
- Vector startup performance model
- Two order of magnitude improvement in N1/2
- Released FM1.1 delivers 17.5MB/s for arbitrary
length messages, 128-byte latency of 20?secs
39Comparative Latency
Fast Messages
LAM
SSAM (ATM)
MPI-FM
MPL (SP2)
Latency (microsecs)
Myricom API
Messaging Layers
- Competitive Latency, better bandwidth at small
packets - LAM (2-mechanisms), MPI comparo
- Functionality varies significantly
40Perspective
- Networks are really fast
- Lots of careful low-level management needed to
deliver even a fraction of the performance to the
application - FM 1.x
- Protocol decomposition
- Streamlined buffer management
- NIC and channel management
- and the right guarantees...
41Reading Assignments
- Clusters and Networks
- Grid Book, Chs. 17 and 20
- High Performance Messaging on Workstations
Illinois Fast Messages (FM) for Myrinet. In
Supercomputing '95 (Pakin, Lauria Chien) - von Eicken, et. Al. U-Net A User-Level Network
Interface for Parallel and Distributed Computing,
Proceedings of the 15th ACM Symposium on
Operating Systems Principles, December1995, pgs
40--53. - Optional
- Pfister, In Search of Clusters The Coming Battle
in Lowly Parallel Computing, Prentice Hall, 1995,
1998. - GNN10000 Interface Documentation, Myrinet
Documentation - Next time
- Multiprocess protection
- The Virtual Interface Architecture