Clusters: Efficient Aggregation - PowerPoint PPT Presentation

1 / 41

About This Presentation

Title:

Clusters: Efficient Aggregation

Description:

No Streaming. Streaming. microseconds. CSE 225 Chien Jan 12, 1999. 27 ... LANai only (streamed) Proc Writes. DMA. MB/s) Frame Size. CSE 225 Chien Jan 12, 1999 ... – PowerPoint PPT presentation

Number of Views:48

Avg rating:3.0/5.0

Slides: 42

Provided by: Andre524

Category:

more less

Transcript and Presenter's Notes

Title: Clusters: Efficient Aggregation

1
Clusters Efficient Aggregation

Andrew Chien
CSE225
January 12, 1999

2
Announcements/Review

Homework 1 is due 1/16 (yes, on Saturday)
Leave at my office, or electronic submit (email)
Email Andrew w/ groups today (if havent
already)
Slides up on the web (after class), let me know
via email asap if there are problems
Grid Applications
Distributed Supercomputing, Real-time Distributed
Instrumentation, Data Intensive Computing,
Tele-immersion
Typical architectures and motivations
Application Requirements

3
Grid Service Requirements

All require basic support for access, initiation,
and run (security, accounting, etc.)
All require coscheduling
Some benefit from wealth of network resources
(faster network, better application)
Some require predictable network performance(QoS)
Some require integration of large data
archives/repositories
Some require heterogeneous computer types
Some require integration of weird digital devices
(instruments, sensors, actuators)

4
Todays Outline

Requirements for Efficient Aggregation in
Clusters (Uniform Grids)
Communication, Scheduling, Resource Access
Fast Communication
Active Messages and Fast Messages

5
Original Clusters High Throughput

Cycle stealing (Condor, Utopia -gt LSF)
Shared, heterogeneous resources, slow networks
Enhancements Identifying resources, uniform
access, migration, and scheduling
Uniprocessor jobs, multiprocessor, loosely
coupled
No change communication performance,
coordination, quality of service
Key Idea reap shared resources that are idle.

6
HPC Clusters High Aggregate Performance

Dedicated Cluster (SP2,CS-2,NOW) high aggregate
performance
Dedicated, homogeneous, fast network
Enhancements Low Overhead, High BW
communication, coarse-grain coordination
No support coordinated multitasking,
heterogeneity, quality of service, wide-area
Key Idea workstations/PCs are supercomputer
building blocks

7
Grids Enabling High Performance Distributed
Computing
Switched Multigigabit Networks

Clusters homogeneous (or nearly) collections of
machines
connected by fast networks (usually switched,
gigabit)
like processors (and like speed)
like configurations
like operating systems
identical binaries
gt this is easy as it gets for grids!

8
What do we need to meld this into an efficient
Aggregate?

Q What examples of efficient aggregates do we
have?
A Tightly coupled parallel computers
Custom high speed interconnects
Deeply integrated network interfaces / special
communication mechanisms (combining, barrier
networks, service)
Lightweight customized node monitors/operating
systems
Batch scheduled
Early 90s examples ...
Intel Paragon (i860-based computer with 100MB/s
mesh)
Thinking Machines CM5 (Sparc-based computer with
network interface on memory bus and special
vector units)
Stanford DASH (directory based cache coherent
linked 4-way SMPs)
Cray T3D (DEC Alpha with 300MB/s network in
memory controller and special network barrier and
synchronization opns)

9
High Performance Communication

Exploiting parallelism requires efficient
communication
high bandwidth data movement
low-latency communication
low-latency synchronization
Can efficiently distribute work
Can efficiently load balance
Can efficiently collect results
Can efficiently synchronize on state
Communication allow the pooling of resources and
joint application to a parallel task

10
Coordinated Processor Scheduling
vs.

Uncoordinated scheduling
degrades communication, coordination, external as
well
Goal Coordinated processor management
10 ms scale, dedicated efficiency for parallel
jobs, interactive performance preserved for
seq/par jobs
Must coexist with commodity software and
timesharing schedulers(no gang scheduling)
Why do cluster/grid applications need this
property?
Well revisit this later in the quarter

11
Uniform Resource Access

View all resources as a uniform namespace
pooling of processors
pooling of memory
shared filesystem
shared process namespace
shared network resources and external access
(naming)
gt Single system image

12
Fast Messages Project Goals (1996)

Explore techniques for lowest latency, high
bandwidth communication
1. Dedicated clusters (single parallel job)
MPPs
2. Multi-tasked clusters (multiple parallel job),
clustered SMPs
3. Scalable servers (multiple jobs, external
networking)
Develop high speed messaging prototypes to build
a user community and explore the design space
Experimentation with in vivo usage
MPI, TCP, Scalable servers, Grand Challenge
Applications
Exploration of more complex system issues

13
Why a new Communication Interface?

Existing parallel machine interfaces
(non-standard)
Existing portable software interfaces (MPI, PVM),
not very high performance
Cost of Communication services
Guarantees supplied critically affect performance
Performance Implications of Processor-network
coupling
Popular interfaces do not provide decoupling

14
Active Messages
Handler----------

Lightweight Messaging
Thinking Machines CM5, Ncube/2
Demonstrated 10 - 20x reduction in communication
overhead in messaging libraries
remote procedure like mechanism handlers
process the data out of the network
go on with the computation

15
Cost of Communication(Active Messages on the
CM-5)

Efficient implementation of a lean messaging
layer
Metric Dynamic instruction counts (overhead)
Dependent on interfacing and network services
Full study in KaramchetiChien, ASPLOS 94

SEND
Status Regs.
P
DATA NETWORK
RECEIVE
16
Costs for 16-word Messages
Finite sequence
Indefinite sequence
500
400
Fault-toler.
300
In-order Del.
insts
Buffer Mgmt.
200
Base Cost
100
0
Src
Dest
Total
Src
Dest
Total

Messaging cost is significantly higher than base
cost (50-70), 47 insts/send-recv
Cost in application level services

17
Costs for 1024-word Messages
30000
Finite sequence
Indefinite sequence
Fault-toler.
25000
In-order Del.
20000
insts
15000
Buffer Mgmt.
10000
Base Cost
5000
0
Src
Dest
Total
Src
Dest
Total

Finite sequence overhead reduced to 10
Indefinite sequence overhead scales with message
size (still 70)

18
Requirements for a High Performance Communication

Reliable delivery
flow control, buffer management
In-order delivery
tradeoff with known penalties
Decoupled interaction with processor
processor performance
communication performance

19
Fast Messages Interface (V 1.0)

FM_send(dest, handler, buffer, size)
generic send, handler called at destination
buffer released after send returns
Memory-to-memory protocol
FM_extract()
Blind poll, Invoke handlers for all pending
messages
Buffers released upon handler completion
Extract calls determine when messages are
processed
gt Essentially and Active Message 1.0 API

20
FM Messaging Costs (T3D)

Two implementations, low overheads (0.3 x ovhd)
Push and Pull messaging (available on WWW)

21
Myrinet available, Fall 1994

1.2 Gbps, 80MB/s duplex
lt 1 ms switch latency
Identical wormhole routing technology to parallel
computers
gt should be able to build high performance
clusters from this, right?

22
Fast Messages on a Cluster

Sun Sparc 20 Workstations
SBus (Latency 0.5 ?s, BW45 MB/s burst,
20-30 MB/s processor, MBus (Read-Write BWs of
40-80 MB/s)
Myrinet LAN (LANai 2.3 numbers)
Byte-wide, twisted-pair ribbons, 30M, 640Mbps
full duplex
1.5K per machine connection (switch port,
processor interface, and cables)

23
FM Design Philosophy Strategy

Goal low latency and high bandwidth
Design strategy
Minimal layer first, Pay for what you use!
Drive small message performance
FM on the Myrinet is a minimal low-level layer
Perspective
Networks are really fast.
Network Interface Processors are not very fast
(relatively)

24
Design Features

Operating system bypass to achieve high
performance
memory-mapping protection, multiplexing must be
in device
Thread-safety for convenient use
Tuned for low latency and high bandwidth for
short messages for accessible performance

25
Fast Messages Design Tactics

Extremely simple design
Simple host-LANai synchronization
Simple buffer management
Host faster than LANai (NI processor)
Shift work to the host
Minimize critical path
Minimize copying, flow control back through
layers
Move noncritical stuff out of line
Optimize for bursts (BW and low latency)

26
LANai-to-LANai Performance
microseconds

Only a small fraction network performance is
achievable
Careful tuning necessary to preserve as much as
possible
Streaming as basis for later graphs

27
Host - LANai Coupling

Queues Decouple, pointers form basis of
synchronization
Performance critical
Quick decisions, memory hierarchy traffic
Data Movement (Processor I/O vs. DMA)

28
Data Movement
MB/s)

Processor mediated vs. DMA-based
64 2x, 128 30, crossover at 200 bytes
Better bridges should improve Processor-mediated
Processor mediated avoids memory copy

Frame Size
29
Data Movement (Latencies)
Micro Seconds
Frame Size

DMA incurs startup penalty, but BW dominates
gt copy to DMA-able region is not included
gt Host initiated DMA is desirable

30
Queuing Issues

Issues
What and how many queues to have, Where to place
the queue pointers? (ownership, consistency),
Where to place the queues? (access, bandwidth),
How much work to do on each side?
Constraints
Limited host memory access (DMA only)
Limited LANai memory
128KB memory (lt 2 milliseconds worth -- host
queuing essential for decoupling)
LANai processing inefficiency
Fixed size frames
Low latency -gt minimal processing overhead -gt
short polling loops

31
Queuing in FM

Send Queue in LANai (0-copy)
Receive Queue in LANai and host memory (decouple,
1-copy)

32
Protocol Decomposition

Majority of functionality on Host
Lowest latency, highest bandwidth for short
messages
Host gt LANai
Extant speed mismatches Meiko CS-2, IBM SP2,
Myrinet, ATM switches, etc.
Cost balance and upgrade issues
Division of Labor affects
Processor utilization, Pipeline Balance
Actual balance depend on message size

33
Data Integrity and Flow Control

Requirements
Reliable delivery, Prevent buffer under/overflow
and deadlock
for multitasking
flow control should work for uncoordinated
scheduling
Flow control can be performance critical
critical path, additional messages
FMs memory -- memory protocol
enables stackable data integrity protocols
decouples processors and network performance

34
Flow Control Approaches

Return to Sender (optimistic)
Scalable protocol, group window
Packet returned to the sender if no space at
receiver
Sender guarantees space for returned packets
Returned pkts eventually retransmitted (progress)
Windowing
Traditional approach, window for each destination
Non-scalable, large buffer requirements
Count kept for each of the nodes
Piggy backed acks auxiliary low water mark
mechanism
Simplifies checking/queuing/etc. (all on host)

35
Flow Control Performance
MB/s

Flow control is cheap. Both schemes have
comparable performance (Host reject)
Need LANai reject or Gang scheduling for memory
benefits
Open issues wiring buffers, window size, memory
utilization, delivery order

36
If coprocessor were faster ...

More functionality (offload host)
Tag checking graph demultiplexing
Return-to-sender flow control in coprocessor?
More performance (BW, latency)
No headroom in coprocessor performance...

37
Tag Checking Overhead

Dramatic impact for shorter messages
Importance likely to be accentuated in higher
speed networks.

38
Analyzing FM 1.1s Performance

D
e
s
c
r
i
p
t
i
o
n

t

r

n

0
i
n
f
1
/
2
L
i
n
k
7
6
M
B
/
s
3
1
5
b
y
t
e
s
4
.
2
m
s
L
i
n
k

L
A
N
a
i
2
1
.
2
M
B
/
s
4
4
b
y
t
e
s
3
.
5
m
s
m
F
M
4
.
1
s
2
1
.
4
M
B
/
s
5
4
b
y
t
e
s
F
M
-
d
m
a
3
3
.
0
M
B
/
s
1
6
2
b
y
t
e
s
7
.
5
m
s
m
M
y
r
i
c
o
m
1
0
5
s
2
3
.
9
M
B
/
s
gt
4
,
4
0
9
b
y
t
e
s
A
P
I

Vector startup performance model
Two order of magnitude improvement in N1/2
Released FM1.1 delivers 17.5MB/s for arbitrary
length messages, 128-byte latency of 20?secs

39
Comparative Latency
Fast Messages
LAM
SSAM (ATM)
MPI-FM
MPL (SP2)
Latency (microsecs)
Myricom API
Messaging Layers

Competitive Latency, better bandwidth at small
packets
LAM (2-mechanisms), MPI comparo
Functionality varies significantly

40
Perspective

Networks are really fast
Lots of careful low-level management needed to
deliver even a fraction of the performance to the
application
FM 1.x
Protocol decomposition
Streamlined buffer management
NIC and channel management
and the right guarantees...

41
Reading Assignments

Clusters and Networks
Grid Book, Chs. 17 and 20
High Performance Messaging on Workstations
Illinois Fast Messages (FM) for Myrinet. In
Supercomputing '95 (Pakin, Lauria Chien)
von Eicken, et. Al. U-Net A User-Level Network
Interface for Parallel and Distributed Computing,
Proceedings of the 15th ACM Symposium on
Operating Systems Principles, December1995, pgs
40--53.
Optional
Pfister, In Search of Clusters The Coming Battle
in Lowly Parallel Computing, Prentice Hall, 1995,
1998.
GNN10000 Interface Documentation, Myrinet
Documentation
Next time
Multiprocess protection
The Virtual Interface Architecture