Clusters: Efficient Aggregation - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

Clusters: Efficient Aggregation

Description:

No Streaming. Streaming. microseconds. CSE 225 Chien Jan 12, 1999. 27 ... LANai only (streamed) Proc Writes. DMA. MB/s) Frame Size. CSE 225 Chien Jan 12, 1999 ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 42
Provided by: Andre524
Category:

less

Transcript and Presenter's Notes

Title: Clusters: Efficient Aggregation


1
Clusters Efficient Aggregation
  • Andrew Chien
  • CSE225
  • January 12, 1999

2
Announcements/Review
  • Homework 1 is due 1/16 (yes, on Saturday)
  • Leave at my office, or electronic submit (email)
  • Email Andrew w/ groups today (if havent
    already)
  • Slides up on the web (after class), let me know
    via email asap if there are problems
  • Grid Applications
  • Distributed Supercomputing, Real-time Distributed
    Instrumentation, Data Intensive Computing,
    Tele-immersion
  • Typical architectures and motivations
  • Application Requirements

3
Grid Service Requirements
  • All require basic support for access, initiation,
    and run (security, accounting, etc.)
  • All require coscheduling
  • Some benefit from wealth of network resources
    (faster network, better application)
  • Some require predictable network performance(QoS)
  • Some require integration of large data
    archives/repositories
  • Some require heterogeneous computer types
  • Some require integration of weird digital devices
    (instruments, sensors, actuators)

4
Todays Outline
  • Requirements for Efficient Aggregation in
    Clusters (Uniform Grids)
  • Communication, Scheduling, Resource Access
  • Fast Communication
  • Active Messages and Fast Messages

5
Original Clusters High Throughput
  • Cycle stealing (Condor, Utopia -gt LSF)
  • Shared, heterogeneous resources, slow networks
  • Enhancements Identifying resources, uniform
    access, migration, and scheduling
  • Uniprocessor jobs, multiprocessor, loosely
    coupled
  • No change communication performance,
    coordination, quality of service
  • Key Idea reap shared resources that are idle.

6
HPC Clusters High Aggregate Performance
  • Dedicated Cluster (SP2,CS-2,NOW) high aggregate
    performance
  • Dedicated, homogeneous, fast network
  • Enhancements Low Overhead, High BW
    communication, coarse-grain coordination
  • No support coordinated multitasking,
    heterogeneity, quality of service, wide-area
  • Key Idea workstations/PCs are supercomputer
    building blocks

7
Grids Enabling High Performance Distributed
Computing
Switched Multigigabit Networks
  • Clusters homogeneous (or nearly) collections of
    machines
  • connected by fast networks (usually switched,
    gigabit)
  • like processors (and like speed)
  • like configurations
  • like operating systems
  • identical binaries
  • gt this is easy as it gets for grids!

8
What do we need to meld this into an efficient
Aggregate?
  • Q What examples of efficient aggregates do we
    have?
  • A Tightly coupled parallel computers
  • Custom high speed interconnects
  • Deeply integrated network interfaces / special
    communication mechanisms (combining, barrier
    networks, service)
  • Lightweight customized node monitors/operating
    systems
  • Batch scheduled
  • Early 90s examples ...
  • Intel Paragon (i860-based computer with 100MB/s
    mesh)
  • Thinking Machines CM5 (Sparc-based computer with
    network interface on memory bus and special
    vector units)
  • Stanford DASH (directory based cache coherent
    linked 4-way SMPs)
  • Cray T3D (DEC Alpha with 300MB/s network in
    memory controller and special network barrier and
    synchronization opns)

9
High Performance Communication
  • Exploiting parallelism requires efficient
    communication
  • high bandwidth data movement
  • low-latency communication
  • low-latency synchronization
  • Can efficiently distribute work
  • Can efficiently load balance
  • Can efficiently collect results
  • Can efficiently synchronize on state
  • Communication allow the pooling of resources and
    joint application to a parallel task

10
Coordinated Processor Scheduling
vs.
  • Uncoordinated scheduling
  • degrades communication, coordination, external as
    well
  • Goal Coordinated processor management
  • 10 ms scale, dedicated efficiency for parallel
    jobs, interactive performance preserved for
    seq/par jobs
  • Must coexist with commodity software and
    timesharing schedulers(no gang scheduling)
  • Why do cluster/grid applications need this
    property?
  • Well revisit this later in the quarter

11
Uniform Resource Access
  • View all resources as a uniform namespace
  • pooling of processors
  • pooling of memory
  • shared filesystem
  • shared process namespace
  • shared network resources and external access
    (naming)
  • gt Single system image

12
Fast Messages Project Goals (1996)
  • Explore techniques for lowest latency, high
    bandwidth communication
  • 1. Dedicated clusters (single parallel job)
    MPPs
  • 2. Multi-tasked clusters (multiple parallel job),
    clustered SMPs
  • 3. Scalable servers (multiple jobs, external
    networking)
  • Develop high speed messaging prototypes to build
    a user community and explore the design space
  • Experimentation with in vivo usage
  • MPI, TCP, Scalable servers, Grand Challenge
    Applications
  • Exploration of more complex system issues

13
Why a new Communication Interface?
  • Existing parallel machine interfaces
    (non-standard)
  • Existing portable software interfaces (MPI, PVM),
    not very high performance
  • Cost of Communication services
  • Guarantees supplied critically affect performance
  • Performance Implications of Processor-network
    coupling
  • Popular interfaces do not provide decoupling

14
Active Messages
Handler----------
  • Lightweight Messaging
  • Thinking Machines CM5, Ncube/2
  • Demonstrated 10 - 20x reduction in communication
    overhead in messaging libraries
  • remote procedure like mechanism handlers
  • process the data out of the network
  • go on with the computation

15
Cost of Communication(Active Messages on the
CM-5)
  • Efficient implementation of a lean messaging
    layer
  • Metric Dynamic instruction counts (overhead)
  • Dependent on interfacing and network services
  • Full study in KaramchetiChien, ASPLOS 94

SEND
Status Regs.
P
DATA NETWORK
RECEIVE
16
Costs for 16-word Messages
Finite sequence
Indefinite sequence
500
400
Fault-toler.
300
In-order Del.
insts
Buffer Mgmt.
200
Base Cost
100
0
Src
Dest
Total
Src
Dest
Total
  • Messaging cost is significantly higher than base
    cost (50-70), 47 insts/send-recv
  • Cost in application level services

17
Costs for 1024-word Messages
30000
Finite sequence
Indefinite sequence
Fault-toler.
25000
In-order Del.
20000
insts
15000
Buffer Mgmt.
10000
Base Cost
5000
0
Src
Dest
Total
Src
Dest
Total
  • Finite sequence overhead reduced to 10
  • Indefinite sequence overhead scales with message
    size (still 70)

18
Requirements for a High Performance Communication
  • Reliable delivery
  • flow control, buffer management
  • In-order delivery
  • tradeoff with known penalties
  • Decoupled interaction with processor
  • processor performance
  • communication performance

19
Fast Messages Interface (V 1.0)
  • FM_send(dest, handler, buffer, size)
  • generic send, handler called at destination
  • buffer released after send returns
  • Memory-to-memory protocol
  • FM_extract()
  • Blind poll, Invoke handlers for all pending
    messages
  • Buffers released upon handler completion
  • Extract calls determine when messages are
    processed
  • gt Essentially and Active Message 1.0 API

20
FM Messaging Costs (T3D)
  • Two implementations, low overheads (0.3 x ovhd)
  • Push and Pull messaging (available on WWW)

21
Myrinet available, Fall 1994
  • 1.2 Gbps, 80MB/s duplex
  • lt 1 ms switch latency
  • Identical wormhole routing technology to parallel
    computers
  • gt should be able to build high performance
    clusters from this, right?

22
Fast Messages on a Cluster
  • Sun Sparc 20 Workstations
  • SBus (Latency 0.5 ?s, BW45 MB/s burst,
    20-30 MB/s processor, MBus (Read-Write BWs of
    40-80 MB/s)
  • Myrinet LAN (LANai 2.3 numbers)
  • Byte-wide, twisted-pair ribbons, 30M, 640Mbps
    full duplex
  • 1.5K per machine connection (switch port,
    processor interface, and cables)

23
FM Design Philosophy Strategy
  • Goal low latency and high bandwidth
  • Design strategy
  • Minimal layer first, Pay for what you use!
  • Drive small message performance
  • FM on the Myrinet is a minimal low-level layer
  • Perspective
  • Networks are really fast.
  • Network Interface Processors are not very fast
    (relatively)

24
Design Features
  • Operating system bypass to achieve high
    performance
  • memory-mapping protection, multiplexing must be
    in device
  • Thread-safety for convenient use
  • Tuned for low latency and high bandwidth for
    short messages for accessible performance

25
Fast Messages Design Tactics
  • Extremely simple design
  • Simple host-LANai synchronization
  • Simple buffer management
  • Host faster than LANai (NI processor)
  • Shift work to the host
  • Minimize critical path
  • Minimize copying, flow control back through
    layers
  • Move noncritical stuff out of line
  • Optimize for bursts (BW and low latency)

26
LANai-to-LANai Performance
microseconds
  • Only a small fraction network performance is
    achievable
  • Careful tuning necessary to preserve as much as
    possible
  • Streaming as basis for later graphs

27
Host - LANai Coupling
  • Queues Decouple, pointers form basis of
    synchronization
  • Performance critical
  • Quick decisions, memory hierarchy traffic
  • Data Movement (Processor I/O vs. DMA)

28
Data Movement
MB/s)
  • Processor mediated vs. DMA-based
  • 64 2x, 128 30, crossover at 200 bytes
  • Better bridges should improve Processor-mediated
  • Processor mediated avoids memory copy

Frame Size
29
Data Movement (Latencies)
Micro Seconds
Frame Size
  • DMA incurs startup penalty, but BW dominates
  • gt copy to DMA-able region is not included
  • gt Host initiated DMA is desirable

30
Queuing Issues
  • Issues
  • What and how many queues to have, Where to place
    the queue pointers? (ownership, consistency),
    Where to place the queues? (access, bandwidth),
    How much work to do on each side?
  • Constraints
  • Limited host memory access (DMA only)
  • Limited LANai memory
  • 128KB memory (lt 2 milliseconds worth -- host
    queuing essential for decoupling)
  • LANai processing inefficiency
  • Fixed size frames
  • Low latency -gt minimal processing overhead -gt
    short polling loops

31
Queuing in FM
  • Send Queue in LANai (0-copy)
  • Receive Queue in LANai and host memory (decouple,
    1-copy)

32
Protocol Decomposition
  • Majority of functionality on Host
  • Lowest latency, highest bandwidth for short
    messages
  • Host gt LANai
  • Extant speed mismatches Meiko CS-2, IBM SP2,
    Myrinet, ATM switches, etc.
  • Cost balance and upgrade issues
  • Division of Labor affects
  • Processor utilization, Pipeline Balance
  • Actual balance depend on message size

33
Data Integrity and Flow Control
  • Requirements
  • Reliable delivery, Prevent buffer under/overflow
    and deadlock
  • for multitasking
  • flow control should work for uncoordinated
    scheduling
  • Flow control can be performance critical
  • critical path, additional messages
  • FMs memory -- memory protocol
  • enables stackable data integrity protocols
  • decouples processors and network performance

34
Flow Control Approaches
  • Return to Sender (optimistic)
  • Scalable protocol, group window
  • Packet returned to the sender if no space at
    receiver
  • Sender guarantees space for returned packets
  • Returned pkts eventually retransmitted (progress)
  • Windowing
  • Traditional approach, window for each destination
  • Non-scalable, large buffer requirements
  • Count kept for each of the nodes
  • Piggy backed acks auxiliary low water mark
    mechanism
  • Simplifies checking/queuing/etc. (all on host)

35
Flow Control Performance
MB/s
  • Flow control is cheap. Both schemes have
    comparable performance (Host reject)
  • Need LANai reject or Gang scheduling for memory
    benefits
  • Open issues wiring buffers, window size, memory
    utilization, delivery order

36
If coprocessor were faster ...
  • More functionality (offload host)
  • Tag checking graph demultiplexing
  • Return-to-sender flow control in coprocessor?
  • More performance (BW, latency)
  • No headroom in coprocessor performance...

37
Tag Checking Overhead
  • Dramatic impact for shorter messages
  • Importance likely to be accentuated in higher
    speed networks.

38
Analyzing FM 1.1s Performance


D
e
s
c
r
i
p
t
i
o
n


t




r




n


0
i
n
f
1
/
2
L
i
n
k
7
6
M
B
/
s
3
1
5
b
y
t
e
s
4
.
2
m
s
L
i
n
k

L
A
N
a
i
2
1
.
2
M
B
/
s
4
4
b
y
t
e
s
3
.
5
m
s
m
F
M
4
.
1
s
2
1
.
4
M
B
/
s
5
4
b
y
t
e
s
F
M
-
d
m
a
3
3
.
0
M
B
/
s
1
6
2
b
y
t
e
s
7
.
5
m
s
m
M
y
r
i
c
o
m
1
0
5
s
2
3
.
9
M
B
/
s
gt
4
,
4
0
9
b
y
t
e
s
A
P
I
  • Vector startup performance model
  • Two order of magnitude improvement in N1/2
  • Released FM1.1 delivers 17.5MB/s for arbitrary
    length messages, 128-byte latency of 20?secs

39
Comparative Latency
Fast Messages
LAM
SSAM (ATM)
MPI-FM
MPL (SP2)
Latency (microsecs)
Myricom API
Messaging Layers
  • Competitive Latency, better bandwidth at small
    packets
  • LAM (2-mechanisms), MPI comparo
  • Functionality varies significantly

40
Perspective
  • Networks are really fast
  • Lots of careful low-level management needed to
    deliver even a fraction of the performance to the
    application
  • FM 1.x
  • Protocol decomposition
  • Streamlined buffer management
  • NIC and channel management
  • and the right guarantees...

41
Reading Assignments
  • Clusters and Networks
  • Grid Book, Chs. 17 and 20
  • High Performance Messaging on Workstations
    Illinois Fast Messages (FM) for Myrinet. In
    Supercomputing '95 (Pakin, Lauria Chien)
  • von Eicken, et. Al. U-Net A User-Level Network
    Interface for Parallel and Distributed Computing,
    Proceedings of the 15th ACM Symposium on
    Operating Systems Principles, December1995, pgs
    40--53.
  • Optional
  • Pfister, In Search of Clusters The Coming Battle
    in Lowly Parallel Computing, Prentice Hall, 1995,
    1998.
  • GNN10000 Interface Documentation, Myrinet
    Documentation
  • Next time
  • Multiprocess protection
  • The Virtual Interface Architecture
Write a Comment
User Comments (0)
About PowerShow.com