The Future of Parallel Computing - PowerPoint PPT Presentation

1 / 45
About This Presentation
Title:

The Future of Parallel Computing

Description:

The Future of Parallel Computing – PowerPoint PPT presentation

Number of Views:64
Avg rating:3.0/5.0
Slides: 46
Provided by: DaveT2
Category:

less

Transcript and Presenter's Notes

Title: The Future of Parallel Computing


1
The Future of Parallel Computing
  • Dave Turner
  • In collaboration with
  • Xuehua Chen, Adam Oline, and Troy Benjegerdes
  • Scalable Computing Laboratory of Ames Laboratory
  • This work was funded by the MICS office of the US
    Department of Energy

2
Outline
  • Overview of parallel computing
  • Measuring the performance of the communication
    system
  • Improving the message-passing performance
  • Taking advantage of the network topology
  • Science enabled by parallel computing

3
The typical small cluster
  • 2.8 GHz dual-Xeon node (1800 each)
  • 2 GB RAM
  • SuperMicro or Tyan motherboards
  • Built-in single or dual Gigabit Ethernet
  • 1U or 2U rackmount, as needed for expansion
  • 24-port Gigabit Ethernet switch
  • Assante or Netgear (1400)
  • 47U Rack (3000)
  • KVM, UPS, Monitor, cables
  • Software
  • Intel compilers (560 academic)
  • MPI, PVM, PBS scheduler

25000 will get you a master plus 10 compute
nodes (22 processors). Cluster vendors such as
Atipa and Microway will sell fully integrated
clusters.
4
The IBM Blue Gene/L
  • 65,536 dual-processor nodes
  • 700 MHz PowerPC 440 cores with dual floating
    point units
  • 256-512 MB RAM
  • Each node runs a very stripped down version of
    Linux
  • Should fit in a room the size of a tennis court
  • Peak computational rate of 360 Teraflops
  • 5 separate networks
  • 3D torus network for MPI communications (MPICH2)
  • 1.4 Gbps peak bandwidth in each direction
  • A tree network connect every 64 node boards to an
    I/O node
  • Will also handle some MPI collectives
  • A Fast Ethernet control network
  • Global interrupt network
  • I/O nodes
  • Gigabit Ethernet connection
  • Run a full Linux kernel

5
The dual-Athlon cluster with SCI interconnects
  • Initially we will connect 64 dual-Athlon PCs with
    an 8x8 SCI grid.
  • The Mezzanine card will allow for a 3D torus in
    the future.

6
Blade-based clusters
  • dual-Xeon blades (5000 each)
  • 2-8 GB RAM
  • 1-2 slow mini-disks (40-80 GB each)
  • Built-in single or dual Gigabit Ethernet
  • 1 PCI-X expansion slot
  • 6U Chassis
  • Holds 10 blades plus a network switch module
  • Lots of cooling
  • 48U rack
  • Could hold 80 blades (160 processors)

Prices will come down.
7
Inefficiencies in the communication system
Applications MPI native layer internal
buses driver NIC switch fabric
75 bandwidth 2-3x latency
PCI Memory
Topological bottlenecks
Poor MPI usage No mapping
Hardware limits Driver tuning
OS bypass TCP tuning
8
Waveguide simulations using the parallel Finite
Difference Time Domain method
Kai-Ming Ho, Rana Biswas, Mihail Sigalas, Ihab
El-Kady, Mehmet Su Dave Turner, Bogdan Vasiliu
9
Waveguide bends in three dimensional layer by
layer photonic band gap materials, M.M. Sigalas,
R. Biswas, K.M. Ho, C.M. Soukoulis, D.E. Turner,
B. Vasiliu, S.C. Kothari, and Shawn Lin,
Microwave and Optical Technology Letters, Vol.
23, 56-59 (Oct. 5, 1999).
10

w
i
t
h

o
r

w
i
t
h
o
u
t

f
e
n
c
e

c
a
l
l
s
.



M
e
a
s
u
r
e

p
e
r
f
o
r
m
a
n
c
e

o
r

d
o

a
n

i
n
t
e
g
r
i
t
y

t
e
s
t
.
http//www.scl.ameslab.gov/Projects/NetPIPE/
11
The NetPIPE utility
  • NetPIPE does a series of ping-pong tests
    between two nodes.
  • Message sizes are chosen at regular intervals,
    and with slight perturbations, to fully test the
    communication system for idiosyncrasies.
  • Latencies reported represent half the ping-pong
    time for messages smaller than 64 Bytes.

Some typical uses
  • Measuring the overhead of message-passing
    protocols.
  • Help in tuning the optimization parameters of
    message-passing libraries.
  • Optimizing driver and OS parameters (socket
    buffer sizes, etc.).
  • Identifying dropouts in networking hardware and
    drivers.

What is not measured
  • NetPIPE cannot measure the load on the CPU yet.
  • The effects from the different methods for
    maintaining message progress.
  • Scalability with system size.

12
Recent additions to NetPIPE
  • Can do an integrity test instead of measuring
    performance.
  • Streaming mode measures performance in 1
    direction only.
  • Must reset sockets to avoid effects from a
    collapsing window size.
  • A bi-directional ping-pong mode has been added
    (-2).
  • One-sided Get and Put calls can be measured
    (MPI or SHMEM).
  • Can choose whether to use an intervening
    MPI_Fence call to synchronize.
  • Messages can be bounced between the same
    buffers (default mode), or they can be started
    from a different area of memory each time.
  • There are lots of cache effects in SMP
    message-passing.
  • InfiniBand can show similar effects since
    memory must be registered with the card.

Process 0
Process 1
0
1
2
3
13
Current projects
  • Overlapping pair-wise ping-pong tests.
  • Must consider synchronization if not using
    bi-directional communications.

Ethernet Switch
n0
n1
n2
n3
Line speed vs end-point limited
n0
n1
n2
n3
  • Investigate other methods for testing the
    global network.
  • Evaluate the full range from simultaneous
    nearest neighbor communications to all-to-all.

14
LAM/MPI
  • LAM 6.5.6-4 release from the RedHat 7.2
    distibution.
  • Must lamboot the daemons.
  • -lamd directs messages through the daemons.
  • -O avoids data conversion for homogeneous
    systems.
  • No socket buffer size tuning.
  • No threshold adjustments.

Currently developed at Indiana University.
http//www.lam-mpi.org/
15
PVM
  • PVM 3.4.3 release from the RedHat 7.2
    distribution.
  • Uses XDR encoding and the pvmd daemons by
    default.
  • pvm_setopt(PvmRoute, PvmRouteDirect) bypasses
    the pvmd daemons.
  • pvm_initsend(PvmDataInPlace) avoids XDR
    encoding for homogeneous systems.

Developed at Oak Ridge National Laboratory.
http//www.csm.ornl.gov/pvm/
16
A NetPIPE example Performance on a Cray T3E
  • Raw SHMEM delivers
  • 2600 Mbps
  • 2-3 us latency
  • Cray MPI originally delivered
  • 1300 Mbps
  • 20 us latency
  • MP_Lite delivers
  • 2600 Mbps
  • 9-10 us latency
  • New Cray MPI delivers
  • 2400 Mbps
  • 20 us latency

The top of the spikes are where the message size
is divisible by 8 Bytes.
17
Channel-bonding Gigabit Ethernet for
better communications between nodes
Channel-bonding uses 2 or more Gigabit Ethernet
cards per PC to increase the communication rate
between nodes in a cluster. GigE cards cost 40
each. 24-port switches cost 1400. ? 100 /
computer This is much more cost effective for PC
clusters than using more expensive networking
hardware, and may deliver similar performance.
18
Performance for channel-bonded Gigabit Ethernet
GigE can deliver 900 Mbps with latencies of 25-62
us for PCs with 64-bit / 66 MHz PCI
slots. Channel-bonding 2 GigE cards / PC using
MP_Lite doubles the performance for large
messages. Adding a 3rd card does not help
much. Channel-bonding 2 GigE cards / PC using
Linux kernel level bonding actually results in
poorer performance. The same tricks that make
channel-bonding successful in MP_Lite should make
Linux kernel bonding working even better. Any
message-passing system could then make use of
channel-bonding on Linux systems.
Channel-bonding multiple GigE cards using MP_Lite
and Linux kernel bonding
19
Performance on Mellanox InfiniBand cards
A new NetPIPE module allows us to measure the raw
performance across InfiniBand hardware (RDMA and
Send/Recv). Burst mode preposts all receives to
duplicate the Mellanox test. The no-cache
performance is much lower when the memory has to
be registered with the card. An MP_Lite
InfiniBand module will be incorporated into
LAM/MPI.
MVAPICH 0.9.1
20
10 Gigabit Ethernet
Intel 10 Gigabit Ethernet cards 133 MHz PCI-X
bus Single mode fiber Intel ixgb driver Can only
achieve 2 Gbps now. Latency is 75 us. Streaming
mode delivers up to 3 Gbps. Much more
development work is needed.
21
Comparison of high-speed interconnects
InfiniBand can deliver 4500 - 6500 Mbps at a 7.5
us latency. Atoll delivers 1890 Mbps with a 4.7
us latency. SCI delivers 1840 Mbps with only a
4.2 us latency. Myrinet performance reaches 1820
Mbps with an 8 us latency. Channel-bonded GigE
offers 1800 Mbps for very large messages. Gigabit
Ethernet delivers 900 Mbps with a 25-62
us latency. 10 GigE only delivers 2 Gbps with a
75 us latency.
22
The MP_Lite message-passing library
  • A light-weight MPI implementation
  • Highly efficient for the architectures supported
  • Designed to be very user-friendly
  • Ideal for performing message-passing research
  • http//www.scl.ameslab.gov/Projects/MP_Lite/

23
2-copy SMP message-passing
Processor
Processor
cache
cache
Process 1
Process 0
Shared-memory segment
Main Memory
24
Shared-memory message-passing using a
typical semaphore-based approach
One large segment shared by all
processors Minimize lockouts to when linked list
changes only Minimize search time with a second
linked list for each destination Semaphores are
slow Still not scalable
Shared-memory segment
Process 0
0?1
2?0
1?3
Process 1
1?0
3?2
1?2
Process 2
Process 3
25
MP_Lite locking FIFO approach
Shared-memory segment
Process 0
The message headers are sent through
shared-memory FIFO pipes. The main segment is
only locked during allocation/de-allocation. A
process spins on an atomic operation with an
occasional schedule yield. An optimized memory
copy routine is used.
0?1
2?0
1?3
Process 1
1?0
3?2
1?2
Process 2
FIFO 0 ? 1
FIFO 0 ? 2
FIFO 0 ? 3
FIFO 1 ? 0
Process 3
FIFO 1 ? 2
FIFO 1 ? 3
FIFO 3 ? 2
26
Optimizing the memory copy routine
The 686 version of GLIBC has a memcpy routine
that does byte-by-byte copies for messages not
divisible by 4 bytes. The Intel memcpy is good,
but does not make use of the non-temporal copy in
the Pentium 4 instruction set. An optimal memcpy
is being developed to try to provide the best
performance everywhere.
With the data starting in cache.
2.4 GHz Xeon running RedHat 7.3
27
MP_Lite lock free approach
Each process has its own section for outgoing
messages. Other processes only flip a cleanup
flag No lockouts provide excellent
scaling Doubly linked lists for very efficient
searches, or combine with the shared-memory FIFO
method.
Shared-memory segment
Process 0
0?1
0
Process 1
1?3
1?0
1?2
1
2
Process 2
2?0
3?2
3
Process 3
28
SMP message-passing performance
with cache effects
With the data starting in cache.
LAM/MPI has the lowest latency at 1 us, with
MPICH2 and MP_Lite at 2 us. MP_Lite dominates in
the cache region due to better lock and header
handling. The non-temporal memory copy boosts
performance by 50 for large messages.
1.7 GHz dual-Xeon running RedHat 8.0
29
Bi-directional SMP message-passing performance
With the data starting in cache.
MP_Lite and LAM/MPI have latencies around 3 us,
with MPICH and at 16 us. MP_Lite and MPICH peak
at 7000 Mbps. The non-temporal memory copy of
MP_Lite boosts the large message rate by 20.
Bi-directional results are pretty similar to the
uni-directional results.
1.7 GHz dual-Xeon running RedHat 8.0
30
Performance using the Ames Lab Classical
Molecular Dynamics code
Communication times in microseconds per iterative
cycle
2 MPI processes on a 1.7 GHz dual-Xeon computer
running RedHat 7.3
31
1-copy SMP message-passing for Linux
Kernel Put or
Kernel Get
Processor
Processor
cache
cache
Process 1
Process 0
Kernel copy
Main Memory
This should double the throughput. It is unclear
what the latency will be. It should be
relatively easy to write an MPI implementation.
32
Writing the Linux module
  • The kernel has 2 functions for transferring data
    to and from user space.
  • copy_from_user() checks for read access then gets
    data.
  • copy_to_user() checks for write access then puts
    data.
  • Write a copy_from_user_to_user() function.
  • Create an initialization function
    join_copy_group().
  • Check for read/write access to all the processes
    once.
  • Expose these to the message-passing layer using a
    module kernel_copy.c.
  • Write an MPI implementation using 1-sided Gets
    and Puts.
  • MPI_Init() will call join_copy_group().
  • MPI_Send() will put data if a receive is
    preposted, else push to a send buffer and post.
  • MPI_Recv() will block on message reception, or
    posting of a matching buffered send in which it
    would get the data.

33
0-copy SMP message-passing for Linux?
MP_FreeSend() MP_MallocRecv()
Processor
Processor
cache
cache
Process 1
Process 0
Kernel virtual copy
1 2 3
1 2 3
1
2
3
Main Memory
  • If the source node does not need to retain a
    copy, do an MP_FreeSend().
  • The kernel can re-assign the physical memory
    pages to the destination node.
  • The destination node then does an
    MP_MallocRecv().
  • Only the partially filled pages would need to be
    copied.
  • In Fortran, the source and dest buffers would
    need to be similarly aligned.

34
  • Most applications do not take advantage of the
    network topology
  • ? There are many different topologies, which
    can even vary at run-time
  • ? no portable method for mapping to the
    topology
  • ? loss of performance and scalability
  • NodeMap will automatically determine the
    network topology at run-time and pass the
    information to the application or message-passing
    library.

35
How NodeMap works
  • Gethostname() ? SMP processes
  • Latency and bandwidth ? individual connections
  • Saturated network performance ? regular meshes
  • Global shifts ? identify regular mesh
    structures
  • Vendor functions when available
  • Static configuration files as a last resort

How NodeMap will be used
  • MPI_Cart_create( reorder 1 ) use NodeMap to
    provide best mapping
  • MPI_Init() can run NodeMap and optimize global
    communications

36
A parallel integral transport equation based
radiography simulation code
Feyzi Inanc, Bogdan Vasiliu, Dave Turner
Nuclear Science and Engineering 137, 173-182
(2001).
37
Performance on ALICE, normalized to 4 nodes
38
Summary
  • Provided an overview of the current state of
    parallel computing.
  • Measuring and tuning the performance is
    necessary, and easy.
  • Much research is being done to improve
    performance.
  • Channel bonding can double communication
    performance for small clusters at a minimal cost.
  • Parallel computing does take significant effort,
    but it opens up new areas of science.

39
Contact information
  • Dave Turner - turner_at_ameslab.gov
  • http//www.scl.ameslab.gov/Projects/MP_Lite/
  • http//www.scl.ameslab.gov/Projects/NetPIPE/

40
CPCM assisted clusters at Iowa State University
  • 9-node PC cluster for Math
  • 16 PC Octopus cluster for Biology/Bio-informatic
    s
  • pre-built 22-processor Atipa cluster for
    Astronomy
  • 24-node Alpha cluster with GigE for Physics
  • 24-node PC cluster for Materials
  • 24-node Athlon cluster with GigE for Physics
  • 22-processor Athlon cluster with GigE for
    Magnetics

41
IBM RS/6000 Workstation Cluster
  • The cluster consists of 22 IBM 43P-260 and
  • 2 IBM
    44P-270 workstations.
  • Each 43P node consists of
  • Dual 200MHz Power3 processors
  • (800 MFLOP peak)
  • 2.5 GB of RAM
  • 18 GB striped disk storage
  • Fast Ethernet
  • Gigabit Ethernet supporting Jumbo Frames
  • Each 44P node consists of
  • Quad 375MHz Power3 processors
  • (1500 MFLOP peak)
  • 16 GB of RAM
  • 72 GB of striped disk storage
  • Fast Ethernet
  • Dual Gigabit Ethernet adapters supporting Jumbo
    Frames

The Cluster is currently operated in a mixed
research/production environment with nearly 100
aggregate utilization, mostly due to production
GAMESS calculations. The Cluster was made
possible by an IBM Shared University Research
(SUR) grant and by the DOE MICS program.
http//www.scl.ameslab.gov/Projects/IBMCluster/
42
IBM pSeries 64-bit workstation cluster
  • The cluster consists of 32 IBM pSeries p640
    workstations. Each p640 node consists of
  • Quad 375MHz Power3 II processors (1500 MFLOP
    peak)
  • 16 GB of RAM
  • 144 or 288 GB striped disk storage
  • Fast Ethernet
  • Dual Gigabit Ethernet adapters supporting Jumbo
    Frames
  • Dual Myrinet 2000 adapters (planned)
  • Aggregate total of
  • 128 CPUs (192 GigaFLOP peak)
  • 1/2 Terabyte of RAM
  • 6 terabytes of disk
  • Nodes will run a mixture of AIX 5.1L (64 bit
    kernel) and 64 bit PPC Linux.

The Cluster was made possible by an IBM Shared
University Research (SUR) grant, the Air Force
Office of Scientific Research, and by the DOE
MICS program.
http//www.scl.ameslab.gov/Projects/pCluster/
43
Channel-bonding in MP_Lite
User space
Kernel space
device driver
Application on node 0
Large socket buffers
device queue
GigE card
a
b
dev_q_xmit
DMA
TCP/IP stack
b
TCP/IP stack
GigE card
a
dev_q_xmit
DMA
MP_Lite
device queue
Flow control may stop a given stream at several
places. With MP_Lite channel-bonding, each
stream is independent of the others.
44
Linux kernel channel-bonding
User space
Kernel space
device driver
Application on node 0
device queue
Large socket buffer
GigE card
DMA
dqx
bonding.c
TCP/IP stack
dqx
dqx
GigE card
DMA
device queue
A full device queue will stop the flow at
bonding.c to both device queues. Flow control on
the destination node may stop the flow out of the
socket buffer. In both of these cases, problems
with one stream can affect both streams.
45
SMP message-passing performance
without cache effects
LAM/MPI has the lowest latency at 1 us, with
MP_Lite at 2 us and MPICH2 and at 3 us. MP_Lite
and LAM/MPI do best in the intermediate
region. The non-temporal memory copy is tuned for
the cache case, kicking in above 128 kB to boost
performance by 50 for large messages. MPI/Pro is
also using an optimized memory copy routine.
With the data starting in main memory.
1.7 GHz dual-Xeon running RedHat 8.0
Write a Comment
User Comments (0)
About PowerShow.com