Intro To Parallel Computing also known as Part I, or A Lot About Hardware - PowerPoint PPT Presentation

1 / 44

About This Presentation

Title:

Intro To Parallel Computing also known as Part I, or A Lot About Hardware

Description:

Intel Itanium 2 dual core processors linked by the NUMAFlex interconnect ... 12,960 Dual Core 2.4 Ghz Opterons. 4 GB of RAM per processor. Proprietary SeaStar ... – PowerPoint PPT presentation

Number of Views:47

Avg rating:3.0/5.0

Slides: 45

Provided by: urba6

Category:

more less

Transcript and Presenter's Notes

Title: Intro To Parallel Computing also known as Part I, or A Lot About Hardware

1
Intro To Parallel Computing(also known as Part
I, or A Lot About Hardware)

John Urbanic
Pittsburgh Supercomputing Center
March 23, 2009

2
Purpose of this talk

This is the 50,000 ft. view of the parallel
computing landscape. We want to orient you a bit
before parachuting you down into the trenches to
deal with MPI and OpenMP. Dont worry about the
details here or lack thereof.
Later, after we have turned you into actual
parallel programmers, we will climb back up in
the Outro To Parallel Computing talk, and
appreciate the perspective anew.

3
A quick outline

Motivation for Petaflops computing
Hopelessness of the Serial Way
Parallelisms many forms
Instruction
Multi-core
Shared Memory
Clusters
MPP
GPU/FPGA
RAID
Networks
Summary and handoff to reality check

4
Current needs
Current Desktop Domain
Which axis is most important?
5
Most popular culprit CPU vs. DRAM
This is limiting the percentage of speed of
light performance that you can expect to get
from any one processor especially for scientific
codes. This is why every workshop going has an
Optimization talk that gets into these dirty
details.
6
Next culpritSingle thread performance is
falling off
Source published SPECInt data
7
Moores Law is not at all dead
Intel process technology capabilities
Transistor for 90nm Process Source Intel
50nm
8
but it is causing issuesShrinking transistors
increased frequency
Shrink transistors 30 each generation
transistor density doubles, oxide thickness
shrinks, frequency increases, and threshold
voltages decrease.

But Gate thickness is approaching atomic
dimensions, It just cant keep shrinking so fast
therefore
Slowing frequency increases.
Less threshold voltage reduction.

9
Moores Law at Intel 1970-2005
2.0x
0.79x
0.70x
2.0x
1.4x
1.5x
Power trend not sustainable
10
Not a new problem, just a new scale
CPUPowerW)
Cray-2 with cooling tower in foreground, circa
1985
11
Energy Efficient Performance
Power Limitations
Power C x V2 x Frequency Frequency
Voltage Power Voltage3 C transistors
switching
CPUPower(W)
Reduce frequency and voltage to get a cubic
reduction in power Use more transistors for
performance
12
Multiple cores deliver more performance per watt
Power
Cache
Power ¼ Performance 1/2
4
Big core
3
Performance
Small core
2
2
1
1
1
1
Many core is more power efficient Power
area Single thread performance area.5
C1
C2
4
4
3
3
Cache
C4
C3
2
2
1
1
13
Example Dual core with voltage scaling
RULE OF THUMB
A 15 Reduction In Voltage Yields
SINGLE CORE
DUAL CORE
Area 1 Voltage 1 Freq 1 Power
1 Perf 1
Area 2 Voltage 0.85 Freq
0.85 Power 1 Perf 1.8
14
Summary we need to spread the performance out
over more transistors. Were going parallel.
15
Richardson's Computation, 1917
Prototypical Example Weather Modeling
Courtesy John Burkhardt, Virginia Tech
16
Many levels and types of parallelism

Instruction
Multi-Core
Multi-socket
Clusters
SMP
MPPs
RAID/IO
GPU/FPGA

17
Instruction Level Intel SSE
SSE Operation (SSE/SSE2/SSE3)
In Each Core
SOURCE
0
127
Single Cycle SSE
X4
X3
X2
X1
SSE/2/3 OP
Y4
Y3
Y2
Y1
DECODE
DECODE
DEST
Core ?arch
CLOCK CYCLE 1
X4opY4
X3opY3
X2opY2
X1opY1
EXECUTE
EXECUTE
CLOCK CYCLE 1
Previous
X1opY1
X2opY2
CLOCK CYCLE 2
X3opY3
X4opY4
SIMD instructions compute multiple operations per
instruction
Graphics not representative of actual die photo
or relative size
18
Multi-Core Some Current Offerings

AMD
Opteron
Athlon 64
Turion 64
Barcelona
ARM
MPCore (ARM9 and ARM11)
Broadcom
SiByte
Cradle Technologies
DSP processor
Cavium Networks
Octeon (16 MIPS cores)
IBM
Cell ( Sony, and Toshiba)
POWER4,5,6
Intel
Core
Xeon

Motorola
Freescale dual-core PowerPC
Picochip
DSP devices (300 16-bit processor MIMD cores on
one die)
Parallax
Propeller (eight 32 bit cores)
HP
PA-RISC
Raza Microelectronics
XLR (eight MIPS cores)
Stream Processors
Storm-1 (2 MIPS CPUs and DSP)
Sun Microsystems
UltraSPARC IV, UltraSPARC IV,
UltraSPARC T1, T2
IntellaSys
seaForth-24.

19
The exotic near-future
Intel's 80 core Polaris research chip
Soon to be followed by their 24-48 core
production chip, Larabee (2009, 2010 for 48 core).
20
Multi-socket Motherboards

Dual and Quad socket boards are very common in
the enterprise and HPC world.
Less desirable in consumer world.

21
Clusters

Thunderbird (Sandia National Labs)
Dell PowerEdge Series Capacity Cluster
4096 dual 3.6 Ghz Intel Xeon processors
6 GB DDR-2 RAM per node
4x InfiniBand interconnect

System X (Virginia Tech)
1100 Dual 2.3 GHz PowerPC 970FX processors
4 GB ECC DDR400 (PC3200) RAM
80 GB S-ATA hard disk drive
One Mellanox Cougar InfiniBand 4x HCA
Running Mac OS X

22
Shared-Memory Processing

Each processor can access the entire data space
Pros
Easier to program
Amenable to automatic parallelism
Can be used to run large memory serial programs
Cons
Expensive
Difficult to implement on the hardware level
Processor count limited by contention/coherency
(currently around 512)
Watch out for NU part of NUMA

23
Shared-Memory Processing

Programming
OpenMP, Pthreads, Shmem
Examples
All multi-socket motherboards
SGI Altix
Intel Itanium 2 dual core processors linked by
the NUMAFlex interconnect
Up to 512 processors (1024 cores) sharing up 128
TB of memory
Really want to keep SMP mode of operation to
lower core counts because of NUMA
Usually intended for a hybrid mode of operation
at larger scales

Columbia (NASA)20 512 processor Altix SMP
computers Combined total of 10,240 processors
24
Distributed Memory Machines

Each node in the computer has a locally
addressable memory space
The computers are connected together via some
high-speed network
Infiniband, Myrinet, Giganet, etc..

Pros
Really large machines
Size limited only by gross physical
considerations
Room size
Cable lengths (10s of meters)
Power/cooling capacity
Money!
Cheaper to build and run
Cons
Harder to program
Data Locality

25
MPPs (Massively Parallel Processors)
Distributed memory at largest scale. Often
shared memory at lower hierarchies.

IBM BlueGene/L (LLNL)
131,072 700 Mhz processors
256 MB or RAM per processor
Balanced compute speed with interconnect

Red Storm (Sandia National Labs)
12,960 Dual Core 2.4 Ghz Opterons
4 GB of RAM per processor
Proprietary SeaStar interconnect

26
Hybrid Programming!

Need to use distributed memory design to reach
large scales.
Leverage commodity processor shared memory at the
board level.
It is an efficient way to get a pile of FLOPS.
But, a little more interesting from programming
perspective (hence Part II)

27
Common MPPs and what makes them different (to
you)

While the CM-2 was SIMD (one instruction unit for
multiple processors), all significant machines
since then are MIMD (multiple instructions for
multiple processors) and based on commodity
processors
T3D/E Alpha
XT3/4 Opteron
Roadrunner Opeteron/Cell
Altix Itanium 2
Red Storm Opteron
BlueGene L/P/Q Power 5,6,7
Clusters Take your pick
These all use similar compilers and can probably
at least run any generic code you throw at them.
Therefore, the single most distinctive
characteristic of any of these machines is the
network.
You shouldnt have to care about network, these
should be largely transparent to you as a
programmer. But, more than a few projects have
felt the pain of treating all networks as usable
enough.

28
Cores, Nodes, Processors, PEs?

Nodes is commonly used to refer to an actual
physical unit, most commonly a circuit board or
blade. These often have multiple processors.
Processors usually have more than one core.
The most unambiguous way to refer to the smallest
useful computing device is as a Processing
Element, or PE. This is usually the same as a
single core.
I will try to use the term PE consistently here,
but may slip up myself. Get used to it as you
will quite often here all of the above terms used
interchangeably where they shouldnt be.

29
Networks

3 characteristics sum up the network

Latency
The time to send a 0 byte packet of data on the
network
Bandwidth
The rate at which a very large packet of
information can be sent

Topology
The configuration of the network that determines
how processing units are directly connected.

30
Ethernet with Workstations
31
Complete Connectivity
32
Crossbar
33
Binary Tree
34
CM-5 Fat Tree
35
TCS Fat Tree
36
INTEL Paragon (2-D Mesh)
37
3-D Torus (XT3, XT4)

XT3 has Global Addressing hardware, and this
helps to simulate shared memory.
Torus means that ends are connected. This means
A is really connected to B and the cube has no
real boundary.

38
Hypercube (CM-2 version)
Not remotely drawn properly.
39
Other Network Factors

Commonly overlooked, important things
How much outstanding data can be on the network
at a given time.
Highly scalable codes use asynchronous
communication schemes, which require a large
amount of data to be on the network at a given
time.
Balance
If the either the network or the compute nodes
perform way out of proportion, it makes for an
unbalanced situation.
Hardware level support
Some routers can support things like network
memory and hareware level operations, which can
greatly increase performance

40
Networks

Infiniband, Myrinet, GigE,
Networks that are more designed to run on a small
number of processors
Seastar (Cray), Federation (IBM), Constellation
(Sun), NUMALink (SGI)
Networks designed to scale to tens of thousands
of processors.

41
AlternativesGeneral-Purpose GPU Computing
GPU performance has doubled about every 6 months
since the 1990s, much better than CPU performance
gains. This is made possible by the explicit
parallelism exposed in the graphics hardware and
algorithms (just add more pipelines). Very SIMD
oriented. Limited to algorithms that can be done
in a very, very data parallel fashion. Also not
really designed to move data on and off board
very quickly.
42
Parallel IO (RAID)

There are increasing numbers of applications for
which many TB of Data need to be written.
Checkpointing is also becoming very important due
to MTBF issues (a whole nother talk).
Build a large, fast, reliable filesystem from a
collection of smaller drives.
Supposed to be transparent to the programmer.

43
Capacity vs. Capability

Capacity computing
Creating large supercomputers to facilitate large
throughput of small parallel jobs
Cheaper, slower interconnects
Clusters running Linux, OS X, or Windows
Easy to build
Capability computing
Creating large supercomputers to enable
computation on large scale
Running the entire machine to perform one task
Good fast interconnect and balanced performance
important
Usually specialized hardware and operating systems

44
Summary (and good luck)