Intro To Parallel Computing also known as Part I, or A Lot About Hardware - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

Intro To Parallel Computing also known as Part I, or A Lot About Hardware

Description:

Intel Itanium 2 dual core processors linked by the NUMAFlex interconnect ... 12,960 Dual Core 2.4 Ghz Opterons. 4 GB of RAM per processor. Proprietary SeaStar ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 45
Provided by: urba6
Category:

less

Transcript and Presenter's Notes

Title: Intro To Parallel Computing also known as Part I, or A Lot About Hardware


1
Intro To Parallel Computing(also known as Part
I, or A Lot About Hardware)
  • John Urbanic
  • Pittsburgh Supercomputing Center
  • March 23, 2009

2
Purpose of this talk
  • This is the 50,000 ft. view of the parallel
    computing landscape. We want to orient you a bit
    before parachuting you down into the trenches to
    deal with MPI and OpenMP. Dont worry about the
    details here or lack thereof.
  • Later, after we have turned you into actual
    parallel programmers, we will climb back up in
    the Outro To Parallel Computing talk, and
    appreciate the perspective anew.

3
A quick outline
  • Motivation for Petaflops computing
  • Hopelessness of the Serial Way
  • Parallelisms many forms
  • Instruction
  • Multi-core
  • Shared Memory
  • Clusters
  • MPP
  • GPU/FPGA
  • RAID
  • Networks
  • Summary and handoff to reality check

4
Current needs
Current Desktop Domain
Which axis is most important?
5
Most popular culprit CPU vs. DRAM
This is limiting the percentage of speed of
light performance that you can expect to get
from any one processor especially for scientific
codes. This is why every workshop going has an
Optimization talk that gets into these dirty
details.
6
Next culpritSingle thread performance is
falling off
Source published SPECInt data
7
Moores Law is not at all dead
Intel process technology capabilities
Transistor for 90nm Process Source Intel
50nm
8
but it is causing issuesShrinking transistors
increased frequency
Shrink transistors 30 each generation
transistor density doubles, oxide thickness
shrinks, frequency increases, and threshold
voltages decrease.
  • But Gate thickness is approaching atomic
    dimensions, It just cant keep shrinking so fast
    therefore
  • Slowing frequency increases.
  • Less threshold voltage reduction.

9
Moores Law at Intel 1970-2005
2.0x
0.79x
0.70x
2.0x
1.4x
1.5x
Power trend not sustainable
10
Not a new problem, just a new scale
CPUPowerW)
Cray-2 with cooling tower in foreground, circa
1985
11
Energy Efficient Performance
Power Limitations
Power C x V2 x Frequency Frequency
Voltage Power Voltage3 C transistors
switching
CPUPower(W)
Reduce frequency and voltage to get a cubic
reduction in power Use more transistors for
performance
12
Multiple cores deliver more performance per watt
Power
Cache
Power ¼ Performance 1/2
4
Big core
3
Performance
Small core
2
2
1
1
1
1
Many core is more power efficient Power
area Single thread performance area.5
C1
C2
4
4
3
3
Cache
C4
C3
2
2
1
1
13
Example Dual core with voltage scaling
RULE OF THUMB
A 15 Reduction In Voltage Yields
SINGLE CORE
DUAL CORE
Area 1 Voltage 1 Freq 1 Power
1 Perf 1
Area 2 Voltage 0.85 Freq
0.85 Power 1 Perf 1.8
14
Summary we need to spread the performance out
over more transistors. Were going parallel.
15
Richardson's Computation, 1917
Prototypical Example Weather Modeling
Courtesy John Burkhardt, Virginia Tech
16
Many levels and types of parallelism
  • Instruction
  • Multi-Core
  • Multi-socket
  • Clusters
  • SMP
  • MPPs
  • RAID/IO
  • GPU/FPGA

17
Instruction Level Intel SSE
SSE Operation (SSE/SSE2/SSE3)
In Each Core
SOURCE
0
127
Single Cycle SSE
X4
X3
X2
X1
SSE/2/3 OP
Y4
Y3
Y2
Y1
DECODE
DECODE
DEST
Core ?arch
CLOCK CYCLE 1
X4opY4
X3opY3
X2opY2
X1opY1
EXECUTE
EXECUTE
CLOCK CYCLE 1
Previous
X1opY1
X2opY2
CLOCK CYCLE 2
X3opY3
X4opY4
SIMD instructions compute multiple operations per
instruction
Graphics not representative of actual die photo
or relative size
18
Multi-Core Some Current Offerings
  • AMD
  • Opteron
  • Athlon 64
  • Turion 64
  • Barcelona
  • ARM
  • MPCore (ARM9 and ARM11)
  • Broadcom
  • SiByte
  • Cradle Technologies
  • DSP processor
  • Cavium Networks
  • Octeon (16 MIPS cores)
  • IBM
  • Cell ( Sony, and Toshiba)
  • POWER4,5,6
  • Intel
  • Core
  • Xeon
  • Motorola
  • Freescale dual-core PowerPC
  • Picochip
  • DSP devices (300 16-bit processor MIMD cores on
    one die)
  • Parallax
  • Propeller (eight 32 bit cores)
  • HP
  • PA-RISC
  • Raza Microelectronics
  • XLR (eight MIPS cores)
  • Stream Processors
  • Storm-1 (2 MIPS CPUs and DSP)
  • Sun Microsystems
  • UltraSPARC IV, UltraSPARC IV,
  • UltraSPARC T1, T2
  • IntellaSys
  • seaForth-24.

19
The exotic near-future
Intel's 80 core Polaris research chip
Soon to be followed by their 24-48 core
production chip, Larabee (2009, 2010 for 48 core).
20
Multi-socket Motherboards
  • Dual and Quad socket boards are very common in
    the enterprise and HPC world.
  • Less desirable in consumer world.

21
Clusters
  • Thunderbird (Sandia National Labs)
  • Dell PowerEdge Series Capacity Cluster
  • 4096 dual 3.6 Ghz Intel Xeon processors
  • 6 GB DDR-2 RAM per node
  • 4x InfiniBand interconnect
  • System X (Virginia Tech)
  • 1100 Dual 2.3 GHz PowerPC 970FX processors
  • 4 GB ECC DDR400 (PC3200) RAM
  • 80 GB S-ATA hard disk drive
  • One Mellanox Cougar InfiniBand 4x HCA
  • Running Mac OS X

22
Shared-Memory Processing
  • Each processor can access the entire data space
  • Pros
  • Easier to program
  • Amenable to automatic parallelism
  • Can be used to run large memory serial programs
  • Cons
  • Expensive
  • Difficult to implement on the hardware level
  • Processor count limited by contention/coherency
    (currently around 512)
  • Watch out for NU part of NUMA

23
Shared-Memory Processing
  • Programming
  • OpenMP, Pthreads, Shmem
  • Examples
  • All multi-socket motherboards
  • SGI Altix
  • Intel Itanium 2 dual core processors linked by
    the NUMAFlex interconnect
  • Up to 512 processors (1024 cores) sharing up 128
    TB of memory
  • Really want to keep SMP mode of operation to
    lower core counts because of NUMA
  • Usually intended for a hybrid mode of operation
    at larger scales

Columbia (NASA)20 512 processor Altix SMP
computers Combined total of 10,240 processors
24
Distributed Memory Machines
  • Each node in the computer has a locally
    addressable memory space
  • The computers are connected together via some
    high-speed network
  • Infiniband, Myrinet, Giganet, etc..
  • Pros
  • Really large machines
  • Size limited only by gross physical
    considerations
  • Room size
  • Cable lengths (10s of meters)
  • Power/cooling capacity
  • Money!
  • Cheaper to build and run
  • Cons
  • Harder to program
  • Data Locality

25
MPPs (Massively Parallel Processors)
Distributed memory at largest scale. Often
shared memory at lower hierarchies.
  • IBM BlueGene/L (LLNL)
  • 131,072 700 Mhz processors
  • 256 MB or RAM per processor
  • Balanced compute speed with interconnect
  • Red Storm (Sandia National Labs)
  • 12,960 Dual Core 2.4 Ghz Opterons
  • 4 GB of RAM per processor
  • Proprietary SeaStar interconnect

26
Hybrid Programming!
  • Need to use distributed memory design to reach
    large scales.
  • Leverage commodity processor shared memory at the
    board level.
  • It is an efficient way to get a pile of FLOPS.
  • But, a little more interesting from programming
    perspective (hence Part II)

27
Common MPPs and what makes them different (to
you)
  • While the CM-2 was SIMD (one instruction unit for
    multiple processors), all significant machines
    since then are MIMD (multiple instructions for
    multiple processors) and based on commodity
    processors
  • T3D/E Alpha
  • XT3/4 Opteron
  • Roadrunner Opeteron/Cell
  • Altix Itanium 2
  • Red Storm Opteron
  • BlueGene L/P/Q Power 5,6,7
  • Clusters Take your pick
  • These all use similar compilers and can probably
    at least run any generic code you throw at them.
    Therefore, the single most distinctive
    characteristic of any of these machines is the
    network.
  • You shouldnt have to care about network, these
    should be largely transparent to you as a
    programmer. But, more than a few projects have
    felt the pain of treating all networks as usable
    enough.

28
Cores, Nodes, Processors, PEs?
  • Nodes is commonly used to refer to an actual
    physical unit, most commonly a circuit board or
    blade. These often have multiple processors.
  • Processors usually have more than one core.
  • The most unambiguous way to refer to the smallest
    useful computing device is as a Processing
    Element, or PE. This is usually the same as a
    single core.
  • I will try to use the term PE consistently here,
    but may slip up myself. Get used to it as you
    will quite often here all of the above terms used
    interchangeably where they shouldnt be.

29
Networks
  • 3 characteristics sum up the network
  • Latency
  • The time to send a 0 byte packet of data on the
    network
  • Bandwidth
  • The rate at which a very large packet of
    information can be sent
  • Topology
  • The configuration of the network that determines
    how processing units are directly connected.

30
Ethernet with Workstations
31
Complete Connectivity
32
Crossbar
33
Binary Tree
34
CM-5 Fat Tree
35
TCS Fat Tree
36
INTEL Paragon (2-D Mesh)
37
3-D Torus (XT3, XT4)
  • XT3 has Global Addressing hardware, and this
    helps to simulate shared memory.
  • Torus means that ends are connected. This means
    A is really connected to B and the cube has no
    real boundary.

38
Hypercube (CM-2 version)
Not remotely drawn properly.
39
Other Network Factors
  • Commonly overlooked, important things
  • How much outstanding data can be on the network
    at a given time.
  • Highly scalable codes use asynchronous
    communication schemes, which require a large
    amount of data to be on the network at a given
    time.
  • Balance
  • If the either the network or the compute nodes
    perform way out of proportion, it makes for an
    unbalanced situation.
  • Hardware level support
  • Some routers can support things like network
    memory and hareware level operations, which can
    greatly increase performance

40
Networks
  • Infiniband, Myrinet, GigE,
  • Networks that are more designed to run on a small
    number of processors
  • Seastar (Cray), Federation (IBM), Constellation
    (Sun), NUMALink (SGI)
  • Networks designed to scale to tens of thousands
    of processors.

41
AlternativesGeneral-Purpose GPU Computing
GPU performance has doubled about every 6 months
since the 1990s, much better than CPU performance
gains. This is made possible by the explicit
parallelism exposed in the graphics hardware and
algorithms (just add more pipelines). Very SIMD
oriented. Limited to algorithms that can be done
in a very, very data parallel fashion. Also not
really designed to move data on and off board
very quickly.
42
Parallel IO (RAID)
  • There are increasing numbers of applications for
    which many TB of Data need to be written.
  • Checkpointing is also becoming very important due
    to MTBF issues (a whole nother talk).
  • Build a large, fast, reliable filesystem from a
    collection of smaller drives.
  • Supposed to be transparent to the programmer.

43
Capacity vs. Capability
  • Capacity computing
  • Creating large supercomputers to facilitate large
    throughput of small parallel jobs
  • Cheaper, slower interconnects
  • Clusters running Linux, OS X, or Windows
  • Easy to build
  • Capability computing
  • Creating large supercomputers to enable
    computation on large scale
  • Running the entire machine to perform one task
  • Good fast interconnect and balanced performance
    important
  • Usually specialized hardware and operating systems

44
Summary (and good luck)
  • You now understand reasons why these machines
    look the way they do
  • Serial boxes cant get to PFLOPs
  • Need to spread out the processors
  • Need to communicate between these spread out
    processors
  • What you might not understand is how to use
    (program) these piles of processors
  • Many approaches
  • Many issues
  • You are going to learn a few of these approaches
  • and we will meet back here for Part II (Outro To
    Parallel Computing.
Write a Comment
User Comments (0)
About PowerShow.com