Parallel Architectures - PowerPoint PPT Presentation

1 / 32

About This Presentation

Title:

Parallel Architectures

Description:

sequential von Neumann architecture. MISD (Multiple Instruction Single Data stream) ... IPC (instruction per cycle) 1) Pipelined processors: ... – PowerPoint PPT presentation

Number of Views:30

Avg rating:3.0/5.0

Slides: 33

Provided by: barbara179

Category:

more less

Transcript and Presenter's Notes

Title: Parallel Architectures

1
Parallel Architectures

State of the Art
Shared Memory Computers

2
Agenda

System level
Shared memory
Distributed memory
Microprocessor level
Multithreading
Multicore

3
Classification of Architectures

Flynns taxonomy not very accurate but widely
used
SISD (Single Instruction Single Data stream)
sequential von Neumann architecture
MISD (Multiple Instruction Single Data stream)
do any exist?
SIMD (Single Instruction Multiple Data stream)
Lock-step machines
MIMD (Multiple Instruction Multiple Data stream)
modern data and instruction parallel systems

4
SIMD Architectures

Lock-step parallel computers, developed as
alternative to vector computers
Front end (control unit) and back end (SIMD
system)
Sequential instructions performed by front end
Parallel instructions broadcast to back end at
each instruction cycle
Massive parallelism maybe thousands of
processors
Examples
Historical Connection machines, MasPar
Today SIMD instructions (MMX, SSE) graphics
processors (Cells SPE)

5
MIMD Architectures

Computer with multiple independently operating
CPUs
CPUs are connected with each other and with
memory
Many types of topologies Linear (Bus), ring,
tree, mess, 3D-torus
CPUs have independent channels to memory
Two major kinds of MIMD machines
shared memory
distributed memory
now also distributed shared memory systems
Most current HPC systems are MIMD machines
Examples SGI PowerChallenge, IBM SP2, HP Exemplar

6
Examples
Distributed Memory Systems
Shared Memory Systems
LAN
7
Shared Memory Architectures

Built during the 1980s, now popular again
Multiple processors that access same main memory
connected to main memory via bus
Each processor has local cache
Cache coherence when a memory location is
updated, any copies in the local cache of other
processors must be invalidated
Advantage easy to program, easy processor
communication
Disadvantage
Bus is bottleneck only small systems possible

8
Shared Memory

All processes/threads access same shared memory
data space

Shared memory space
proc1
proc2
proc3
procN
9
More Realistic View of Shared Memory Architecture
Shared memory
Bus
cache1
cache2
cache3
cacheN

cpu1
cpu2
cpu3
cpuN
10
Distributed Memory Architecture

Built during late 1980s and since then
Multiple processors, each with its own local
memory
No shared memory
Processormemory nodes connected via network
If data in processors memory is needed by
another processor, it must be transmitted as
messages via network
Data exchange slow in comparison to clock speed
Hard to program
message passing assembly language in parallel
programming
Advantage Distributed memory parallel
architectures scale easily
Examples Intel iPSC, Paragon IBM SP2, Cray T3D,
T3E

11
Distributed Memory

The alternative model to shared memory

mem1
mem2
mem3
memN
proc1
proc2
proc3
procN
network
12
Distributed Memory Multicomputers

Each node is von Neumann machine
Cost of local memory access ltlt remote memory
access
Variety of interconnects used (butterfly
networks, hypercubes, multistage networks)

13
Distributed Shared Memory Systems

Recent architecture, overcomes size limits of
shared memory systems
Distributed memory system but memory is globally
addressed
So looks like shared memory system to users
Hardware supports cache coherency
Examples HP Exemplar, SGI Origin, Altix

Program as shared memory system
14
Distributed Shared Memory

For larger numbers of processors

Global shared memory space
mem2
mem3
memN
mem1

Interconnect
cache2
cache1
cache4
cache3

proc1
proc2
proc3
procN
15
Why Shared Memory Parallelism?

Shared memory parallelism was introduced to
desktop computing
Now it is becoming the basis for general-purpose
computing
Shared memory parallelism is powerful and very
flexible
can provide large amounts of memory for a program
More power without raising the clock speed
Easy to program compared to distributed memory
Also a good start point for learning more
aggressive parallel programming
Technology was fairly cost-effective
Can help avoid excessive power consumption and
overheating
Multicore/multithreading

16
Agenda

System level
Shared memory
Distributed memory
Microprocessor level
Multithreading
Multicore

17
Generations of Microprocessors

Serial processors
handle each instruction back to back
IPC (instruction per cycle)lt1)
Pipelined processors
overlap different stages of handling of different
instructions
IPC 1 as throughput
Superscalar processors
issue/execute multiple instructions in parallel
(Instruction level parallelism ILP)
Multiple function units Out-of-order/speculative
execution
IPCgt1
Multicore/multithreaded processors
IPC gtgt 1

18
Limitations of superscalar uniprocessor computers

Limited instruction level parallelism
Data hazards
Instruction depends on result of prior
instruction still in the pipeline
Structural hazards
limited resources to support parallel execution
of instructions
Control hazards
Branches and jumps cause wrong instructions are
fetched/executed
Difficulties in increasing clock frequency
Power consumption
Heat dissipation
Solution thread level of parallelism

19
Multithreading

Thread a lightweight stream of instructions
within a process
Shared resources memory address space, file
descriptors
Private state information (context) registers,
program counter, stack, etc.
Software multithreading context maintained by
software (OS)
Useful for overlapping I/O and computation
Hardware multithreading thread context
maintained by hardware
Multiple processors SMP
A multithreading processor
Fine-grained and Coarse-grained multithreading
Simultaneous multithreading
multicore

20
MultiThreaded Processors

Fine-grain switch threads on every instruction
issue
Round-robin thread interleaving (skipping stalled
threads)
Advantage can hide throughput losses that come
from both short and long stalls
Disadvantage slows down the execution of an
individual thread, frequent thread switching is
costly.
Coarse-grain switches threads only on costly
stalls (e.g., L2 cache misses)
Advantages thread switching is cheaper, much
less likely to slow down the execution of an
individual thread
Disadvantage limited ability to overcome
throughput loss
Pipeline must be flushed and refilled on thread
switches

21
Simultaneous Multithreading

A combination of ILP and TLP
Superscalar issue/execute multiple instructions
per cycle
TLP hardware support for instructions from
multiple threads simultaneously

Improved resource utilization
Shared function units, cache, TLB, etc
Pentium 4 HyperThread 5 size increase with
30 performance gain

22
Hyper-Threading from Intel

1 physical processors supporting 2 logical
processors
Shared Instruction Trace Cache and L1 D-Cache
Replicated PC and register renamer
Other resources partitioned equally between 2
threads
Recombines partitioned resources when single
threaded (ensure single thread performance)

Intel NetBurst Microarchitecture Pipeline With
Hyper-Threading Technology
23
ILP and TLP
Issue slots
Fine-grained Multithreading
SMT
Superscalar
Courtesy of Pratyusa Manadhata_at_CMU
24
Multicore Processors

Multicore (chip multiprocessing)
Several processors integrated into one chip
Dedicated execution resources for each core
Shared L1, off-chip datapath
L2 shared or dedicated
Benefits of integration compared to SMP
Increased communication bandwidth between
processors
Decreased latency
Cost/performance efficiency

25
Dual-core Intel vs. AMD

Major difference how to ensure scalability
limited by Memory I/O bandwidth
Intel increases front side bus frequency
AMD integrated memory controller 3
HyperTransport links (to I/O and other CPUs)

26
Performance SMT vs. Multicore
Sun Fire V490 4 dual-core processors (4 x 2
threads)
Intel Xeon 2 SMT processors (2 x 2 threads)
NAS parallel benchmark in OpenMP
Chunhua Liao, Zhenying Liu, Lei Huang, and
Barbara Chapman. "Evaluating OpenMP on Chip
MultiThreading Platforms". First International
Workshop on OpenMP, IWOMP 2005. Eugene, Oregon
USA. June 1-4, 2005.
27
Hybrid Multicore/multitheading

Suns UltraSparc T1(Niagara) throughput
computing, not ideal for scientific computing.
8 single-issue, in-order cores, each is 4-way
fine-grained multithreaded

I/O shared functs
4-way MT SPARC pipe
4-way MT SPARC pipe
4-way MT SPARC pipe
4-way MT SPARC pipe
4-way MT SPARC pipe
4-way MT SPARC pipe
4-way MT SPARC pipe
4-way MT SPARC pipe
Crossbar
4-way banked L2 Cache
Memory controllers
28
More hybrid Multicore/multitheading
Intel's Itanium 2 Dual core 2-way
coarse-grained multithreading
IBM Power5 Dual core 4-way SMT
29
An aggressive one Tera MTA

A large-scale multithreading pioneer
Tera MultiThreaded Architecture
Up to 256 of processors
Up to 128 threads per processor
Fine-grained multithreading
257x12832,768 threads!
Uniform shared memory
No data cache
up to 2.5 gigabytes per second via
interconnection network.
About 25 active streams per MTA processor are
needed to overlap memory latency with
computational processing.

Tera Computer Company bought the remains of the
Cray Research division of Silicon Graphics in
2000 and promptly renamed itself Cray Inc.
30
Asymmetric (Heterogeneous) Multiprocessing

Combining general-purpose and special-purpose
cores together
Cell18 cores in one chip
1 Power Processor Element (PPE)
general purpose, 64-bit RISC
8 Synergistic Processor Elements (SPE)
SIMD units, support both parallel and stream
processing

Cell Processor for game consoles
31
Summary

Almost all modern supercomputers are MIMD
architectures
SIMD components are also popular
Main distinction is between shared memory and
distributed memory
But also distributed shared memory with cache
coherency
Microprocessors are increasingly powerful
More relying on thread level parallelism rather
than increasing frequency and complexity.

32
Announcements