Title: Parallel Architectures
1Parallel Architectures
- State of the Art
- Shared Memory Computers
2Agenda
- System level
- Shared memory
- Distributed memory
- Microprocessor level
- Multithreading
- Multicore
3Classification of Architectures
- Flynns taxonomy not very accurate but widely
used - SISD (Single Instruction Single Data stream)
- sequential von Neumann architecture
- MISD (Multiple Instruction Single Data stream)
- do any exist?
- SIMD (Single Instruction Multiple Data stream)
- Lock-step machines
- MIMD (Multiple Instruction Multiple Data stream)
- modern data and instruction parallel systems
4SIMD Architectures
- Lock-step parallel computers, developed as
alternative to vector computers - Front end (control unit) and back end (SIMD
system) - Sequential instructions performed by front end
- Parallel instructions broadcast to back end at
each instruction cycle - Massive parallelism maybe thousands of
processors - Examples
- Historical Connection machines, MasPar
- Today SIMD instructions (MMX, SSE) graphics
processors (Cells SPE)
5MIMD Architectures
- Computer with multiple independently operating
CPUs - CPUs are connected with each other and with
memory - Many types of topologies Linear (Bus), ring,
tree, mess, 3D-torus - CPUs have independent channels to memory
- Two major kinds of MIMD machines
- shared memory
- distributed memory
- now also distributed shared memory systems
- Most current HPC systems are MIMD machines
- Examples SGI PowerChallenge, IBM SP2, HP Exemplar
6Examples
Distributed Memory Systems
Shared Memory Systems
LAN
7Shared Memory Architectures
- Built during the 1980s, now popular again
- Multiple processors that access same main memory
- connected to main memory via bus
- Each processor has local cache
- Cache coherence when a memory location is
updated, any copies in the local cache of other
processors must be invalidated - Advantage easy to program, easy processor
communication - Disadvantage
- Bus is bottleneck only small systems possible
8Shared Memory
-
- All processes/threads access same shared memory
data space
Shared memory space
proc1
proc2
proc3
procN
9More Realistic View of Shared Memory Architecture
Shared memory
Bus
cache1
cache2
cache3
cacheN
cpu1
cpu2
cpu3
cpuN
10Distributed Memory Architecture
- Built during late 1980s and since then
- Multiple processors, each with its own local
memory - No shared memory
- Processormemory nodes connected via network
- If data in processors memory is needed by
another processor, it must be transmitted as
messages via network - Data exchange slow in comparison to clock speed
- Hard to program
- message passing assembly language in parallel
programming - Advantage Distributed memory parallel
architectures scale easily - Examples Intel iPSC, Paragon IBM SP2, Cray T3D,
T3E
11Distributed Memory
- The alternative model to shared memory
mem1
mem2
mem3
memN
proc1
proc2
proc3
procN
network
12Distributed Memory Multicomputers
- Each node is von Neumann machine
- Cost of local memory access ltlt remote memory
access - Variety of interconnects used (butterfly
networks, hypercubes, multistage networks)
13Distributed Shared Memory Systems
- Recent architecture, overcomes size limits of
shared memory systems - Distributed memory system but memory is globally
addressed - So looks like shared memory system to users
- Hardware supports cache coherency
- Examples HP Exemplar, SGI Origin, Altix
Program as shared memory system
14Distributed Shared Memory
- For larger numbers of processors
Global shared memory space
mem2
mem3
memN
mem1
Interconnect
cache2
cache1
cache4
cache3
proc1
proc2
proc3
procN
15Why Shared Memory Parallelism?
- Shared memory parallelism was introduced to
desktop computing - Now it is becoming the basis for general-purpose
computing - Shared memory parallelism is powerful and very
flexible - can provide large amounts of memory for a program
- More power without raising the clock speed
- Easy to program compared to distributed memory
- Also a good start point for learning more
aggressive parallel programming - Technology was fairly cost-effective
- Can help avoid excessive power consumption and
overheating - Multicore/multithreading
16Agenda
- System level
- Shared memory
- Distributed memory
- Microprocessor level
- Multithreading
- Multicore
17Generations of Microprocessors
- Serial processors
- handle each instruction back to back
- IPC (instruction per cycle)lt1)
- Pipelined processors
- overlap different stages of handling of different
instructions - IPC 1 as throughput
- Superscalar processors
- issue/execute multiple instructions in parallel
(Instruction level parallelism ILP) - Multiple function units Out-of-order/speculative
execution - IPCgt1
- Multicore/multithreaded processors
- IPC gtgt 1
18Limitations of superscalar uniprocessor computers
- Limited instruction level parallelism
- Data hazards
- Instruction depends on result of prior
instruction still in the pipeline - Structural hazards
- limited resources to support parallel execution
of instructions - Control hazards
- Branches and jumps cause wrong instructions are
fetched/executed - Difficulties in increasing clock frequency
- Power consumption
- Heat dissipation
- Solution thread level of parallelism
19Multithreading
- Thread a lightweight stream of instructions
within a process - Shared resources memory address space, file
descriptors - Private state information (context) registers,
program counter, stack, etc. - Software multithreading context maintained by
software (OS) - Useful for overlapping I/O and computation
- Hardware multithreading thread context
maintained by hardware - Multiple processors SMP
- A multithreading processor
- Fine-grained and Coarse-grained multithreading
- Simultaneous multithreading
- multicore
20MultiThreaded Processors
- Fine-grain switch threads on every instruction
issue - Round-robin thread interleaving (skipping stalled
threads) - Advantage can hide throughput losses that come
from both short and long stalls - Disadvantage slows down the execution of an
individual thread, frequent thread switching is
costly. - Coarse-grain switches threads only on costly
stalls (e.g., L2 cache misses) - Advantages thread switching is cheaper, much
less likely to slow down the execution of an
individual thread - Disadvantage limited ability to overcome
throughput loss - Pipeline must be flushed and refilled on thread
switches
21Simultaneous Multithreading
- A combination of ILP and TLP
- Superscalar issue/execute multiple instructions
per cycle - TLP hardware support for instructions from
multiple threads simultaneously
- Improved resource utilization
- Shared function units, cache, TLB, etc
- Pentium 4 HyperThread 5 size increase with
30 performance gain
22Hyper-Threading from Intel
- 1 physical processors supporting 2 logical
processors - Shared Instruction Trace Cache and L1 D-Cache
- Replicated PC and register renamer
- Other resources partitioned equally between 2
threads - Recombines partitioned resources when single
threaded (ensure single thread performance)
Intel NetBurst Microarchitecture Pipeline With
Hyper-Threading Technology
23ILP and TLP
Issue slots
Fine-grained Multithreading
SMT
Superscalar
Courtesy of Pratyusa Manadhata_at_CMU
24Multicore Processors
- Multicore (chip multiprocessing)
- Several processors integrated into one chip
- Dedicated execution resources for each core
- Shared L1, off-chip datapath
- L2 shared or dedicated
- Benefits of integration compared to SMP
- Increased communication bandwidth between
processors - Decreased latency
- Cost/performance efficiency
25Dual-core Intel vs. AMD
- Major difference how to ensure scalability
limited by Memory I/O bandwidth - Intel increases front side bus frequency
- AMD integrated memory controller 3
HyperTransport links (to I/O and other CPUs)
26Performance SMT vs. Multicore
Sun Fire V490 4 dual-core processors (4 x 2
threads)
Intel Xeon 2 SMT processors (2 x 2 threads)
NAS parallel benchmark in OpenMP
Chunhua Liao, Zhenying Liu, Lei Huang, and
Barbara Chapman. "Evaluating OpenMP on Chip
MultiThreading Platforms". First International
Workshop on OpenMP, IWOMP 2005. Eugene, Oregon
USA. June 1-4, 2005.
27Hybrid Multicore/multitheading
- Suns UltraSparc T1(Niagara) throughput
computing, not ideal for scientific computing. - 8 single-issue, in-order cores, each is 4-way
fine-grained multithreaded
I/O shared functs
4-way MT SPARC pipe
4-way MT SPARC pipe
4-way MT SPARC pipe
4-way MT SPARC pipe
4-way MT SPARC pipe
4-way MT SPARC pipe
4-way MT SPARC pipe
4-way MT SPARC pipe
Crossbar
4-way banked L2 Cache
Memory controllers
28More hybrid Multicore/multitheading
Intel's Itanium 2 Dual core 2-way
coarse-grained multithreading
IBM Power5 Dual core 4-way SMT
29An aggressive one Tera MTA
- A large-scale multithreading pioneer
- Tera MultiThreaded Architecture
- Up to 256 of processors
- Up to 128 threads per processor
- Fine-grained multithreading
- 257x12832,768 threads!
- Uniform shared memory
- No data cache
- up to 2.5 gigabytes per second via
interconnection network. - About 25 active streams per MTA processor are
needed to overlap memory latency with
computational processing.
Tera Computer Company bought the remains of the
Cray Research division of Silicon Graphics in
2000 and promptly renamed itself Cray Inc.
30Asymmetric (Heterogeneous) Multiprocessing
- Combining general-purpose and special-purpose
cores together - Cell18 cores in one chip
- 1 Power Processor Element (PPE)
- general purpose, 64-bit RISC
- 8 Synergistic Processor Elements (SPE)
- SIMD units, support both parallel and stream
processing
Cell Processor for game consoles
31Summary
- Almost all modern supercomputers are MIMD
architectures - SIMD components are also popular
- Main distinction is between shared memory and
distributed memory - But also distributed shared memory with cache
coherency - Microprocessors are increasingly powerful
- More relying on thread level parallelism rather
than increasing frequency and complexity.
32Announcements
- Did you get an email from Dr. Chapman?
- If not, please sign up your name email now.
- Class website
- http//www.cs.uh.edu/chapman/teachpubs/6397
- There might be some other invited speakers teach
this class Wednesday.