Title: Chapter 6: Multiprocessors Part I
1Chapter 6 Multiprocessors Part I
- Introduction (Section 6.1)
- What is a parallel or multiprocessor system?
- Why parallel architecture?
- Performance potential
- Flynn classification
- Communication models (Section 6.1)
- Architectures (Section 6.1)
- Centralized sharedmemory (Section 6.3)
- Distributed sharedmemory (Section 6.5)
- More in Part II
2What is a parallel or multiprocessor system?
- Multiple processor units working together to
solve the same problem - Key architectural issue Communication model
3Why parallel architectures?
- Absolute performance
- Scientific computing
- Generalpurpose computing
- Technology and architecture trends in
highperformance computing - of transistors on chip growing rapidly
- Clock rates expected to go up but slowly
- Instructionlevel parallelism valuable but
limited - Complex architectures
- ? Coarserlevel parallelism as in MPs
- Trend seen in products from AMD, Compaq, HP, IBM,
Intel, SGI, SUN,
4Why parallel architectures (Cont.)?
- Costperformance
- uPs have made massive gains in performance
- clock rates, ILP, caches
- These commodity uPs are cheap, many more are sold
as compared to supercomputers - ? Multiprocessors made from uPs replacing
traditional supercomputers
5Performance Potential
- Amdahl's Law is pessimistic
- Let s be the serial part
- Let p be the part that can be parallelized n ways
- Serial SSPPPPPP
- 6 processors SSP
- P
- P
- P
- P
- P
- Speedup 8/3 2.67
- T(n)
- As n ? ?, T(n) ?
- Pessimistic
1 sp/n
1 s
6Performance Potential (Cont.)
- Gustafson's Corollary
- Amdahl's law holds if run same problem size on
larger machines - But, in practice, people run larger problems and
''wait'' the same time
7Performance Potential (Cont.)
- Gustafson's Corollary (Cont.)
- Assume for larger problem sizes
- Serial time fixed (at s)
- Parallel time proportional to problem size (truth
more complicated) - Old Serial SSPPPPPP
- 6 processors SSPPPPPP
- PPPPPP
- PPPPPP
- PPPPPP
- PPPPPP
- PPPPPP
- Hypothetical Serial
- SSPPPPPP PPPPPP PPPPPP PPPPPP PPPPPP PPPPPP
- Speedup (856)/8 4.75
- T'(n) s np T'(?) ? ?!!!!
- How does your algorithm ''scale up''?
8Flynn classification
- SingleInstruction SingleData (SISD)
- SingleInstruction MultipleData (SIMD)
- MultipleInstruction SingleData (MISD)
- MultipleInstruction MultipleData (MIMD)
9Communication models
- Sharedmemory
- Message passing
- Data parallel
10Communication Models SharedMemory
P
P
P
interconnect
MMMMMMM
- Each node a processor that runs a process
- One shared memory
- Accessible by any processor
- The same address on two different processors
refers to the same datum - Therefore, write and read memory to
- Store and recall data
- Communicate, Synchronize (coordinate)
11Communication Models Message Passing
P M
P M
P M
interconnect
- Each node a computer
- Processor runs its own program (like SM)
- Memory local to that node, unrelated to other
memory - Add messages for internode communication, send
and receive like mail
12Communication Models Data Parallel
P M
P M
P M
interconnect
- Virtual processor per datum
- Write sequential programs with ''conceptual PC''
and let parallelism be within the data (e.g.,
matrices) - C A B
- Typically SIMD architecture, but MIMD can be as
effective
13Architectures
- All mechanisms can usually be synthesized by all
hardware - Key which communication model does hardware
support best? - All smallscale systems are sharedmemory
14Which is Best Communication Model to Support?
- Sharedmemory
- Used in smallscale systems
- Easier to program for dynamic data structures
- Lower overhead communication for small data
- Implicit movement of data with caching
- Hard to build?
- Messagepassing
- Communication explicit harder to program?
- Larger overheads in communication OS
intervention? - Easier to build?
15SharedMemory Architecture
The model
PROC
PROC
PROC
INTERCONNECT
MEMORY
- For now, assume interconnect is a bus
centralized architecture
16Centralized SharedMemory Architecture
PROC
PROC
PROC
BUS
MEMORY
17Centralized SharedMemory Architecture (Cont.)
- For higher bandwidth (throughput)
- For lower latency
- Problem?
18Centralized SharedMemory Architecture (Cont.)
- For higher bandwidth (throughput)
- For lower latency
- Problem?
PROC
PROC
PROC
BUS
MEMORY
MEMORY
MEMORY
19Centralized SharedMemory Architecture (Cont.)
- For higher bandwidth (throughput)
- For lower latency
- Problem?
PROC
PROC
PROC
BUS
MEMORY
MEMORY
MEMORY
PROC
PROC
PROC
CACHE
CACHE
CACHE
BUS
MEMORY
MEMORY
MEMORY
20Cache Coherence Problem
PROC 2
PROC 1
PROC n
CACHE
A
BUS
MEMORY
MEMORY
A
21Cache Coherence Solutions
- Snooping
- Problem with centralized architecture
PROC 2
PROC 1
PROC n
CACHE
A
BUS
MEMORY
MEMORY
A
22Distributed SharedMemory (DSM) Architecture
- Use a higherbandwidth interconnection network
- Uniform memory access architecture (UMA)
PROC 2
PROC 1
PROC n
CACHE
CACHE
CACHE
GENERAL INTERCONNECT
MEMORY
MEMORY
MEMORY
23Distributed SharedMemory (DSM) - Cont.
- For lower latency NonUniform Memory Access
architecture (NUMA)
24Distributed SharedMemory (DSM) -- Cont.
- For lower latency NonUniform Memory Access
architecture (NUMA)
PROC
PROC
PROC
MEM
MEM
MEM
CACHE
CACHE
CACHE
SWITCH/NETWORK
25NonBus Interconnection Networks
- Example interconnection networks
26Distributed SharedMemory - Coherence Problem
- Directory scheme
- Level of indirection!
PROC
PROC
PROC
MEM
MEM
MEM
CACHE
CACHE
CACHE
SWITCH/NETWORK
27Distributed SharedMemory - Coherence Problem
- Directory scheme
- Level of indirection!
PROC
PROC
PROC
MEM
MEM
MEM
CACHE
CACHE
CACHE
DIR
DIR
DIR
SWITCH/NETWORK