Title: Multiprocessors and Thread-Level Parallelism
1Chapter 5
- Multiprocessors and Thread-Level Parallelism
26.1 Introduction
- Thread
- A set of instructions that can be executed in a
different processor - Example)
- Process
- an iteration of a loop
- Needs to be identified generated by a compiler
- Size from hundreds to millions of instructions
- Thread-level parallelism
- Run different threads using different processors
- N threads ? run N processors in parallel
- To fully utilize a machine with N processor, a
compiler or programmer needs to identify N
threads - Difference from instruction-level parallelism
(ILP) - Large grain (also called coarse grain)
parallelism - ILP called small grain (also called fine grain)
parallelism - Identified by programmer or compiler
- ILP mainly identified by HW
3A Taxonomy of Parallel Architectures
6.1 Introduction
- Flynns categorization
- SISD (Single instruction stream, single data
stream) uniprocessor - SIMD (Single instruction stream, multiple data
streams) instructions for multimedia extension - MISD (Multiple instructions streams, single data
stream) no commercial multiprocessor, some
special-purpose stream processors - MIMD (Multiple instruction streams, multiple data
streams) general form of parallel processors - MIMD architectures
- Centralized shared-memory architecture
- ? Fig. 6.1
- Multiple processors share a single centralized
memory - Simplest architecture
- Normally, a single bus based architecture
- But multiple buses or a switch can also be used
- At most a few dozens of processors
- Because scaling is difficult (the shared memory
can be a bottleneck) - Most popular architecture
- Also called, symmetric (shared-memory)
multiprocessors (SMP) or uniform memory access
(UMA)
4A Taxonomy of Parallel Architectures
6.1 Introduction
- MIMD Architectures (continued)
- Physically distributed memory
- ? Fig. 6.2
- Advantage
- Fast local memory access
- Easy to scale up memory bandwidth
- Cost-effective way
- Assumption most accesses are to the local memory
in the node - Disadvantage
- Slow remote memory access or communication
between processors - Variation each node consists of multiple
processors (i.e., SMP)
5Models for Communication and Memory Architecture
6.1 Introduction
- This discussion is for physically distributed
memory systems - Distributed shared-memory (DSM) architecture
- Physically distributed but logically shared
address space - A single address space over physically-distributed
memory - The same address on two processors access to the
same location - Interprocessor communication use load/store
instructions through the shared memory - Also called nonuniform memory access (NUMA)
architecture - Message-passing multiprocessors
- Each processor has its own private address space
- The same address on two processors access to
different locations - Can be considered as multiple independent
computers connected through a network ?
multicomputer, cluster - Interprocessor communication explicit call to
message passing interface - MPI (Message Passing Interface) most popular
standard library for message passing
6Models for Communication and Memory Architecture
6.1 Introduction
- Performance Metrics for Communication Mechanisms
- Communication bandwidth
- Communication latency
- Communication latency hiding
- Overlapping with computation or other
communication - Advantage of different communication mechanisms
- Shared-memory communication
- Advantage
- Shared memory model is familiar programming
environment - Easy of programming (or compiler design)
- Compatible with centralized memory systems
- OpenMP standardized programming interface for
shared-memory multiprocessors - Easy to focus on optimization of critical part.
- Lower communication overhead (normally lower
latency) - Cache management may be automatically supported
by HW
7Models for Communication and Memory Architecture
6.1 Introduction
- Advantage of different communication mechanisms
(continued) - Message-passing communication
- Advantage
- Simple and scalable hardware implementation
- Explicit communication
- more observable and controllable communication
- Can explicitly focusing on specific part
- Reduce errors for synchronization
- Performance advantage by explicitly handling
communication - Analogy shared-memory communication is like a
high-level programming while message-passing is
like an assembly program
8Challenges of Parallel Processing
6.1 Introduction
- Limited parallelism
- Ex) You want to achieve a speedup of 80 with 100
processors. - Speedup 1/(Fractionenhanced/Speedupenhanced)(1
-Fractionenhanced) - Speedup 1/(Fractionparallel/Speedupparallel)(1
-Fractionparallel) - 80 1/(Fractionparallel/100)(1-Fractionparallel
) - Fractionparallel0.9975
- Large latency or remote access in a parallel
processor - ? Figure 6.3
- Summary SMP, DSM, UMA, NUMA, Message-passing,
MPP (Massively Parallel Processors)
96.3 Symmetric Shared-Memory Architectures
- Private data used by a single processor
- Shared data used by multiple processors
- Caching of shared data causes a cache coherence
problem - ? Fig. 6.7
- Cache coherence problem
- Informal definition a memory system is coherent
if any read of a data item returns the most
recently written value of that data item ? vague - A memory system is coherent if
- A read by a processor P to a location X that
follows a write by P to X, with no writes of X by
another processor occurring between the write and
the ready by P, always returns the value written
by P - A ready by a processor to location X that follows
a write by another processor to X returns the
written value if the read and write are
sufficiently separated in time and no other
writes to X occur between the two accesses..8 - Exactly when ? ? memory consistency problem in
Section 6 - Writes to the same location are serialized that
is, two writes to the same location by any two
processors are seen in the same order by all
processors. For example, if the values 1 and then
2 are written to a location, processor can never
read the value of the location as 2 and then
later read it as 1
10Basic Schemes for Enforcing Coherence
6.3 Symmetric Shared-Memory Architectures
- Cache coherence protocol
- Maintain coherence for multiple processors
- Track the state of any sharing of a data block
- Two approaches
- Directory based a single central data structure
to store the sharing status - Snooping every cache contains the sharing status
- No centralized data structure
- Usually on the shared-memory bus
11An Example Protocol
6.3 Symmetric Shared-Memory Architectures
- Write invalidate protocol for a write-back cache
- ? Fig. 6.10, Fig. 6.11 Fig. 6.12
- In the left half of Fig. 6.11
- All states needed for a write-back uniprocessor
cache invalid, valid, dirty - All arcs needed for a write-back uniprocessor
cache except a write hit on the shared state
(generate a write miss) - Simplification
- No distinction between a write hit and a write
miss to a shared cache block - Place a write miss on bus
- Memory will supply the data
- Any processor with copy of the cache block
invalidate it - More complicated protocol distinguishes write
hit from write miss - Write hit may not fetch data from memory ? no
data movement is necessary ? just status change ?
addition of write invalidate transaction - Addition of write invalidate transaction to an
Exclusive state - Eg) Write miss on an invalid block with other
cache has it in the exclusive state - just invalidate the cache block but no data
write-back - Just change of the ownership
- This may depend on the size of write (full cache
block write or partial write) - Additional state clean and private state
- No need to generate a bus transaction on a write
to this block
12An Example Protocol
6.3 Symmetric Shared-Memory Architectures
- Protocol assumption operations are atomic, i.e.,
no intervening operation can occur. - Eg) Atomic operation write miss detection,
acquire bus, and receive a response - Non-atomic operation may cause deadlock
- Microprocessor supports cache coherence protocol
- Ex) For Pentium IV, 4 processors can be directly
connected to a shared bus
13Snooping Protocols
6.3 Symmetric Shared-Memory Architectures
- Two approaches
- Write invalidate protocol
- Processor has exclusive access to a data item
before it writes that item - ? Fig. 6.8
- Write update protocol (or write broadcast
protocol) - Update all the cached copies of a data item when
that item is written - ? Fig. 6.9
- Performance differences
- Multiple writes to the same item
- Write invalidate invalidate just once
- Write update update every time
- Multiple writes to different words in the same
cache block - Write invalidate invalidate just once
- Write update update for every word
- Delay between write in one processor and read in
another processor - Write update scheme is faster
- Write invalidate Less bus and memory traffic ?
More popular
14Basic Implementation Techniques
6.3 Symmetric Shared-Memory Architectures
- Write through vs write back
- On cache miss
- Write-through cache always in the memory
- Write-back cache either in the memory or in the
other processors cache - The processor broadcasts the address to be read
on the bus - If a processor has the dirty copy of the data, it
sends the data block - Note write-back cache requires less memory
traffic ? preferred for multiprocessor systems
15Basic Implementation Techniques
6.3 Symmetric Shared-Memory Architectures
- Data structure for cache coherence protocol
- Normal cache structure needed cache tag valid
bit dirty bit - Shared bit indicates whether it is shared by
other processor - Write invalidation is not necessary for
non-shared data - Owner of a cache block the processor that
exclusively contains the sole copy of data block - Every bus transaction must check cache tags ? may
interfere with CPU cache access - Duplicate tags
- Multilevel cache with the inclusion property
- Snooping 2nd level cache
- CPU access 1st level cache
- Adopted in many designs