Multiprocessors and Thread-Level Parallelism - PowerPoint PPT Presentation

1 / 15

About This Presentation

Title:

Multiprocessors and Thread-Level Parallelism

Description:

Thread-level parallelism: Run different threads using different processors ... The same address on two processors access to different locations ... – PowerPoint PPT presentation

Number of Views:363

Avg rating:3.0/5.0

Slides: 16

Provided by: cappS

Category:

more less

Transcript and Presenter's Notes

Title: Multiprocessors and Thread-Level Parallelism

1
Chapter 5

Multiprocessors and Thread-Level Parallelism

2
6.1 Introduction

Thread
A set of instructions that can be executed in a
different processor
Example)
Process
an iteration of a loop
Needs to be identified generated by a compiler
Size from hundreds to millions of instructions
Thread-level parallelism
Run different threads using different processors
N threads ? run N processors in parallel
To fully utilize a machine with N processor, a
compiler or programmer needs to identify N
threads
Difference from instruction-level parallelism
(ILP)
Large grain (also called coarse grain)
parallelism
ILP called small grain (also called fine grain)
parallelism
Identified by programmer or compiler
ILP mainly identified by HW

3
A Taxonomy of Parallel Architectures
6.1 Introduction

Flynns categorization
SISD (Single instruction stream, single data
stream) uniprocessor
SIMD (Single instruction stream, multiple data
streams) instructions for multimedia extension
MISD (Multiple instructions streams, single data
stream) no commercial multiprocessor, some
special-purpose stream processors
MIMD (Multiple instruction streams, multiple data
streams) general form of parallel processors
MIMD architectures
Centralized shared-memory architecture
? Fig. 6.1
Multiple processors share a single centralized
memory
Simplest architecture
Normally, a single bus based architecture
But multiple buses or a switch can also be used
At most a few dozens of processors
Because scaling is difficult (the shared memory
can be a bottleneck)
Most popular architecture
Also called, symmetric (shared-memory)
multiprocessors (SMP) or uniform memory access
(UMA)

4
A Taxonomy of Parallel Architectures
6.1 Introduction

MIMD Architectures (continued)
Physically distributed memory
? Fig. 6.2
Advantage
Fast local memory access
Easy to scale up memory bandwidth
Cost-effective way
Assumption most accesses are to the local memory
in the node
Disadvantage
Slow remote memory access or communication
between processors
Variation each node consists of multiple
processors (i.e., SMP)

5
Models for Communication and Memory Architecture
6.1 Introduction

This discussion is for physically distributed
memory systems
Distributed shared-memory (DSM) architecture
Physically distributed but logically shared
address space
A single address space over physically-distributed
memory
The same address on two processors access to the
same location
Interprocessor communication use load/store
instructions through the shared memory
Also called nonuniform memory access (NUMA)
architecture
Message-passing multiprocessors
Each processor has its own private address space
The same address on two processors access to
different locations
Can be considered as multiple independent
computers connected through a network ?
multicomputer, cluster
Interprocessor communication explicit call to
message passing interface
MPI (Message Passing Interface) most popular
standard library for message passing

6
Models for Communication and Memory Architecture
6.1 Introduction

Performance Metrics for Communication Mechanisms
Communication bandwidth
Communication latency
Communication latency hiding
Overlapping with computation or other
communication
Advantage of different communication mechanisms
Shared-memory communication
Advantage
Shared memory model is familiar programming
environment
Easy of programming (or compiler design)
Compatible with centralized memory systems
OpenMP standardized programming interface for
shared-memory multiprocessors
Easy to focus on optimization of critical part.
Lower communication overhead (normally lower
latency)
Cache management may be automatically supported
by HW

7
Models for Communication and Memory Architecture
6.1 Introduction

Advantage of different communication mechanisms
(continued)
Message-passing communication
Advantage
Simple and scalable hardware implementation
Explicit communication
more observable and controllable communication
Can explicitly focusing on specific part
Reduce errors for synchronization
Performance advantage by explicitly handling
communication
Analogy shared-memory communication is like a
high-level programming while message-passing is
like an assembly program

8
Challenges of Parallel Processing
6.1 Introduction

Limited parallelism
Ex) You want to achieve a speedup of 80 with 100
processors.
Speedup 1/(Fractionenhanced/Speedupenhanced)(1
-Fractionenhanced)
Speedup 1/(Fractionparallel/Speedupparallel)(1
-Fractionparallel)
80 1/(Fractionparallel/100)(1-Fractionparallel
)
Fractionparallel0.9975
Large latency or remote access in a parallel
processor
? Figure 6.3
Summary SMP, DSM, UMA, NUMA, Message-passing,
MPP (Massively Parallel Processors)

9
6.3 Symmetric Shared-Memory Architectures

Private data used by a single processor
Shared data used by multiple processors
Caching of shared data causes a cache coherence
problem
? Fig. 6.7
Cache coherence problem
Informal definition a memory system is coherent
if any read of a data item returns the most
recently written value of that data item ? vague
A memory system is coherent if
A read by a processor P to a location X that
follows a write by P to X, with no writes of X by
another processor occurring between the write and
the ready by P, always returns the value written
by P
A ready by a processor to location X that follows
a write by another processor to X returns the
written value if the read and write are
sufficiently separated in time and no other
writes to X occur between the two accesses..8
Exactly when ? ? memory consistency problem in
Section 6
Writes to the same location are serialized that
is, two writes to the same location by any two
processors are seen in the same order by all
processors. For example, if the values 1 and then
2 are written to a location, processor can never
read the value of the location as 2 and then
later read it as 1

10
Basic Schemes for Enforcing Coherence
6.3 Symmetric Shared-Memory Architectures

Cache coherence protocol
Maintain coherence for multiple processors
Track the state of any sharing of a data block
Two approaches
Directory based a single central data structure
to store the sharing status
Snooping every cache contains the sharing status
No centralized data structure
Usually on the shared-memory bus

11
An Example Protocol
6.3 Symmetric Shared-Memory Architectures

Write invalidate protocol for a write-back cache
? Fig. 6.10, Fig. 6.11 Fig. 6.12
In the left half of Fig. 6.11
All states needed for a write-back uniprocessor
cache invalid, valid, dirty
All arcs needed for a write-back uniprocessor
cache except a write hit on the shared state
(generate a write miss)
Simplification
No distinction between a write hit and a write
miss to a shared cache block
Place a write miss on bus
Memory will supply the data
Any processor with copy of the cache block
invalidate it
More complicated protocol distinguishes write
hit from write miss
Write hit may not fetch data from memory ? no
data movement is necessary ? just status change ?
addition of write invalidate transaction
Addition of write invalidate transaction to an
Exclusive state
Eg) Write miss on an invalid block with other
cache has it in the exclusive state
just invalidate the cache block but no data
write-back
Just change of the ownership
This may depend on the size of write (full cache
block write or partial write)
Additional state clean and private state
No need to generate a bus transaction on a write
to this block

12
An Example Protocol
6.3 Symmetric Shared-Memory Architectures

Protocol assumption operations are atomic, i.e.,
no intervening operation can occur.
Eg) Atomic operation write miss detection,
acquire bus, and receive a response
Non-atomic operation may cause deadlock
Microprocessor supports cache coherence protocol
Ex) For Pentium IV, 4 processors can be directly
connected to a shared bus

13
Snooping Protocols
6.3 Symmetric Shared-Memory Architectures

Two approaches
Write invalidate protocol
Processor has exclusive access to a data item
before it writes that item
? Fig. 6.8
Write update protocol (or write broadcast
protocol)
Update all the cached copies of a data item when
that item is written
? Fig. 6.9
Performance differences
Multiple writes to the same item
Write invalidate invalidate just once
Write update update every time
Multiple writes to different words in the same
cache block
Write invalidate invalidate just once
Write update update for every word
Delay between write in one processor and read in
another processor
Write update scheme is faster
Write invalidate Less bus and memory traffic ?
More popular

14
Basic Implementation Techniques
6.3 Symmetric Shared-Memory Architectures

Write through vs write back
On cache miss
Write-through cache always in the memory
Write-back cache either in the memory or in the
other processors cache
The processor broadcasts the address to be read
on the bus
If a processor has the dirty copy of the data, it
sends the data block
Note write-back cache requires less memory
traffic ? preferred for multiprocessor systems

15
Basic Implementation Techniques
6.3 Symmetric Shared-Memory Architectures

Data structure for cache coherence protocol
Normal cache structure needed cache tag valid
bit dirty bit
Shared bit indicates whether it is shared by
other processor
Write invalidation is not necessary for
non-shared data
Owner of a cache block the processor that
exclusively contains the sole copy of data block
Every bus transaction must check cache tags ? may
interfere with CPU cache access
Duplicate tags
Multilevel cache with the inclusion property
Snooping 2nd level cache
CPU access 1st level cache
Adopted in many designs