Title: Shared Memory Multiprocessors
1Shared Memory Multiprocessors
- Avinash Karanth Kodi
- Department of Electrical and Computer Engineering
- University of Arizona, Tucson, AZ 85721
- E-mail louri_at_ece.arizona.edu
- ECE 568 Introduction to Parallel Processing
2What is a Multiprocessor
- A collection of communicating processors
- Goals balance load, reduce inherent
communication and extra work - A multi-cache, multi-memory system
- Role of these components essential regardless of
programming model - Prog. model and comm. abstr. affect specific
performance tradeoffs
...
3Natural Extensions of Memory System
P
P
1
n
Switch
(Interleaved)
First-level
(Interleaved)
Main memory
Shared Cache
Centralized Memory Dance Hall, UMA
Distributed Memory (NUMA)
4Bus-based Symmetric Multiprocessors (SMPs)
- Dominate the server market (60 Billion market)
- Attractive as throughput servers and for parallel
programs - Fine-grain resource sharing
- Uniform access via loads/stores
- Automatic data movement and coherent replication
in caches - Cheap and powerful extension
- Normal uniprocessor mechanisms to access data
- Key is extension of memory hierarchy to support
multiple processors
5Caches are critical for performance
- Reduce average latency
- automatic replication closer to processor
- Reduce average bandwidth
- Data is logically transferred from producer to
consumer to memory - store reg --gt mem
- load reg lt-- mem
- Many processor can shared data efficiently
- What happens when store load are executed on
different processors?
6Cache Coherence Problem in SMPs
P1
P2
P3
u5
Memory
Replicas in the caches of multiple processors in
an SMP have to be updated or kept coherent
7Cache Coherence Problem
- Caches play key role in all cases
- Reduce average data access time
- Reduce bandwidth demands placed on shared
interconnect - Private processor caches create a problem
- Copies of a variable can be present in multiple
caches - A write by one processor may not become visible
to others - Theyll keep accessing stale value in their
caches - Cache coherence problem
- data sharing, I/O Operations, Process Migration
- What do we do about it?
- Organize the memory hierarchy to make it go away
- Detect and take actions to eliminate the problem
8Intuitive Memory Model Coherence Protocols
- Reading an address should return the last value
written to that address - Easy in uniprocessors
- except for I/O
- Cache coherence problem in MPs is more pervasive
and more performance critical
- 2 ways of maintaining caches coherence
- Invalidate-based Protocols invalidate replicas
if a processor wants to write to a location - Write-Update Protocols Update replicas with the
written value
9Definition of a Cache Coherent System
- A multiprocessor system is coherent if the
results of any execution of - a program are such that, for each location, it
is possible to construct - a hypothetical total order of all memory
accesses that is consistent - with the results of the execution
- a read by a processor P to a location X that
follows a write by P to X - with no writes of X by another processor
occurring between the write - and the read by P, always returns the value
written by P - a read by a processor to location X that
follows a write by another - processor to X returns the written value if
the read and write are sufficiently - separated in time and no other writes to X
occur between the two accesses - writes to the same location are serialized
i.e. if two writes to the same - memory location by any 2 processors are seen in
the same order by ALL - processors
10Cache Coherence Properties
Key Properties - Write Propagation The
propagation of writes by any processor
should become visible to all other processors
- Write Serialization All writes (from same
or different processors) are seen in the
same order by all processors
2 classes of Protocols - Snoopy Protocols
these are for bus-based systems (SMPs) -
Directory-based Protocols for large scale
multiprocessors (point-to point
interconnects)
11Definition of Snoopy Protocol
- A snooping protocol is a distributed algorithm
represented by a - collection of co-operating finite state machines.
It is specified by the - following components
- the set of states associated with memory blocks
in the local caches - the state transition diagram with the following
input symbols - - Processor Request
- - Bus Transactions
- the actions associated with each state
transition - The different states are co-ordinated by the
bus transactions
12Bus-based Snoopy Protocol
- Bus is a broadcast medium Caches know what they
have - Cache Controller snoops all transactions on the
shared bus - relevant transaction if for a block it contains
- take action to ensure coherence
- invalidate, update, or supply value
- depends on state of the block and the protocol
13Example Write-Through Invalidate
P
P
P
2
1
3
I/O devices
Memory
- Cache controllers can snoop on the bus
- All bus transactions are visible to all cache
controllers - All controllers see the transactions in the
same order - Controllers can take action if the bus
transaction is relevant i.e. - involves a memory block in its cache
- Coherence is maintained at the granularity of a
cache block
14Architectural Building Blocks
- Invalidation Protocols invalidate replicas if a
processor writes a location - Update Protocols update replicas with the
written value - Based on
- Bus transactions with 3 phases
- - Bus arbitration
- - Command and address transmission
- - Data Transfer
- FSM State Transitions for a cache block
- - State information (eg. invalid, valid, dirty)
is available for blocks - in a cache
- - State information for uncached blocks is
implicitly defined (eg. - invalid or not present)
15Design Choices
- Controller updates state of blocks in response to
processor and snoop events and generates bus
transactions - Snoopy protocol
- set of states
- state-transition diagram
- actions
- Basic Choices
- Write-through vs Write-back
- Invalidate vs. Update
Snoop
16Write-through Invalidate Protocol
- Two states per block in each cache
- as in uniprocessor
- state of a block is a p-vector of states
- Hardware state bits associated with blocks that
are in the cache - other blocks can be seen as being in invalid
(not-present) state in that cache - Writes invalidate all other caches
- can have multiple simultaneous readers of
block,but write invalidates them
17MSI Protocol (1/3)
CPU Read hit
- State machinefor CPU requestsfor each cache
block
CPU Read
Shared (read/only)
Invalid
Place read miss on bus
CPU Write
CPU read miss Write back block, Place read
miss on bus
CPU Read miss Place read miss on bus
Place Write Miss on bus
CPU Write Place Write Miss on Bus
Cache Block State
Modified (read/write)
CPU read hit CPU write hit
CPU Write Miss Write back cache block Place write
miss on bus
18MSI Protocol (2/3)
- State machinefor bus requests for each cache
block
Write miss for this block
Shared (read/only)
Invalid
Write miss for this block
Write Back Block (abort memory access)
Read miss for this block
Write Back Block (abort memory access)
Modified (read/write)
19MSI Protocol (3/3)
CPU Read hit
- State machinefor CPU requestsfor each cache
block and for bus requests for each cache block
Write miss for this block
Shared (read/only)
CPU Read
Invalid
Place read miss on bus
CPU Write
Place Write Miss on bus
Write miss for this block
CPU read miss Write back block, Place read
miss on bus
CPU Read miss Place read miss on bus
Write Back Block (abort memory access)
CPU Write Place Write Miss on Bus
Cache Block State
Write Back Block (abort memory access)
Read miss for this block
Modified (read/write)
CPU read hit CPU write hit
CPU Write Miss Write back cache block Place write
miss on bus
20Example
Bus
Processor 1
Processor 2
Memory
Assumes initial cache state is invalid and A1
and A2 map to same cache block, but A1 ! A2
21Example Step 1
Assumes initial cache state is invalid and A1
and A2 map to same cache block, but A1 !
A2. Active arrow
22Example Step 2
Assumes initial cache state is invalid and A1
and A2 map to same cache block, but A1 ! A2
23Example Step 3
A1
A1
Assumes initial cache state is invalid and A1
and A2 map to same cache block, but A1 ! A2.
24Example Step 4
A1
A1 A1
Assumes initial cache state is invalid and A1
and A2 map to same cache block, but A1 ! A2
25Example Step 5
A1
A1 A1 A1
A1
Assumes initial cache state is invalid and A1
and A2 map to same cache block, but A1 ! A2
26MESI Writeback Invalidation Protocol
States - Invalid (I) - Shared (S) one or
more - Exclusive (E) one only -
Dirty or Modified (M) one only Processor
Events - PrRd (read) - PrWr (write) Bus
Transactions - BusRd asks for copy with no
intent to modify - BusRdX asks for copy with
intent to modify - BusWB updates memory
Actions - Update state, perform bus transaction,
flush value onto bus
274-State MESI Protocol
- Invalid (I)
- Exclusive (E)
- Shared (S)
- Modified (M)
28Setup for Memory Consistency
- Coherence gt Writes to a location become visible
to all in the same order - But when does a write become visible?
- How do we establish orders between a write and a
read by different processors? - use event synchronization
- typically use more than one location!
29Requirements for Memory Consistency (1/3)
Clearly, we need something more than coherence to
give a shared address space a clear semantics,
i.e. an ordering model that programmers can use
to reason about possible results and hence the
correctness of their programmers
30Requirements for Memory Consistency (2/3)
- Determines the total order such that
- It gives the same result
- Operations by any particular process occur in
the order they were - issued
- The value returned by each read operation is
the value written by - the last write operation to that location in
the total order
- Coherence Protocol defines only properties for
accesses - to a single location
- Program need in addition, guaranteed properties
for - accesses to multiple locations
31Requirements for Memory Consistency (3/3)
A memory consistency model for a shared address
space specifies constraints on the order in which
memory operations must appear to be performed
(i.e. to become visible to the processors) with
respect to one another. It includes operations
to the same location or to different
locations. Therefore, it subsumes coherence.
32Sequential Consistency (1/3)
Definition (Lamport 1979) A multiprocessor is
sequentially consistent if the result of any
execution is the same as if the operations of
all the processors were executed in some
sequential order, and the operations of each
individual processor occur in this sequence in
the order specified by its program Two
constraints program order and atomicity of
memory operations.
33Sequential Consistency (2/3)
- Program Order
- - Memory operations of a process
must appear to become - visible - to itself others - in
program order - Write Atomicity
- - Maintain a single sequential order
among all operations to - all memory locations
Pn
P0
P1
Memory
34Sequential Consistency (3/3)
Result (A, B) (1, 0) allowed under SC
(A, B) (0, 2) NOT ALLOWED under SC
- SC does not ensure mutual exclusion,
synchronization - primitives are required
35Base Cache Coherence Design
Base Cache Coherence Design Single-level
write-back cache Invalidation protocol One
outstanding memory request per processor Atomic
memory bus transactions For BusRd, BusRdX no
intervening transactions allowed on bus
between issuing address and receiving data
BusWB address and data simultaneous and sinked
by memory system before any new bus request
Atomic operations within process One finishes
before next in program order starts
36Cache Controller and Tags
Cache controller responsible for parts of a
memory operation Uniprocessor On a miss -
Assert request for bus - Wait for bus grant -
Drive address and command lines - Wait for
command to be accepted by relevant device -
Transfer data In snoop-based multiprocessor,
cache controller must monitor bus and
processor Can view as two controllers
bus-side, and processor-side With single-level
cache dual tags or dual-ported tag RAM
Responds to bus transactions when necessary
37Reporting Snoop Results How?
- Collective response from caches must appear on
bus - Example in MESI protocol, need to know
- - Is block dirty i.e. should memory respond or
not? - - Is block shared i.e. transition to E or S
state on read miss? - Three wired-OR signals
- - Shared asserted if any cache has a copy
- - Dirty asserted if some cache has a dirty copy
- - neednt know which, since it will do whats
necessary - Snoop-valid asserted when OK to check other
two signals - actually inhibit until OK to check
- Illinois MESI requires priority scheme for
cache-to-cache transfers - Which cache should supply data when in shared
state? - Commercial implementations allow memory to
provide data
38Reporting Snoop Results When?
As soon as possible, memory needs to know what to
do. If none of the caches has a dirty copy,
memory has to fetch the data. Three options
Fixed number of clocks from address appearing on
bus Dual tags required to reduce contention
with processor Still must be conservative.
Processor blocks access to tag memory on E -gt
M. Variable delay Memory assumes cache will
supply data till all say sorry Less
conservative, more flexible, more complex
Memory can fetch data and hold just in case (SGI
Challenge) Immediately Bit-per-block in
memory Main memory maintain a bit per block
that indicates whether this block is modified
in one of the caches. Extra hardware complexity
in commodity main memory system
39Multi-level Cache Hierarchies
- How to snoop with multi-level caches?
- - independent bus snooping at every level
(additional hardware - snooper, pins, duplication of tags)
- - maintain cache inclusion
- Requirements for Inclusion
- - data in higher-level cache is subset of
data in lower-level cache - - modified in higher-level gt marked
modified in lower-level - Need to snoop lowest-level cache
- - If L2 says not present (modified), then
not so in L1 too - - If BusRd seen to block that is modified
in L1, L2 itself knows this - Inclusion is not always automatically preserved
40Violations in Inclusion
The two caches (L1, L2) may choose to replace
different block Differences in reference
history - set-associative first-level cache with
LRU replacement Split higher-level caches -
instruction, data blocks go in different caches
at L1, but may collide in L2 Differences in
block size But a common case works
automatically - L1 direct-mapped, fewer sets
than in L2, and block size same
41Enhancements required to Cache Protocol
- Explicitly maintaining inclusion property
- Propagate bus transactions from L2 to L1
- Propagate flush and invalidations
- Propagate modified state from L1 to L2 on
writes - L2 cache must be updated before flush due to
bus transaction - Write-through L1, or modified-but-stale bit
per block in L2 cache - Dual cache tags less important each cache is
filter for other
42Further Enhancements
- Split bus transaction into request and response
sub-transactions - Separate arbitration for each phase
- Other transactions may intervene
- Improves bandwidth dramatically
- Response is matched to request
- Buffering between bus and cache controllers
- Use multiple buses (address and data separately)
- To separate the address and data portions of the
transaction
43Split Transaction Buses Example
- Split-transaction Buses
- Separate the address and data portions of the
transaction
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 1
Address Bus
Snoop Line
Data Bus
44Problems in scaling SMPs Starfire
Sun StarFire Uses 4 address buses. For 13 or
lower system boards, the maximum data capacity
is limited by the crossbar. Beyond 13, it is
limited by the snoop bandwidth
256
21,333
Memory Bandwidth Snooping capacity Data-crossbar
capacity with random addresses
192
16,000
Bandwidth at 83.3-MHz clock (MBps)
10,667 Snoop Bandwidth
128
Bytes Per Clock
Snoop Limited
64
5,333
0
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
System Boards
Courtesy of Alan Charlesworth, STARFIRE
Extending the SMP Envelope, IEEE Micro, Volume
18, Issue 1,Jan-Feb 1998 Page(s) 39-49
45Bandwidth Scaling Sun Interconnects
46Distributed Shared Memory Multiprocessors
Distributed Memory (NUMA)
- Separate Memory per Processor
- Local or Remote access via memory controller
- 1 Cache Coherency solution non-cached pages
- Alternative directory per cache that tracks
state of every block in every cache - Which caches have a copies of block, dirty vs.
clean, ... - Info per memory block vs. per cache block?
- PLUS In memory gt simpler protocol
(centralized/one location) - MINUS In memory gt directory is Æ’(memory size)
vs. Æ’(cache size) - Prevent directory as bottleneck? distribute
directory entries with memory, each keeping track
of which Procs have copies of their blocks
47Directory Protocol
- Similar to Snoopy Protocol Three states
- Shared 1 processors have data, memory
up-to-date - Uncached (no processor hasit not valid in any
cache) - Exclusive 1 processor (owner) has data
memory out-of-date - In addition to cache state, must track which
processors have data when in the shared state
(usually bit vector, 1 if processor has copy) - Keep it simple(r)
- Writes to non-exclusive data gt write miss
- Processor blocks until access completes
- Assume messages received and acted upon in order
sent
48Directory Protocol
- No bus and dont want to broadcast
- interconnect no longer single arbitration point
- all messages have explicit responses
- Terms typically 3 processors involved
- Local node where a request originates
- Home node where the memory location of an address
resides - Remote node has a copy of a cache block, whether
exclusive or shared