Shared Memory Multiprocessors

About This Presentation

Title:

Shared Memory Multiprocessors

Description:

Snoopy Protocols these are for bus ... Definition of Snoopy Protocol. ECE, ... Snoopy protocol. set of states. state-transition diagram ... – PowerPoint PPT presentation

Number of Views:286

Avg rating:3.0/5.0

Slides: 49

Provided by: drjeffreyj

Learn more at: https://uweb.engr.arizona.edu

Category:

more less

Transcript and Presenter's Notes

Title: Shared Memory Multiprocessors

1
Shared Memory Multiprocessors

Avinash Karanth Kodi
Department of Electrical and Computer Engineering
University of Arizona, Tucson, AZ 85721
E-mail louri_at_ece.arizona.edu
ECE 568 Introduction to Parallel Processing

2
What is a Multiprocessor

A collection of communicating processors
Goals balance load, reduce inherent
communication and extra work
A multi-cache, multi-memory system
Role of these components essential regardless of
programming model
Prog. model and comm. abstr. affect specific
performance tradeoffs

...
3
Natural Extensions of Memory System
P
P
1
n
Switch
(Interleaved)
First-level
(Interleaved)
Main memory
Shared Cache
Centralized Memory Dance Hall, UMA
Distributed Memory (NUMA)
4
Bus-based Symmetric Multiprocessors (SMPs)

Dominate the server market (60 Billion market)
Attractive as throughput servers and for parallel
programs
Fine-grain resource sharing
Uniform access via loads/stores
Automatic data movement and coherent replication
in caches
Cheap and powerful extension
Normal uniprocessor mechanisms to access data
Key is extension of memory hierarchy to support
multiple processors

5
Caches are critical for performance

Reduce average latency
automatic replication closer to processor
Reduce average bandwidth
Data is logically transferred from producer to
consumer to memory
store reg --gt mem
load reg lt-- mem

Many processor can shared data efficiently

What happens when store load are executed on
different processors?

6
Cache Coherence Problem in SMPs
P1
P2
P3

u5
Memory
Replicas in the caches of multiple processors in
an SMP have to be updated or kept coherent
7
Cache Coherence Problem

Caches play key role in all cases
Reduce average data access time
Reduce bandwidth demands placed on shared
interconnect
Private processor caches create a problem
Copies of a variable can be present in multiple
caches
A write by one processor may not become visible
to others
Theyll keep accessing stale value in their
caches
Cache coherence problem
data sharing, I/O Operations, Process Migration
What do we do about it?
Organize the memory hierarchy to make it go away
Detect and take actions to eliminate the problem

8
Intuitive Memory Model Coherence Protocols

Reading an address should return the last value
written to that address
Easy in uniprocessors
except for I/O
Cache coherence problem in MPs is more pervasive
and more performance critical

2 ways of maintaining caches coherence
Invalidate-based Protocols invalidate replicas
if a processor wants to write to a location
Write-Update Protocols Update replicas with the
written value

9
Definition of a Cache Coherent System

A multiprocessor system is coherent if the
results of any execution of
a program are such that, for each location, it
is possible to construct
a hypothetical total order of all memory
accesses that is consistent
with the results of the execution
a read by a processor P to a location X that
follows a write by P to X
with no writes of X by another processor
occurring between the write
and the read by P, always returns the value
written by P
a read by a processor to location X that
follows a write by another
processor to X returns the written value if
the read and write are sufficiently
separated in time and no other writes to X
occur between the two accesses
writes to the same location are serialized
i.e. if two writes to the same
memory location by any 2 processors are seen in
the same order by ALL
processors

10
Cache Coherence Properties
Key Properties - Write Propagation The
propagation of writes by any processor
should become visible to all other processors
- Write Serialization All writes (from same
or different processors) are seen in the
same order by all processors
2 classes of Protocols - Snoopy Protocols
these are for bus-based systems (SMPs) -
Directory-based Protocols for large scale
multiprocessors (point-to point
interconnects)
11
Definition of Snoopy Protocol

A snooping protocol is a distributed algorithm
represented by a
collection of co-operating finite state machines.
It is specified by the
following components
the set of states associated with memory blocks
in the local caches
the state transition diagram with the following
input symbols
- Processor Request
- Bus Transactions
the actions associated with each state
transition
The different states are co-ordinated by the
bus transactions

12
Bus-based Snoopy Protocol

Bus is a broadcast medium Caches know what they
have
Cache Controller snoops all transactions on the
shared bus
relevant transaction if for a block it contains
take action to ensure coherence
invalidate, update, or supply value
depends on state of the block and the protocol

13
Example Write-Through Invalidate
P
P
P
2
1
3

I/O devices
Memory

Cache controllers can snoop on the bus
All bus transactions are visible to all cache
controllers
All controllers see the transactions in the
same order
Controllers can take action if the bus
transaction is relevant i.e.
involves a memory block in its cache
Coherence is maintained at the granularity of a
cache block

14
Architectural Building Blocks

Invalidation Protocols invalidate replicas if a
processor writes a location
Update Protocols update replicas with the
written value
Based on
Bus transactions with 3 phases
- Bus arbitration
- Command and address transmission
- Data Transfer
FSM State Transitions for a cache block
- State information (eg. invalid, valid, dirty)
is available for blocks
in a cache
- State information for uncached blocks is
implicitly defined (eg.
invalid or not present)

15
Design Choices

Controller updates state of blocks in response to
processor and snoop events and generates bus
transactions
Snoopy protocol
set of states
state-transition diagram
actions
Basic Choices
Write-through vs Write-back
Invalidate vs. Update

Snoop
16
Write-through Invalidate Protocol

Two states per block in each cache
as in uniprocessor
state of a block is a p-vector of states
Hardware state bits associated with blocks that
are in the cache
other blocks can be seen as being in invalid
(not-present) state in that cache
Writes invalidate all other caches
can have multiple simultaneous readers of
block,but write invalidates them

17
MSI Protocol (1/3)
CPU Read hit

State machinefor CPU requestsfor each cache
block

CPU Read
Shared (read/only)
Invalid
Place read miss on bus
CPU Write
CPU read miss Write back block, Place read
miss on bus
CPU Read miss Place read miss on bus
Place Write Miss on bus
CPU Write Place Write Miss on Bus
Cache Block State
Modified (read/write)
CPU read hit CPU write hit
CPU Write Miss Write back cache block Place write
miss on bus
18
MSI Protocol (2/3)

State machinefor bus requests for each cache
block

Write miss for this block
Shared (read/only)
Invalid
Write miss for this block
Write Back Block (abort memory access)
Read miss for this block
Write Back Block (abort memory access)
Modified (read/write)
19
MSI Protocol (3/3)
CPU Read hit

State machinefor CPU requestsfor each cache
block and for bus requests for each cache block

Write miss for this block
Shared (read/only)
CPU Read
Invalid
Place read miss on bus
CPU Write
Place Write Miss on bus
Write miss for this block
CPU read miss Write back block, Place read
miss on bus
CPU Read miss Place read miss on bus
Write Back Block (abort memory access)
CPU Write Place Write Miss on Bus
Cache Block State
Write Back Block (abort memory access)
Read miss for this block
Modified (read/write)
CPU read hit CPU write hit
CPU Write Miss Write back cache block Place write
miss on bus
20
Example
Bus
Processor 1
Processor 2
Memory
Assumes initial cache state is invalid and A1
and A2 map to same cache block, but A1 ! A2
21
Example Step 1
Assumes initial cache state is invalid and A1
and A2 map to same cache block, but A1 !
A2. Active arrow
22
Example Step 2
Assumes initial cache state is invalid and A1
and A2 map to same cache block, but A1 ! A2
23
Example Step 3
A1
A1
Assumes initial cache state is invalid and A1
and A2 map to same cache block, but A1 ! A2.
24
Example Step 4
A1
A1 A1
Assumes initial cache state is invalid and A1
and A2 map to same cache block, but A1 ! A2
25
Example Step 5
A1
A1 A1 A1
A1
Assumes initial cache state is invalid and A1
and A2 map to same cache block, but A1 ! A2
26
MESI Writeback Invalidation Protocol
States - Invalid (I) - Shared (S) one or
more - Exclusive (E) one only -
Dirty or Modified (M) one only Processor
Events - PrRd (read) - PrWr (write) Bus
Transactions - BusRd asks for copy with no
intent to modify - BusRdX asks for copy with
intent to modify - BusWB updates memory
Actions - Update state, perform bus transaction,
flush value onto bus
27
4-State MESI Protocol

Invalid (I)
Exclusive (E)
Shared (S)
Modified (M)

28
Setup for Memory Consistency

Coherence gt Writes to a location become visible
to all in the same order
But when does a write become visible?
How do we establish orders between a write and a
read by different processors?
use event synchronization
typically use more than one location!

29
Requirements for Memory Consistency (1/3)
Clearly, we need something more than coherence to
give a shared address space a clear semantics,
i.e. an ordering model that programmers can use
to reason about possible results and hence the
correctness of their programmers
30
Requirements for Memory Consistency (2/3)

Determines the total order such that
It gives the same result
Operations by any particular process occur in
the order they were
issued
The value returned by each read operation is
the value written by
the last write operation to that location in
the total order

Coherence Protocol defines only properties for
accesses
to a single location
Program need in addition, guaranteed properties
for
accesses to multiple locations

31
Requirements for Memory Consistency (3/3)
A memory consistency model for a shared address
space specifies constraints on the order in which
memory operations must appear to be performed
(i.e. to become visible to the processors) with
respect to one another. It includes operations
to the same location or to different
locations. Therefore, it subsumes coherence.
32
Sequential Consistency (1/3)
Definition (Lamport 1979) A multiprocessor is
sequentially consistent if the result of any
execution is the same as if the operations of
all the processors were executed in some
sequential order, and the operations of each
individual processor occur in this sequence in
the order specified by its program Two
constraints program order and atomicity of
memory operations.
33
Sequential Consistency (2/3)

Program Order
- Memory operations of a process
must appear to become
visible - to itself others - in
program order
Write Atomicity
- Maintain a single sequential order
among all operations to
all memory locations

Pn
P0
P1
Memory
34
Sequential Consistency (3/3)
Result (A, B) (1, 0) allowed under SC
(A, B) (0, 2) NOT ALLOWED under SC

SC does not ensure mutual exclusion,
synchronization
primitives are required

35
Base Cache Coherence Design
Base Cache Coherence Design Single-level
write-back cache Invalidation protocol One
outstanding memory request per processor Atomic
memory bus transactions For BusRd, BusRdX no
intervening transactions allowed on bus
between issuing address and receiving data
BusWB address and data simultaneous and sinked
by memory system before any new bus request
Atomic operations within process One finishes
before next in program order starts
36
Cache Controller and Tags
Cache controller responsible for parts of a
memory operation Uniprocessor On a miss -
Assert request for bus - Wait for bus grant -
Drive address and command lines - Wait for
command to be accepted by relevant device -
Transfer data In snoop-based multiprocessor,
cache controller must monitor bus and
processor Can view as two controllers
bus-side, and processor-side With single-level
cache dual tags or dual-ported tag RAM
Responds to bus transactions when necessary
37
Reporting Snoop Results How?

Collective response from caches must appear on
bus
Example in MESI protocol, need to know
- Is block dirty i.e. should memory respond or
not?
- Is block shared i.e. transition to E or S
state on read miss?
Three wired-OR signals
- Shared asserted if any cache has a copy
- Dirty asserted if some cache has a dirty copy
- neednt know which, since it will do whats
necessary
Snoop-valid asserted when OK to check other
two signals
actually inhibit until OK to check
Illinois MESI requires priority scheme for
cache-to-cache transfers
Which cache should supply data when in shared
state?
Commercial implementations allow memory to
provide data

38
Reporting Snoop Results When?
As soon as possible, memory needs to know what to
do. If none of the caches has a dirty copy,
memory has to fetch the data. Three options
Fixed number of clocks from address appearing on
bus Dual tags required to reduce contention
with processor Still must be conservative.
Processor blocks access to tag memory on E -gt
M. Variable delay Memory assumes cache will
supply data till all say sorry Less
conservative, more flexible, more complex
Memory can fetch data and hold just in case (SGI
Challenge) Immediately Bit-per-block in
memory Main memory maintain a bit per block
that indicates whether this block is modified
in one of the caches. Extra hardware complexity
in commodity main memory system
39
Multi-level Cache Hierarchies

How to snoop with multi-level caches?
- independent bus snooping at every level
(additional hardware
snooper, pins, duplication of tags)
- maintain cache inclusion
Requirements for Inclusion
- data in higher-level cache is subset of
data in lower-level cache
- modified in higher-level gt marked
modified in lower-level
Need to snoop lowest-level cache
- If L2 says not present (modified), then
not so in L1 too
- If BusRd seen to block that is modified
in L1, L2 itself knows this
Inclusion is not always automatically preserved

40
Violations in Inclusion
The two caches (L1, L2) may choose to replace
different block Differences in reference
history - set-associative first-level cache with
LRU replacement Split higher-level caches -
instruction, data blocks go in different caches
at L1, but may collide in L2 Differences in
block size But a common case works
automatically - L1 direct-mapped, fewer sets
than in L2, and block size same
41
Enhancements required to Cache Protocol

Explicitly maintaining inclusion property

Propagate bus transactions from L2 to L1
Propagate flush and invalidations
Propagate modified state from L1 to L2 on
writes
L2 cache must be updated before flush due to
bus transaction
Write-through L1, or modified-but-stale bit
per block in L2 cache
Dual cache tags less important each cache is
filter for other

42
Further Enhancements

Split bus transaction into request and response
sub-transactions
Separate arbitration for each phase
Other transactions may intervene
Improves bandwidth dramatically
Response is matched to request
Buffering between bus and cache controllers
Use multiple buses (address and data separately)
To separate the address and data portions of the
transaction

43
Split Transaction Buses Example

Split-transaction Buses
Separate the address and data portions of the
transaction

Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 1
Address Bus
Snoop Line
Data Bus
44
Problems in scaling SMPs Starfire
Sun StarFire Uses 4 address buses. For 13 or
lower system boards, the maximum data capacity
is limited by the crossbar. Beyond 13, it is
limited by the snoop bandwidth
256
21,333
Memory Bandwidth Snooping capacity Data-crossbar
capacity with random addresses
192
16,000
Bandwidth at 83.3-MHz clock (MBps)
10,667 Snoop Bandwidth
128
Bytes Per Clock
Snoop Limited
64
5,333
0
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
System Boards
Courtesy of Alan Charlesworth, STARFIRE
Extending the SMP Envelope, IEEE Micro, Volume
18, Issue 1,Jan-Feb 1998 Page(s) 39-49
45
Bandwidth Scaling Sun Interconnects
46
Distributed Shared Memory Multiprocessors
Distributed Memory (NUMA)

Separate Memory per Processor
Local or Remote access via memory controller
1 Cache Coherency solution non-cached pages
Alternative directory per cache that tracks
state of every block in every cache
Which caches have a copies of block, dirty vs.
clean, ...
Info per memory block vs. per cache block?
PLUS In memory gt simpler protocol
(centralized/one location)
MINUS In memory gt directory is ƒ(memory size)
vs. ƒ(cache size)
Prevent directory as bottleneck? distribute
directory entries with memory, each keeping track
of which Procs have copies of their blocks

47
Directory Protocol

Similar to Snoopy Protocol Three states
Shared 1 processors have data, memory
up-to-date
Uncached (no processor hasit not valid in any
cache)
Exclusive 1 processor (owner) has data
memory out-of-date
In addition to cache state, must track which
processors have data when in the shared state
(usually bit vector, 1 if processor has copy)
Keep it simple(r)
Writes to non-exclusive data gt write miss
Processor blocks until access completes
Assume messages received and acted upon in order
sent