Shared Memory Multiprocessors - PowerPoint PPT Presentation

About This Presentation
Title:

Shared Memory Multiprocessors

Description:

Snoopy Protocols these are for bus ... Definition of Snoopy Protocol. ECE, ... Snoopy protocol. set of states. state-transition diagram ... – PowerPoint PPT presentation

Number of Views:286
Avg rating:3.0/5.0
Slides: 49
Provided by: drjeffreyj
Category:

less

Transcript and Presenter's Notes

Title: Shared Memory Multiprocessors


1
Shared Memory Multiprocessors
  • Avinash Karanth Kodi
  • Department of Electrical and Computer Engineering
  • University of Arizona, Tucson, AZ 85721
  • E-mail louri_at_ece.arizona.edu
  • ECE 568 Introduction to Parallel Processing

2
What is a Multiprocessor
  • A collection of communicating processors
  • Goals balance load, reduce inherent
    communication and extra work
  • A multi-cache, multi-memory system
  • Role of these components essential regardless of
    programming model
  • Prog. model and comm. abstr. affect specific
    performance tradeoffs

...
3
Natural Extensions of Memory System
P
P
1
n
Switch
(Interleaved)
First-level
(Interleaved)
Main memory
Shared Cache
Centralized Memory Dance Hall, UMA
Distributed Memory (NUMA)
4
Bus-based Symmetric Multiprocessors (SMPs)
  • Dominate the server market (60 Billion market)
  • Attractive as throughput servers and for parallel
    programs
  • Fine-grain resource sharing
  • Uniform access via loads/stores
  • Automatic data movement and coherent replication
    in caches
  • Cheap and powerful extension
  • Normal uniprocessor mechanisms to access data
  • Key is extension of memory hierarchy to support
    multiple processors

5
Caches are critical for performance
  • Reduce average latency
  • automatic replication closer to processor
  • Reduce average bandwidth
  • Data is logically transferred from producer to
    consumer to memory
  • store reg --gt mem
  • load reg lt-- mem
  • Many processor can shared data efficiently
  • What happens when store load are executed on
    different processors?

6
Cache Coherence Problem in SMPs
P1
P2
P3



u5
Memory
Replicas in the caches of multiple processors in
an SMP have to be updated or kept coherent
7
Cache Coherence Problem
  • Caches play key role in all cases
  • Reduce average data access time
  • Reduce bandwidth demands placed on shared
    interconnect
  • Private processor caches create a problem
  • Copies of a variable can be present in multiple
    caches
  • A write by one processor may not become visible
    to others
  • Theyll keep accessing stale value in their
    caches
  • Cache coherence problem
  • data sharing, I/O Operations, Process Migration
  • What do we do about it?
  • Organize the memory hierarchy to make it go away
  • Detect and take actions to eliminate the problem

8
Intuitive Memory Model Coherence Protocols
  • Reading an address should return the last value
    written to that address
  • Easy in uniprocessors
  • except for I/O
  • Cache coherence problem in MPs is more pervasive
    and more performance critical
  • 2 ways of maintaining caches coherence
  • Invalidate-based Protocols invalidate replicas
    if a processor wants to write to a location
  • Write-Update Protocols Update replicas with the
    written value

9
Definition of a Cache Coherent System
  • A multiprocessor system is coherent if the
    results of any execution of
  • a program are such that, for each location, it
    is possible to construct
  • a hypothetical total order of all memory
    accesses that is consistent
  • with the results of the execution
  • a read by a processor P to a location X that
    follows a write by P to X
  • with no writes of X by another processor
    occurring between the write
  • and the read by P, always returns the value
    written by P
  • a read by a processor to location X that
    follows a write by another
  • processor to X returns the written value if
    the read and write are sufficiently
  • separated in time and no other writes to X
    occur between the two accesses
  • writes to the same location are serialized
    i.e. if two writes to the same
  • memory location by any 2 processors are seen in
    the same order by ALL
  • processors

10
Cache Coherence Properties
Key Properties - Write Propagation The
propagation of writes by any processor
should become visible to all other processors
- Write Serialization All writes (from same
or different processors) are seen in the
same order by all processors
2 classes of Protocols - Snoopy Protocols
these are for bus-based systems (SMPs) -
Directory-based Protocols for large scale
multiprocessors (point-to point
interconnects)
11
Definition of Snoopy Protocol
  • A snooping protocol is a distributed algorithm
    represented by a
  • collection of co-operating finite state machines.
    It is specified by the
  • following components
  • the set of states associated with memory blocks
    in the local caches
  • the state transition diagram with the following
    input symbols
  • - Processor Request
  • - Bus Transactions
  • the actions associated with each state
    transition
  • The different states are co-ordinated by the
    bus transactions

12
Bus-based Snoopy Protocol
  • Bus is a broadcast medium Caches know what they
    have
  • Cache Controller snoops all transactions on the
    shared bus
  • relevant transaction if for a block it contains
  • take action to ensure coherence
  • invalidate, update, or supply value
  • depends on state of the block and the protocol

13
Example Write-Through Invalidate
P
P
P
2
1
3



I/O devices
Memory
  • Cache controllers can snoop on the bus
  • All bus transactions are visible to all cache
    controllers
  • All controllers see the transactions in the
    same order
  • Controllers can take action if the bus
    transaction is relevant i.e.
  • involves a memory block in its cache
  • Coherence is maintained at the granularity of a
    cache block

14
Architectural Building Blocks
  • Invalidation Protocols invalidate replicas if a
    processor writes a location
  • Update Protocols update replicas with the
    written value
  • Based on
  • Bus transactions with 3 phases
  • - Bus arbitration
  • - Command and address transmission
  • - Data Transfer
  • FSM State Transitions for a cache block
  • - State information (eg. invalid, valid, dirty)
    is available for blocks
  • in a cache
  • - State information for uncached blocks is
    implicitly defined (eg.
  • invalid or not present)

15
Design Choices
  • Controller updates state of blocks in response to
    processor and snoop events and generates bus
    transactions
  • Snoopy protocol
  • set of states
  • state-transition diagram
  • actions
  • Basic Choices
  • Write-through vs Write-back
  • Invalidate vs. Update

Snoop
16
Write-through Invalidate Protocol
  • Two states per block in each cache
  • as in uniprocessor
  • state of a block is a p-vector of states
  • Hardware state bits associated with blocks that
    are in the cache
  • other blocks can be seen as being in invalid
    (not-present) state in that cache
  • Writes invalidate all other caches
  • can have multiple simultaneous readers of
    block,but write invalidates them

17
MSI Protocol (1/3)
CPU Read hit
  • State machinefor CPU requestsfor each cache
    block

CPU Read
Shared (read/only)
Invalid
Place read miss on bus
CPU Write
CPU read miss Write back block, Place read
miss on bus
CPU Read miss Place read miss on bus
Place Write Miss on bus
CPU Write Place Write Miss on Bus
Cache Block State
Modified (read/write)
CPU read hit CPU write hit
CPU Write Miss Write back cache block Place write
miss on bus
18
MSI Protocol (2/3)
  • State machinefor bus requests for each cache
    block

Write miss for this block
Shared (read/only)
Invalid
Write miss for this block
Write Back Block (abort memory access)
Read miss for this block
Write Back Block (abort memory access)
Modified (read/write)
19
MSI Protocol (3/3)
CPU Read hit
  • State machinefor CPU requestsfor each cache
    block and for bus requests for each cache block

Write miss for this block
Shared (read/only)
CPU Read
Invalid
Place read miss on bus
CPU Write
Place Write Miss on bus
Write miss for this block
CPU read miss Write back block, Place read
miss on bus
CPU Read miss Place read miss on bus
Write Back Block (abort memory access)
CPU Write Place Write Miss on Bus
Cache Block State
Write Back Block (abort memory access)
Read miss for this block
Modified (read/write)
CPU read hit CPU write hit
CPU Write Miss Write back cache block Place write
miss on bus
20
Example
Bus
Processor 1
Processor 2
Memory
Assumes initial cache state is invalid and A1
and A2 map to same cache block, but A1 ! A2
21
Example Step 1
Assumes initial cache state is invalid and A1
and A2 map to same cache block, but A1 !
A2. Active arrow
22
Example Step 2
Assumes initial cache state is invalid and A1
and A2 map to same cache block, but A1 ! A2
23
Example Step 3
A1
A1
Assumes initial cache state is invalid and A1
and A2 map to same cache block, but A1 ! A2.
24
Example Step 4
A1
A1 A1
Assumes initial cache state is invalid and A1
and A2 map to same cache block, but A1 ! A2
25
Example Step 5
A1
A1 A1 A1
A1
Assumes initial cache state is invalid and A1
and A2 map to same cache block, but A1 ! A2
26
MESI Writeback Invalidation Protocol
States - Invalid (I) - Shared (S) one or
more - Exclusive (E) one only -
Dirty or Modified (M) one only Processor
Events - PrRd (read) - PrWr (write) Bus
Transactions - BusRd asks for copy with no
intent to modify - BusRdX asks for copy with
intent to modify - BusWB updates memory
Actions - Update state, perform bus transaction,
flush value onto bus
27
4-State MESI Protocol
  • Invalid (I)
  • Exclusive (E)
  • Shared (S)
  • Modified (M)

28
Setup for Memory Consistency
  • Coherence gt Writes to a location become visible
    to all in the same order
  • But when does a write become visible?
  • How do we establish orders between a write and a
    read by different processors?
  • use event synchronization
  • typically use more than one location!

29
Requirements for Memory Consistency (1/3)
Clearly, we need something more than coherence to
give a shared address space a clear semantics,
i.e. an ordering model that programmers can use
to reason about possible results and hence the
correctness of their programmers
30
Requirements for Memory Consistency (2/3)
  • Determines the total order such that
  • It gives the same result
  • Operations by any particular process occur in
    the order they were
  • issued
  • The value returned by each read operation is
    the value written by
  • the last write operation to that location in
    the total order
  • Coherence Protocol defines only properties for
    accesses
  • to a single location
  • Program need in addition, guaranteed properties
    for
  • accesses to multiple locations

31
Requirements for Memory Consistency (3/3)
A memory consistency model for a shared address
space specifies constraints on the order in which
memory operations must appear to be performed
(i.e. to become visible to the processors) with
respect to one another. It includes operations
to the same location or to different
locations. Therefore, it subsumes coherence.
32
Sequential Consistency (1/3)
Definition (Lamport 1979) A multiprocessor is
sequentially consistent if the result of any
execution is the same as if the operations of
all the processors were executed in some
sequential order, and the operations of each
individual processor occur in this sequence in
the order specified by its program Two
constraints program order and atomicity of
memory operations.
33
Sequential Consistency (2/3)
  • Program Order
  • - Memory operations of a process
    must appear to become
  • visible - to itself others - in
    program order
  • Write Atomicity
  • - Maintain a single sequential order
    among all operations to
  • all memory locations

Pn
P0
P1
Memory
34
Sequential Consistency (3/3)
Result (A, B) (1, 0) allowed under SC
(A, B) (0, 2) NOT ALLOWED under SC
  • SC does not ensure mutual exclusion,
    synchronization
  • primitives are required

35
Base Cache Coherence Design
Base Cache Coherence Design Single-level
write-back cache Invalidation protocol One
outstanding memory request per processor Atomic
memory bus transactions For BusRd, BusRdX no
intervening transactions allowed on bus
between issuing address and receiving data
BusWB address and data simultaneous and sinked
by memory system before any new bus request
Atomic operations within process One finishes
before next in program order starts
36
Cache Controller and Tags
Cache controller responsible for parts of a
memory operation Uniprocessor On a miss -
Assert request for bus - Wait for bus grant -
Drive address and command lines - Wait for
command to be accepted by relevant device -
Transfer data In snoop-based multiprocessor,
cache controller must monitor bus and
processor Can view as two controllers
bus-side, and processor-side With single-level
cache dual tags or dual-ported tag RAM
Responds to bus transactions when necessary
37
Reporting Snoop Results How?
  • Collective response from caches must appear on
    bus
  • Example in MESI protocol, need to know
  • - Is block dirty i.e. should memory respond or
    not?
  • - Is block shared i.e. transition to E or S
    state on read miss?
  • Three wired-OR signals
  • - Shared asserted if any cache has a copy
  • - Dirty asserted if some cache has a dirty copy
  • - neednt know which, since it will do whats
    necessary
  • Snoop-valid asserted when OK to check other
    two signals
  • actually inhibit until OK to check
  • Illinois MESI requires priority scheme for
    cache-to-cache transfers
  • Which cache should supply data when in shared
    state?
  • Commercial implementations allow memory to
    provide data

38
Reporting Snoop Results When?
As soon as possible, memory needs to know what to
do. If none of the caches has a dirty copy,
memory has to fetch the data. Three options
Fixed number of clocks from address appearing on
bus Dual tags required to reduce contention
with processor Still must be conservative.
Processor blocks access to tag memory on E -gt
M. Variable delay Memory assumes cache will
supply data till all say sorry Less
conservative, more flexible, more complex
Memory can fetch data and hold just in case (SGI
Challenge) Immediately Bit-per-block in
memory Main memory maintain a bit per block
that indicates whether this block is modified
in one of the caches. Extra hardware complexity
in commodity main memory system
39
Multi-level Cache Hierarchies
  • How to snoop with multi-level caches?
  • - independent bus snooping at every level
    (additional hardware
  • snooper, pins, duplication of tags)
  • - maintain cache inclusion
  • Requirements for Inclusion
  • - data in higher-level cache is subset of
    data in lower-level cache
  • - modified in higher-level gt marked
    modified in lower-level
  • Need to snoop lowest-level cache
  • - If L2 says not present (modified), then
    not so in L1 too
  • - If BusRd seen to block that is modified
    in L1, L2 itself knows this
  • Inclusion is not always automatically preserved

40
Violations in Inclusion
The two caches (L1, L2) may choose to replace
different block Differences in reference
history - set-associative first-level cache with
LRU replacement Split higher-level caches -
instruction, data blocks go in different caches
at L1, but may collide in L2 Differences in
block size But a common case works
automatically - L1 direct-mapped, fewer sets
than in L2, and block size same
41
Enhancements required to Cache Protocol
  • Explicitly maintaining inclusion property
  • Propagate bus transactions from L2 to L1
  • Propagate flush and invalidations
  • Propagate modified state from L1 to L2 on
    writes
  • L2 cache must be updated before flush due to
    bus transaction
  • Write-through L1, or modified-but-stale bit
    per block in L2 cache
  • Dual cache tags less important each cache is
    filter for other

42
Further Enhancements
  • Split bus transaction into request and response
    sub-transactions
  • Separate arbitration for each phase
  • Other transactions may intervene
  • Improves bandwidth dramatically
  • Response is matched to request
  • Buffering between bus and cache controllers
  • Use multiple buses (address and data separately)
  • To separate the address and data portions of the
    transaction

43
Split Transaction Buses Example
  • Split-transaction Buses
  • Separate the address and data portions of the
    transaction

Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 1
Address Bus
Snoop Line
Data Bus
44
Problems in scaling SMPs Starfire
Sun StarFire Uses 4 address buses. For 13 or
lower system boards, the maximum data capacity
is limited by the crossbar. Beyond 13, it is
limited by the snoop bandwidth
256
21,333
Memory Bandwidth Snooping capacity Data-crossbar
capacity with random addresses
192
16,000
Bandwidth at 83.3-MHz clock (MBps)
10,667 Snoop Bandwidth
128
Bytes Per Clock
Snoop Limited
64
5,333
0
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
System Boards
Courtesy of Alan Charlesworth, STARFIRE
Extending the SMP Envelope, IEEE Micro, Volume
18, Issue 1,Jan-Feb 1998 Page(s) 39-49
45
Bandwidth Scaling Sun Interconnects
46
Distributed Shared Memory Multiprocessors
Distributed Memory (NUMA)
  • Separate Memory per Processor
  • Local or Remote access via memory controller
  • 1 Cache Coherency solution non-cached pages
  • Alternative directory per cache that tracks
    state of every block in every cache
  • Which caches have a copies of block, dirty vs.
    clean, ...
  • Info per memory block vs. per cache block?
  • PLUS In memory gt simpler protocol
    (centralized/one location)
  • MINUS In memory gt directory is Æ’(memory size)
    vs. Æ’(cache size)
  • Prevent directory as bottleneck? distribute
    directory entries with memory, each keeping track
    of which Procs have copies of their blocks

47
Directory Protocol
  • Similar to Snoopy Protocol Three states
  • Shared 1 processors have data, memory
    up-to-date
  • Uncached (no processor hasit not valid in any
    cache)
  • Exclusive 1 processor (owner) has data
    memory out-of-date
  • In addition to cache state, must track which
    processors have data when in the shared state
    (usually bit vector, 1 if processor has copy)
  • Keep it simple(r)
  • Writes to non-exclusive data gt write miss
  • Processor blocks until access completes
  • Assume messages received and acted upon in order
    sent

48
Directory Protocol
  • No bus and dont want to broadcast
  • interconnect no longer single arbitration point
  • all messages have explicit responses
  • Terms typically 3 processors involved
  • Local node where a request originates
  • Home node where the memory location of an address
    resides
  • Remote node has a copy of a cache block, whether
    exclusive or shared
Write a Comment
User Comments (0)
About PowerShow.com