Title: Cache Coherence Protocols in Shared Memory Multiprocessors
1Cache Coherence Protocols in Shared Memory
Multiprocessors
2Outline
- Introduction
- Background Information
- The cache coherence problem
- Cahce Enforcement Strategies
- Consistency models
- Simple Solutions
- Hardware Protocols
- Snooping protocols
- Directory-based protocols
- Compiler and Software protocols
- Future work and conclusions
3The Cache Coherence Problem
- Caches allow greater performance by storing
frequently used data in faster memory - Since all processors share the same address
space, it is possible for more than one processor
to cache an address (or data item) at a time - If one processor updates the data item without
informing the other processor, inconsistencies
may result and cause incorrect executions
4Cache Coherence Problem
5Cache Coherence (cont.)
- For correct execution, coherence must be enforced
between the caches - Two major factors are
- performance
- implementation cost
- Four primary design issues are
- coherence detection strategy
- coherence enforcement strategy
- precision of block-sharing information
- cache block size
6Cache Enforcement Strategies
- A cache enforcement strategy is the mechanism
which makes caches consistent - write-update (WU)
- write-invalidate (WI)
- hybrid protocols, competitive-update (CU)
- Performance of WU and WI vary depending on the
application and the number of writes - Hybrid protocols switch between WU and WI based
on the of writes to a block
7Consistency Models
- A consistency model defines how the consistency
of data values is maintained - Some consistency models are
- sequential consistency
- weak consistency
- release consistency
- Weak consistency models are more efficient to
implement and require fewer coherence messages
8Shared Caches (1)
Processors share a single cache, essentially
punting the problem. Useful for very small
machines. E.g., DPC in the Encore, Alliant
FX/8. Problems are limited cache bandwidth and
cache interference Benefits are fine-grain
sharing and prefetch effects
9Non-cacheable Items (2)
- Make shared data non-cacheable
- One of the simplest software solution
- Also at hardware, make cache locations
unreachable
10Broadcast Writes (3)
- Every cache write request is sent to all other
caches - Firstly need to discover whether each cache hold
this data - Other copies are either updated or invalidated
- Significant additional memory transactions occur
11Hardware Protocols
- Snoop Bus Mechanism
- Directory Based Methods
- Full Directory
- Limited Directory
- Chained Directory
12Snoop Bus Protocol
- Snooping protocols rely on a shared bus between
the processors for coherence - On a processor write, the write is passed through
the cache to main memory on the bus - Any processor caching the address may update or
invalidate its cache entry as appropriate - Snooping protocols do not scale well beyond 32
processors because of the shared bus - The choice between WU, WI, and CU is especially
important to reduce communication
13MESI (4-state) Invalidation Protocol
- Each line in the cache can be in one of 4 states
- Modifed (exclusive) only in 1 cache, modified
- Exclusive (unmodified) only in 1 cache,
unmodified - Shared (unmodified)
- Invalid
14MESI State Transition Diagram
15MESI Example
16Directory-Based Protocols
- Directory-based protocols do not rely on a shared
bus to exchange coherence information (use
point-to-point connections) - more scaleable (can have hundreds of processors)
- each processor can have its own memory
- implement weak consistency for efficiency
17Directory-Based Protocols (cont.)
- Each node maintains a directory storing cache
information and memory information - A processor communicates with the directory to
access memory - if a processor requests a non-local memory page,
the directory uses its information to find the
page - Then, it uses messages to retrieve the page and
insure all other processors have consistent info. - Since the directory maintains which processors
are caching the page, it only needs to send
messages to those processors
18Directory-Based Protocols (cont.)
- Designing a directory requires defining
- cache block granularity
- cache controller design
- directory structure
- Cache block granularity is the size of the cache
and the size of a cache line - CC-NUMA machines have a separate, smaller cache
from main memory - COMA machines use nodes entire memory as cache
for remote pages - Block size affects performance (false sharing)
19Directory-Based Protocols (cont.)
- Cache controller is hardware that maintains the
directory and processes memory requests - custom hardware
- programmable protocol processor
- The directory structure is how the cache and
memory information is organized - p1-bit full directory
- linked-list directories
- tagged directories
20Directory Models
- Full Directory
- Link to all caches for all shared locations
- Limited Directory
- To some caches having shared data, n lt N
- Chained (linked)Directory
- To one chache, form ths cache to others,
single/double link
21Directory Sample (full)
22Lock-Based Protocols
- New work that promises to be more scaleable than
directory protocols - Implements scope consistency which is similar to
lazy release consistency - Coherence information exchanged by reading and
writing notices from the lock which protects the
shared memory - Currently, implemented in software similar to
DSM, but may move to hardware if performance
gains can be realized
23Software Protocols
- Software protocols enforce consistency with
limited hardware support by relying either on the
compiler or specialized software handlers - Similar to distributed shared memory (DSM)
systems but at a lower level - sharing usually in blocks not pages
- needs to be more efficient for better performance
- architecture support for sharing
24Classification of Software Protocols
- Several criteria distinguish software protocols
- dynamism - compile-time or run-time analysis
- selectivity - level of coherence actions
- restrictiveness - conservative or as-needed
consistency enforcement - adaptivity - can protocol adapt to access
patterns - granularity - size and structure of coherence
data - blocking - program block on which coherence is
enforced - positioning - position of coherence instructions
- updating - how memory is updated after a write
- checking - how incoherence is detected
25Software Coherence with Limited Hardware Support
- Compiler must generate consistent code as no
hardware coherence provided - Hardware maintains time tags which are updated on
every write - On a read, compiler generates coherence reads
which check time tags to insure data is
consistent - Relies on the compiler to detect read which may
be inconsistent, and the hardware must maintain
these time tags - Using tags, it is also possible to perform
dynamic self-invalidation of blocks - Many techniques based on using these time tags
26Software Coherence with Limited Hardware Support
(cont.)
- If hardware has no time tags, Petersen and Li
developed an algorithm which uses only page
translation hardware and page status tables - Sharing information is maintained by a software
handler at the page-level - On a page access or fault, the software handler
checks the sharing information, updates page
tables, and performs coherence actions - Slower than hardware as software handlers involve
the OS and are on the critical memory access path
27Enforcing Coherence by Restricting Parallelism
- Compilers can also guarantee coherence by
structuring the language to limit parallelism - easier to enforce coherence
- limits the programmer and potential parallelism
- simplifies compiler design
- good performance can be achieved with no hardware
support - Parallel language restrictions include
- doall parallel loops
- master/slave processes
28Optimizing Compilers
- Optimizing compilers are designed to maintain
coherence with limited hardware support without
overly restricting the programmer - rely on detecting data dependencies
- may use synchronization variables (locks,
barriers) - can provide the hardware with hints
- can detect when coherence is not needed
- may have problems with dynamic sharing
- offer good performance, but are hard to design
29Future Work
- Hardware protocols are well defined, and the
directory structure is near optimal - Cost improvements can be obtained by mass
producing cache controller chips - Software protocols are a good area for future
research because they are also applicable at
higher-levels of sharing (DSM, databases, ...) - Optimizing compilers need to be improved to
detect data dependencies and optimize code for
the parallel environment
30Conclusions
- Hardware protocols offer the best performance but
require high hardware costs - Software protocols can be used when there is no
hardware support with a slight performance
penalty - Optimizing compilers can enforce coherence or
provide hints to the hardware - A combination of hardware and compiler
optimizations is the best