Title: LimitLESS Directories: A Scalable Cache Coherence Scheme
1LimitLESS Directories A Scalable Cache
Coherence Scheme
- By David Chaiken,
- John Kubiatowicz,
- Anant Agarwal
Presented by Sampath Rudravaram
2Cache Coherence
- The gap between the computing power of
microprocessors and that of the largest
supercomputers is shrinking, while the
price/performance advantage of microprocessor is
increasing. - Cache enhance the performance of
multiprocessors by reducing network traffic and
average memory access time - Cache coherence arise because multiple
processors may be reading and modifying the same
memory block within their own cache
- Common Solution
- Snoopy coherence
- Directory based coherence lt--
- Compiler directed coherence
3Directory (Full-map)
- The message-based protocols allocate
- a section of the systems memory
- ? Directory
- Each block of memory has an associated directory
entry which contains a bit for each cache in the
system. - That bit indicates whether or not the associated
cache contains a copy of memory block
4Directory based Coherence
- The basic concept is that a processor must ask
for permission to load an entry from the primary
memory to its cache. - When an entry is changed the directory must be
notified either before the change is initiated or
when it is complete. - When an entry is changed the directory either
updates or invalidates the other caches with that
entry.
5Directory based Coherence
State 1 2 3 . . .
. . . . N
- FULL-MAP Directory Entry
- Advantages ?
- -gtNo broadcast is necessary
- Disadvantages ?
- -gtCoherence traffic is high due to all
requests to the directory - -gtGreat need for memory(size grows as ?(N2))
6Directory based Coherence
State Node ID Node ID Node ID Node
ID
- Limited Directory Entry
- Advantages ?
- -gtIts performance is comparable to that of a
full-map scheme in case where there is limited
sharing of data between processors - -gtCheaper to implement
- Disadvantages ?
- -gtThe protocol is susceptible to thrashing when
the number of processors sharing data exceeds the
number of pointers in the directory entry
7LimitLESS(Limited directory Locally Extended
through Software Support. )
- The LimitLess scheme attempts to combine the full
map and limited directory ideas in order to
achieve a robust yet affordable and scalable
cache coherence solution. - The main idea behind this method is to handle
the common case in hardware and the exceptional
case in software. - Using limited directories implemented in hardware
to keep track of a fixed amount of cached memory
blocks. When the capacity of the directory entry
is exceeded, then the directory interrupts the
local processor and a full map directory is
emulated in software.
8lt- Protocol messages for hardware coherence
Directory states
Annotation of the state transition diagram
9Architectural Features LimitLESS
- Alewife is a large-scale multiprocessor with
distributed shared memory and a cost-effective
mesh network for communication. - An Alewife node consists of a 33MHz SPARCLE
processor, 64K bytes of direct-mapped cache, 4M
bytes of globally-shared main memory, and a
floating-point coprocessor -
10(No Transcript)
11A 16-node Alewife machine
A 128-node Alewife Chassis
12Architectural Features LimitLESS
- Be capable of rapid trap handling (five to ten
cycles ). - A rapid context switching processor
- A finely-tuned software trap architecture .
- The processor needs complete access to coherence
related controller state - The directory controller must be able to
invoke processor trap handlers when necessary. - An interface to the network that allows the
processor to launch and to intercept coherence
protocol packets. -
- IPI( Interprocessor-Interrrupt)
Condition Bits
Processor
Controller
Trap Lines
Data Bus
Address Bus
13Architectural Features LimitLESS
- IPI provides
- a superset of the network functionality
- -gt Used to send and receive cache protocol
packets - -gt Used to send preemptive message to remote
processors - Network Packet Structure
- Protocol Opcode
- -gtfor cache coherence traffic
- Interrupt Opcode
- -gtfor interprocessor message
- Transmission of IPI Packets
- -gt enqueue the request on IPI output
Queue -
- Reception of IPI packets
- -gtplace the packet in the IPI input Queue
- IPI input traps are synchronous.
Source processor Packet Length Opcode Operand
1 Operand 2 .. .. .. Operand m-1 Data word Data
word 2 .. .. .. Data word n-1
14Queue based diagram of the Alewife controller
15Meta States Trap Handler
- Meta States
- Trap Handler
- First time overflow
- -The trap code allocates a full-map
bit-vector in local memory. - -Empty all hardware pointers, set the
corresponding bits in the vector - -Directory Mode is set to Trap-On-Write
before trap returns - Additional overflow
- -Empty all hardware pointers, set the
corresponding bits in the vector - Termination (on WREQ or local write fault)
- -Empty all hardware pointers
- -Record the identity of requester in the
directory - -Set the ActCtr to the of bits in the
vector that are set - -Place directory in Normal Mode, Write
Transaction Sate. - -Invalidate all caches with the bit set in
vector
16PERFORMANCE MEASUREMENT
- Comparision of the performance of
limited,LimitLESS and full-map directories. - Evaluated in terms of the total number of cycles
needed to execute an application on a 64
processor Alewife machine.
17Measurement Technique
ASIM,The Alewife System Simulator
18Performance Results
-gt four-pointer limited protocol,full-map
protocol,LimitLESS scheme with Ts50 -gt 64-node
Alewife machine with 64K byte caches and 2D mesh
n/ws
19Performance Results (contd..)
-gt Result when the variable in Weather is not
optimised.
20Performance Results (contd..)
-gt Result when the variable in Weather is
optimised
21Performance Results (Contd..)
-gt Result when emulation latency 50 for
LimitLESS protocol.
22Conclusion
- This paper proposed a new scheme for cache
coherence, called LimitLess, which is being
implemented in Alewife machine. - Hardware requirement includes rapid trap handling
and a flexible processor interface to the
network. - Preliminary simulation results indicate that the
LimitLEss scheme approaches the performance of a
full-map directory protocol with the memory
efficiency of a limited directory protocol. - Furthermore, the LimitLess scheme provides a
migration path toward a future in which cache
coherence is handled entirely in software