LimitLESS Directories: A Scalable Cache Coherence Scheme - PowerPoint PPT Presentation

1 / 22

About This Presentation

Title:

LimitLESS Directories: A Scalable Cache Coherence Scheme

Description:

... interrupts the local processor and a full map directory is emulated in software. Read Data ... the variable in Weather is optimised. Performance Results ... – PowerPoint PPT presentation

Number of Views:38

Avg rating:3.0/5.0

Slides: 23

Provided by: acade116

Category:

more less

Transcript and Presenter's Notes

Title: LimitLESS Directories: A Scalable Cache Coherence Scheme

1
LimitLESS Directories A Scalable Cache
Coherence Scheme

By David Chaiken,
John Kubiatowicz,
Anant Agarwal

Presented by Sampath Rudravaram
2
Cache Coherence

The gap between the computing power of
microprocessors and that of the largest
supercomputers is shrinking, while the
price/performance advantage of microprocessor is
increasing.
Cache enhance the performance of
multiprocessors by reducing network traffic and
average memory access time
Cache coherence arise because multiple
processors may be reading and modifying the same
memory block within their own cache

Common Solution
Snoopy coherence
Directory based coherence lt--
Compiler directed coherence

3
Directory (Full-map)

The message-based protocols allocate
a section of the systems memory
? Directory
Each block of memory has an associated directory
entry which contains a bit for each cache in the
system.
That bit indicates whether or not the associated
cache contains a copy of memory block

4
Directory based Coherence

The basic concept is that a processor must ask
for permission to load an entry from the primary
memory to its cache.
When an entry is changed the directory must be
notified either before the change is initiated or
when it is complete.
When an entry is changed the directory either
updates or invalidates the other caches with that
entry.

5
Directory based Coherence
State 1 2 3 . . .
. . . . N

FULL-MAP Directory Entry
Advantages ?
-gtNo broadcast is necessary
Disadvantages ?
-gtCoherence traffic is high due to all
requests to the directory
-gtGreat need for memory(size grows as ?(N2))

6
Directory based Coherence
State Node ID Node ID Node ID Node
ID

Limited Directory Entry
Advantages ?
-gtIts performance is comparable to that of a
full-map scheme in case where there is limited
sharing of data between processors
-gtCheaper to implement
Disadvantages ?
-gtThe protocol is susceptible to thrashing when
the number of processors sharing data exceeds the
number of pointers in the directory entry

7
LimitLESS(Limited directory Locally Extended
through Software Support. )

The LimitLess scheme attempts to combine the full
map and limited directory ideas in order to
achieve a robust yet affordable and scalable
cache coherence solution.
The main idea behind this method is to handle
the common case in hardware and the exceptional
case in software.
Using limited directories implemented in hardware
to keep track of a fixed amount of cached memory
blocks. When the capacity of the directory entry
is exceeded, then the directory interrupts the
local processor and a full map directory is
emulated in software.

8
lt- Protocol messages for hardware coherence
Directory states

Annotation of the state transition diagram
9
Architectural Features LimitLESS

Alewife is a large-scale multiprocessor with
distributed shared memory and a cost-effective
mesh network for communication.
An Alewife node consists of a 33MHz SPARCLE
processor, 64K bytes of direct-mapped cache, 4M
bytes of globally-shared main memory, and a
floating-point coprocessor

10
(No Transcript)
11
A 16-node Alewife machine
A 128-node Alewife Chassis
12
Architectural Features LimitLESS

Be capable of rapid trap handling (five to ten
cycles ).
A rapid context switching processor
A finely-tuned software trap architecture .
The processor needs complete access to coherence
related controller state
The directory controller must be able to
invoke processor trap handlers when necessary.
An interface to the network that allows the
processor to launch and to intercept coherence
protocol packets.
IPI( Interprocessor-Interrrupt)

Condition Bits
Processor
Controller
Trap Lines
Data Bus
Address Bus
13
Architectural Features LimitLESS

IPI provides
a superset of the network functionality
-gt Used to send and receive cache protocol
packets
-gt Used to send preemptive message to remote
processors
Network Packet Structure
Protocol Opcode
-gtfor cache coherence traffic
Interrupt Opcode
-gtfor interprocessor message
Transmission of IPI Packets
-gt enqueue the request on IPI output
Queue
Reception of IPI packets
-gtplace the packet in the IPI input Queue
IPI input traps are synchronous.

Source processor Packet Length Opcode Operand
1 Operand 2 .. .. .. Operand m-1 Data word Data
word 2 .. .. .. Data word n-1
14
Queue based diagram of the Alewife controller
15
Meta States Trap Handler

Meta States
Trap Handler
First time overflow
-The trap code allocates a full-map
bit-vector in local memory.
-Empty all hardware pointers, set the
corresponding bits in the vector
-Directory Mode is set to Trap-On-Write
before trap returns
Additional overflow
-Empty all hardware pointers, set the
corresponding bits in the vector
Termination (on WREQ or local write fault)
-Empty all hardware pointers
-Record the identity of requester in the
directory
-Set the ActCtr to the of bits in the
vector that are set
-Place directory in Normal Mode, Write
Transaction Sate.
-Invalidate all caches with the bit set in
vector

16
PERFORMANCE MEASUREMENT

Comparision of the performance of
limited,LimitLESS and full-map directories.
Evaluated in terms of the total number of cycles
needed to execute an application on a 64
processor Alewife machine.

17
Measurement Technique
ASIM,The Alewife System Simulator
18
Performance Results
-gt four-pointer limited protocol,full-map
protocol,LimitLESS scheme with Ts50 -gt 64-node
Alewife machine with 64K byte caches and 2D mesh
n/ws
19
Performance Results (contd..)
-gt Result when the variable in Weather is not
optimised.
20
Performance Results (contd..)
-gt Result when the variable in Weather is
optimised
21
Performance Results (Contd..)
-gt Result when emulation latency 50 for
LimitLESS protocol.
22
Conclusion

This paper proposed a new scheme for cache
coherence, called LimitLess, which is being
implemented in Alewife machine.
Hardware requirement includes rapid trap handling
and a flexible processor interface to the
network.
Preliminary simulation results indicate that the
LimitLEss scheme approaches the performance of a
full-map directory protocol with the memory
efficiency of a limited directory protocol.
Furthermore, the LimitLess scheme provides a
migration path toward a future in which cache
coherence is handled entirely in software