Title: The DirectoryBased Cache Coherence Protocol for the DASH Multiprocessor
1The Directory-Based Cache Coherence Protocol for
the DASH Multiprocessor
- Computer System Laboratory
- Stanford University
- Daniel Lenoski, James Laudon,
- Kourosh Gharachoroloo, Anoop Gupta,
- and John Hennessy
2Designing low-cost high-performance
multiprocessor
- Message-passing (multicomputer)
- -distributed add. space, locally access
- ? more scalable
- ? more cumbersome to program
- Shared-memory (multiprocessor)
- -single add. space, remote access
- ? simplicity( data partitioning, dynamic load
distribution) - ? consume bandwidth, cache coherence
-
3DASH (Directory Architecture for Shared memory)
- Distributed shared main mem. among the processing
nodes to provide scalable mem. bandwidth - Distributed directory-based protocol to support
cache coherence
4DASH architecture
- Processing node (cluster)
- -bus-based multiprocessor
- -snoopy protocol, amortizes cost of dir. logic
network interface - Set of clusters
- -mesh interconnected network
- -distributed directory-based protocol, keeps the
summary info for each mem.line specifying the
cluster that are caching it.
5(No Transcript)
6Details
- Cache--individual to each processor
- Memory-- shared to processors w/in the same
cluster - Directory memory-- keep track of all processors
caching a block, send point-to-point msg
(invalidate/update), avoid broadcast - Remote Access Cache (RAC) maintaining state of
currently outstanding requests, buffering replies
from the network to release waiting processor for
bus arbitration.
7Design distributed directory-based protocol
- Correctness issues
- -memory consistency model, strong constrained?
Less constrained? - -deadlock, loop, generation of previous request
is the requirement of the next. - -error handling, manage data integrity fault
tolerance. - Performance issues
- -latency
- write misses-write buffer, release consistency
model - read misses-min inter-cluster msg, delay of
msg. - -bandwidth, reduce serialization (queuing
delays), traffic, of msg, caches distributed
memory in DASH. - Distributed control complexity issues
- -distribute control to components, balance
system performance complexity of the components.
8(No Transcript)
9DASH prototype
- Cluster(node)
- Silicon Graphics PowerStation 4D/240
- 4 processors (MIPS 3000/3010)
- L1(64 Kbyte instruction,64Kbyte write-through
data) - L2(256 Kbyte write-back), convert RT?RB, cache
tag for snooping, maintaining consistency using
Illinois MESI protocol
10(No Transcript)
11- Memory bus
- Separated into 32-bit add. bus 64-bit data bus.
- Supporting mem-to-cache cache-to-cache transfer
- 16 bytes every 4 bus clocks with a latency of 6
bus clocks, max bandwidth 64 mbps - Retry mechanism, when a request requires services
from a remote cluster, remote request are
signaled to retry, mask unmasked requesting
processor to avoid unnecessary retries.
12Modification
- Directory controller board
- -maintaining cache coherence inter-node,
interface to interconnection network - Directory controller (DC)-contains the directory
mem. corresponding to the portion of main mem.
Initiates out-bound network requests - Pseudo-CPU (PCPU)- buffering income requests,
issuing requests on bus - Reply controller (RC)- tracks outstanding
requests made by local processors, receives
buffers the corresponding replies from remote
cluster, acts as mem. In case of request retry. - Interconnection network-2 wormhole routed meshes
(request reply) - HW monitoring logic, miscellaneous control and
status registers-logic samples directory board
and bus events, derive usage and performance
statistics.
13(No Transcript)
14(No Transcript)
15- Directory memory
- -array of directory entries
- -one entry for each mem. Block
- -single state bit (shared/dirty)
- -a bit vector of pointer to each of the 16
clusters - -directory information is combined with bus
operation, address, and result of snooping within
the cluster - -DC generates network msg bus controls
16 Assume N" processors. With each
cache-block in memory N presence-bits (bit
vector), and 1 dirty-bit (state bit)
17- Remote Access Cache (RAC)
- Maintaining state of currently outstanding
requests from RC - Buffering replies from the network, waiting
processor is released for bus arbitration. - Supplementing the functionality of the
processors caches - Supplies data cache-to-cache when released
processor retry the access
18DASH cache coherence protocol
- Local cluster
- a cluster that contains the processor
originating a given request - Home cluster
- the cluster which contains the main memory and
directory for a given physical memory address - Remote cluster
- any other cluster
- Owning cluster
- a cluster owns a dirty memory block
- Local memory
- the main memory associated with the local
cluster - Remote memory
- any memory whose home is not the local
19DASH cache coherence protocol
- Invalidation-based ownership protocol
- Memory block
- Unchached-remote-- not cached by any remote
cluster - Shared-remote--cached in an unmodified state by
one or more remote clusters - Dirty-remotecached in a modified state by a
single remote cluster - Cache block
- Invalidthe copy in cache is stale
- Sharedother processors caching that location
- Dirtythis cache contains an exclusive copy of
the memory block, and the block has been
modified.
203 primitive operations
- Read request (load)
- In L1, simply supplies the data
- In L2, fill operation find and bring the required
block to L1 - Others, send a read request on the bus
- Shares- local, simply transfer over the bus
- Dirty-local, RAC take ownership of the cache line
- Unchached-remote/shared-remote, send data over
the reply network to requesting cluster - Dirty-remote, forward request to owning cluster,
owning cluster send data to requesting cluster
and sharing write-back request to home cluster.
21- Forward strategy
- ? reduce latency by direct responds
- ? process many request simultaneously
(multithreaded) - reduce serialization
- Additional latency when simultaneously accesses
are made to the same block, 1st request will be
satisfied and dirty cluster loses ownership, 2nd
request return negative acknowledge(NAK) that
force retry access.
22- Read-exclusive request (store)
- In local memory, write and invalidate others
copies - Dirty-remote, owning processor invalidate that
block from its cache, send granting ownership and
data to requesting cluster, send update ownership
msg to home cluster. - Unchached-remote/ shared-remote, write, send
invalidate request for shared state.
23Acknowledge -needed for the requesting processor
to know when the store has been complete w/
respect to all processors. -maintain consistency,
guarantee that new owner will not loose ownership
before the directory has been updated
24- Write-back request
- a dirty cache line that is replaced must be
written back to memory - Home cluster is local, write back to main memory
- Home cluster is remote, send a message to the
remote home cluster, update the main memory in
remote home and mark the block unchached-remote.
25Bus initiated cache transaction
- Transactions made by cache snooping the bus
- Read operation, dirty cache supplies date and
changes to shared state - Read-exclusive operation, invalidate all other
cached copies - Line in L2 is invalidated, L1 do the same
26Exception conditions
- A request forwarded to a dirty cluster may
arrived there to find that the dirty cluster no
longer owns the data. - Prior access, change ownership
- Owning cluster perform a write back
- Sol requesting cluster is sent a NAK responses
and is required to reissure the request(release
mask, treating as new request)
27- Ownership bouncing back to two remote clusters,
requesting cluster receives multiple NAKs - Time-out
- Return a bus error
- Sol add a additional directory states access
queue, responds for all read only requests,
grants ownership to each exclusive request on a
pseudo-random basis.
28- Separate request and reply network, some msg sent
between 2 clusters can be received out-of-order - Sol acknowledge reply,out-of-order requests
receive NAK response
29- Invalidate request overtakes read reply which try
to purge the read copy. - Sol when RAC detects an invalidation request for
a pending read, change state of that RAC entry to
invalidate-read-pending, RC assumes that any read
reply is stale and treats the reply as a NAK
response.
30Deadlock
- HW
- 2 mesh network, point-to-point message passing
- ? consumption of an incoming message may require
the generation of another outgoing message. - Protocol
- Request message
- read, read-exclusive, invalidation requests
- Reply message
- read read-exclusive replies, invalidation ack.
- Separate mesh function
31Error handling
- Error checking system
- ECC on main memory
- Parity checking on directory memory
- Length checking of network message
- Inconsistent bus and network message checking
- Report to processor through bus errors and
associated error capture registers. - Issuing processor time-out originating request or
fencing operation. OS can clean up the state of a
line by using back-door paths the allow direct
addressing of the RAC and directory mem.
32Scalability of the DASH directory
- Amount of dir.mem.mem.size x processors
- Limited pointer per entry, no space for
processors that are not caching the line - Allow pointer to be shared between directory
entries - Use a cache of directory entries to supplement or
replace the normal directory - Sparse-directories, limited pointers and a coarse
vector
33Validation of the protocol
- 2 SW simulator base testing methods
- Low-level DASH system simulator that incorporates
the coherence protocol, caches, buses and
interconnection network - High-level functional simulator that models the
processors and executes parallel programs - 2 scheme for testing protocol
- Running existing parallel programming and compare
output - Test script
- Hardware
34Comparison with scalable coherent interface
protocol (SCI)
- Similarities
- -rely on coherence caches maintained by
distributed directories - -rely on distributed memories to provide
scalable memory bandwidth - Differences
- -in SCI, directory is a distributed sharing list
maintained by cache - -in DASH, all the directory info is placed with
main memory
35- SCI advantages
- -amount of directory pointer grows naturally
with the of processors - -employ SRAM technology used by cache
- -guarantee forward progress in all cases
- SCI disadvantages
- -directory entries increases the complexity and
latency of the directory protocol, additional
update msg must be sent bet caches - -require more inter-node communication