The DirectoryBased Cache Coherence Protocol for the DASH Multiprocessor - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

The DirectoryBased Cache Coherence Protocol for the DASH Multiprocessor

Description:

L2(256 Kbyte write-back), convert RT RB, cache tag for snooping, maintaining ... the state of a line by using back-door paths the allow direct addressing of the RAC ... – PowerPoint PPT presentation

Number of Views:361
Avg rating:3.0/5.0
Slides: 36
Provided by: Car7257
Category:

less

Transcript and Presenter's Notes

Title: The DirectoryBased Cache Coherence Protocol for the DASH Multiprocessor


1
The Directory-Based Cache Coherence Protocol for
the DASH Multiprocessor
  • Computer System Laboratory
  • Stanford University
  • Daniel Lenoski, James Laudon,
  • Kourosh Gharachoroloo, Anoop Gupta,
  • and John Hennessy

2
Designing low-cost high-performance
multiprocessor
  • Message-passing (multicomputer)
  • -distributed add. space, locally access
  • ? more scalable
  • ? more cumbersome to program
  • Shared-memory (multiprocessor)
  • -single add. space, remote access
  • ? simplicity( data partitioning, dynamic load
    distribution)
  • ? consume bandwidth, cache coherence

3
DASH (Directory Architecture for Shared memory)
  • Distributed shared main mem. among the processing
    nodes to provide scalable mem. bandwidth
  • Distributed directory-based protocol to support
    cache coherence

4
DASH architecture
  • Processing node (cluster)
  • -bus-based multiprocessor
  • -snoopy protocol, amortizes cost of dir. logic
    network interface
  • Set of clusters
  • -mesh interconnected network
  • -distributed directory-based protocol, keeps the
    summary info for each mem.line specifying the
    cluster that are caching it.

5
(No Transcript)
6
Details
  • Cache--individual to each processor
  • Memory-- shared to processors w/in the same
    cluster
  • Directory memory-- keep track of all processors
    caching a block, send point-to-point msg
    (invalidate/update), avoid broadcast
  • Remote Access Cache (RAC) maintaining state of
    currently outstanding requests, buffering replies
    from the network to release waiting processor for
    bus arbitration.

7
Design distributed directory-based protocol
  • Correctness issues
  • -memory consistency model, strong constrained?
    Less constrained?
  • -deadlock, loop, generation of previous request
    is the requirement of the next.
  • -error handling, manage data integrity fault
    tolerance.
  • Performance issues
  • -latency
  • write misses-write buffer, release consistency
    model
  • read misses-min inter-cluster msg, delay of
    msg.
  • -bandwidth, reduce serialization (queuing
    delays), traffic, of msg, caches distributed
    memory in DASH.
  • Distributed control complexity issues
  • -distribute control to components, balance
    system performance complexity of the components.

8
(No Transcript)
9
DASH prototype
  • Cluster(node)
  • Silicon Graphics PowerStation 4D/240
  • 4 processors (MIPS 3000/3010)
  • L1(64 Kbyte instruction,64Kbyte write-through
    data)
  • L2(256 Kbyte write-back), convert RT?RB, cache
    tag for snooping, maintaining consistency using
    Illinois MESI protocol

10
(No Transcript)
11
  • Memory bus
  • Separated into 32-bit add. bus 64-bit data bus.
  • Supporting mem-to-cache cache-to-cache transfer
  • 16 bytes every 4 bus clocks with a latency of 6
    bus clocks, max bandwidth 64 mbps
  • Retry mechanism, when a request requires services
    from a remote cluster, remote request are
    signaled to retry, mask unmasked requesting
    processor to avoid unnecessary retries.

12
Modification
  • Directory controller board
  • -maintaining cache coherence inter-node,
    interface to interconnection network
  • Directory controller (DC)-contains the directory
    mem. corresponding to the portion of main mem.
    Initiates out-bound network requests
  • Pseudo-CPU (PCPU)- buffering income requests,
    issuing requests on bus
  • Reply controller (RC)- tracks outstanding
    requests made by local processors, receives
    buffers the corresponding replies from remote
    cluster, acts as mem. In case of request retry.
  • Interconnection network-2 wormhole routed meshes
    (request reply)
  • HW monitoring logic, miscellaneous control and
    status registers-logic samples directory board
    and bus events, derive usage and performance
    statistics.

13
(No Transcript)
14
(No Transcript)
15
  • Directory memory
  • -array of directory entries
  • -one entry for each mem. Block
  • -single state bit (shared/dirty)
  • -a bit vector of pointer to each of the 16
    clusters
  • -directory information is combined with bus
    operation, address, and result of snooping within
    the cluster
  • -DC generates network msg bus controls

16
Assume N" processors. With each
cache-block in memory N presence-bits (bit
vector), and 1 dirty-bit (state bit)
17
  • Remote Access Cache (RAC)
  • Maintaining state of currently outstanding
    requests from RC
  • Buffering replies from the network, waiting
    processor is released for bus arbitration.
  • Supplementing the functionality of the
    processors caches
  • Supplies data cache-to-cache when released
    processor retry the access

18
DASH cache coherence protocol
  • Local cluster
  • a cluster that contains the processor
    originating a given request
  • Home cluster
  • the cluster which contains the main memory and
    directory for a given physical memory address
  • Remote cluster
  • any other cluster
  • Owning cluster
  • a cluster owns a dirty memory block
  • Local memory
  • the main memory associated with the local
    cluster
  • Remote memory
  • any memory whose home is not the local

19
DASH cache coherence protocol
  • Invalidation-based ownership protocol
  • Memory block
  • Unchached-remote-- not cached by any remote
    cluster
  • Shared-remote--cached in an unmodified state by
    one or more remote clusters
  • Dirty-remotecached in a modified state by a
    single remote cluster
  • Cache block
  • Invalidthe copy in cache is stale
  • Sharedother processors caching that location
  • Dirtythis cache contains an exclusive copy of
    the memory block, and the block has been
    modified.

20
3 primitive operations
  • Read request (load)
  • In L1, simply supplies the data
  • In L2, fill operation find and bring the required
    block to L1
  • Others, send a read request on the bus
  • Shares- local, simply transfer over the bus
  • Dirty-local, RAC take ownership of the cache line
  • Unchached-remote/shared-remote, send data over
    the reply network to requesting cluster
  • Dirty-remote, forward request to owning cluster,
    owning cluster send data to requesting cluster
    and sharing write-back request to home cluster.

21
  • Forward strategy
  • ? reduce latency by direct responds
  • ? process many request simultaneously
    (multithreaded)
  • reduce serialization
  • Additional latency when simultaneously accesses
    are made to the same block, 1st request will be
    satisfied and dirty cluster loses ownership, 2nd
    request return negative acknowledge(NAK) that
    force retry access.

22
  • Read-exclusive request (store)
  • In local memory, write and invalidate others
    copies
  • Dirty-remote, owning processor invalidate that
    block from its cache, send granting ownership and
    data to requesting cluster, send update ownership
    msg to home cluster.
  • Unchached-remote/ shared-remote, write, send
    invalidate request for shared state.

23
Acknowledge -needed for the requesting processor
to know when the store has been complete w/
respect to all processors. -maintain consistency,
guarantee that new owner will not loose ownership
before the directory has been updated
24
  • Write-back request
  • a dirty cache line that is replaced must be
    written back to memory
  • Home cluster is local, write back to main memory
  • Home cluster is remote, send a message to the
    remote home cluster, update the main memory in
    remote home and mark the block unchached-remote.

25
Bus initiated cache transaction
  • Transactions made by cache snooping the bus
  • Read operation, dirty cache supplies date and
    changes to shared state
  • Read-exclusive operation, invalidate all other
    cached copies
  • Line in L2 is invalidated, L1 do the same

26
Exception conditions
  • A request forwarded to a dirty cluster may
    arrived there to find that the dirty cluster no
    longer owns the data.
  • Prior access, change ownership
  • Owning cluster perform a write back
  • Sol requesting cluster is sent a NAK responses
    and is required to reissure the request(release
    mask, treating as new request)

27
  • Ownership bouncing back to two remote clusters,
    requesting cluster receives multiple NAKs
  • Time-out
  • Return a bus error
  • Sol add a additional directory states access
    queue, responds for all read only requests,
    grants ownership to each exclusive request on a
    pseudo-random basis.

28
  • Separate request and reply network, some msg sent
    between 2 clusters can be received out-of-order
  • Sol acknowledge reply,out-of-order requests
    receive NAK response

29
  • Invalidate request overtakes read reply which try
    to purge the read copy.
  • Sol when RAC detects an invalidation request for
    a pending read, change state of that RAC entry to
    invalidate-read-pending, RC assumes that any read
    reply is stale and treats the reply as a NAK
    response.

30
Deadlock
  • HW
  • 2 mesh network, point-to-point message passing
  • ? consumption of an incoming message may require
    the generation of another outgoing message.
  • Protocol
  • Request message
  • read, read-exclusive, invalidation requests
  • Reply message
  • read read-exclusive replies, invalidation ack.
  • Separate mesh function

31
Error handling
  • Error checking system
  • ECC on main memory
  • Parity checking on directory memory
  • Length checking of network message
  • Inconsistent bus and network message checking
  • Report to processor through bus errors and
    associated error capture registers.
  • Issuing processor time-out originating request or
    fencing operation. OS can clean up the state of a
    line by using back-door paths the allow direct
    addressing of the RAC and directory mem.

32
Scalability of the DASH directory
  • Amount of dir.mem.mem.size x processors
  • Limited pointer per entry, no space for
    processors that are not caching the line
  • Allow pointer to be shared between directory
    entries
  • Use a cache of directory entries to supplement or
    replace the normal directory
  • Sparse-directories, limited pointers and a coarse
    vector

33
Validation of the protocol
  • 2 SW simulator base testing methods
  • Low-level DASH system simulator that incorporates
    the coherence protocol, caches, buses and
    interconnection network
  • High-level functional simulator that models the
    processors and executes parallel programs
  • 2 scheme for testing protocol
  • Running existing parallel programming and compare
    output
  • Test script
  • Hardware

34
Comparison with scalable coherent interface
protocol (SCI)
  • Similarities
  • -rely on coherence caches maintained by
    distributed directories
  • -rely on distributed memories to provide
    scalable memory bandwidth
  • Differences
  • -in SCI, directory is a distributed sharing list
    maintained by cache
  • -in DASH, all the directory info is placed with
    main memory

35
  • SCI advantages
  • -amount of directory pointer grows naturally
    with the of processors
  • -employ SRAM technology used by cache
  • -guarantee forward progress in all cases
  • SCI disadvantages
  • -directory entries increases the complexity and
    latency of the directory protocol, additional
    update msg must be sent bet caches
  • -require more inter-node communication
Write a Comment
User Comments (0)
About PowerShow.com