Ronny%20Krashinsky - PowerPoint PPT Presentation

About This Presentation
Title:

Ronny%20Krashinsky

Description:

The programmer labels variables as ... The programmer gets a shared memory system with automatic ... caching can make the programmer's life easier ... – PowerPoint PPT presentation

Number of Views:76
Avg rating:3.0/5.0
Slides: 28
Provided by: ronnykra
Category:

less

Transcript and Presenter's Notes

Title: Ronny%20Krashinsky


1
Software Cache Coherent Shared Memory under
Split-C
Ronny Krashinsky Erik Machnicki
2
Motivation
  • Shared address space parallel programming is
    conceptually simpler than message passing
  • NOWs are more cost effective than SMPs
  • However, NOWs are a more natural fit for message
    passing
  • Two approaches to supporting a shared address
    space with distributed shared memory
  • 1.Simulate the hardware solution, using coherent
    replication
  • 2.Translate all accesses to shared variables into
    explicit messages

3
  • Split-C uses the second method (no caching)
  • This makes the Split-C implementation much
    simpler
  • The programmer labels variables as local or
    global
  • Global accesses become function calls to the
    Split-C library
  • Disadvantage
  • The demand on the programmer is much greater
  • The programmer must provide efficient
    distribution and access
  • The programmer must manage "caching"

4
Our Solution
  • Add automatic coherent caching to Split-C
  • SWCC-Split-C
  • Software Cache Coherent Split-C
  • (Almost) No changes to the Split-C programming
    language
  • The programmer gets a shared memory system with
    automatic replication on a NOW
  • The programmers task is simpler, not as much
    emphasis on placement
  • Good for irregular applications

5
  • Next
  • Design
  • Results
  • Conclusion

6
Design Overview
  • Fine-grained coherence at the level of blocks of
    memory
  • Simple MSI invalidate protocol
  • Directory structure tracks the state of blocks
    as they move
  • through the system
  • Each block is associated with a home node
  • NACKs and retries are used to achieve coherence

7
Notation
8
  • Address Blocks
  • Split-C shared variable has Processor Number and
    Local Address (virtual memory address)
  • SWCC partition the entire address space into
    blocks
  • Coherence is maintained at the level of blocks
  • The upper bits of the Local Address part of a
    global variable determines its block address
  • Addresses associated with directory structure
    and coherence protocols are block addresses

9
  • Directory Structure
  • Hash table of pointers to linked lists of
    directory entries
  • Lives in local memory (malloced at beginning of
    program)

10
  • Directory Entry
  • Block Addr
  • State
  • Data
  • Linked list pointer
  • user vector (maintained and used only by home
    node)
  • Directory entry for every shared block which a
    program accesses
  • (not only at home node)
  • At home node, the directory entry gets a copy of
    local memory

11
  • Directory Lookup (hit)
  • Calculate the directory hash table index
  • Load the address of the directory entry
  • Load the block addr field of the directory entry
  • Check that it matches the block addr of the
    global variable
  • Load the state of the directory entry
  • Check the state of the entry
  • Perform the memory access
  • Only user optimization
  • Check that the node is the home node
  • Calculate the directory hash table index
  • Load the entry from the directory hash table
  • Check that the entry is NULL
  • Perform the memory access

12
  • Coherence Protocol
  • 3 stable states Modified, Shared, Invalid
  • Also Read-Busy, Write-Busy
  • If data is available in appropriate state, no
    communication.
  • Otherwise, Local node sends request to home
    node. Home node does necessary processing to
    reply with the data. May send invalidate or
    flush requests to remote nodes.
  • Serialization at home node NACKs and retries
  • Messages via Active Messages
  • Active Message deadlock rules?
  • State transition diagrams (simplified) ...

13
LOCAL NODE
WRITE /
READ /
READ /
WRITE_RESP /
READ_RESP /
READ / READ_REQ
WRITE / WRITE_REQ
14
(No Transcript)
15
REMOTE NODE
FLUSH_REQ / FLUSH_RESP
FLUSH_X_REQ / FLUSH_X_RESP
INV_REQ / INV_RESP
16
  • Other Design Points
  • race conditions, write lock flag
  • non-FIFO network, NACKs and Retries
  • duplicate requests
  • bulk transactions
  • stores

17
Performance Results
  • Micro-Benchmarks
  • Matrix-Multiply
  • EM3D

18
Read Micro-Benchmarks
19
Write Micro-Benchmarks
20
  • Matrix Multiplication
  • Naïve
  • Blocked
  • Optimized Blocked

21
Naïve MM, Scaling the Number of Processors
22
Naïve MM, Scaling the Matrix Size
23
MM Fixed Size, Fixed Resources, Different
Versions
MFLOPS
100
10
1
0
1
4
16
Block Size
16
64
64
128
Naive
Optimized Blocked
Basic Blocked
24
  • EM3D
  • H nodes and E nodes depend on each other
  • Each iteration the values of H nodes are updated
    based on the values of the E nodes it depends on
    and vice-versa.
  • parameters
  • number of nodes
  • degree of nodes
  • remote probability
  • distance span
  • number of iterations

25
EM3D Scaling Remote Dependency Percentage
26
EM3D Scaling Number of Processors
27
Conclusions
  • Automatic coherent caching can make the
    programmer's life easier
  • Initial data placement is less important
  • For some applications it is even more difficult
    to predict access patterns or do caching in the
    user program, e.g. Barnes Hut or Ray-Tracing
  • Cache coherence is also useful in exploiting
    spatial locality
  • Sometimes caching isnt useful and just provides
    extra overhead. Potentially the user or compiler
    could decide to use caching on a per-variable
    basis.
Write a Comment
User Comments (0)
About PowerShow.com