Ronny%20Krashinsky - PowerPoint PPT Presentation

About This Presentation

Title:

Ronny%20Krashinsky

Description:

The programmer labels variables as ... The programmer gets a shared memory system with automatic ... caching can make the programmer's life easier ... – PowerPoint PPT presentation

Number of Views:76

Avg rating:3.0/5.0

Slides: 28

Provided by: ronnykra

Learn more at: http://people.csail.mit.edu

Category:

more less

Transcript and Presenter's Notes

Title: Ronny%20Krashinsky

1
Software Cache Coherent Shared Memory under
Split-C
Ronny Krashinsky Erik Machnicki
2
Motivation

Shared address space parallel programming is
conceptually simpler than message passing
NOWs are more cost effective than SMPs
However, NOWs are a more natural fit for message
passing
Two approaches to supporting a shared address
space with distributed shared memory
1.Simulate the hardware solution, using coherent
replication
2.Translate all accesses to shared variables into
explicit messages

Split-C uses the second method (no caching)
This makes the Split-C implementation much
simpler
The programmer labels variables as local or
global
Global accesses become function calls to the
Split-C library
Disadvantage
The demand on the programmer is much greater
The programmer must provide efficient
distribution and access
The programmer must manage "caching"

4
Our Solution

Add automatic coherent caching to Split-C
SWCC-Split-C
Software Cache Coherent Split-C
(Almost) No changes to the Split-C programming
language
The programmer gets a shared memory system with
automatic replication on a NOW
The programmers task is simpler, not as much
emphasis on placement
Good for irregular applications

Next
Design
Results
Conclusion

6
Design Overview

Fine-grained coherence at the level of blocks of
memory
Simple MSI invalidate protocol
Directory structure tracks the state of blocks
as they move
through the system
Each block is associated with a home node
NACKs and retries are used to achieve coherence

7
Notation
8

Address Blocks
Split-C shared variable has Processor Number and
Local Address (virtual memory address)
SWCC partition the entire address space into
blocks
Coherence is maintained at the level of blocks
The upper bits of the Local Address part of a
global variable determines its block address
Addresses associated with directory structure
and coherence protocols are block addresses

Directory Structure
Hash table of pointers to linked lists of
directory entries
Lives in local memory (malloced at beginning of
program)

Directory Entry
Block Addr
State
Data
Linked list pointer
user vector (maintained and used only by home
node)

Directory entry for every shared block which a
program accesses
(not only at home node)
At home node, the directory entry gets a copy of
local memory

Directory Lookup (hit)
Calculate the directory hash table index
Load the address of the directory entry
Load the block addr field of the directory entry
Check that it matches the block addr of the
global variable
Load the state of the directory entry
Check the state of the entry
Perform the memory access

Only user optimization
Check that the node is the home node
Calculate the directory hash table index
Load the entry from the directory hash table
Check that the entry is NULL
Perform the memory access

Coherence Protocol
3 stable states Modified, Shared, Invalid
Also Read-Busy, Write-Busy
If data is available in appropriate state, no
communication.
Otherwise, Local node sends request to home
node. Home node does necessary processing to
reply with the data. May send invalidate or
flush requests to remote nodes.
Serialization at home node NACKs and retries
Messages via Active Messages
Active Message deadlock rules?
State transition diagrams (simplified) ...

13
LOCAL NODE
WRITE /
READ /
READ /
WRITE_RESP /
READ_RESP /
READ / READ_REQ
WRITE / WRITE_REQ
14
(No Transcript)
15
REMOTE NODE
FLUSH_REQ / FLUSH_RESP
FLUSH_X_REQ / FLUSH_X_RESP
INV_REQ / INV_RESP
16

Other Design Points
race conditions, write lock flag
non-FIFO network, NACKs and Retries
duplicate requests
bulk transactions
stores

17
Performance Results

Micro-Benchmarks
Matrix-Multiply
EM3D

18
Read Micro-Benchmarks
19
Write Micro-Benchmarks
20

Matrix Multiplication
Naïve
Blocked
Optimized Blocked

21
Naïve MM, Scaling the Number of Processors
22
Naïve MM, Scaling the Matrix Size
23
MM Fixed Size, Fixed Resources, Different
Versions
MFLOPS
100
10
1
0
1
4
16
Block Size
16
64
64
128
Naive
Optimized Blocked
Basic Blocked
24

EM3D
H nodes and E nodes depend on each other
Each iteration the values of H nodes are updated
based on the values of the E nodes it depends on
and vice-versa.
parameters
number of nodes
degree of nodes
remote probability
distance span
number of iterations

25
EM3D Scaling Remote Dependency Percentage
26
EM3D Scaling Number of Processors
27
Conclusions

Automatic coherent caching can make the
programmer's life easier
Initial data placement is less important
For some applications it is even more difficult
to predict access patterns or do caching in the
user program, e.g. Barnes Hut or Ray-Tracing
Cache coherence is also useful in exploiting
spatial locality
Sometimes caching isnt useful and just provides
extra overhead. Potentially the user or compiler
could decide to use caching on a per-variable
basis.