Title: TreadMarks: Shared Memory Computing on Networks of Workstations
1TreadMarks Shared Memory Computing on Networks
of Workstations
- C. Amza, A. L. Cox, S. Dwarkadas, P.J. Keleher,
H. Lu, R. Rajamony,W. Yu, W. ZwaenepoelRice
University
2INTRODUCTION
- Distributed shared memory is a software
abstraction allowing a set of workstations
connected by a LAN to share a single paged
virtual address space - Key issue in building a software DSM is
minimizing the amount of data communication among
the workstation memories
3Why bother with DSM?
- Key idea is to build fast parallel computers that
are - Cheaper than shared memory multiprocessor
architectures - As convenient to use
4Conventional parallel architecture
CPU
CPU
CPU
CPU
Shared memory
5Todays architecture
- Clusters of workstations are much more cost
effective - No need to develop complex bus and cache
structures - Can use off-the-shelf networking hardware
- Gigabit Ethernet
- Myrinet (1.5 Gb/s)
- Can quickly integrate newest microprocessors
6Limitations of cluster approach
- Communication within a cluster of workstation is
through message passing - Much harder to program than concurrent access to
a shared memory - Many big programs were written for shared memory
architectures - Converting them to a message passing architecture
is a nightmare
7Distributed shared memory
main memories
DSM one shared global address space
8Distributed shared memory
- DSM makes a cluster of workstations look like a
shared memory parallel computer - Easier to write new programs
- Easier to port existing programs
- Key problem is that DSM only provides the
illusion of having a shared memory architecture - Data must still move back and forth among the
workstations
9Munin
- Developed at Rice University
- Based on software objects (variables)
- Used the processor virtual memory to detect
access to the shared objects - Included several techniques for reducing
consistency-related communication - Only ran on top of the V kernel
10Munin main strengths
- Excellent performance
- Portability of programs
- Allowed programs written for a multiprocessor
architecture to run on a cluster of workstations
with a minimum number of changes(dusty decks)
11Munin main weakness
- Very poor portability of Munin itself
- Depended of some features of the V kernel
- Not maintained since the late 80's
12TreadMarks
- Provides DSM as an array of bytes
- Like Munin,
- Uses release consistency
- Offers a multiple writer protocol to fight false
sharing - Runs at user-level on a number of UNIX platforms
- Offers a very simple user interface
13First example Jacobi iteration
- Illustrates the use of barriers
- A barrier is a synchronization primitive that
forces processes accessing it to wait until all
processes have reached it - Forces processes to wait until all of them have
completed a specific step
14Jacobi iteration overall organization
- Operates on a two-dimensional array
- Each processor works on a specific band of rows
- Boundary rows are shared
15Jacobi iteration overall organization
- During each iteration step, each array element is
set to the average of its four neighbors - Averages are stored in a scratch matrix and
copied later into the shared matrix
16Jacobi iteration the barriers
- Mark the end of each computation phase
- Prevents processes from continuing the
computation before all other processes have
completed the previous phase and the new values
are "installed" - Include an implicit release() followed by an
implicit acquire() - To be explained later
17Jacobi iteration declarations
define M define N float grid //
shared array float scratchMN // private
array
18Jacobi iteration startup
main() Tmk_startup() if (Tmk_proc_id 0
) grid Tmk_malloc(MNsizeof(float))
initialize grid // if Tmk_barrier(0)
length M/Tmk_nprocs begin
lengthTmk_proc_id end length(Tmk_proc_id
1)
19Jacobi iteration main loop
for (number of iterations) for (i
begin i lt end i) for (j 0 j lt N
j) scratchij (gridi-j
gridij1)/4
Tmk_barrier(1) for (i begin i lt end i)
for (j 0 j lt N j) gridij
scratchij Tmk_barrier(2) // main
loop // main
20Second example TSP
- Traveling salesman problem
- Finding the shortest path through a number of
cities - Program keeps a queue of partial tours
- Most promising at the end
21TSP declarations
queue_type Queue int Shortest_length int
queue_lock_id, min_lock_id
22TSP startup
main ( Tmkstartup() queue_lock_id 0
min_lock_id 1 if (Tmk_proc_id 0)
Queue Tmk_malloc(sizeof(queuetype))
Shortest_length Tmk_malloc(sizeof(int))
initialize Heap and Shortest_length // if
Tmk_barrier (0)
23TSP while loop
while (true) do Tmk_lock_acquire(queue_l
ock_id) if (queue is empty)
Tmk_lock_release(queue_lock_id)
Tmk_exit() // while loop Keep
adding to queue until a long promising tour
appears at the head Path Delete the tour
from the head Tmk_lock_release(queue_lock_id)
// while
24TSP end of main
length recursively try all cities not
on Path, find the shortest tour
length Tmk_lock_acquire(min_lock_id) if
(length lt Shortest_length) Shortest_length
length Tmk_lock_release(min_lock_id // main
25Critical sections
- All accesses to shared variables are surrounded
by a pair - Tmk_lock_acquire(lock_id)
-
- Tmk_lock_relese(lock_id)
26Implementation Issues
- Consistency issues
- False sharing
27Consistency model (I)
- Shared data are replicated at times
- To speed up read accesses
- All workstations must share a consistent view of
all data - Strict consistency is not possible
28Consistency model (II)
- Various authors have proposed weaker consistency
models - Cheaper to implement
- Harder to use in a correct fashion
- TreadMarks usessoftware release consistency
- Only requires the memory to be consistent at
specific synchronization points
29SW release consistency (I)
- Well-written parallel programs use locks to
achieve mutual exclusion when they access shared
variables - P(mutex) and V(mutex)
- lock(csect) and unlock(csect)
- acquire( ) and release( )
- Unprotected accesses can produce unpredictable
results
30SW release consistency (II)
- SW release consistency will only guarantee
correctness of operations performed within a
request/release pair - No need to export the new values of shared
variables until the release - Must guarantee that workstation has received the
most recent values of all shared variables when
it completes a request
31SW release consistency (III)
- shared int x
- acquire( )// wait for new value of x
- xrelease ( )
- // export x2
- shared int x
- acquire( ) x 1release ( )
- // export x1
32SW release consistency (IV)
- Must still decide how to release updated values
- TreadMarks uses lazy release
- Delays propagation until an acquire is issued
- Its predecessor Munin used eager release
- New values of shared variables were propagated at
release time
33SW release consistency (V)
Eager release
Lazy release
34False sharing
accesses y
accesses x
x y
page containing x and y will move back and
forthbetween main memories of workstations
35Multiple write protocol (I)
- Designed to fight false sharing
- Uses a copy-on-write mechanism
- Whenever a process is granted access to
write-shared data, the page containing these data
is marked copy-on-write - First attempt to modify the contents of the page
will result in the creation of a copy of the
page modified (the twin).
36Creating a twin
37Multiple write protocol (II)
- At release time, TreadMarks
- Performs a word by word comparison of the page
and its twin - Stores the diff in the space used by the twin
page - Informs all processors having a copy of the
shared data of the update - These processors will request the diff the first
time they access the page
38Creating a diff
39Example
Before
First write access
x 1 y 2
x 1 y 2
twin
After
Compare with twin
x 3 y 2
New value of x is 3
40Multiple write protocol (III)
- TreadMarks could but does not check for
conflicting updates to write-shared pages
41The TreadMarks system
- Entirely at user-level
- Links to programs written in C, C and Fortran
- Uses UDP/IP for communication (or AAL3/4 if
machines are connected by an ATM LAN) - Uses SIGIO signal to speed up processing of
incoming requests - Uses mprotect( ) system call to control access to
shard pages
42Performance evaluation (I)
- Long discussion of two large TreadMarks
applications
43Performance evaluation (II)
- A previous paper compared performance of
TreadMarks with that of Munin - Munin performance typically was within 5 to 33
of the performance of hand-coded message passing
versions of the same programs - TreadMarks was almost always better than Munin
with one exception - A 3-D FFT program
44Performance Evaluation (III)
- 3-D FFT program was an iterative program that
read some shared data outside any critical
section - Doing otherwise would have been to costly
- Munin used eager release, which ensured that the
values read were not far from their true value - Not true for TreadMarks!
45Other DSM Implementations (I)
- Sequentially-Consistent Software DSM (IVY)
- Sends messages to other copies at each write
- Much slower
- Software release consistency with eager release
(Munin)
46Other DSM Implementations (II)
- Entry consistency (Midway)
- Requires each variable to be associated to a
synchronization object (typically a lock) - Acquire/release operations on a given
synchronization object only involve the variables
associated with that object - Requires less data traffic
- Does not handle well dusty decks
47Other DSM Implementations (III)
- Structured DSM Systems (Linda)
- Offer to the programmer a shared tuple space
accessed using specific synchronized methods - Require a very different programming style
48CONCLUSIONS
- Can build an efficient DSM entirely in user
space - Modern UNIX systems offer all the required
primitives - Software release consistency model works very
well - Lazy release is almost always better than eager
release