TreadMarks: Shared Memory Computing on Networks of Workstations - PowerPoint PPT Presentation

1 / 48

About This Presentation

Title:

TreadMarks: Shared Memory Computing on Networks of Workstations

Description:

Title: TECHNIQUES FOR REDUCING CONSISTENCY-RELATED COMMUNICATION IN DISTRIBUTED SHARED-MEMORY SYSTEMS Author: Jehan-Fran ois P ris Last modified by – PowerPoint PPT presentation

Number of Views:140

Avg rating:3.0/5.0

Slides: 49

Provided by: Jehan51

Learn more at: http://www2.cs.uh.edu

Category:

more less

Transcript and Presenter's Notes

Title: TreadMarks: Shared Memory Computing on Networks of Workstations

1
TreadMarks Shared Memory Computing on Networks
of Workstations

C. Amza, A. L. Cox, S. Dwarkadas, P.J. Keleher,
H. Lu, R. Rajamony,W. Yu, W. ZwaenepoelRice
University

2
INTRODUCTION

Distributed shared memory is a software
abstraction allowing a set of workstations
connected by a LAN to share a single paged
virtual address space
Key issue in building a software DSM is
minimizing the amount of data communication among
the workstation memories

3
Why bother with DSM?

Key idea is to build fast parallel computers that
are
Cheaper than shared memory multiprocessor
architectures
As convenient to use

4
Conventional parallel architecture
CPU
CPU
CPU
CPU
Shared memory
5
Todays architecture

Clusters of workstations are much more cost
effective
No need to develop complex bus and cache
structures
Can use off-the-shelf networking hardware
Gigabit Ethernet
Myrinet (1.5 Gb/s)
Can quickly integrate newest microprocessors

6
Limitations of cluster approach

Communication within a cluster of workstation is
through message passing
Much harder to program than concurrent access to
a shared memory
Many big programs were written for shared memory
architectures
Converting them to a message passing architecture
is a nightmare

7
Distributed shared memory
main memories
DSM one shared global address space
8
Distributed shared memory

DSM makes a cluster of workstations look like a
shared memory parallel computer
Easier to write new programs
Easier to port existing programs
Key problem is that DSM only provides the
illusion of having a shared memory architecture
Data must still move back and forth among the
workstations

9
Munin

Developed at Rice University
Based on software objects (variables)
Used the processor virtual memory to detect
access to the shared objects
Included several techniques for reducing
consistency-related communication
Only ran on top of the V kernel

10
Munin main strengths

Excellent performance
Portability of programs
Allowed programs written for a multiprocessor
architecture to run on a cluster of workstations
with a minimum number of changes(dusty decks)

11
Munin main weakness

Very poor portability of Munin itself
Depended of some features of the V kernel
Not maintained since the late 80's

12
TreadMarks

Provides DSM as an array of bytes
Like Munin,
Uses release consistency
Offers a multiple writer protocol to fight false
sharing
Runs at user-level on a number of UNIX platforms
Offers a very simple user interface

13
First example Jacobi iteration

Illustrates the use of barriers
A barrier is a synchronization primitive that
forces processes accessing it to wait until all
processes have reached it
Forces processes to wait until all of them have
completed a specific step

14
Jacobi iteration overall organization

Operates on a two-dimensional array
Each processor works on a specific band of rows
Boundary rows are shared

15
Jacobi iteration overall organization

During each iteration step, each array element is
set to the average of its four neighbors
Averages are stored in a scratch matrix and
copied later into the shared matrix

16
Jacobi iteration the barriers

Mark the end of each computation phase
Prevents processes from continuing the
computation before all other processes have
completed the previous phase and the new values
are "installed"
Include an implicit release() followed by an
implicit acquire()
To be explained later

17
Jacobi iteration declarations
define M define N float grid //
shared array float scratchMN // private
array
18
Jacobi iteration startup
main() Tmk_startup() if (Tmk_proc_id 0
) grid Tmk_malloc(MNsizeof(float))
initialize grid // if Tmk_barrier(0)
length M/Tmk_nprocs begin
lengthTmk_proc_id end length(Tmk_proc_id
1)
19
Jacobi iteration main loop
for (number of iterations) for (i
begin i lt end i) for (j 0 j lt N
j) scratchij (gridi-j
gridij1)/4
Tmk_barrier(1) for (i begin i lt end i)
for (j 0 j lt N j) gridij
scratchij Tmk_barrier(2) // main
loop // main
20
Second example TSP

Traveling salesman problem
Finding the shortest path through a number of
cities
Program keeps a queue of partial tours
Most promising at the end

21
TSP declarations
queue_type Queue int Shortest_length int
queue_lock_id, min_lock_id
22
TSP startup
main ( Tmkstartup() queue_lock_id 0
min_lock_id 1 if (Tmk_proc_id 0)
Queue Tmk_malloc(sizeof(queuetype))
Shortest_length Tmk_malloc(sizeof(int))
initialize Heap and Shortest_length // if
Tmk_barrier (0)
23
TSP while loop
while (true) do Tmk_lock_acquire(queue_l
ock_id) if (queue is empty)
Tmk_lock_release(queue_lock_id)
Tmk_exit() // while loop Keep
adding to queue until a long promising tour
appears at the head Path Delete the tour
from the head Tmk_lock_release(queue_lock_id)
// while
24
TSP end of main
length recursively try all cities not
on Path, find the shortest tour
length Tmk_lock_acquire(min_lock_id) if
(length lt Shortest_length) Shortest_length
length Tmk_lock_release(min_lock_id // main
25
Critical sections

All accesses to shared variables are surrounded
by a pair
Tmk_lock_acquire(lock_id)
Tmk_lock_relese(lock_id)

26
Implementation Issues

Consistency issues
False sharing

27
Consistency model (I)

Shared data are replicated at times
To speed up read accesses
All workstations must share a consistent view of
all data
Strict consistency is not possible

28
Consistency model (II)

Various authors have proposed weaker consistency
models
Cheaper to implement
Harder to use in a correct fashion
TreadMarks usessoftware release consistency
Only requires the memory to be consistent at
specific synchronization points

29
SW release consistency (I)

Well-written parallel programs use locks to
achieve mutual exclusion when they access shared
variables
P(mutex) and V(mutex)
lock(csect) and unlock(csect)
acquire( ) and release( )
Unprotected accesses can produce unpredictable
results

30
SW release consistency (II)

SW release consistency will only guarantee
correctness of operations performed within a
request/release pair
No need to export the new values of shared
variables until the release
Must guarantee that workstation has received the
most recent values of all shared variables when
it completes a request

31
SW release consistency (III)

shared int x
acquire( )// wait for new value of x
xrelease ( )
// export x2

shared int x
acquire( ) x 1release ( )
// export x1

32
SW release consistency (IV)

Must still decide how to release updated values
TreadMarks uses lazy release
Delays propagation until an acquire is issued
Its predecessor Munin used eager release
New values of shared variables were propagated at
release time

33
SW release consistency (V)
Eager release
Lazy release
34
False sharing
accesses y
accesses x
x y
page containing x and y will move back and
forthbetween main memories of workstations
35
Multiple write protocol (I)

Designed to fight false sharing
Uses a copy-on-write mechanism
Whenever a process is granted access to
write-shared data, the page containing these data
is marked copy-on-write
First attempt to modify the contents of the page
will result in the creation of a copy of the
page modified (the twin).

36
Creating a twin
37
Multiple write protocol (II)

At release time, TreadMarks
Performs a word by word comparison of the page
and its twin
Stores the diff in the space used by the twin
page
Informs all processors having a copy of the
shared data of the update
These processors will request the diff the first
time they access the page

38
Creating a diff
39
Example
Before
First write access
x 1 y 2
x 1 y 2
twin
After
Compare with twin
x 3 y 2
New value of x is 3
40
Multiple write protocol (III)

TreadMarks could but does not check for
conflicting updates to write-shared pages

41
The TreadMarks system

Entirely at user-level
Links to programs written in C, C and Fortran
Uses UDP/IP for communication (or AAL3/4 if
machines are connected by an ATM LAN)
Uses SIGIO signal to speed up processing of
incoming requests
Uses mprotect( ) system call to control access to
shard pages

42
Performance evaluation (I)

Long discussion of two large TreadMarks
applications

43
Performance evaluation (II)

A previous paper compared performance of
TreadMarks with that of Munin
Munin performance typically was within 5 to 33
of the performance of hand-coded message passing
versions of the same programs
TreadMarks was almost always better than Munin
with one exception
A 3-D FFT program

44
Performance Evaluation (III)

3-D FFT program was an iterative program that
read some shared data outside any critical
section
Doing otherwise would have been to costly
Munin used eager release, which ensured that the
values read were not far from their true value
Not true for TreadMarks!

45
Other DSM Implementations (I)

Sequentially-Consistent Software DSM (IVY)
Sends messages to other copies at each write
Much slower
Software release consistency with eager release
(Munin)

46
Other DSM Implementations (II)

Entry consistency (Midway)
Requires each variable to be associated to a
synchronization object (typically a lock)
Acquire/release operations on a given
synchronization object only involve the variables
associated with that object
Requires less data traffic
Does not handle well dusty decks

47
Other DSM Implementations (III)

Structured DSM Systems (Linda)
Offer to the programmer a shared tuple space
accessed using specific synchronized methods
Require a very different programming style

48
CONCLUSIONS