TreadMarks: Shared Memory Computing on Networks of Workstations - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

TreadMarks: Shared Memory Computing on Networks of Workstations

Description:

Title: TECHNIQUES FOR REDUCING CONSISTENCY-RELATED COMMUNICATION IN DISTRIBUTED SHARED-MEMORY SYSTEMS Author: Jehan-Fran ois P ris Last modified by – PowerPoint PPT presentation

Number of Views:140
Avg rating:3.0/5.0
Slides: 49
Provided by: Jehan51
Learn more at: http://www2.cs.uh.edu
Category:

less

Transcript and Presenter's Notes

Title: TreadMarks: Shared Memory Computing on Networks of Workstations


1
TreadMarks Shared Memory Computing on Networks
of Workstations
  • C. Amza, A. L. Cox, S. Dwarkadas, P.J. Keleher,
    H. Lu, R. Rajamony,W. Yu, W. ZwaenepoelRice
    University

2
INTRODUCTION
  • Distributed shared memory is a software
    abstraction allowing a set of workstations
    connected by a LAN to share a single paged
    virtual address space
  • Key issue in building a software DSM is
    minimizing the amount of data communication among
    the workstation memories

3
Why bother with DSM?
  • Key idea is to build fast parallel computers that
    are
  • Cheaper than shared memory multiprocessor
    architectures
  • As convenient to use

4
Conventional parallel architecture
CPU
CPU
CPU
CPU
Shared memory
5
Todays architecture
  • Clusters of workstations are much more cost
    effective
  • No need to develop complex bus and cache
    structures
  • Can use off-the-shelf networking hardware
  • Gigabit Ethernet
  • Myrinet (1.5 Gb/s)
  • Can quickly integrate newest microprocessors

6
Limitations of cluster approach
  • Communication within a cluster of workstation is
    through message passing
  • Much harder to program than concurrent access to
    a shared memory
  • Many big programs were written for shared memory
    architectures
  • Converting them to a message passing architecture
    is a nightmare

7
Distributed shared memory
main memories
DSM one shared global address space
8
Distributed shared memory
  • DSM makes a cluster of workstations look like a
    shared memory parallel computer
  • Easier to write new programs
  • Easier to port existing programs
  • Key problem is that DSM only provides the
    illusion of having a shared memory architecture
  • Data must still move back and forth among the
    workstations

9
Munin
  • Developed at Rice University
  • Based on software objects (variables)
  • Used the processor virtual memory to detect
    access to the shared objects
  • Included several techniques for reducing
    consistency-related communication
  • Only ran on top of the V kernel

10
Munin main strengths
  • Excellent performance
  • Portability of programs
  • Allowed programs written for a multiprocessor
    architecture to run on a cluster of workstations
    with a minimum number of changes(dusty decks)

11
Munin main weakness
  • Very poor portability of Munin itself
  • Depended of some features of the V kernel
  • Not maintained since the late 80's

12
TreadMarks
  • Provides DSM as an array of bytes
  • Like Munin,
  • Uses release consistency
  • Offers a multiple writer protocol to fight false
    sharing
  • Runs at user-level on a number of UNIX platforms
  • Offers a very simple user interface

13
First example Jacobi iteration
  • Illustrates the use of barriers
  • A barrier is a synchronization primitive that
    forces processes accessing it to wait until all
    processes have reached it
  • Forces processes to wait until all of them have
    completed a specific step

14
Jacobi iteration overall organization
  • Operates on a two-dimensional array
  • Each processor works on a specific band of rows
  • Boundary rows are shared

15
Jacobi iteration overall organization
  • During each iteration step, each array element is
    set to the average of its four neighbors
  • Averages are stored in a scratch matrix and
    copied later into the shared matrix

16
Jacobi iteration the barriers
  • Mark the end of each computation phase
  • Prevents processes from continuing the
    computation before all other processes have
    completed the previous phase and the new values
    are "installed"
  • Include an implicit release() followed by an
    implicit acquire()
  • To be explained later

17
Jacobi iteration declarations
define M define N float grid //
shared array float scratchMN // private
array
18
Jacobi iteration startup
main() Tmk_startup() if (Tmk_proc_id 0
) grid Tmk_malloc(MNsizeof(float))
initialize grid // if Tmk_barrier(0)
length M/Tmk_nprocs begin
lengthTmk_proc_id end length(Tmk_proc_id
1)
19
Jacobi iteration main loop
for (number of iterations) for (i
begin i lt end i) for (j 0 j lt N
j) scratchij (gridi-j
gridij1)/4
Tmk_barrier(1) for (i begin i lt end i)
for (j 0 j lt N j) gridij
scratchij Tmk_barrier(2) // main
loop // main
20
Second example TSP
  • Traveling salesman problem
  • Finding the shortest path through a number of
    cities
  • Program keeps a queue of partial tours
  • Most promising at the end

21
TSP declarations
queue_type Queue int Shortest_length int
queue_lock_id, min_lock_id
22
TSP startup
main ( Tmkstartup() queue_lock_id 0
min_lock_id 1 if (Tmk_proc_id 0)
Queue Tmk_malloc(sizeof(queuetype))
Shortest_length Tmk_malloc(sizeof(int))
initialize Heap and Shortest_length // if
Tmk_barrier (0)
23
TSP while loop
while (true) do Tmk_lock_acquire(queue_l
ock_id) if (queue is empty)
Tmk_lock_release(queue_lock_id)
Tmk_exit() // while loop Keep
adding to queue until a long promising tour
appears at the head Path Delete the tour
from the head Tmk_lock_release(queue_lock_id)
// while
24
TSP end of main
length recursively try all cities not
on Path, find the shortest tour
length Tmk_lock_acquire(min_lock_id) if
(length lt Shortest_length) Shortest_length
length Tmk_lock_release(min_lock_id // main
25
Critical sections
  • All accesses to shared variables are surrounded
    by a pair
  • Tmk_lock_acquire(lock_id)
  • Tmk_lock_relese(lock_id)

26
Implementation Issues
  • Consistency issues
  • False sharing

27
Consistency model (I)
  • Shared data are replicated at times
  • To speed up read accesses
  • All workstations must share a consistent view of
    all data
  • Strict consistency is not possible

28
Consistency model (II)
  • Various authors have proposed weaker consistency
    models
  • Cheaper to implement
  • Harder to use in a correct fashion
  • TreadMarks usessoftware release consistency
  • Only requires the memory to be consistent at
    specific synchronization points

29
SW release consistency (I)
  • Well-written parallel programs use locks to
    achieve mutual exclusion when they access shared
    variables
  • P(mutex) and V(mutex)
  • lock(csect) and unlock(csect)
  • acquire( ) and release( )
  • Unprotected accesses can produce unpredictable
    results

30
SW release consistency (II)
  • SW release consistency will only guarantee
    correctness of operations performed within a
    request/release pair
  • No need to export the new values of shared
    variables until the release
  • Must guarantee that workstation has received the
    most recent values of all shared variables when
    it completes a request

31
SW release consistency (III)
  • shared int x
  • acquire( )// wait for new value of x
  • xrelease ( )
  • // export x2
  • shared int x
  • acquire( ) x 1release ( )
  • // export x1

32
SW release consistency (IV)
  • Must still decide how to release updated values
  • TreadMarks uses lazy release
  • Delays propagation until an acquire is issued
  • Its predecessor Munin used eager release
  • New values of shared variables were propagated at
    release time

33
SW release consistency (V)
Eager release
Lazy release
34
False sharing
accesses y
accesses x
x y
page containing x and y will move back and
forthbetween main memories of workstations
35
Multiple write protocol (I)
  • Designed to fight false sharing
  • Uses a copy-on-write mechanism
  • Whenever a process is granted access to
    write-shared data, the page containing these data
    is marked copy-on-write
  • First attempt to modify the contents of the page
    will result in the creation of a copy of the
    page modified (the twin).

36
Creating a twin
37
Multiple write protocol (II)
  • At release time, TreadMarks
  • Performs a word by word comparison of the page
    and its twin
  • Stores the diff in the space used by the twin
    page
  • Informs all processors having a copy of the
    shared data of the update
  • These processors will request the diff the first
    time they access the page

38
Creating a diff
39
Example
Before
First write access
x 1 y 2
x 1 y 2
twin
After
Compare with twin
x 3 y 2
New value of x is 3
40
Multiple write protocol (III)
  • TreadMarks could but does not check for
    conflicting updates to write-shared pages

41
The TreadMarks system
  • Entirely at user-level
  • Links to programs written in C, C and Fortran
  • Uses UDP/IP for communication (or AAL3/4 if
    machines are connected by an ATM LAN)
  • Uses SIGIO signal to speed up processing of
    incoming requests
  • Uses mprotect( ) system call to control access to
    shard pages

42
Performance evaluation (I)
  • Long discussion of two large TreadMarks
    applications

43
Performance evaluation (II)
  • A previous paper compared performance of
    TreadMarks with that of Munin
  • Munin performance typically was within 5 to 33
    of the performance of hand-coded message passing
    versions of the same programs
  • TreadMarks was almost always better than Munin
    with one exception
  • A 3-D FFT program

44
Performance Evaluation (III)
  • 3-D FFT program was an iterative program that
    read some shared data outside any critical
    section
  • Doing otherwise would have been to costly
  • Munin used eager release, which ensured that the
    values read were not far from their true value
  • Not true for TreadMarks!

45
Other DSM Implementations (I)
  • Sequentially-Consistent Software DSM (IVY)
  • Sends messages to other copies at each write
  • Much slower
  • Software release consistency with eager release
    (Munin)

46
Other DSM Implementations (II)
  • Entry consistency (Midway)
  • Requires each variable to be associated to a
    synchronization object (typically a lock)
  • Acquire/release operations on a given
    synchronization object only involve the variables
    associated with that object
  • Requires less data traffic
  • Does not handle well dusty decks

47
Other DSM Implementations (III)
  • Structured DSM Systems (Linda)
  • Offer to the programmer a shared tuple space
    accessed using specific synchronized methods
  • Require a very different programming style

48
CONCLUSIONS
  • Can build an efficient DSM entirely in user
    space
  • Modern UNIX systems offer all the required
    primitives
  • Software release consistency model works very
    well
  • Lazy release is almost always better than eager
    release
Write a Comment
User Comments (0)
About PowerShow.com