Title: Distributed Shared Memory
1Distributed Shared Memory
- Distributed Shared Memory (DSM) Systems build the
shared memory abstract on top of the distributed
memory machines - The users have a virtual global address space and
the message passing underneath is sorted out by
DSM transparently from the users - Then we can use shared memory programming
techniques - Software of implementing DSM http//www.ics.uci.ed
u/javid/dsm/page.html
2Three types of DSM implementations
- Page-based technique
- The virtual global address space is divided into
equal sized chunks (pages) which are spread over
the machines - Page is the minimal sharing unit
- The request by a process to access a non-local
piece of memory results in a page fault - a trap occurs and the DSM software fetches the
required page of memory and restarts the
instruction - a decision has to be made whether to replicate
pages or maintain only one copy of any page and
move it around the network - The granularity of the pages has to be decided
before implementation
3Three types of DSM implementations
- Shared-variable based technique
- only the variables and data structures required
by more than one process are shared. - Variable is minimal sharing unit
- Trade-off between consistency and network traffic
4Three types of DSM implementations
- Object-based technique
- memory can be conceptualized as an abstract space
filled with objects (including data and methods) - Object is minimal sharing unit
- Trade-off between consistency and network traffic
5OpenMP
- OpenMP stands for Open specification for
Multi-processing - used to assist compilers to understand and
parallelise the serial code better - Can be used to specify shared memory parallelism
in Fortran, C and C programs - OpenMP is a specification for
- a set of compiler directives,
- RUN TIME library routines, and
- environment variables
- Started mid-late 80s with emergence of shared
memory parallel computers with proprietary
directive-driven programming environments - OpenMP is industry standard
6OpenMP
- OpenMP specifications include
- OpenMP 1.0 for Fortran, 1997
- OpenMP 1.0 for C/C, 1998
- OpenMP 2.0 for Fortran, 2000
- OpenMP 2.0 for C/C , 2002
- OpenMP 2.5 for C/C and Fortran, 2005
- OpenMP Architecture Review Board Compaq, HP,
IBM, Intel, SGI, SUN
7OpenMP programming model
- Shared Memory, thread-based parallelism
- Explicit parallelism
- Fork-join model
8OpenMP code structure in C
- include ltomp.hgt
- main ()
- int var1, var2, var3
- Serial code
- /Beginning of parallel section. Fork a team
of threads. Specify variable scoping/ - pragma omp parallel private(var1, var2)
shared(var3) -
- Parallel section executed by all
threads -
- All threads join master thread and
disband -
- Resume serial code
9OpenMP code structure in Fortran
- PROGRAM HELLO INTEGER VAR1, VAR2, VAR3 Serial
code . . . !Beginning of parallel section. Fork
a team of threads. Specify variable scoping
!OMP PARALLEL PRIVATE(VAR1, VAR2) SHARED(VAR3)
Parallel section executed by all threads . .
. All threads join master thread and disband
!OMP END PARALLEL Resume serial code . . .
END
10OpenMP Directives Format
11OpenMP features
- OpenMP directives are ignored by compilers that
dont support OpenMP, so codes can also be run on
sequential machines - Compiler directives used to specify
- sections of code that can be executed in parallel
- critical sections
- Scope of variables (private or shared)
- Mainly used to parallelize loops, e.g. separate
threads to handle separate iterations of the loop - There is also a run-time library that has several
useful routines for checking the number of
threads and number of processors, changing the
number of threads, etc
12Fork-Join Model
- Multiple threads are created using the parallel
construct - For C and C
- pragma omp parallel
-
- ... do stuff
-
- For Fortran
- !OMP PARALLEL
- ... do stuff
- !OMP END PARALLEL
13How many threads generated
- The number of threads in a parallel region is
determined by the following factors, in order of
precedence - Use of the omp_set_num_threads() library function
- Setting of the OMP_NUM_THREADS environment
variable - Implementation default - the number of CPUs on a
node - Threads are numbered from 0 (master thread) to
N-1
14Parallelizing loops in OpenMP Work Sharing
construct
- Compiler directive specifies that loop can be
done in parallel - For C and C
- pragma omp parallel for
- for (i0iiltN)
-
- valuei compute(i)
-
- For Fortran
- !OMP PARALLEL DO
- DO (i1N)
- value(i) compute(i)
- END DO
- !OMP END PARALLEL DO
- Can use thread scheduling to specify partition
and allocation of iterations to threads - pragma omp parallel for schedule(static,4)
- schedule(static ,chunk)
- Deal out blocks of iterations of size chunk to
each thread - schedule(dynamic ,chunk)
15Synchronisation in OpenMP
- Critical construct
- Barrier construct
16Example of Critical Section in OpenMP
- include ltomp.hgt
- main()
- int x
- x 0
- pragma omp parallel shared(x)
-
- pragma omp critical
- x x1
- / end of parallel section /
17Example of Barrier in OpenMP
- include ltomp.hgt include ltstdio.hgt int main
(int argc, char argv) int th_id,
nthreads pragma omp parallel
private(th_id) th_id
omp_get_thread_num() printf("Hello
World from thread d\n", th_id)
pragma omp barrier if ( th_id 0 )
nthreads
omp_get_num_threads()
printf("There are d threads\n",nthreads)
return 0
18Data Scope Attributes in OpenMP
- OpenMP Data Scope Attribute Clauses are used to
explicitly define how variables should be scoped - These clauses are used in conjunction with
several directives (e.g. PARALLEL, DO/for) to
control the scoping of enclosed variables - Three often encountered clauses
- Shared
- Private
- Reduction
19Shared and private data in OpenMP
- private(var) creates a local copy of var for each
thread - shared(var) states that var is a global variable
to be shared among threads - Default data storage attribute is shared
!OMP PARALLEL DO !OMP PRIVATE(xx,yy)
SHARED(u,f) DO j 1,m DO i 1,n
xx -1.0 dx (i-1) yy -1.0 dy
(j-1) u(i,j) 0.0 f(i,j)
-alpha (1.0-xxxx) (1.0-yyyy) END
DO END DO !OMP END PARALLEL DO
20Reduction Clause
- Reduction -
- reduction (op var)
- e.g. add, logical OR. A local copy of the
variable is made for each thread. Reduction
operation done for each thread, then local values
combined to create global value
double ZZ, res0.0 pragma omp parallel for
reduction (res) private(ZZ) for (i1iltNi)
ZZ i res res ZZ
21Run-Time Library Routines
- Can perform a variety of functions, including
- Query the number of threads/thread no.
- Set number of threads
22Run-Time Library Routines
- query routines allow you to get the number of
threads and the ID of a specific thread - id omp_get_thread_num() //thread no.
- Nthreads omp_get_num_threads() //number of
threads - Can specify number of threads at runtime
- omp_set_num_threads(Nthreads)
23Environment Variable
- Controlling the execution of parallel code
- Four environment variables
- OMP_SCHEDULE how iterations of a loop are
scheduled - OMP_NUM_THREADS maximum number of threads
- OMP_DYNAMIC enable or disable dynamic adjustment
of the number of threads - OMP_NESTED enable or disable nested parallelism
24OpenMP compilers
- Since parallelism is mostly achieved by
parallelising loops using shared memory, OpenMP
compilers work well for multiprocessor SMPs and
vector machines - OpenMP could work for distributed memory
machines, but would need to use a good
distributed shared memory (DSM) implementation - For more information on OpenMP, see
- www.openmp.org
25High Performance ComputingCourse Notes
2007-2008Message Passing Programming I
26Message Passing Programming
- Message Passing is the most widely used parallel
programming model - Message passing works by creating a number of
tasks, uniquely named, that interact by sending
and receiving messages to and from one another
(hence the message passing) - Generally, processes communicate through sending
the data from the address space of one process to
that of another - Communication of processes (via files, pipe,
socket) - Communication of threads within a process (via
global data area) - Programs based on message passing can be based on
standard sequential language programs (C/C,
Fortran), augmented with calls to library
functions for sending and receiving messages
27Message Passing Interface (MPI)
- MPI is a specification, not a particular
implementation - Does not specify process startup, error codes,
amount of system buffer, etc - MPI is a library, not a language
- The goals of MPI functionality, portability and
efficiency - Message passing model gt MPI specification gt MPI
implementation
28OpenMP vs MPI
- In a nutshell
- MPI is used on distributed-memory systems
- OpenMP is used for code parallelisation on
shared-memory systems - Both are explicit parallelism
- High-level control (OpenMP), lower-level control
(MPI)
29A little history
- Message-passing libraries developed for a number
of early distributed memory computers - By 1993 there were loads of vendor specific
implementations - By 1994 MPI-1 came into being
- By 1996 MPI-2 was finalized
30The MPI programming model
- MPI standards -
- MPI-1 (1.1, 1.2), MPI-2 (2.0)
- Forwards compatibility preserved between versions
- Standard bindings - for C, C and Fortran. Have
seen MPI bindings for Python, Java etc (all
non-standard) - We will stick to the C binding, for the lectures
and coursework. More info on MPI
www.mpi-forum.org - Implementations - For your laptop pick up MPICH
(free portable implementation of MPI
(http//www-unix.mcs.anl. gov/mpi/mpich/index.htm)
- Coursework will use MPICH
31MPI
- MPI is a complex system comprising of 129
functions with numerous parameters and variants - Six of them are indispensable, but can write a
large number of useful programs already - Other functions add flexibility (datatype),
robustness (non-blocking send/receive),
efficiency (ready-mode communication), modularity
(communicators, groups) or convenience
(collective operations, topology). - In the lectures, we are going to cover most
commonly encountered functions
32The MPI programming model
- Computation comprises one or more processes that
communicate via library routines and sending and
receiving messages to other processes - (Generally) a fixed set of processes created at
outset, one process per processor - Different from PVM
33Intuitive Interfaces for sending and receiving
messages
- Send(data, destination), Receive(data, source)
- minimal interface
- Not enough in some situations, we also need
- Message matching add message_id at both send
and receive interfaces - they become Send(data, destination, msg_id),
receive(data, source, msg_id) - Message_id
- Is expressed using an integer, termed as message
tag - Allows the programmer to deal with the arrival of
messages in an orderly fashion (queue and then
deal with
34How to express the data in the send/receive
interfaces
- Early stages
- (address, length) for the send interface
- (address, max_length) for the receive interface
- They are not always good
- The data to be sent may not be in the contiguous
memory locations - Storing format for data may not be the same or
known in advance in heterogeneous platform - Enventually, a triple (address, count, datatype)
is used to express the data to be sent and
(address, max_count, datatype) for the data to be
received - Reflecting the fact that a message contains much
more structures than just a string of bits, For
example, (vector_A, 300, MPI_REAL) - Programmers can construct their own datatype
- Now, the interfaces become send(address, count,
datatype, destination, msg_id) and
receive(address, max_count, datatype, source,
msg_id)
35How to distinguish messages
- Message tag is necessary, but not sufficient
- So, communicator is introduced
36Communicators
- Messages are put into contexts
- Contexts are allocated at run time by the system
in response to programmer requests - The system can guarantee that each generated
context is unique - The processes belong to groups
- The notions of context and group are combined in
a single object, which is called a communicator - A communicator identifies a group of processes
and a communication context - The MPI library defines a initial communicator,
MPI_COMM_WORLD, which contains all the processes
running in the system - The messages from different process groups can
have the same tag - So the send interface becomes send(address,
count, datatype, destination, tag, comm)
37Status of the received messages
- The structure of the message status is added to
the receive interface - Status holds the information about source, tag
and actual message size - In the C language, source can be retrieved by
accessing status.MPI_SOURCE, - tag can be retrieved by status.MPI_TAG and
- actual message size can be retrieved by calling
the function MPI_Get_count(status, datatype,
count) - The receive interface becomes receive(address,
maxcount, datatype, source, tag, communicator,
status)
38How to express source and destination
- The processes in a communicator (group) are
identified by ranks - If a communicator contains n processes, process
ranks are integers from 0 to n-1 - Source and destination processes in the
send/receive interface are the ranks
39Some other issues
- In the receive interface, tag can be a wildcard,
which means any message will be received - In the receive interface, source can also be a
wildcard, which match any source
40MPI basics
- First six functions (C bindings)
- MPI_Send (buf, count, datatype, dest, tag, comm)
- Send a message
- buf address of send buffer
- count no. of elements to send (gt0)
- datatype of elements
- dest process id of destination
- tag message tag
- comm communicator (handle)
41MPI basics
- First six functions (C bindings)
- MPI_Send (buf, count, datatype, dest, tag, comm)
- Send a message
- buf address of send buffer
- count no. of elements to send (gt0)
- datatype of elements
- dest process id of destination
- tag message tag
- comm communicator (handle)
42MPI basics
- First six functions (C bindings)
- MPI_Send (buf, count, datatype, dest, tag, comm)
- Send a message
- buf address of send buffer
- count no. of elements to send (gt0)
- datatype of elements
- dest process id of destination
- tag message tag
- comm communicator (handle)
43MPI basics
- First six functions (C bindings)
- MPI_Send (buf, count, datatype, dest, tag, comm)
-
- Calculating the size of the data to be send
- buf address of send buffer
- count sizeof (datatype) bytes of data
-
44MPI basics
- First six functions (C bindings)
- MPI_Send (buf, count, datatype, dest, tag, comm)
- Send a message
- buf address of send buffer
- count no. of elements to send (gt0)
- datatype of elements
- dest process id of destination
- tag message tag
- comm communicator (handle)
45MPI basics
- First six functions (C bindings)
- MPI_Send (buf, count, datatype, dest, tag, comm)
- Send a message
- buf address of send buffer
- count no. of elements to send (gt0)
- datatype of elements
- dest process id of destination
- tag message tag
- comm communicator (handle)
46MPI basics
- First six functions (C bindings)
- MPI_Recv (buf, count, datatype, source, tag,
comm, status) - Receive a message
- buf address of receive buffer (var param)
- count max no. of elements in receive buffer
(gt0) - datatype of receive buffer elements
- source process id of source process, or
MPI_ANY_SOURCE - tag message tag, or MPI_ANY_TAG
- comm communicator
- status status object
47MPI basics
- First six functions (C bindings)
- MPI_Init (int argc, char argv)
- Initiate a computation
- argc (number of arguments) and argv (argument
vector) are main programs arguments - Must be called first, and once per process
- MPI_Finalize ( )
- Shut down a computation
- The last thing that happens
48MPI basics
- First six functions (C bindings)
- MPI_Comm_size (MPI_Comm comm, int size)
- Determine number of processes in comm
- comm is communicator handle, MPI_COMM_WORLD is
the default (including all MPI processes) - size holds number of processes in group
- MPI_Comm_rank (MPI_Comm comm, int pid)
- Determine id of current (or calling) process
- pid holds id of current process
49MPI basics a basic example
- include "mpi.h" include ltstdio.hgt int
main(int argc, char argv) Â Â Â int rank,
nprocs   MPI_Init(argc,argv)   Â
MPI_Comm_size(MPI_COMM_WORLD,nprocs) Â Â Â
MPI_Comm_rank(MPI_COMM_WORLD,rank) Â Â Â
printf("Hello, world. I am d of d\n", rank,
nprocs) Â Â Â MPI_Finalize()
mpirun np 4 myprog Hello, world. I am 1 of
4 Hello, world. I am 3 of 4 Hello, world. I am 0
of 4 Hello, world. I am 2 of 4
50MPI basics send and recv example (1)
- include "mpi.h"include ltstdio.hgt int
main(int argc, char argv)Â Â Â int rank,
size, i   int buffer10   MPI_Status
status    MPI_Init(argc, argv)  Â
MPI_Comm_size(MPI_COMM_WORLD, size)Â Â Â
MPI_Comm_rank(MPI_COMM_WORLD, rank)Â Â Â if
(size lt 2)Â Â Â Â Â Â Â Â Â Â printf("Please run with
two processes.\n") Â Â Â Â Â Â Â MPI_Finalize()Â Â Â Â Â
  return 0      if (rank 0)  Â
       for (i0 ilt10 i)          Â
bufferi i       MPI_Send(buffer, 10,
MPI_INT, 1, 123, MPI_COMM_WORLD)Â Â Â
51MPI basics send and recv example (2)
- Â Â Â if (rank 1)Â Â Â Â Â Â Â Â Â Â for (i0 ilt10
i)Â Â Â Â Â Â Â Â Â Â Â bufferi -1Â Â Â Â Â Â Â
MPI_Recv(buffer, 10, MPI_INT, 0, 123,
MPI_COMM_WORLD, status)Â Â Â Â Â Â Â for (i0 ilt10
i)Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â if (bufferi !
i)Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â printf("Error bufferd d
but is expected to be d\n", i, bufferi,
i)Â Â Â Â Â Â Â Â Â Â Â Â Â MPI_Finalize()
52MPI language bindings
- Standard (accepted) bindings for Fortran, C and
C - Java bindings are work in progress
- JavaMPI Java wrapper to native calls
- mpiJava JNI wrappers
- jmpi pure Java implementation of MPI library
- MPIJ same idea
- Java Grande Forum trying to sort it all out
- We will use the C bindings