HPC Parallel Programming: From Concept to Compile

About This Presentation

Title:

HPC Parallel Programming: From Concept to Compile

Description:

Parallel Random Access Machine the PRAM model. result, specialist, agenda - RSA model ... Cosmology and astrophysics. Computational fluid dynamics and turbulence ... – PowerPoint PPT presentation

Number of Views:89

Avg rating:3.0/5.0

Slides: 193

Provided by: istUwa

Category:

more less

Transcript and Presenter's Notes

Title: HPC Parallel Programming: From Concept to Compile

1
HPC Parallel ProgrammingFrom Concept to Compile

A One day Introductory Workshop
October 12th, 2004

2
Schedule

Concept (a frame of mind)
Compile (application)

3
Introduction

Programming parallel computers
Compiler extension
Sequential programming language extension
Parallel programming layer
New parallel languages

4
Concept

Parallel Algorithm Design
Programming paradigms
Parallel Random Access Machine the PRAM model
result, specialist, agenda - RSA model
Task / Channel - the PCAM model
Bulk Synchronous Parallel the BSP model
Pattern Language

5
Compile

Serial
Introduction to OpenMP
Introduction to MPI
Profilers
Libraries
Debugging
Performance Analysis Formulas

By the end of this workshop you will be exposed
to
Different parallel programming models and
paradigms
Serial Im not bad, I was built this way
programming and how you can optimize it
OpenMP and MPI
libraries
debugging

Eu est velociter perfectus
Well Done is Quickly Done
Caesar Augustus

8
Introduction

What is Parallel Computing?
It is the ability to program in a language that
allows you to explicitly indicate how different
portions of the computation may be executed
concurrently by different processors

Why do it?
The need for speed
How much speedup can be determined by
Amdahls Law S(p) p/(1(p-1)f) where
f - fraction of the computation that cannot be
divided into concurrent tasks, 0 f 1 and
p - the number of processors
So if we have 20 processors and a serial portion
of 5 we will get a speedup of 20/(1(20-1).05)
10.26
Also Gustafson-Barsiss Law which takes into
account scalability, and
Karp-Flatt Metric which takes into account the
parallel overhead, and
Isoefficiency Relation which is used to determine
the range of processors for which a particular
level of efficiency can be determined. Parallel
overhead increases as the number of processors
increase, so to maintain efficiency increase the
problem size

Why Do Parallel Computing some other reasons
Time Reduce the turnaround time of applications
Performance Parallel computing is the only way
to extend performance toward the TFLOP realm
Cost/Performance Traditional vector computers
become too expensive as one pushes the
performance barrier
Memory Applications often require memory that
goes beyond that addressable by a single
processor
Whole classes of important algorithms are ideal
for parallel execution. Most algorithms can
benefit from parallel processing such as Laplace
equation, Monte Carlo, FFT (signal processing),
image processing
Life itself is a set of concurrent processes
Scientists use modeling so why not model systems
in a way closer to nature

Many complex scientific problems require large
computing resources. Problems such as
Quantum chemistry, statistical mechanics, and
relativistic physics
Cosmology and astrophysics
Computational fluid dynamics and turbulence
Biology, genome sequencing, genetic engineering
Medicine
Global weather and environmental modeling
One such place is http//www-fp.mcs.anl.gov/grand-
challenges/

12
Programming Parallel Computers

In 1988 four distinct paths for application
software development on parallel computers were
identified by McGraw and Axelrod
Extend an existing compiler to translate
sequential programs into parallel programs
Extend an existing language with new operations
that allow users to express parallelism
Add a new language layer on top of an existing
sequential language
Define a totally new parallel language

13
Compiler extension

Design parallelizing compilers that exploit
parallelism in existing programs written in a
sequential language
Advantages
billions of dollars and thousands of years of
programmer effort have already gone into legacy
programs.
Automatic parallelization can save money and
labour.
It has been an active area of research for over
twenty years
Companies such as Parallel Software Products
http//www.parallelsp.com/ offer compilers that
translate F77 code into parallel programs for MPI
and OpenMP
Disadvantages
Pits programmer and compiler in game of hide and
seek. The programmer hides parallelism in DO
loops and control structures and the compiler
might irretrievably lose some parallelism

14
Sequential Programming Language Design

Extend a sequential language with functions that
allow programmers to create, terminate,
synchronize and communicate with parallel
processes
Advantages
Easiest, quickest, and least expensive since it
only requires the development of a subroutine
library
Libraries meeting the MPI standard exist for
almost every parallel computer
Gives programmers flexibility with respect to
program development
Disadvantages
Compiler is not involved in generation of
parallel code therefore it cannot flag errors
It is very easy to write parallel programs that
are difficult to debug

15
Parallel Programming layers

Think of a parallel program consisting of 2
layers. The bottom layer contains the core of
the computation which manipulates its portion of
data to gets its result. The upper layer
controls creation and synchronization of
processes. A compiler would then translate these
two levels into code for execution on parallel
machines
Advantages
Allows users to depict parallel programs as
directed graphs with nodes depicting sequential
procedures and arcs representing data dependences
among procedures
Disadvantages
Requires programmer to learn and use a new
parallel programming system

16
New Parallel Languages

Develop a parallel language from scratch. Let
the programmer express parallel operations
explicitly. The programming language Occam is one
such famous example http//wotug.ukc.ac.uk/paralle
l/occam/
Advantages
Explicit parallelism means programmer and
compiler are now allies instead of adversaries
Disadvantages
Requires development of new compilers. It
typically takes years for vendors to develop
high-quality compilers for their parallel
architectures
Some parallel languages such as C were not
adapted as standard compromising severely
portable code
User resistance. Who wants to learn another
language

The most popular approach continues to be
augmenting existing sequential languages with
low-level constructs expressed by function calls
or compiler directives
Advantages
Can exhibit high efficiency
Portable to a wide range of parallel systems
C, C, F90 with MPI or OpenMP are such examples
Disadvantages
More difficult to code and debug

18
Concept

An algorithm (from OED) is a set of rules or
process, usually one expressed in algebraic
notation now used in computing
A parallel algorithm is one in which the rules or
process are concurrent
There is no simple recipe for designing parallel
algorithms. However, it can benefit from a
methodological approach. It allows the
programmer to focus on machine-independent issues
such as concurrency early in the design process
and machine-specific aspects later
You will be introduced to such approaches and
models and hopefully gain some insight into the
design process
Examining these models is a good way to start
thinking in parallel

19
Parallel Programming Paradigms

Parallel applications can be classified into well
defined programming paradigms
Each paradigm is a class of algorithms that have
the same control structure
Experience suggests that there are a relatively
few paradigms underlying most parallel programs
The choice of paradigm is determined by the
computing resources which can define the level of
granularity and type of parallelism inherent in
the program which reflects the structure of
either the data or application

20
Parallel Programming Paradigms

The most systematic definition of paradigms comes
from a technical report from the University of
Basel in 1993 entitled BACS Basel Algorithm
Classification Scheme
A generic tuple of factors which characterize a
parallel algorithm
Process properties (structure, topology,
execution)
Interaction properties
Data properties (partitioning, placement)
The following paradigms were described
Task-Farming (or Master/Slave)
Single Program Multiple Data (SPMD)
Data Pipelining
Divide and Conquer
Speculative Parallelism

21
PPP Task-Farming

Task-farming consists of two entities
Master which decomposes the problem into small
tasks and distributes/farms them to the slave
processes. It also gathers the partial results
and produces the final computational result
Slave which gets a message with a task, executes
the task and sends the result back to the master
It can use either static load balancing
(distribution of tasks is all performed at the
beginning of the computation) or dynamic
load-balancing (when the number of tasks exceeds
the number of processors or is unknown, or when
execution times are not predictable, or when
dealing with unbalanced problems). This paradigm
responds quite well to the loss of processors and
can be scaled by extending the single master to a
set of masters

22
PPP Single Program Multiple data (SPMD)

SPMD is the most commonly used paradigm
Each process executes the same piece of code but
on a different part of the data which involves
the splitting of the application data among the
available processors. This is also referred to
as geometric parallelism, domain decomposition,
or data parallelism
Applications can be very efficient if the data is
well distributed by the processes on a
homogeneous system. If different workloads are
evident then some sort of load balancing scheme
is necessary during run-time execution
Highly sensitive to loss of a process. Unusually
results in a deadlock until global
synchronization point is reached

23
PPP Data Pipelining

Data pipelining is fine-grained parallelism and
is based on a functional decomposition approach
The tasks (capable of concurrent operation) are
identified and each processor executes a small
part of the total algorithm
One of the simplest and most popular functional
decomposition paradigms and can also be referred
to as data-flow parallelism.
Communication between stages of the pipeline can
be asynchronous since the efficiency is directly
dependent on the ability to balance the load
across the stages
Often used in data reduction and image processing

24
PPP Divide and Conquer

The divide and conquer approach is well known in
sequential algorithm development in which a
problem is divided into two or more subproblems.
Each subproblem is solved independently and the
results combined
In parallel divide and conquer, the subproblems
can be solved at the same time
Three generic computational operations split,
compute, and join (sort of like a virtual tree
where the tasks are computed at the leaf nodes)

25
PPP Speculative Parallelism

Employed when it is difficult to obtain
parallelism through any one of the previous
paradigms
Deals with complex data dependencies which can be
broken down into smaller parts using some
speculation or heuristic to facilitate the
parallelism

26
PRAM Parallel Random Access Machine

Descendent of RAM (Random Access Machine)
A theoretical model of parallel computation in
which an arbitrary but finite number of
processors can access any value in an arbitrarily
large shared memory in a single time step
Introduced in the 1970s it still remains popular
since it is theoretically tractable and gives
algorithm designers a common target. The
Prentice Hall book from 1989 entitled the Design
and Analysis of Parallel algorithms, gives a good
introduction to the design of algorithms using
this model

27
PRAM cont

The three most important variations on this model
are
EREW (exclusive read exclusive write) where any
memory location may be access only once in any
one step
CREW (concurrent read exclusive write) where any
memory location may be read any number of times
during a single step but written to only once
after the reads have finished
CRCW (concurrent read concurrent write) where any
memory location may be written to or read from
any number of times during a single step. Some
rule or priority must be given to resolve
multiple writes

28
PRAM cont

This model has problems
PRAMs cannot be emulated optimally on all
architectures
Problem lies in the assumption that every
processor can access the memory simultaneously in
a single step. Even in hypercubes messages must
take several hops between source and destination
and it grows logarithmically with the machines
size. As a result any buildable computer will
experience a logarithmic slowdown relative to the
PRAM model as its size increases
One solution is to take advantage of cases in
which there is greater parallelism in the process
than in the hardware it is running on, enabling
each physical processor to emulate many virtual
processors. An example of such is as follows

29
PRAM cont

Example
Process A sends request
Process B runs while request travels to memory
Process C runs while memory services request
Process D runs while reply returns to processor
Process A is re-scheduled
The efficiency with which physically resizable
architectures could emulate the PRAM is dictated
by the theorem
If each of P processors sends a single message to
a randomly-selected partner, it is highly
probable that at least on processor will receive
O(P/log log P) messages, and some others will
receive none but
If each processor sends log P messages to
randomly-selected partners, there is a high
probability that no processor will receive more
than 3 log P messages.
So if problem size is increased at least
logarithmically faster than machine size,
efficiency can be held constant. The problem is
that it holds constant for hypercubes in which
the communication links grow with the number of
processors.
Several ways around the above limitation has been
suggested. Such as

30
PRAM cont

XPRAM where computations are broken up into steps
such that no processor may communicate more that
a certain number of times per single time step.
Programs which fit this model can be emulated
efficiently
Problem is that it is difficult to design
algorithms in which the frequency of
communication decreases as the problem size
increases
Bird-Meertens formalism where the allowed set of
communications would be restricted to those which
can be emulated efficiently
The scan-vector model proposed by Blelloch
accounts for the relative distance of different
portions of memory
Another option was proposed by Ramade in 1991
which uses a butterfly network in which each node
is a processor/memory pair. Routing messages in
this model is complicated but the end result is
an optimal PRAM emulation

31
Result, Agenda, Specialist Model

The RAS model was proposed by Nicholas Carriero
and David Gelernter in their book How to Write
Parallel Programs in 1990
To write a parallel program
Choose a pattern that is most natural to the
problem
Write a program using the method that is most
natural for that pattern, and
If the resulting program is not efficient, then
transform it methodically into a more efficient
version

32
RAS

Sounds simple. We can envision parallelism in
terms of
Result - focuses on the shape of the finished
product
Plan a parallel application around a data
structure yielded as the final result. We get
parallelism by computing all elements of the
result simultaneously
Agenda - focuses on the list of tasks to be
performed
Plan a parallel application around a particular
agenda of tasks, and then assign many processes
to execute the tasks
Specialist - focuses on the make-up of the work
Plan an application around an ensemble of
specialists connected into a logical network of
some kind. Parallelism results in all nodes
being active simultaneously much like
pipe-lining

33
RAS - Result

In most cases the easiest way to think of a
parallel program is to think of the resulting
data structure. It is a good starting point for
any problem whose goal is to produce a series of
values with predictable organization and
interdependencies
Such a program reads as follows
Build a data structure
Determine the value of all elements of the
structure simultaneously
Terminate when all values are known
If all values are independent then all
computations start in parallel. However, if some
elements cannot be computed until certain other
values are known, then those tasks are blocked
As a simple example consider adding two n-element
vectors (i.e. add the ith elements of both and
store the sum in another vector)

34
RAS - Agenda

Agenda parallelism adapts well to many different
problems
The most flexible is the master-worker paradigm
in which a master process initializes the
computation and creates a collection of identical
worker processes
Each worker process is capable of performing any
step in the computation
Workers seek a task to perform and then repeat
When no tasks remain, the program is finished
An example would be to find the lowest ratio of
salary to dependents in a database. The master
fills a bag with records and each worker draws
from the bag, computes the ratio, sends the
results back to the master. The master keeps
track of the minimum and when tasks are complete
reports the answer

35
RAS - Specialist

Specialist parallelism involves programs that are
conceived in terms of a logical network.
Best understood in which each node executes an
autonomous computation and inter-node
communication follows predictable paths
An example could be a circuit simulation where
each element is realized by a separate process

36
RAS - Example

Consider a naïve n-body simulator where on each
iteration of the simulation we calculate the
prevailing forces between each body and all the
rest, and update each bodys position accordingly
With the result parallelism approach it is easy
to restate the problem description as follows
Suppose n bodies, q iterations of the simulation,
compute matrix M such that M i, j is the
position of the ith body after the jth iteration
Define each entry in terms of other entries i.e.
write a function to compute position (i, j)

37
RAS - Example

With the agenda parallelism model we can
repeatedly apply the transformation compute next
position to all bodies in the set
So the steps involved would be to
Create a master process and have it generate n
initial task descriptors ( one for each body )
On the first iteration, each process repeatedly
grabs a task descriptor and computes the next
position of the corresponding body, until all
task descriptors are used
The master can the store information about each
bodys position at the last iteration in a
distributed table structure where each process
can refer to it directly

38
RAS - Example

Finally, with the specialist parallelism approach
we create a series of processes, each one
specializing in a single body (i.e. each
responsible for computing a single bodys current
position throughout the entire simulation)
At the start of each iteration, each process
sends data to and receives data from each other
process
The data included in the incoming message group
of messages is sufficient to allow each process
to compute a new position for its body then
repeat

39
Task Channel model

THERE IS NO SIMPLE RECIPE FOR DESIGNING PARALLEL
ALGORITHMS
However, with suggestions by Ian Foster and his
book Designing and Building Parallel Programs
there is a methodology we can use
The task/channel method is one most often sited
as a practical means to organize the design
process
It represents a parallel computation as a set of
tasks that may interact with each other by
sending messages through channels
It can be viewed as a directed graph where
vertices represent tasks and directed edges
represent channels
A thorough examination of this design process
will conclude with a practical example

A task is a program, its local memory, and a
collection of I/O ports
The local memory contains the programs
instructions and its private data
It can send local data values to other tasks via
output ports
It also receives data values from other tasks via
input ports
A channel is a message queue that connects one
tasks output port with another tasks input port
Data values appear at the input port in the same
order as they were placed in the output port of
the channel
Tasks cannot receive data until it is sent (i.e.
receiving is blocked)
Sending is never blocked

The four stages of Fosters Design process are
Partitioning the process of dividing the
computation and data into pieces
Communication the pattern of send and receives
between tasks
Agglomeration process of grouping tasks into
larger tasks to simplify programming or improve
performance
Mapping the processes of assigning tasks to
processors
Commonly referred to as PCAM

42
Partitioning

Discover as much parallelism as possible. To
this end strive to split the computation and data
into smaller pieces
There are two approaches
Domain decomposition
Functional decomposition

43
PCAM partitioning domain decomposition

Domain decomposition is where you first divide
the data into pieces and then determine how to
associate computations with the data
Typically focus on the largest or most frequently
accessed data structure in the program
Consider a 3D matrix. It can be partitioned as
Collection of 2D slices, resulting in a 1D
collection of tasks
Collection of 1D slices, resulting in a 2D
collection of tasks
Each matrix element separately resulting in a 3D
collection of tasks
At this point in the design process it is usually
best to maximize the number of tasks hence 3D
partitioning is best

44
PCAM partitioning functional decomposition

Functional decomposition is complimentary to
domain decomposition in which the computation is
first divided into pieces and then the data items
are associated with each computation. This is
often know as pipelining which yield a collection
of concurrent tasks
Consider brain surgery
before surgery begins a set of CT images are
input to form a 3D model of the brain
The system tracks the position of the instruments
converting physical coordinates into image
coordinates and displaying them on a monitor.
While one task is converting physical coordinates
to image coordinates, another is displaying the
previous image, and yet another is tracking the
instrument for the next image. (Anyone remember
the movie The Fantastic Voyage?)

45
PCAM Partitioning - Checklist

Regardless of decomposition we must maximize the
number of primitive tasks since it is the upper
bound on the parallelism we can exploit. Foster
has presented us a checklist to evaluate the
quality of the partitioning
There are at least an order of magnitude more
tasks than processors on the target parallel
machine. If not, there may be little flexibility
in later design options
Avoid redundant computation and storage
requirements since the design may not work well
when the size of the problem increases
Tasks are of comparable size. If not, it may be
hard to allocate each processor equal amounts of
work
The number of tasks scale with problem size. If
not, it may be difficult to solve larger problems
when more processors are available
Investigate alternative partitioning to maximize
flexibility later

46
PCAM-Communication

After the tasks have been identified it is
necessary to understand the communication
patterns between them
Communications are considered part of the
overhead of a parallel algorithm, since the
sequential algorithm does not need to do this.
Minimizing this overhead is an important goal
Two such patterns local and global are more
commonly used than others (structured/unstructured
, static/dynamic, synchronous/asynchronous)
Local communication exists when a task need
values from a small number of other tasks (its
neighbours) in order to form a computation
Global communication exits when a large number of
tasks must supply data in order to form a
computation (e.g. performing a parallel reduction
operation computing the sum of values over N
tasks)

47
PCAM Communication - checklist

These are guidelines and not hard and fast rules
Are the communication operations balanced between
tasks? Unbalanced communication requirements
suggest a non-scalable construct
Each task communicates only with a small number
of neighbours
Tasks are able to communicate concurrently. If
not the algorithm is likely to be inefficient and
non-scalable.
Tasks are able to perform their computations
concurrently

48
PCAM - Agglomeration

The first two steps of the design process was
focused on identifying as much parallelism as
possible
At this point the algorithm would probably not
execute efficiently on any particular parallel
computer. For example, if there are many
magnitudes more tasks than processors it can lead
to a significant overhead in communication
In the next two stages of the design we consider
combining tasks into larger tasks and then
mapping them onto physical processors to reduce
parallel overhead

49
PCAM - Agglomeration

Agglomeration (according to OED) is the process
of collecting in a mass. In this case we try
group tasks into larger tasks to facilitate
improvement in performance or to simplify
programming.
There are three main goals to agglomeration
Lower communication overhead
Maintain the scalability of the parallel design,
and
Reduce software engineering costs

50
PCAM - Agglomeration

How can we lower communication overhead?
By agglomerating tasks that communicate with each
other, communication is completely eliminated,
since data values controlled by the tasks are in
the memory of the consolidated task. This
process is known as increasing the locality of
the parallel algorithm
Another way is to combine groups of transmitting
and receiving tasks thereby reducing the number
of messages sent. Sending fewer, longer messages
takes less time than sending more, shorter
messages since there is an associated startup
cost (message latency) inherent with every
message sent which is independent of the length
of the message.

51
PCAM - Agglomeration

How can we maintain the scalability of the
parallel design?
Ensure that you do not combine too many tasks
since porting to a machine with more processors
may be difficult.
For example part of your parallel program is to
manipulate a 3D array 16 X 128 X 256 and the
machine has 8 processors. By agglomerating the
2nd and 3rd dimensions each task would be
responsible for a submatrix of 2 X 128 X 256. We
can even port this to a machine that has 16
processors. However porting this to a machine
with more processors might result in large
changes to the parallel code. Therefore
agglomerating the 2nd and 3rd dimension might not
be a good idea. What about a machine with 50,
64, or 128 processors?

52
PCAM - Agglomeration

How can we reduce software engineering costs?
By parallelizing a sequential program we can
reduce time and expense of developing a similar
parallel program. Remember Parallel Software
Products

53
PCAM Agglomeration - checklist

Some of these points in this checklist emphasize
quantitative performance analysis which becomes
more important as we move from the abstract to
the concrete
Has the agglomeration increased the locality of
the parallel algorithm?
Do replicated computations take less time than
the communications they replace?
Is the amount of replicated data small enough to
allow the algorithm to scale?
Do agglomerated tasks have similar computational
and communication costs?
Is the number of tasks an increasing function of
the problem size?
Are the number of tasks as small as possible and
yet as large as the number of processors on your
parallel computer?
Is the trade-off between your chosen
agglomeration and the cost of modifications to
existing sequential code reasonable?

54
PCAM - Mapping

In this 4th and final stage we specify where each
task is to execute
The goals of mapping are to maximize processor
utilization and minimize interprocessor
communications. Often these are conflicting
goals
Processor utilization is maximized when the
commutation is balanced evenly. Conversely, it
drops when one or processors are idle
Interprocessor communication increases when two
tasks connected by a channel are mapped to
different processors. Conversely, it decreases
when the two tasks connected by the channel are
mapped to the same processor
Mapping every task to the same processors cut
communications to zero but utilization is reduced
to 1/processors. The point is to choose a
mapping that represents a reasonable balance
between conflicting goals. The mapping problem
has a name and it is

55
PCAM - Mapping

The mapping problem is known to be NP-hard,
meaning that no computationally tractable
(polynomial-time) algorithm exists for evaluating
these trade-offs in the general case. Hence we
must rely on heuristics that can do a reasonably
good job of mapping
Some strategies for decomposition of a problem
are
Perfectly parallel
Domain
Control
Object-oriented
Hybrid/layered (multiple uses of the above)

56
PCAM Mapping decomposition - perfect

Perfectly parallel
Applications that require little or no
inter-processor communication when running in
parallel
Easiest type of problem to decompose
Results in nearly perfect speed-up
The pi example is almost perfectly parallel
The only communication occurs at the beginning of
the problem when the number of divisions needs to
be broadcast and at the end where the partial
sums need to be added together
The calculation of the area of each slice
proceeds independently
This would be true even if the area calculation
were replaced by something more complex

57
PCAM mapping decomposition - domain

Domain decomposition
In simulation and modelling this is the most
common solution
The solution space (which often corresponds to
the real space) is divided up among the
processors. Each processor solves its own little
piece
Finite-difference methods and finite-element
methods lend themselves well to this approach
The method of solution often leads naturally to a
set of simultaneous equations that can be solved
by parallel matrix solvers
Sometimes the solution involves some kind of
transformation of variables (i.e. Fourier
Transform). Here the domain is some kind of
phase space. The solution and the various
transformations involved can be parallelized

58
PCAM mapping decomposition - domain

Solution of a PDE (Laplaces Equation)
A finite-difference approximation
Domain is divided into discrete finite
differences
Solution is approximated throughout
In this case, an iterative approach can be used
to obtain a steady-state solution
Only nearest neighbour cells are considered in
forming the finite difference
Gravitational N-body, structural mechanics,
weather and climate models are other examples

59
PCAM mapping decomposition - control

Control decomposition
If you cannot find a good domain to decompose,
your problem might lend itself to control
decomposition
Good for
Unpredictable workloads
Problems with no convenient static structures
One set of control decomposition is functional
decomposition
Problem is viewed as a set of operations. It is
among operations where parallelization is done
Many examples in industrial engineering ( i.e.
modelling an assembly line, a chemical plant,
etc.)
Many examples in data processing where a series
of operations is performed on a continuous stream
of data

60
PCAM mapping decomposition - control

Examples
Image processing given a series of raw images,
perform a series of transformation that yield a
final enhanced image. Solve this in a functional
decomposition (each process represents a
different function in the problem) using data
pipelining
Game playing games feature an irregular search
space. One possible move may lead to a rich set
of possible subsequent moves to search.

61
PCAM mapping decomposition - OO

Object-oriented decomposition is really a
combination of functional and domain
decomposition
Rather than thinking about a dividing data or
functionality, we look at the objects in the
problem
The object can be decomposed as a set of data
structures plus the procedures that act on those
data structures
The goal of object-oriented parallel programming
is distributed objects
Although conceptually clear, in practice it can
be difficult to achieve good load balancing among
the objects without a great deal of fine tuning
Works best for fine-grained problems and in
environments where having functionally ready
at-the-call is more important than worrying about
under-worked processors (i.e. battlefield
simulation)
Message passing is still explicit (no standard
C compiler automatically parallelizes over
objects).

62
PCAM mapping decomposition - OO

Example the client-server model
The server is an object that has data associated
with it (i.e. a database) and a set of procedures
that it performs (i.e. searches for requested
data within the database)
The client is an object that has data associated
with it (i.e. a subset of data that it has
requested from the database) and a set of
procedures it performs (i.e. some application
that massages the data).
The server and client can run concurrently on
different processors an object-oriented
decomposition of a parallel application
In the real-world, this can be large scale when
many clients (workstations running applications)
access a large central data base kind of like a
distributed supercomputer

63
PCAM mapping decomposition -summary

A good decomposition strategy is
Key to potential application performance
Key to programmability of the solution
There are many different ways of thinking about
decomposition
Decomposition models (domain, control,
object-oriented, etc.) provide standard templates
for thinking about the decomposition of a problem
Decomposition should be natural to the problem
rather than natural to the computer architecture
Communication does no useful work keep it to a
minimum
Always wise to see if a library solution already
exists for your problem
Dont be afraid to use multiple decompositions in
a problem if it seems to fit

64
PCAM mapping - considerations

If the communication pattern among tasks is
regular, create p agglomerated tasks that
minimize communication and map each task to its
own processor
If the number of tasks is fixed and communication
among them regular but the time require to
perform each task is variable, then some sort of
cyclic or interleaved mapping of tasks to
processors may result in a more balanced load
Dynamic load-balancing algorithms are needed when
tasks are created and destroyed at run-time or
computation or communication of tasks vary widely

65
PCAM mapping - checklist

It is important to keep an open mind during the
design process. These points can help you decide
if you have done a good job of considering design
alternatives
Is the design based on one task per processor or
multiple tasks per processor?
Have both static and dynamic allocation of tasks
to processors been considered?
If dynamic allocation of tasks is chosen is the
manager (task allocator) a bottle neck to
performance
If using probabilistic or cyclic methods, do you
have a large enough number of tasks to ensure
reasonable load balance (typically ten times as
many tasks as processors are required)

66
PCAM example N-body problem

In a Newtonian n-body simulation, gravitational
forces have infinite range. Sequential algorithms
to solve these problems have time complexity of
T(n²) per iteration where n is the number of
objects
Let us suppose that we are simulating the motion
of n particles of varying mass in 2D. During
each iteration we need to compute the velocity
vector of each particle, given the positions of
all other particles.
Using the four stage process we get

67
PCAM example N-body problem

Partitioning
Assume we have one task per particle.
This particle must know the location of all other
particles
Communication
A gather operation is a global communication that
takes a dataset distributed among a group of
tasks and collects the items on a single task
An all-gather operation is similar to gather,
except at the end of communication every task has
a copy of the entire dataset
We need to update the location of every particle
so an all-gather is necessary

68
PCAM example N-body problem

So put a channel between every pair of tasks
During each communication step each task sends it
vector element to one other task. After n 1
communication steps, each task has the position
of all other particles, and it can perform the
calculations needed to determine the velocity and
new location for its particle
Is there a quicker way? Suppose there were only
two particles. If each task had a single
particle, they can exchange copies of their
values. What if there were four particles?
After a single exchange step tasks 0 and 1 could
both have particles v0 and v1 , likewise for
tasks 2 and 3. If task 0 exchanges its pair of
particles with task 2 and task 1 exchanges with
task 3, then all tasks will have all four
particles. A logarithmic number of exchange
steps is sufficient to allow every processor to
acquire the value originally held by every other
processor. So the ith exchange step of messages
have length 2(i-1)

69
PCAM example N-body problem

Agglomeration and mapping
In general, there are more particles n than
processors p. If n is a multiple of p we can
associate one task per processor and agglomerate
n/p particles into each task.

70
PCAM - summary

Task/channel (PCAM) is a theoretical construct
that represents a parallel computation as a set
of tasks that may interact with each other by
sending messages through channels
It encourages parallel algorithm designs that
maximize local computations and minimize
communications

71
BSP Bulk Synchronous Parallel

BSP model was proposed in 1989. It provides an
elegant theoretical framework for bridging the
gap between parallel hardware and software
BSP allows for the programmer to design an
algorithm as a sequence of large step (supersteps
in the BSP language) each containing many basic
computation or communication operations done in
parallel and a global synchronization at the end,
where all processors wait for each other to
finish their work before they proceed with the
next superstep.
BSP is currently used around the world and very
good text (which this segment is based on) is
called Parallel Scientific Computation by Rob
Bisseling published by Oxford Press in 2004

72
BSP Bulk Synchronous Parallel

Some useful links
BSP Worldwide organization
http//www.bsp-worldwide.org
The Oxford BSP toolset (public domain GNU
license)
http//www.bsp-worldwide.org/implmnts/oxtool
The source files from the book together with test
programs form a package called BSPedupack and can
be found at
http//www.math.uu.nl/people/bisseling/software.ht
ml
The MPI version called MPIedupack is also
available from the previously mentioned site

73
BSP Bulk Synchronous Parallel

BSP satisfies all requirements of a useful
parallel programming model
Simple enough to allow easy development and
analysis of algorithms
Realistic enough to allow reasonably accurate
modelling of real-life parallel computing
There exists a portability layer in the form of
BSPlib
It has been efficiently implemented in the Oxford
BSP toolset and Paderborn University BSP library
Currently being used as a framework for algorithm
design and implementation on clusters of PCs,
networks of workstations, shared-memory
multiprocessors and large parallel machines with
distributed memory

74
BSP Model

BSP comprises of a computer architecture, a class
of algorithms, and a function for charging costs
to algorithms (hmm no wonder it is a popular
model)
The BSP computer
consists of a collection of processors, each with
private memory,
and a communication network that allows
processors to access each others memories

75
BSP Model

The BSP algorithm is a series of supersteps which
contain either a number of computation or
communication steps followed by a global barrier
synchronization (i.e. bulk synchronization
What is one possible problem you see right away
with designing algorithms this way?

76
BSP Model

The BSP cost function is classified as an
h-relation and consists of a superstep where at
least one processor sends and receives at most h
data words (real or integer) Therefore h max
hsend, hreceive
It assumes sends and receives are simultaneous
This charging cost is based on the assumption
that the bottleneck is at the entry or exit of a
communication network
The cost of an h-relation would be
Tcomm(h) hg l, where
g is the communication cost per data word, and
l is the global synchronization cost (both in
time units of 1 flop) and the cost of a BSP
algorithm is the expression
a bg cl (a, b, c) where a, b, c depend in
general on p and on the problem size

77
BSP Bulk Synchronous Parallel

This model currently allows you to convert from
BSP to MPI-2 using MPIEDUPACK as an example (i.e.
MPI can be used for programming in the BSP style
The main difference between MPI and BSPlib is
that MPI provides more opportunities for
optimization by the user. However, BSP does
impose a discipline that can prove fruitful in
developing reusable code
The book contains an excellent section on sparse
matrix vector multiplication and if you link to
the website you can download some interesting
solvers http//www.math.uu.nl/people/bisseling/Mon
driaan/mondriaan.html

78
Pattern Language

Primarily from the book Patterns for Parallel
Programming by Mattson, Sanders, and Massingill,
Addison-Wesley, 2004
From the back cover Its the first parallel
programming guide written specifically to serve
working software developers, not just computer
scientists. The authors introduce a complete,
highly accessible pattern language that will help
any experienced developer think parallel and
start writing effective code almost immediately
The cliché Dont judge a book by its cover
comes to mind

79
Pattern Language

We have come full circle. However, we have
gained some knowledge along the way
A pattern language is not a programming language.
It is an embodiment of design methodologies
which provides domain specific advise to the
application designer
Design patterns were introduced into software
engineering in 1987

80
Pattern Language

Organized into four design spaces (sound familiar
- PCAM)
Finding concurrency
Structure problem to expose exploitable
concurrency
Algorithm structure
Structure the algorithm to take advantage of the
concurrency found above
Supporting structures
Structured approaches that represent the program
and shared data structures
Implementation mechanisms
Mapping patterns to processors

81
Concept - Summary

What is the common thread of all these models and
paradigms?

82
Concept - Conclusion

You take a problem, break it up into n tasks and
assign them to p processors thats the science
How you break up the problem and exploit the
parallelism now thats the art

83
This page intentionally left blank

84
Compile

Serial/sequential program optimization
Introduction to OpenMP
Introduction to MPI
Profilers
Libraries
Debugging
Performance Analysis Formulas

85
Serial

Some of you may be thinking why would I want to
discuss serial in an talk about parallel
computing.
Well, have you ever eaten just one bran flake
or one rolled oat at a time?

86
Serial

Most of the serial optimization techniques can be
used for any program parallel or serial
Well written assembler code will beat high level
programming language any day but who has the
time to write a parallel application in assembler
for one of the myriad of processors available.
However, small sections of assembler might be
more affective.
Reducing the memory requirements of an
application is a good tool that frequently
results in better processor performance
You can use these tools to write efficient code
from scratch or to optimize existing code.
First attempts at optimization may be compiler
options or modifying a loop. However performance
tuning is like trying to reach the speed of light
more and more time or energy is expended but
the peak performance is never reached. It may be
best, before optimizing your program, to consider
how much time and energy you have and are willing
or allowed to commit. Remember, you may spend a
lot of time optimizing for one processor/compiler
only to be told to port the code to another system

87
Serial

Computers have become faster over the past years
(Moores Law). However, application speed has
not kept pace. Why? Perhaps it is because
programmers
Write programs without any knowledge of the
hardware on which they will run
Do not know how to use compilers effectively (how
many use the gnu compilers?)
Do not know how to modify code to improve
performance

88
Serial Storage Problems

Avoid cache thrashing and memory bank contention
by dimensioning multidimensional arrays so that
the dimensions are not powers of two
Eliminate TLB (Translation Lookaside Buffer which
translates virtual memory addresses into physical
memory addresses) misses and memory bank
contention by accessing arrays in unit stride. A
TLB miss is when a process accesses memory which
does not have its translation in the TLB
Avoid Fortran I/O interfaces such as open(),
read(), write(), etc. since they are built on top
of buffered I/O mechanisms fopen(), fread(),
fwrite(), etc.. Fortran adds additional
functionality to the I/O routines which leads to
more overhead for doing the actual transfers
Do your own buffering for I/O and use system
calls to transfer large blocks of data to and
from files

89
Serial Compilers and HPC

A compiler takes a high-level language as input
and produces assembler code and once linked with
other objects, form an executable which can run
on a computer
Initially programmers had no choice but to
program in assembler for a specific processor.
When processors change so would the code
Now programmers write in a high-level language
that can be recompiled for other processors
(source code compatibility). There is also
object and binary compatibility

90
Serial the compiler and you

How the compiler generates good assembly code and
things you can do to help it
Register allocation is when the compiler assigns
quantities to registers. C and C have the
register command. Some optimizations increase
the number of registers required
C/C register data type usual when the
programmer knows the variable will be used many
times and should not be reloaded from memory
C/C asm macro allows assembly code to be
inserted directly into the instruction sequence.
It makes code non-portable
C/C include file math.h generates faster code
when used
Uniqueness of memory addresses. Different
languages make assumptions on whether memory
locations of variables are unique. Aliasing
occurs when multiple elements have the same
memory locations.

91
Serial The Compiler and You

Dead code elimination is the removal of code that
is never used
Constant folding is when expressions with
multiple constants can be folded together and
evaluated at compile time (i.e. A 34 can be
replaced by A 7). Propagation is when variable
references are replaced by a constant value at
compile time (i.e. A34, BA3 can be replaced
by A7 and B10
Common subexpression elimination (i.e. AB(XY),
CD(XY)) puts repeated expressions into a new
variable
Strength reduction

92
Serial Strength reductions

Replace integer multiplication or division with
shift operations
Multiplies and divides are expensive
Replace 32-bit integer division by 64-bit
floating-point division
Integer division is much more expensive than
floating-point division
Replace floating-point multiplication with
floating-point addition
YXX is cheaper than Y2X
Replace multiple floating-point divisions by
division and multiplication
Division is one of the most expensive operations
ax/z, by/z can be replaced by c1/z, axc,
byc
Replace power function by floating-point
multiplications
Power calculations are very expensive and take 50
times longer than performing a multiplication so
YX3 can be replaced by YXXX

93
Serial Single Loop Optimization

Induction variable optimization
when values in a loop are a linear function of
the induction variable the code can be simplified
by replacing the expression with a counter and
replacing the multiplication by an addition
Prefetching
what happens when the compiler prefetches off
the end of the array (fortunately it is ignored)
Test promotion in loops
Branches in code greatly reduce performance since
they interfere with pipelining
Loop peeling
Handle boundary conditions outside the loop (i.e.
do not test for them inside the loop)
Fusion
If the loop is the same (i.e. i0 iltn, i) for
more than one loop combine them together
Fission
Sometime loops need to be split apart to help
performance
Copying
Loop fission using dynamically allocated meory