HPC Parallel Programming: From Concept to Compile - PowerPoint PPT Presentation

1 / 192
About This Presentation
Title:

HPC Parallel Programming: From Concept to Compile

Description:

Parallel Random Access Machine the PRAM model. result, specialist, agenda - RSA model ... Cosmology and astrophysics. Computational fluid dynamics and turbulence ... – PowerPoint PPT presentation

Number of Views:89
Avg rating:3.0/5.0
Slides: 193
Provided by: istUwa
Category:

less

Transcript and Presenter's Notes

Title: HPC Parallel Programming: From Concept to Compile


1
HPC Parallel ProgrammingFrom Concept to Compile
  • A One day Introductory Workshop
  • October 12th, 2004

2
Schedule
  • Concept (a frame of mind)
  • Compile (application)

3
Introduction
  • Programming parallel computers
  • Compiler extension
  • Sequential programming language extension
  • Parallel programming layer
  • New parallel languages

4
Concept
  • Parallel Algorithm Design
  • Programming paradigms
  • Parallel Random Access Machine the PRAM model
  • result, specialist, agenda - RSA model
  • Task / Channel - the PCAM model
  • Bulk Synchronous Parallel the BSP model
  • Pattern Language

5
Compile
  • Serial
  • Introduction to OpenMP
  • Introduction to MPI
  • Profilers
  • Libraries
  • Debugging
  • Performance Analysis Formulas

6
  • By the end of this workshop you will be exposed
    to
  • Different parallel programming models and
    paradigms
  • Serial Im not bad, I was built this way
    programming and how you can optimize it
  • OpenMP and MPI
  • libraries
  • debugging

7
  • Eu est velociter perfectus
  • Well Done is Quickly Done
  • Caesar Augustus

8
Introduction
  • What is Parallel Computing?
  • It is the ability to program in a language that
    allows you to explicitly indicate how different
    portions of the computation may be executed
    concurrently by different processors

9
  • Why do it?
  • The need for speed
  • How much speedup can be determined by
  • Amdahls Law S(p) p/(1(p-1)f) where
  • f - fraction of the computation that cannot be
    divided into concurrent tasks, 0 f 1 and
  • p - the number of processors
  • So if we have 20 processors and a serial portion
    of 5 we will get a speedup of 20/(1(20-1).05)
    10.26
  • Also Gustafson-Barsiss Law which takes into
    account scalability, and
  • Karp-Flatt Metric which takes into account the
    parallel overhead, and
  • Isoefficiency Relation which is used to determine
    the range of processors for which a particular
    level of efficiency can be determined. Parallel
    overhead increases as the number of processors
    increase, so to maintain efficiency increase the
    problem size

10
  • Why Do Parallel Computing some other reasons
  • Time Reduce the turnaround time of applications
  • Performance Parallel computing is the only way
    to extend performance toward the TFLOP realm
  • Cost/Performance Traditional vector computers
    become too expensive as one pushes the
    performance barrier
  • Memory Applications often require memory that
    goes beyond that addressable by a single
    processor
  • Whole classes of important algorithms are ideal
    for parallel execution. Most algorithms can
    benefit from parallel processing such as Laplace
    equation, Monte Carlo, FFT (signal processing),
    image processing
  • Life itself is a set of concurrent processes
  • Scientists use modeling so why not model systems
    in a way closer to nature

11
  • Many complex scientific problems require large
    computing resources. Problems such as
  • Quantum chemistry, statistical mechanics, and
    relativistic physics
  • Cosmology and astrophysics
  • Computational fluid dynamics and turbulence
  • Biology, genome sequencing, genetic engineering
  • Medicine
  • Global weather and environmental modeling
  • One such place is http//www-fp.mcs.anl.gov/grand-
    challenges/

12
Programming Parallel Computers
  • In 1988 four distinct paths for application
    software development on parallel computers were
    identified by McGraw and Axelrod
  • Extend an existing compiler to translate
    sequential programs into parallel programs
  • Extend an existing language with new operations
    that allow users to express parallelism
  • Add a new language layer on top of an existing
    sequential language
  • Define a totally new parallel language

13
Compiler extension
  • Design parallelizing compilers that exploit
    parallelism in existing programs written in a
    sequential language
  • Advantages
  • billions of dollars and thousands of years of
    programmer effort have already gone into legacy
    programs.
  • Automatic parallelization can save money and
    labour.
  • It has been an active area of research for over
    twenty years
  • Companies such as Parallel Software Products
    http//www.parallelsp.com/ offer compilers that
    translate F77 code into parallel programs for MPI
    and OpenMP
  • Disadvantages
  • Pits programmer and compiler in game of hide and
    seek. The programmer hides parallelism in DO
    loops and control structures and the compiler
    might irretrievably lose some parallelism

14
Sequential Programming Language Design
  • Extend a sequential language with functions that
    allow programmers to create, terminate,
    synchronize and communicate with parallel
    processes
  • Advantages
  • Easiest, quickest, and least expensive since it
    only requires the development of a subroutine
    library
  • Libraries meeting the MPI standard exist for
    almost every parallel computer
  • Gives programmers flexibility with respect to
    program development
  • Disadvantages
  • Compiler is not involved in generation of
    parallel code therefore it cannot flag errors
  • It is very easy to write parallel programs that
    are difficult to debug

15
Parallel Programming layers
  • Think of a parallel program consisting of 2
    layers. The bottom layer contains the core of
    the computation which manipulates its portion of
    data to gets its result. The upper layer
    controls creation and synchronization of
    processes. A compiler would then translate these
    two levels into code for execution on parallel
    machines
  • Advantages
  • Allows users to depict parallel programs as
    directed graphs with nodes depicting sequential
    procedures and arcs representing data dependences
    among procedures
  • Disadvantages
  • Requires programmer to learn and use a new
    parallel programming system

16
New Parallel Languages
  • Develop a parallel language from scratch. Let
    the programmer express parallel operations
    explicitly. The programming language Occam is one
    such famous example http//wotug.ukc.ac.uk/paralle
    l/occam/
  • Advantages
  • Explicit parallelism means programmer and
    compiler are now allies instead of adversaries
  • Disadvantages
  • Requires development of new compilers. It
    typically takes years for vendors to develop
    high-quality compilers for their parallel
    architectures
  • Some parallel languages such as C were not
    adapted as standard compromising severely
    portable code
  • User resistance. Who wants to learn another
    language

17
  • The most popular approach continues to be
    augmenting existing sequential languages with
    low-level constructs expressed by function calls
    or compiler directives
  • Advantages
  • Can exhibit high efficiency
  • Portable to a wide range of parallel systems
  • C, C, F90 with MPI or OpenMP are such examples
  • Disadvantages
  • More difficult to code and debug

18
Concept
  • An algorithm (from OED) is a set of rules or
    process, usually one expressed in algebraic
    notation now used in computing
  • A parallel algorithm is one in which the rules or
    process are concurrent
  • There is no simple recipe for designing parallel
    algorithms. However, it can benefit from a
    methodological approach. It allows the
    programmer to focus on machine-independent issues
    such as concurrency early in the design process
    and machine-specific aspects later
  • You will be introduced to such approaches and
    models and hopefully gain some insight into the
    design process
  • Examining these models is a good way to start
    thinking in parallel

19
Parallel Programming Paradigms
  • Parallel applications can be classified into well
    defined programming paradigms
  • Each paradigm is a class of algorithms that have
    the same control structure
  • Experience suggests that there are a relatively
    few paradigms underlying most parallel programs
  • The choice of paradigm is determined by the
    computing resources which can define the level of
    granularity and type of parallelism inherent in
    the program which reflects the structure of
    either the data or application

20
Parallel Programming Paradigms
  • The most systematic definition of paradigms comes
    from a technical report from the University of
    Basel in 1993 entitled BACS Basel Algorithm
    Classification Scheme
  • A generic tuple of factors which characterize a
    parallel algorithm
  • Process properties (structure, topology,
    execution)
  • Interaction properties
  • Data properties (partitioning, placement)
  • The following paradigms were described
  • Task-Farming (or Master/Slave)
  • Single Program Multiple Data (SPMD)
  • Data Pipelining
  • Divide and Conquer
  • Speculative Parallelism

21
PPP Task-Farming
  • Task-farming consists of two entities
  • Master which decomposes the problem into small
    tasks and distributes/farms them to the slave
    processes. It also gathers the partial results
    and produces the final computational result
  • Slave which gets a message with a task, executes
    the task and sends the result back to the master
  • It can use either static load balancing
    (distribution of tasks is all performed at the
    beginning of the computation) or dynamic
    load-balancing (when the number of tasks exceeds
    the number of processors or is unknown, or when
    execution times are not predictable, or when
    dealing with unbalanced problems). This paradigm
    responds quite well to the loss of processors and
    can be scaled by extending the single master to a
    set of masters

22
PPP Single Program Multiple data (SPMD)
  • SPMD is the most commonly used paradigm
  • Each process executes the same piece of code but
    on a different part of the data which involves
    the splitting of the application data among the
    available processors. This is also referred to
    as geometric parallelism, domain decomposition,
    or data parallelism
  • Applications can be very efficient if the data is
    well distributed by the processes on a
    homogeneous system. If different workloads are
    evident then some sort of load balancing scheme
    is necessary during run-time execution
  • Highly sensitive to loss of a process. Unusually
    results in a deadlock until global
    synchronization point is reached

23
PPP Data Pipelining
  • Data pipelining is fine-grained parallelism and
    is based on a functional decomposition approach
  • The tasks (capable of concurrent operation) are
    identified and each processor executes a small
    part of the total algorithm
  • One of the simplest and most popular functional
    decomposition paradigms and can also be referred
    to as data-flow parallelism.
  • Communication between stages of the pipeline can
    be asynchronous since the efficiency is directly
    dependent on the ability to balance the load
    across the stages
  • Often used in data reduction and image processing

24
PPP Divide and Conquer
  • The divide and conquer approach is well known in
    sequential algorithm development in which a
    problem is divided into two or more subproblems.
    Each subproblem is solved independently and the
    results combined
  • In parallel divide and conquer, the subproblems
    can be solved at the same time
  • Three generic computational operations split,
    compute, and join (sort of like a virtual tree
    where the tasks are computed at the leaf nodes)

25
PPP Speculative Parallelism
  • Employed when it is difficult to obtain
    parallelism through any one of the previous
    paradigms
  • Deals with complex data dependencies which can be
    broken down into smaller parts using some
    speculation or heuristic to facilitate the
    parallelism

26
PRAM Parallel Random Access Machine
  • Descendent of RAM (Random Access Machine)
  • A theoretical model of parallel computation in
    which an arbitrary but finite number of
    processors can access any value in an arbitrarily
    large shared memory in a single time step
  • Introduced in the 1970s it still remains popular
    since it is theoretically tractable and gives
    algorithm designers a common target. The
    Prentice Hall book from 1989 entitled the Design
    and Analysis of Parallel algorithms, gives a good
    introduction to the design of algorithms using
    this model

27
PRAM cont
  • The three most important variations on this model
    are
  • EREW (exclusive read exclusive write) where any
    memory location may be access only once in any
    one step
  • CREW (concurrent read exclusive write) where any
    memory location may be read any number of times
    during a single step but written to only once
    after the reads have finished
  • CRCW (concurrent read concurrent write) where any
    memory location may be written to or read from
    any number of times during a single step. Some
    rule or priority must be given to resolve
    multiple writes

28
PRAM cont
  • This model has problems
  • PRAMs cannot be emulated optimally on all
    architectures
  • Problem lies in the assumption that every
    processor can access the memory simultaneously in
    a single step. Even in hypercubes messages must
    take several hops between source and destination
    and it grows logarithmically with the machines
    size. As a result any buildable computer will
    experience a logarithmic slowdown relative to the
    PRAM model as its size increases
  • One solution is to take advantage of cases in
    which there is greater parallelism in the process
    than in the hardware it is running on, enabling
    each physical processor to emulate many virtual
    processors. An example of such is as follows

29
PRAM cont
  • Example
  • Process A sends request
  • Process B runs while request travels to memory
  • Process C runs while memory services request
  • Process D runs while reply returns to processor
  • Process A is re-scheduled
  • The efficiency with which physically resizable
    architectures could emulate the PRAM is dictated
    by the theorem
  • If each of P processors sends a single message to
    a randomly-selected partner, it is highly
    probable that at least on processor will receive
    O(P/log log P) messages, and some others will
    receive none but
  • If each processor sends log P messages to
    randomly-selected partners, there is a high
    probability that no processor will receive more
    than 3 log P messages.
  • So if problem size is increased at least
    logarithmically faster than machine size,
    efficiency can be held constant. The problem is
    that it holds constant for hypercubes in which
    the communication links grow with the number of
    processors.
  • Several ways around the above limitation has been
    suggested. Such as

30
PRAM cont
  • XPRAM where computations are broken up into steps
    such that no processor may communicate more that
    a certain number of times per single time step.
  • Programs which fit this model can be emulated
    efficiently
  • Problem is that it is difficult to design
    algorithms in which the frequency of
    communication decreases as the problem size
    increases
  • Bird-Meertens formalism where the allowed set of
    communications would be restricted to those which
    can be emulated efficiently
  • The scan-vector model proposed by Blelloch
    accounts for the relative distance of different
    portions of memory
  • Another option was proposed by Ramade in 1991
    which uses a butterfly network in which each node
    is a processor/memory pair. Routing messages in
    this model is complicated but the end result is
    an optimal PRAM emulation

31
Result, Agenda, Specialist Model
  • The RAS model was proposed by Nicholas Carriero
    and David Gelernter in their book How to Write
    Parallel Programs in 1990
  • To write a parallel program
  • Choose a pattern that is most natural to the
    problem
  • Write a program using the method that is most
    natural for that pattern, and
  • If the resulting program is not efficient, then
    transform it methodically into a more efficient
    version

32
RAS
  • Sounds simple. We can envision parallelism in
    terms of
  • Result - focuses on the shape of the finished
    product
  • Plan a parallel application around a data
    structure yielded as the final result. We get
    parallelism by computing all elements of the
    result simultaneously
  • Agenda - focuses on the list of tasks to be
    performed
  • Plan a parallel application around a particular
    agenda of tasks, and then assign many processes
    to execute the tasks
  • Specialist - focuses on the make-up of the work
  • Plan an application around an ensemble of
    specialists connected into a logical network of
    some kind. Parallelism results in all nodes
    being active simultaneously much like
    pipe-lining

33
RAS - Result
  • In most cases the easiest way to think of a
    parallel program is to think of the resulting
    data structure. It is a good starting point for
    any problem whose goal is to produce a series of
    values with predictable organization and
    interdependencies
  • Such a program reads as follows
  • Build a data structure
  • Determine the value of all elements of the
    structure simultaneously
  • Terminate when all values are known
  • If all values are independent then all
    computations start in parallel. However, if some
    elements cannot be computed until certain other
    values are known, then those tasks are blocked
  • As a simple example consider adding two n-element
    vectors (i.e. add the ith elements of both and
    store the sum in another vector)

34
RAS - Agenda
  • Agenda parallelism adapts well to many different
    problems
  • The most flexible is the master-worker paradigm
  • in which a master process initializes the
    computation and creates a collection of identical
    worker processes
  • Each worker process is capable of performing any
    step in the computation
  • Workers seek a task to perform and then repeat
  • When no tasks remain, the program is finished
  • An example would be to find the lowest ratio of
    salary to dependents in a database. The master
    fills a bag with records and each worker draws
    from the bag, computes the ratio, sends the
    results back to the master. The master keeps
    track of the minimum and when tasks are complete
    reports the answer

35
RAS - Specialist
  • Specialist parallelism involves programs that are
    conceived in terms of a logical network.
  • Best understood in which each node executes an
    autonomous computation and inter-node
    communication follows predictable paths
  • An example could be a circuit simulation where
    each element is realized by a separate process

36
RAS - Example
  • Consider a naïve n-body simulator where on each
    iteration of the simulation we calculate the
    prevailing forces between each body and all the
    rest, and update each bodys position accordingly
  • With the result parallelism approach it is easy
    to restate the problem description as follows
  • Suppose n bodies, q iterations of the simulation,
    compute matrix M such that M i, j is the
    position of the ith body after the jth iteration
  • Define each entry in terms of other entries i.e.
    write a function to compute position (i, j)

37
RAS - Example
  • With the agenda parallelism model we can
    repeatedly apply the transformation compute next
    position to all bodies in the set
  • So the steps involved would be to
  • Create a master process and have it generate n
    initial task descriptors ( one for each body )
  • On the first iteration, each process repeatedly
    grabs a task descriptor and computes the next
    position of the corresponding body, until all
    task descriptors are used
  • The master can the store information about each
    bodys position at the last iteration in a
    distributed table structure where each process
    can refer to it directly

38
RAS - Example
  • Finally, with the specialist parallelism approach
    we create a series of processes, each one
    specializing in a single body (i.e. each
    responsible for computing a single bodys current
    position throughout the entire simulation)
  • At the start of each iteration, each process
    sends data to and receives data from each other
    process
  • The data included in the incoming message group
    of messages is sufficient to allow each process
    to compute a new position for its body then
    repeat

39
Task Channel model
  • THERE IS NO SIMPLE RECIPE FOR DESIGNING PARALLEL
    ALGORITHMS
  • However, with suggestions by Ian Foster and his
    book Designing and Building Parallel Programs
    there is a methodology we can use
  • The task/channel method is one most often sited
    as a practical means to organize the design
    process
  • It represents a parallel computation as a set of
    tasks that may interact with each other by
    sending messages through channels
  • It can be viewed as a directed graph where
    vertices represent tasks and directed edges
    represent channels
  • A thorough examination of this design process
    will conclude with a practical example

40
  • A task is a program, its local memory, and a
    collection of I/O ports
  • The local memory contains the programs
    instructions and its private data
  • It can send local data values to other tasks via
    output ports
  • It also receives data values from other tasks via
    input ports
  • A channel is a message queue that connects one
    tasks output port with another tasks input port
  • Data values appear at the input port in the same
    order as they were placed in the output port of
    the channel
  • Tasks cannot receive data until it is sent (i.e.
    receiving is blocked)
  • Sending is never blocked

41
  • The four stages of Fosters Design process are
  • Partitioning the process of dividing the
    computation and data into pieces
  • Communication the pattern of send and receives
    between tasks
  • Agglomeration process of grouping tasks into
    larger tasks to simplify programming or improve
    performance
  • Mapping the processes of assigning tasks to
    processors
  • Commonly referred to as PCAM

42
Partitioning
  • Discover as much parallelism as possible. To
    this end strive to split the computation and data
    into smaller pieces
  • There are two approaches
  • Domain decomposition
  • Functional decomposition

43
PCAM partitioning domain decomposition
  • Domain decomposition is where you first divide
    the data into pieces and then determine how to
    associate computations with the data
  • Typically focus on the largest or most frequently
    accessed data structure in the program
  • Consider a 3D matrix. It can be partitioned as
  • Collection of 2D slices, resulting in a 1D
    collection of tasks
  • Collection of 1D slices, resulting in a 2D
    collection of tasks
  • Each matrix element separately resulting in a 3D
    collection of tasks
  • At this point in the design process it is usually
    best to maximize the number of tasks hence 3D
    partitioning is best

44
PCAM partitioning functional decomposition
  • Functional decomposition is complimentary to
    domain decomposition in which the computation is
    first divided into pieces and then the data items
    are associated with each computation. This is
    often know as pipelining which yield a collection
    of concurrent tasks
  • Consider brain surgery
  • before surgery begins a set of CT images are
    input to form a 3D model of the brain
  • The system tracks the position of the instruments
    converting physical coordinates into image
    coordinates and displaying them on a monitor.
    While one task is converting physical coordinates
    to image coordinates, another is displaying the
    previous image, and yet another is tracking the
    instrument for the next image. (Anyone remember
    the movie The Fantastic Voyage?)

45
PCAM Partitioning - Checklist
  • Regardless of decomposition we must maximize the
    number of primitive tasks since it is the upper
    bound on the parallelism we can exploit. Foster
    has presented us a checklist to evaluate the
    quality of the partitioning
  • There are at least an order of magnitude more
    tasks than processors on the target parallel
    machine. If not, there may be little flexibility
    in later design options
  • Avoid redundant computation and storage
    requirements since the design may not work well
    when the size of the problem increases
  • Tasks are of comparable size. If not, it may be
    hard to allocate each processor equal amounts of
    work
  • The number of tasks scale with problem size. If
    not, it may be difficult to solve larger problems
    when more processors are available
  • Investigate alternative partitioning to maximize
    flexibility later

46
PCAM-Communication
  • After the tasks have been identified it is
    necessary to understand the communication
    patterns between them
  • Communications are considered part of the
    overhead of a parallel algorithm, since the
    sequential algorithm does not need to do this.
    Minimizing this overhead is an important goal
  • Two such patterns local and global are more
    commonly used than others (structured/unstructured
    , static/dynamic, synchronous/asynchronous)
  • Local communication exists when a task need
    values from a small number of other tasks (its
    neighbours) in order to form a computation
  • Global communication exits when a large number of
    tasks must supply data in order to form a
    computation (e.g. performing a parallel reduction
    operation computing the sum of values over N
    tasks)

47
PCAM Communication - checklist
  • These are guidelines and not hard and fast rules
  • Are the communication operations balanced between
    tasks? Unbalanced communication requirements
    suggest a non-scalable construct
  • Each task communicates only with a small number
    of neighbours
  • Tasks are able to communicate concurrently. If
    not the algorithm is likely to be inefficient and
    non-scalable.
  • Tasks are able to perform their computations
    concurrently

48
PCAM - Agglomeration
  • The first two steps of the design process was
    focused on identifying as much parallelism as
    possible
  • At this point the algorithm would probably not
    execute efficiently on any particular parallel
    computer. For example, if there are many
    magnitudes more tasks than processors it can lead
    to a significant overhead in communication
  • In the next two stages of the design we consider
    combining tasks into larger tasks and then
    mapping them onto physical processors to reduce
    parallel overhead

49
PCAM - Agglomeration
  • Agglomeration (according to OED) is the process
    of collecting in a mass. In this case we try
    group tasks into larger tasks to facilitate
    improvement in performance or to simplify
    programming.
  • There are three main goals to agglomeration
  • Lower communication overhead
  • Maintain the scalability of the parallel design,
    and
  • Reduce software engineering costs

50
PCAM - Agglomeration
  • How can we lower communication overhead?
  • By agglomerating tasks that communicate with each
    other, communication is completely eliminated,
    since data values controlled by the tasks are in
    the memory of the consolidated task. This
    process is known as increasing the locality of
    the parallel algorithm
  • Another way is to combine groups of transmitting
    and receiving tasks thereby reducing the number
    of messages sent. Sending fewer, longer messages
    takes less time than sending more, shorter
    messages since there is an associated startup
    cost (message latency) inherent with every
    message sent which is independent of the length
    of the message.

51
PCAM - Agglomeration
  • How can we maintain the scalability of the
    parallel design?
  • Ensure that you do not combine too many tasks
    since porting to a machine with more processors
    may be difficult.
  • For example part of your parallel program is to
    manipulate a 3D array 16 X 128 X 256 and the
    machine has 8 processors. By agglomerating the
    2nd and 3rd dimensions each task would be
    responsible for a submatrix of 2 X 128 X 256. We
    can even port this to a machine that has 16
    processors. However porting this to a machine
    with more processors might result in large
    changes to the parallel code. Therefore
    agglomerating the 2nd and 3rd dimension might not
    be a good idea. What about a machine with 50,
    64, or 128 processors?

52
PCAM - Agglomeration
  • How can we reduce software engineering costs?
  • By parallelizing a sequential program we can
    reduce time and expense of developing a similar
    parallel program. Remember Parallel Software
    Products

53
PCAM Agglomeration - checklist
  • Some of these points in this checklist emphasize
    quantitative performance analysis which becomes
    more important as we move from the abstract to
    the concrete
  • Has the agglomeration increased the locality of
    the parallel algorithm?
  • Do replicated computations take less time than
    the communications they replace?
  • Is the amount of replicated data small enough to
    allow the algorithm to scale?
  • Do agglomerated tasks have similar computational
    and communication costs?
  • Is the number of tasks an increasing function of
    the problem size?
  • Are the number of tasks as small as possible and
    yet as large as the number of processors on your
    parallel computer?
  • Is the trade-off between your chosen
    agglomeration and the cost of modifications to
    existing sequential code reasonable?

54
PCAM - Mapping
  • In this 4th and final stage we specify where each
    task is to execute
  • The goals of mapping are to maximize processor
    utilization and minimize interprocessor
    communications. Often these are conflicting
    goals
  • Processor utilization is maximized when the
    commutation is balanced evenly. Conversely, it
    drops when one or processors are idle
  • Interprocessor communication increases when two
    tasks connected by a channel are mapped to
    different processors. Conversely, it decreases
    when the two tasks connected by the channel are
    mapped to the same processor
  • Mapping every task to the same processors cut
    communications to zero but utilization is reduced
    to 1/processors. The point is to choose a
    mapping that represents a reasonable balance
    between conflicting goals. The mapping problem
    has a name and it is

55
PCAM - Mapping
  • The mapping problem is known to be NP-hard,
    meaning that no computationally tractable
    (polynomial-time) algorithm exists for evaluating
    these trade-offs in the general case. Hence we
    must rely on heuristics that can do a reasonably
    good job of mapping
  • Some strategies for decomposition of a problem
    are
  • Perfectly parallel
  • Domain
  • Control
  • Object-oriented
  • Hybrid/layered (multiple uses of the above)

56
PCAM Mapping decomposition - perfect
  • Perfectly parallel
  • Applications that require little or no
    inter-processor communication when running in
    parallel
  • Easiest type of problem to decompose
  • Results in nearly perfect speed-up
  • The pi example is almost perfectly parallel
  • The only communication occurs at the beginning of
    the problem when the number of divisions needs to
    be broadcast and at the end where the partial
    sums need to be added together
  • The calculation of the area of each slice
    proceeds independently
  • This would be true even if the area calculation
    were replaced by something more complex

57
PCAM mapping decomposition - domain
  • Domain decomposition
  • In simulation and modelling this is the most
    common solution
  • The solution space (which often corresponds to
    the real space) is divided up among the
    processors. Each processor solves its own little
    piece
  • Finite-difference methods and finite-element
    methods lend themselves well to this approach
  • The method of solution often leads naturally to a
    set of simultaneous equations that can be solved
    by parallel matrix solvers
  • Sometimes the solution involves some kind of
    transformation of variables (i.e. Fourier
    Transform). Here the domain is some kind of
    phase space. The solution and the various
    transformations involved can be parallelized

58
PCAM mapping decomposition - domain
  • Solution of a PDE (Laplaces Equation)
  • A finite-difference approximation
  • Domain is divided into discrete finite
    differences
  • Solution is approximated throughout
  • In this case, an iterative approach can be used
    to obtain a steady-state solution
  • Only nearest neighbour cells are considered in
    forming the finite difference
  • Gravitational N-body, structural mechanics,
    weather and climate models are other examples

59
PCAM mapping decomposition - control
  • Control decomposition
  • If you cannot find a good domain to decompose,
    your problem might lend itself to control
    decomposition
  • Good for
  • Unpredictable workloads
  • Problems with no convenient static structures
  • One set of control decomposition is functional
    decomposition
  • Problem is viewed as a set of operations. It is
    among operations where parallelization is done
  • Many examples in industrial engineering ( i.e.
    modelling an assembly line, a chemical plant,
    etc.)
  • Many examples in data processing where a series
    of operations is performed on a continuous stream
    of data

60
PCAM mapping decomposition - control
  • Examples
  • Image processing given a series of raw images,
    perform a series of transformation that yield a
    final enhanced image. Solve this in a functional
    decomposition (each process represents a
    different function in the problem) using data
    pipelining
  • Game playing games feature an irregular search
    space. One possible move may lead to a rich set
    of possible subsequent moves to search.

61
PCAM mapping decomposition - OO
  • Object-oriented decomposition is really a
    combination of functional and domain
    decomposition
  • Rather than thinking about a dividing data or
    functionality, we look at the objects in the
    problem
  • The object can be decomposed as a set of data
    structures plus the procedures that act on those
    data structures
  • The goal of object-oriented parallel programming
    is distributed objects
  • Although conceptually clear, in practice it can
    be difficult to achieve good load balancing among
    the objects without a great deal of fine tuning
  • Works best for fine-grained problems and in
    environments where having functionally ready
    at-the-call is more important than worrying about
    under-worked processors (i.e. battlefield
    simulation)
  • Message passing is still explicit (no standard
    C compiler automatically parallelizes over
    objects).

62
PCAM mapping decomposition - OO
  • Example the client-server model
  • The server is an object that has data associated
    with it (i.e. a database) and a set of procedures
    that it performs (i.e. searches for requested
    data within the database)
  • The client is an object that has data associated
    with it (i.e. a subset of data that it has
    requested from the database) and a set of
    procedures it performs (i.e. some application
    that massages the data).
  • The server and client can run concurrently on
    different processors an object-oriented
    decomposition of a parallel application
  • In the real-world, this can be large scale when
    many clients (workstations running applications)
    access a large central data base kind of like a
    distributed supercomputer

63
PCAM mapping decomposition -summary
  • A good decomposition strategy is
  • Key to potential application performance
  • Key to programmability of the solution
  • There are many different ways of thinking about
    decomposition
  • Decomposition models (domain, control,
    object-oriented, etc.) provide standard templates
    for thinking about the decomposition of a problem
  • Decomposition should be natural to the problem
    rather than natural to the computer architecture
  • Communication does no useful work keep it to a
    minimum
  • Always wise to see if a library solution already
    exists for your problem
  • Dont be afraid to use multiple decompositions in
    a problem if it seems to fit

64
PCAM mapping - considerations
  • If the communication pattern among tasks is
    regular, create p agglomerated tasks that
    minimize communication and map each task to its
    own processor
  • If the number of tasks is fixed and communication
    among them regular but the time require to
    perform each task is variable, then some sort of
    cyclic or interleaved mapping of tasks to
    processors may result in a more balanced load
  • Dynamic load-balancing algorithms are needed when
    tasks are created and destroyed at run-time or
    computation or communication of tasks vary widely

65
PCAM mapping - checklist
  • It is important to keep an open mind during the
    design process. These points can help you decide
    if you have done a good job of considering design
    alternatives
  • Is the design based on one task per processor or
    multiple tasks per processor?
  • Have both static and dynamic allocation of tasks
    to processors been considered?
  • If dynamic allocation of tasks is chosen is the
    manager (task allocator) a bottle neck to
    performance
  • If using probabilistic or cyclic methods, do you
    have a large enough number of tasks to ensure
    reasonable load balance (typically ten times as
    many tasks as processors are required)

66
PCAM example N-body problem
  • In a Newtonian n-body simulation, gravitational
    forces have infinite range. Sequential algorithms
    to solve these problems have time complexity of
    T(n²) per iteration where n is the number of
    objects
  • Let us suppose that we are simulating the motion
    of n particles of varying mass in 2D. During
    each iteration we need to compute the velocity
    vector of each particle, given the positions of
    all other particles.
  • Using the four stage process we get

67
PCAM example N-body problem
  • Partitioning
  • Assume we have one task per particle.
  • This particle must know the location of all other
    particles
  • Communication
  • A gather operation is a global communication that
    takes a dataset distributed among a group of
    tasks and collects the items on a single task
  • An all-gather operation is similar to gather,
    except at the end of communication every task has
    a copy of the entire dataset
  • We need to update the location of every particle
    so an all-gather is necessary

68
PCAM example N-body problem
  • So put a channel between every pair of tasks
  • During each communication step each task sends it
    vector element to one other task. After n 1
    communication steps, each task has the position
    of all other particles, and it can perform the
    calculations needed to determine the velocity and
    new location for its particle
  • Is there a quicker way? Suppose there were only
    two particles. If each task had a single
    particle, they can exchange copies of their
    values. What if there were four particles?
    After a single exchange step tasks 0 and 1 could
    both have particles v0 and v1 , likewise for
    tasks 2 and 3. If task 0 exchanges its pair of
    particles with task 2 and task 1 exchanges with
    task 3, then all tasks will have all four
    particles. A logarithmic number of exchange
    steps is sufficient to allow every processor to
    acquire the value originally held by every other
    processor. So the ith exchange step of messages
    have length 2(i-1)

69
PCAM example N-body problem
  • Agglomeration and mapping
  • In general, there are more particles n than
    processors p. If n is a multiple of p we can
    associate one task per processor and agglomerate
    n/p particles into each task.

70
PCAM - summary
  • Task/channel (PCAM) is a theoretical construct
    that represents a parallel computation as a set
    of tasks that may interact with each other by
    sending messages through channels
  • It encourages parallel algorithm designs that
    maximize local computations and minimize
    communications

71
BSP Bulk Synchronous Parallel
  • BSP model was proposed in 1989. It provides an
    elegant theoretical framework for bridging the
    gap between parallel hardware and software
  • BSP allows for the programmer to design an
    algorithm as a sequence of large step (supersteps
    in the BSP language) each containing many basic
    computation or communication operations done in
    parallel and a global synchronization at the end,
    where all processors wait for each other to
    finish their work before they proceed with the
    next superstep.
  • BSP is currently used around the world and very
    good text (which this segment is based on) is
    called Parallel Scientific Computation by Rob
    Bisseling published by Oxford Press in 2004

72
BSP Bulk Synchronous Parallel
  • Some useful links
  • BSP Worldwide organization
  • http//www.bsp-worldwide.org
  • The Oxford BSP toolset (public domain GNU
    license)
  • http//www.bsp-worldwide.org/implmnts/oxtool
  • The source files from the book together with test
    programs form a package called BSPedupack and can
    be found at
  • http//www.math.uu.nl/people/bisseling/software.ht
    ml
  • The MPI version called MPIedupack is also
    available from the previously mentioned site

73
BSP Bulk Synchronous Parallel
  • BSP satisfies all requirements of a useful
    parallel programming model
  • Simple enough to allow easy development and
    analysis of algorithms
  • Realistic enough to allow reasonably accurate
    modelling of real-life parallel computing
  • There exists a portability layer in the form of
    BSPlib
  • It has been efficiently implemented in the Oxford
    BSP toolset and Paderborn University BSP library
  • Currently being used as a framework for algorithm
    design and implementation on clusters of PCs,
    networks of workstations, shared-memory
    multiprocessors and large parallel machines with
    distributed memory

74
BSP Model
  • BSP comprises of a computer architecture, a class
    of algorithms, and a function for charging costs
    to algorithms (hmm no wonder it is a popular
    model)
  • The BSP computer
  • consists of a collection of processors, each with
    private memory,
  • and a communication network that allows
    processors to access each others memories

75
BSP Model
  • The BSP algorithm is a series of supersteps which
    contain either a number of computation or
    communication steps followed by a global barrier
    synchronization (i.e. bulk synchronization
  • What is one possible problem you see right away
    with designing algorithms this way?

76
BSP Model
  • The BSP cost function is classified as an
    h-relation and consists of a superstep where at
    least one processor sends and receives at most h
    data words (real or integer) Therefore h max
    hsend, hreceive
  • It assumes sends and receives are simultaneous
  • This charging cost is based on the assumption
    that the bottleneck is at the entry or exit of a
    communication network
  • The cost of an h-relation would be
  • Tcomm(h) hg l, where
  • g is the communication cost per data word, and
    l is the global synchronization cost (both in
    time units of 1 flop) and the cost of a BSP
    algorithm is the expression
  • a bg cl (a, b, c) where a, b, c depend in
    general on p and on the problem size

77
BSP Bulk Synchronous Parallel
  • This model currently allows you to convert from
    BSP to MPI-2 using MPIEDUPACK as an example (i.e.
    MPI can be used for programming in the BSP style
  • The main difference between MPI and BSPlib is
    that MPI provides more opportunities for
    optimization by the user. However, BSP does
    impose a discipline that can prove fruitful in
    developing reusable code
  • The book contains an excellent section on sparse
    matrix vector multiplication and if you link to
    the website you can download some interesting
    solvers http//www.math.uu.nl/people/bisseling/Mon
    driaan/mondriaan.html

78
Pattern Language
  • Primarily from the book Patterns for Parallel
    Programming by Mattson, Sanders, and Massingill,
    Addison-Wesley, 2004
  • From the back cover Its the first parallel
    programming guide written specifically to serve
    working software developers, not just computer
    scientists. The authors introduce a complete,
    highly accessible pattern language that will help
    any experienced developer think parallel and
    start writing effective code almost immediately
  • The cliché Dont judge a book by its cover
    comes to mind

79
Pattern Language
  • We have come full circle. However, we have
    gained some knowledge along the way
  • A pattern language is not a programming language.
    It is an embodiment of design methodologies
    which provides domain specific advise to the
    application designer
  • Design patterns were introduced into software
    engineering in 1987

80
Pattern Language
  • Organized into four design spaces (sound familiar
    - PCAM)
  • Finding concurrency
  • Structure problem to expose exploitable
    concurrency
  • Algorithm structure
  • Structure the algorithm to take advantage of the
    concurrency found above
  • Supporting structures
  • Structured approaches that represent the program
    and shared data structures
  • Implementation mechanisms
  • Mapping patterns to processors

81
Concept - Summary
  • What is the common thread of all these models and
    paradigms?

82
Concept - Conclusion
  • You take a problem, break it up into n tasks and
    assign them to p processors thats the science
  • How you break up the problem and exploit the
    parallelism now thats the art

83
This page intentionally left blank

84
Compile
  • Serial/sequential program optimization
  • Introduction to OpenMP
  • Introduction to MPI
  • Profilers
  • Libraries
  • Debugging
  • Performance Analysis Formulas

85
Serial
  • Some of you may be thinking why would I want to
    discuss serial in an talk about parallel
    computing.
  • Well, have you ever eaten just one bran flake
    or one rolled oat at a time?

86
Serial
  • Most of the serial optimization techniques can be
    used for any program parallel or serial
  • Well written assembler code will beat high level
    programming language any day but who has the
    time to write a parallel application in assembler
    for one of the myriad of processors available.
    However, small sections of assembler might be
    more affective.
  • Reducing the memory requirements of an
    application is a good tool that frequently
    results in better processor performance
  • You can use these tools to write efficient code
    from scratch or to optimize existing code.
  • First attempts at optimization may be compiler
    options or modifying a loop. However performance
    tuning is like trying to reach the speed of light
    more and more time or energy is expended but
    the peak performance is never reached. It may be
    best, before optimizing your program, to consider
    how much time and energy you have and are willing
    or allowed to commit. Remember, you may spend a
    lot of time optimizing for one processor/compiler
    only to be told to port the code to another system

87
Serial
  • Computers have become faster over the past years
    (Moores Law). However, application speed has
    not kept pace. Why? Perhaps it is because
    programmers
  • Write programs without any knowledge of the
    hardware on which they will run
  • Do not know how to use compilers effectively (how
    many use the gnu compilers?)
  • Do not know how to modify code to improve
    performance

88
Serial Storage Problems
  • Avoid cache thrashing and memory bank contention
    by dimensioning multidimensional arrays so that
    the dimensions are not powers of two
  • Eliminate TLB (Translation Lookaside Buffer which
    translates virtual memory addresses into physical
    memory addresses) misses and memory bank
    contention by accessing arrays in unit stride. A
    TLB miss is when a process accesses memory which
    does not have its translation in the TLB
  • Avoid Fortran I/O interfaces such as open(),
    read(), write(), etc. since they are built on top
    of buffered I/O mechanisms fopen(), fread(),
    fwrite(), etc.. Fortran adds additional
    functionality to the I/O routines which leads to
    more overhead for doing the actual transfers
  • Do your own buffering for I/O and use system
    calls to transfer large blocks of data to and
    from files

89
Serial Compilers and HPC
  • A compiler takes a high-level language as input
    and produces assembler code and once linked with
    other objects, form an executable which can run
    on a computer
  • Initially programmers had no choice but to
    program in assembler for a specific processor.
    When processors change so would the code
  • Now programmers write in a high-level language
    that can be recompiled for other processors
    (source code compatibility). There is also
    object and binary compatibility

90
Serial the compiler and you
  • How the compiler generates good assembly code and
    things you can do to help it
  • Register allocation is when the compiler assigns
    quantities to registers. C and C have the
    register command. Some optimizations increase
    the number of registers required
  • C/C register data type usual when the
    programmer knows the variable will be used many
    times and should not be reloaded from memory
  • C/C asm macro allows assembly code to be
    inserted directly into the instruction sequence.
    It makes code non-portable
  • C/C include file math.h generates faster code
    when used
  • Uniqueness of memory addresses. Different
    languages make assumptions on whether memory
    locations of variables are unique. Aliasing
    occurs when multiple elements have the same
    memory locations.

91
Serial The Compiler and You
  • Dead code elimination is the removal of code that
    is never used
  • Constant folding is when expressions with
    multiple constants can be folded together and
    evaluated at compile time (i.e. A 34 can be
    replaced by A 7). Propagation is when variable
    references are replaced by a constant value at
    compile time (i.e. A34, BA3 can be replaced
    by A7 and B10
  • Common subexpression elimination (i.e. AB(XY),
    CD(XY)) puts repeated expressions into a new
    variable
  • Strength reduction

92
Serial Strength reductions
  • Replace integer multiplication or division with
    shift operations
  • Multiplies and divides are expensive
  • Replace 32-bit integer division by 64-bit
    floating-point division
  • Integer division is much more expensive than
    floating-point division
  • Replace floating-point multiplication with
    floating-point addition
  • YXX is cheaper than Y2X
  • Replace multiple floating-point divisions by
    division and multiplication
  • Division is one of the most expensive operations
    ax/z, by/z can be replaced by c1/z, axc,
    byc
  • Replace power function by floating-point
    multiplications
  • Power calculations are very expensive and take 50
    times longer than performing a multiplication so
    YX3 can be replaced by YXXX

93
Serial Single Loop Optimization
  • Induction variable optimization
  • when values in a loop are a linear function of
    the induction variable the code can be simplified
    by replacing the expression with a counter and
    replacing the multiplication by an addition
  • Prefetching
  • what happens when the compiler prefetches off
    the end of the array (fortunately it is ignored)
  • Test promotion in loops
  • Branches in code greatly reduce performance since
    they interfere with pipelining
  • Loop peeling
  • Handle boundary conditions outside the loop (i.e.
    do not test for them inside the loop)
  • Fusion
  • If the loop is the same (i.e. i0 iltn, i) for
    more than one loop combine them together
  • Fission
  • Sometime loops need to be split apart to help
    performance
  • Copying
  • Loop fission using dynamically allocated meory
Write a Comment
User Comments (0)
About PowerShow.com