Parallel Programming - PowerPoint PPT Presentation

1 / 63

About This Presentation

Title:

Parallel Programming

Description:

Agglomeration ... Has agglomeration yielded tasks with similar computation ... Can also agglomerate tasks associated with the same processor (creating one task ... – PowerPoint PPT presentation

Number of Views:31

Avg rating:3.0/5.0

Slides: 64

Provided by: LeeMcC

Learn more at: https://www.cs.memphis.edu

Category:

more less

Transcript and Presenter's Notes

Title: Parallel Programming

1
Parallel Programming

September 4

2
Designing Parallel Programs

No one way to solve a problem
Approach
Decide on problem issues early
Leave machine specific concerns until later

3
Stages of the Design Process

Partitioning
Communication
Agglomeration
Mapping
Result of the design process could be a program
that is as simple or complex as is possible

4
Stages of the Design Process

Partitioning
Communication
Agglomeration
Mapping
Result of the design process could be a program
that is as simple or complex as is possible

5
Partitioning

Decomposition of the processes and data into
small tasks
Ignore machine issues
The goal is a fine-grained decomposition

6
Partitioning

The Process
Focus on the data first Domain Decomposition

7
Domain Decomposition

Try to divide the data into small segments of
approximately equal size
Make this division along logical boundaries
Try starting with the largest data structure or
the one used the most
Similar to creating objects in OO programming

8
Example

Class scheduling system discussion

9
Domain Decomposition

Next, partition the processes necessary using the
data decomposition
Associate the separate process tasks with the
data that they use
If a process requires data from several tasks,
communication will be necessary between those
tasks

10
Partitioning

The Process
Focus on the data first Domain Decomposition
Focus on the computations first Functional
Decomposition

11
Functional Decomposition

For computation intensive programs, concentrating
on the processes first can be more effective
This can also be a useful endeavor for program
simplicity

12
Partitioning Checklist

Does your partition define at least an order of
magnitude more tasks than there are processors in
your target computer?
Provides flexibility in future design stages
Does your partition avoid redundant computation
and storage requirements?
Provides scalability

13
Partitioning Checklist

Are tasks of comparable size?
Goes to processor allocation in the Mapping stage
Does the number of tasks scale with problem size?
Also goes to scalability. The number of tasks
should increase rather than the size of tasks if
a larger problem is presented

14
Partitioning Checklist

Have you identified several alternative
partitions?
Future design stages may require a different
decomposition. Having alternatives in mind will
help you see these areas and will be easier to
come up with during this phase.

15
Stages of the Design Process

Partitioning
Communication
Agglomeration
Mapping
Result of the design process could be a program
that is as simple or complex as is possible

16
Communication

When one task needs data associated with another
task
Conceptualized as a channel between tasks
Two phases
Define the channel structure
Define the messages sent along the channels

17
Communication

Difference in Domain vs. Functional decomposition
Domain (data centered) often requires complex
communication structures
Functional (processes centered) is usually
straightforward communication-wise

18
Categorizing Communication Patterns

Local/Global
Structured/Unstructured
Static/Dynamic
Synchronous/Asynchronous

19
Local Communication

When an operation requires data from only a small
number of other tasks

20
Task Algorithm Jacobi finite difference
for t0 to T-1 send     to each neighbor
receive      ,      ,      ,       from
neighbors compute       using Equation 2.1
endfor
21
Gauss-Seidel version

Reformulation of Jacobi using fewer iterations

22
Gauss-Seidel version
23
Global Communication

Many tasks need data from many other tasks
Example highly connected neural networks

24
Problems with local communication

Centralized
Sequential

25
Distributing Communication And Computation

One alternative for the summation problem
Doesnt solve the problem

26
Distributing Communication And Computation

Divide and Conquer

27
Divide and Conquer
28
Unstructured and Dynamic Communication

Irregular communication structure
Changing structure

29
Asynchronous Communication

Tasks have no innate knowledge of when their data
will be needed by other tasks
Consumers must request data from producers

30
Possible Mechanisms

Data structure is completely distributed with
individual tasks requesting data and polling for
data requests
Separate set of tasks used to request data and
respond to data requests
Use shared memory watch out for gridlock

31
Communications Design Checklist

Do all tasks perform about the same number of
communication operations?
Unbalanced non-scalable
Does each task communicate only with a small
number of neighbors?
Choice of local or global

32
Communications Design Checklist

Are communication operations able to proceed
concurrently?
If not, algorithm is likely to be inefficient and
non-scalable
Is the computation associated with different
tasks able to proceed concurrently?
Again, inefficient and non-scalable

33
What do we have so far?

The algorithm is currently in an abstract state
suitable for use as a parallel or distributed
mechanism

34
Stages of the Design Process

Partitioning
Communication
Agglomeration
Mapping
Result of the design process could be a program
that is as simple or complex as is possible

35
Agglomeration

Start to look at the implementation and the
particular system to be run on sort of
Try to reduce the number of tasks to be
manageable by the system
Decide if data replication is appropriate to
reduce communication

36
(No Transcript)
37
Conflicting Goals

Reduce communication costs
Retain flexibility
Reduce software engineering costs

38
Reduce communication costs

Send less data
Send fewer messages even if the data is the
same
Surface-to-volume effects
Replicating computation

39
Replicating Computation

For instances when tasks must wait for other
computations to complete

40
Avoiding Communication

A tree
A butterfly
Alternative butterfly

41
Reducing Software Engineering Costs

Keep in mind general maintenance issues
When parallelizing existing sequential
algorithms, keeping existing code when possible
may be preferable to gaining a few extra
milliseconds in execution

42
Agglomeration Design Checklist

Has agglomeration reduced communication costs by
increasing locality?
If not, you missed the point
If agglomeration has replicated computation, have
you verified that the benefits of this
replication outweigh its costs, for a range of
problem sizes and processor counts?
If not, you may be losing scalability

43
Agglomeration Design Checklist

If agglomeration replicates data, have you
verified that this does not compromise the
scalability of your algorithm by restricting the
range of problem sizes or processor counts that
it can address?
Has agglomeration yielded tasks with similar
computation and communication costs?

44
Agglomeration Design Checklist

Does the number of tasks still scale with problem
size?
If agglomeration eliminated opportunities for
concurrent execution, have you verified that
there is sufficient concurrency for current and
future target computers?

45
Agglomeration Design Checklist

Can the number of tasks be reduced still further,
without introducing load imbalances, increasing
software engineering costs, or reducing
scalability?
If you are parallelizing an existing sequential
program, have you considered the cost of the
modifications required to the sequential code?

46
Stages of the Design Process

Partitioning
Communication
Agglomeration
Mapping
Result of the design process could be a program
that is as simple or complex as is possible

47
Mapping

Now we need to specify exactly how our algorithm
will run on the given hardware
The ultimate goal is to minimize execution time
General mapping for optimal efficiency is
considered to be NP-complete

48
Strategies

Tasks that can execute concurrently are assigned
to different processors
Tasks that communicate frequently are assigned to
the same processor

49
Example
50
Static Task/Channels

Simply minimize interprocessor communication
Can also agglomerate tasks associated with the
same processor (creating one task per processor)

51
Complex Tasks/Channels

Load-balancing techniques
Tend to use heuristics
Cost of additional computations must be considered

52
Load-Balancing

Recursive Bisection
Divides the domain into subdomains of
approximately equal computational costs while
minimizing communication
Domain is cut in one dimension to create two
subdomains then recursively divided

53
Load-Balancing

Local Algorithms
Recursive bisection requires global knowledge
while local does not
Using information from neighboring processors,
each processor compares its load with its
neighbors and transfers computations if the
difference is above a threshold

54
Load-Balancing

Probabilistic Methods
Randomly assign tasks to processors
Good when there is little communication between
tasks or little locality
Good for scalability
Generates more communication
Works well if there are a lot more tasks than
processors

55
Load-Balancing

Cyclic Mappings
Used when the computational load per grid point
varies and there is a good deal of locality
Just like probabilistic mappings except that
tasks are assigned in a pattern
Can also be performed in a block-cyclic fashion

56
Task-Scheduling Algorithms

Used when a functional decomposition creates many
tasks with weak locality
Tasks are reformulated as data structures
representing problems that can then be
dynamically assigned to processors (worker tasks)
Allocation strategy is key

57
Manager/Worker
58
Manager/Worker

Workers request tasks which are provided by the
manager
Workers may send new tasks to the manager
More efficient versions prefetch tasks to overlap
computation and communication or cache problems
in the workers to reduce communication

59
Hierarchical Manager/Worker

Essentially a hierarchical tree of Manager/Worker
subdivisions
Can be more effective at reducing communications

60
Decentralized Schemes

Task pool becomes a distributed data structure
Processors each maintain a task pool requested by
workers
Access rules can be enforced to limit
communications
Scales up nicely

61
Mapping Design Checklist

If considering an SPMD design for a complex
problem, have you also considered an algorithm
based on dynamic task creation and deletion?
The later may scale better and be easier to
design but may degrade performance
Have you considered the reverse of 1?

62
Mapping Design Checklist

If using a centralized load-balancing scheme,
have you verified that the manager will not
become a bottleneck?
If using a dynamic load-balancing scheme, have
you evaluated the relative costs of different
strategies?
Dont forget implementation costs

63
Mapping Design Checklist

If using probabilistic or cyclic methods, do you
have a large enough number of tasks to ensure
reasonable load balance?
Generally, ten times the number of tasks than
processors is needed

Write a Comment

User Comments (0)