Title: Parallel Programming
1Parallel Programming
2Designing Parallel Programs
- No one way to solve a problem
- Approach
- Decide on problem issues early
- Leave machine specific concerns until later
3Stages of the Design Process
- Partitioning
- Communication
- Agglomeration
- Mapping
- Result of the design process could be a program
that is as simple or complex as is possible
4Stages of the Design Process
- Partitioning
- Communication
- Agglomeration
- Mapping
- Result of the design process could be a program
that is as simple or complex as is possible
5Partitioning
- Decomposition of the processes and data into
small tasks - Ignore machine issues
- The goal is a fine-grained decomposition
6Partitioning
- The Process
- Focus on the data first Domain Decomposition
7Domain Decomposition
- Try to divide the data into small segments of
approximately equal size - Make this division along logical boundaries
- Try starting with the largest data structure or
the one used the most - Similar to creating objects in OO programming
8Example
- Class scheduling system discussion
9Domain Decomposition
- Next, partition the processes necessary using the
data decomposition - Associate the separate process tasks with the
data that they use - If a process requires data from several tasks,
communication will be necessary between those
tasks
10Partitioning
- The Process
- Focus on the data first Domain Decomposition
- Focus on the computations first Functional
Decomposition
11Functional Decomposition
- For computation intensive programs, concentrating
on the processes first can be more effective - This can also be a useful endeavor for program
simplicity
12Partitioning Checklist
- Does your partition define at least an order of
magnitude more tasks than there are processors in
your target computer? - Provides flexibility in future design stages
- Does your partition avoid redundant computation
and storage requirements? - Provides scalability
13Partitioning Checklist
- Are tasks of comparable size?
- Goes to processor allocation in the Mapping stage
- Does the number of tasks scale with problem size?
- Also goes to scalability. The number of tasks
should increase rather than the size of tasks if
a larger problem is presented
14Partitioning Checklist
- Have you identified several alternative
partitions? - Future design stages may require a different
decomposition. Having alternatives in mind will
help you see these areas and will be easier to
come up with during this phase.
15Stages of the Design Process
- Partitioning
- Communication
- Agglomeration
- Mapping
- Result of the design process could be a program
that is as simple or complex as is possible
16Communication
- When one task needs data associated with another
task - Conceptualized as a channel between tasks
- Two phases
- Define the channel structure
- Define the messages sent along the channels
17Communication
- Difference in Domain vs. Functional decomposition
- Domain (data centered) often requires complex
communication structures - Functional (processes centered) is usually
straightforward communication-wise
18Categorizing Communication Patterns
- Local/Global
- Structured/Unstructured
- Static/Dynamic
- Synchronous/Asynchronous
19Local Communication
- When an operation requires data from only a small
number of other tasks
20Task Algorithm Jacobi finite difference
for t0 to T-1 send    to each neighbor
receive      ,      ,      ,      from
neighbors compute      using Equation 2.1
endfor
21Gauss-Seidel version
- Reformulation of Jacobi using fewer iterations
22Gauss-Seidel version
23Global Communication
- Many tasks need data from many other tasks
- Example highly connected neural networks
24Problems with local communication
25Distributing Communication And Computation
- One alternative for the summation problem
- Doesnt solve the problem
26Distributing Communication And Computation
27Divide and Conquer
28Unstructured and Dynamic Communication
- Irregular communication structure
- Changing structure
29Asynchronous Communication
- Tasks have no innate knowledge of when their data
will be needed by other tasks - Consumers must request data from producers
30Possible Mechanisms
- Data structure is completely distributed with
individual tasks requesting data and polling for
data requests - Separate set of tasks used to request data and
respond to data requests - Use shared memory watch out for gridlock
31Communications Design Checklist
- Do all tasks perform about the same number of
communication operations? - Unbalanced non-scalable
- Does each task communicate only with a small
number of neighbors? - Choice of local or global
32Communications Design Checklist
- Are communication operations able to proceed
concurrently? - If not, algorithm is likely to be inefficient and
non-scalable - Is the computation associated with different
tasks able to proceed concurrently? - Again, inefficient and non-scalable
33What do we have so far?
- The algorithm is currently in an abstract state
suitable for use as a parallel or distributed
mechanism
34Stages of the Design Process
- Partitioning
- Communication
- Agglomeration
- Mapping
- Result of the design process could be a program
that is as simple or complex as is possible
35Agglomeration
- Start to look at the implementation and the
particular system to be run on sort of - Try to reduce the number of tasks to be
manageable by the system - Decide if data replication is appropriate to
reduce communication
36(No Transcript)
37Conflicting Goals
- Reduce communication costs
- Retain flexibility
- Reduce software engineering costs
38Reduce communication costs
- Send less data
- Send fewer messages even if the data is the
same - Surface-to-volume effects
- Replicating computation
39Replicating Computation
- For instances when tasks must wait for other
computations to complete
40Avoiding Communication
- A tree
- A butterfly
- Alternative butterfly
41Reducing Software Engineering Costs
- Keep in mind general maintenance issues
- When parallelizing existing sequential
algorithms, keeping existing code when possible
may be preferable to gaining a few extra
milliseconds in execution
42Agglomeration Design Checklist
- Has agglomeration reduced communication costs by
increasing locality? - If not, you missed the point
- If agglomeration has replicated computation, have
you verified that the benefits of this
replication outweigh its costs, for a range of
problem sizes and processor counts? - If not, you may be losing scalability
43Agglomeration Design Checklist
- If agglomeration replicates data, have you
verified that this does not compromise the
scalability of your algorithm by restricting the
range of problem sizes or processor counts that
it can address? - Has agglomeration yielded tasks with similar
computation and communication costs?
44Agglomeration Design Checklist
- Does the number of tasks still scale with problem
size? - If agglomeration eliminated opportunities for
concurrent execution, have you verified that
there is sufficient concurrency for current and
future target computers?
45Agglomeration Design Checklist
- Can the number of tasks be reduced still further,
without introducing load imbalances, increasing
software engineering costs, or reducing
scalability? - If you are parallelizing an existing sequential
program, have you considered the cost of the
modifications required to the sequential code?
46Stages of the Design Process
- Partitioning
- Communication
- Agglomeration
- Mapping
- Result of the design process could be a program
that is as simple or complex as is possible
47Mapping
- Now we need to specify exactly how our algorithm
will run on the given hardware - The ultimate goal is to minimize execution time
- General mapping for optimal efficiency is
considered to be NP-complete
48Strategies
- Tasks that can execute concurrently are assigned
to different processors - Tasks that communicate frequently are assigned to
the same processor
49Example
50Static Task/Channels
- Simply minimize interprocessor communication
- Can also agglomerate tasks associated with the
same processor (creating one task per processor)
51Complex Tasks/Channels
- Load-balancing techniques
- Tend to use heuristics
- Cost of additional computations must be considered
52Load-Balancing
- Recursive Bisection
- Divides the domain into subdomains of
approximately equal computational costs while
minimizing communication - Domain is cut in one dimension to create two
subdomains then recursively divided
53Load-Balancing
- Local Algorithms
- Recursive bisection requires global knowledge
while local does not - Using information from neighboring processors,
each processor compares its load with its
neighbors and transfers computations if the
difference is above a threshold
54Load-Balancing
- Probabilistic Methods
- Randomly assign tasks to processors
- Good when there is little communication between
tasks or little locality - Good for scalability
- Generates more communication
- Works well if there are a lot more tasks than
processors
55Load-Balancing
- Cyclic Mappings
- Used when the computational load per grid point
varies and there is a good deal of locality - Just like probabilistic mappings except that
tasks are assigned in a pattern - Can also be performed in a block-cyclic fashion
56Task-Scheduling Algorithms
- Used when a functional decomposition creates many
tasks with weak locality - Tasks are reformulated as data structures
representing problems that can then be
dynamically assigned to processors (worker tasks) - Allocation strategy is key
57Manager/Worker
58Manager/Worker
- Workers request tasks which are provided by the
manager - Workers may send new tasks to the manager
- More efficient versions prefetch tasks to overlap
computation and communication or cache problems
in the workers to reduce communication
59Hierarchical Manager/Worker
- Essentially a hierarchical tree of Manager/Worker
subdivisions - Can be more effective at reducing communications
60Decentralized Schemes
- Task pool becomes a distributed data structure
- Processors each maintain a task pool requested by
workers - Access rules can be enforced to limit
communications - Scales up nicely
61Mapping Design Checklist
- If considering an SPMD design for a complex
problem, have you also considered an algorithm
based on dynamic task creation and deletion? - The later may scale better and be easier to
design but may degrade performance - Have you considered the reverse of 1?
62Mapping Design Checklist
- If using a centralized load-balancing scheme,
have you verified that the manager will not
become a bottleneck? - If using a dynamic load-balancing scheme, have
you evaluated the relative costs of different
strategies? - Dont forget implementation costs
63Mapping Design Checklist
- If using probabilistic or cyclic methods, do you
have a large enough number of tasks to ensure
reasonable load balance? - Generally, ten times the number of tasks than
processors is needed