Parallel Programming - PowerPoint PPT Presentation

1 / 63
About This Presentation
Title:

Parallel Programming

Description:

Agglomeration ... Has agglomeration yielded tasks with similar computation ... Can also agglomerate tasks associated with the same processor (creating one task ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 64
Provided by: LeeMcC
Category:

less

Transcript and Presenter's Notes

Title: Parallel Programming


1
Parallel Programming
  • September 4

2
Designing Parallel Programs
  • No one way to solve a problem
  • Approach
  • Decide on problem issues early
  • Leave machine specific concerns until later

3
Stages of the Design Process
  • Partitioning
  • Communication
  • Agglomeration
  • Mapping
  • Result of the design process could be a program
    that is as simple or complex as is possible

4
Stages of the Design Process
  • Partitioning
  • Communication
  • Agglomeration
  • Mapping
  • Result of the design process could be a program
    that is as simple or complex as is possible

5
Partitioning
  • Decomposition of the processes and data into
    small tasks
  • Ignore machine issues
  • The goal is a fine-grained decomposition

6
Partitioning
  • The Process
  • Focus on the data first Domain Decomposition

7
Domain Decomposition
  • Try to divide the data into small segments of
    approximately equal size
  • Make this division along logical boundaries
  • Try starting with the largest data structure or
    the one used the most
  • Similar to creating objects in OO programming

8
Example
  • Class scheduling system discussion

9
Domain Decomposition
  • Next, partition the processes necessary using the
    data decomposition
  • Associate the separate process tasks with the
    data that they use
  • If a process requires data from several tasks,
    communication will be necessary between those
    tasks

10
Partitioning
  • The Process
  • Focus on the data first Domain Decomposition
  • Focus on the computations first Functional
    Decomposition

11
Functional Decomposition
  • For computation intensive programs, concentrating
    on the processes first can be more effective
  • This can also be a useful endeavor for program
    simplicity

12
Partitioning Checklist
  • Does your partition define at least an order of
    magnitude more tasks than there are processors in
    your target computer?
  • Provides flexibility in future design stages
  • Does your partition avoid redundant computation
    and storage requirements?
  • Provides scalability

13
Partitioning Checklist
  • Are tasks of comparable size?
  • Goes to processor allocation in the Mapping stage
  • Does the number of tasks scale with problem size?
  • Also goes to scalability. The number of tasks
    should increase rather than the size of tasks if
    a larger problem is presented

14
Partitioning Checklist
  • Have you identified several alternative
    partitions?
  • Future design stages may require a different
    decomposition. Having alternatives in mind will
    help you see these areas and will be easier to
    come up with during this phase.

15
Stages of the Design Process
  • Partitioning
  • Communication
  • Agglomeration
  • Mapping
  • Result of the design process could be a program
    that is as simple or complex as is possible

16
Communication
  • When one task needs data associated with another
    task
  • Conceptualized as a channel between tasks
  • Two phases
  • Define the channel structure
  • Define the messages sent along the channels

17
Communication
  • Difference in Domain vs. Functional decomposition
  • Domain (data centered) often requires complex
    communication structures
  • Functional (processes centered) is usually
    straightforward communication-wise

18
Categorizing Communication Patterns
  • Local/Global
  • Structured/Unstructured
  • Static/Dynamic
  • Synchronous/Asynchronous

19
Local Communication
  • When an operation requires data from only a small
    number of other tasks

20
Task Algorithm Jacobi finite difference
for t0 to T-1 send     to each neighbor
receive      ,      ,      ,       from
neighbors compute       using Equation 2.1
endfor
21
Gauss-Seidel version
  • Reformulation of Jacobi using fewer iterations

22
Gauss-Seidel version
23
Global Communication
  • Many tasks need data from many other tasks
  • Example highly connected neural networks

24
Problems with local communication
  • Centralized
  • Sequential

25
Distributing Communication And Computation
  • One alternative for the summation problem
  • Doesnt solve the problem

26
Distributing Communication And Computation
  • Divide and Conquer

27
Divide and Conquer
28
Unstructured and Dynamic Communication
  • Irregular communication structure
  • Changing structure

29
Asynchronous Communication
  • Tasks have no innate knowledge of when their data
    will be needed by other tasks
  • Consumers must request data from producers

30
Possible Mechanisms
  • Data structure is completely distributed with
    individual tasks requesting data and polling for
    data requests
  • Separate set of tasks used to request data and
    respond to data requests
  • Use shared memory watch out for gridlock

31
Communications Design Checklist
  • Do all tasks perform about the same number of
    communication operations?
  • Unbalanced non-scalable
  • Does each task communicate only with a small
    number of neighbors?
  • Choice of local or global

32
Communications Design Checklist
  • Are communication operations able to proceed
    concurrently?
  • If not, algorithm is likely to be inefficient and
    non-scalable
  • Is the computation associated with different
    tasks able to proceed concurrently?
  • Again, inefficient and non-scalable

33
What do we have so far?
  • The algorithm is currently in an abstract state
    suitable for use as a parallel or distributed
    mechanism

34
Stages of the Design Process
  • Partitioning
  • Communication
  • Agglomeration
  • Mapping
  • Result of the design process could be a program
    that is as simple or complex as is possible

35
Agglomeration
  • Start to look at the implementation and the
    particular system to be run on sort of
  • Try to reduce the number of tasks to be
    manageable by the system
  • Decide if data replication is appropriate to
    reduce communication

36
(No Transcript)
37
Conflicting Goals
  • Reduce communication costs
  • Retain flexibility
  • Reduce software engineering costs

38
Reduce communication costs
  • Send less data
  • Send fewer messages even if the data is the
    same
  • Surface-to-volume effects
  • Replicating computation

39
Replicating Computation
  • For instances when tasks must wait for other
    computations to complete

40
Avoiding Communication
  • A tree
  • A butterfly
  • Alternative butterfly

41
Reducing Software Engineering Costs
  • Keep in mind general maintenance issues
  • When parallelizing existing sequential
    algorithms, keeping existing code when possible
    may be preferable to gaining a few extra
    milliseconds in execution

42
Agglomeration Design Checklist
  • Has agglomeration reduced communication costs by
    increasing locality?
  • If not, you missed the point
  • If agglomeration has replicated computation, have
    you verified that the benefits of this
    replication outweigh its costs, for a range of
    problem sizes and processor counts?
  • If not, you may be losing scalability

43
Agglomeration Design Checklist
  • If agglomeration replicates data, have you
    verified that this does not compromise the
    scalability of your algorithm by restricting the
    range of problem sizes or processor counts that
    it can address?
  • Has agglomeration yielded tasks with similar
    computation and communication costs?

44
Agglomeration Design Checklist
  • Does the number of tasks still scale with problem
    size?
  • If agglomeration eliminated opportunities for
    concurrent execution, have you verified that
    there is sufficient concurrency for current and
    future target computers?

45
Agglomeration Design Checklist
  • Can the number of tasks be reduced still further,
    without introducing load imbalances, increasing
    software engineering costs, or reducing
    scalability?
  • If you are parallelizing an existing sequential
    program, have you considered the cost of the
    modifications required to the sequential code?

46
Stages of the Design Process
  • Partitioning
  • Communication
  • Agglomeration
  • Mapping
  • Result of the design process could be a program
    that is as simple or complex as is possible

47
Mapping
  • Now we need to specify exactly how our algorithm
    will run on the given hardware
  • The ultimate goal is to minimize execution time
  • General mapping for optimal efficiency is
    considered to be NP-complete

48
Strategies
  • Tasks that can execute concurrently are assigned
    to different processors
  • Tasks that communicate frequently are assigned to
    the same processor

49
Example
50
Static Task/Channels
  • Simply minimize interprocessor communication
  • Can also agglomerate tasks associated with the
    same processor (creating one task per processor)

51
Complex Tasks/Channels
  • Load-balancing techniques
  • Tend to use heuristics
  • Cost of additional computations must be considered

52
Load-Balancing
  • Recursive Bisection
  • Divides the domain into subdomains of
    approximately equal computational costs while
    minimizing communication
  • Domain is cut in one dimension to create two
    subdomains then recursively divided

53
Load-Balancing
  • Local Algorithms
  • Recursive bisection requires global knowledge
    while local does not
  • Using information from neighboring processors,
    each processor compares its load with its
    neighbors and transfers computations if the
    difference is above a threshold

54
Load-Balancing
  • Probabilistic Methods
  • Randomly assign tasks to processors
  • Good when there is little communication between
    tasks or little locality
  • Good for scalability
  • Generates more communication
  • Works well if there are a lot more tasks than
    processors

55
Load-Balancing
  • Cyclic Mappings
  • Used when the computational load per grid point
    varies and there is a good deal of locality
  • Just like probabilistic mappings except that
    tasks are assigned in a pattern
  • Can also be performed in a block-cyclic fashion

56
Task-Scheduling Algorithms
  • Used when a functional decomposition creates many
    tasks with weak locality
  • Tasks are reformulated as data structures
    representing problems that can then be
    dynamically assigned to processors (worker tasks)
  • Allocation strategy is key

57
Manager/Worker
58
Manager/Worker
  • Workers request tasks which are provided by the
    manager
  • Workers may send new tasks to the manager
  • More efficient versions prefetch tasks to overlap
    computation and communication or cache problems
    in the workers to reduce communication

59
Hierarchical Manager/Worker
  • Essentially a hierarchical tree of Manager/Worker
    subdivisions
  • Can be more effective at reducing communications

60
Decentralized Schemes
  • Task pool becomes a distributed data structure
  • Processors each maintain a task pool requested by
    workers
  • Access rules can be enforced to limit
    communications
  • Scales up nicely

61
Mapping Design Checklist
  • If considering an SPMD design for a complex
    problem, have you also considered an algorithm
    based on dynamic task creation and deletion?
  • The later may scale better and be easier to
    design but may degrade performance
  • Have you considered the reverse of 1?

62
Mapping Design Checklist
  • If using a centralized load-balancing scheme,
    have you verified that the manager will not
    become a bottleneck?
  • If using a dynamic load-balancing scheme, have
    you evaluated the relative costs of different
    strategies?
  • Dont forget implementation costs

63
Mapping Design Checklist
  • If using probabilistic or cyclic methods, do you
    have a large enough number of tasks to ensure
    reasonable load balance?
  • Generally, ten times the number of tasks than
    processors is needed
Write a Comment
User Comments (0)
About PowerShow.com