Parallelization%20Strategies%20and%20Load%20Balancing - PowerPoint PPT Presentation

About This Presentation
Title:

Parallelization%20Strategies%20and%20Load%20Balancing

Description:

Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley – PowerPoint PPT presentation

Number of Views:140
Avg rating:3.0/5.0
Slides: 33
Provided by: Universi473
Category:

less

Transcript and Presenter's Notes

Title: Parallelization%20Strategies%20and%20Load%20Balancing


1
Parallelization Strategies and Load Balancing
  • Some material borrowed from lectures of J.
    Demmel, UC Berkeley

2
Ideas for dividing work
  • Embarrassingly parallel computations
  • ideal case. after perhaps some initial
    communication, all processes operate
    independently until the end of the job
  • examples computing pi general Monte Carlo
    calculations simple geometric transformation of
    an image
  • static or dynamic (worker pool) task assignment

3
Ideas for dividing work
  • Partitioning
  • partition the data, or the domain, or the task
    list, perhaps master/slave
  • examples dot product of vectors integration on
    a fixed interval N-body problem using domain
    decomposition
  • static or dynamic task assignment need for care

4
Ideas for dividing work
  • Divide Conquer
  • recursively partition the data, or the domain, or
    the task list
  • examples tree algorithm for N-body problem
    multipole multigrid
  • usually dynamic work assignments

5
Ideas for dividing work
  • Pipelining
  • a sequence of tasks performed by one of a host of
    processors functional decomposition
  • examples upper triangular linear solves
    pipeline sorts
  • usually dynamic work assignments

6
Ideas for dividing work
  • Synchronous Computing
  • same computation on different sets of data often
    domain decomposition
  • examples iterative linear system solves
  • often can schedule static work assignments, if
    data structures dont change

7
Load balancing
  • Determined by
  • Task costs
  • Task dependencies
  • Locality needs
  • Spectrum of solutions
  • Static - all information available before
    starting
  • Semi-Static - some info before starting
  • Dynamic - little or no info before starting
  • Survey of solutions
  • How each one works
  • Theoretical bounds, if any
  • When to use it

8
Load Balancing in General
  • Large literature
  • A closely related problem is scheduling, which is
    to determine the order in which tasks run

9
Load Balancing Problems
  • Tasks costs
  • Do all tasks have equal costs?
  • Task dependencies
  • Can all tasks be run in any order (including
    parallel)?
  • Task locality
  • Is it important for some tasks to be scheduled on
    the same processor (or nearby) to reduce
    communication cost?

10
Task cost spectrum
11
Task Dependency Spectrum
12
Task Locality Spectrum
13
Approaches
  • Static load balancing
  • Semi-static load balancing
  • Self-scheduling
  • Distributed task queues
  • Diffusion-based load balancing
  • DAG scheduling
  • Mixed Parallelism

14
Static Load Balancing
  • All information is available in advance
  • Common cases
  • dense matrix algorithms, e.g. LU factorization
  • done using blocked/cyclic layout
  • blocked for locality, cyclic for load balancing
  • usually a regular mesh, e.g., FFT
  • done using cyclictransposeblocked layout for 1D
  • sparse-matrix-vector multiplication
  • use graph partitioning, where graph does not
    change over time

15
Semi-Static Load Balance
  • Domain changes slowly locality is important
  • use static algorithm
  • do some computation, allowing some load imbalance
    on later steps
  • recompute a new load balance using static
    algorithm
  • Particle simulations, particle-in-cell (PIC)
    methods
  • tree-structured computations (Barnes Hut, etc.)
  • grid computations with dynamically changing grid,
    which changes slowly

16
Self-Scheduling
  • Self scheduling
  • Centralized pool of tasks that are available to
    run
  • When a processor completes its current task, look
    at the pool
  • If the computation of one task generates more,
    add them to the pool
  • Originally used for
  • Scheduling loops by compiler (really the
    runtime-system)

17
When is Self-Scheduling a Good Idea?
  • A set of tasks without dependencies
  • can also be used with dependencies, but most
    analysis has only been done for task sets without
    dependencies
  • Cost of each task is unknown
  • Locality is not important
  • Using a shared memory multiprocessor, so a
    centralized pool of tasks is fine

18
Variations on Self-Scheduling
  • Dont grab small unit of parallel work.
  • Chunk of tasks of size K.
  • If K large, access overhead for task queue is
    small
  • If K small, likely to have load balance
  • Four variations
  • Use a fixed chunk size
  • Guided self-scheduling
  • Tapering
  • Weighted Factoring

19
Variation 1 Fixed Chunk Size
  • How to compute optimal chunk size
  • Requires a lot of information about the problem
    characteristics e.g. task costs, number
  • Need off-line algorithm not useful in practice.
  • All tasks must be known in advance

20
Variation 2 Guided Self-Scheduling
  • Use larger chunks at the beginning to avoid
    excessive overhead and smaller chunks near the
    end to even out the finish times.

21
Variation 3 Tapering
  • Chunk size, Ki is a function of not only the
    remaining work, but also the task cost variance
  • variance is estimated using history information
  • high variance gt small chunk size should be used
  • low variant gt larger chunks OK

22
Variation 4 Weighted Factoring
  • Similar to self-scheduling, but divide task cost
    by computational power of requesting node
  • Useful for heterogeneous systems
  • Also useful for shared resource e.g. NOWs
  • as with Tapering, historical information is used
    to predict future speed
  • speed may depend on the other loads currently
    on a given processor

23
Distributed Task Queues
  • The obvious extension of self-scheduling to
    distributed memory
  • Good when locality is not very important
  • Distributed memory multiprocessors
  • Shared memory with significant synchronization
    overhead
  • Tasks that are known in advance
  • The costs of tasks is not known in advance

24
DAG Scheduling
  • Directed acyclic graph (DAG) of tasks
  • nodes represent computation (weighted)
  • edges represent orderings and usually
    communication (may also be weighted)
  • usually not common to have DAG in advance

25
DAG Scheduling
  • Two application domains where DAGs are known
  • Digital Signal Processing computations
  • Sparse direct solvers (mainly Cholesky, since it
    doesnt require pivoting).
  • Basic strategy partition DAG to minimize
    communication and keep all processors busy
  • NP complete, so need approximations
  • Different than graph partitioning, which was for
    tasks with communication but no dependencies

26
Mixed Parallelism
  • Another variation - a problem with 2 levels of
    parallelism
  • course-grained task parallelism
  • good when many tasks, bad if few
  • fine-grained data parallelism
  • good when much parallelism within a task, bad if
    little

27
Mixed Parallelism
  • Adaptive mesh refinement
  • Discrete event simulation, e.g., circuit
    simulation
  • Database query processing
  • Sparse matrix direct solvers

28
Mixed Parallelism Strategies
29
Which Strategy to Use
30
Switch Parallelism A Special Case
31
A Simple Performance Model for Data Parallelism
32
Values of Sigma - problem size
Write a Comment
User Comments (0)
About PowerShow.com