Principles of Parallel Algorithm Design - PowerPoint PPT Presentation

1 / 75
About This Presentation
Title:

Principles of Parallel Algorithm Design

Description:

relatively a few combinations lead to a good parallel algorithm ... rectilinear, jagged, orthogonal bisection. Dynamic Mapping Schemes. centralized schemes ... – PowerPoint PPT presentation

Number of Views:433
Avg rating:3.0/5.0
Slides: 76
Provided by: cevdeta
Category:

less

Transcript and Presenter's Notes

Title: Principles of Parallel Algorithm Design


1
Principles of Parallel Algorithm Design
  • Prof. Dr. Cevdet Aykanat
  • Bilkent Ãœniversitesi
  • Bilgisayar Mühendisligi Bölümü

2

Principles of Parallel Algorithm
Design
  • Identifying concurrent tasks
  • Mapping tasks onto multiple processes
  • Distributing input, output, intermediate data
  • Managing access to shared data
  • Synchronizing processors

3

Principles of Parallel Algorithm Design
  • Identifying concurrent tasks
  • Mapping tasks onto multiple processes
  • Distributing input, output, intermediate data
  • Managing access to shared data
  • Synchronizing processors
  • several choices for each step
  • relatively a few combinations lead to a good
    parallel algorithm
  • different choices yield best performance on
  • different parallel architectures
  • different parallel programming paradigms

4
Decomposition, Tasks
  • decomosition
  • dividing a computation into smaller parts
  • some or all parts can be executed concurrently
  • atomic task
  • user defined
  • indivisible units of computation
  • same size or different size

5
(No Transcript)
6
Task Dependence Graphs (TDG)
  • directed acyclic graph
  • nodes atomic tasks
  • directed edges dependencies
  • some tasks use data produced by other tasks
  • TDG can be weighted
  • node wgt amount of computation
  • edge wgt amount of data
  • multiple ways of expressing certain computations
  • different ways of arranging computations
  • lead to different TDGs

7
(No Transcript)
8
(No Transcript)
9
(No Transcript)
10
Granularity, Concurrency
  • granularity number () and size of tasks
  • fine grain large of small tasks
  • coarse grain small of large tas
  • degree of concurrency (DoC)
  • of tasks that can be executed simultaneously
  • max DoC max DoC at any given time
  • tree TDGs max DoC of leaves (usually)
  • avg DoC DoC over entire duration

11
(No Transcript)
12
Degree of Concurrency
  • depends on granularity
  • finer task granularity larger DoC
  • bound on fine granularity of a decomposition
  • depends on shape of TDG
  • shallow and wide TDG larger DoC
  • deep and thin TDG smaller DoC
  • critical path
  • longest directed path between a start node and a
    finish node
  • critical path length sum of wgts along the path
  • avg DoC total work / critical path length

13
(No Transcript)
14
Task Interaction Graph (TIG)
  • tasks share input, output or intermediate data
  • interactions among independent tasks of aTDG
  • TIG pattern of interactions among tasks
  • node task
  • edge connects tasks that interact with each
    other
  • TIG can be weighted
  • node wgt amount of computation
  • edge wgt amount of interaction

15
(No Transcript)
16
Processes and Mapping
  • process vs processor
  • logical computing agents that perform tasks
  • mapping assigning tasks to processes
  • conflicting goals in a good mapping
  • maximize concurrency
  • map independent tasks to different processes
  • minimize idle time / interaction overhead
  • map tasks along critical path to same process
  • map tasks with high interaction to same processes
  • e.g., map all tasks to the same process

17
(No Transcript)
18
Decomposition Techiques
  • recursive decomposition
  • data decomposition
  • explaloratory decomposition
  • speculative decomposition

19
Recursive Decomposition
  • divide-and-conquer strategy ? natural concurrency
  • divide problem into a set of independent
    subproblems
  • conquer recursively solve each subproblem
  • combine solns to subproblems to a soln of
    problem
  • if sequential algorithm is not based on DAC
  • restructure computation as a DAC algorithm
  • recursive decomposition to extract concurrency
  • e.g., finding minimum of an array A of n
    numbers

20
(No Transcript)
21
(No Transcript)
22
(No Transcript)
23
Data Decomposition
  • partition/decompose computational data domain
  • use this partition to induce task decomposition
  • tasks similar operations on different data parts
  • partitioning output data
  • each output can be computed independently as a fn
    of input
  • example block matrix multiplication
  • data decomposition may not lead to unique task
    decompsition
  • another example computing itemset frequencies
  • input transactions output itemset
    frequencies

24
(No Transcript)
25
(No Transcript)
26
(No Transcript)
27
(No Transcript)
28
Data Decomposition
  • partitioning input data
  • may not be possible desirable to partition output
    data
  • e.g., finding min, sum of a set of numbers,
    sorting
  • a task created for each part of the input data
  • task all computations that can be done using
    local data
  • a combine step may be needed to combine results
    of tasks
  • example finding the sum of an array A of n
    numbers
  • example computing itemset frequencies
  • partitioning both output and input data
  • output data partitioning is feasible
  • partitioning of input data offers additional
    concurrency
  • example computing itemset frequencies

29
Data Decomposition
  • partitioning intermediate data
  • multistage computations
  • partioning input or output data of an
    intermediate stage
  • may lead to higher concurrency
  • some restructuring of the algorithm may be needed
  • example block matrix multiplication
  • owner computes rule
  • each part performs all computations involving
    data it owns
  • input perform all computations that can be done
    using local data
  • ouput compute all data in the partition

30
(No Transcript)
31
(No Transcript)
32
(No Transcript)
33
Other Decomposition Techniques
  • exploratory decomposition
  • search of a configuration space for a solution
  • partition the search space into smaller parts
  • search each part concurrently
  • total parallel work lt, , gt total serial work
  • example 15-puzzle problem
  • speculative decomposition
  • hybrid decompositions
  • computation structured into multiple stages
  • may apply different decompositions in different
    stages
  • examples finding min of an array and quicksort
  • data decomposition then recursive decomposition

34
(No Transcript)
35
(No Transcript)
36
(No Transcript)
37
(No Transcript)
38
(No Transcript)
39
Characteristics of Tasks
  • task generation static vs dynamic task
    generation
  • static all tasks are known priori to execution
    of algorithm
  • data decomposition matrix multiplication
  • recursive decomposition finding min of an array
  • dynamic actual tasks and TPG/TIG not available a
    priori
  • rules, guideliness governing task generation may
    be known
  • recursive decomposition quicksort
  • another example ray tracing
  • task sizes uniform vs non-uniform
  • complexity of mapping depends on this
  • tasks in matrix multiplication uniform
  • tasks in quicksort non-uniform

40
Characteristics of Tasks
  • knowledge of task sizes
  • can be used in mapping
  • known tasks in decompositions for matrix
    multiplication
  • unknown tasks in 15-puzzle problem
  • do not know a priori how many moves will lead to
    a soln.
  • size of data associated with tasks
  • associated data must be available to the process
  • size and location of the associated data
  • consider data migration overhead in the mapping

41
Characteristics of Inter-Task Interactions
  • static vs dynamic
  • static pattern and timing of interactions known
    a priori
  • static interaction decompositions for matrix
    multiplication
  • message-passing paradigm (MPP)
  • active involvement of both interacting tasks
  • static interactions easy to program
  • dynamic interactions harder to program
  • tasks assigned additional synchronization and
    polling responsibilities
  • shared-address-space (SASP) can handle both
    equally easily
  • regular vs irregular (spatial structure)
  • regular structure that can be exploited for
    efficient implement.
  • structured/curvilinear grids (implicit
    connectivity)
  • image dithering (example)
  • irregular no such regular pattern exists
  • unstructured grids (connectivity maintained
    explicitly)
  • SpMxV (sparse matrix vector multiplication)
  • irregular and dynamic interactions harder to
    handle in MPP

42
(No Transcript)
43
Characteristics of Inter-Task Interactions
  • read-only vs read-write
  • read-only tasks require read-only access to
    shared data
  • example decompositions for matrix multiplication
  • read-write tasks need to read and write on
    shared data
  • example heuristic search for 15-puzzle problem
  • one-way vs two-way
  • 2-way data/work needed by a task explicitliy
    supplied by another
  • usually involve predefined producer and consumer
  • 1-way only one of a pair of comm. tasks
    initiates completes interaction
  • read-only ? 1-way read-write ? either
    1-way or 2-way
  • SASP can handle both interactions equally easily
  • MPP cannot handle 1-way interaction directly
  • source of data should explicitly send it to the
    recipient
  • static 1-way easily converted to 2-way via
    program restructuring
  • dynamic 1-way nontrivial program structuring for
    converting to 2-way
  • polling task checks for pending requests from
    others at regular intervals

44
Mapping Techniques
  • minimize overheads of parallel task execution
  • overhead inter-process interaction
  • overhead process idle time (uneven load
    distribution)
  • load balancing
  • balanced aggregate load necessary but not
    sufficient
  • computations interactions well balanced at each
    stage
  • example 12-task decomposition (9-12 depends on
    1-8)

45
(No Transcript)
46
Static vs Dynamic Mapping
  • static distribute tasks prior to execution
  • static task generation either static or dynamic
    mapping
  • good mapping knowledge of task sizes, data
    sizes, TIG
  • non-trivial problem (usually NP-hard)
  • task sizes known but non-uniform
  • even if no TDG/TIG ? number partitioning problem
  • dynamic distribute workload during execution
  • dynamic task generation dynamic mapping
  • task sizes unkown dynamic mapping more effective
  • large data size dynamic mapping costly (in MPP)

47
Static-Mapping Schemes
  • mapping based on data partitioning
  • data partitioning induces a decomposition
  • partitioning selected with final mapping in mind
  • i.e., p-way data decomposition
  • dense arrays
  • sparse data structures, graphs (FE meshes)
  • mapping based on task partitioning
  • task dependence graphs, task interaction graphs
  • hierarchical partitioning
  • hybrid decomposition and mapping techniques

48
Array Distribution Schemes
  • block distributions spatial locality of
    interaction
  • each process receives a contigous block of
    entries
  • 1D each part contains a block of consecutive
    rows
  • i.e., kth part contains rows kn/p ... (k1)n/p-1
  • 2D checkerboard partitioning
  • higher dimensional distributions
  • higher degree of concurrency
  • less inter-process interaction
  • example matrix multiplication

49
(No Transcript)
50
(No Transcript)
51
(No Transcript)
52
(No Transcript)
53
(No Transcript)
54
(No Transcript)
55
(No Transcript)
56
Array Distribution Schemes
  • cyclic distribution
  • amount of work differs for different matrix
    entries
  • examples ray casting, dense LU factorization
  • block distribution leads to load imbalance
  • all processes have tasks from all parts of the
    matrix
  • good load balance, but complete loss of locality
  • block-cyclic distribution
  • partition array into more than p blocks
  • map blocks to processes in a round-robin
    (scattered) manner
  • randomized block distribution
  • when the distribution of work has some special
    pattern
  • adaptive 2D array partitionings
  • rectilinear, jagged, orthogonal bisection

57
(No Transcript)
58
(No Transcript)
59
(No Transcript)
60
(No Transcript)
61
(No Transcript)
62
(No Transcript)
63
(No Transcript)
64
(No Transcript)
65
(No Transcript)
66
(No Transcript)
67
(No Transcript)
68
Dynamic Mapping Schemes
  • centralized schemes
  • all tasks maintained in a common pool or by a
    process
  • idle processes take task(s) from central pool or
    master process
  • easier to implement
  • limited scalability central pool/process becomes
    a bottleneck
  • chunk scheduling idle processes get group of
    tasks
  • danger of load imbalance due to large chunk sizes
  • decrease chunk size as program progresses
  • e.g., sorting entries in each row of a matrix
  • non-uniform tasks unknown task sizes
  • e.g., image-space parallel ray casting

69
Dynamic Mapping Schemes
  • distributed schemes
  • tasks are distributed among processes
  • more scalable (no bottleneck)
  • critical parameters of distributed load balancing
  • how sending and receiving processes ard paired?
  • who initiates the work transfer sender or
    receiver?
  • how much work transferred in each exchange?
  • when is he work transfer performed?
  • suitability to parallel architectures
  • both can be implemented in both SAS and MP
    paradigms
  • dynamic schemes require movement of tasks
  • computational granularity of tasks should be high
    in MP systems

70
Methods for Interaction Overheads
  • factors
  • volume and frequency of interaction
  • spatial and temporal pattern of interactions
  • maximizing data locality
  • minimize volume of data exchange
  • minimize overall volume of shared data
  • similar to maximizing temporal data locality
  • minimize frequency of interaction
  • high startup cost associated with each
    interaction
  • restructure algorithm shared data accessed in
    large pieces
  • similar to increasing spatial locality of data
    access
  • minimizing contention and hot spots
  • multiple tasks try to access same resource
    concurrently
  • multiple simultaneous access to same memory
    block/bank
  • multiple processes sending messages to same
    process at the same time

71
Methods for Interaction Overheads
  • minimizing contention and hot spots
  • multiple tasks try to access same resource
    concurrently
  • multiple simultaneous access to same memory
    block/bank
  • multiple processes sending messages to same
    process simult.
  • e.g., matrix multiplication based on 2D
    partitioning
  • overlapping computations with interactions
  • early initiation of an interaction
  • support from programming paradigm, OS, hardware
  • MP non-blocking message-passing primitives

72
Methods for Interaction Overheads
  • replicating data or computation
  • replicating frequently accessed read-only shared
    data
  • MP paradigm benefits more from data replication
  • replicated computation for shared intermediate
    results
  • using optimized collective interaction operations
  • usually use available implementations (e.g., by
    MPI)
  • sometimes, it may be better to write your own
    procedure
  • overlapping interactions with other interactions
  • example one-to-all broadcast

73
(No Transcript)
74
Parallel Algorithm Models
  • data-parallel model
  • data parallelism identicial operations applied
    concurrently on different data items
  • task graph model
  • task parallelism independent tasks in a TDG
  • quicksort, sparse matrix factorization
  • work-pool or task-pool model
  • dynamic mapping of tasks onto processes
  • mapping may be centralized or distributed

75
Parallel Algorithm Models
  • master-slave or manager-worker model
  • master process generates work allocates to
    worker processes
  • pipeline or producer-consumer model
  • stream parallelism execution of diff. programs
    on a data stream
  • each process in the pipeline
  • consumer of the sequence of data items for the
    preceeding process
  • producer of data for the process following in the
    pipeline
  • pipeline may not be a linear chain (it can be a
    DAG)
  • hybrid models
  • multiple models applied hierarchically
  • multiple models applied sequentially to different
    stages
Write a Comment
User Comments (0)
About PowerShow.com