Title: Programming Parallel Algorithms NESL
1Programming Parallel Algorithms - NESL
- Guy E. Blelloch
- Presented by
- Michael Sirivianos
- Barbara Theodorides
2Problem Statement
- Why design a new language specifically for
programming parallel algorithms? - In the past 20 years there has been tremendous
progress in developing and analyzing parallel
algorithms - At that time less success in developing good
languages for programming parallel algorithms - There is a large gap between languages that are
too low level (details that obscure the meaning
of the algorithm) and languages that are too high
level (making performance implications unclear)
3NESL
- Nested Data Parallel Language
- Useful for teaching and implementing parallel
algorithms. - Bridges the gap allows high-level descriptions
of parallel algorithms but also has a
straightforward mapping onto a performance model. - Goals when designing NESL
- A language-based performance model that uses work
and depth rather than a machine-based model that
uses running time - Support for nested data-parallel constructs
(ability to nest parallel calls)
4Analyzing performance
- Processor-based models Performance is calculated
in terms of the number of instruction cycles a
computation takes (its running time) A
function of input size and number of processors - Virtual models Higher level models that can be
mapped onto various real machines (e.g. PRAM -
Parallel Random Access Machine) - Can be mapped efficiently onto more realistic
machines by simulating multiple processors of the
PRAM on a single processor of a host machine.
Virtual models easier to program.
5Measuring performance Work Depth
- Work the total number of operations executed by
a computation - specifies the running time on a sequential
processor - Depth the longest chain of sequential
dependencies in the computation. - represents the best possible running time
assuming an ideal machine with an unlimited
number of processors - Example Summing 16 numbers using a balanced
binary tree
6How can work depth be incorporated into a
computational model?
- Circuit model
- Designing a circuit of logic gates
- In previous example, design a circuit in which
the inputs are at the top, each is an adder
circuit, and each of the lines between adders is
a bundle of wires. - Work circuit size (number of gates)
- Depth longest path from an input to an output
7How can work depth be incorporated into a
computational model? (cont)
- Vector Machine Models
- VRAM is a sequential RAM extended with a set of
instructions that operate on vectors. - Each location in memory contains a whole vector
- Vectors can vary in size during the computation
- Vector instructions include element wise
operations (adding corresponding elements) - Depth instructions executed by the machine
- Work sum of the lengths of the vectors
8How can work depth be incorporated into a
computational model? (cont)
- Vector Machine Models Example
- Summation tree code
- Work O ( n n/2 ) O (n)
- Depth O (log n)
9How can work depth be incorporated into a
computational model? (cont)
- Language-Based Models
- Specify the costs of the primitive instructions
and a set of rules for composing costs across
program expressions. - Discuss the running time of the algorithms
without introducing a specific machine model. - Using work depth work depth costs are
assigned to each function and scalar primitive of
a language and rules are specified for combining
parallel and sequential expressions. -
- Roughly speaking, when executing a set of tasks
in parallel - work sum of work of the tasks
- depth maximum of the depth of the tasks
10Why Work Depth?
- Work Depth used informally for many years to
describe the performance of parallel algorithms - easier to describe
- easier to think about
- easier to analyze algorithms in terms of work
depth than in terms of running time and number of
processors (processor-based model) - Why models based on work depth are better than
processor-based models for programming and
analyzing parallel algorithms? - Performance analysis is closely related to the
code and code provides a clear abstraction of
parallelism.
11Why Work Depth? (cont)
- To support this claim they consider Quicksort.
- Sequential algorithm
- Average case run time O ( n log n ) , depth
or recur. calls O ( log n ) - Parallel algorithm
-
12Quicksort (cont.)
- Code and analysis based on a processor based
model - Code will have to specify how the sequence is
partitioned across processor - how the subselection is implemented in parallel
- how the recursive calls get partitioned among the
processors. - how the subcalls are synchronized
- In the case of Quicksort, this gets even more
complicated. T - The recursive calls are not of equal sizes.
13Work Depth and running time
- Running time at the two limits
- Single processor. RT work
- Unlimited number of processors. RT depth
- We can place upper and lower bounds for a given
number of processor. - W/ P lt T lt W / P D
- valid under assumptions about communication and
scheduling costs. - e.g. given memory latency L
- W/ P lt T lt W / P LD
- Communication cost among processor is not unit
time thus D is multiplied by a latency factor.
Bandwidth is not taken into account. In case of
significantly different bandwidth W should be
divided by a large B factor and D by a small B
factor.
14Work Depth and running time (cont)
- Communication Bounds
- Work depth do not take into account
communication costs - latency time between making a remote request and
receiving the reply - bandwidth rate at which a processor can access
memory - Latency can be hidden.
- Each processor has multiple parallel tasks
(threads) to execute and therefore has plenty to
do while waiting for replies - Bandwidth can not be hidden. While processor is
waiting for data transfer to complete it is not
able to perform other operations, and therefore
remains idle. - .
15Nested Data-Parallelism and NESL
- Data-Parallelism the ability to operate in
parallel over sets of data - Data-Parallel Languages or Collection-Oriented
Languages - languages based on data-parallelism. Can be
either flat or nested - Importance of nested parallelism
- Used to implement nested loops and
divide-and-conquer algorithms in parallel - Existing languages, such as C, do not have direct
support for such nesting! - NESL
- Is a nested data-parallel language.
- Designed in order to express nested parallelism
in a simple way with a minimum set of structures
16NESL
- Supports data-parallelism by means of operations
on sequences - Apply-to-each construct which uses a set-like
notation - e.g. a a a in 3, -4, -9, 5
- Used over multiple sequences. a b a in
3, -4, -9, 5 b in 1, 2, 3, 4 - Ability to subselect elements of a sequence
based on a filter. - e.g. a a a in 3, -4, -9, 5 a gt 0
- Any function may be applied to each element of a
sequence - e.g. factorial(i) i in 3, 1, 7
- Provides a set of functions on sequences, each
of which can be implemented in parallel (sum,
reverse, write) - e.g. write(0, 0, 0, 0, 0, 0, 0, 0,
(4,-2),(2,5),(5,9)) - Nested parallelism allow sequences to be nested
and allow parallel funcitons to be used in an
apply-to-each. - e.g. sum(a) a in 2,3, 8,3,9, 7
17The performance Model
- Defines Work Depth in terms of the work and
depth of the primitive operations, and Rules for
composing the measures across expressions. - In most cases W(e1 e2) 1 W(e1) W(e2),
where ei expresions - A similar rule is used for the depth.
- Rules
- apply-to-each expression
- if expression
18The performance Model (cont)
- Example Factorial
- Concider the evaluation of the expression
- e factorial(n) n in a where a 3, 1, 5,
2. - function factorial(n)
- if (n 1) then 1
- else nfactorial(n-1)
- Using the rules for work and depth
- where W , W, W- have cost 1.
- The two unit constants come form the cost of the
function call and the if-then-else statement.
19Examples of Parallel Algorithms in NESL
- Principles
- An important aspect of developing a good parallel
algorithm is designing one whose work is close to
the time for a good sequential algorithm that
solves the same problem. - Work-efficient Parallel algorithms are referred
to as work-efficient relative to a sequential
algorithm if their work is within a constant
factor of the time of the sequential algorithm.
20Examples of Parallel Algorithms in NESL (cont)
- Primes
- Sieve of Eratosthenes
- 1 procedure PRIMES(n)
- 2 let A be an array of length n
- 3 set all but the first element of A to TRUE
- 4 for i from 2 to sqrt(n)
- 5 begin
- 6 if Ai is TRUE
- 7 then set all multiples of i up to n to
FALSE - 8 end
- Line 7 is implementing by looping over the
multiples, thus the algorithm takes O (n log log
n) time.
21Examples of Parallel Algorithms in NESL (cont)
- Primes (parallelized)
- Parallelize the line set all multiples of i up
to n to FALSE - multiples of a value i can be generated in
parallel by 2ini - and can be written into the array A in
parallel with the write function - The depth of this algorithm is O (sqrt(n)), since
each iteration of the loop has constant depth and
there are sqrt(n) iterations. - The number of multiples is the same as the time
of the sequential version. - Since it does the same number of operations,
work is the same O (n log log n).
22Examples of Parallel Algorithms in NESL (cont)
- Primes Improving depth
- If we are given all the primes form 2 up to
sqrt(n), we could then generate - all the multiples of these primes at once
2pnp in sqr_primes - function primes (n)
- if n 2 then ( int )
- else
- let sqr_primes primes( isqrt(n) )
- composites 2pnp p in sqr_primes
- flat_comps flatten (composites)
- flags write(dist(true, n), (i,false)
i in flat_comps) - indices i in 0n fl in flags fl
- in drop(indices, 2)
23Examples of Parallel Algorithms in NESL (cont)
- Primes Improving depth
- Analyze of Work Depth
- Work clearly most of the work is done at the top
level of recursion, which does O (n log log n)
work, and therefore the total work is - O (n log log n)
- Depth since each recursion level has constant
depth, the total depth is proportional to the
number of levels. The number of levels is log log
n (the size of the problem at the ith level is
n1/2d gt d log log n) and therefore the depth
is O (log log n) - This algorithm remains work-efficient and greatly
improves the depth.
24Examples of Parallel Algorithms in NESL (cont)
- Sparce Matrix Multiplication
- Sparce matices most elements are zero
- Representation in NESL
- 2.0 -1.0 0 0
A (0, 2.0), (1, -1.0), - A -1.0 2.0 -1.0 0
(0, -1.0), (1, 2.0), (2, -1.0), - 0 -1.0 2.0 -1.0
(1, -1.0), (2, 2.0), (3, -1.0), - 0 0 -1.0 2.0
(2, -1.0), (3, 2.0) - E.g. multiply a sparce matrix A with a dense
vector x. - The dot product Ax in NESL is sum(v xi
(i,v) in row) row in A - Let n be the number of nonzero elements in the
row, then - depth of the computation the depth of the
sum O ( log n ) - work sum of the work across the elements
O (n)
25Examples of Parallel Algorithms in NESL (cont)
- Planar Convex Hull
- Problem Given n points in the plane, find which
of them lie on the perimeter of the smallest
convex region that contains all points. - An example of nested parallelism for
divide-and-conquer algorithms. - Quickhull algorithm (similar to Quicksort)
- The strategy is to pick a pivot element, split
the data based on the pivot, and recurse on each
of the split sets. - Worst case performance is O (n2) and the worst
case depth is O (n).
26Examples of Parallel Algorithms in NESL (cont)
- hsplit(set,A,P) hsplit(set,P,A)
- cross product (p, (A,P))
- pm farthest from line A-P
- Recursively hsplit(set,A,pm)
- hsplit(set,pm,P)
- Ignores elements below the line
27Examples of Parallel Algorithms in NESL (cont)
- Performance analysis of Quickhull
- Each recursive call has constant depth and O(n)
work. - However, since many points might be deleted on
each step, the work could be significantly less. - As in Quicksort, worst case performance is O (n2)
and the worst case depth is O (n). - For m hull points the best case times are O (n)
work and O( log m ) depth.
28Summary
- They formalize a clear-cut formal language-based
model for analyzing performance - Work depth based model is directly defined
through a programming language, rather than a
specific machine - It can be applied to various classes of machines
using mappings that count for number of
processors, processing and communication costs. - NESL allows simple description of parallel
algorithms and makes use of data parallel
constructs and the ability to nest such
constructs..
29Summary
- NESL hides the CPU/Memory allocation, and
inter-processor communication details by
providing an abstraction of parallelism. - The current NESL implementation is based on an
intermediate language (VCODE )and a library of
low level vector routines (CVL) - For more information on how NESL compiler is
implemented - Implementation of a Portable Nested
Data-Parallel Language Guy E. Blelloch,
Siddhartha Chatterjee, Jonathan C. Hardwick, Jay
Sipelstein, and Marco Zagha. - Â
30Discussion
- Parallel Processing - Sensor Network Analogy
- Local processing -gt Aggregation. Work
corresponds to total aggregation cost. - Moving levels up -gt Collecting aggregated
results from children nodes. - Depth-gtDepth of routing tree in sensor network.
Implies communication cost. - Latency-gtCost to transmit data between motes.
- In parallel computation the goal is to reduce
execution time. - Sensor networks aim to reduce power consumption
by minimizing communications. Execution time is
also an issue when real time requirements are
imposed.
31Discussion
- NESL and TAG queries?
- Can latency be hidden by assigning multiple tasks
to motes? - Can you perform different operations on an
array's elements in parallel? Is it hard to add
one more parallelism mechanism besides
apply-to-each and parallel functions?