Parallel coding

About This Presentation

Title:

Parallel coding

Description:

Overhead make sure the communication channels aren't clogged (net admin) ... broadcast is good way to clog the network (all processors update the data, then ... – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 55

Provided by: guyke

Category:

more less

Transcript and Presenter's Notes

Title: Parallel coding

1
Parallel coding

Approaches in converting sequential code programs
to run on parallel machines

2
Goals

Reduce wall-clock time
Scalability
increase resolution
expand space without loss of efficiency

Its all about efficiency

Poor data communication
Poor load balancing
Inherently sequential algorithm nature

3
Efficiency

Communication overhead data transfer is at most
10-3 the processing speed
load balancing uneven load which is statically
balanced may cause idle processor time
Inherently sequential algorithm nature if all
tasks should be performed serially, no room for
parallelization
Lack of efficiency could cause a parallel code to
perform worse than a similar sequential code

4
Scalability
Amdahl's Law states that potential program
speedup is defined by the fraction of code (f)
which can be parallelized
5
Scalability
1 1 speedup --------
----------- 1 - f P/N S

speedup
-----------------------
N P .50 P .90 P .99
----- ------- ------- -------
1.82 5.26 9.17
1.98 9.17 50.25
1.99 9.91 90.99
10000 1.99 9.91 99.02

6
Before we start - Framework

Code may be influenced/determined by machine
architecture
The need to understand the architecture
Choose a programming paradigm
Choose the compiler
Determine communication
Choose the network topology
Add code to accomplish task control and
communications
Make sure the code is sufficiently optimized (may
involve the architecture)
Debug the code
Eliminate lines that impose unnecessary overhead

7
Before we start - Program

If we are starting with an existing serial
program, debug the serial code completely
Identify the parts of the program that can be
executed concurrently
Requires a thorough understanding of the
algorithm
Exploit any inherent parallelism which may exist.
May require restructuring of the program and/or
algorithm. May require an entirely new algorithm.

8
Before we start - Framework

Architecture Intel Xeon, 16Gb distributed
memory, Rocks Cluster
Compiler Intel FORTRAN/pgf
Network star (mesh?)
Overhead make sure the communication channels
arent clogged (net admin)
Optimized Code write c-code when necessary, use
CPU pipelines, use debugged programs

9
Sequential Coding Practice
10
Improvement methods
explicit implicit
Hardware Buy/design dedicated drivers and boards buy new off-the-shelf products every so often
Instruction sets Rely on them (MMX/SSE) and do assembly !_at_ language Make use of compilers optimization
Memory write dedicated memory handlers adjust Do LOOP paging
Cache branch prediciton / prefetching algorithm adjust cache fetch by ordering the data streams manually
Sequential coding practice
11
The COMMON problem
Problem COMMON blocks are copied as one chunk of
data each time a process forks The compiler
doesnt distinguish between active COMMONs and
redundant ones
Sequential coding practice
12
The COMMON problem
On NUMA (Non Uniform Memory Access) MPP/SMP
(massively parallel processing/Symmetric Multi
Processor) Vector machines This is rarely an
issue On a Distributed Computer
(clusters) Crucial (network is congested by
this)!!!
Problem COMMON blocks are copied as one chunk of
data each time a process forks The compiler
doesnt distinguish between declared COMMONs and
redundant ones
Sequential coding practice
13
The COMMON problem

Resolution
Pass only the required data for the task
Functional programming (pass arguments on the
call)
On shared memory architectures use shmXXX
commands
On distributed memory architectures use message
passing

Sequential coding practice
14
Swapping to secondary storage
Problem swapping is transparent but uncontrolled
the kernel cannot predict which pages are
needed next, only determine which are needed
frequently
Swap space is a way to emulate physical ram,
right?No, generally swap space is a repository
to hold things from memory when memory is low.
Things in swap cannot be addressed directly and
need to be paged into physical memory before use,
so there's no way swap could be used to emulate
memory. So no, 512M512M swap is not the same as
1G memory and no swap. KernelTrap.org
Sequential coding practice
15
Swapping tosecondary storage - Example
381MB X 2
CPU dual Intel Pentium3 Speed - 1000MHz RAM -
512 MB Compiler IntelFortran Optimization O2
(default)
Sequential coding practice
16
Swapping to secondary storage
For processing 800Mb of data, 1Gb of data travels
at harddisk rate throughout the run
Sequential coding practice
17
Swapping to secondary storage
Problem swapping is transparent but uncontrolled
the kernel cannot predict which pages are
needed next, only determine which are needed
frequently
Swap space is a way to emulate physical ram,
right?No, generally swap space is a repository
to hold things from memory when memory is low.
Things in swap cannot be addressed directly and
need to be paged into physical memory before use,
so there's no way swap could be used to emulate
memory. So no, 512M512M swap is not the same as
1G memory and no swap. KernelTrap.org
Resolution prevent swapping by adjusting the
data amount into user process RAM size (read and
write temporary files from/to disk).
Sequential coding practice
18
Swapping to secondary storage
On every node Memory size 2GB Predicted number
of jobs pending 3
Use MOSIX for load balancing
Work with data segments no grater than
600Mb/process (open files memory output
buffers)
Sequential coding practice
19
Paging, cache
16K
Problem like swapping, memory pages go in and
out of CPUs cache. Again, the compiler cannot
predict the ordering of pages into the cache.
Semi-controlled paging leads again to performance
degradation
Note On-board memory is slower than cache memory
(bus speed) but still faster than disk access
Sequential coding practice
20
Paging, cache
16K
Cache size (Xeon) 512Kb So Work in 512K
chunks whenever possible (e.g. 256 X 256 double
precision)
Problem like swapping, memory pages go in and
out of CPUs cache. Again, the compiler cannot
predict the ordering of pages into the cache.
Semi-controlled paging leads again to performance
degradation
Resolution prevent paging by adjusting the data
size to CPU cache
Sequential coding practice
21
Example
Sequential coding practice
22
Exampleresults
Sequential coding practice
23
Workload summary

Adjust to cache size
Adjust to pages in sequence
Adjust to RAM size
Control disk activity

fastest
slowest
24
Sparse Arrays

Current Dense (full) arrays
All array indices are occupied in memory
Matrix manipulations are usually element by
element (no linear algebra manipulations when
handling parameters on the grid)

Sequential coding practice
25
Dense Arrays in HUCMCloud drop size
distribution (F1)
Number of nonzeros 110,000 Load 5
Number of nonzeros 3,700 Load 0.2
Sequential coding practice
26
Dense Arrays in HUCMCloud drop size
distribution (F1)Lots of LHOLEs
Number of nonzeros 110,000 Load 14
Number of nonzeros 3,700 Load 0.5
Sequential coding practice
27
Sparse Arrays

Current Dense (full) arrays
All array subscripts occupy memory
Matrix manipulations are usually element by
element (no linear algebra manipulations when
handling parameters on the grid)
Improvement Sparse arrays
Only non-zero elements occupy memory cells. Spare
notation
When calculating algebraic matrices run the
profiler to check performance degradation due to
sparse data

Sequential coding practice
28
Sparse Arrays - HOWTO
actual
J I val
displayed
SPARSE is a supported datatype in Intel
MathKernel library
Sequential coding practice
29
DO LOOPs

Current have no respect to memory layout.
example FORTRAN uses column major subscripts

Memory layout
Virtual layout a 2D array (Column major)
Sequential coding practice
30
DO LOOPs

Order of the subscript is crucial
Data pointer advances many steps
Many page faults

Memory layout
1
2
Virtual layout a 2D array (Column major)
Sequential coding practice
31
DO LOOPs

Order of the subscript is crucial

Memory layout
2
1
Virtual layout a 2D array (Column major)
Sequential coding practice
32
DO LOOPs - example
125Mb
Sequential coding practice
33
DO LOOPs
Wall-clock time
the do-loop
the print statement A system call
Sequential coding practice
34
DO LOOPs

Improvements
Reorder the DO LOOPs
or
Rearrange the dimensions in an array
GFF2R(NI, NKR, NK, ICEMAX) -gt
GFF2R(ICEMAX, NKR, NK, NI)

Innermost (fastest) running subscript
Outermost (slowest) running subscript
Sequential coding practice
35
Parallel Coding Practice
36
Job Scheduling

Current
Manual batch hard to track, no monitoring of the
control
Improvements
Batch scheduling / parameter sweep (e.g. shell
scripts, NIMROD)
EASY/MAUI backfilling job scheduler

Parallel coding practice
37
Load balancing

Current
Administrative - manual (and rough) load
balancing Haim
MPI, PVM, libraries no load balancing
capabilities, software dependent
RAMS variable grid point area
MM5, MPP - ?
WRF - ?
File system NFS A disaster!!! client side
caching, no segmented file locks, network
congestion
Improvements
MOSIX kernel level governing, better monitoring
of jobs, no stray (defunct) residues
MOPI DFSA (not PVFS, and definitely not NFS)

Parallel coding practice
38
NFS client side cache

every node has a non-concurrent mirror of the
image
Write 2 writes to the same location may crash
the system
Read old data may be read

39
Parallel I/O Local / MOPI
MOSIX Parallel I/O System
Parallel coding practice
40
Parallel I/O Local / MOPI

Local can be adapted with minor change in
source code
MOPI - Needs installation but requires no changes
in source code

41
converting sequential to parallel

An easy 5-step method
Hotspot identification
Partition
Communication
Agglomeration
mapping

Parallel coding practice
42
Hotspots Partition Comm Agglomerate Map
Parallelizing should be done methodically in a
clean, accurate and meticulous way. However
intuitive parallel programming is, it does not
always allow straightforward automatic mechanical
methods. One of the approaches - the methodical
approach (Ian Foster) This particular method
maximizes the potential for parallelizing and
provide efficient steps that exploit this
potential. Furthermore, it provides explicit
checklists on completion of each step (not
detailed here).
Parallel coding practice
43
5-step hotspots
Hotspots Partition Comm Agglomerate Map
Identify the hotspots identify the parts of a
program which consume the most run time. Our goal
here is to know which code segments can and
should be parallelized. Why? For e.g. Greatly
improve code that consumes 10 of the run time
may increase performance by 10 whereas
optimizing code that consumes 90 of the runtime
may enable an order of magnitude
speedup. How? Algorithm inspection (in theory) By
looking at the code By Profiling (tools such as
prof or another 3rd party) to identify
bottlenecks
Parallel coding practice
44
5-step partition1
Hotspots Partition Comm Agglomerate Map
Definition The ratio between computation and
communication is known as granularity
Parallel coding practice
45
5-step partition2
Hotspots Partition Comm Agglomerate Map

Goal Partition the tasks into the most fine grain
ones.
Why?
We want to discover all the available
opportunities for parallel execution, and to
provide flexibility when we introduce the
following steps (communication, memory and other
requirements will enforce the optimal
agglomeration and mapping)
How?
Functional Parallelism
Data Parallelism
Data decomposition sometimes its easier to
start off with partitioning the data into
segments which are not mutually dependent

Parallel coding practice
46
5-step partition3
Hotspots Partition Comm Agglomerate Map
Parallel coding practice
47
5-step partition4
Hotspots Partition Comm Agglomerate Map

Goal Partition the tasks into the most fine grain
ones.
Why?
We want to discover all the available
opportunities for parallel execution, and to
provide flexibility when we introduce the
following steps (communication, memory and other
requirements will enforce the optimal
agglomeration and mapping)
How?
Functional Parallelism
Data Parallelism
Functional decomposition partitioning the
calculation into segments which are not mutually
dependent (e.g. integration components are
evaluated before the integration step)

Parallel coding practice
48
5-step partition5
Hotspots Partition Comm Agglomerate Map
Parallel coding practice
49
5-step communication1
Hotspots Partition Comm Agglomerate Map

Communication occurs during data passing and
synchronization. We strive to minimize data
communication between tasks or make them more
coarse-grained
Sometimes the master process may encounter too
much traffic coming in If large data chunks must
be transferred try to form hierarchies in
aggregating the data
The most efficient granularity is dependent upon
the algorithm and the hardware environment in
which it runs
Decomposing the data has a crucial role here,
consider revisiting step 2

Parallel coding practice
50
5-step communication2
Hotspots Partition Comm Agglomerate Map
Sending data out to sub-tasks Point-to-point is
best for sending personalized data to each
independent task broadcast is good way to clog
the network (all processors update the data, then
need to send it back to the master) but we may
find good use for it when a large computation can
be performed once and lookup tables can be sent
across the network Collection is usually used to
perform mathematics like min, max, sum Shared
memory systems synchronize using the memory
locking techniques Distributed memory systems may
use blocking or non-blocking message passing.
Blocking MP may be used for synchronization
Parallel coding practice
51
5-step agglomeration
Hotspots Partition Comm Agglomerate Map
Extreme granularity is not a winning
scheme Agglomeration of dependant tasks removes
their communication requirements off of the
network, and increases the computational and
memory usage effectiveness of the processor which
handles them Rule of thumb Make sure there are
an order of magnitude more tasks than processors
Parallel coding practice
52
5-step map
Hotspots Partition Comm Agglomerate Map

Optimization
Measure Performance
Locate Problem Areas
Improve them
Load balancing (performed by the task scheduler)
Static load balancing if the agglomerated tasks
run in similar time
Dynamic load balancing if the number of tasks
is unknown or if there are uneven run-times among
the tasks (pool of tasks)

Parallel coding practice
53
Project management

The physical problem becomes more complex when we
consider parallelizing.
Implies large scale project planning
Work in the group should conform to one protocol
to allow seamless integration use the Concurrent
Version Systems
Study computer resources to their limits program
to fit calculations in the cache, message packets
in the NIC buffers, file decomposition to
minimize network traffic
Form a work plan beforehand.

54
References

Foster I., Designing and Building Parallel
Programs (http//www-unix.mcs.anl.gov/dbpp/)
Egan, J. I., and Teixeira T. J., (1988), Writing
a UNIX device driver
SP Parallel Programming Workshop, Maui High
Performance Computing Center (http//www.hku.hk/cc
/sp2/workshop/html)
Amar L., Barak A. and Shiloh A., (2002) The MOSIX
Parallel I/O System for Scalable I/O Performance.
Proc. 14-th IASTED International Conference on
Parallel and Distributed Computing and Systems
(PDCS 2002), pp. 495-500, Cambridge, MA.