Title: Parallel coding
1Parallel coding
- Approaches in converting sequential code programs
to run on parallel machines
2Goals
- Reduce wall-clock time
- Scalability
- increase resolution
- expand space without loss of efficiency
Its all about efficiency
- Poor data communication
- Poor load balancing
- Inherently sequential algorithm nature
3Efficiency
- Communication overhead data transfer is at most
10-3 the processing speed - load balancing uneven load which is statically
balanced may cause idle processor time - Inherently sequential algorithm nature if all
tasks should be performed serially, no room for
parallelization - Lack of efficiency could cause a parallel code to
perform worse than a similar sequential code
4Scalability
Amdahl's Law states that potential program
speedup is defined by the fraction of code (f)
which can be parallelized
5Scalability
1 1 speedup --------
----------- 1 - f P/N S
- speedup
- -----------------------
- N P .50 P .90 P .99
- ----- ------- ------- -------
- 1.82 5.26 9.17
- 1.98 9.17 50.25
- 1.99 9.91 90.99
- 10000 1.99 9.91 99.02
6Before we start - Framework
- Code may be influenced/determined by machine
architecture - The need to understand the architecture
- Choose a programming paradigm
- Choose the compiler
- Determine communication
- Choose the network topology
- Add code to accomplish task control and
communications - Make sure the code is sufficiently optimized (may
involve the architecture) - Debug the code
- Eliminate lines that impose unnecessary overhead
7Before we start - Program
- If we are starting with an existing serial
program, debug the serial code completely - Identify the parts of the program that can be
executed concurrently - Requires a thorough understanding of the
algorithm - Exploit any inherent parallelism which may exist.
- May require restructuring of the program and/or
algorithm. May require an entirely new algorithm.
8Before we start - Framework
- Architecture Intel Xeon, 16Gb distributed
memory, Rocks Cluster - Compiler Intel FORTRAN/pgf
- Network star (mesh?)
- Overhead make sure the communication channels
arent clogged (net admin) - Optimized Code write c-code when necessary, use
CPU pipelines, use debugged programs
9Sequential Coding Practice
10Improvement methods
explicit implicit
Hardware Buy/design dedicated drivers and boards buy new off-the-shelf products every so often
Instruction sets Rely on them (MMX/SSE) and do assembly !_at_ language Make use of compilers optimization
Memory write dedicated memory handlers adjust Do LOOP paging
Cache branch prediciton / prefetching algorithm adjust cache fetch by ordering the data streams manually
Sequential coding practice
11The COMMON problem
Problem COMMON blocks are copied as one chunk of
data each time a process forks The compiler
doesnt distinguish between active COMMONs and
redundant ones
Sequential coding practice
12The COMMON problem
On NUMA (Non Uniform Memory Access) MPP/SMP
(massively parallel processing/Symmetric Multi
Processor) Vector machines This is rarely an
issue On a Distributed Computer
(clusters) Crucial (network is congested by
this)!!!
Problem COMMON blocks are copied as one chunk of
data each time a process forks The compiler
doesnt distinguish between declared COMMONs and
redundant ones
Sequential coding practice
13The COMMON problem
- Resolution
- Pass only the required data for the task
- Functional programming (pass arguments on the
call) - On shared memory architectures use shmXXX
commands - On distributed memory architectures use message
passing
Sequential coding practice
14Swapping to secondary storage
Problem swapping is transparent but uncontrolled
the kernel cannot predict which pages are
needed next, only determine which are needed
frequently
Swap space is a way to emulate physical ram,
right?No, generally swap space is a repository
to hold things from memory when memory is low.
Things in swap cannot be addressed directly and
need to be paged into physical memory before use,
so there's no way swap could be used to emulate
memory. So no, 512M512M swap is not the same as
1G memory and no swap. KernelTrap.org
Sequential coding practice
15Swapping tosecondary storage - Example
381MB X 2
CPU dual Intel Pentium3 Speed - 1000MHz RAM -
512 MB Compiler IntelFortran Optimization O2
(default)
Sequential coding practice
16Swapping to secondary storage
For processing 800Mb of data, 1Gb of data travels
at harddisk rate throughout the run
Sequential coding practice
17Swapping to secondary storage
Problem swapping is transparent but uncontrolled
the kernel cannot predict which pages are
needed next, only determine which are needed
frequently
Swap space is a way to emulate physical ram,
right?No, generally swap space is a repository
to hold things from memory when memory is low.
Things in swap cannot be addressed directly and
need to be paged into physical memory before use,
so there's no way swap could be used to emulate
memory. So no, 512M512M swap is not the same as
1G memory and no swap. KernelTrap.org
Resolution prevent swapping by adjusting the
data amount into user process RAM size (read and
write temporary files from/to disk).
Sequential coding practice
18Swapping to secondary storage
On every node Memory size 2GB Predicted number
of jobs pending 3
Use MOSIX for load balancing
Work with data segments no grater than
600Mb/process (open files memory output
buffers)
Sequential coding practice
19Paging, cache
16K
Problem like swapping, memory pages go in and
out of CPUs cache. Again, the compiler cannot
predict the ordering of pages into the cache.
Semi-controlled paging leads again to performance
degradation
Note On-board memory is slower than cache memory
(bus speed) but still faster than disk access
Sequential coding practice
20Paging, cache
16K
Cache size (Xeon) 512Kb So Work in 512K
chunks whenever possible (e.g. 256 X 256 double
precision)
Problem like swapping, memory pages go in and
out of CPUs cache. Again, the compiler cannot
predict the ordering of pages into the cache.
Semi-controlled paging leads again to performance
degradation
Resolution prevent paging by adjusting the data
size to CPU cache
Sequential coding practice
21Example
Sequential coding practice
22Exampleresults
Sequential coding practice
23Workload summary
- Adjust to cache size
- Adjust to pages in sequence
- Adjust to RAM size
- Control disk activity
fastest
slowest
24Sparse Arrays
- Current Dense (full) arrays
- All array indices are occupied in memory
- Matrix manipulations are usually element by
element (no linear algebra manipulations when
handling parameters on the grid)
Sequential coding practice
25Dense Arrays in HUCMCloud drop size
distribution (F1)
Number of nonzeros 110,000 Load 5
Number of nonzeros 3,700 Load 0.2
Sequential coding practice
26Dense Arrays in HUCMCloud drop size
distribution (F1)Lots of LHOLEs
Number of nonzeros 110,000 Load 14
Number of nonzeros 3,700 Load 0.5
Sequential coding practice
27Sparse Arrays
- Current Dense (full) arrays
- All array subscripts occupy memory
- Matrix manipulations are usually element by
element (no linear algebra manipulations when
handling parameters on the grid) - Improvement Sparse arrays
- Only non-zero elements occupy memory cells. Spare
notation - When calculating algebraic matrices run the
profiler to check performance degradation due to
sparse data
Sequential coding practice
28Sparse Arrays - HOWTO
actual
J I val
displayed
SPARSE is a supported datatype in Intel
MathKernel library
Sequential coding practice
29DO LOOPs
- Current have no respect to memory layout.
example FORTRAN uses column major subscripts
Memory layout
Virtual layout a 2D array (Column major)
Sequential coding practice
30DO LOOPs
- Order of the subscript is crucial
- Data pointer advances many steps
- Many page faults
Memory layout
1
2
Virtual layout a 2D array (Column major)
Sequential coding practice
31DO LOOPs
- Order of the subscript is crucial
Memory layout
2
1
Virtual layout a 2D array (Column major)
Sequential coding practice
32DO LOOPs - example
125Mb
Sequential coding practice
33DO LOOPs
Wall-clock time
the do-loop
the print statement A system call
Sequential coding practice
34DO LOOPs
- Improvements
- Reorder the DO LOOPs
- or
- Rearrange the dimensions in an array
- GFF2R(NI, NKR, NK, ICEMAX) -gt
- GFF2R(ICEMAX, NKR, NK, NI)
Innermost (fastest) running subscript
Outermost (slowest) running subscript
Sequential coding practice
35Parallel Coding Practice
36Job Scheduling
- Current
- Manual batch hard to track, no monitoring of the
control - Improvements
- Batch scheduling / parameter sweep (e.g. shell
scripts, NIMROD) - EASY/MAUI backfilling job scheduler
Parallel coding practice
37Load balancing
- Current
- Administrative - manual (and rough) load
balancing Haim - MPI, PVM, libraries no load balancing
capabilities, software dependent - RAMS variable grid point area
- MM5, MPP - ?
- WRF - ?
- File system NFS A disaster!!! client side
caching, no segmented file locks, network
congestion - Improvements
- MOSIX kernel level governing, better monitoring
of jobs, no stray (defunct) residues - MOPI DFSA (not PVFS, and definitely not NFS)
Parallel coding practice
38NFS client side cache
- every node has a non-concurrent mirror of the
image - Write 2 writes to the same location may crash
the system - Read old data may be read
39Parallel I/O Local / MOPI
MOSIX Parallel I/O System
Parallel coding practice
40Parallel I/O Local / MOPI
- Local can be adapted with minor change in
source code - MOPI - Needs installation but requires no changes
in source code
41converting sequential to parallel
- An easy 5-step method
- Hotspot identification
- Partition
- Communication
- Agglomeration
- mapping
Parallel coding practice
42Hotspots Partition Comm Agglomerate Map
Parallelizing should be done methodically in a
clean, accurate and meticulous way. However
intuitive parallel programming is, it does not
always allow straightforward automatic mechanical
methods. One of the approaches - the methodical
approach (Ian Foster) This particular method
maximizes the potential for parallelizing and
provide efficient steps that exploit this
potential. Furthermore, it provides explicit
checklists on completion of each step (not
detailed here).
Parallel coding practice
435-step hotspots
Hotspots Partition Comm Agglomerate Map
Identify the hotspots identify the parts of a
program which consume the most run time. Our goal
here is to know which code segments can and
should be parallelized. Why? For e.g. Greatly
improve code that consumes 10 of the run time
may increase performance by 10 whereas
optimizing code that consumes 90 of the runtime
may enable an order of magnitude
speedup. How? Algorithm inspection (in theory) By
looking at the code By Profiling (tools such as
prof or another 3rd party) to identify
bottlenecks
Parallel coding practice
445-step partition1
Hotspots Partition Comm Agglomerate Map
Definition The ratio between computation and
communication is known as granularity
Parallel coding practice
455-step partition2
Hotspots Partition Comm Agglomerate Map
- Goal Partition the tasks into the most fine grain
ones. - Why?
- We want to discover all the available
opportunities for parallel execution, and to
provide flexibility when we introduce the
following steps (communication, memory and other
requirements will enforce the optimal
agglomeration and mapping) - How?
- Functional Parallelism
- Data Parallelism
- Data decomposition sometimes its easier to
start off with partitioning the data into
segments which are not mutually dependent
Parallel coding practice
465-step partition3
Hotspots Partition Comm Agglomerate Map
Parallel coding practice
475-step partition4
Hotspots Partition Comm Agglomerate Map
- Goal Partition the tasks into the most fine grain
ones. - Why?
- We want to discover all the available
opportunities for parallel execution, and to
provide flexibility when we introduce the
following steps (communication, memory and other
requirements will enforce the optimal
agglomeration and mapping) - How?
- Functional Parallelism
- Data Parallelism
- Functional decomposition partitioning the
calculation into segments which are not mutually
dependent (e.g. integration components are
evaluated before the integration step)
Parallel coding practice
485-step partition5
Hotspots Partition Comm Agglomerate Map
Parallel coding practice
495-step communication1
Hotspots Partition Comm Agglomerate Map
- Communication occurs during data passing and
synchronization. We strive to minimize data
communication between tasks or make them more
coarse-grained - Sometimes the master process may encounter too
much traffic coming in If large data chunks must
be transferred try to form hierarchies in
aggregating the data - The most efficient granularity is dependent upon
the algorithm and the hardware environment in
which it runs - Decomposing the data has a crucial role here,
consider revisiting step 2
Parallel coding practice
505-step communication2
Hotspots Partition Comm Agglomerate Map
Sending data out to sub-tasks Point-to-point is
best for sending personalized data to each
independent task broadcast is good way to clog
the network (all processors update the data, then
need to send it back to the master) but we may
find good use for it when a large computation can
be performed once and lookup tables can be sent
across the network Collection is usually used to
perform mathematics like min, max, sum Shared
memory systems synchronize using the memory
locking techniques Distributed memory systems may
use blocking or non-blocking message passing.
Blocking MP may be used for synchronization
Parallel coding practice
515-step agglomeration
Hotspots Partition Comm Agglomerate Map
Extreme granularity is not a winning
scheme Agglomeration of dependant tasks removes
their communication requirements off of the
network, and increases the computational and
memory usage effectiveness of the processor which
handles them Rule of thumb Make sure there are
an order of magnitude more tasks than processors
Parallel coding practice
525-step map
Hotspots Partition Comm Agglomerate Map
- Optimization
- Measure Performance
- Locate Problem Areas
- Improve them
- Load balancing (performed by the task scheduler)
- Static load balancing if the agglomerated tasks
run in similar time - Dynamic load balancing if the number of tasks
is unknown or if there are uneven run-times among
the tasks (pool of tasks)
Parallel coding practice
53Project management
- The physical problem becomes more complex when we
consider parallelizing. - Implies large scale project planning
- Work in the group should conform to one protocol
to allow seamless integration use the Concurrent
Version Systems - Study computer resources to their limits program
to fit calculations in the cache, message packets
in the NIC buffers, file decomposition to
minimize network traffic - Form a work plan beforehand.
54References
- Foster I., Designing and Building Parallel
Programs (http//www-unix.mcs.anl.gov/dbpp/) - Egan, J. I., and Teixeira T. J., (1988), Writing
a UNIX device driver - SP Parallel Programming Workshop, Maui High
Performance Computing Center (http//www.hku.hk/cc
/sp2/workshop/html) - Amar L., Barak A. and Shiloh A., (2002) The MOSIX
Parallel I/O System for Scalable I/O Performance.
Proc. 14-th IASTED International Conference on
Parallel and Distributed Computing and Systems
(PDCS 2002), pp. 495-500, Cambridge, MA.