Parallel coding - PowerPoint PPT Presentation

1 / 54
About This Presentation
Title:

Parallel coding

Description:

Overhead make sure the communication channels aren't clogged (net admin) ... broadcast is good way to clog the network (all processors update the data, then ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 55
Provided by: guyke
Category:
Tags: clog | coding | parallel

less

Transcript and Presenter's Notes

Title: Parallel coding


1
Parallel coding
  • Approaches in converting sequential code programs
    to run on parallel machines

2
Goals
  • Reduce wall-clock time
  • Scalability
  • increase resolution
  • expand space without loss of efficiency

Its all about efficiency
  • Poor data communication
  • Poor load balancing
  • Inherently sequential algorithm nature

3
Efficiency
  • Communication overhead data transfer is at most
    10-3 the processing speed
  • load balancing uneven load which is statically
    balanced may cause idle processor time
  • Inherently sequential algorithm nature if all
    tasks should be performed serially, no room for
    parallelization
  • Lack of efficiency could cause a parallel code to
    perform worse than a similar sequential code

4
Scalability
Amdahl's Law states that potential program
speedup is defined by the fraction of code (f)
which can be parallelized
5
Scalability
1 1 speedup --------
----------- 1 - f P/N S
  • speedup
  • -----------------------
  • N P .50 P .90 P .99
  • ----- ------- ------- -------
  • 1.82 5.26 9.17
  • 1.98 9.17 50.25
  • 1.99 9.91 90.99
  • 10000 1.99 9.91 99.02

6
Before we start - Framework
  • Code may be influenced/determined by machine
    architecture
  • The need to understand the architecture
  • Choose a programming paradigm
  • Choose the compiler
  • Determine communication
  • Choose the network topology
  • Add code to accomplish task control and
    communications
  • Make sure the code is sufficiently optimized (may
    involve the architecture)
  • Debug the code
  • Eliminate lines that impose unnecessary overhead

7
Before we start - Program
  • If we are starting with an existing serial
    program, debug the serial code completely
  • Identify the parts of the program that can be
    executed concurrently
  • Requires a thorough understanding of the
    algorithm
  • Exploit any inherent parallelism which may exist.
  • May require restructuring of the program and/or
    algorithm. May require an entirely new algorithm.

8
Before we start - Framework
  • Architecture Intel Xeon, 16Gb distributed
    memory, Rocks Cluster
  • Compiler Intel FORTRAN/pgf
  • Network star (mesh?)
  • Overhead make sure the communication channels
    arent clogged (net admin)
  • Optimized Code write c-code when necessary, use
    CPU pipelines, use debugged programs

9
Sequential Coding Practice
10
Improvement methods
explicit implicit
Hardware Buy/design dedicated drivers and boards buy new off-the-shelf products every so often
Instruction sets Rely on them (MMX/SSE) and do assembly !_at_ language Make use of compilers optimization
Memory write dedicated memory handlers adjust Do LOOP paging
Cache branch prediciton / prefetching algorithm adjust cache fetch by ordering the data streams manually
Sequential coding practice
11
The COMMON problem
Problem COMMON blocks are copied as one chunk of
data each time a process forks The compiler
doesnt distinguish between active COMMONs and
redundant ones
Sequential coding practice
12
The COMMON problem
On NUMA (Non Uniform Memory Access) MPP/SMP
(massively parallel processing/Symmetric Multi
Processor) Vector machines This is rarely an
issue On a Distributed Computer
(clusters) Crucial (network is congested by
this)!!!
Problem COMMON blocks are copied as one chunk of
data each time a process forks The compiler
doesnt distinguish between declared COMMONs and
redundant ones
Sequential coding practice
13
The COMMON problem
  • Resolution
  • Pass only the required data for the task
  • Functional programming (pass arguments on the
    call)
  • On shared memory architectures use shmXXX
    commands
  • On distributed memory architectures use message
    passing

Sequential coding practice
14
Swapping to secondary storage
Problem swapping is transparent but uncontrolled
the kernel cannot predict which pages are
needed next, only determine which are needed
frequently
Swap space is a way to emulate physical ram,
right?No, generally swap space is a repository
to hold things from memory when memory is low.
Things in swap cannot be addressed directly and
need to be paged into physical memory before use,
so there's no way swap could be used to emulate
memory. So no, 512M512M swap is not the same as
1G memory and no swap. KernelTrap.org
Sequential coding practice
15
Swapping tosecondary storage - Example
381MB X 2
CPU dual Intel Pentium3 Speed - 1000MHz RAM -
512 MB Compiler IntelFortran Optimization O2
(default)
Sequential coding practice
16
Swapping to secondary storage
For processing 800Mb of data, 1Gb of data travels
at harddisk rate throughout the run
Sequential coding practice
17
Swapping to secondary storage
Problem swapping is transparent but uncontrolled
the kernel cannot predict which pages are
needed next, only determine which are needed
frequently
Swap space is a way to emulate physical ram,
right?No, generally swap space is a repository
to hold things from memory when memory is low.
Things in swap cannot be addressed directly and
need to be paged into physical memory before use,
so there's no way swap could be used to emulate
memory. So no, 512M512M swap is not the same as
1G memory and no swap. KernelTrap.org
Resolution prevent swapping by adjusting the
data amount into user process RAM size (read and
write temporary files from/to disk).
Sequential coding practice
18
Swapping to secondary storage
On every node Memory size 2GB Predicted number
of jobs pending 3
Use MOSIX for load balancing
Work with data segments no grater than
600Mb/process (open files memory output
buffers)
Sequential coding practice
19
Paging, cache
16K
Problem like swapping, memory pages go in and
out of CPUs cache. Again, the compiler cannot
predict the ordering of pages into the cache.
Semi-controlled paging leads again to performance
degradation
Note On-board memory is slower than cache memory
(bus speed) but still faster than disk access
Sequential coding practice
20
Paging, cache
16K
Cache size (Xeon) 512Kb So Work in 512K
chunks whenever possible (e.g. 256 X 256 double
precision)
Problem like swapping, memory pages go in and
out of CPUs cache. Again, the compiler cannot
predict the ordering of pages into the cache.
Semi-controlled paging leads again to performance
degradation
Resolution prevent paging by adjusting the data
size to CPU cache
Sequential coding practice
21
Example
Sequential coding practice
22
Exampleresults
Sequential coding practice
23
Workload summary
  • Adjust to cache size
  • Adjust to pages in sequence
  • Adjust to RAM size
  • Control disk activity

fastest
slowest
24
Sparse Arrays
  • Current Dense (full) arrays
  • All array indices are occupied in memory
  • Matrix manipulations are usually element by
    element (no linear algebra manipulations when
    handling parameters on the grid)

Sequential coding practice
25
Dense Arrays in HUCMCloud drop size
distribution (F1)
Number of nonzeros 110,000 Load 5
Number of nonzeros 3,700 Load 0.2
Sequential coding practice
26
Dense Arrays in HUCMCloud drop size
distribution (F1)Lots of LHOLEs
Number of nonzeros 110,000 Load 14
Number of nonzeros 3,700 Load 0.5
Sequential coding practice
27
Sparse Arrays
  • Current Dense (full) arrays
  • All array subscripts occupy memory
  • Matrix manipulations are usually element by
    element (no linear algebra manipulations when
    handling parameters on the grid)
  • Improvement Sparse arrays
  • Only non-zero elements occupy memory cells. Spare
    notation
  • When calculating algebraic matrices run the
    profiler to check performance degradation due to
    sparse data

Sequential coding practice
28
Sparse Arrays - HOWTO
actual
J I val
displayed
SPARSE is a supported datatype in Intel
MathKernel library
Sequential coding practice
29
DO LOOPs
  • Current have no respect to memory layout.
    example FORTRAN uses column major subscripts

Memory layout
Virtual layout a 2D array (Column major)
Sequential coding practice
30
DO LOOPs
  • Order of the subscript is crucial
  • Data pointer advances many steps
  • Many page faults

Memory layout
1
2
Virtual layout a 2D array (Column major)
Sequential coding practice
31
DO LOOPs
  • Order of the subscript is crucial

Memory layout
2
1
Virtual layout a 2D array (Column major)
Sequential coding practice
32
DO LOOPs - example
125Mb
Sequential coding practice
33
DO LOOPs
Wall-clock time
the do-loop
the print statement A system call
Sequential coding practice
34
DO LOOPs
  • Improvements
  • Reorder the DO LOOPs
  • or
  • Rearrange the dimensions in an array
  • GFF2R(NI, NKR, NK, ICEMAX) -gt
  • GFF2R(ICEMAX, NKR, NK, NI)

Innermost (fastest) running subscript
Outermost (slowest) running subscript
Sequential coding practice
35
Parallel Coding Practice
36
Job Scheduling
  • Current
  • Manual batch hard to track, no monitoring of the
    control
  • Improvements
  • Batch scheduling / parameter sweep (e.g. shell
    scripts, NIMROD)
  • EASY/MAUI backfilling job scheduler

Parallel coding practice
37
Load balancing
  • Current
  • Administrative - manual (and rough) load
    balancing Haim
  • MPI, PVM, libraries no load balancing
    capabilities, software dependent
  • RAMS variable grid point area
  • MM5, MPP - ?
  • WRF - ?
  • File system NFS A disaster!!! client side
    caching, no segmented file locks, network
    congestion
  • Improvements
  • MOSIX kernel level governing, better monitoring
    of jobs, no stray (defunct) residues
  • MOPI DFSA (not PVFS, and definitely not NFS)

Parallel coding practice
38
NFS client side cache
  • every node has a non-concurrent mirror of the
    image
  • Write 2 writes to the same location may crash
    the system
  • Read old data may be read

39
Parallel I/O Local / MOPI
MOSIX Parallel I/O System
Parallel coding practice
40
Parallel I/O Local / MOPI
  • Local can be adapted with minor change in
    source code
  • MOPI - Needs installation but requires no changes
    in source code

41
converting sequential to parallel
  • An easy 5-step method
  • Hotspot identification
  • Partition
  • Communication
  • Agglomeration
  • mapping

Parallel coding practice
42
Hotspots Partition Comm Agglomerate Map
Parallelizing should be done methodically in a
clean, accurate and meticulous way. However
intuitive parallel programming is, it does not
always allow straightforward automatic mechanical
methods. One of the approaches - the methodical
approach (Ian Foster) This particular method
maximizes the potential for parallelizing and
provide efficient steps that exploit this
potential. Furthermore, it provides explicit
checklists on completion of each step (not
detailed here).
Parallel coding practice
43
5-step hotspots
Hotspots Partition Comm Agglomerate Map
Identify the hotspots identify the parts of a
program which consume the most run time. Our goal
here is to know which code segments can and
should be parallelized. Why? For e.g. Greatly
improve code that consumes 10 of the run time
may increase performance by 10 whereas
optimizing code that consumes 90 of the runtime
may enable an order of magnitude
speedup. How? Algorithm inspection (in theory) By
looking at the code By Profiling (tools such as
prof or another 3rd party) to identify
bottlenecks
Parallel coding practice
44
5-step partition1
Hotspots Partition Comm Agglomerate Map
Definition The ratio between computation and
communication is known as granularity
Parallel coding practice
45
5-step partition2
Hotspots Partition Comm Agglomerate Map
  • Goal Partition the tasks into the most fine grain
    ones.
  • Why?
  • We want to discover all the available
    opportunities for parallel execution, and to
    provide flexibility when we introduce the
    following steps (communication, memory and other
    requirements will enforce the optimal
    agglomeration and mapping)
  • How?
  • Functional Parallelism
  • Data Parallelism
  • Data decomposition sometimes its easier to
    start off with partitioning the data into
    segments which are not mutually dependent

Parallel coding practice
46
5-step partition3
Hotspots Partition Comm Agglomerate Map
Parallel coding practice
47
5-step partition4
Hotspots Partition Comm Agglomerate Map
  • Goal Partition the tasks into the most fine grain
    ones.
  • Why?
  • We want to discover all the available
    opportunities for parallel execution, and to
    provide flexibility when we introduce the
    following steps (communication, memory and other
    requirements will enforce the optimal
    agglomeration and mapping)
  • How?
  • Functional Parallelism
  • Data Parallelism
  • Functional decomposition partitioning the
    calculation into segments which are not mutually
    dependent (e.g. integration components are
    evaluated before the integration step)

Parallel coding practice
48
5-step partition5
Hotspots Partition Comm Agglomerate Map
Parallel coding practice
49
5-step communication1
Hotspots Partition Comm Agglomerate Map
  • Communication occurs during data passing and
    synchronization. We strive to minimize data
    communication between tasks or make them more
    coarse-grained
  • Sometimes the master process may encounter too
    much traffic coming in If large data chunks must
    be transferred try to form hierarchies in
    aggregating the data
  • The most efficient granularity is dependent upon
    the algorithm and the hardware environment in
    which it runs
  • Decomposing the data has a crucial role here,
    consider revisiting step 2

Parallel coding practice
50
5-step communication2
Hotspots Partition Comm Agglomerate Map
Sending data out to sub-tasks Point-to-point is
best for sending personalized data to each
independent task broadcast is good way to clog
the network (all processors update the data, then
need to send it back to the master) but we may
find good use for it when a large computation can
be performed once and lookup tables can be sent
across the network Collection is usually used to
perform mathematics like min, max, sum Shared
memory systems synchronize using the memory
locking techniques Distributed memory systems may
use blocking or non-blocking message passing.
Blocking MP may be used for synchronization
Parallel coding practice
51
5-step agglomeration
Hotspots Partition Comm Agglomerate Map
Extreme granularity is not a winning
scheme Agglomeration of dependant tasks removes
their communication requirements off of the
network, and increases the computational and
memory usage effectiveness of the processor which
handles them Rule of thumb Make sure there are
an order of magnitude more tasks than processors
Parallel coding practice
52
5-step map
Hotspots Partition Comm Agglomerate Map
  • Optimization
  • Measure Performance
  • Locate Problem Areas
  • Improve them
  • Load balancing (performed by the task scheduler)
  • Static load balancing if the agglomerated tasks
    run in similar time
  • Dynamic load balancing if the number of tasks
    is unknown or if there are uneven run-times among
    the tasks (pool of tasks)

Parallel coding practice
53
Project management
  • The physical problem becomes more complex when we
    consider parallelizing.
  • Implies large scale project planning
  • Work in the group should conform to one protocol
    to allow seamless integration use the Concurrent
    Version Systems
  • Study computer resources to their limits program
    to fit calculations in the cache, message packets
    in the NIC buffers, file decomposition to
    minimize network traffic
  • Form a work plan beforehand.

54
References
  • Foster I., Designing and Building Parallel
    Programs (http//www-unix.mcs.anl.gov/dbpp/)
  • Egan, J. I., and Teixeira T. J., (1988), Writing
    a UNIX device driver
  • SP Parallel Programming Workshop, Maui High
    Performance Computing Center (http//www.hku.hk/cc
    /sp2/workshop/html)
  • Amar L., Barak A. and Shiloh A., (2002) The MOSIX
    Parallel I/O System for Scalable I/O Performance.
    Proc. 14-th IASTED International Conference on
    Parallel and Distributed Computing and Systems
    (PDCS 2002), pp. 495-500, Cambridge, MA.
Write a Comment
User Comments (0)
About PowerShow.com