Title: High Performance Fortran HPF
1High Performance Fortran (HPF)
- Source
-
- Chapter 7 of "Designing and building parallel
programs (Ian Foster, 1995)
2Question
- Can't we just have a clever compiler generate a
parallel program from a sequential program? - Fine-grained parallelism
- x ab cd
- Trivial parallelism
- for i 1 to 100 do
- for j 1 to 100 do
- C i, j dotproduct ( A i,, B , j
) - od
- od
3Automatic parallelism
- Automatic parallelization of any program is
extremely hard - Solutions
- Make restrictions on source program
- Restrict kind of parallelism used
- Use semi-automatic approach
- Use application-domain oriented languages
4High Performance Fortran (HPF)
- Designed by a forum from industry, government,
universities - Extends Fortran 90
- To be used for computationally expensive
numerical applications - Portable to SIMD machines, vector processors,
shared-memory MIMD and distributed-memory MIMD
5Fortran 90 - Base language of HPF
- Extends Fortran 77 with 'modern' features
- abstract data types, modules
- recursion
- pointers, dynamic storage
- Array operators
- A B C
- A A 1.0
- A(17) B(17) B(28)
- WHERE (X / 0) X 1.0/X
6Data parallelism
- Data parallelism same operation applied to
different data elements in parallel - Data parallel program sequence of data parallel
operations - Overall approach
- Programmer does domain decomposition
- Compiler partitions operations automatically
- Data may be regular (array) orirregular (tree,
sparse matrix) - Most data parallel languages only dealwith arrays
7Data parallelism - Concurrency
- Explicit parallel operations
- A B C ! A, B, and C are arrays
- Implicit parallelism
- do i 1,m
- do j 1,n
- A(i,j) B(i,j) C(i,j)
- enddo
- enddo
8Compiling data parallel programs
- Programs are translated automatically into
parallel SPMD (Single Program Multiple Data)
programs - Each processor executes same program on subset of
the data - Owner computes rule
- - Each processor owns subset of the data
structures - - Operations required for an element are executed
by the owner - - Each processor may read (but not modify) other
elements
9Example
- real s, X(100), Y(100) ! s is scalar, X and Y
are arrays - X X 3.0 ! Multiply each X(i) by 3.0
- do i 2,99
- Y(i) (X(i-1) X(i1))/2 ! Communication
required - enddo
- s SUM(X) ! Communication required
- X and Y are distributed (partitioned)
- s is replicated on each machine
10HPF primitives for data distribution
- Directives
- PROCESSORS shape size of abstract processors
- ALIGN align elements of different arrays
- DISTRIBUTE distribute (partition) an array
- Directives affect performance of the program, not
its result
11Processors directive
- !HPF PROCESSORS P(32)
- !HPF PROCESSORS Q(4,8)
- Mapping of abstract to physical processors not
specified in HPF (implementation-dependent)
12Alignment directive
- Aligns an array with another array
- Species that specific elements should be mapped
to the same processor - real A(50), B(50), C(50,50)
- !HPF ALIGN A(I) WITH B(I)
- !HPF ALIGN A(I) WITH B(I2)
- !HPF ALIGN A() WITH C(,)
13Figure 7.6 from Foster's book
14Distribution directive
- Species how elements should be partitioned among
the local memories - Each dimension can be distributed as follows
- no distribution
- BLOCK (n) block distribution
- CYCLIC (n) cyclic distribution
15Figure 7.7 from Foster's book
16Example Successive Over relaxation (SOR)
- Recall algorithm discussed in Introduction
float G1N, 1M, Gnew1N, 1M for (step
0 step lt NSTEPS step) for (i 2 i lt N
i) / update grid / for (j 2 j lt M
j) Gnewi,j f(Gi,j, Gi-1,j,
Gi1,j,Gi,j-1, Gi,j1) G Gnew
17Parallel SOR with message passing
- float Glb-1ub1, 1M, Gnewlb-1ub1, 1M
- for (step 0 step lt NSTEPS step)
- SEND(cpuid-1, Glb) / send 1st row left
/ - SEND(cpuid1, Gub) / send last row
right / - RECEIVE(cpuid-1, Glb-1) / receive from
left / - RECEIVE(cpuid1, Gub1) / receive from
right / - for (i lb i lt ub i) / update my rows
/ - for (j 2 j lt M j)
- Gnewi,j f(Gi,j, Gi-1,j, Gi1,j,
Gi,j-1, Gi,j1) - G Gnew
18Finite differencing ( SOR) in HPF
- See Ian Foster, Program 7.2 uses convergence
criterion instead of fixed number of steps - program hpf_finite_difference
- !HPF PROCESSORS pr(4) ! use 4 CPUs
- real X(100, 100), New(100, 100) ! data
arrays - !HPF ALIGN New(,) WITH X(,)
- !HPF DISTRIBUTE X(BLOCK,) ONTO pr ! row-wise
- New(299, 299) (X(198, 299) X(3100,
299)
X(299, 198) X(299, 3100))/4 - diffmax MAXVAL (ABS (New-X))
- end
19Changing the distribution
- Use block distribution instead of row
distribution - program hpf_finite_difference
- !HPF PROCESSORS pr(2,2) ! use 2x2 grid
- real X(100, 100), New(100, 100) ! data
arrays - !HPF ALIGN New(,) WITH X(,)
- !HPF DISTRIBUTE X(BLOCK, BLOCK) ONTO pr !
block-wise - New(299, 299) (X(198, 299) X(3100,
299) X(299, 198) X(299, 3100))/4 - diffmax MAXVAL (ABS (New-X))
- end
20Performance
- Distribution affects
- Load balance
- Amount of communication
- Example (communication costs)
- !HPF PROCESSORS pr(3)
- integer A(8), B(8), C(8)
- !HPF ALIGN B() WITH A()
- !HPF DISTRIBUTE A(BLOCK) ONTO pr
- !HPF DISTRIBUTE C(CYCLIC) ONTO pr
21Figure 7.9 from Foster's book
22Historical Evaluation
- See The rise and fall of High Performance
Fortran an historical object lesson by Ken
Kennedy, Charles Koelbel, Hans Zima.In
Proceedings of the third ACM SIGPLAN conference
on History of programming languages, June 2007
Optional, obtainable from ACM Digital Library
23Problems with HPF
- Immature compiler technology
- Upgrading to Fortran 90 was complicated
- Implementing HPF extensions took much time
- HPC community was impatient and started using MPI
- Missing features
- Support for sparse array and other irregular data
structures - Obtaining portable performance was difficult
- Performance tuning was difficult
24Impact of HPF
- Huge impact on parallel language design
- Very frequently cited
- Some impact on OpenMP (shared-memory standard)
- New wave of High Productivity Computing Systems
(HPCS) languages Chapel (Cray), Fortress (Sun),
X10 (IBM) - Used in extended form (HPF/JA) for Japanese Earth
Simulator
25Conclusions
- High-level model
- User species data distribution
- Compiler generates parallel program
communication - More restrictive than general message passing
model (only data parallelism) - Restricted to array-based data structures
- HPF programs will be easy to modify, enhances
portability - Changing data distribution only requires changing
directives