Title: J90 vs' T3E Program Design
1J90 vs. T3E Program Design
- Jonathan Carter
- High Performance Computing Department
2Overview
- J90 and T3E architectures
- Parallel Programming Models
- Models available on each platform
- Designing Parallel Programs
- Examples
- Some results
3J90 Architecture
4J90 Architecture
- 100 MHz, 200 MFlop vector processor
- 20-28 processors in one machine
- Shared memory of 1 Gword (8 byte word)
- SV1 upgrade promises 1 GFlop per processor
- Shared filesystems
5T3E Architecture
6T3E Architecture
- Processing elements (PEs) are composed of CPU and
memory connected by a fast 3D torus
7T3E Architecture
- 450 MHz, 900 MFlop EV5 superscalar processor
- 644 PEs, 512 PE maximum job size
- Distributed memory 32 Mwords (8 byte words) per
PE - Shared filesystems
8Comparison
- J90
- Shared memory
- Dynamically allocated CPUs
- Time-shared CPU and memory
- T3E
- Distributed memory
- Statically allocated CPUs
- Dedicated CPU and memory
9Programming Environment
- Similar compilers and libraries
- Different tools and data representations
- Subtly different programming models
10Parallel Programming Models I
- Message Passing - set of processes each with
local data, each process has a unique name and
interacts with other processes by sending and
receiving messages - Flexible Model - process creation and
termination, multiple different programs execute - Single program multiple data (SPMD) - processes
fixed at startup, copies of a single program
execute - Message Passing Interface (MPI), Parallel Virtual
Machine (PVM)
11Parallel Programming Models II
- Shared Memory - similar to message passing,
except that one-sided memory operations (puts and
gets) are allowed - Low latency, high bandwidth and less forced
synchronization - SGI/Cray shared memory library (shmem)
- Fortran Co-arrays extension (F--)
12Parallel Programming Models III
- Data Parallelism - exploits the fact that often
the same operation is applied to all elements of
a data structure. For example, adding a scalar to
all the elements of a real 1D array can be done
in parallel. - High Performance Fortran (HPF) provides a data
parallel framework. Focus is on indicating data
distribution and indicating what operations can
be done in parallel.
13Parallel Programming Models IV
- Thread-based parallelism - a set of threads of
control are spawned by a master process. Threads
can access global data, but can also have private
data. Only on shared-memory machines. Fine
grained parallelism possible. - Automatic parallelizing compilers
- Proprietary compiler directives and OpenMP
- POSIX threads
14Message Passing Interface - MPI
- A library of routines to write parallel programs
using message passing - Standard supported by most vendors
- MPI is whatever size you like
- Simple one to one send and receive (cooperative
communication) - Broadcasts and reductions
- MPI I/O is a parallel I/O standard (not on J90s)
15High Performance Fortran - HPF
- HPF is a data-parallel language. Compiler
directives act to distribute data and indicate
loops that may be executed in parallel. Some
functions are automatically parallelized, other
constructs need directives or rearranging. - High level, no explicit communication required
- Portable, many compilers exist
- Somewhat restrictive, not all algorithms can be
specified - Performance may not be that great. Definitely
cant just compile an old Fortran code and hope
for the best.
16HPF - Data Distribution
- Consider a 4 processor case
!HPF DISTRIBUTE A(BLOCK) dimension a(20)
P1
P2
P3
P4
!HPF DISTRIBUTE B(BLOCK,) dimension
b(8,20)
P1
P2
P3
P4
17HPF - Data Distribution
- Consider a 4 processor case
!HPF DISTRIBUTE A(CYCLIC) dimension a(20)
!HPF DISTRIBUTE B(,CYCLIC) dimension
b(8,20)
18Tasking Directives
- Most vendors provide compiler directives to
indicate where a region of code may be executed
in parallel. - OpenMP is a standard for Fortran 90 and C, which
should lead to portable programs. - High level, no explicit communication
- Threads can join and leave as program progresses,
relaxed approach
19J90 Programming models
- Automatic parallelizing compilers (f90, cc, CC)
- Cray and OpenMP compiler directives
- Message Passing Interface (MPI)
- Parallel Virtual Machine (PVM)
- Shared memory library (shmem)
20T3E Programming Models
- Message Passing Interface (MPI)
- High Performance Fortran (HPF)
- Parallel Virtual Machine (PVM)
- Shared memory library (shmem)
- Fortran Co-arrays (F--)
21Designing Parallel Algorithms
Problem
- Partitioning - decompose the computation and the
data into tasks. - Communication - determine communication required
to coordinate tasks - Agglomeration - evaluate both computation and
communication with respect to eprformance and
implementation costs combine tasks if necessary - Mapping - Assign tasks to processors either
statically or dynamically.
partition
communicate
agglomerate
map
22Partitioning
- Two complimentary ways to think about
partitioning - Domain decomposition - seek to divide the data
into roughly equal portions per task - Functional decomposition - seek to divide the
computation into disjoint functions per task
23Communication
- Types of communication
- Local - needs access to data from one or very few
processes - Global - needs access to data from all or most
processes - Static - an unchanging pattern
- Dynamic - a pattern changing with time
- Regular pattern
- Irregular pattern
24Agglomeration
- Consider
- Is it more efficient or easier to combine certain
tasks - Is it more efficient or easier to replicate data
or computation - Issues
- Granularity - computation vs. communication
- Flexibility - dont limit number of tasks or
scalability - Code reuse - seek to reuse old code or algorithms
25Mapping
- Map tasks to physical processes. For the J90 and
T3E this is relatively simple. Both systems are
homogeneous. - simple domain decomposition - fixed number of
equal sized tasks, which are agglomerated to form
a reasonable number of larger tasks which each
map to a process - complex domain decomposition - need a load
balancing algorithm - functional decomposition - task-scheduling
algorithm
26Example
- Calculate the energy of a system of particles
interacting via a Coulomb potential.
real coord(3,n), charge(n)
energy0.0 do i 1, n do j 1,
i-1 rdist 1.0/sqrt((coord(1,i)-coord(1
,j))2 (coord(2,i)-coord(2,j))2(c
oord(3,i)-coord(3,j))2) energy
energy charge(i)charge(j)rdist end
do end do
27MPI Example 1
- Functional decomposition
- each task will compute the same number of
interactions - accomplish this by dividing up the outer loop
- replicate data to make communication simple
- this approach will not scale
28MPI - Example 1
include 'mpif.h' parameter(n50000)
dimension coord(3,n), charge(n) call
mpi_init(ierr) call mpi_comm_rank(MPI_COMM_W
ORLD, mype, ierr) call mpi_comm_size(MPI_COM
M_WORLD, npes, ierr) call
initdata(n,coord,charge,mype) e
energy(mype,npes,n,coord,charge)
etotal0.0 call mpi_reduce(e, etotal, 1,
MPI_REAL, MPI_SUM, 0, MPI_COMM_WORLD,
ierr) if (mype.eq.0) write(,) etotal
call mpi_finalize(ierr)
29MPI - Example 1
subroutine initdata(n,coord,charge,mype)
include 'mpif.h' dimension coord(3,n),
charge(n) if (mype.eq.0) then
GENERATE coords, charge end if ! broadcast
data to slaves call mpi_bcast(coord, 3n,
MPI_REAL, 0, MPI_COMM_WORLD, ierr) call
mpi_bcast(charge, n, MPI_REAL, 0, MPI_COMM_WORLD,
ierr) return
30MPI - Example 1
real function energy(mype,npes,n,coord,charg
e) dimension coord(3,n), charge(n)
intern(n-1)/npes nstartnint(sqrt(real(myp
einter)))1 nfinishnint(sqrt(real((mype1)
inter))) if (mype.eq.npes-1) nfinishn
total 0.0 do i nstart, nfinish
do j 1, i-1 .... total
total charge(i)charge(j)rdist end do
end do energy total return
31MPI - Example 2
- Domain decomposition
- each task takes a chunk of particles
- in turn, receives particle data from another
process and computes all interactions between own
data and received data - repeat until all interactions are done
32MPI - Example 2
Proc 0
Proc 1
Proc 2
Proc 3
Proc 4
Step 1
21-40
41-60
61-80
81-100
1-20
21-40
41-60
61-80
81-100
1-20
Step 2
21-40
41-60
61-80
81-100
1-20
41-60
61-80
81-100
1-20
21-40
Step 3
21-40
41-60
61-80
81-100
1-20
61-80
81-100
1-20
21-40
41-60
33 subroutine initdata(n,coord,charge,mype,npes
,npepmax,nmax,nmin) include 'mpif.h'
dimension coord(3,n), charge(n) integer
status(MPI_STATUS_SIZE) itag0
isender0 if (mype.eq.0) then do
ipe1,npes-1 GENERATE coord, charge
for PEipe call mpi_send(coord, nj3,
MPI_REAL, ipe, itag, MPI_COMM_WORLD,
ierror) call mpi_send(charge, nj,
MPI_REAL, ipe, itag, MPI_COMM_WORLD,
ierror) end do GENERATE coord,
charge for self else ! receive particles
call mpi_recv(coord, 3n, MPI_REAL,
isender, itag, MPI_COMM_WORLD, status,
ierror) call mpi_recv(charge, n,
MPI_REAL, isender, itag,
MPI_COMM_WORLD, status, ierror) endif
return
34 niternpes/2 do iter1, niter ! PE
to send to and receive from if
(ipsend.eq.npes-1) then ipsend0
else ipsendipsend1 end if
if (iprecv.eq.0) then
iprecvnpes-1 else
iprecviprecv-1 end if ! send and
receive particles call mpi_sendrecv(coordi
, 3n, MPI_REAL, ipsend, itag, coordj,
3n, MPI_REAL, iprecv, itag,
MPI_COMM_WORLD, status, ierror) call
mpi_sendrecv(chargei, n, MPI_REAL, ipsend, itag,
chargej, n, MPI_REAL, iprecv, itag,
MPI_COMM_WORLD, status, ierror) !
accumulate energy e e energy2(n,
coordi, chargei, n, coordj, chargej) end do
35HPF Example
parameter(n50000) dimension
coord(3,n), charge(n), ep(n) !HPF DISTRIBUTE
coord(,BLOCK) !HPF ALIGN charge() WITH
coord(,) !HPF ALIGN ep() WITH coord(,)
call initdata(n, coord, charge)
eenergy(n, coord, charge, ep) write(,)
e stop end
36HPF Example
real function energy(n,coord,charge,ep)
implicit real(a-h,o-z) dimension
coord(3,n), charge(n), ep(n) !HPF DISTRIBUTE
coord(,BLOCK) !HPF ALIGN charge() WITH
coord(,) !HPF ALIGN ep() WITH
coord(,) !HPF INDEPENDENT, NEW(rdist, j)
do i 1, n ep(i) 0.0 do j
1, i-1 rdist 1.0/sqrt((coord(1,i)-coor
d(1,j))2 (coord(2,i)-coord(2,j))2
(coord(3,i)-coord(3,j))2) ep(i)
ep(i) charge(i)charge(j)rdist end do
end do energy sum(ep) return
end
37Cray Specific Directives
subroutine energy(n,coord,a) implicit
real(a-h,o-z) dimension coord(3,n), a(n)
total 0.0 cmic parallel autoscope,
shared(total),private(t,i,j,rdist)
t0.0 cmic do parallel do i 1, n
do j 1, i-1 rdist
1.0/sqrt((coord(1,i)-coord(1,j))2
(coord(2,i)-coord(2,j))2(coord(3,i)-coord(3,j))
2) t t a(i)a(j)rdist
end do end do cmic guard
totaltotalt cmic end guard cmic end parallel
write(,)' energy ',total return
end
38OpenMP Directives
function energy(n,coord,charge)
dimension coord(3,n), charge(n) total
0.0 !omp parallel private(rdist) !omp do
schedule(dynamic,64) reduction(total) do
i 1, n do j 1, i-1 rdist
1.0/sqrt((coord(1,i)-coord(1,j))2
(coord(2,i)-coord(2,j))2(coord(3,i)-coord(3,j))
2) total total charge(i)charge(j)
rdist end do end do !omp end
parallel energytotal return
end
39Coulomb Interaction - T3E
40Coloumb Interaction - T3E
41Coulomb Interaction - J90
42Coulomb Interaction - J90
43Strategies
- Simple programs with low software development
cost - Automatic parallelizing compiler or compiler
directives on J90 - High Performance Fortran on T3E (probably not
optimal performance) - Complex programs
- Compiler directives on J90
- Redesign with MPI for both J90 and T3E
44Further Information
- General Parallel Programming
- Designing and Building Parallel Programs, by Ian
Foster. Addison-Wesley ISBN 0-201057594-9 - http//www-unix.mcs.anl.gov/dbpp/
- MPI
- Using MPI, by Gropp, Lusk and Skjellum. MIT Press
ISBN 0-262-57104-8 - MPI - The Complete Reference, Vol 1, by Snir,
Otto, Huss-Lederman, Walker, and Dongarra. MIT
Press ISBN 0-262-69216-3 - MPI - The Complete Reference, Vol 2, by Gropp,
Huss-Lederman, Lumsdaine, Lusk, Nitzberg, Saphir
and Snir. MIT Press ISBN 0-262-69216-3 - http//www-unix.mcs.anl.gov/mpi/
- HPF
- The High Performance Fortran Handbook, by
Koelbel, Loveman, Schreiber, Steele, Jr., and
Zosel. MIT Press ISBN 0-262-61094-9 - http//www.crpc.rice.edu/HPFF/home.html
- Cray Tasking Directives
- CF90 Commands and Directives Reference Manual
SR-3901 - Cray C/C Reference Manual SR-2179
- http//www.cray.com/products/software/publications
/ - OpenMP
- CF90 Commands and Directives Reference Manual
SR-3901 - http//www.openmp.org/
- http//www.cray.com/products/software/publications
/