J90 vs' T3E Program Design - PowerPoint PPT Presentation

1 / 44

About This Presentation

Title:

J90 vs' T3E Program Design

Description:

Agglomeration - evaluate both computation and communication with ... agglomerate. map. NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER. 22. Partitioning ... – PowerPoint PPT presentation

Number of Views:42

Avg rating:3.0/5.0

Slides: 45

Provided by: mjd5

Category:

more less

Transcript and Presenter's Notes

Title: J90 vs' T3E Program Design

1
J90 vs. T3E Program Design

Jonathan Carter
High Performance Computing Department

2
Overview

J90 and T3E architectures
Parallel Programming Models
Models available on each platform
Designing Parallel Programs
Examples
Some results

3
J90 Architecture

Shared memory

4
J90 Architecture

100 MHz, 200 MFlop vector processor
20-28 processors in one machine
Shared memory of 1 Gword (8 byte word)
SV1 upgrade promises 1 GFlop per processor
Shared filesystems

5
T3E Architecture

Distributed memory

6
T3E Architecture

Processing elements (PEs) are composed of CPU and
memory connected by a fast 3D torus

7
T3E Architecture

450 MHz, 900 MFlop EV5 superscalar processor
644 PEs, 512 PE maximum job size
Distributed memory 32 Mwords (8 byte words) per
PE
Shared filesystems

8
Comparison

J90
Shared memory
Dynamically allocated CPUs
Time-shared CPU and memory

T3E
Distributed memory
Statically allocated CPUs
Dedicated CPU and memory

9
Programming Environment

Similar compilers and libraries
Different tools and data representations
Subtly different programming models

10
Parallel Programming Models I

Message Passing - set of processes each with
local data, each process has a unique name and
interacts with other processes by sending and
receiving messages
Flexible Model - process creation and
termination, multiple different programs execute
Single program multiple data (SPMD) - processes
fixed at startup, copies of a single program
execute
Message Passing Interface (MPI), Parallel Virtual
Machine (PVM)

11
Parallel Programming Models II

Shared Memory - similar to message passing,
except that one-sided memory operations (puts and
gets) are allowed
Low latency, high bandwidth and less forced
synchronization
SGI/Cray shared memory library (shmem)
Fortran Co-arrays extension (F--)

12
Parallel Programming Models III

Data Parallelism - exploits the fact that often
the same operation is applied to all elements of
a data structure. For example, adding a scalar to
all the elements of a real 1D array can be done
in parallel.
High Performance Fortran (HPF) provides a data
parallel framework. Focus is on indicating data
distribution and indicating what operations can
be done in parallel.

13
Parallel Programming Models IV

Thread-based parallelism - a set of threads of
control are spawned by a master process. Threads
can access global data, but can also have private
data. Only on shared-memory machines. Fine
grained parallelism possible.
Automatic parallelizing compilers
Proprietary compiler directives and OpenMP
POSIX threads

14
Message Passing Interface - MPI

A library of routines to write parallel programs
using message passing
Standard supported by most vendors
MPI is whatever size you like
Simple one to one send and receive (cooperative
communication)
Broadcasts and reductions
MPI I/O is a parallel I/O standard (not on J90s)

15
High Performance Fortran - HPF

HPF is a data-parallel language. Compiler
directives act to distribute data and indicate
loops that may be executed in parallel. Some
functions are automatically parallelized, other
constructs need directives or rearranging.
High level, no explicit communication required
Portable, many compilers exist
Somewhat restrictive, not all algorithms can be
specified
Performance may not be that great. Definitely
cant just compile an old Fortran code and hope
for the best.

16
HPF - Data Distribution

Consider a 4 processor case

!HPF DISTRIBUTE A(BLOCK) dimension a(20)
P1
P2
P3
P4
!HPF DISTRIBUTE B(BLOCK,) dimension
b(8,20)
P1
P2
P3
P4
17
HPF - Data Distribution

Consider a 4 processor case

!HPF DISTRIBUTE A(CYCLIC) dimension a(20)
!HPF DISTRIBUTE B(,CYCLIC) dimension
b(8,20)
18
Tasking Directives

Most vendors provide compiler directives to
indicate where a region of code may be executed
in parallel.
OpenMP is a standard for Fortran 90 and C, which
should lead to portable programs.
High level, no explicit communication
Threads can join and leave as program progresses,
relaxed approach

19
J90 Programming models

Automatic parallelizing compilers (f90, cc, CC)
Cray and OpenMP compiler directives
Message Passing Interface (MPI)
Parallel Virtual Machine (PVM)
Shared memory library (shmem)

20
T3E Programming Models

Message Passing Interface (MPI)
High Performance Fortran (HPF)
Parallel Virtual Machine (PVM)
Shared memory library (shmem)
Fortran Co-arrays (F--)

21
Designing Parallel Algorithms
Problem

Partitioning - decompose the computation and the
data into tasks.
Communication - determine communication required
to coordinate tasks
Agglomeration - evaluate both computation and
communication with respect to eprformance and
implementation costs combine tasks if necessary
Mapping - Assign tasks to processors either
statically or dynamically.

partition
communicate
agglomerate
map
22
Partitioning

Two complimentary ways to think about
partitioning
Domain decomposition - seek to divide the data
into roughly equal portions per task
Functional decomposition - seek to divide the
computation into disjoint functions per task

23
Communication

Types of communication
Local - needs access to data from one or very few
processes
Global - needs access to data from all or most
processes
Static - an unchanging pattern
Dynamic - a pattern changing with time
Regular pattern
Irregular pattern

24
Agglomeration

Consider
Is it more efficient or easier to combine certain
tasks
Is it more efficient or easier to replicate data
or computation
Issues
Granularity - computation vs. communication
Flexibility - dont limit number of tasks or
scalability
Code reuse - seek to reuse old code or algorithms

25
Mapping

Map tasks to physical processes. For the J90 and
T3E this is relatively simple. Both systems are
homogeneous.
simple domain decomposition - fixed number of
equal sized tasks, which are agglomerated to form
a reasonable number of larger tasks which each
map to a process
complex domain decomposition - need a load
balancing algorithm
functional decomposition - task-scheduling
algorithm

26
Example

Calculate the energy of a system of particles
interacting via a Coulomb potential.

real coord(3,n), charge(n)
energy0.0 do i 1, n do j 1,
i-1 rdist 1.0/sqrt((coord(1,i)-coord(1
,j))2 (coord(2,i)-coord(2,j))2(c
oord(3,i)-coord(3,j))2) energy
energy charge(i)charge(j)rdist end
do end do
27
MPI Example 1

Functional decomposition
each task will compute the same number of
interactions
accomplish this by dividing up the outer loop
replicate data to make communication simple
this approach will not scale

28
MPI - Example 1
include 'mpif.h' parameter(n50000)
dimension coord(3,n), charge(n) call
mpi_init(ierr) call mpi_comm_rank(MPI_COMM_W
ORLD, mype, ierr) call mpi_comm_size(MPI_COM
M_WORLD, npes, ierr) call
initdata(n,coord,charge,mype) e
energy(mype,npes,n,coord,charge)
etotal0.0 call mpi_reduce(e, etotal, 1,
MPI_REAL, MPI_SUM, 0, MPI_COMM_WORLD,
ierr) if (mype.eq.0) write(,) etotal
call mpi_finalize(ierr)
29
MPI - Example 1
subroutine initdata(n,coord,charge,mype)
include 'mpif.h' dimension coord(3,n),
charge(n) if (mype.eq.0) then
GENERATE coords, charge end if ! broadcast
data to slaves call mpi_bcast(coord, 3n,
MPI_REAL, 0, MPI_COMM_WORLD, ierr) call
mpi_bcast(charge, n, MPI_REAL, 0, MPI_COMM_WORLD,
ierr) return
30
MPI - Example 1
real function energy(mype,npes,n,coord,charg
e) dimension coord(3,n), charge(n)
intern(n-1)/npes nstartnint(sqrt(real(myp
einter)))1 nfinishnint(sqrt(real((mype1)
inter))) if (mype.eq.npes-1) nfinishn
total 0.0 do i nstart, nfinish
do j 1, i-1 .... total
total charge(i)charge(j)rdist end do
end do energy total return
31
MPI - Example 2

Domain decomposition
each task takes a chunk of particles
in turn, receives particle data from another
process and computes all interactions between own
data and received data
repeat until all interactions are done

32
MPI - Example 2
Proc 0
Proc 1
Proc 2
Proc 3
Proc 4
Step 1
21-40
41-60
61-80
81-100
1-20
21-40
41-60
61-80
81-100
1-20
Step 2
21-40
41-60
61-80
81-100
1-20
41-60
61-80
81-100
1-20
21-40
Step 3
21-40
41-60
61-80
81-100
1-20
61-80
81-100
1-20
21-40
41-60
33
subroutine initdata(n,coord,charge,mype,npes
,npepmax,nmax,nmin) include 'mpif.h'
dimension coord(3,n), charge(n) integer
status(MPI_STATUS_SIZE) itag0
isender0 if (mype.eq.0) then do
ipe1,npes-1 GENERATE coord, charge
for PEipe call mpi_send(coord, nj3,
MPI_REAL, ipe, itag, MPI_COMM_WORLD,
ierror) call mpi_send(charge, nj,
MPI_REAL, ipe, itag, MPI_COMM_WORLD,
ierror) end do GENERATE coord,
charge for self else ! receive particles
call mpi_recv(coord, 3n, MPI_REAL,
isender, itag, MPI_COMM_WORLD, status,
ierror) call mpi_recv(charge, n,
MPI_REAL, isender, itag,
MPI_COMM_WORLD, status, ierror) endif
return
34
niternpes/2 do iter1, niter ! PE
to send to and receive from if
(ipsend.eq.npes-1) then ipsend0
else ipsendipsend1 end if
if (iprecv.eq.0) then
iprecvnpes-1 else
iprecviprecv-1 end if ! send and
receive particles call mpi_sendrecv(coordi
, 3n, MPI_REAL, ipsend, itag, coordj,
3n, MPI_REAL, iprecv, itag,
MPI_COMM_WORLD, status, ierror) call
mpi_sendrecv(chargei, n, MPI_REAL, ipsend, itag,
chargej, n, MPI_REAL, iprecv, itag,
MPI_COMM_WORLD, status, ierror) !
accumulate energy e e energy2(n,
coordi, chargei, n, coordj, chargej) end do
35
HPF Example
parameter(n50000) dimension
coord(3,n), charge(n), ep(n) !HPF DISTRIBUTE
coord(,BLOCK) !HPF ALIGN charge() WITH
coord(,) !HPF ALIGN ep() WITH coord(,)
call initdata(n, coord, charge)
eenergy(n, coord, charge, ep) write(,)
e stop end
36
HPF Example
real function energy(n,coord,charge,ep)
implicit real(a-h,o-z) dimension
coord(3,n), charge(n), ep(n) !HPF DISTRIBUTE
coord(,BLOCK) !HPF ALIGN charge() WITH
coord(,) !HPF ALIGN ep() WITH
coord(,) !HPF INDEPENDENT, NEW(rdist, j)
do i 1, n ep(i) 0.0 do j
1, i-1 rdist 1.0/sqrt((coord(1,i)-coor
d(1,j))2 (coord(2,i)-coord(2,j))2
(coord(3,i)-coord(3,j))2) ep(i)
ep(i) charge(i)charge(j)rdist end do
end do energy sum(ep) return
end
37
Cray Specific Directives
subroutine energy(n,coord,a) implicit
real(a-h,o-z) dimension coord(3,n), a(n)
total 0.0 cmic parallel autoscope,
shared(total),private(t,i,j,rdist)
t0.0 cmic do parallel do i 1, n
do j 1, i-1 rdist
1.0/sqrt((coord(1,i)-coord(1,j))2
(coord(2,i)-coord(2,j))2(coord(3,i)-coord(3,j))
2) t t a(i)a(j)rdist
end do end do cmic guard
totaltotalt cmic end guard cmic end parallel
write(,)' energy ',total return
end
38
OpenMP Directives
function energy(n,coord,charge)
dimension coord(3,n), charge(n) total
0.0 !omp parallel private(rdist) !omp do
schedule(dynamic,64) reduction(total) do
i 1, n do j 1, i-1 rdist
1.0/sqrt((coord(1,i)-coord(1,j))2
(coord(2,i)-coord(2,j))2(coord(3,i)-coord(3,j))
2) total total charge(i)charge(j)
rdist end do end do !omp end
parallel energytotal return
end
39
Coulomb Interaction - T3E
40
Coloumb Interaction - T3E
41
Coulomb Interaction - J90
42
Coulomb Interaction - J90
43
Strategies

Simple programs with low software development
cost
Automatic parallelizing compiler or compiler
directives on J90
High Performance Fortran on T3E (probably not
optimal performance)
Complex programs
Compiler directives on J90
Redesign with MPI for both J90 and T3E

44
Further Information

General Parallel Programming
Designing and Building Parallel Programs, by Ian
Foster. Addison-Wesley ISBN 0-201057594-9
http//www-unix.mcs.anl.gov/dbpp/
MPI
Using MPI, by Gropp, Lusk and Skjellum. MIT Press
ISBN 0-262-57104-8
MPI - The Complete Reference, Vol 1, by Snir,
Otto, Huss-Lederman, Walker, and Dongarra. MIT
Press ISBN 0-262-69216-3
MPI - The Complete Reference, Vol 2, by Gropp,
Huss-Lederman, Lumsdaine, Lusk, Nitzberg, Saphir
and Snir. MIT Press ISBN 0-262-69216-3
http//www-unix.mcs.anl.gov/mpi/
HPF
The High Performance Fortran Handbook, by
Koelbel, Loveman, Schreiber, Steele, Jr., and
Zosel. MIT Press ISBN 0-262-61094-9
http//www.crpc.rice.edu/HPFF/home.html
Cray Tasking Directives
CF90 Commands and Directives Reference Manual
SR-3901
Cray C/C Reference Manual SR-2179
http//www.cray.com/products/software/publications
/
OpenMP
CF90 Commands and Directives Reference Manual
SR-3901
http//www.openmp.org/
http//www.cray.com/products/software/publications
/