Title: APPM 4660
1APPM 4660
- Turbulence Simulation and High-Performance
Computing - Joe Werne
- Colorado Research Associates Division (CoRA)
- NorthWest Research Associates, Inc. (NWRA)
- 3380 Mitchell Lane, Boulder, CO 80301
- werne_at_cora.nwra.com
2Computer power doubles every 1.5-2.0 years
Network bandwidth doubles every
0.75-1.0 years 1993 Cray C90, 16-PE, 1 Gflops
each, liquid H2O cooled, 32M 1993 Cray T3D,
256-PE, 146 Mflops (37.5 Gflops total), 16M
1978 Cray I, 160 Million instructions per second
20,000,000 1995 Sony Play Station, 500 Million
instructions per second, 299
3- 1991 2D convection Cray YMP
(2.66 Gflops) 102 hrs 0.5 week - 1996 rotating convection Cray C90 (16
Gflops) 103 hrs 1.0 month - 1998 gravity-wave breaking SGI O2 (50
Gflops) 104 hrs 1.0 years - 2000 wind shear Cray T3E
(600 Gflops) 105 hrs 10 years - 2005 wind shear IBM P4
(20 Tflops) 106 hrs 100 years
43D wind shear
3D wind shear
- Animations demonstrating turbulence simulation
- 2D convection 1991 Cray YMP (102 hrs 0.5
wk) - rotating convection 1996 Cray C90 (103 hrs
1 mo) - gravity-wave breaking 1998 SGI Power Onyx
(104 hrs 1 year) - KH 2000 Cray T3E (105 hrs 10 years)
- KH 2005 IBM P4 (106 hrs 100 years)
3D wave Breaking
3D rotating convection
2D convection
5- Characteristics of turbulent motion
- nonlinear, chaotic solutions
- large range of length and time scales
- probabilistic description useful/necessary
kd/kl dx Re3/4 Re(simulation)
104-105 Re(stratosphere) 107-108 cost dx4
Re3
6- Characteristics of turbulent motion
- nonlinear, chaotic solutions
- large range of length and time scales
- probabilistic description useful/necessary
7- Finite-difference versus spectral resolution
Finite difference f(x) f(x0)
f(x0)h 1/2 f(x0)h2 1/6 f(x0)h3
fj (fj1-fj-1) /
(2h) O(h2) fj
(fj1-2fjfj-1) / h2 O(h2) Spectral
method f(x) ? F(k) exp(ikx)
f(x) ? ik F(k) exp(ikx) f(x) -?
k2 F(k) exp(ikx) 2? F(k) ? f(x)
exp(-ikx) dx and if we integrate by parts,
we have 2? F(k) (ik)-r ? drf(x)/dxr
exp(-ikx) dx so if the nth derivative is
bounded, F(k)
decays at least as fast
as k-n, and the error is O(h-n)
All the numerical solutions presented were
computed with spectral methods.
8- Getting big computers to run fast for you
- Two important issues
- single-processor optimization
- parallelization efficiency
This used to be easy when Cray dominated the
supercomputer market with vector processors and
compilers were intelligent. Now compilers are
stupid, and machine architectures are more
complex. Nevertheless, one thing you absolutely
had to pay attention to was memory management.
This is still the case!
9- Getting big computers to run fast for you Cray
YMP
Memory banks
CPUs
10- Getting big computers to run fast for you Cray
YMP
Memory banks
CPUs
11- Getting big computers to run fast for you Cray
YMP
Memory banks
CPUs
This kind of gang-memory access kills performance
because of inter-processor interference.
12- Getting big computers to run fast for you Cray
YMP
Memory banks
CPUs
13- Getting big computers to run fast for you Cray
YMP
Memory banks
CPUs
14- Getting big computers to run fast for you Cray
YMP
Memory banks
CPUs
15- Getting big computers to run fast for you Cray
YMP
Memory banks
This kind of memory access is 8 times faster, but
what changes must you make to your code?
CPUs
16- Getting big computers to run fast for you Cray
YMP
Memory banks
CPUs
17- Getting big computers to run fast for you Cray
C90
As the number of processors grows, index-level
parallelization is less effective. We must
change the index order to dedicate the outer
index for parallelization operations.
Memory banks
CPUs
18- Getting big computers to run fast for you Cray
C90
Memory banks
CPUs
19- Getting big computers to run fast for you MPP
MPPs harkened a new paradigm shift fast
networks connecting many cheap, off-the-shelf
CPUs. Shared memory was no more. Now memory is
integral with CPU, and managing cache memory is
extremely important.
CPUs
20- Getting big computers to run fast for you
- Modern MPPs and cache thrashing
Cache memory
e.g., 4 words per line
and register memory is faster than cache, so
learn a little about the platform on which you
are running, and look for ways (like this) to
maximize cache and register re-use. and be
wary of profiling tools! Take-home message high
performance computing is mostly about intelligent
memory management.
CPUs
Main memory large, but much slower access.
a(i) b(i) c(i)
21- Getting big computers to run fast for you
- Modern MPPs and stupid compilers
Look at documentation for single-processor
optimization techniques. You may be surprised at
what they suggest. For example,
a(i) a(i)b(i) c(i)d(i) e(i)f(i) May run
much slower than a(i) ( ( (
a(i)b(i) ) c(i)d(i) ) e(i)f(i) )
22- Getting big computers to run fast for you
- Parallelization
! program
main !----------------------------- dimension
a(nx,ny,nz) dimension b(nx,ny,nz) !---------------
-------------- do I1,nx do j1,ny do
k1,nz a(I,j,k) b(I,j,k) c(I,j,k)
enddo enddo Enddo !-----------------------------
call subarg(a,b,c,nx,ny,nz) call
sub2ar(b,c,nx,ny,nz) !----------------------------
- stop end
! program
main !----------------------------- dimension
a(nx,ny,nz) dimension b(nx,ny,nz) !---------------
-------------- do I1,nx do j1,ny do
k1,nz a(I,j,k) b(I,j,k) c(I,j,k)
enddo enddo Enddo !-----------------------------
call subarg(a,b,c,nx,ny,nz) call
sub2ar(b,c,nx,ny,nz) !----------------------------
- stop end
SIMD -- single-processor, multiple data
! program
main !----------------------------- dimension
a(nx,ny,nz) dimension b(nx,ny,nz) !---------------
-------------- do I1,nx do j1,ny do
k1,nz a(I,j,k) b(I,j,k) c(I,j,k)
enddo enddo Enddo !-----------------------------
call subarg(a,b,c,nx,ny,nz) call
sub2ar(b,c,nx,ny,nz) !----------------------------
- stop end
! program
main !----------------------------- dimension
a(nx,ny,nz) dimension b(nx,ny,nz) !---------------
-------------- do I1,nx do j1,ny do
k1,nz a(I,j,k) b(I,j,k) c(I,j,k)
enddo enddo Enddo !-----------------------------
call subarg(a,b,c,nx,ny,nz) call
sub2ar(b,c,nx,ny,nz) !----------------------------
- stop end
! program
main !----------------------------- dimension
a(nx,ny,nz) dimension b(nx,ny,nz) !---------------
-------------- do I1,nx do j1,ny do
k1,nz a(I,j,k) b(I,j,k) c(I,j,k)
enddo enddo Enddo !-----------------------------
call subarg(a,b,c,nx,ny,nz) call
sub2ar(b,c,nx,ny,nz) !----------------------------
- stop end
! program
main !----------------------------- dimension
a(nx,ny,nz) dimension b(nx,ny,nz) !---------------
-------------- do I1,nx do j1,ny do
k1,nz a(I,j,k) b(I,j,k) c(I,j,k)
enddo enddo Enddo !-----------------------------
call subarg(a,b,c,nx,ny,nz) call
sub2ar(b,c,nx,ny,nz) !----------------------------
- stop end
PE5
PE4
PE3
PE2
PE1
PE0
23- Getting big computers to run fast for you
- Parallelization
Amdahls Law (for fixed problem size per
processor) TNCPU fT NCPU (1-f)T TNCPU
wall clock time when running on NCPU processors
T wall clock time when running on 1
processor f parallel fraction of
code NCPU number of processors
24Parallel, spectral, 3D Navier-Stokes Solver
- Fully spectral (3D FFTs 75 computation)
- Radix 2,3,4,5 FFTs
- Spectral modes and NCPUs must be commensurate
- Communication shmem, global transpose, data
reduction - Parallel I/O every 60 dt
25Parallel, spectral, 3D Navier-Stokes Solver