APPM 4660 - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

APPM 4660

Description:

1978: Cray I, 160 Million instructions per second $20,000,000 ... Radix 2,3,4,5 FFT's. Spectral modes and NCPUs must be commensurate ... – PowerPoint PPT presentation

Number of Views:60

Avg rating:3.0/5.0

Slides: 26

Provided by: joew3

Category:

Tags: appm | radix

more less

Transcript and Presenter's Notes

Title: APPM 4660

1
APPM 4660

Turbulence Simulation and High-Performance
Computing
Joe Werne
Colorado Research Associates Division (CoRA)
NorthWest Research Associates, Inc. (NWRA)
3380 Mitchell Lane, Boulder, CO 80301
werne_at_cora.nwra.com

Moores Law (1965)

Computer power doubles every 1.5-2.0 years
Network bandwidth doubles every
0.75-1.0 years 1993 Cray C90, 16-PE, 1 Gflops
each, liquid H2O cooled, 32M 1993 Cray T3D,
256-PE, 146 Mflops (37.5 Gflops total), 16M
1978 Cray I, 160 Million instructions per second
20,000,000 1995 Sony Play Station, 500 Million
instructions per second, 299
3

Turbulence Simulation

1991 2D convection Cray YMP
(2.66 Gflops) 102 hrs 0.5 week
1996 rotating convection Cray C90 (16
Gflops) 103 hrs 1.0 month
1998 gravity-wave breaking SGI O2 (50
Gflops) 104 hrs 1.0 years
2000 wind shear Cray T3E
(600 Gflops) 105 hrs 10 years
2005 wind shear IBM P4
(20 Tflops) 106 hrs 100 years

4
3D wind shear
3D wind shear

Animations demonstrating turbulence simulation
2D convection 1991 Cray YMP (102 hrs 0.5
wk)
rotating convection 1996 Cray C90 (103 hrs
1 mo)
gravity-wave breaking 1998 SGI Power Onyx
(104 hrs 1 year)
KH 2000 Cray T3E (105 hrs 10 years)
KH 2005 IBM P4 (106 hrs 100 years)

3D wave Breaking
3D rotating convection
2D convection
5

Characteristics of turbulent motion

nonlinear, chaotic solutions
large range of length and time scales
probabilistic description useful/necessary

kd/kl dx Re3/4 Re(simulation)
104-105 Re(stratosphere) 107-108 cost dx4
Re3
6

Characteristics of turbulent motion

nonlinear, chaotic solutions
large range of length and time scales
probabilistic description useful/necessary

Finite-difference versus spectral resolution

Finite difference f(x) f(x0)
f(x0)h 1/2 f(x0)h2 1/6 f(x0)h3
fj (fj1-fj-1) /
(2h) O(h2) fj
(fj1-2fjfj-1) / h2 O(h2) Spectral
method f(x) ? F(k) exp(ikx)
f(x) ? ik F(k) exp(ikx) f(x) -?
k2 F(k) exp(ikx) 2? F(k) ? f(x)
exp(-ikx) dx and if we integrate by parts,
we have 2? F(k) (ik)-r ? drf(x)/dxr
exp(-ikx) dx so if the nth derivative is
bounded, F(k)
decays at least as fast
as k-n, and the error is O(h-n)
All the numerical solutions presented were
computed with spectral methods.
8

Getting big computers to run fast for you

Two important issues
single-processor optimization
parallelization efficiency

This used to be easy when Cray dominated the
supercomputer market with vector processors and
compilers were intelligent. Now compilers are
stupid, and machine architectures are more
complex. Nevertheless, one thing you absolutely
had to pay attention to was memory management.
This is still the case!
9

Getting big computers to run fast for you Cray
YMP

Memory banks
CPUs
10

Getting big computers to run fast for you Cray
YMP

Memory banks
CPUs
11

Getting big computers to run fast for you Cray
YMP

Memory banks
CPUs
This kind of gang-memory access kills performance
because of inter-processor interference.
12

Getting big computers to run fast for you Cray
YMP

Memory banks
CPUs
13

Getting big computers to run fast for you Cray
YMP

Memory banks
CPUs
14

Getting big computers to run fast for you Cray
YMP

Memory banks
CPUs
15

Getting big computers to run fast for you Cray
YMP

Memory banks
This kind of memory access is 8 times faster, but
what changes must you make to your code?
CPUs
16

Getting big computers to run fast for you Cray
YMP

Memory banks
CPUs
17

Getting big computers to run fast for you Cray
C90

As the number of processors grows, index-level
parallelization is less effective. We must
change the index order to dedicate the outer
index for parallelization operations.
Memory banks
CPUs
18

Getting big computers to run fast for you Cray
C90

Memory banks
CPUs
19

Getting big computers to run fast for you MPP

MPPs harkened a new paradigm shift fast
networks connecting many cheap, off-the-shelf
CPUs. Shared memory was no more. Now memory is
integral with CPU, and managing cache memory is
extremely important.
CPUs
20

Getting big computers to run fast for you
Modern MPPs and cache thrashing

Cache memory
e.g., 4 words per line
and register memory is faster than cache, so
learn a little about the platform on which you
are running, and look for ways (like this) to
maximize cache and register re-use. and be
wary of profiling tools! Take-home message high
performance computing is mostly about intelligent
memory management.
CPUs
Main memory large, but much slower access.
a(i) b(i) c(i)
21

Getting big computers to run fast for you
Modern MPPs and stupid compilers

Look at documentation for single-processor
optimization techniques. You may be surprised at
what they suggest. For example,
a(i) a(i)b(i) c(i)d(i) e(i)f(i) May run
much slower than a(i) ( ( (
a(i)b(i) ) c(i)d(i) ) e(i)f(i) )
22

Getting big computers to run fast for you
Parallelization

! program
main !----------------------------- dimension
a(nx,ny,nz) dimension b(nx,ny,nz) !---------------
-------------- do I1,nx do j1,ny do
k1,nz a(I,j,k) b(I,j,k) c(I,j,k)
enddo enddo Enddo !-----------------------------
call subarg(a,b,c,nx,ny,nz) call
sub2ar(b,c,nx,ny,nz) !----------------------------
- stop end
! program
main !----------------------------- dimension
a(nx,ny,nz) dimension b(nx,ny,nz) !---------------
-------------- do I1,nx do j1,ny do
k1,nz a(I,j,k) b(I,j,k) c(I,j,k)
enddo enddo Enddo !-----------------------------
call subarg(a,b,c,nx,ny,nz) call
sub2ar(b,c,nx,ny,nz) !----------------------------
- stop end
SIMD -- single-processor, multiple data
! program
main !----------------------------- dimension
a(nx,ny,nz) dimension b(nx,ny,nz) !---------------
-------------- do I1,nx do j1,ny do
k1,nz a(I,j,k) b(I,j,k) c(I,j,k)
enddo enddo Enddo !-----------------------------
call subarg(a,b,c,nx,ny,nz) call
sub2ar(b,c,nx,ny,nz) !----------------------------
- stop end
! program
main !----------------------------- dimension
a(nx,ny,nz) dimension b(nx,ny,nz) !---------------
-------------- do I1,nx do j1,ny do
k1,nz a(I,j,k) b(I,j,k) c(I,j,k)
enddo enddo Enddo !-----------------------------
call subarg(a,b,c,nx,ny,nz) call
sub2ar(b,c,nx,ny,nz) !----------------------------
- stop end
! program
main !----------------------------- dimension
a(nx,ny,nz) dimension b(nx,ny,nz) !---------------
-------------- do I1,nx do j1,ny do
k1,nz a(I,j,k) b(I,j,k) c(I,j,k)
enddo enddo Enddo !-----------------------------
call subarg(a,b,c,nx,ny,nz) call
sub2ar(b,c,nx,ny,nz) !----------------------------
- stop end
! program
main !----------------------------- dimension
a(nx,ny,nz) dimension b(nx,ny,nz) !---------------
-------------- do I1,nx do j1,ny do
k1,nz a(I,j,k) b(I,j,k) c(I,j,k)
enddo enddo Enddo !-----------------------------
call subarg(a,b,c,nx,ny,nz) call
sub2ar(b,c,nx,ny,nz) !----------------------------
- stop end
PE5
PE4
PE3
PE2
PE1
PE0
23

Getting big computers to run fast for you
Parallelization

Amdahls Law (for fixed problem size per
processor) TNCPU fT NCPU (1-f)T TNCPU
wall clock time when running on NCPU processors
T wall clock time when running on 1
processor f parallel fraction of
code NCPU number of processors
24
Parallel, spectral, 3D Navier-Stokes Solver