APPM 4660 - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

APPM 4660

Description:

1978: Cray I, 160 Million instructions per second $20,000,000 ... Radix 2,3,4,5 FFT's. Spectral modes and NCPUs must be commensurate ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 26
Provided by: joew3
Category:
Tags: appm | radix

less

Transcript and Presenter's Notes

Title: APPM 4660


1
APPM 4660
  • Turbulence Simulation and High-Performance
    Computing
  • Joe Werne
  • Colorado Research Associates Division (CoRA)
  • NorthWest Research Associates, Inc. (NWRA)
  • 3380 Mitchell Lane, Boulder, CO 80301
  • werne_at_cora.nwra.com

2
  • Moores Law (1965)

Computer power doubles every 1.5-2.0 years
Network bandwidth doubles every
0.75-1.0 years 1993 Cray C90, 16-PE, 1 Gflops
each, liquid H2O cooled, 32M 1993 Cray T3D,
256-PE, 146 Mflops (37.5 Gflops total), 16M
1978 Cray I, 160 Million instructions per second
20,000,000 1995 Sony Play Station, 500 Million
instructions per second, 299
3
  • Turbulence Simulation
  • 1991 2D convection Cray YMP
    (2.66 Gflops) 102 hrs 0.5 week
  • 1996 rotating convection Cray C90 (16
    Gflops) 103 hrs 1.0 month
  • 1998 gravity-wave breaking SGI O2 (50
    Gflops) 104 hrs 1.0 years
  • 2000 wind shear Cray T3E
    (600 Gflops) 105 hrs 10 years
  • 2005 wind shear IBM P4
    (20 Tflops) 106 hrs 100 years

4
3D wind shear
3D wind shear
  • Animations demonstrating turbulence simulation
  • 2D convection 1991 Cray YMP (102 hrs 0.5
    wk)
  • rotating convection 1996 Cray C90 (103 hrs
    1 mo)
  • gravity-wave breaking 1998 SGI Power Onyx
    (104 hrs 1 year)
  • KH 2000 Cray T3E (105 hrs 10 years)
  • KH 2005 IBM P4 (106 hrs 100 years)

3D wave Breaking
3D rotating convection
2D convection
5
  • Characteristics of turbulent motion
  • nonlinear, chaotic solutions
  • large range of length and time scales
  • probabilistic description useful/necessary

kd/kl dx Re3/4 Re(simulation)
104-105 Re(stratosphere) 107-108 cost dx4
Re3
6
  • Characteristics of turbulent motion
  • nonlinear, chaotic solutions
  • large range of length and time scales
  • probabilistic description useful/necessary

7
  • Finite-difference versus spectral resolution

Finite difference f(x) f(x0)
f(x0)h 1/2 f(x0)h2 1/6 f(x0)h3
fj (fj1-fj-1) /
(2h) O(h2) fj
(fj1-2fjfj-1) / h2 O(h2) Spectral
method f(x) ? F(k) exp(ikx)
f(x) ? ik F(k) exp(ikx) f(x) -?
k2 F(k) exp(ikx) 2? F(k) ? f(x)
exp(-ikx) dx and if we integrate by parts,
we have 2? F(k) (ik)-r ? drf(x)/dxr
exp(-ikx) dx so if the nth derivative is
bounded, F(k)
decays at least as fast
as k-n, and the error is O(h-n)
All the numerical solutions presented were
computed with spectral methods.
8
  • Getting big computers to run fast for you
  • Two important issues
  • single-processor optimization
  • parallelization efficiency

This used to be easy when Cray dominated the
supercomputer market with vector processors and
compilers were intelligent. Now compilers are
stupid, and machine architectures are more
complex. Nevertheless, one thing you absolutely
had to pay attention to was memory management.
This is still the case!
9
  • Getting big computers to run fast for you Cray
    YMP

Memory banks
CPUs
10
  • Getting big computers to run fast for you Cray
    YMP

Memory banks
CPUs
11
  • Getting big computers to run fast for you Cray
    YMP

Memory banks
CPUs
This kind of gang-memory access kills performance
because of inter-processor interference.
12
  • Getting big computers to run fast for you Cray
    YMP

Memory banks
CPUs
13
  • Getting big computers to run fast for you Cray
    YMP

Memory banks
CPUs
14
  • Getting big computers to run fast for you Cray
    YMP

Memory banks
CPUs
15
  • Getting big computers to run fast for you Cray
    YMP

Memory banks
This kind of memory access is 8 times faster, but
what changes must you make to your code?
CPUs
16
  • Getting big computers to run fast for you Cray
    YMP

Memory banks
CPUs
17
  • Getting big computers to run fast for you Cray
    C90

As the number of processors grows, index-level
parallelization is less effective. We must
change the index order to dedicate the outer
index for parallelization operations.
Memory banks
CPUs
18
  • Getting big computers to run fast for you Cray
    C90

Memory banks
CPUs
19
  • Getting big computers to run fast for you MPP

MPPs harkened a new paradigm shift fast
networks connecting many cheap, off-the-shelf
CPUs. Shared memory was no more. Now memory is
integral with CPU, and managing cache memory is
extremely important.
CPUs
20
  • Getting big computers to run fast for you
  • Modern MPPs and cache thrashing

Cache memory
e.g., 4 words per line
and register memory is faster than cache, so
learn a little about the platform on which you
are running, and look for ways (like this) to
maximize cache and register re-use. and be
wary of profiling tools! Take-home message high
performance computing is mostly about intelligent
memory management.
CPUs
Main memory large, but much slower access.
a(i) b(i) c(i)
21
  • Getting big computers to run fast for you
  • Modern MPPs and stupid compilers

Look at documentation for single-processor
optimization techniques. You may be surprised at
what they suggest. For example,
a(i) a(i)b(i) c(i)d(i) e(i)f(i) May run
much slower than a(i) ( ( (
a(i)b(i) ) c(i)d(i) ) e(i)f(i) )
22
  • Getting big computers to run fast for you
  • Parallelization

! program
main !----------------------------- dimension
a(nx,ny,nz) dimension b(nx,ny,nz) !---------------
-------------- do I1,nx do j1,ny do
k1,nz a(I,j,k) b(I,j,k) c(I,j,k)
enddo enddo Enddo !-----------------------------
call subarg(a,b,c,nx,ny,nz) call
sub2ar(b,c,nx,ny,nz) !----------------------------
- stop end
! program
main !----------------------------- dimension
a(nx,ny,nz) dimension b(nx,ny,nz) !---------------
-------------- do I1,nx do j1,ny do
k1,nz a(I,j,k) b(I,j,k) c(I,j,k)
enddo enddo Enddo !-----------------------------
call subarg(a,b,c,nx,ny,nz) call
sub2ar(b,c,nx,ny,nz) !----------------------------
- stop end
SIMD -- single-processor, multiple data
! program
main !----------------------------- dimension
a(nx,ny,nz) dimension b(nx,ny,nz) !---------------
-------------- do I1,nx do j1,ny do
k1,nz a(I,j,k) b(I,j,k) c(I,j,k)
enddo enddo Enddo !-----------------------------
call subarg(a,b,c,nx,ny,nz) call
sub2ar(b,c,nx,ny,nz) !----------------------------
- stop end
! program
main !----------------------------- dimension
a(nx,ny,nz) dimension b(nx,ny,nz) !---------------
-------------- do I1,nx do j1,ny do
k1,nz a(I,j,k) b(I,j,k) c(I,j,k)
enddo enddo Enddo !-----------------------------
call subarg(a,b,c,nx,ny,nz) call
sub2ar(b,c,nx,ny,nz) !----------------------------
- stop end
! program
main !----------------------------- dimension
a(nx,ny,nz) dimension b(nx,ny,nz) !---------------
-------------- do I1,nx do j1,ny do
k1,nz a(I,j,k) b(I,j,k) c(I,j,k)
enddo enddo Enddo !-----------------------------
call subarg(a,b,c,nx,ny,nz) call
sub2ar(b,c,nx,ny,nz) !----------------------------
- stop end
! program
main !----------------------------- dimension
a(nx,ny,nz) dimension b(nx,ny,nz) !---------------
-------------- do I1,nx do j1,ny do
k1,nz a(I,j,k) b(I,j,k) c(I,j,k)
enddo enddo Enddo !-----------------------------
call subarg(a,b,c,nx,ny,nz) call
sub2ar(b,c,nx,ny,nz) !----------------------------
- stop end
PE5
PE4
PE3
PE2
PE1
PE0
23
  • Getting big computers to run fast for you
  • Parallelization

Amdahls Law (for fixed problem size per
processor) TNCPU fT NCPU (1-f)T TNCPU
wall clock time when running on NCPU processors
T wall clock time when running on 1
processor f parallel fraction of
code NCPU number of processors
24
Parallel, spectral, 3D Navier-Stokes Solver
  • Fully spectral (3D FFTs 75 computation)
  • Radix 2,3,4,5 FFTs
  • Spectral modes and NCPUs must be commensurate
  • Communication shmem, global transpose, data
    reduction
  • Parallel I/O every 60 dt

25
Parallel, spectral, 3D Navier-Stokes Solver
Write a Comment
User Comments (0)
About PowerShow.com