Title: Highperformance computing for environmental flows
1High-performance computing for environmental flows
- Oliver Fringer
- Environmental Fluid Mechanics Laboratory
- Department of Civil and Environmental Engineering
- Stanford University
- Support
- ONR Grants N00014-02-1-0204, N00014-05-1-0294,
NSF Grant 0113111, Leavell Family Faculty
Scholarship - 31 October 2005
2Why high-performance computing?
- Case 1 Internal waves in the coastal ocean
- How is tidal energy dissipated in the world's
oceans?
Straight of Gibraltar
Baja California
Image http//cimss.ssec.wisc.edu/goes/misc/010930
/010930_vis_anim.html
Image http//envisat.esa.int/instruments/images/g
ibraltar_int_wave.gif
3The fate of internal waves
Klymak Moum, 2003
Venayagamoorthy Fringer, 2004
4- Case 2 Coherent structures in the Snohomish
estuary - Can we predict stratification, sediment
distribution, and bathymetry based on images of
coherent structures on the surface?
5(No Transcript)
6107 m (10,000 km)
SUN / MOON
Tides
Precip.
topography
104 m (10 km)
Internal tides
Energy Input
River flow
Wave steepening
Eddy formation
10 m
Nonlinear internal waves
Energy cascade
Coherent structures
1 m
turbulence
Dissipation/ Mixing
Scale of interest
Energy loss
10-2 m (cm)
7The ideal simulationDirect-numerical simulation
(DNS)
Calculate all scales of motion
8How big does my computer need to be?
- Rule of thumb 1
- Internal wave example
- Energy input 10,000 km
- Energy dissipation 1 m
- Total grid points 1021 (1 billion trillion)
- Rule of thumb 2
- Internal wave example
- Total simulation time 2 weeks
- Smallest time scale 1 s
- Total time steps 1,209,600
9Calculation on the world's fastest computer LLNL
Blue Gene
- World's fastest computer LLNL Blue Gene
- 131,072 Processors
- Peak throughput 281 TFlops (Trillion
floating-point operations per second) - How many flops would be required for the internal
wave simulation? - Typical simulation codes 100 flop/grid cell/time
step - 100 flop/grid cell/time step 1021 grid points
106 time steps 1029 flop - How much memory?
- 220 bytes/grid cell 2.2 X 1023 bytes 100
billion Tb RAM! - How long would it take to compute?
- 1029 flop/(3 X 1014 flops) 3 X 1015 s 95
million years!
10A more feasible simulation
Energy Input
Energy cascade
Calculate a truncated range of scales
3 month simulation
Model the effect of the other scales on those
of interest
Scale of interest
Energy loss
95 million-year compute time
11Multi-scale simulation techniques
- Adaptive mesh refinement (AMR)
- Overlay refined grids where
- length scales are small
Figure M. Barad (U. C. Davis)
12Grid savings when employing adaptive mesh
refinement
AMR Simulations of internal waves in the South
China Sea Reduce of grid points by a factor
of 50,000!!!
Taiwan
China
Philippines
Figure M. Barad (U. C. Davis)
13Unstructured grids
- Potential savings
- Reduce of grid points
- by 1000.
-
Grid B. Wang
14Grid resolution is everything
Grid size 3000 m (80K cells) ? Umax 2
cm/s Grid size 300 m (3M cells) ? Umax 16.1
cm/s Grid size 60 m (45M cells) ? Umax ?
Field data from Petruncio et al. (1998) ITEX1
Mooring A2
SUNTANS results w/300 m grid Max U 16.1 cm/s
Simulation time 8 M2 Tides
Figures S. Jachec
U velocity (cm/s)
15SUNTANS Overview
- SUNTANS
- Stanford
- Unstructured
- Nonhydrostatic
- Terrain-following
- Adaptive
- Navier-Stokes
- Simulator
- Finite-volume prisms
- Parallel computing gcc MPI ParMetis
Side view
Top view
16Parallel computing
8-processor partitioning of Monterey Bay
17Ahmdal's Law
- There is always a portion of a code (f) that
will execute sequentially and is not
parallelizable -
- Number of processors Np
- As Np ? 8, S ? 1/f
Ideal speedup
Larger problem size
Number of processors
18Parallel Graph Partitioning
- Given a graph, partition with the following
constraints - Balance the workload
- All processors should perform the same amount of
work. - Each graph node is weighted by the local depth.
- Minimize the number of edge cuts
- Processors must communicate information at
interprocessor boundaries. - Graph partitioning must minimize the number of
edge cuts in order to minimize cost of
communication.
Delaunay edges Voronoi graph
Voronoi graph of Monterey Bay
19ParMetis Parallel Unstructured Graph
Partitioning (Karypis et al., U. Minnesota)
Five-processor partitioning Workloads 20.0
20.2 19.4 20.2 20.2
Original 1089-node graph of Monterey Bay, CA
Use the depths as weights for the workload
20Cache-unfriendly code vs...
- Data is transferred from RAM to cache in blocks
before it is loaded for use in the cpu
Consider the simple triangulation shown
transfer slow
load fast
RAM (large)
cache (small)
cpu
s(1)s(1)as(0)bs(14) cs(17) Machine
code 1) Transfer block s(0),s(1) 2) Load
s(0) 3) Load s(1) 4) Transfer block
s(14),s(15) 5) Load s(14) 6) Transfer block
s(16),s(17) 7) Load s(17) 8) Obtain new s(1) 4
Loads, 3 Transfers
0
8
16
1
9
17
2
10
3
11
4
12
5
13
6
14
7
15
21Cache-friendly code
- Data is transferred from RAM to cache in blocks
before it is loaded for use in the cpu
Consider the simple triangulation shown
transfer slow
load fast
RAM (large)
cache (small)
cpu
14
s(1)s(1)as(0)bs(2) cs(3) Machine
code 1) Transfer block s(0),s(1) 2) Load
s(0) 3) Load s(1) 4) Transfer block
s(2),s(3) 5) Load s(2) 6) Load s(3) 7) Obtain
new s(1) 4 Loads, 2 Transfers
0
8
16
17
1
9
17
2
10
3
11
3
4
12
2
5
13
6
14
7
15
22ParMetis Parallel Unstructured Graph Ordering
(Karypis et al., U. Minnesota)
Unordered Monterey graph with 1089 nodes
Ordered Monterey graph with 1089 nodes
Ordering increases per-processor performance by
up to 20
23Pressure-Split Algorithm
- Pressure is split into its hydrostatic and
hydrodynamic components - Hydrostatic pressure
Surface pressure
Barotropic pressure
Baroclinic pressure
24Boussinesq Navier-Stokes Equations with
pressure-splitting
Surface pressure gradient
Internal waves c O ( 1 m/s )
Surface waves c O ( 100 m/s )
Acceleration uO(0.1 m/s)
25Hydrostatic vs. Nonhydrostatic flows
- Most environmental flows are Hydrostatic
- Hyperbolic character
- Long horizontal length scales, i.e. long waves
- Only in small regions is the flow Nonhydrostatic
- Elliptic character
- Short length scales relative to depth
Long wave (hydrostatic)
free surface
Steep bathymetry (nonhydrostatic)
bottom
26When is a flow nonhydrostatic?
Aspect Ratio
27(No Transcript)
28When are internal waves nonhydrostatic?
CPU time 1 day CPU time 3 days
29Hydrostatic vs. Nonhydrostatic lock exchange
computation
Hydrostatic
Nonhydrostatic
Doman size 0.8 m by 0.1 m (grid 400 by 100)
30Conditioning of the Pressure-Poisson equation
- The 2D x-z Poisson equation is given by
- For large aspect ratio flows,
- To a good approximation,
- The preconditioned equation is thenwhich is a
block-diagonal preconditioner.
31Speedup with the preconditionerwhen applied to a
domain with dD/L0.01
No preconditioner (22.8X)
Diagonal (8.5X)
Block-diagonal (1 X)
32Problem size reduction
- All scales 192 million-year compute time
- Reduced range of scales 10-years (Monterey Bay
with 50 m grid cells in a 100km by 100km domain
with 10 second time steps) - Parallel computing 6 months
- Optimal load balancing 5 months
- Optimal grid ordering/cache-friendliness 4
months - Optimal pressure preconditioning 1 month
- Optimal time-stepping/other numerics 2 weeks
- Total time savings 5 billion times faster!
- For more information visit http//suntans.stanfo
rd.edu