Highperformance computing for environmental flows

About This Presentation

Title:

Highperformance computing for environmental flows

Description:

... ccciiixxx~~~ PPP '''tttOOOGGGFFF ... 222:::AAAGGGPPPUUU___eeelllrrrzzz SSSddd ... – PowerPoint PPT presentation

Number of Views:58

Avg rating:3.0/5.0

Slides: 33

Provided by: ssp1

Category:

more less

Transcript and Presenter's Notes

Title: Highperformance computing for environmental flows

1
High-performance computing for environmental flows

Oliver Fringer
Environmental Fluid Mechanics Laboratory
Department of Civil and Environmental Engineering
Stanford University
Support
ONR Grants N00014-02-1-0204, N00014-05-1-0294,
NSF Grant 0113111, Leavell Family Faculty
Scholarship
31 October 2005

2
Why high-performance computing?

Case 1 Internal waves in the coastal ocean
How is tidal energy dissipated in the world's
oceans?

Straight of Gibraltar
Baja California
Image http//cimss.ssec.wisc.edu/goes/misc/010930
/010930_vis_anim.html
Image http//envisat.esa.int/instruments/images/g
ibraltar_int_wave.gif
3
The fate of internal waves
Klymak Moum, 2003
Venayagamoorthy Fringer, 2004
4

Case 2 Coherent structures in the Snohomish
estuary
Can we predict stratification, sediment
distribution, and bathymetry based on images of
coherent structures on the surface?

5
(No Transcript)
6
107 m (10,000 km)
SUN / MOON
Tides
Precip.
topography
104 m (10 km)
Internal tides
Energy Input
River flow
Wave steepening
Eddy formation
10 m
Nonlinear internal waves
Energy cascade
Coherent structures
1 m
turbulence
Dissipation/ Mixing
Scale of interest
Energy loss
10-2 m (cm)
7
The ideal simulationDirect-numerical simulation
(DNS)
Calculate all scales of motion
8
How big does my computer need to be?

Rule of thumb 1
Internal wave example
Energy input 10,000 km
Energy dissipation 1 m
Total grid points 1021 (1 billion trillion)
Rule of thumb 2
Internal wave example
Total simulation time 2 weeks
Smallest time scale 1 s
Total time steps 1,209,600

9
Calculation on the world's fastest computer LLNL
Blue Gene

World's fastest computer LLNL Blue Gene
131,072 Processors
Peak throughput 281 TFlops (Trillion
floating-point operations per second)
How many flops would be required for the internal
wave simulation?
Typical simulation codes 100 flop/grid cell/time
step
100 flop/grid cell/time step 1021 grid points
106 time steps 1029 flop
How much memory?
220 bytes/grid cell 2.2 X 1023 bytes 100
billion Tb RAM!
How long would it take to compute?
1029 flop/(3 X 1014 flops) 3 X 1015 s 95
million years!

10
A more feasible simulation
Energy Input
Energy cascade
Calculate a truncated range of scales
3 month simulation
Model the effect of the other scales on those
of interest
Scale of interest
Energy loss
95 million-year compute time
11
Multi-scale simulation techniques

Adaptive mesh refinement (AMR)
Overlay refined grids where
length scales are small

Figure M. Barad (U. C. Davis)
12
Grid savings when employing adaptive mesh
refinement
AMR Simulations of internal waves in the South
China Sea Reduce of grid points by a factor
of 50,000!!!
Taiwan
China
Philippines
Figure M. Barad (U. C. Davis)
13
Unstructured grids

Potential savings
Reduce of grid points
by 1000.

Grid B. Wang
14
Grid resolution is everything
Grid size 3000 m (80K cells) ? Umax 2
cm/s Grid size 300 m (3M cells) ? Umax 16.1
cm/s Grid size 60 m (45M cells) ? Umax ?
Field data from Petruncio et al. (1998) ITEX1
Mooring A2
SUNTANS results w/300 m grid Max U 16.1 cm/s
Simulation time 8 M2 Tides
Figures S. Jachec
U velocity (cm/s)
15
SUNTANS Overview

SUNTANS
Stanford
Unstructured
Nonhydrostatic
Terrain-following
Adaptive
Navier-Stokes
Simulator
Finite-volume prisms
Parallel computing gcc MPI ParMetis

Side view
Top view
16
Parallel computing
8-processor partitioning of Monterey Bay
17
Ahmdal's Law

There is always a portion of a code (f) that
will execute sequentially and is not
parallelizable
Number of processors Np
As Np ? 8, S ? 1/f

Ideal speedup
Larger problem size
Number of processors
18
Parallel Graph Partitioning

Given a graph, partition with the following
constraints
Balance the workload
All processors should perform the same amount of
work.
Each graph node is weighted by the local depth.
Minimize the number of edge cuts
Processors must communicate information at
interprocessor boundaries.
Graph partitioning must minimize the number of
edge cuts in order to minimize cost of
communication.

Delaunay edges Voronoi graph
Voronoi graph of Monterey Bay
19
ParMetis Parallel Unstructured Graph
Partitioning (Karypis et al., U. Minnesota)
Five-processor partitioning Workloads 20.0
20.2 19.4 20.2 20.2
Original 1089-node graph of Monterey Bay, CA
Use the depths as weights for the workload
20
Cache-unfriendly code vs...

Data is transferred from RAM to cache in blocks
before it is loaded for use in the cpu

Consider the simple triangulation shown
transfer slow
load fast
RAM (large)
cache (small)
cpu
s(1)s(1)as(0)bs(14) cs(17) Machine
code 1) Transfer block s(0),s(1) 2) Load
s(0) 3) Load s(1) 4) Transfer block
s(14),s(15) 5) Load s(14) 6) Transfer block
s(16),s(17) 7) Load s(17) 8) Obtain new s(1) 4
Loads, 3 Transfers
0
8
16
1
9
17
2
10
3
11
4
12
5
13
6
14
7
15
21
Cache-friendly code

Data is transferred from RAM to cache in blocks
before it is loaded for use in the cpu

Consider the simple triangulation shown
transfer slow
load fast
RAM (large)
cache (small)
cpu
14
s(1)s(1)as(0)bs(2) cs(3) Machine
code 1) Transfer block s(0),s(1) 2) Load
s(0) 3) Load s(1) 4) Transfer block
s(2),s(3) 5) Load s(2) 6) Load s(3) 7) Obtain
new s(1) 4 Loads, 2 Transfers
0
8
16
17
1
9
17
2
10
3
11
3
4
12
2
5
13
6
14
7
15
22
ParMetis Parallel Unstructured Graph Ordering
(Karypis et al., U. Minnesota)
Unordered Monterey graph with 1089 nodes
Ordered Monterey graph with 1089 nodes
Ordering increases per-processor performance by
up to 20
23
Pressure-Split Algorithm

Pressure is split into its hydrostatic and
hydrodynamic components
Hydrostatic pressure

Surface pressure
Barotropic pressure
Baroclinic pressure
24
Boussinesq Navier-Stokes Equations with
pressure-splitting
Surface pressure gradient
Internal waves c O ( 1 m/s )
Surface waves c O ( 100 m/s )
Acceleration uO(0.1 m/s)
25
Hydrostatic vs. Nonhydrostatic flows

Most environmental flows are Hydrostatic
Hyperbolic character
Long horizontal length scales, i.e. long waves
Only in small regions is the flow Nonhydrostatic
Elliptic character
Short length scales relative to depth

Long wave (hydrostatic)
free surface
Steep bathymetry (nonhydrostatic)
bottom
26
When is a flow nonhydrostatic?
Aspect Ratio
27
(No Transcript)
28
When are internal waves nonhydrostatic?
CPU time 1 day CPU time 3 days
29
Hydrostatic vs. Nonhydrostatic lock exchange
computation
Hydrostatic
Nonhydrostatic
Doman size 0.8 m by 0.1 m (grid 400 by 100)
30
Conditioning of the Pressure-Poisson equation

The 2D x-z Poisson equation is given by
For large aspect ratio flows,
To a good approximation,
The preconditioned equation is thenwhich is a
block-diagonal preconditioner.

31
Speedup with the preconditionerwhen applied to a
domain with dD/L0.01
No preconditioner (22.8X)
Diagonal (8.5X)
Block-diagonal (1 X)
32
Problem size reduction

All scales 192 million-year compute time
Reduced range of scales 10-years (Monterey Bay
with 50 m grid cells in a 100km by 100km domain
with 10 second time steps)
Parallel computing 6 months
Optimal load balancing 5 months
Optimal grid ordering/cache-friendliness 4
months
Optimal pressure preconditioning 1 month
Optimal time-stepping/other numerics 2 weeks
Total time savings 5 billion times faster!
For more information visit http//suntans.stanfo
rd.edu

Write a Comment

User Comments (0)

About PowerShow.com

Highperformance computing for environmental flows - PowerPoint PPT Presentation

Highperformance computing for environmental flows

... ccciiixxx~~~ PPP '''tttOOOGGGFFF ... 222:::AAAGGGPPPUUU___eeelllrrrzzz SSSddd ... – PowerPoint PPT presentation