Highperformance computing for environmental flows - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Highperformance computing for environmental flows

Description:

... ccciiixxx~~~ PPP '''tttOOOGGGFFF ... 222:::AAAGGGPPPUUU___eeelllrrrzzz SSSddd ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 33
Provided by: ssp1
Category:

less

Transcript and Presenter's Notes

Title: Highperformance computing for environmental flows


1
High-performance computing for environmental flows
  • Oliver Fringer
  • Environmental Fluid Mechanics Laboratory
  • Department of Civil and Environmental Engineering
  • Stanford University
  • Support
  • ONR Grants N00014-02-1-0204, N00014-05-1-0294,
    NSF Grant 0113111, Leavell Family Faculty
    Scholarship
  • 31 October 2005

2
Why high-performance computing?
  • Case 1 Internal waves in the coastal ocean
  • How is tidal energy dissipated in the world's
    oceans?

Straight of Gibraltar
Baja California
Image http//cimss.ssec.wisc.edu/goes/misc/010930
/010930_vis_anim.html
Image http//envisat.esa.int/instruments/images/g
ibraltar_int_wave.gif
3
The fate of internal waves
Klymak Moum, 2003
Venayagamoorthy Fringer, 2004
4
  • Case 2 Coherent structures in the Snohomish
    estuary
  • Can we predict stratification, sediment
    distribution, and bathymetry based on images of
    coherent structures on the surface?

5
(No Transcript)
6
107 m (10,000 km)
SUN / MOON
Tides
Precip.
topography
104 m (10 km)
Internal tides
Energy Input
River flow
Wave steepening
Eddy formation
10 m
Nonlinear internal waves
Energy cascade
Coherent structures
1 m
turbulence
Dissipation/ Mixing
Scale of interest
Energy loss
10-2 m (cm)
7
The ideal simulationDirect-numerical simulation
(DNS)
Calculate all scales of motion
8
How big does my computer need to be?
  • Rule of thumb 1
  • Internal wave example
  • Energy input 10,000 km
  • Energy dissipation 1 m
  • Total grid points 1021 (1 billion trillion)
  • Rule of thumb 2
  • Internal wave example
  • Total simulation time 2 weeks
  • Smallest time scale 1 s
  • Total time steps 1,209,600

9
Calculation on the world's fastest computer LLNL
Blue Gene
  • World's fastest computer LLNL Blue Gene
  • 131,072 Processors
  • Peak throughput 281 TFlops (Trillion
    floating-point operations per second)
  • How many flops would be required for the internal
    wave simulation?
  • Typical simulation codes 100 flop/grid cell/time
    step
  • 100 flop/grid cell/time step 1021 grid points
    106 time steps 1029 flop
  • How much memory?
  • 220 bytes/grid cell 2.2 X 1023 bytes 100
    billion Tb RAM!
  • How long would it take to compute?
  • 1029 flop/(3 X 1014 flops) 3 X 1015 s 95
    million years!

10
A more feasible simulation
Energy Input
Energy cascade
Calculate a truncated range of scales
3 month simulation
Model the effect of the other scales on those
of interest
Scale of interest
Energy loss
95 million-year compute time
11
Multi-scale simulation techniques
  • Adaptive mesh refinement (AMR)
  • Overlay refined grids where
  • length scales are small

Figure M. Barad (U. C. Davis)
12
Grid savings when employing adaptive mesh
refinement
AMR Simulations of internal waves in the South
China Sea Reduce of grid points by a factor
of 50,000!!!
Taiwan
China
Philippines
Figure M. Barad (U. C. Davis)
13
Unstructured grids
  • Potential savings
  • Reduce of grid points
  • by 1000.

Grid B. Wang
14
Grid resolution is everything
Grid size 3000 m (80K cells) ? Umax 2
cm/s Grid size 300 m (3M cells) ? Umax 16.1
cm/s Grid size 60 m (45M cells) ? Umax ?
Field data from Petruncio et al. (1998) ITEX1
Mooring A2
SUNTANS results w/300 m grid Max U 16.1 cm/s
Simulation time 8 M2 Tides
Figures S. Jachec
U velocity (cm/s)
15
SUNTANS Overview
  • SUNTANS
  • Stanford
  • Unstructured
  • Nonhydrostatic
  • Terrain-following
  • Adaptive
  • Navier-Stokes
  • Simulator
  • Finite-volume prisms
  • Parallel computing gcc MPI ParMetis

Side view
Top view
16
Parallel computing
8-processor partitioning of Monterey Bay
17
Ahmdal's Law
  • There is always a portion of a code (f) that
    will execute sequentially and is not
    parallelizable
  • Number of processors Np
  • As Np ? 8, S ? 1/f

Ideal speedup
Larger problem size
Number of processors
18
Parallel Graph Partitioning
  • Given a graph, partition with the following
    constraints
  • Balance the workload
  • All processors should perform the same amount of
    work.
  • Each graph node is weighted by the local depth.
  • Minimize the number of edge cuts
  • Processors must communicate information at
    interprocessor boundaries.
  • Graph partitioning must minimize the number of
    edge cuts in order to minimize cost of
    communication.

Delaunay edges Voronoi graph
Voronoi graph of Monterey Bay
19
ParMetis Parallel Unstructured Graph
Partitioning (Karypis et al., U. Minnesota)
Five-processor partitioning Workloads 20.0
20.2 19.4 20.2 20.2
Original 1089-node graph of Monterey Bay, CA
Use the depths as weights for the workload
20
Cache-unfriendly code vs...
  • Data is transferred from RAM to cache in blocks
    before it is loaded for use in the cpu

Consider the simple triangulation shown
transfer slow
load fast
RAM (large)
cache (small)
cpu
s(1)s(1)as(0)bs(14) cs(17) Machine
code 1) Transfer block s(0),s(1) 2) Load
s(0) 3) Load s(1) 4) Transfer block
s(14),s(15) 5) Load s(14) 6) Transfer block
s(16),s(17) 7) Load s(17) 8) Obtain new s(1) 4
Loads, 3 Transfers
0
8
16
1
9
17
2
10
3
11
4
12
5
13
6
14
7
15
21
Cache-friendly code
  • Data is transferred from RAM to cache in blocks
    before it is loaded for use in the cpu

Consider the simple triangulation shown
transfer slow
load fast
RAM (large)
cache (small)
cpu
14
s(1)s(1)as(0)bs(2) cs(3) Machine
code 1) Transfer block s(0),s(1) 2) Load
s(0) 3) Load s(1) 4) Transfer block
s(2),s(3) 5) Load s(2) 6) Load s(3) 7) Obtain
new s(1) 4 Loads, 2 Transfers
0
8
16
17
1
9
17
2
10
3
11
3
4
12
2
5
13
6
14
7
15
22
ParMetis Parallel Unstructured Graph Ordering
(Karypis et al., U. Minnesota)
Unordered Monterey graph with 1089 nodes
Ordered Monterey graph with 1089 nodes
Ordering increases per-processor performance by
up to 20
23
Pressure-Split Algorithm
  • Pressure is split into its hydrostatic and
    hydrodynamic components
  • Hydrostatic pressure

Surface pressure
Barotropic pressure
Baroclinic pressure
24
Boussinesq Navier-Stokes Equations with
pressure-splitting
Surface pressure gradient
Internal waves c O ( 1 m/s )
Surface waves c O ( 100 m/s )
Acceleration uO(0.1 m/s)
25
Hydrostatic vs. Nonhydrostatic flows
  • Most environmental flows are Hydrostatic
  • Hyperbolic character
  • Long horizontal length scales, i.e. long waves
  • Only in small regions is the flow Nonhydrostatic
  • Elliptic character
  • Short length scales relative to depth

Long wave (hydrostatic)
free surface
Steep bathymetry (nonhydrostatic)
bottom
26
When is a flow nonhydrostatic?
Aspect Ratio
27
(No Transcript)
28
When are internal waves nonhydrostatic?
CPU time 1 day CPU time 3 days
29
Hydrostatic vs. Nonhydrostatic lock exchange
computation
Hydrostatic
Nonhydrostatic
Doman size 0.8 m by 0.1 m (grid 400 by 100)
30
Conditioning of the Pressure-Poisson equation
  • The 2D x-z Poisson equation is given by
  • For large aspect ratio flows,
  • To a good approximation,
  • The preconditioned equation is thenwhich is a
    block-diagonal preconditioner.

31
Speedup with the preconditionerwhen applied to a
domain with dD/L0.01
No preconditioner (22.8X)
Diagonal (8.5X)
Block-diagonal (1 X)
32
Problem size reduction
  • All scales 192 million-year compute time
  • Reduced range of scales 10-years (Monterey Bay
    with 50 m grid cells in a 100km by 100km domain
    with 10 second time steps)
  • Parallel computing 6 months
  • Optimal load balancing 5 months
  • Optimal grid ordering/cache-friendliness 4
    months
  • Optimal pressure preconditioning 1 month
  • Optimal time-stepping/other numerics 2 weeks
  • Total time savings 5 billion times faster!
  • For more information visit http//suntans.stanfo
    rd.edu
Write a Comment
User Comments (0)
About PowerShow.com