Title: Optimizing Performance of the Lattice Boltzmann Method for Complex Structures
1Optimizing Performance of the Lattice Boltzmann
Method for Complex Structures
- Friedrich-Alexander University Erlangen/Nuremberg
- Department of Computer Science 10 (System
Simulation) - Regional Computing Center of Erlangen (RRZE)
2Outline
- Introduction
- Lattice Boltzmann Method
- Implementation Aspects
- Application
- Implementation
- Optimization
- Results
- Conclusion
3Lattice Boltzmann Method
- Boltzmann Equation
- Discretization of particle velocity space
- (finite set of discrete velocities)
4Lattice Boltzmann Method
- Different discretization schemes
- Numerical accuracy and stability
- Computational speed and simplicity
D3Q15
D3Q19
D3Q27
5Lattice Boltzmann Method
- Discretization in space x and time t
collision step
streaming step
6Lattice Boltzmann Method (Implementation Aspects)
- Discretization in space x and time t
collision step
streaming step
- Stream-Collide (Pull-Method)
- Get the distributions from the neighboring cells
in the source array and store the relaxated
values to one cell in the destination array - Collide-Stream (Push-Method)
- Take the distributions from one cell in the
source array and store the relaxated values to
the neighboring cells in the destination array
W
source
destination
7Lattice Boltzmann Method (Implementation Aspects)
- Walls and Obstacles Bounce Back rule
8Implementation Aspects
- Data Dependencies
- Two Grids
- Compressed Grid
?
9Implementation Aspects
double precision f(0xMax1,0yMax1,0zMax1,018
,01) do z1,zMax do y1,yMax do
x1,xMax if( fluidcell(x,y,z) ) then
LOAD f(x,y,z, 018,t) Relaxation (complex
computations) SAVE f(x ,y ,z ,
0,t1) SAVE f(x1,y1,z , 1,t1)
SAVE f(x ,y1,z , 2,t1) SAVE
f(x-1,y1,z , 3,t1) SAVE f(x
,y-1,z-1,18,t1) endif enddo
enddo enddo
Collide
Stream
10Application
11Application Porous Media Combustion
P C Porous Media Combustion
- New technology for heating installations
- Porous Media Combustion
- Fuel-air-mixture does no longer react in a free
flame - Combustion process takes place inside the pores
of a porous medium that is placed in the reaction
area
60mm
20mm
12Application Porous Media Combustion
P C Porous Media Combustion
Figures by courtesy of LSTM Uni-Erlangen, Thomas
Zeiser
13Application Porous Media Combustion
P C Porous Media Combustion
- Various applications
- modern steam engines, vehicle heaters
Figures by courtesy of LSTM Uni-Erlangen, Thomas
Zeiser
14Application Introducing Complex Geometries used
- Porous Medium from PMC Silicon-Carbide (SiC)
- Many small obstacles
- Obstacle/fluid-ratio 2
- High number of fluid-solid faces
15Application Introducing Complex Geometries used
- Second Test-Geometry
- MC
- Huge obstacles, only fewfluid tubes
- Obstacle/fluid-ratio 50
- Low number of fluid-solid faces
16Implementation
17Implementation
- Collision and streaming step in same loop
- Push-Method
- Data representation in 1D-Array
- Stores only Fluid Cells (? saves memory)
- Indirect addressing of target cells (by extra
connectivity array) - Boundary conditions (Bounce Back) handled
implicitly
18Implementation
- Indirect addressing and implicit Bounce Back
obstacle wall
i
i1
i
i1
i-1
i-1
connectivity array
19Implementation
- Indirect addressing and implicit Bounce Back
obstacle wall
i
i1
i
i1
i-1
i-1
connectivity array
20Implementation
- Preprocessor
- Sets up connectivity array
- Specifies all domain parameters
- Geometry and obstacles
- Traversing scheme
- Solver
- Reads in preprocessed information
- Performs lattice Boltzmann method (in single loop)
21Implementation
- Rating
- 1D-Array compared to standard implementation
using multidimensional array and three loops - Advantages
- Saves memory
- The more obstacles in the domain the higher is
the compression - Implicit Bounce Back
- No extra routine or if-statement needed
- Drawbacks
- Indirect addressing
- Prevents compiler from vectorization and other
optimization techniques - ? Consequently, worse performance
22Optimization - Outline
- Memory Traversing Schemes
- Space-Filling Curves
- Blocking
- Memory Layouts
- Further Optimization Techniques
23Memory Traversing Schemes Space-Filling Curves
- What is a space-filling curve?
- Loosely spoken
- A one dimensional curve that fills a higher
dimensional space - Which curves were used?
- Hilbert
- Peano
- How are they constructed?
- Again, loosely
- By a mapping from a one-dimensional interval to
the higher dimensional space - Then, by recursion new mapping of each part of
the interval
24Memory Traversing Schemes Space-Filling Curves
- And how works construction really?
1
0
25Memory Traversing Schemes Space-Filling Curves
- How does that look in 3D?
- OK, but how to construct them?
26Memory Traversing Schemes Space-Filling Curves
- How to construct them?
- Table based approach
- Hilbert 48 Productions with 8 entries and 7
connectors - Peano 8 Productions with 27 entries and 26
connectors
Current Level Next Level Next Level Next Level Next Level Next Level Next Level Next Level Next Level Next Level Next Level Next Level Next Level Next Level Next Level Next Level
bne enb b ben n ben f fse e fse b bws s bws f wnf
fws swf f fsw w fsw b bes s bes f fne e fne b nwb
27Memory Traversing Schemes Space-Filling Curves
- Summary for Space-Filling Curves
- Recursive production by segmentation
- Limitation in system sizes
- Hilbert 23n
- Peano 33n
- Increase spatial locality
- Enable mesh-refinement
28Memory Traversing Schemes Blocking
- Implicit blocking technique
- Arrangement of data in a blocking manner
- Increases spatial locality
29Memory Traversing Schemes
- Notes on Memory Traversing Schemes
- Pure preprocessing technique
- Only arrangement of data in memory changed
- No change for solver
- No overhead for solver (e.g. loop overhead for
blocking)
30Memory Layouts Collision Optimized Layout
- Standard Array-Layout F(i,x,y,z,t)
Array-of-Structures - Collision optimized
- Optimal read access2 cache lines per LUP
- Bad write access19 stores in 19 cache
linesBut Depending on systemsize some of them
arealready/still in cache
31Memory Layouts Collision Optimized Layout
- 18 write accesseson one cell from3 different
z-layers - 8 write accesseson one cell from3 different rows
32Memory Layouts Collision Optimized Layout
- Performance of Collision-Optimized Layout (P4,
512kB L2)
33Memory Layouts Propagation Optimized Layout
- Optimized Array-Layout F(x,y,z,i,t)
Structure-of-Arrays - stride-1-access on x (inner loop)
- 19 cache lines per 16 LUPsin read and write
process - 1 cache miss each 16th memory access
34Memory Layouts Propagation Optimized Layout
- Performance on Pentium 4, 512kB L2
35Further Optimization Techniques
- Additional Bottlenecks
- Large loop body (causes register spills on IA32)
- Concurrent writing to 19 different cache lines
interferes with number of write combine buffers
on IA32 (6 for Intel Xeon/Nocona) - Indirect addressing prevents IA32 hardware
prefetcher from preloading values for target
cells (due to bounce back at obstacles) - Solutions (implemented in Solver)
- Split up loop in 5 loops of length Nx
- Manual Block Preload Technique
- (Drawback Both techniques need a loop blocking
scheme) - These solutions only needed for IA32
36Results - Outline
- Architecture Descriptions
- Comparison 1D-Solver to Standard Solver
- Memory Layouts
- For Standard Solver
- For 1D-Solver
- Influence of Geometry
- MC
- SiC
- Memory Traversing Schemes
- Space-Filling Curves
- Blocking
37Architecture Description Nocona/Irwindale
- Test System Test machine at RRZE
- CPUs Nocona (Irwindale), 3.6GHz, 2MB L2-Cache
- Memory DDR400, 6.4GB/s
- Architectural specialties
- EM64T extension
- Hyperthreading
- One memory bus for both processors
38Architecture Description Itanium2 / Altix
- Test System RRZE SGI Altix
- CPUs Itanium 2, 1.3 GHz, 3MB L3-Cache
- Memory 112GB distributed shared memory
- Architectural specialties
- Itanium 2
- EPIC (Explicitly Parallel Instruction
Computing) - No out-of-order
- Parallelization of commands in the grip of
compiler (? bundles) - L1-Cache only for Integer
- Altix
- ccNUMA with NUMALink 3
- Memory connected hierarchically by SHUBs
39Architecture Description AMD Opteron
- Test System LSS HPC-Cluster
- CPUs AMD Opteron, 2.2GHz, 1MB L2-CacheIA32-compa
tible - Memory DDR333, 5.2GB/s
- Architectural specialties
- Compute nodes with four CPUs
- 4GB RAM per CPU, each CPU can access 16GB per
ccNUMA - Interconnect
- CPUs on one node HyperTransport (6.4GB/s)
- Nodes Infiniband (10GB/s)
40Comparison 1D-Array Solver to Standard Solver
41Memory Layouts for Standard Solver
42Memory Layouts on Itanium 2
43Memory Layouts on AMD Opteron
44Memory Layouts on Nocona/Irwindale
45Influence of Geometry SiC-foam (2 obstacles)
46Influence of Geometry MC (50 obstacles)
47Memory Traversing Schemes
48Memory Traversing Schemes
49Conclusion
- 1D-Array Data representation makes performance
independent of obstacle to fluid ratio - Memory traversing by Space-Filling Curves results
in similar performance as spatial blocking - Implementation of SFCs is not worth the effort
(if they are used as memory traversing
alternative only) - Together with indirect addressing Collision
Optimized Layout with blocking is best technique
if cache is larger than 1 MB - Indeed, there are cases where Propagation
Optimized Layout is not best
50Outlook
- Future work could concern
- Space-Filling Curves
- Kind of staggered SFCs, for every direction own
curve - Avoid waste of underused cache lines where
lattice sites are neighboring cells which are
visited much later - Galerkin-discretization or point wise evaluation
of LBM to enable stack-implementation in
conjunction with SFCs - BUT For real-world problems construction on
non-cubic grids is necessary at first - Search for vectorization enhancing techniques to
over-come problems with indirect addressing on
Itanium 2 - Search for reasons why Collision Optimized Layout
is better than Propagation Optimized Layout
51Acknowledgement / References
- Acknowledgement
- Bavarian Graduate School for Computational
Engineering - Thomas Zeiser (RRZE)
- Gerhard Wellein (RRZE)
- References
- S. Donath, T. Zeiser, G. Hager, J. Habich, G.
Wellein Optimizing Performance of the Lattice
Boltzmann Method for Complex Structures on
Cache-based Architectures - G. Wellein, P. Lammers, G. Hager, S. Donath, T.
Zeiser Towards Optimal Performance for Lattice
Boltzmann Applications on Terascale Computers - G. Wellein, T. Zeiser, S. Donath, G. Hager On
the single processor performance of simple
lattice Boltzmann kernels