Optimizing Performance of the Lattice Boltzmann Method for Complex Structures - PowerPoint PPT Presentation

1 / 51

About This Presentation

Title:

Optimizing Performance of the Lattice Boltzmann Method for Complex Structures

Description:

Optimizing Performance of the Lattice Boltzmann Method for Complex Structures Friedrich-Alexander University Erlangen/Nuremberg Department of Computer Science 10 ... – PowerPoint PPT presentation

Number of Views:201

Avg rating:3.0/5.0

Slides: 52

Provided by: feri73

Category:

more less

Transcript and Presenter's Notes

Title: Optimizing Performance of the Lattice Boltzmann Method for Complex Structures

1
Optimizing Performance of the Lattice Boltzmann
Method for Complex Structures

Friedrich-Alexander University Erlangen/Nuremberg
Department of Computer Science 10 (System
Simulation)
Regional Computing Center of Erlangen (RRZE)

2
Outline

Introduction
Lattice Boltzmann Method
Implementation Aspects
Application
Implementation
Optimization
Results
Conclusion

3
Lattice Boltzmann Method

Boltzmann Equation
Discretization of particle velocity space
(finite set of discrete velocities)

4
Lattice Boltzmann Method

Different discretization schemes
Numerical accuracy and stability
Computational speed and simplicity

D3Q15
D3Q19
D3Q27
5
Lattice Boltzmann Method

Discretization in space x and time t

collision step
streaming step
6
Lattice Boltzmann Method (Implementation Aspects)

Discretization in space x and time t

collision step
streaming step

Stream-Collide (Pull-Method)
Get the distributions from the neighboring cells
in the source array and store the relaxated
values to one cell in the destination array
Collide-Stream (Push-Method)
Take the distributions from one cell in the
source array and store the relaxated values to
the neighboring cells in the destination array

W
source
destination
7
Lattice Boltzmann Method (Implementation Aspects)

Walls and Obstacles Bounce Back rule

8
Implementation Aspects

Data Dependencies
Two Grids
Compressed Grid

?
9
Implementation Aspects
double precision f(0xMax1,0yMax1,0zMax1,018
,01) do z1,zMax do y1,yMax do
x1,xMax if( fluidcell(x,y,z) ) then
LOAD f(x,y,z, 018,t) Relaxation (complex
computations) SAVE f(x ,y ,z ,
0,t1) SAVE f(x1,y1,z , 1,t1)
SAVE f(x ,y1,z , 2,t1) SAVE
f(x-1,y1,z , 3,t1) SAVE f(x
,y-1,z-1,18,t1) endif enddo
enddo enddo
Collide
Stream
10
Application
11
Application Porous Media Combustion
P C Porous Media Combustion

New technology for heating installations
Porous Media Combustion
Fuel-air-mixture does no longer react in a free
flame
Combustion process takes place inside the pores
of a porous medium that is placed in the reaction
area

60mm
20mm
12
Application Porous Media Combustion
P C Porous Media Combustion
Figures by courtesy of LSTM Uni-Erlangen, Thomas
Zeiser
13
Application Porous Media Combustion
P C Porous Media Combustion

Various applications
modern steam engines, vehicle heaters

Figures by courtesy of LSTM Uni-Erlangen, Thomas
Zeiser
14
Application Introducing Complex Geometries used

Porous Medium from PMC Silicon-Carbide (SiC)
Many small obstacles
Obstacle/fluid-ratio 2
High number of fluid-solid faces

15
Application Introducing Complex Geometries used

Second Test-Geometry
MC
Huge obstacles, only fewfluid tubes
Obstacle/fluid-ratio 50
Low number of fluid-solid faces

16
Implementation
17
Implementation

Collision and streaming step in same loop
Push-Method
Data representation in 1D-Array
Stores only Fluid Cells (? saves memory)
Indirect addressing of target cells (by extra
connectivity array)
Boundary conditions (Bounce Back) handled
implicitly

18
Implementation

Indirect addressing and implicit Bounce Back

obstacle wall
i
i1
i
i1
i-1
i-1
connectivity array
19
Implementation

Indirect addressing and implicit Bounce Back

obstacle wall
i
i1
i
i1
i-1
i-1
connectivity array
20
Implementation

Preprocessor
Sets up connectivity array
Specifies all domain parameters
Geometry and obstacles
Traversing scheme
Solver
Reads in preprocessed information
Performs lattice Boltzmann method (in single loop)

21
Implementation

Rating
1D-Array compared to standard implementation
using multidimensional array and three loops
Advantages
Saves memory
The more obstacles in the domain the higher is
the compression
Implicit Bounce Back
No extra routine or if-statement needed
Drawbacks
Indirect addressing
Prevents compiler from vectorization and other
optimization techniques
? Consequently, worse performance

22
Optimization - Outline

Memory Traversing Schemes
Space-Filling Curves
Blocking
Memory Layouts
Further Optimization Techniques

23
Memory Traversing Schemes Space-Filling Curves

What is a space-filling curve?
Loosely spoken
A one dimensional curve that fills a higher
dimensional space
Which curves were used?
Hilbert
Peano
How are they constructed?
Again, loosely
By a mapping from a one-dimensional interval to
the higher dimensional space
Then, by recursion new mapping of each part of
the interval

24
Memory Traversing Schemes Space-Filling Curves

And how works construction really?

1
0
25
Memory Traversing Schemes Space-Filling Curves

How does that look in 3D?

OK, but how to construct them?

26
Memory Traversing Schemes Space-Filling Curves

How to construct them?
Table based approach
Hilbert 48 Productions with 8 entries and 7
connectors
Peano 8 Productions with 27 entries and 26
connectors

Current Level Next Level Next Level Next Level Next Level Next Level Next Level Next Level Next Level Next Level Next Level Next Level Next Level Next Level Next Level Next Level
bne enb b ben n ben f fse e fse b bws s bws f wnf
fws swf f fsw w fsw b bes s bes f fne e fne b nwb
27
Memory Traversing Schemes Space-Filling Curves

Summary for Space-Filling Curves
Recursive production by segmentation
Limitation in system sizes
Hilbert 23n
Peano 33n
Increase spatial locality
Enable mesh-refinement

28
Memory Traversing Schemes Blocking

Implicit blocking technique
Arrangement of data in a blocking manner
Increases spatial locality

29
Memory Traversing Schemes

Notes on Memory Traversing Schemes
Pure preprocessing technique
Only arrangement of data in memory changed
No change for solver
No overhead for solver (e.g. loop overhead for
blocking)

30
Memory Layouts Collision Optimized Layout

Standard Array-Layout F(i,x,y,z,t)
Array-of-Structures
Collision optimized
Optimal read access2 cache lines per LUP
Bad write access19 stores in 19 cache
linesBut Depending on systemsize some of them
arealready/still in cache

31
Memory Layouts Collision Optimized Layout

18 write accesseson one cell from3 different
z-layers
8 write accesseson one cell from3 different rows

32
Memory Layouts Collision Optimized Layout

Performance of Collision-Optimized Layout (P4,
512kB L2)

33
Memory Layouts Propagation Optimized Layout

Optimized Array-Layout F(x,y,z,i,t)
Structure-of-Arrays
stride-1-access on x (inner loop)
19 cache lines per 16 LUPsin read and write
process
1 cache miss each 16th memory access

34
Memory Layouts Propagation Optimized Layout

Performance on Pentium 4, 512kB L2

35
Further Optimization Techniques

Additional Bottlenecks
Large loop body (causes register spills on IA32)
Concurrent writing to 19 different cache lines
interferes with number of write combine buffers
on IA32 (6 for Intel Xeon/Nocona)
Indirect addressing prevents IA32 hardware
prefetcher from preloading values for target
cells (due to bounce back at obstacles)
Solutions (implemented in Solver)
Split up loop in 5 loops of length Nx
Manual Block Preload Technique
(Drawback Both techniques need a loop blocking
scheme)
These solutions only needed for IA32

36
Results - Outline

Architecture Descriptions
Comparison 1D-Solver to Standard Solver
Memory Layouts
For Standard Solver
For 1D-Solver
Influence of Geometry
MC
SiC
Memory Traversing Schemes
Space-Filling Curves
Blocking

37
Architecture Description Nocona/Irwindale

Test System Test machine at RRZE
CPUs Nocona (Irwindale), 3.6GHz, 2MB L2-Cache
Memory DDR400, 6.4GB/s
Architectural specialties
EM64T extension
Hyperthreading
One memory bus for both processors

38
Architecture Description Itanium2 / Altix

Test System RRZE SGI Altix
CPUs Itanium 2, 1.3 GHz, 3MB L3-Cache
Memory 112GB distributed shared memory
Architectural specialties
Itanium 2
EPIC (Explicitly Parallel Instruction
Computing)
No out-of-order
Parallelization of commands in the grip of
compiler (? bundles)
L1-Cache only for Integer
Altix
ccNUMA with NUMALink 3
Memory connected hierarchically by SHUBs

39
Architecture Description AMD Opteron

Test System LSS HPC-Cluster
CPUs AMD Opteron, 2.2GHz, 1MB L2-CacheIA32-compa
tible
Memory DDR333, 5.2GB/s
Architectural specialties
Compute nodes with four CPUs
4GB RAM per CPU, each CPU can access 16GB per
ccNUMA
Interconnect
CPUs on one node HyperTransport (6.4GB/s)
Nodes Infiniband (10GB/s)

40
Comparison 1D-Array Solver to Standard Solver
41
Memory Layouts for Standard Solver
42
Memory Layouts on Itanium 2
43
Memory Layouts on AMD Opteron
44
Memory Layouts on Nocona/Irwindale
45
Influence of Geometry SiC-foam (2 obstacles)
46
Influence of Geometry MC (50 obstacles)
47
Memory Traversing Schemes
48
Memory Traversing Schemes
49
Conclusion

1D-Array Data representation makes performance
independent of obstacle to fluid ratio
Memory traversing by Space-Filling Curves results
in similar performance as spatial blocking
Implementation of SFCs is not worth the effort
(if they are used as memory traversing
alternative only)
Together with indirect addressing Collision
Optimized Layout with blocking is best technique
if cache is larger than 1 MB
Indeed, there are cases where Propagation
Optimized Layout is not best

50
Outlook

Future work could concern
Space-Filling Curves
Kind of staggered SFCs, for every direction own
curve
Avoid waste of underused cache lines where
lattice sites are neighboring cells which are
visited much later
Galerkin-discretization or point wise evaluation
of LBM to enable stack-implementation in
conjunction with SFCs
BUT For real-world problems construction on
non-cubic grids is necessary at first
Search for vectorization enhancing techniques to
over-come problems with indirect addressing on
Itanium 2
Search for reasons why Collision Optimized Layout
is better than Propagation Optimized Layout

51
Acknowledgement / References

Acknowledgement
Bavarian Graduate School for Computational
Engineering
Thomas Zeiser (RRZE)
Gerhard Wellein (RRZE)
References
S. Donath, T. Zeiser, G. Hager, J. Habich, G.
Wellein Optimizing Performance of the Lattice
Boltzmann Method for Complex Structures on
Cache-based Architectures
G. Wellein, P. Lammers, G. Hager, S. Donath, T.
Zeiser Towards Optimal Performance for Lattice
Boltzmann Applications on Terascale Computers
G. Wellein, T. Zeiser, S. Donath, G. Hager On
the single processor performance of simple
lattice Boltzmann kernels