Title: EFFECTIVE PARALLELIZATION OF A TURBULENT FLOW SIMULATION
1EFFECTIVE PARALLELIZATION OF A TURBULENT FLOW
SIMULATION
- Masami Takata,
- Yoshinobu Yamamoto, Hayaru Shouno, Tomoaki
Kunugi, Kazuki Joe - Graduate School of Human Culture, Nara Womens
Univ. - Department of Nuclear Engineering, Kyoto Univ.
2Contents
- Background
- Direct Numerical Simulation (DNS) of
- free-surface turbulent flow
- Parallelization methods
- Evaluation of the parallelization methods
- Conclusion
3Background
- Free-surface turbulent flow
- The industrial devices
- (nuclear fusion reactor and a chemical plant)
- Analysis method for turbulent flow
- Direct Numerical Simulations (DNSs)
- A transformation into parallelized form
High grid density Huge Calculation time
ltFor distributed memory parallel
computersgt MPI(Message Passing Interface)
4DNS of the turbulent flow (1)
- Calculation conditions
- Reynolds number 2270
- Prandtl-Number 1
- Grid(x,y,z) (64,82,64)
5DNS of the turbulent flow (2)
- The incompressible Navier-Stokes equation
- Integration of the master equations
- A fractional step method
- Time integration
- A second order Adams-Bashforth scheme
- A Crank-Nicholson scheme
- Spatial discretization
- A second order central differencing scheme
6DNS of the turbulent flow (3) The arrays for the
DNS
- x,y,z Grid intervals and coordinates in
the three-dimensional direction - dist x,dist y,dist z The temperatures
in water surface or wall - u,v,w Flow velocities in x, y, and z
directions - fu,fuo,fv,fvo,fw,fwo (convective
term)(viscous term) - t The temperature
- ft,fto (convective term)(viscous term)
- p (pressure)/(density)
7DNS of the turbulent flow (4) the program flow
of the DNS
Program flow
dist x
dist y
dist z
x
y
z
u
v
w
p
t
fu,fuo,fv,fvo, fw,fwo,t,ft,fto
File input (u,v,w,p)
File output
iteration
dist x dist y dist z x,y,z u,v,w fufuo fvfvo f
wfwo p,t ftfto
fu
fv
fw
t
u
v
w
fuo
fvo
fwo
u
v
w
output
u
v
w
p
output
ft
fto
t
8Parallelization method
- Parallelized program
- Data distribution
- The minimum data communication
- MPI synchronous protocol
-
- asynchronous protocol
9Parallelization method 1 Data communication
protocols (1)
- Synchronous protocol
- Processors for receive operations are suspended
until the completion of communications. - A processor can use
- only one communication protocol exclusively.
Receive operation
Completion of the communications
10Parallelization method 1 Data communication
protocols (2)
- Asynchronous protocol
- Send and receive operations can be executed
independently. - A processor can use
- several communication protocols simultaneously.
receive operation
Completion of the communications
11Parallelization method 1 Data communication
protocols (3)
- Asynchronous protocol
- MPI_ALLREDUCE
- A function for returning the results of
reduce-operations to all processors in a
communication group - MPI_BCAST
- A broadcast function
12Parallelization method 2 (1)
10 i 1,100000 10 x(i) i min
y(1) 20 j 2,100000 if(min .gt. y(j))
then min y(j) endif 20 continue 30 k
1,100000 30 z(k) z(k)k 40 i
1,100000 40 w(i) x(I(i) )min
do do do do
13Parallelization method 2 (2)
- The number of synchronous communications is two
for each processor.
14Parallelization method 2 (3)
- The number of asynchronous communications is
eight for each processor.
No asynchronous communication for global reduce
operations
15Parallelization method 2 (4)
- The number of asynchronous communications is at
most three.
With partial strip mining
16Parallelization method 2 (5)
17Parallelization method 2 The arrays for the
DNS
- x,y,z Grid intervals and coordinates in
the three-dimensional direction - dist x,dist y,dist z The temperatures
in water surface or wall - u,v,w Flow velocities in x, y, and z
directions - fu,fuo,fv,fvo,fw,fwo (convective
term)(viscous term) - t The temperature
- ft,fto (convective term)(viscous term)
- p (pressure)/(density)
18Parallelization method 2 the initialization
part for the DNS
Program flow
fu,fuo,fv,fvo, fw,fwo,t,ft,fto
dist x
dist y
dist z
x
y
z
u
v
w
p
t
File input (u,v,w,p)
File output
19Parallelization method 2 the calculation
part for the DNS
Program flow
iteration
fu
fv
fw
t
u
v
w
fuo
fvo
fwo
u
v
w
output
u
v
w
p
output
ft
fto
t
dist x dist y dist z x,y,z u,v,w fufuo fvfvo f
wfwo p,t ftfto
20Parallelization method 3 The conjugate
residual method in Poisson equation lt1gt
(subroutine press)
- Usage an array p and some local arrays
- Dependencies exist in decomposed arrays at the
boundaries. - If parallelization method (2) is
adopted, MPI_WAIT (that makes a processor be
suspended until the completion of the
asynchronous communication) causes large waiting
time. - Two partitioning methods for subroutine press
21Parallelization method 3 subroutine press
lt2gt
- In the method A(target eight processors)
- A processor subroutine norm (that returns
the maximum element in a given array) - The rest seven processors the calculation
for update - The disadvantage !
- The assignment of each array to seven processors
- (the number of array elements is defined with a
multiple of two) - Programmers development efforts increase.
(the communication number increase)
22Parallelization method 3 subroutine press
lt3gt
- In the method B
- Strip mining to all the processors
- Required synchronous communications
- MPI_ALLREDUCE (with synchronous protocol)
- (Function for returning the results of
reduce-operations to all processors in a
communication group.) - subroutine norm (that returns the maximum element
in a given array) - subroutine inprod (that returns the total of the
elements of a given array) - MPI_BCAST (with synchronous protocol)
- The array p calculated by each processor must be
broadcast after the execution of subroutine press.
23Parallelization method 3 subroutine press
lt4gt
- The characteristic
- It mainly consists of a series of doall
statements - Strip mining for all available processors
- The synchronous communications do not cause too
much overhead - (each doall statement requires about the same
- calculation time)
- The method B for subroutine press is adopted.
24Evaluation of the parallelization methods 1
- Using the MPI
- Parallelized programs for four and eight
processors - For the experiments
- Sparc SUN Workstation Ultra-2(SunOS 5.6)
- Memory capacity of 512MB
- LAN using 100base-TX as a communication media
25Evaluation of the parallelization methods 2
26The Effectiveness
- A few ideal (waiting) time
- It corresponds mainly to the file I/O time for
the calculation results. - In reducing execution time
- Theoretically
- (the execution time of the sequential
program)/(number of processors) - Actually, waiting and system time increase.
- Because of the communication overhead
- 4 processors lt28.7gt
- 8 processors lt15.79gt
Effective extremely
27Conclusion
- For a direct numerical simulation
- Using MPI
- Proposed parallelization method
- The parallelization methods are more effective
when the number of processors becomes a multiple
of four. - The execution time of the parallelized programs
for four and eight processors was decreased to
1/4 and 1/8 of the original sequential program
respectively.
28Future work
- Whether the parallelization methods are effective
with more processors? - Parallelization methods when physical memory
capacity is smaller than the total data set size.