Title: Dynamic Load Balancing Techniques for Nonlinear Structural Dynamics
1Dynamic Load Balancing Techniques for
NonlinearStructural Dynamics
- SEI 2006 Structures Congress Presentation
Elisa Sotelino Professor CEE CS
Departments Virginia Tech
Ammar T. Al-Sayegh PhD. Candidate School of Civil
Engineering Purdue University
2Outline
- Research Objectives
- Types Parallel FEA Algorithms
- Node-wise Algorithms
- Element-wise Algorithms
- Domain-wise Algorithms
- Proposed Row-wise Algorithm
- DLB Technique to Overcome Imbalance
- Numerical Example
- ParaStruc Intro
- Problem Definitions
- ParaStruc Results
- Questions Comments?
3Objectives of this Development Effort
- To offer a new Parallel Finite Element Analysis
(PFEA) Algorithm with the following qualities - Robust
- Efficient
- Expandable
- Easy to implement
- Devise a Dynamic Load Balancing (DLB) scheme that
effectively works with this new algorithm.
4What is a PFEA Algorithm?
- FEA can be broken down into
- 1. Element state determination and force
calculation - Store Ke, Fe, Element Properties
- Get Df
- Compute Ke, Fe
- 3. Applying boundary conditions
- Store DOFf
- Get K, R
- Compute Kf, Rf
- 2. Assembly of global stiffness matrix and load
vector - Store K, R
- Get Ke, Fe
- Compute K, R
- 4. Solving for the nodal displacements
- Store Kf, Rf
- Get Kf, Rf
- Compute Df
5What is a PFEA Algorithm?
- FEA can be broken down into
- A procedure that specifies
6What is a PFEA Algorithm?
Communi
unication
?
?
cation
Comm
?
Intra-Step Dependencies
Inter-Step Dependencies
-Compute-Store (Mem, Comm)
-Compute-Store (Comm)
4. Solve for Df. Store Kf, Rf Compute Df
2. Assembly of K R. Store K, R Compute K, R
-Compute-Compute (Conc)
?
?
?
?
?
?
Comm
?
cation
K,R
?
Communi
unication
?
- A procedure that specifies
which step is parallelized,
how it is distributed,
and how the data is communicated.
7Node-Wise Algorithms
- Cons
- Higher storage.
- Redundant computation.
- Solution concurrency/efficiency tradeoff.
- Handling nonlinearity is not trivial.
- Assembly is partitioned according to nodes
- Pros
- Lower communication.
- Robust.
- Elements distributed to pertinent assembly
partitions
1. Element St. Det. Store Ke, Fe, Prop Compute
Ke, Fe
4. Solve for Df. Store Kf, Rf Compute Df
2. Assembly of K R. Store K, R Compute K, R
Comm
cation
3. Apply BCs. Store DOFf Compute Kf, Rf
Kf,Rf
K,R
Communi
unication
8Element-Wise Algorithms
- Pros
- Lower storage.
- Higher concurrency.
- Better handling of NL.
- Cons
- Longer convergence (Itv).
- May not converge (Itv).
- High communication.
- Fine grained.
- K R are not explicitly assembled
- Element State Det. Iterative Solution
Parallelized
1. Element St. Det. Store Ke, Fe, Prop Compute
Ke, Fe
Communi
unication
Ke,Fe
Df
cation
Comm
4. Solve for Df. Store Kf, Rf Compute Df
2. Assembly of K R. Store K, R Compute K, R
4. Solve for Df. Iteratively using Ke, R
Ke,Fe
Ke,Fe
Comm
cation
3. Apply BCs. Store DOFf Compute Kf, Rf
3. Apply BCs. On Ke
Kf,Rf
K,R
Communi
unication
9Domain-Wise Algorithms
- Pros
- Lower communication.
- Higher concurrency.
- Cons
- More computation.
- Higher storage.
- Handling nonlinearity is not trivial.
- Split to domains, Solve, then Join back
1. Element St. Det. Store Ke, Fe, Prop Compute
Ke, Fe
Communi
unication
Ke,Fe
Df
4-c. Solve for Dd1-Dd1 Interface
cation
Comm
4. Solve for Df. Store Kf, Rf Compute Df
2. Assembly of K R. Store K, R Compute K, R
4-b. Solve for Dd1-Dd1 Internal
4-a. Split Kf to Kd1-Kdn Split Rf to Rd1-Rdn
Comm
cation
K,R
Communi
unication
10Proposed Row-Wise Algorithm
- Consider structure with analyzed with n 3
processors
- 1. Create a vector of elements subdivide vector
to n rows
2. Partition K R into n rows distributed to n
processors
3. Mark supported rows/cols, and redistribute
free rows
4. Partition the disp vector into rows, and solve
for Df
Communi
unication
cation
Comm
4. Solve for Df. Store Kf, Rf Compute Df
2. Assembly of K R. Store K, R Compute K, R
Comm
cation
K,R
Communi
unication
11Proposed Row-Wise Algorithm
- Now, consider the communication
1. Element St. Det. Store Ke, Fe, Prop Compute
Ke, Fe
Ke,Fe
Df
K,R
12Proposed Row-Wise Algorithm
- Perfect case No inter-processor communication
Most Often Not True
True
13Proposed Row-Wise Algorithm
5. Load Balance
- Consider following Structure
4. Solve for Df
2. Assembly
1. Element State Determination
3. Apply BCs
- Row 5 in System has Multiplicity 3 in P0 - Row
8 in System has Mult. 1 in P0
- Row 4 in Element has Multiplicity of 2 in P2 -
Row 7 in Elem. Has Mult. 1 in P2
N7 N5 N8 N2 N6 N4 N3
14Source of Nonlinearity Imbalance
- Nonlinearity
- Generated at the section level.
- Propagates to the element level through
integration. - Propagates to the structural level through
assembly. - Therefore, Imbalance must be dealt with at
- Element state determination stepcaused by more
iterations on some processors than others. - Structure displacement calculation stepcaused by
introduction of new nonzero or new zero elements
in the stiffness matrix, residual force vector,
or displacement increment vector.
15DLB Technique
- Element State Determination Step
- Update the multiplicity of each element row in
the processor. - Start the state determination with the highest
Multiplicity row for this processor, and end with
lowest multiplicity row of this processor. - When last row reached, broadcast a signal
requesting this processors maximum multiplicity
values of undetermined rows on other processors. - Import the highest multiplicity row, and perform
state determination on it. - Structure Displacement Solution Step
- Update the multiplicity of each system row in the
processor - Request maximum multiplicity values for this
processor in other processors. - Compare the multiplicity of the local row with
the maximum multiplicity in this processor with
the multiplicity of the remote row in this
processor. If the remote multiplicity is higher,
exchange the two rows.
16ParaStruc
- New, fully parallelized, structural finite
element system. - Built on Trilinos, a set of parallel numerical
libraries. - Lightweight. Contains preprocessor 3 Classes
only
17Numerical Example NL Cantilever
- Partition to 1,000 subelements
- Mesh to 1,000 fibers
- Integrate at 8 sections
- Apply End Point Load
- Analyzed on VT System X Supercomputer
- 1100 node (2200 processors) Apple Xserve G5
cluster - Dual 2.3 GHz PowerPC 970FX processors / node
- 4 GB ECC DDR400 (PC3200) RAM / node
- 80 GB S-ATA hard disk drive / node
- Mellanox Cougar InfiniBand 4x HCA Interconnect
18Numerical Example Results
- 2 Factors controlling Efficiency
- Storage cost (cache availability) Positive
Impact for Np lt 8 - - Communication cost (latency) Negative Impact
for Np gt 8
19Conclusions
- A new row-wise partitioning algorithm together
and a dynamic load balancing technique have been
developed that - Minimizes computation in each processor.
- Minimizes the required storage in each processor.
- Minimizes inter-communication between processors.
- Balances the computation load among the
processors.
20Acknowledgements
- Special Thanks to
- Kuwait University for funding this research.
- Virginia Tech TeraScale group for making the
System X Supercomputer available for this
research and for their help and support in using
it.
21Questions Comments?
System X Supercomupter at Virginia Tech