Dynamic Load Balancing Techniques for Nonlinear Structural Dynamics - PowerPoint PPT Presentation

1 / 21

About This Presentation

Title:

Dynamic Load Balancing Techniques for Nonlinear Structural Dynamics

Description:

School of Civil Engineering. Purdue University. 2:43. 1. Research Objectives ... To offer a new Parallel Finite Element Analysis (PFEA) Algorithm with the ... – PowerPoint PPT presentation

Number of Views:57

Avg rating:3.0/5.0

Slides: 22

Provided by: succe

Category:

more less

Transcript and Presenter's Notes

Title: Dynamic Load Balancing Techniques for Nonlinear Structural Dynamics

1
Dynamic Load Balancing Techniques for
NonlinearStructural Dynamics

SEI 2006 Structures Congress Presentation

Elisa Sotelino Professor CEE CS
Departments Virginia Tech
Ammar T. Al-Sayegh PhD. Candidate School of Civil
Engineering Purdue University
2
Outline

Research Objectives
Types Parallel FEA Algorithms
Node-wise Algorithms
Element-wise Algorithms
Domain-wise Algorithms
Proposed Row-wise Algorithm
DLB Technique to Overcome Imbalance
Numerical Example
ParaStruc Intro
Problem Definitions
ParaStruc Results
Questions Comments?

3
Objectives of this Development Effort

To offer a new Parallel Finite Element Analysis
(PFEA) Algorithm with the following qualities
Robust
Efficient
Expandable
Easy to implement
Devise a Dynamic Load Balancing (DLB) scheme that
effectively works with this new algorithm.

4
What is a PFEA Algorithm?

FEA can be broken down into

1. Element state determination and force
calculation
Store Ke, Fe, Element Properties
Get Df
Compute Ke, Fe

3. Applying boundary conditions
Store DOFf
Get K, R
Compute Kf, Rf

2. Assembly of global stiffness matrix and load
vector
Store K, R
Get Ke, Fe
Compute K, R

4. Solving for the nodal displacements
Store Kf, Rf
Get Kf, Rf
Compute Df

5
What is a PFEA Algorithm?

FEA can be broken down into

A procedure that specifies

6
What is a PFEA Algorithm?
Communi
unication
?
?
cation
Comm
?
Intra-Step Dependencies
Inter-Step Dependencies
-Compute-Store (Mem, Comm)
-Compute-Store (Comm)
4. Solve for Df. Store Kf, Rf Compute Df
2. Assembly of K R. Store K, R Compute K, R
-Compute-Compute (Conc)
?
?
?
?
?
?
Comm
?
cation
K,R
?
Communi
unication
?

A procedure that specifies

which step is parallelized,
how it is distributed,
and how the data is communicated.
7
Node-Wise Algorithms

Cons
Higher storage.
Redundant computation.
Solution concurrency/efficiency tradeoff.
Handling nonlinearity is not trivial.

Assembly is partitioned according to nodes

Pros
Lower communication.
Robust.

Elements distributed to pertinent assembly
partitions

1. Element St. Det. Store Ke, Fe, Prop Compute
Ke, Fe
4. Solve for Df. Store Kf, Rf Compute Df
2. Assembly of K R. Store K, R Compute K, R
Comm
cation
3. Apply BCs. Store DOFf Compute Kf, Rf
Kf,Rf
K,R
Communi
unication
8
Element-Wise Algorithms

Pros
Lower storage.
Higher concurrency.
Better handling of NL.

Cons
Longer convergence (Itv).
May not converge (Itv).
High communication.
Fine grained.

K R are not explicitly assembled

Solve for Df Iteratively

Element State Det. Iterative Solution
Parallelized

1. Element St. Det. Store Ke, Fe, Prop Compute
Ke, Fe
Communi
unication
Ke,Fe
Df
cation
Comm
4. Solve for Df. Store Kf, Rf Compute Df
2. Assembly of K R. Store K, R Compute K, R
4. Solve for Df. Iteratively using Ke, R
Ke,Fe
Ke,Fe
Comm
cation
3. Apply BCs. Store DOFf Compute Kf, Rf
3. Apply BCs. On Ke
Kf,Rf
K,R
Communi
unication
9
Domain-Wise Algorithms

Pros
Lower communication.
Higher concurrency.

Cons
More computation.
Higher storage.
Handling nonlinearity is not trivial.

Split to domains, Solve, then Join back

1. Element St. Det. Store Ke, Fe, Prop Compute
Ke, Fe
Communi
unication
Ke,Fe
Df
4-c. Solve for Dd1-Dd1 Interface
cation
Comm
4. Solve for Df. Store Kf, Rf Compute Df
2. Assembly of K R. Store K, R Compute K, R
4-b. Solve for Dd1-Dd1 Internal
4-a. Split Kf to Kd1-Kdn Split Rf to Rd1-Rdn
Comm
cation
K,R
Communi
unication
10
Proposed Row-Wise Algorithm

Consider structure with analyzed with n 3
processors

1. Create a vector of elements subdivide vector
to n rows

2. Partition K R into n rows distributed to n
processors
3. Mark supported rows/cols, and redistribute
free rows
4. Partition the disp vector into rows, and solve
for Df
Communi
unication
cation
Comm
4. Solve for Df. Store Kf, Rf Compute Df
2. Assembly of K R. Store K, R Compute K, R
Comm
cation
K,R
Communi
unication
11
Proposed Row-Wise Algorithm

Now, consider the communication

1. Element St. Det. Store Ke, Fe, Prop Compute
Ke, Fe
Ke,Fe
Df
K,R
12
Proposed Row-Wise Algorithm

Perfect case No inter-processor communication

Most Often Not True
True
13
Proposed Row-Wise Algorithm
5. Load Balance

Consider following Structure

4. Solve for Df
2. Assembly
1. Element State Determination
3. Apply BCs
- Row 5 in System has Multiplicity 3 in P0 - Row
8 in System has Mult. 1 in P0
- Row 4 in Element has Multiplicity of 2 in P2 -
Row 7 in Elem. Has Mult. 1 in P2
N7 N5 N8 N2 N6 N4 N3
14
Source of Nonlinearity Imbalance

Element stiffness matrix

Nonlinearity
Generated at the section level.
Propagates to the element level through
integration.
Propagates to the structural level through
assembly.
Therefore, Imbalance must be dealt with at
Element state determination stepcaused by more
iterations on some processors than others.
Structure displacement calculation stepcaused by
introduction of new nonzero or new zero elements
in the stiffness matrix, residual force vector,
or displacement increment vector.

15
DLB Technique

Element State Determination Step
Update the multiplicity of each element row in
the processor.
Start the state determination with the highest
Multiplicity row for this processor, and end with
lowest multiplicity row of this processor.
When last row reached, broadcast a signal
requesting this processors maximum multiplicity
values of undetermined rows on other processors.
Import the highest multiplicity row, and perform
state determination on it.
Structure Displacement Solution Step
Update the multiplicity of each system row in the
processor
Request maximum multiplicity values for this
processor in other processors.
Compare the multiplicity of the local row with
the maximum multiplicity in this processor with
the multiplicity of the remote row in this
processor. If the remote multiplicity is higher,
exchange the two rows.

16
ParaStruc

New, fully parallelized, structural finite
element system.
Built on Trilinos, a set of parallel numerical
libraries.
Lightweight. Contains preprocessor 3 Classes
only

17
Numerical Example NL Cantilever

Partition to 1,000 subelements
Mesh to 1,000 fibers
Integrate at 8 sections
Apply End Point Load

Analyzed on VT System X Supercomputer
1100 node (2200 processors) Apple Xserve G5
cluster
Dual 2.3 GHz PowerPC 970FX processors / node
4 GB ECC DDR400 (PC3200) RAM / node
80 GB S-ATA hard disk drive / node
Mellanox Cougar InfiniBand 4x HCA Interconnect

18
Numerical Example Results

2 Factors controlling Efficiency
Storage cost (cache availability) Positive
Impact for Np lt 8
- Communication cost (latency) Negative Impact
for Np gt 8

19
Conclusions

A new row-wise partitioning algorithm together
and a dynamic load balancing technique have been
developed that
Minimizes computation in each processor.
Minimizes the required storage in each processor.
Minimizes inter-communication between processors.
Balances the computation load among the
processors.

20
Acknowledgements

Special Thanks to
Kuwait University for funding this research.
Virginia Tech TeraScale group for making the
System X Supercomputer available for this
research and for their help and support in using
it.

21
Questions Comments?
System X Supercomupter at Virginia Tech

Write a Comment

User Comments (0)