Automatic Parallelization of Simulation Code from Equation Based Simulation Languages presentation

About This Presentation

Transcript and Presenter's Notes

Title: Automatic Parallelization of Simulation Code from Equation Based Simulation Languages

1
Automatic Parallelization of Simulation Code from
Equation Based Simulation Languages

Peter Aronsson,
Industrial phd student, PELAB SaS IDA
Linköping University, Sweden
Based on Licentiate presentation CPC03
Presentation

2
Outline

Introduction
Task Graphs
Related work on Scheduling Clustering
Parallelization Tool
Contributions
Results
Conclusion Future Work

3
Introduction

Modelica
Object Oriented, Equation Based, Modeling
Language
Modelica enable modeling and simulation of large
and complex multi-domain systems
Large need for parallel computation
To decrease time of executing simulations
To make large models possible to simulate at all.
To meet hard real time demands in
hardware-in-the-loop simulations

4
Examples of large complex systems in Modelica
5
Modelica Example - DCmotor
6
Modelica example

model DCMotor
import Modelica.Electrical.Analog.Basic.
import Modelica.Electrical.Sources.StepVoltage
Resistor R1(R10)
Inductor I1(L0.1)
EMF emf(k5.4)
Ground ground
StepVoltage step(V10)
Modelica.Mechanics.Rotational.Inertia
load(J2.25)
equation
connect(R1.n, I1.p)
connect(I1.n, emf.p)
connect(emf.n, ground.p)
connect(emf.flange_b, load.flange_a)
connect(step.p, R1.p)
connect(step.n, ground.p)
end DCMotor

7
Example Flat set of Equations

R1.v -R1.n.vR1.p.v 0 R1.n.iR1.p.i
R1.i R1.p.i R1.iR1.R R1.v
I1.v -I1.n.vI1.p.v 0 I1.n.iI1.p.i
I1.i I1.p.i I1.LI1.der(i) I1.v
emf.v -emf.n.vemf.p.v 0
emf.n.iemf.p.i emf.i emf.p.i
emf.w emf.flange_b.der(phi) emf.kemf.w
emf.v
emf.flange_b.tau -emf.iemf.k ground.p.v
0 step.v -step.n.vstep.p.v
0 step.n.istep.p.i step.i step.p.i
step.signalSource.outPort.signal1 (if time lt
step.signalSource.p_startTime1
then 0
else step.signalSource.p_height1)step.signal
Source.p_offset1
step.v step.signalSource.outPort.signal1
load.flange_a.phi load.phi
load.flange_b.phi load.phi load.w
load.der(phi)
load.a load.der(w) load.aload.J
load.flange_a.tauload.flange_b.tau
R1.n.v I1.p.v I1.p.iR1.n.i 0
I1.n.v emf.p.v emf.p.iI1.n.i 0
emf.n.v step.n.v step.n.v ground.p.v
emf.n.iground.p.istep.n.i 0
emf.flange_b.phi load.flange_a.phi
emf.flange_b.tauload.flange_a.tau 0
step.p.v R1.p.v
R1.p.istep.p.i 0 load.flange_b.tau 0
step.signalSource.y step.signalSource.outPort.si
gnal

8
Plot of Simulation result

load.flange_a.tau
load.w

9
Task Graphs

Directed Acyclic Graph (DAG)
G (V,E, t,c)
V Set of nodes, representing computational
tasks
E Set of edges, representing communication of
data between tasks
t(v) Execution cost for node v
c(i,j) Communication cost for edge (i,j)
Referred to as the delay model (macro dataflow
model)

10
Small Task Graph Example
10
5
5
5
5
10
10
10
11
Task Scheduling Algorithms

Multiprocessor Scheduling Problem
For each task, assign
Starting time
Processor assignment (P1,...PN)
Goal minimize execution time, given
Precedence constraints
Execution cost
Communication cost
Algorithms in literature
List Scheduling approaches (ERT, FLB)
Critical Path scheduling approaches (TDS, MCP)
Categories Fixed No. of Proc, fixed c and/or t,
...

12
Granularity

Granularity g min(t(v))/max(c(i,j))
Affects scheduling result
E.g. TDS works best for high values of g, i.e.
low communication cost
Solutions
Clustering algorithms
IDEA build clusters of nodes where nodes in the
same cluster are executed on the same processor
Merging algorithms
Merge tasks to increase computational cost.

13
Task Clustering/Merging Algorithms

Task Clustering Problem
Build clusters of nodes such that parallel time
decreases
PT(n) tlevel(n)blevel(n)
By zeroing edges, i.e. putting several nodes into
the same cluster gt zero communication cost.
Literature
Sarkars Internalization alg., Yangs DSC alg.
Task Merging Problem
Transform the Task Graph by merging nodes
Literature E.g. Grain Packing alg.

14
Clustering v.s. Merging
10
5
5
0
5
5
5
0
0
0
10
10
10
merging
10
5
0
0
10
10
10
10
Clustered Task Graph
Merged Task Graph
15
DSC algorithm

Initially, put each node a separate cluster.
Traverse Task Graph
Merge clusters as long as Parallel Time does not
increase.
Low complexity O((ne) log n)
Previously used by Andersson in ObjectMath (PELAB)

16
Modelica Compilation
Numerical solver
Modelica semantics
Equation system (DAE)
Opt.
Rhs calculations
C code
Flat modelica (.mof)
Structure of simulation code for
t0tltstopTimetstepSize x_dott1
f(x_dott,xt,t) xt1 ODESolver(x_dott1
)
Modelica model (.mo)
17
Optimizations on equations

Simplification of equations
E.g. ab, bc eliminate gt b
BLT transformation, i.e. topological sorting into
strongly connected components
(BLT Block Lower Triangular form)
Index reduction, Index is how many times an
equation needs to be differentiated in order to
solve the equation system.
Mixed Mode /Inline Integration, methods of
optimizing equations by reducing size of equation
systems

18
Generated C Code Content

Assignment statements
Arithmetic expressions (,-,,/), if-expressions
Function calls
Standard Math functions
Sin, Cos, Log
Modelica Functions
User defined, side effect free
External Modelica Functions
In External lib, written in Fortran or C
Call function for solving subsystems of
equations
Linear or non-linear
Example Application
Robot simulation has 27 000 lines of generated C
code

19
Parallelization Tool Overview
Model .mo
Modelica Compiler
Parallelizer
C code
Parallel C code
Solver lib
MPI lib
C compiler
C compiler
Seq exe
Parallel exe
20
Parallelization Tool Internal Structure
Sequential C code
Parser
Symbol Table
Task Graph Builder
Scheduler
Code Generator
Debug Statistics
Parallel C code
21
Task Graph building

First graph corresponds to individual arithmetic
operations, assignments, function calls and
variable definitions in the C code
Second graph Clusters of tasks from first task
graph
Example

defs
a
b
c

d

-

foo
-
/
22
Investigated Scheduling Algorithms

Parallelization Tool
TDS (Task Duplications Scheduling Algorithm)
Pre Clustering Method
Full Task Duplication Method
Experimental Framework (Mathematica)
ERT
DSC
TDS
Full Task Duplication Method
Task Merging approaches (Graph Rewrite Systems)

23
Method 1Pre Clustering algorithm

buildCluster(nnode, llist of nodes,
sizeInteger)
Adds n to a new cluster
Repeatedly adds nodes until the
size(cluster)size
Children to n
One in-degree children to cluster
Siblings to n
Parents to n
Arbitrary nodes

24
Managing cycles

When adding a node to a cluster the resulting
graph might have cycles
Resulting graph when clustering a and b is
cyclic since you can reach a,b from c
Resulting graph not a DAG
Can not use standard scheduling algorithms

a
c
d
b
e
25
Pre Clustering Results

Did not produce Speedup
Introduced far too many dependencies in resulting
task graph
Sequentialized schedule
Conclusion
For fine grained task graphs
Need task duplication in such algorithm to succeed

26
Method 2 Full Task Duplication

For each noden with successor(n)
Put all pred(n) in one cluster
Repeat for all nodes in cluster
Rationale If depth of graph limited, task
duplication will be kept at reasonable level and
cluster size reasonable small.
Works well when communication cost gtgt execution
cost

27
Full Task Duplication (2)

Merging clusters
Merge clusters with load balancing strategy,
without increasing maximum cluster size
Merge clusters with greatest number of common
nodes
Repeat (2) until number of processors requirement
is met

28
Full Task Duplication Results

Computed measurements
Execution cost of largest cluster communication
cost
Measured speedup
Executed on PC Linux
cluster SCI network interface,
using SCAMPI

29
Robot Example Computed Speedup

Mixed Mode / Inline Integration

With MM/II
Without MM/II
30
Thermofluid pipe executed on PC Cluster

Pressurewavedemo in Thermofluid package 50
discretization points

31
Thermofluid pipe executed on PC Cluster

Pressurewavedemo in Thermofluid package 100
discretization points

32
Task Merging using GRS

Idea A set of simple rules to transform a task
graph to increase its granularity (and decrease
Parallel Time)
Use top level (and bottom level) as metric
Parallel Time max tlevel max blevel

33
Rule 1

Merging a single child with only one parent.
Motivation The merge does not decrease amount of
parallelism in the task graph. And granularity
can possibly increase.

p
p
c
34
Rule 2

Merge all parents of a node together with the
node itself.
Motivation If the top level does not increase by
the merge the resulting task will increase in
size, potentially increasing granularity.

p1
p2
pn

c
c
35
Rule 3

Duplicate parent and merge into each child node
Motivation As long as each childs tlevel does
not increase, duplicating p into the child will
reduce the number of nodes and increase
granularity.

p

c2
cn
c1

c2
cn
c1
36
Rule 4

Merge siblings into a single node as long as a
parameterized maximum execution cost is not
exceeded.
Motivation This rule can be useful if several
small predecessor nodes exist and a larger
predecessor node which prevents a complete merge.
Does not guarantee decrease of PT.

p
Pk1
pn

p1
p2
pn
c
c
37
Results Example

Task graph from Modelica simulation code
Small example from the mechanical domain.
About 100 nodes built on expression level,
originating from 84 equations variables

38
Result Task Merging example

B1, L1

39
Result Task Merging example

B1, L10
B1, L100

40
Conclusions

Pre Clustering approach did not work well for the
fine grained task graphs produced by our
parallelization tool
FTD Method
Works reasonable well for some examples
However, in general
Need for better scheduling/clustering algorithms
for fine grained task graphs

41
Conclusions (2)

Simple delay model may not be enough
More advanced model require more complex
scheduling and clustering algorithms
Simulation code from equation based models
Hard to extract parallelism from
Need new optimization methods on DAEs or ODEs
to increase parallelism

42
Conclusions Task Merging using GRS

A task merging algorithm using GRS have been
proposed
Four rules with simple patterns gt fast pattern
matching
Can easily be integrated in existing scheduling
tools.
Successfully merges tasks considering
Bandwidth Latency
Task duplication
Merging criterion decrease Parallel Time, by
decreasing tlevel (PT)
Tested on examples from simulation code

43
Future Work

Designing and Implementing Better Scheduling and
Clustering Algorithms
Support for more advanced task graph models
Work better for high granularity values
Try larger examples
Test on different architectures
Shared Memory machines
Dual processor machines

44
Future Work (2)

Heterogeneous multiprocessor systems
Mixed DSP processors, RISC,CISC, etc.
Enhancing Modelica language with data parallelism
e.g. parallel loops, vector operations
Parallelize e.g. combined PDE and ODE problems
in Modelica.
Using e.g. SCALAPACK for solving subsystems of
linear equations. How to integrate into
scheduling algorithms?

Write a Comment

User Comments (0)

About PowerShow.com

Automatic Parallelization of Simulation Code from Equation Based Simulation Languages PowerPoint PPT Presentation