Title: Automatic Parallelization of Simulation Code from Equation Based Simulation Languages
1Automatic Parallelization of Simulation Code from
Equation Based Simulation Languages
- Peter Aronsson,
- Industrial phd student, PELAB SaS IDA
- Linköping University, Sweden
- Based on Licentiate presentation CPC03
Presentation
2Outline
- Introduction
- Task Graphs
- Related work on Scheduling Clustering
- Parallelization Tool
- Contributions
- Results
- Conclusion Future Work
3Introduction
- Modelica
- Object Oriented, Equation Based, Modeling
Language - Modelica enable modeling and simulation of large
and complex multi-domain systems - Large need for parallel computation
- To decrease time of executing simulations
- To make large models possible to simulate at all.
- To meet hard real time demands in
hardware-in-the-loop simulations
4Examples of large complex systems in Modelica
5Modelica Example - DCmotor
6Modelica example
- model DCMotor
- import Modelica.Electrical.Analog.Basic.
- import Modelica.Electrical.Sources.StepVoltage
- Resistor R1(R10)
- Inductor I1(L0.1)
- EMF emf(k5.4)
- Ground ground
- StepVoltage step(V10)
- Modelica.Mechanics.Rotational.Inertia
load(J2.25) - equation
- connect(R1.n, I1.p)
- connect(I1.n, emf.p)
- connect(emf.n, ground.p)
- connect(emf.flange_b, load.flange_a)
- connect(step.p, R1.p)
- connect(step.n, ground.p)
- end DCMotor
7Example Flat set of Equations
- R1.v -R1.n.vR1.p.v 0 R1.n.iR1.p.i
R1.i R1.p.i R1.iR1.R R1.v - I1.v -I1.n.vI1.p.v 0 I1.n.iI1.p.i
I1.i I1.p.i I1.LI1.der(i) I1.v
- emf.v -emf.n.vemf.p.v 0
emf.n.iemf.p.i emf.i emf.p.i - emf.w emf.flange_b.der(phi) emf.kemf.w
emf.v - emf.flange_b.tau -emf.iemf.k ground.p.v
0 step.v -step.n.vstep.p.v - 0 step.n.istep.p.i step.i step.p.i
- step.signalSource.outPort.signal1 (if time lt
step.signalSource.p_startTime1 - then 0
- else step.signalSource.p_height1)step.signal
Source.p_offset1 - step.v step.signalSource.outPort.signal1
load.flange_a.phi load.phi - load.flange_b.phi load.phi load.w
load.der(phi) - load.a load.der(w) load.aload.J
load.flange_a.tauload.flange_b.tau - R1.n.v I1.p.v I1.p.iR1.n.i 0
- I1.n.v emf.p.v emf.p.iI1.n.i 0
emf.n.v step.n.v step.n.v ground.p.v
- emf.n.iground.p.istep.n.i 0
emf.flange_b.phi load.flange_a.phi - emf.flange_b.tauload.flange_a.tau 0
step.p.v R1.p.v - R1.p.istep.p.i 0 load.flange_b.tau 0
- step.signalSource.y step.signalSource.outPort.si
gnal
8Plot of Simulation result
9Task Graphs
- Directed Acyclic Graph (DAG)
- G (V,E, t,c)
- V Set of nodes, representing computational
tasks - E Set of edges, representing communication of
data between tasks - t(v) Execution cost for node v
- c(i,j) Communication cost for edge (i,j)
- Referred to as the delay model (macro dataflow
model)
10Small Task Graph Example
10
5
5
5
5
10
10
10
11Task Scheduling Algorithms
- Multiprocessor Scheduling Problem
- For each task, assign
- Starting time
- Processor assignment (P1,...PN)
- Goal minimize execution time, given
- Precedence constraints
- Execution cost
- Communication cost
- Algorithms in literature
- List Scheduling approaches (ERT, FLB)
- Critical Path scheduling approaches (TDS, MCP)
- Categories Fixed No. of Proc, fixed c and/or t,
...
12Granularity
- Granularity g min(t(v))/max(c(i,j))
- Affects scheduling result
- E.g. TDS works best for high values of g, i.e.
low communication cost - Solutions
- Clustering algorithms
- IDEA build clusters of nodes where nodes in the
same cluster are executed on the same processor - Merging algorithms
- Merge tasks to increase computational cost.
13Task Clustering/Merging Algorithms
- Task Clustering Problem
- Build clusters of nodes such that parallel time
decreases - PT(n) tlevel(n)blevel(n)
- By zeroing edges, i.e. putting several nodes into
the same cluster gt zero communication cost. - Literature
- Sarkars Internalization alg., Yangs DSC alg.
- Task Merging Problem
- Transform the Task Graph by merging nodes
- Literature E.g. Grain Packing alg.
14Clustering v.s. Merging
10
5
5
0
5
5
5
0
0
0
10
10
10
merging
10
5
0
0
10
10
10
10
Clustered Task Graph
Merged Task Graph
15DSC algorithm
- Initially, put each node a separate cluster.
- Traverse Task Graph
- Merge clusters as long as Parallel Time does not
increase. - Low complexity O((ne) log n)
- Previously used by Andersson in ObjectMath (PELAB)
16Modelica Compilation
Numerical solver
Modelica semantics
Equation system (DAE)
Opt.
Rhs calculations
C code
Flat modelica (.mof)
Structure of simulation code for
t0tltstopTimetstepSize x_dott1
f(x_dott,xt,t) xt1 ODESolver(x_dott1
)
Modelica model (.mo)
17Optimizations on equations
- Simplification of equations
- E.g. ab, bc eliminate gt b
- BLT transformation, i.e. topological sorting into
strongly connected components - (BLT Block Lower Triangular form)
- Index reduction, Index is how many times an
equation needs to be differentiated in order to
solve the equation system. - Mixed Mode /Inline Integration, methods of
optimizing equations by reducing size of equation
systems
18Generated C Code Content
- Assignment statements
- Arithmetic expressions (,-,,/), if-expressions
- Function calls
- Standard Math functions
- Sin, Cos, Log
- Modelica Functions
- User defined, side effect free
- External Modelica Functions
- In External lib, written in Fortran or C
- Call function for solving subsystems of
equations - Linear or non-linear
- Example Application
- Robot simulation has 27 000 lines of generated C
code
19Parallelization Tool Overview
Model .mo
Modelica Compiler
Parallelizer
C code
Parallel C code
Solver lib
MPI lib
C compiler
C compiler
Seq exe
Parallel exe
20Parallelization Tool Internal Structure
Sequential C code
Parser
Symbol Table
Task Graph Builder
Scheduler
Code Generator
Debug Statistics
Parallel C code
21Task Graph building
- First graph corresponds to individual arithmetic
operations, assignments, function calls and
variable definitions in the C code - Second graph Clusters of tasks from first task
graph - Example
defs
a
b
c
d
-
foo
-
/
22Investigated Scheduling Algorithms
- Parallelization Tool
- TDS (Task Duplications Scheduling Algorithm)
- Pre Clustering Method
- Full Task Duplication Method
- Experimental Framework (Mathematica)
- ERT
- DSC
- TDS
- Full Task Duplication Method
- Task Merging approaches (Graph Rewrite Systems)
23Method 1Pre Clustering algorithm
- buildCluster(nnode, llist of nodes,
sizeInteger) - Adds n to a new cluster
- Repeatedly adds nodes until the
size(cluster)size - Children to n
- One in-degree children to cluster
- Siblings to n
- Parents to n
- Arbitrary nodes
24Managing cycles
- When adding a node to a cluster the resulting
graph might have cycles - Resulting graph when clustering a and b is
cyclic since you can reach a,b from c - Resulting graph not a DAG
- Can not use standard scheduling algorithms
a
c
d
b
e
25Pre Clustering Results
- Did not produce Speedup
- Introduced far too many dependencies in resulting
task graph - Sequentialized schedule
- Conclusion
- For fine grained task graphs
- Need task duplication in such algorithm to succeed
26Method 2 Full Task Duplication
- For each noden with successor(n)
- Put all pred(n) in one cluster
- Repeat for all nodes in cluster
- Rationale If depth of graph limited, task
duplication will be kept at reasonable level and
cluster size reasonable small. - Works well when communication cost gtgt execution
cost
27Full Task Duplication (2)
- Merging clusters
- Merge clusters with load balancing strategy,
without increasing maximum cluster size - Merge clusters with greatest number of common
nodes - Repeat (2) until number of processors requirement
is met
28Full Task Duplication Results
- Computed measurements
- Execution cost of largest cluster communication
cost - Measured speedup
- Executed on PC Linux
- cluster SCI network interface,
- using SCAMPI
29Robot Example Computed Speedup
- Mixed Mode / Inline Integration
With MM/II
Without MM/II
30Thermofluid pipe executed on PC Cluster
- Pressurewavedemo in Thermofluid package 50
discretization points
31Thermofluid pipe executed on PC Cluster
- Pressurewavedemo in Thermofluid package 100
discretization points
32Task Merging using GRS
- Idea A set of simple rules to transform a task
graph to increase its granularity (and decrease
Parallel Time) - Use top level (and bottom level) as metric
- Parallel Time max tlevel max blevel
33Rule 1
- Merging a single child with only one parent.
- Motivation The merge does not decrease amount of
parallelism in the task graph. And granularity
can possibly increase.
p
p
c
34Rule 2
- Merge all parents of a node together with the
node itself. - Motivation If the top level does not increase by
the merge the resulting task will increase in
size, potentially increasing granularity.
p1
p2
pn
c
c
35Rule 3
- Duplicate parent and merge into each child node
- Motivation As long as each childs tlevel does
not increase, duplicating p into the child will
reduce the number of nodes and increase
granularity.
p
c2
cn
c1
c2
cn
c1
36Rule 4
- Merge siblings into a single node as long as a
parameterized maximum execution cost is not
exceeded. - Motivation This rule can be useful if several
small predecessor nodes exist and a larger
predecessor node which prevents a complete merge.
Does not guarantee decrease of PT.
p
Pk1
pn
p1
p2
pn
c
c
37Results Example
- Task graph from Modelica simulation code
- Small example from the mechanical domain.
- About 100 nodes built on expression level,
originating from 84 equations variables
38Result Task Merging example
39Result Task Merging example
40Conclusions
- Pre Clustering approach did not work well for the
fine grained task graphs produced by our
parallelization tool - FTD Method
- Works reasonable well for some examples
- However, in general
- Need for better scheduling/clustering algorithms
for fine grained task graphs
41Conclusions (2)
- Simple delay model may not be enough
- More advanced model require more complex
scheduling and clustering algorithms - Simulation code from equation based models
- Hard to extract parallelism from
- Need new optimization methods on DAEs or ODEs
to increase parallelism
42Conclusions Task Merging using GRS
- A task merging algorithm using GRS have been
proposed - Four rules with simple patterns gt fast pattern
matching - Can easily be integrated in existing scheduling
tools. - Successfully merges tasks considering
- Bandwidth Latency
- Task duplication
- Merging criterion decrease Parallel Time, by
decreasing tlevel (PT) - Tested on examples from simulation code
43Future Work
- Designing and Implementing Better Scheduling and
Clustering Algorithms - Support for more advanced task graph models
- Work better for high granularity values
- Try larger examples
- Test on different architectures
- Shared Memory machines
- Dual processor machines
44Future Work (2)
- Heterogeneous multiprocessor systems
- Mixed DSP processors, RISC,CISC, etc.
- Enhancing Modelica language with data parallelism
- e.g. parallel loops, vector operations
- Parallelize e.g. combined PDE and ODE problems
in Modelica. - Using e.g. SCALAPACK for solving subsystems of
linear equations. How to integrate into
scheduling algorithms?