Automatic Parallelization of Simulation Code from Equation Based Simulation Languages PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: Automatic Parallelization of Simulation Code from Equation Based Simulation Languages


1
Automatic Parallelization of Simulation Code from
Equation Based Simulation Languages
  • Peter Aronsson,
  • Industrial phd student, PELAB SaS IDA
  • Linköping University, Sweden
  • Based on Licentiate presentation CPC03
    Presentation

2
Outline
  • Introduction
  • Task Graphs
  • Related work on Scheduling Clustering
  • Parallelization Tool
  • Contributions
  • Results
  • Conclusion Future Work

3
Introduction
  • Modelica
  • Object Oriented, Equation Based, Modeling
    Language
  • Modelica enable modeling and simulation of large
    and complex multi-domain systems
  • Large need for parallel computation
  • To decrease time of executing simulations
  • To make large models possible to simulate at all.
  • To meet hard real time demands in
    hardware-in-the-loop simulations

4
Examples of large complex systems in Modelica
5
Modelica Example - DCmotor
6
Modelica example
  • model DCMotor
  • import Modelica.Electrical.Analog.Basic.
  • import Modelica.Electrical.Sources.StepVoltage
  • Resistor R1(R10)
  • Inductor I1(L0.1)
  • EMF emf(k5.4)
  • Ground ground
  • StepVoltage step(V10)
  • Modelica.Mechanics.Rotational.Inertia
    load(J2.25)
  • equation
  • connect(R1.n, I1.p)
  • connect(I1.n, emf.p)
  • connect(emf.n, ground.p)
  • connect(emf.flange_b, load.flange_a)
  • connect(step.p, R1.p)
  • connect(step.n, ground.p)
  • end DCMotor

7
Example Flat set of Equations
  • R1.v -R1.n.vR1.p.v 0 R1.n.iR1.p.i
    R1.i R1.p.i R1.iR1.R R1.v
  • I1.v -I1.n.vI1.p.v 0 I1.n.iI1.p.i
    I1.i I1.p.i I1.LI1.der(i) I1.v
  • emf.v -emf.n.vemf.p.v 0
    emf.n.iemf.p.i emf.i emf.p.i
  • emf.w emf.flange_b.der(phi) emf.kemf.w
    emf.v
  • emf.flange_b.tau -emf.iemf.k ground.p.v
    0 step.v -step.n.vstep.p.v
  • 0 step.n.istep.p.i step.i step.p.i
  • step.signalSource.outPort.signal1 (if time lt
    step.signalSource.p_startTime1
  • then 0
  • else step.signalSource.p_height1)step.signal
    Source.p_offset1
  • step.v step.signalSource.outPort.signal1
    load.flange_a.phi load.phi
  • load.flange_b.phi load.phi load.w
    load.der(phi)
  • load.a load.der(w) load.aload.J
    load.flange_a.tauload.flange_b.tau
  • R1.n.v I1.p.v I1.p.iR1.n.i 0
  • I1.n.v emf.p.v emf.p.iI1.n.i 0
    emf.n.v step.n.v step.n.v ground.p.v
  • emf.n.iground.p.istep.n.i 0
    emf.flange_b.phi load.flange_a.phi
  • emf.flange_b.tauload.flange_a.tau 0
    step.p.v R1.p.v
  • R1.p.istep.p.i 0 load.flange_b.tau 0
  • step.signalSource.y step.signalSource.outPort.si
    gnal

8
Plot of Simulation result
  • load.flange_a.tau
  • load.w

9
Task Graphs
  • Directed Acyclic Graph (DAG)
  • G (V,E, t,c)
  • V Set of nodes, representing computational
    tasks
  • E Set of edges, representing communication of
    data between tasks
  • t(v) Execution cost for node v
  • c(i,j) Communication cost for edge (i,j)
  • Referred to as the delay model (macro dataflow
    model)

10
Small Task Graph Example
10
5
5
5
5
10
10
10
11
Task Scheduling Algorithms
  • Multiprocessor Scheduling Problem
  • For each task, assign
  • Starting time
  • Processor assignment (P1,...PN)
  • Goal minimize execution time, given
  • Precedence constraints
  • Execution cost
  • Communication cost
  • Algorithms in literature
  • List Scheduling approaches (ERT, FLB)
  • Critical Path scheduling approaches (TDS, MCP)
  • Categories Fixed No. of Proc, fixed c and/or t,
    ...

12
Granularity
  • Granularity g min(t(v))/max(c(i,j))
  • Affects scheduling result
  • E.g. TDS works best for high values of g, i.e.
    low communication cost
  • Solutions
  • Clustering algorithms
  • IDEA build clusters of nodes where nodes in the
    same cluster are executed on the same processor
  • Merging algorithms
  • Merge tasks to increase computational cost.

13
Task Clustering/Merging Algorithms
  • Task Clustering Problem
  • Build clusters of nodes such that parallel time
    decreases
  • PT(n) tlevel(n)blevel(n)
  • By zeroing edges, i.e. putting several nodes into
    the same cluster gt zero communication cost.
  • Literature
  • Sarkars Internalization alg., Yangs DSC alg.
  • Task Merging Problem
  • Transform the Task Graph by merging nodes
  • Literature E.g. Grain Packing alg.

14
Clustering v.s. Merging
10
5
5
0
5
5
5
0
0
0
10
10
10
merging
10
5
0
0
10
10
10
10
Clustered Task Graph
Merged Task Graph
15
DSC algorithm
  • Initially, put each node a separate cluster.
  • Traverse Task Graph
  • Merge clusters as long as Parallel Time does not
    increase.
  • Low complexity O((ne) log n)
  • Previously used by Andersson in ObjectMath (PELAB)

16
Modelica Compilation
Numerical solver
Modelica semantics
Equation system (DAE)
Opt.
Rhs calculations
C code
Flat modelica (.mof)
Structure of simulation code for
t0tltstopTimetstepSize x_dott1
f(x_dott,xt,t) xt1 ODESolver(x_dott1
)
Modelica model (.mo)
17
Optimizations on equations
  • Simplification of equations
  • E.g. ab, bc eliminate gt b
  • BLT transformation, i.e. topological sorting into
    strongly connected components
  • (BLT Block Lower Triangular form)
  • Index reduction, Index is how many times an
    equation needs to be differentiated in order to
    solve the equation system.
  • Mixed Mode /Inline Integration, methods of
    optimizing equations by reducing size of equation
    systems

18
Generated C Code Content
  • Assignment statements
  • Arithmetic expressions (,-,,/), if-expressions
  • Function calls
  • Standard Math functions
  • Sin, Cos, Log
  • Modelica Functions
  • User defined, side effect free
  • External Modelica Functions
  • In External lib, written in Fortran or C
  • Call function for solving subsystems of
    equations
  • Linear or non-linear
  • Example Application
  • Robot simulation has 27 000 lines of generated C
    code

19
Parallelization Tool Overview
Model .mo
Modelica Compiler
Parallelizer
C code
Parallel C code
Solver lib
MPI lib
C compiler
C compiler
Seq exe
Parallel exe
20
Parallelization Tool Internal Structure
Sequential C code
Parser
Symbol Table
Task Graph Builder
Scheduler
Code Generator
Debug Statistics
Parallel C code
21
Task Graph building
  • First graph corresponds to individual arithmetic
    operations, assignments, function calls and
    variable definitions in the C code
  • Second graph Clusters of tasks from first task
    graph
  • Example

defs
a
b
c

d

-


foo
-
/
22
Investigated Scheduling Algorithms
  • Parallelization Tool
  • TDS (Task Duplications Scheduling Algorithm)
  • Pre Clustering Method
  • Full Task Duplication Method
  • Experimental Framework (Mathematica)
  • ERT
  • DSC
  • TDS
  • Full Task Duplication Method
  • Task Merging approaches (Graph Rewrite Systems)

23
Method 1Pre Clustering algorithm
  • buildCluster(nnode, llist of nodes,
    sizeInteger)
  • Adds n to a new cluster
  • Repeatedly adds nodes until the
    size(cluster)size
  • Children to n
  • One in-degree children to cluster
  • Siblings to n
  • Parents to n
  • Arbitrary nodes

24
Managing cycles
  • When adding a node to a cluster the resulting
    graph might have cycles
  • Resulting graph when clustering a and b is
    cyclic since you can reach a,b from c
  • Resulting graph not a DAG
  • Can not use standard scheduling algorithms

a
c
d
b
e
25
Pre Clustering Results
  • Did not produce Speedup
  • Introduced far too many dependencies in resulting
    task graph
  • Sequentialized schedule
  • Conclusion
  • For fine grained task graphs
  • Need task duplication in such algorithm to succeed

26
Method 2 Full Task Duplication
  • For each noden with successor(n)
  • Put all pred(n) in one cluster
  • Repeat for all nodes in cluster
  • Rationale If depth of graph limited, task
    duplication will be kept at reasonable level and
    cluster size reasonable small.
  • Works well when communication cost gtgt execution
    cost

27
Full Task Duplication (2)
  • Merging clusters
  • Merge clusters with load balancing strategy,
    without increasing maximum cluster size
  • Merge clusters with greatest number of common
    nodes
  • Repeat (2) until number of processors requirement
    is met

28
Full Task Duplication Results
  • Computed measurements
  • Execution cost of largest cluster communication
    cost
  • Measured speedup
  • Executed on PC Linux
  • cluster SCI network interface,
  • using SCAMPI

29
Robot Example Computed Speedup
  • Mixed Mode / Inline Integration

With MM/II
Without MM/II
30
Thermofluid pipe executed on PC Cluster
  • Pressurewavedemo in Thermofluid package 50
    discretization points

31
Thermofluid pipe executed on PC Cluster
  • Pressurewavedemo in Thermofluid package 100
    discretization points

32
Task Merging using GRS
  • Idea A set of simple rules to transform a task
    graph to increase its granularity (and decrease
    Parallel Time)
  • Use top level (and bottom level) as metric
  • Parallel Time max tlevel max blevel

33
Rule 1
  • Merging a single child with only one parent.
  • Motivation The merge does not decrease amount of
    parallelism in the task graph. And granularity
    can possibly increase.

p
p
c
34
Rule 2
  • Merge all parents of a node together with the
    node itself.
  • Motivation If the top level does not increase by
    the merge the resulting task will increase in
    size, potentially increasing granularity.

p1
p2
pn

c
c
35
Rule 3
  • Duplicate parent and merge into each child node
  • Motivation As long as each childs tlevel does
    not increase, duplicating p into the child will
    reduce the number of nodes and increase
    granularity.

p

c2
cn
c1

c2
cn
c1
36
Rule 4
  • Merge siblings into a single node as long as a
    parameterized maximum execution cost is not
    exceeded.
  • Motivation This rule can be useful if several
    small predecessor nodes exist and a larger
    predecessor node which prevents a complete merge.
    Does not guarantee decrease of PT.


p
Pk1
pn

p1
p2
pn
c
c
37
Results Example
  • Task graph from Modelica simulation code
  • Small example from the mechanical domain.
  • About 100 nodes built on expression level,
    originating from 84 equations variables

38
Result Task Merging example
  • B1, L1

39
Result Task Merging example
  • B1, L10
  • B1, L100

40
Conclusions
  • Pre Clustering approach did not work well for the
    fine grained task graphs produced by our
    parallelization tool
  • FTD Method
  • Works reasonable well for some examples
  • However, in general
  • Need for better scheduling/clustering algorithms
    for fine grained task graphs

41
Conclusions (2)
  • Simple delay model may not be enough
  • More advanced model require more complex
    scheduling and clustering algorithms
  • Simulation code from equation based models
  • Hard to extract parallelism from
  • Need new optimization methods on DAEs or ODEs
    to increase parallelism

42
Conclusions Task Merging using GRS
  • A task merging algorithm using GRS have been
    proposed
  • Four rules with simple patterns gt fast pattern
    matching
  • Can easily be integrated in existing scheduling
    tools.
  • Successfully merges tasks considering
  • Bandwidth Latency
  • Task duplication
  • Merging criterion decrease Parallel Time, by
    decreasing tlevel (PT)
  • Tested on examples from simulation code

43
Future Work
  • Designing and Implementing Better Scheduling and
    Clustering Algorithms
  • Support for more advanced task graph models
  • Work better for high granularity values
  • Try larger examples
  • Test on different architectures
  • Shared Memory machines
  • Dual processor machines

44
Future Work (2)
  • Heterogeneous multiprocessor systems
  • Mixed DSP processors, RISC,CISC, etc.
  • Enhancing Modelica language with data parallelism
  • e.g. parallel loops, vector operations
  • Parallelize e.g. combined PDE and ODE problems
    in Modelica.
  • Using e.g. SCALAPACK for solving subsystems of
    linear equations. How to integrate into
    scheduling algorithms?
Write a Comment
User Comments (0)
About PowerShow.com