Title: Parallel Simulations on High-Performance Clusters
1Parallel Simulations on High-Performance
Clusters
- C.D. Pham
- RESAM laboratory
- Univ. Lyon 1, France
- cpham_at_resam.univ-lyon1.fr
2Outline
- Backgrounds
- Discrete Event Simulation (DES)
- Parallel DES and the synchronization problems
- The CSAM Tool
- Architecture of the simulator kernel
- The communication network model
- Results
- On mono-processor cluster
- On multi-processor cluster
3Simulation
- To simulate is to reproduce the behavior of a
physical system with a model - Practically, computers are used to numerically
simulate a logical model - Simulations are used for performance evaluation
and prediction of complex systems - fluids dynamic, chemistry reactions (continous)
- communication network models routing, congestion
avoidance, mobile (discrete) - Simulation is more flexible than analytical
methods
4Discrete Event Simulation (DES)
- assumption that a system changes its state at
discrete points in simulation time
a1
a2
a3
a4
d1
d2
d3
S1
S3
time-step
?t
0
2?t
3?t
4?t
5?t
6?t
5DES concepts
- fundamental concepts
- system state (variables)
- state transitions (events)
- simulation time totally ordered set of values
representing time in the system being modeled - the system state can only be modified upon
reception of an event - modeling can be
- event-oriented
- process-oriented
6Life cycle of a DES
- a DES system can be viewed as a collec-tion of
simulated objects and a sequence of event
computations - each event computation contains a time stamp
indicating when that event occurs in the physical
system - each event computation may
- modify state variables
- schedule new events into the simulated future
- events are stored in a local event list
- events are processed in time stamped order
- usually, no more event termination
7A simple DES model
link model delay 5 send processing time
5 receive processing time 1 packet arrival P1
at 5, P2 at 12, P3 at 22
8Why it works?
- events are processed in time stamp order
- an event at time t can only generate future
events with timestamp greater or equal to t (no
event in the past) - generated events are put and sorted in the event
list, according to their timestamp
- the event with the smallest timestamp is always
processed first, - causality constraints are implicitly maintained.
9Why change? It s so simple!
- models becomes larger and larger
- the simulation time is overwhelming or the
simulation is just untractable - example
- parallel programs with millions of lines of
codes, - mobile networks with millions of mobile hosts,
- ATM networks with hundreds of complex switches,
- multicast model with thousands of sources,
- ever-growing Internet,
- and much more...
10Some figures to convince...
- ATM network models
- Simulation at the cell-level,
- 200 switches
- 1000 traffic sources, 50Mbits/s
- 155Mbits/s links,
- 1 simulation event per cell arrival.
More than 26 billions events to simulate 1
second! 30 hours if 1 event is processed in 1us
- simulation time increases as link speed
increases, - usually more than 1 event per cell arrival,
- how scalable is traditional simulation?
11Parallel simulation - principles
- execution of a discrete event simulation on a
parallel or distributed system with several
physical processors. - the simulation model is decomposed into several
sub-models that can be executed in parallel - spacial partitioning,
- temporel partitioning,
- radically different from simple simulation
replications.
12Parallel simulation - pros cons
- pros
- reduction of the simulation time,
- increase of the model size,
- cons
- causality constraints are difficult to maintain,
- need of special mechanisms to synchronize the
different processors, - increase both the model and the simulation kernel
complexity. - challenges
- ease of use, transparency.
13Parallel simulation - example
14A simple PDES model
local event list
15Synchronization problems
- fundamental concepts
- each Logical Process (LP) can be at a different
simulation time - local causality constraints events in each LP
must be executed in time stamp order - synchronization algorithms
- Conservative avoids local causality violations
by waiting until it s safe - Optimistic allows local causality violations but
provisions are done to recover from them at
runtime
16CSAM (Pham, UCBL)
- CSAM Conservative Simulator for ATM network
Model - Simulation at the cell-level
- Conservative and/or sequential
- C programming-style, predefined generic model
of sources, switches, links - New models can be easily created by deriving from
base classes - Configuration file that describes the topology
17CSAM - Kernel characteristics
- Exploits the lookahead of communication links
transparent for the user - Virtual Input Channels
- reduces overhead for event manipulation,
- reduces overhead for null-messages handling.
- Cyclic event execution
- Message aggregation
- static aggregation size,
- asymmetric aggregation size on CLUMPS,
- sender-initiated,
- receiver-initiated.
18CSAM - Life cycle
19Test case 78-switch ATM network
Distance-Vector Routing with dynamic link cost
functions Connection setup, admission control
protocols
20Why is it difficult?
- Very small granularity 1 message represents 1
cell tranfer - high level of message synchronisation
- very small computation/communication ratio
- Load imbalance between links
- large number of control messages
- partitioning and load balancing are difficult
21CSAM - Some results...
Routing protocols reconfiguration time
22CSAM - Some results...
23Parallel Simulation on High Performance Clusters
- Myrinet-based cluster of 12 Pentium Pro at
200MHz, 64 MBytes, Linux - Myrinet-based cluster of 4 dual Pentium Pro
450MHz, 128 Mbytes, Linux - Myrinet board with LANai 4.1, 256KB
- BIP, BIP-SMP, MPI/BIP, MPI/BIP-SMP communication
libraries
24Speedup on a myrinet cluster
Pentium Pro 200MHz
More than 53 millions events to simulate 0.31s
25Speedup with CLUMPS
Dual Pentium Pro 450MHz
26Increasing the model size (CLUMPS)
Dual Pentium Pro 450MHz, 4x2 int
27Speedup on SGI/Cray Origin 2000
28Conclusions
- Parallel Simulation is very sensitive to latency
- High Performance Clusters is a good alternative
to traditionnal massively parallel computer - CLUMPS architectures are very attractive as the
price on the communication card can be cut in half