Title: Efficient Optimistic Parallel Simulations Using Reverse Computation
 1Efficient Optimistic Parallel Simulations Using 
Reverse Computation
-  Chris Carothers 
- Department of Computer Science 
- Rensselaer Polytechnic Institute 
- Kalyan Permulla 
- and 
- Richard M. Fujimoto 
- College of Computing 
- Georgia Institute of Technology
2Why Parallel/Distributed Simulation?
- Goal speed up discrete-event simulation programs 
 using multiple processors
- Enabling technology for 
- intractable simulation models tractable 
- off-line decision aides on-line aides 
 for time critical situation analysis
-  DPAT A distributed 
 simulation success story
- simulation model of the National Airspace 
- developed _at_ MITRE using Georgia Tech Time Warp 
 (GTW)
- simulates 50,000 flights in lt 1 minute, which 
 use to take 1.5 hours.
- web based user-interface 
- to be used in the FAA Command Center for on-line 
 what if planning
- Parallel/distributed simulation has the potential 
 to improve how what if planning strategies are
 evaluated
3How to Synchronize Distributed Simulations?
parallel time-stepped simulation lock-step 
execution
parallel discrete-event simulation must allow 
for sparse, irregular event computations
barrier
Problem events arriving in the past
Solution Time Warp
Virtual Time
Virtual Time
PE 2
PE 3
PE 1
PE 2
PE 1
PE 3
processed event
straggler event 
 4Time Warp...
Local Control Mechanism error detection and 
rollback
Global Control Mechanism compute Global Virtual 
Time (GVT)
V i r t u a l T i m e
V i r t u a l T i m e
collect versions of state / events  perform 
I/O operations that are lt GVT
(1) undo state Ds (2) cancel sent events
GVT
LP 2
LP 3
LP 1
LP 2
LP 1
LP 3
unprocessed event
processed event
straggler event
committed event 
 5Challenge Efficient Implementation?
- Advantages 
- automatically finds available parallelism 
- makes development easier 
- outperforms conservative schemes by a factor of N
- Disadvantages 
- Large memory requirements to support rollback 
 operation
- State-saving incurs high overheads for fine-grain 
 event computations
- Time Warp is out of performance envelop for 
 many applications
Our Solution Reverse Computation 
 6Outline...
- Reverse Computation 
- Example ATM Multiplexor 
- Beneficial Application Properties 
- Rules for Automation 
- Reversible Random Number Generator 
- Experimental Results 
- Conclusions 
-  Future Work
7Our Solution Reverse Computation...
- Use Reverse Computation (RC) 
- automatically generate reverse code from model 
 source
- undo by executing reverse code 
- Delivers better performance 
- negligible overhead for forward computation 
- significantly lower memory utilization
8Example ATM Multiplexor
Original
N
if( qlen lt B ) qlen delaysqlen else lost
B
on cell arrival... 
 9Gains.
- State size reduction 
- from B2 words to 1 word 
- e.g. B100 gt 100x reduction! 
- Negligible overhead in forward computation 
- removed from forward computation 
- moved to rollback phase 
- Result 
- significant increase in speed 
- significant decrease in memory 
- How?... 
10Beneficial Application Properties
- 1. Majority of operations are constructive 
- e.g., , --, etc. 
- 2. Size of control state lt size of data state 
- e.g., size of b1 lt size of qlen, sent, lost, etc. 
- 3. Perfectly reversible high-level operations 
-  gleaned from irreversible smaller operations 
- e.g., random number generation
11Rules for Automation...
Generation rules, and upper-bounds on bit 
requirements for various statement types 
 12Destructive Assignment...
- Destructive assignment (DA) 
- examples x  y x  y 
- requires all modified bytes to be saved 
- Caveat 
- reversing technique for DAs can degenerate to 
 traditional incremental state saving
- Good news 
- certain collections of DAs are perfectly 
 reversible!
- queueing network models contain collections of 
 easily/perfectly reversible DAs
- queue handling (swap, shift, tree insert/delete, 
 )
- statistics collection (increment, decrement, ) 
- random number generation (reversible RNGs)
13Reversing an RNG?
double RNGGenVal(Generator g)  long k,s 
double u u  0.0 s  Cg 0g k  s 
/ 46693 s  45991  (s - k  46693) - k  
25884 if (s lt 0) s  s  2147483647 
Cg 0g  s u  u  4.65661287524579692e-10
  s s  Cg 1g k  s / 10339 s  
207707  (s - k  10339) - k  870 if (s lt 
0) s  s  2147483543 Cg 1g  s u 
 u - 4.65661310075985993e-10  s if (u lt 0) 
u  u  1.0 
 s  Cg 2g k  s / 15499 s  
138556  (s - k  15499) - k  3979 if (s lt 
0.0) s  s  2147483423 Cg 2g  s 
u  u  4.65661336096842131e-10  s if (u gt 
1.0) u  u - 1.0 s  Cg 3g k  s / 
43218 s  49689  (s - k  43218) - k  
24121 if (s lt 0) s  s  2147483323 
Cg 3g  s u  u - 4.65661357780891134e-10
  s if (u lt 0) u  u  1.0 return 
(u) 
Observation k  s / 46693 is a Destructive 
AssignmentResult RC degrades to classic 
state-savingcan we do better? 
 14RNGs A Higher Level View
The previous RNG is based on the following 
recurrence. xi,n  aixi,n-1 mod mi where xi,n 
one of the four seed values in the Nth set, mi is 
one the four largest primes less than 231, and ai 
is a primitive root of mi. Now, the above 
recurrence is in fact reversible. inverse of ai 
modulo mi is defined, bi  aimi-2 mod mi Using 
bi, we can generate the reverse recurrence as 
follows xi,n-1  bixi,n mod mi 
 15Reverse Code Efficiency...
- Future RNGs may result in even greater savings. 
- Consider the MT19937 Generator... 
- Has a period of 219937 
- Uses 2496 bytes for a single generator 
- Property... 
- Non-reversibility of indvidual steps DO NOT imply 
 that the computation as a whole is not
 reversible.
- Can we automatically find this higher-level 
 reversibility?
- Other Reversible Structures Include... 
- Circular shift operation 
- Insertion  deletion operations on trees (i.e., 
 priority queues).
Reverse computation is well-suited for queuing 
network models! 
 16Performance Study 
 17Why the large increase in parallel performance?
million events/second 
 18Cache Performance...
-  Faults TLB P cache S 
 cache
- SS 12pe 43966018 1283032615 
 162449694
- RC 12pe 11595326 590555715 94771426
19Related Work...
- Reverse computation used in 
- low power processors, debugging, garbage 
 collection, database recovery, reliability, etc.
- All previous work either 
- prohibit irreversible constructs, or 
- use copy-on-write implementation for every 
 modification(correspond to incremental state
 saving)
- Many operate at coarse, virtual page-level
20Contributions
- We identify that 
- RC makes Time Warp usable for fine-grain models! 
- disproved previous beliefthat fine grain models 
 cant be optimistically simulated efficiently
- less memory consumption, more speed, without 
 extra user effort
- RC generalizes state saving 
- e.g., incremental state saving, copy state saving 
- For certain data types, RC is more memory 
 efficient than SS
- e.g., priority queues
21Future Work
- Develop state minimization algorithms, by 
- State compressionbit size for reversibility lt 
 bit size of data variables
- State reusesame state bits for different 
 statements
- based on liveness, analogous to register 
 allocation
- Complete RC automation algorithm designavoiding 
 the straightforward incremental state saving
 approach
- Lossy integer and floating point arithmetic 
- Jump statements 
- Recursive functions
22Geronimo! System Architecture
High Performance Simulation Application
Geronimo
 distributed compute server
rack-mounted CPUs (not in demonstration)
multiprocessor
Geronimo Features (1) risky or speculative 
processing of object computations, (2) reverse 
computation to support undo operation, (3) 
Active Code in a combination, heterogeneous, 
shared-memory, message passing environment... 
 23Geronimo! Risky Processing...
- Execution Framework 
-  Objects 
-  schedule Threads / Tasks 
-  at some virtual time 
- Applications 
-  discrete-event simulations 
-  scientific computing applications
processed thread
CAVEAT Good performance relies on cost of 
recovery  probability of failure being less than 
cost of being safe!
straggler thread
unprocessed thread 
 24Geronimo! Efficient Undo
- Traditional approach State Saving 
- save byte-copies of modified items 
- high overhead for fine-granularity computations 
- memory utilization is large 
- need alternative for large-scale, fine-grain 
 simulations
- Our approach Reverse Computation 
- automatically generate reverse code from model 
 source
- utilize reverse code to do rollback 
- negligible overhead for forward computation 
- significantly lower memory utilization 
- joint with Kalyan Perumalla and Richard Fujimoto 
Observation reverse computation treats code 
asstate. This results in a code-state 
duality.Can we generalize notion?.. 
 25Geronimo! Active Code
- Key idea allow object methods/code to be 
 dynamically changed during run-time.
- objects can schedule in the future a new method 
 or re-define old methods of other objects and
 themselves.
- objects can erase/delete methods on themselves or 
 other objects.
- new methods can contain Active Code which can 
 re-specialize itself or other objects.
- work in a heterogeneous environment. 
- How is this useful? 
- increase performance by allowing the program to 
 consistently execute the common case fast.
- adaptive, perturbation-free, monitoring of 
 distributed systems.
- potential for increasing a languages 
 expressive power.
- Our approach? 
- Javano, need higher performancemaybe used in 
 the future...
- special compilerno, cant keep up with changes 
 to microprocessors.
26Geronimo! Active Code Implementation
- Runtime infrastructure 
- modifies source code tree 
- start a rebuild of the executable on a another 
 existing machine
- uses a systems naïve compiler 
- Re-exec system call 
- reloads only the new text or code segment of new 
 executable
- fix-up old stack to reflect new code changes 
- fix-up pointers to functions 
- will run in user-space for portability across 
 platforms
- Language preprocessor 
- instruments code to support stack and function 
 pointer fix-up
- instruments code to support stack reconstruction 
 and re-start process
27Research Issues
- Software architecture for the heterogeneous, 
 shared-memory, message passing environment.
- Development of distributed algorithms that are 
 fully optimized for this combination
 environment.
- What language to use for development, C or C or 
 both?
- Geronimo! API. 
-  Active Code Language and Systems Support. 
- Mapping relevant application types to this 
 framework
Homework Problem Can you find specific 
applications/problems where we can apply 
Geronimo!?