Title: The Potential of TraceLevel Parallelism in Java Programs
 1The Potential of Trace-Level Parallelism in Java 
Programs
- Borys J. Bradel 
- Tarek S. Abdelrahman 
- University of Toronto 
- Principles and Practices of Programming in Java 
- September 7th 2007
2Motivation
- Gap exists between hardware and software 
- Hardware 
- Majority of computer chips contain multiple cores 
- Athlon X2, Core 2 Duo, Power5, Cell, Niagara 
- Software 
- Writing parallel software is difficult 
- Bridging the gap may lead to better utilization 
 of hardware and therefore improved performance
3Automatic Parallelization
- Traditional compile time 
- Perform analysis at compile time 
- Divide program based on analysis 
- Limited success 
- Runtime 
- New approach to automatic parallelization is 
 needed
- Combine analysis with runtime information 
- What information to use? 
- Trace-Based 
- Our solution is to use traces
3 
 4How successful can using traces be?
- We answer this question by simulating trace 
 execution
- monitor a programs execution 
- simulate the execution of traces in parallel 
- Measure a practical upper-bound on parallelism 
- not an accurate measurement of performance 
5Outline
- Traces 
- Execution Model 
- Simulation Platform 
- Experimental Evaluation 
- Conclusion 
6Trace Definition
- A trace is a frequently executed sequence of 
 unique basic blocks or instructions
- Identified by a trace collection system at runtime
public static int foo()  int a0 for (int 
i0iltni) ai return a  
 7Benefits
- Source code is not required 
- Granularity of parallelism can vary 
- Traces simplify control flow and analysis 
- Traces are simple to identify
8Execution Model
parallel
sequential
CFG
Method 
 9Dependence Communication
Method
Dependences limit parallelism 
ai  
 10Dependence Communication
Different types of communication
Instruction-Instruction
Trace-Trace
i4 
i4 
  ai
Communication Delay
Trace-Instruction
  ai
i4 
  ai 
 11Requirements
- Java Virtual Machine 
- Execute bytecode 
- Interpreted or compiled 
- Trace Collection System 
- monitor control flow 
- create traces
JVM
Code Execution
control flow
TCS 
 12Parallel Identification Engine
- Records memory information 
- Keeps track of dependences 
- Ignore instructions that read and write to the 
 same variablee.g. dependence between i and
 itself is ignored
- Schedules instructions 
- Instruction Window 
- Communication 
- Processor Count 
JVM
Code Execution
control flow
instruction info
traces 
 13Scheduling
Record trace information when traces execute 
sequentially Schedule when instruction window 
is full
Schedule
Schedule 
 14Schedule around Dependences
4 processors 12 traces per window
- Dependent traces are scheduled far enough apart 
 to have correct execution
15Speedup
- Ratio 
- Cycles aggregated all scheduled traces on 
 parallel system
- Cycles over all scheduled traces on one processor 
 system
- Each trace executes sequentially on one processor 
- A cycle represents the write of one memory 
 location
ai i
B1
 2 cycles
if (iltn) goto B1
B2 
 16Experimental Evaluation
- Jupiter Patrick Doyle 
- RedSpot Borys Bradel 
- Modified Critical-Path Min-You Wu scheduler 
- Benchmarks 
- Java Grande Section 3 
- SPECjvm98
17Effect of Window Size 
 18Effect of Communication Cost 
 19Effect of Communication Type 
 20Effect of Processor Count 
 21Conclusion
- How successful can using traces be? 
- Built simulator to measure parallel execution of 
 traces
- Traces have the potential to be used to 
 parallelize programs
- Some benchmarks do not scale well 
- Some benchmarks scale very well 
- Most benchmarks have at least 2x speedup on four 
 processors
- Future work create a system that performs 
 trace-based parallelization
22Jupiter and RedSpot
Interpreter emulate a0 emulate i0 emulate goto 
B2 call RedSpot emulate if (iltn) 
goto B1 call RedSpot emulate ai emulate 
i emulate if (iltn) goto B1 call RedSpot
Trace 1
emulate ai emulate i emulate if (iltn) goto B1 
 call RedSpot 
 23Parallel Identification Engine
Interpreter  emulate if (iltn) goto B1 call 
RedSpot call PIE emulate ai 
 call PIE emulate 
i 
 call PIE emulate if (iltn) goto B1 call 
RedSpot call PIE emulate ai 
 call PIE emulate 
i 
 call PIE emulate if (iltn) goto B1 call 
RedSpot call PIE 
call call PIE for each instruction and each 
memory access 
 24Processor Count
Maximum number of processors limits performance
2 processors 
 25Scheduling Window
Can only schedule a limited number of tracesat a 
time
4 traces per window