Title: Dynamic Optimization
1Dynamic Optimization
- David Kaeli
- Department of Electrical and Computer Engineering
- Northeastern University
- Boston, MA
- kaeli_at_ece.neu.edu
2What is Dynamic Optimization
- Allow a running binary to adapt to the underlying
hardware system dynamically - Perform optimization while not sacrificing
performance
Input
Input
Runtime Dynamic Optimization System
Static Source
Fluid Binary
OS/HW Platform
3Why Dynamic versus Static
- Allows code to adapt to
- Changes in the microarchitecture of the
underlying platform (related to binary
translation) - Changes in program input
- Environment dynamics (e.g., system load, system
availability) - Involves very little user interaction
(optimization should be applied transparently) - Source code is not needed
- Language independent
4Challenges with Dynamic Optimization
- Reducing the associated overhead and maintaining
transparency - Addressing a range of workloads
- Selecting appropriate optimizations
5Dynamic Optimization Systems
- Dynamo
- HP labs, PA-RISC/HPUX
- Runtime optimization
- Vulcan/Mojo
- MS Research, X86-IA64/Win2K
- Deskstop instrumentation, profiling and
optimization - Jalapeno
- IBM Research, JVM-PPC-SMPs/AIX
- Java JIT designed for research
- Latte
- University of Seoul, Korea
- Java JIT designed for efficient register
allocation
6Dynamo
Application Libs (native binary)
Application Libs (native binary)
Dynamo
CPU Platform
CPU Platform
Dynamo execution model
Normal execution model
To the application, Dynamo looks like a software
interpreter that executes the same instruction
set executed by underlying hardware interpreter
(the CPU).
Many of these slides were provided by Evelyn
Duesterwald
7Elements of Dynamo
- A novel performance delivery mechanism
- Optimize the code when it executes, not when it
is created - A client-enabled performance mechanism
- Dynamic code re-layout
- Partial dynamic inlining/superblock formation
- Path-specific optimization
- Adaptive machine and input specific
- Complementary to static optimization
- Transparent requires no compiler support
Application Libs (native binary)
Dynamo
CPU Platform
8Flow within Dynamo
Input native instruction stream
Interpretation/Profiling Loop
no
miss
Interpret until taken branch
Lookup next PC in Trace Cache
Hot start of trace?
yes
recycle counter
exit branch
hit
Trace Selector
Dynamo Code Cache
Trace Optimizer
Emit Trace
Trace Linker
9Traces in Dynamo
Trace single-entry join-free dynamic sequence
of basic blocks
Control Flow Graph
Memory Layout
Trace Cache Layout
connect to other trace
A
A
A
F
T
F
B
F
B
B
C
C
D
C
call
E
E
call
D
return
trampoline
E
return
D
exit to Interpreter
10Traces in Dynamo
Interprocedural forward path start-of-trace
target of backward branch
end-of-trace taken backward branch
A
B
C
D
11 Paths through the loop ABCEH
ABCEHKMO ABCEHKNO ABCEIKMO
ABCEIKNO ABCFJL ABCFJLNO
ABDFJL ABDFJLNO ABDGJL
ABDGJLNO
G
E
F
I
H
J
K
L
M
N
O
11Traces in Dynamo typical path profiles
A
- Approach
- profile all edge frequencies
- select hot trace by following highest frequency
branch outcome - Disadvantage
- Infeasible path ignores branch correlation
- Overhead
- need to profile every conditional branch
B
C
D
G
E
F
I
H
J
K
L
M
N
O
12Traces in Dynamo Next Executing Tail Prediction
- Minimal profiling
- profile only start-of-trace points (block A)
-
- Optimistic
- at hot start-of-trace select next executing
- Advantages
- very light-weight
- instrumentation points
- targets of backward branches
- counters
- targets of backward branches
- statistically likely to pick the hottest path
- feasible paths
- easy to implement
A
B
D
C
G
E
F
I
H
J
K
L
M
N
O
13Trace Selection
14When to stop creating new traces
- Excessively high trace selection rates cause
unacceptable overhead and potential thrashing in
the Dynamo code cache - We need the opportunity to amortize the cost of
creating traces, thus we need to turn off trace
creation sometimes - Bail out is entered when the creation rate per
unit time is excessively high
15Trace Optimization
List of trace blocks
Build lightweight Intermediate Representation
(IR) Symbolic Labels, Extended Virtual Register
Set
Lite IR
Forward Pass
Optimization with integrated demand-driven
analysis
Backward Pass
Reg Alloc
Schedule Register Allocation retain previous
mappings
Linker
16Trace Optimization
- Are there any runtime optimization opportunities
in statically optimized code? - Limitations of static compiler optimization
- cost of call-specific interprocedural
optimization - cost of path-specific optimization in presence of
complex control flow - difficulty of predicting past indirect branches
- lack of access to shared libraries
- sub-optimal register allocation decisions
- register allocation for individual array elements
or pointers
17Path-specific optimizations
- Conservative Optimizations
- precise signal delivery
- memory-safe
- partial procedure inlining
- redundant branch removal
- constant propagation
- constant folding
- copy propagation
- Aggressive Optimizations
- redundant load removal
- runtime disambiguated (guarded) load removal
- dead code elimination
- partially dead code sinking
- loop unrolling
- loop invariant hoisting
- ? aggressive optimization can be made memory- and
signal-safe - compiler hints
- de-optimization
18Dynamo Optimizations
- Constant propagation
- Given x
- Replace all later uses of x with c, assuming that
x will not be modified
entry
entry
b b
b 3
n
n
y
y
d
d
e
e
exit
exit
19Dynamo Optimizations
- Constant folding
- Identifying that all operands in an assignment
are constant after macro expansion and constant
propagation - Easy for booleans, a little trickier for integers
(exceptions such as divide by zero and
overflows), for FP this can be very tricky due to
multiple FP formats
entry
entry
b 3
b 3 e
n
y
d
d
exit
e
exit
20Dynamo Optimizations
- Partial load removal LRE paper
- Dead code elimination
- A variable is dead if it is not used on any path
from where it is defined to where the function
exits - An instruction is dead if it computes only values
that are not used on any executable path leading
from the instruction - Dead code is often created through the
application of code optimization (e.g., strength
reduction replacing expensive ops by less
expensive ops) - Loop invariant hoisting moving invariant
operations out of the loop body - Fragment link-time optimizations apply peephole
optimization around link, looking for dead code
removal
21Implementation Issues
- Problem Signal arrives when executing in the
code cache - How can we achieve transparent signal delivery?
- How can the original signal context be
reconstructed? - Dynamo approach intercept all signals
- Upon arrival of a signal at code cache location
L, Dynamo first gains control - Save code cache context
- Retranslate the trace and record
- Any changes in register mapping up to position L
- Original code addresses of L
- All context-modifying optimizations and steps for
de-optimization - Update the code cache context to obtain native
context - Load native context and execute original signal
handler
22Dynamic Code Cache
- Problem How to control size of dynamically
recompiled code? - How to react to phase changes?
- Adaptive flushing based cache management scheme
- Preemptive cache flushes
- Fast allocation/de-allocation of traces
- Removal of old and cold traces
- Branch re-biasing to improve locality in cache
- Configurable for various performance/memory-footpr
int trade-offs - Code cache default size 300 Kbytes
23Dynamo Performance
(O2 compiled native binary running under Dynamo
on a PA-8000)
24Bailout
- bail out if trace selection rate exceeds
tolerable threshold
25Bailout
- To prevent degradation, Dynamo keeps track of the
current trace selection rate - Virtual time is recorded by counting the number
of interpreted BBs before we select N traces - A threshold is set to judge if a rate is high
- The trace selection rate is considered excessive
if k consecutive high rate time intervals have
been encountered - Bailout will turn off trace selection and
optimization execution resumes in the original
binary
26Performance speedups with bailout
(O2 compiled native binary running under Dynamo
on a PA-8000)
27Memory Overhead Dynamo text
Total size 273 Kb PA-RISC dependent portion
179 Kb (66)
28Summary of Dynamo
- Demonstrated the potential for dynamic
optimization through an actual implementation - Optimization impact tends to be program dependent
- More sophisticated bailout algorithms need to be
devised - Static compile-time hints should be used to help
guide a dynamic optimization system
29Vulcan A. Srivastava
- Provides both static and dynamic code
modification - Performs optimization on x86, IA64 and MSIL
binaries - Can work in the presence of multithreading and
variable length instructions (X86) - Designed to be able to perform modifications on a
remote machine using a distributed common object
model (DCOM) interface - Can also serve as a binary translator
30Mojo Dynamic Optimization using Vulcan
(ChaikenGillies)
- Targets a desktop x86/Windows2000 environment
- Supports large, multithreaded, applications that
use exception handlers - Requires no OS support
- Allows optimization across shared library
boundaries - Can be aided by information provided by a static
compiler
31Mojo Structure
Exception handling
Original Code
NT DLL
Mojo Dispatcher
Path Cache
Basic Block Cache
Path Builder
32Mojo Structure
Exception handling
Original Code
NT DLL
1. Interrogate the Path Cache for a hit
Mojo Dispatcher
Path Cache
Basic Block Cache
Path Builder
33Mojo Structure
Exception handling
2. If hit, then execute from the PC directly,
else interrogate the Basic Block Cache for a hit
Original Code
NT DLL
Mojo Dispatcher
Path Cache
Basic Block Cache
Path Builder
34Mojo Structure
Exception handling
3. If hit in the BBC, execute directly, else load
the block from the original code.
Original Code
NT DLL
Mojo Dispatcher
Path Cache
Basic Block Cache
Path Builder
35Mojo Structure
Exception handling
Each time control returns to the Mojo
Dispatcher. BBs are checked for hotness.
Original Code
NT DLL
Mojo Dispatcher
Path Cache
Basic Block Cache
Path Builder
36Mojo Structure
Exception handling
If a BB is hot enough, Mojo turns on Path
Building. Once a complete path has been built
and optimized, it is placed in the Path Cache.
Original Code
NT DLL
Mojo Dispatcher
Path Cache
Basic Block Cache
Path Builder
37Mojo Components
- Mojo Dispatcher
- Is the control point in the dynamic optimization
system - Manages execution context using its own stack
space - Basic Block Cache
- Handles basic blocks that have not yet become hot
- Identifies basic block boundaries by dynamically
decoding instruction bytes - Branches are modified to pass control to the
dispatcher, and passes the addresses of the next
basic block to execute - Additional information is kept in the BBC that is
used when constructing paths
38Mojo Components
- Path Builder
- Responsible for selecting, building and
optimizing hot paths - Maintains hotness information for basic blocks
- Utilizes the same heuristics for building hot
paths as is used in Dynamo (next path after
counter overflow) - Utilizes separate thresholds for back edge
targets and path exit targets (need to detect hot
side exits when constructing a dynamic path) - Instructions are laid out contiguously
(reordered), eliminating many taken conditional
branches
39Mojo Components Path Builder
- Path Termination - Dynamo only terminates paths
on back edges
Dynamo back edge profiling
Mojo back edge and side exit profiling
Original nested loops
B
A
B
B
C
C
Longer path
C
A
A
40Exception Handling and Threads
- Mojo patches the ntdll.dll
- Mojo captures the state of the machine before
passing off exceptions to the dispatcher - The dispatcher prevents the exception handler
from polluting the Path Cache - To handle multithreading, Mojo allocates a basic
block cache per thread, though uses a shared Path
Cache - Locking mechanisms are provided to access and
update the shared Path Cache reliably
41Mojo performance
qsort, acker and fib are recursive programs
42Mojo performance SPEC2000/SPEC95
43Mojo Execution - Windows
44Comments
- For simple programs with simple control flow,
Mojo shows good improvement - For larger programs with more dynamic control
flow, Mojo is overwhelmed with the amount of path
creation (same problem that was encountered for
Dynamo) - Bailout strategy needed, along with better hot
path detection algorithm - Future work is investigating how to use hints
obtained during static compilation to aid in the
dynamic optimization of the code
45What is a JIT
- Just-in Time Compiler developed to address the
performance issues encountered with java
interpreter/translator performance - Portability generally means lower performance
JITs attempt to bridge this gap - JITs dynamically cache translated java bytecodes
and perform extensive optimization on the native
instructions - Given the overhead of using an OO programming
model (frequent method calls), extensive
exception checking, and the overhead of dynamic
translation/compilation, the quality of the JIT
must be high
46Common JITs
- SUN Java Development Kit (Sun)
- Hotspot JIT (Sun)
- Kaffe (Transvirtual Technologies)
- Jalapeno (IBM Research)
- Latte (Seoul National University)
47IBM Jalapeno JVM and JIT
- Designed specifically for servers
- Shared memory multiprocessor scalability
- Manage a large number of concurrent threads
- High availability
- Rapid response and graceful degradation (an issue
when garbage collection is involved) - Mainly developed in Java (reliability?)
- Designed specifically for extensive dynamic
optimization
48The Jalapeno Adaptive Optimization System
- Translates bytecodes directly to the native ISA
- Recompilation is performed in a separate thread
from the application, and thus can be done in
parallel to program execution - AOS has three components
- Runtime measurement system
- Controller
- Recompilation system
49Jalapeno AOS Architecture
Compilers (Base, Opt, )
Install new code
Controller
Executing Code
Controller
Hardware/VM Performance Monitor
Profile data
AOS Database
Inst/Opt Code
Raw Data
Raw Data
Inst/Comp Plan
Controller
Raw Data
Controller
Organizer
Compilation Threads
Formatted Data
Organizer
Measurement Subsystem
Formatted Data
Compilation Queue
Controller
Organizer Event Queue
50Three Optimization Levels
- Level 0 On-the-fly optimizations performed
during translation (constant prop, constant
folding, dead code detection) - Level 1 Adds to Level 0 common subexpression
elimination, redundant load elimination,
aggressive inlining - Level 2 Adds to Level 1 flow-sensitive
optimizations, array bounds check elimination
51Controller model
- Decides when to recompile a method
- Decides which optimization level to use
- Measurements are used to guide the profiling
strategy and select the hot methods to recompile - An analytical model is also used that represents
the costs and benefits of performing these tasks
52When to recompile?
- Ti Current total amount of time the program
will spend executing method m - Cj Cost of recompilation method m at
optimization level j - Tj Expected total amount of time the program
will spend executing an optimized method m - For j0,1,2 choose the j that minimizes Cj Tj
- If Cj Tj level j
- Otherwise it decides not to recompile
53When to recompile?
- To estimate Ti, we assume the program will run
for a total time of Tf, and use profile data to
indicate what percentage of the total execution
time (Pm) is spent in method m (versus the rest
of the program) - We can compute Ti as
- Ti Tf Pm
- This is the initial estimated execution time for
method m. A new Ti is computed based on an
estimate of the speedup of method m. - The above weight decays over time.
54How well does optimization work in Jalapeno?
55Comments about Jalapeno
- Focused on method-granularity optimization
- Simple heuristics for predicting runtimes and
benefits/costs are highly sensitive to cold vs.
warm invocation of the application - New work looks at method specific optimizations
that consider additional characteristics besides
just the estimated runtime
56Latte
- Addresses the inefficiencies in the stack-based
Java bytecode machine by efficiently mapping
stack space to a RISC register file - Since traditional register coloring is an
expensive algorithm, and allocation must be done
in the same space as the runtime, this system
looks at other ways to get good register
allocation at a reduced cost
57Java Translation to Native Code
- Identifies control join points and subroutines in
the bytecode using a depth-first search traversal - Bytecodes are translated in a control flow graph,
mapping program variables to a set of pseudo
registers - Traditional compiler optimizations are performed
- Register allocation is performed
- CFG is converted to native host (SPARC) code
58Treeregion Scheduling
- The CFG is partitioned into treeregions (single
entry, multiple exit subgraphs, that are shaped
like trees) - Treeregions start at the beginning of the program
or at join points, and end either at the end of
the program or at new join points - Liveness analysis is performed
- Individual treeregions are scheduled using a
backward sweep, followed by a forward sweep
59How well does optimization work in Latte?
60Comments on Latte
- Good register allocation can help to improve the
runtime performance of a dynamically tuned Java
bytecode binary - Optimization should target hot spots in the
executable - Can provide very competitive performance compared
with the Sun JDK and HotSpot compilation tools
61Summary on Dynamic Optimization
- There is always a struggle to balance the costs
and benefits of particular types of dynamic
optimizers - Dynamic optimizers can be workload dependent
- There exists a lot of room in Java JITs to
improve upon instruction schedules and register
allocation - This is a rich area for future research on
compiler and memory management studies