Dynamic Optimization - PowerPoint PPT Presentation

About This Presentation

Title:

Dynamic Optimization

Description:

Virtual time is recorded by counting the number of interpreted BBs before we select N traces ... SUN Java Development Kit (Sun) Hotspot JIT (Sun) Kaffe ... – PowerPoint PPT presentation

Number of Views:145

Avg rating:3.0/5.0

Slides: 62

Provided by: davidr8

Learn more at: https://studies.ac.upc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Dynamic Optimization

1
Dynamic Optimization

David Kaeli
Department of Electrical and Computer Engineering
Northeastern University
Boston, MA
kaeli_at_ece.neu.edu

2
What is Dynamic Optimization

Allow a running binary to adapt to the underlying
hardware system dynamically
Perform optimization while not sacrificing
performance

Input
Input
Runtime Dynamic Optimization System
Static Source
Fluid Binary
OS/HW Platform
3
Why Dynamic versus Static

Allows code to adapt to
Changes in the microarchitecture of the
underlying platform (related to binary
translation)
Changes in program input
Environment dynamics (e.g., system load, system
availability)
Involves very little user interaction
(optimization should be applied transparently)
Source code is not needed
Language independent

4
Challenges with Dynamic Optimization

Reducing the associated overhead and maintaining
transparency
Addressing a range of workloads
Selecting appropriate optimizations

5
Dynamic Optimization Systems

Dynamo
HP labs, PA-RISC/HPUX
Runtime optimization
Vulcan/Mojo
MS Research, X86-IA64/Win2K
Deskstop instrumentation, profiling and
optimization
Jalapeno
IBM Research, JVM-PPC-SMPs/AIX
Java JIT designed for research
Latte
University of Seoul, Korea
Java JIT designed for efficient register
allocation

6
Dynamo
Application Libs (native binary)
Application Libs (native binary)
Dynamo
CPU Platform
CPU Platform
Dynamo execution model
Normal execution model
To the application, Dynamo looks like a software
interpreter that executes the same instruction
set executed by underlying hardware interpreter
(the CPU).
Many of these slides were provided by Evelyn
Duesterwald
7
Elements of Dynamo

A novel performance delivery mechanism
Optimize the code when it executes, not when it
is created
A client-enabled performance mechanism
Dynamic code re-layout
Partial dynamic inlining/superblock formation
Path-specific optimization
Adaptive machine and input specific
Complementary to static optimization
Transparent requires no compiler support

Application Libs (native binary)
Dynamo
CPU Platform
8
Flow within Dynamo
Input native instruction stream
Interpretation/Profiling Loop
no
miss
Interpret until taken branch
Lookup next PC in Trace Cache
Hot start of trace?
yes
recycle counter
exit branch
hit
Trace Selector
Dynamo Code Cache
Trace Optimizer
Emit Trace
Trace Linker
9
Traces in Dynamo
Trace single-entry join-free dynamic sequence
of basic blocks
Control Flow Graph
Memory Layout
Trace Cache Layout
connect to other trace
A
A
A
F
T
F
B
F
B
B
C
C
D
C
call
E
E
call
D
return
trampoline
E
return
D
exit to Interpreter
10
Traces in Dynamo
Interprocedural forward path start-of-trace
target of backward branch
end-of-trace taken backward branch
A
B
C
D
11 Paths through the loop ABCEH
ABCEHKMO ABCEHKNO ABCEIKMO
ABCEIKNO ABCFJL ABCFJLNO
ABDFJL ABDFJLNO ABDGJL
ABDGJLNO
G
E
F
I
H
J
K
L
M
N
O
11
Traces in Dynamo typical path profiles
A

Approach
profile all edge frequencies
select hot trace by following highest frequency
branch outcome
Disadvantage
Infeasible path ignores branch correlation
Overhead
need to profile every conditional branch

B
C
D
G
E
F
I
H
J
K
L
M
N
O
12
Traces in Dynamo Next Executing Tail Prediction

Minimal profiling
profile only start-of-trace points (block A)
Optimistic
at hot start-of-trace select next executing
Advantages
very light-weight
instrumentation points
targets of backward branches
counters
targets of backward branches
statistically likely to pick the hottest path
feasible paths
easy to implement

A
B
D
C
G
E
F
I
H
J
K
L
M
N
O
13
Trace Selection
14
When to stop creating new traces

Excessively high trace selection rates cause
unacceptable overhead and potential thrashing in
the Dynamo code cache
We need the opportunity to amortize the cost of
creating traces, thus we need to turn off trace
creation sometimes
Bail out is entered when the creation rate per
unit time is excessively high

15
Trace Optimization
List of trace blocks
Build lightweight Intermediate Representation
(IR) Symbolic Labels, Extended Virtual Register
Set
Lite IR
Forward Pass
Optimization with integrated demand-driven
analysis
Backward Pass
Reg Alloc
Schedule Register Allocation retain previous
mappings
Linker
16
Trace Optimization

Are there any runtime optimization opportunities
in statically optimized code?
Limitations of static compiler optimization
cost of call-specific interprocedural
optimization
cost of path-specific optimization in presence of
complex control flow
difficulty of predicting past indirect branches
lack of access to shared libraries
sub-optimal register allocation decisions
register allocation for individual array elements
or pointers

17
Path-specific optimizations

Conservative Optimizations
precise signal delivery
memory-safe
partial procedure inlining
redundant branch removal
constant propagation
constant folding
copy propagation

Aggressive Optimizations
redundant load removal
runtime disambiguated (guarded) load removal
dead code elimination
partially dead code sinking
loop unrolling
loop invariant hoisting

? aggressive optimization can be made memory- and
signal-safe
compiler hints
de-optimization

18
Dynamo Optimizations

Constant propagation
Given x
Replace all later uses of x with c, assuming that
x will not be modified

entry
entry
b b
b 3
n
n
y
y
d
d
e
e
exit
exit
19
Dynamo Optimizations

Constant folding
Identifying that all operands in an assignment
are constant after macro expansion and constant
propagation
Easy for booleans, a little trickier for integers
(exceptions such as divide by zero and
overflows), for FP this can be very tricky due to
multiple FP formats

entry
entry
b 3
b 3 e
n
y
d
d
exit
e
exit
20
Dynamo Optimizations

Partial load removal LRE paper
Dead code elimination
A variable is dead if it is not used on any path
from where it is defined to where the function
exits
An instruction is dead if it computes only values
that are not used on any executable path leading
from the instruction
Dead code is often created through the
application of code optimization (e.g., strength
reduction replacing expensive ops by less
expensive ops)
Loop invariant hoisting moving invariant
operations out of the loop body
Fragment link-time optimizations apply peephole
optimization around link, looking for dead code
removal

21
Implementation Issues

Problem Signal arrives when executing in the
code cache
How can we achieve transparent signal delivery?
How can the original signal context be
reconstructed?
Dynamo approach intercept all signals
Upon arrival of a signal at code cache location
L, Dynamo first gains control
Save code cache context
Retranslate the trace and record
Any changes in register mapping up to position L
Original code addresses of L
All context-modifying optimizations and steps for
de-optimization
Update the code cache context to obtain native
context
Load native context and execute original signal
handler

22
Dynamic Code Cache

Problem How to control size of dynamically
recompiled code?
How to react to phase changes?
Adaptive flushing based cache management scheme
Preemptive cache flushes
Fast allocation/de-allocation of traces
Removal of old and cold traces
Branch re-biasing to improve locality in cache
Configurable for various performance/memory-footpr
int trade-offs
Code cache default size 300 Kbytes

23
Dynamo Performance
(O2 compiled native binary running under Dynamo
on a PA-8000)
24
Bailout

bail out if trace selection rate exceeds
tolerable threshold

25
Bailout

To prevent degradation, Dynamo keeps track of the
current trace selection rate
Virtual time is recorded by counting the number
of interpreted BBs before we select N traces
A threshold is set to judge if a rate is high
The trace selection rate is considered excessive
if k consecutive high rate time intervals have
been encountered
Bailout will turn off trace selection and
optimization execution resumes in the original
binary

26
Performance speedups with bailout
(O2 compiled native binary running under Dynamo
on a PA-8000)
27
Memory Overhead Dynamo text
Total size 273 Kb PA-RISC dependent portion
179 Kb (66)
28
Summary of Dynamo

Demonstrated the potential for dynamic
optimization through an actual implementation
Optimization impact tends to be program dependent
More sophisticated bailout algorithms need to be
devised
Static compile-time hints should be used to help
guide a dynamic optimization system

29
Vulcan A. Srivastava

Provides both static and dynamic code
modification
Performs optimization on x86, IA64 and MSIL
binaries
Can work in the presence of multithreading and
variable length instructions (X86)
Designed to be able to perform modifications on a
remote machine using a distributed common object
model (DCOM) interface
Can also serve as a binary translator

30
Mojo Dynamic Optimization using Vulcan
(ChaikenGillies)

Targets a desktop x86/Windows2000 environment
Supports large, multithreaded, applications that
use exception handlers
Requires no OS support
Allows optimization across shared library
boundaries
Can be aided by information provided by a static
compiler

31
Mojo Structure
Exception handling
Original Code
NT DLL
Mojo Dispatcher
Path Cache
Basic Block Cache
Path Builder
32
Mojo Structure
Exception handling
Original Code
NT DLL
1. Interrogate the Path Cache for a hit
Mojo Dispatcher
Path Cache
Basic Block Cache
Path Builder
33
Mojo Structure
Exception handling
2. If hit, then execute from the PC directly,
else interrogate the Basic Block Cache for a hit
Original Code
NT DLL
Mojo Dispatcher
Path Cache
Basic Block Cache
Path Builder
34
Mojo Structure
Exception handling
3. If hit in the BBC, execute directly, else load
the block from the original code.
Original Code
NT DLL
Mojo Dispatcher
Path Cache
Basic Block Cache
Path Builder
35
Mojo Structure
Exception handling
Each time control returns to the Mojo
Dispatcher. BBs are checked for hotness.
Original Code
NT DLL
Mojo Dispatcher
Path Cache
Basic Block Cache
Path Builder
36
Mojo Structure
Exception handling
If a BB is hot enough, Mojo turns on Path
Building. Once a complete path has been built
and optimized, it is placed in the Path Cache.
Original Code
NT DLL
Mojo Dispatcher
Path Cache
Basic Block Cache
Path Builder
37
Mojo Components

Mojo Dispatcher
Is the control point in the dynamic optimization
system
Manages execution context using its own stack
space
Basic Block Cache
Handles basic blocks that have not yet become hot
Identifies basic block boundaries by dynamically
decoding instruction bytes
Branches are modified to pass control to the
dispatcher, and passes the addresses of the next
basic block to execute
Additional information is kept in the BBC that is
used when constructing paths

38
Mojo Components

Path Builder
Responsible for selecting, building and
optimizing hot paths
Maintains hotness information for basic blocks
Utilizes the same heuristics for building hot
paths as is used in Dynamo (next path after
counter overflow)
Utilizes separate thresholds for back edge
targets and path exit targets (need to detect hot
side exits when constructing a dynamic path)
Instructions are laid out contiguously
(reordered), eliminating many taken conditional
branches

39
Mojo Components Path Builder

Path Termination - Dynamo only terminates paths
on back edges

Dynamo back edge profiling
Mojo back edge and side exit profiling
Original nested loops
B
A
B
B
C
C
Longer path
C
A
A
40
Exception Handling and Threads

Mojo patches the ntdll.dll
Mojo captures the state of the machine before
passing off exceptions to the dispatcher
The dispatcher prevents the exception handler
from polluting the Path Cache
To handle multithreading, Mojo allocates a basic
block cache per thread, though uses a shared Path
Cache
Locking mechanisms are provided to access and
update the shared Path Cache reliably

41
Mojo performance
qsort, acker and fib are recursive programs
42
Mojo performance SPEC2000/SPEC95
43
Mojo Execution - Windows
44
Comments

For simple programs with simple control flow,
Mojo shows good improvement
For larger programs with more dynamic control
flow, Mojo is overwhelmed with the amount of path
creation (same problem that was encountered for
Dynamo)
Bailout strategy needed, along with better hot
path detection algorithm
Future work is investigating how to use hints
obtained during static compilation to aid in the
dynamic optimization of the code

45
What is a JIT

Just-in Time Compiler developed to address the
performance issues encountered with java
interpreter/translator performance
Portability generally means lower performance
JITs attempt to bridge this gap
JITs dynamically cache translated java bytecodes
and perform extensive optimization on the native
instructions
Given the overhead of using an OO programming
model (frequent method calls), extensive
exception checking, and the overhead of dynamic
translation/compilation, the quality of the JIT
must be high

46
Common JITs

SUN Java Development Kit (Sun)
Hotspot JIT (Sun)
Kaffe (Transvirtual Technologies)
Jalapeno (IBM Research)
Latte (Seoul National University)

47
IBM Jalapeno JVM and JIT

Designed specifically for servers
Shared memory multiprocessor scalability
Manage a large number of concurrent threads
High availability
Rapid response and graceful degradation (an issue
when garbage collection is involved)
Mainly developed in Java (reliability?)
Designed specifically for extensive dynamic
optimization

48
The Jalapeno Adaptive Optimization System

Translates bytecodes directly to the native ISA
Recompilation is performed in a separate thread
from the application, and thus can be done in
parallel to program execution
AOS has three components
Runtime measurement system
Controller
Recompilation system

49
Jalapeno AOS Architecture
Compilers (Base, Opt, )
Install new code
Controller
Executing Code
Controller
Hardware/VM Performance Monitor
Profile data
AOS Database
Inst/Opt Code
Raw Data
Raw Data
Inst/Comp Plan
Controller
Raw Data
Controller
Organizer
Compilation Threads
Formatted Data
Organizer
Measurement Subsystem
Formatted Data
Compilation Queue
Controller
Organizer Event Queue
50
Three Optimization Levels

Level 0 On-the-fly optimizations performed
during translation (constant prop, constant
folding, dead code detection)
Level 1 Adds to Level 0 common subexpression
elimination, redundant load elimination,
aggressive inlining
Level 2 Adds to Level 1 flow-sensitive
optimizations, array bounds check elimination

51
Controller model

Decides when to recompile a method
Decides which optimization level to use
Measurements are used to guide the profiling
strategy and select the hot methods to recompile
An analytical model is also used that represents
the costs and benefits of performing these tasks

52
When to recompile?

Ti Current total amount of time the program
will spend executing method m
Cj Cost of recompilation method m at
optimization level j
Tj Expected total amount of time the program
will spend executing an optimized method m
For j0,1,2 choose the j that minimizes Cj Tj
If Cj Tj level j
Otherwise it decides not to recompile

53
When to recompile?

To estimate Ti, we assume the program will run
for a total time of Tf, and use profile data to
indicate what percentage of the total execution
time (Pm) is spent in method m (versus the rest
of the program)
We can compute Ti as
Ti Tf Pm
This is the initial estimated execution time for
method m. A new Ti is computed based on an
estimate of the speedup of method m.
The above weight decays over time.

54
How well does optimization work in Jalapeno?
55
Comments about Jalapeno

Focused on method-granularity optimization
Simple heuristics for predicting runtimes and
benefits/costs are highly sensitive to cold vs.
warm invocation of the application
New work looks at method specific optimizations
that consider additional characteristics besides
just the estimated runtime

56
Latte

Addresses the inefficiencies in the stack-based
Java bytecode machine by efficiently mapping
stack space to a RISC register file
Since traditional register coloring is an
expensive algorithm, and allocation must be done
in the same space as the runtime, this system
looks at other ways to get good register
allocation at a reduced cost

57
Java Translation to Native Code

Identifies control join points and subroutines in
the bytecode using a depth-first search traversal
Bytecodes are translated in a control flow graph,
mapping program variables to a set of pseudo
registers
Traditional compiler optimizations are performed
Register allocation is performed
CFG is converted to native host (SPARC) code

58
Treeregion Scheduling

The CFG is partitioned into treeregions (single
entry, multiple exit subgraphs, that are shaped
like trees)
Treeregions start at the beginning of the program
or at join points, and end either at the end of
the program or at new join points
Liveness analysis is performed
Individual treeregions are scheduled using a
backward sweep, followed by a forward sweep

59
How well does optimization work in Latte?
60
Comments on Latte

Good register allocation can help to improve the
runtime performance of a dynamically tuned Java
bytecode binary
Optimization should target hot spots in the
executable
Can provide very competitive performance compared
with the Sun JDK and HotSpot compilation tools

61
Summary on Dynamic Optimization

There is always a struggle to balance the costs
and benefits of particular types of dynamic
optimizers
Dynamic optimizers can be workload dependent
There exists a lot of room in Java JITs to
improve upon instruction schedules and register
allocation
This is a rich area for future research on
compiler and memory management studies

Write a Comment

User Comments (0)