Dynamic Optimization - PowerPoint PPT Presentation

About This Presentation
Title:

Dynamic Optimization

Description:

Virtual time is recorded by counting the number of interpreted BBs before we select N traces ... SUN Java Development Kit (Sun) Hotspot JIT (Sun) Kaffe ... – PowerPoint PPT presentation

Number of Views:145
Avg rating:3.0/5.0
Slides: 62
Provided by: davidr8
Category:

less

Transcript and Presenter's Notes

Title: Dynamic Optimization


1
Dynamic Optimization
  • David Kaeli
  • Department of Electrical and Computer Engineering
  • Northeastern University
  • Boston, MA
  • kaeli_at_ece.neu.edu

2
What is Dynamic Optimization
  • Allow a running binary to adapt to the underlying
    hardware system dynamically
  • Perform optimization while not sacrificing
    performance

Input
Input
Runtime Dynamic Optimization System
Static Source
Fluid Binary
OS/HW Platform
3
Why Dynamic versus Static
  • Allows code to adapt to
  • Changes in the microarchitecture of the
    underlying platform (related to binary
    translation)
  • Changes in program input
  • Environment dynamics (e.g., system load, system
    availability)
  • Involves very little user interaction
    (optimization should be applied transparently)
  • Source code is not needed
  • Language independent

4
Challenges with Dynamic Optimization
  • Reducing the associated overhead and maintaining
    transparency
  • Addressing a range of workloads
  • Selecting appropriate optimizations

5
Dynamic Optimization Systems
  • Dynamo
  • HP labs, PA-RISC/HPUX
  • Runtime optimization
  • Vulcan/Mojo
  • MS Research, X86-IA64/Win2K
  • Deskstop instrumentation, profiling and
    optimization
  • Jalapeno
  • IBM Research, JVM-PPC-SMPs/AIX
  • Java JIT designed for research
  • Latte
  • University of Seoul, Korea
  • Java JIT designed for efficient register
    allocation

6
Dynamo
Application Libs (native binary)
Application Libs (native binary)
Dynamo
CPU Platform
CPU Platform
Dynamo execution model
Normal execution model
To the application, Dynamo looks like a software
interpreter that executes the same instruction
set executed by underlying hardware interpreter
(the CPU).
Many of these slides were provided by Evelyn
Duesterwald
7
Elements of Dynamo
  • A novel performance delivery mechanism
  • Optimize the code when it executes, not when it
    is created
  • A client-enabled performance mechanism
  • Dynamic code re-layout
  • Partial dynamic inlining/superblock formation
  • Path-specific optimization
  • Adaptive machine and input specific
  • Complementary to static optimization
  • Transparent requires no compiler support

Application Libs (native binary)
Dynamo
CPU Platform
8
Flow within Dynamo
Input native instruction stream
Interpretation/Profiling Loop
no
miss
Interpret until taken branch
Lookup next PC in Trace Cache
Hot start of trace?
yes
recycle counter
exit branch
hit
Trace Selector
Dynamo Code Cache
Trace Optimizer
Emit Trace
Trace Linker
9
Traces in Dynamo
Trace single-entry join-free dynamic sequence
of basic blocks
Control Flow Graph
Memory Layout
Trace Cache Layout
connect to other trace
A
A
A
F
T
F
B
F
B
B
C
C
D
C
call
E
E
call
D
return
trampoline
E
return
D
exit to Interpreter
10
Traces in Dynamo
Interprocedural forward path start-of-trace
target of backward branch
end-of-trace taken backward branch
A
B
C
D
11 Paths through the loop ABCEH
ABCEHKMO ABCEHKNO ABCEIKMO
ABCEIKNO ABCFJL ABCFJLNO
ABDFJL ABDFJLNO ABDGJL
ABDGJLNO
G
E
F
I
H
J
K
L
M
N
O
11
Traces in Dynamo typical path profiles
A
  • Approach
  • profile all edge frequencies
  • select hot trace by following highest frequency
    branch outcome
  • Disadvantage
  • Infeasible path ignores branch correlation
  • Overhead
  • need to profile every conditional branch

B
C
D
G
E
F
I
H
J
K
L
M
N
O
12
Traces in Dynamo Next Executing Tail Prediction
  • Minimal profiling
  • profile only start-of-trace points (block A)
  • Optimistic
  • at hot start-of-trace select next executing
  • Advantages
  • very light-weight
  • instrumentation points
  • targets of backward branches
  • counters
  • targets of backward branches
  • statistically likely to pick the hottest path
  • feasible paths
  • easy to implement

A
B
D
C
G
E
F
I
H
J
K
L
M
N
O
13
Trace Selection
14
When to stop creating new traces
  • Excessively high trace selection rates cause
    unacceptable overhead and potential thrashing in
    the Dynamo code cache
  • We need the opportunity to amortize the cost of
    creating traces, thus we need to turn off trace
    creation sometimes
  • Bail out is entered when the creation rate per
    unit time is excessively high

15
Trace Optimization
List of trace blocks
Build lightweight Intermediate Representation
(IR) Symbolic Labels, Extended Virtual Register
Set
Lite IR
Forward Pass
Optimization with integrated demand-driven
analysis
Backward Pass
Reg Alloc
Schedule Register Allocation retain previous
mappings
Linker
16
Trace Optimization
  • Are there any runtime optimization opportunities
    in statically optimized code?
  • Limitations of static compiler optimization
  • cost of call-specific interprocedural
    optimization
  • cost of path-specific optimization in presence of
    complex control flow
  • difficulty of predicting past indirect branches
  • lack of access to shared libraries
  • sub-optimal register allocation decisions
  • register allocation for individual array elements
    or pointers

17
Path-specific optimizations
  • Conservative Optimizations
  • precise signal delivery
  • memory-safe
  • partial procedure inlining
  • redundant branch removal
  • constant propagation
  • constant folding
  • copy propagation
  • Aggressive Optimizations
  • redundant load removal
  • runtime disambiguated (guarded) load removal
  • dead code elimination
  • partially dead code sinking
  • loop unrolling
  • loop invariant hoisting
  • ? aggressive optimization can be made memory- and
    signal-safe
  • compiler hints
  • de-optimization

18
Dynamo Optimizations
  • Constant propagation
  • Given x
  • Replace all later uses of x with c, assuming that
    x will not be modified

entry
entry
b b
b 3
n
n
y
y
d
d
e
e
exit
exit
19
Dynamo Optimizations
  • Constant folding
  • Identifying that all operands in an assignment
    are constant after macro expansion and constant
    propagation
  • Easy for booleans, a little trickier for integers
    (exceptions such as divide by zero and
    overflows), for FP this can be very tricky due to
    multiple FP formats

entry
entry
b 3
b 3 e
n
y
d
d
exit
e
exit
20
Dynamo Optimizations
  • Partial load removal LRE paper
  • Dead code elimination
  • A variable is dead if it is not used on any path
    from where it is defined to where the function
    exits
  • An instruction is dead if it computes only values
    that are not used on any executable path leading
    from the instruction
  • Dead code is often created through the
    application of code optimization (e.g., strength
    reduction replacing expensive ops by less
    expensive ops)
  • Loop invariant hoisting moving invariant
    operations out of the loop body
  • Fragment link-time optimizations apply peephole
    optimization around link, looking for dead code
    removal

21
Implementation Issues
  • Problem Signal arrives when executing in the
    code cache
  • How can we achieve transparent signal delivery?
  • How can the original signal context be
    reconstructed?
  • Dynamo approach intercept all signals
  • Upon arrival of a signal at code cache location
    L, Dynamo first gains control
  • Save code cache context
  • Retranslate the trace and record
  • Any changes in register mapping up to position L
  • Original code addresses of L
  • All context-modifying optimizations and steps for
    de-optimization
  • Update the code cache context to obtain native
    context
  • Load native context and execute original signal
    handler

22
Dynamic Code Cache
  • Problem How to control size of dynamically
    recompiled code?
  • How to react to phase changes?
  • Adaptive flushing based cache management scheme
  • Preemptive cache flushes
  • Fast allocation/de-allocation of traces
  • Removal of old and cold traces
  • Branch re-biasing to improve locality in cache
  • Configurable for various performance/memory-footpr
    int trade-offs
  • Code cache default size 300 Kbytes

23
Dynamo Performance
(O2 compiled native binary running under Dynamo
on a PA-8000)
24
Bailout
  • bail out if trace selection rate exceeds
    tolerable threshold

25
Bailout
  • To prevent degradation, Dynamo keeps track of the
    current trace selection rate
  • Virtual time is recorded by counting the number
    of interpreted BBs before we select N traces
  • A threshold is set to judge if a rate is high
  • The trace selection rate is considered excessive
    if k consecutive high rate time intervals have
    been encountered
  • Bailout will turn off trace selection and
    optimization execution resumes in the original
    binary

26
Performance speedups with bailout
(O2 compiled native binary running under Dynamo
on a PA-8000)
27
Memory Overhead Dynamo text
Total size 273 Kb PA-RISC dependent portion
179 Kb (66)
28
Summary of Dynamo
  • Demonstrated the potential for dynamic
    optimization through an actual implementation
  • Optimization impact tends to be program dependent
  • More sophisticated bailout algorithms need to be
    devised
  • Static compile-time hints should be used to help
    guide a dynamic optimization system

29
Vulcan A. Srivastava
  • Provides both static and dynamic code
    modification
  • Performs optimization on x86, IA64 and MSIL
    binaries
  • Can work in the presence of multithreading and
    variable length instructions (X86)
  • Designed to be able to perform modifications on a
    remote machine using a distributed common object
    model (DCOM) interface
  • Can also serve as a binary translator

30
Mojo Dynamic Optimization using Vulcan
(ChaikenGillies)
  • Targets a desktop x86/Windows2000 environment
  • Supports large, multithreaded, applications that
    use exception handlers
  • Requires no OS support
  • Allows optimization across shared library
    boundaries
  • Can be aided by information provided by a static
    compiler

31
Mojo Structure
Exception handling
Original Code
NT DLL
Mojo Dispatcher
Path Cache
Basic Block Cache
Path Builder
32
Mojo Structure
Exception handling
Original Code
NT DLL
1. Interrogate the Path Cache for a hit
Mojo Dispatcher
Path Cache
Basic Block Cache
Path Builder
33
Mojo Structure
Exception handling
2. If hit, then execute from the PC directly,
else interrogate the Basic Block Cache for a hit
Original Code
NT DLL
Mojo Dispatcher
Path Cache
Basic Block Cache
Path Builder
34
Mojo Structure
Exception handling
3. If hit in the BBC, execute directly, else load
the block from the original code.
Original Code
NT DLL
Mojo Dispatcher
Path Cache
Basic Block Cache
Path Builder
35
Mojo Structure
Exception handling
Each time control returns to the Mojo
Dispatcher. BBs are checked for hotness.
Original Code
NT DLL
Mojo Dispatcher
Path Cache
Basic Block Cache
Path Builder
36
Mojo Structure
Exception handling
If a BB is hot enough, Mojo turns on Path
Building. Once a complete path has been built
and optimized, it is placed in the Path Cache.
Original Code
NT DLL
Mojo Dispatcher
Path Cache
Basic Block Cache
Path Builder
37
Mojo Components
  • Mojo Dispatcher
  • Is the control point in the dynamic optimization
    system
  • Manages execution context using its own stack
    space
  • Basic Block Cache
  • Handles basic blocks that have not yet become hot
  • Identifies basic block boundaries by dynamically
    decoding instruction bytes
  • Branches are modified to pass control to the
    dispatcher, and passes the addresses of the next
    basic block to execute
  • Additional information is kept in the BBC that is
    used when constructing paths

38
Mojo Components
  • Path Builder
  • Responsible for selecting, building and
    optimizing hot paths
  • Maintains hotness information for basic blocks
  • Utilizes the same heuristics for building hot
    paths as is used in Dynamo (next path after
    counter overflow)
  • Utilizes separate thresholds for back edge
    targets and path exit targets (need to detect hot
    side exits when constructing a dynamic path)
  • Instructions are laid out contiguously
    (reordered), eliminating many taken conditional
    branches

39
Mojo Components Path Builder
  • Path Termination - Dynamo only terminates paths
    on back edges

Dynamo back edge profiling
Mojo back edge and side exit profiling
Original nested loops
B
A
B
B
C
C
Longer path
C
A
A
40
Exception Handling and Threads
  • Mojo patches the ntdll.dll
  • Mojo captures the state of the machine before
    passing off exceptions to the dispatcher
  • The dispatcher prevents the exception handler
    from polluting the Path Cache
  • To handle multithreading, Mojo allocates a basic
    block cache per thread, though uses a shared Path
    Cache
  • Locking mechanisms are provided to access and
    update the shared Path Cache reliably

41
Mojo performance
qsort, acker and fib are recursive programs
42
Mojo performance SPEC2000/SPEC95
43
Mojo Execution - Windows
44
Comments
  • For simple programs with simple control flow,
    Mojo shows good improvement
  • For larger programs with more dynamic control
    flow, Mojo is overwhelmed with the amount of path
    creation (same problem that was encountered for
    Dynamo)
  • Bailout strategy needed, along with better hot
    path detection algorithm
  • Future work is investigating how to use hints
    obtained during static compilation to aid in the
    dynamic optimization of the code

45
What is a JIT
  • Just-in Time Compiler developed to address the
    performance issues encountered with java
    interpreter/translator performance
  • Portability generally means lower performance
    JITs attempt to bridge this gap
  • JITs dynamically cache translated java bytecodes
    and perform extensive optimization on the native
    instructions
  • Given the overhead of using an OO programming
    model (frequent method calls), extensive
    exception checking, and the overhead of dynamic
    translation/compilation, the quality of the JIT
    must be high

46
Common JITs
  • SUN Java Development Kit (Sun)
  • Hotspot JIT (Sun)
  • Kaffe (Transvirtual Technologies)
  • Jalapeno (IBM Research)
  • Latte (Seoul National University)

47
IBM Jalapeno JVM and JIT
  • Designed specifically for servers
  • Shared memory multiprocessor scalability
  • Manage a large number of concurrent threads
  • High availability
  • Rapid response and graceful degradation (an issue
    when garbage collection is involved)
  • Mainly developed in Java (reliability?)
  • Designed specifically for extensive dynamic
    optimization

48
The Jalapeno Adaptive Optimization System
  • Translates bytecodes directly to the native ISA
  • Recompilation is performed in a separate thread
    from the application, and thus can be done in
    parallel to program execution
  • AOS has three components
  • Runtime measurement system
  • Controller
  • Recompilation system

49
Jalapeno AOS Architecture
Compilers (Base, Opt, )
Install new code
Controller
Executing Code
Controller
Hardware/VM Performance Monitor
Profile data
AOS Database
Inst/Opt Code
Raw Data
Raw Data
Inst/Comp Plan
Controller
Raw Data
Controller
Organizer
Compilation Threads
Formatted Data
Organizer
Measurement Subsystem
Formatted Data
Compilation Queue
Controller
Organizer Event Queue
50
Three Optimization Levels
  • Level 0 On-the-fly optimizations performed
    during translation (constant prop, constant
    folding, dead code detection)
  • Level 1 Adds to Level 0 common subexpression
    elimination, redundant load elimination,
    aggressive inlining
  • Level 2 Adds to Level 1 flow-sensitive
    optimizations, array bounds check elimination

51
Controller model
  • Decides when to recompile a method
  • Decides which optimization level to use
  • Measurements are used to guide the profiling
    strategy and select the hot methods to recompile
  • An analytical model is also used that represents
    the costs and benefits of performing these tasks

52
When to recompile?
  • Ti Current total amount of time the program
    will spend executing method m
  • Cj Cost of recompilation method m at
    optimization level j
  • Tj Expected total amount of time the program
    will spend executing an optimized method m
  • For j0,1,2 choose the j that minimizes Cj Tj
  • If Cj Tj level j
  • Otherwise it decides not to recompile

53
When to recompile?
  • To estimate Ti, we assume the program will run
    for a total time of Tf, and use profile data to
    indicate what percentage of the total execution
    time (Pm) is spent in method m (versus the rest
    of the program)
  • We can compute Ti as
  • Ti Tf Pm
  • This is the initial estimated execution time for
    method m. A new Ti is computed based on an
    estimate of the speedup of method m.
  • The above weight decays over time.

54
How well does optimization work in Jalapeno?
55
Comments about Jalapeno
  • Focused on method-granularity optimization
  • Simple heuristics for predicting runtimes and
    benefits/costs are highly sensitive to cold vs.
    warm invocation of the application
  • New work looks at method specific optimizations
    that consider additional characteristics besides
    just the estimated runtime

56
Latte
  • Addresses the inefficiencies in the stack-based
    Java bytecode machine by efficiently mapping
    stack space to a RISC register file
  • Since traditional register coloring is an
    expensive algorithm, and allocation must be done
    in the same space as the runtime, this system
    looks at other ways to get good register
    allocation at a reduced cost

57
Java Translation to Native Code
  • Identifies control join points and subroutines in
    the bytecode using a depth-first search traversal
  • Bytecodes are translated in a control flow graph,
    mapping program variables to a set of pseudo
    registers
  • Traditional compiler optimizations are performed
  • Register allocation is performed
  • CFG is converted to native host (SPARC) code

58
Treeregion Scheduling
  • The CFG is partitioned into treeregions (single
    entry, multiple exit subgraphs, that are shaped
    like trees)
  • Treeregions start at the beginning of the program
    or at join points, and end either at the end of
    the program or at new join points
  • Liveness analysis is performed
  • Individual treeregions are scheduled using a
    backward sweep, followed by a forward sweep

59
How well does optimization work in Latte?
60
Comments on Latte
  • Good register allocation can help to improve the
    runtime performance of a dynamically tuned Java
    bytecode binary
  • Optimization should target hot spots in the
    executable
  • Can provide very competitive performance compared
    with the Sun JDK and HotSpot compilation tools

61
Summary on Dynamic Optimization
  • There is always a struggle to balance the costs
    and benefits of particular types of dynamic
    optimizers
  • Dynamic optimizers can be workload dependent
  • There exists a lot of room in Java JITs to
    improve upon instruction schedules and register
    allocation
  • This is a rich area for future research on
    compiler and memory management studies
Write a Comment
User Comments (0)
About PowerShow.com