Title: Efficient Binary Translation In CoDesigned Virtual Machines
1Efficient Binary Translation In
Co-Designed Virtual Machines
Feb. 28th, 2006 -- Shiliang Hu
2Outline
- Introduction
- The x86vm Infrastructure
- Efficient Dynamic Binary Translation
- DBT Modeling ? Translation Strategy
- Efficient DBT for the x86 (SW)
- Hardware Assists for DBT (HW)
- A Co-Designed x86 Processor
- Conclusions
3The Dilemma Binary Compatibility
- Two Fundamentals for Computer Architects
- Computer Applications Ever-expanding
- Software development is expensive
- Software porting is also costly.
- ?Standard software binary distribution format(s)
- Implementation Technology Ever-evolving
- Silicon technology has been evolving rapidly
Moore Law - Trend Always needs ISA architecture innovation.
- Dilemma Binary Compatibility
- Cause Coupled Software Binary Distribution
Format and Hardware/Software Interface.
4Solution Dynamic ISA Mapping
Software in Architected ISA OS, Drivers, Lib
code, Apps
Architected ISA e.g. x86
Dynamic Translation
Implementation ISA e.g. fusible ISA
HW Implementation Processors, Mem-sys, I/O
devices
- ISA mapping
- Hardware-intensive translation good for startup
performance. - Software dynamic optimization good for hotspots.
- Can we combine the advantages of both?
- Startup Fast, hardware-intensive translation
- Steady State Intelligent translation /
optimization for hotspots.
5Key Efficient Binary Translation
- Startup curves for Windows workloads
6Issue Bad Case Scenarios
- Short-Running Fine-Grain Cooperating Tasks
- Performance lost to slow startup cannot be
compensated for before the tasks end. - Real Time Applications
- Real time constraints can be compromised due to
slow. translation process. - Multi-tasking Server-like Applications
- Frequent context switches b/w resource competing
tasks. - Limited code cache size causes re-translations
when switched in and out. - OS boot-up shutdown (Client, mobile platforms)
7Related Work State-of-the Art VM
- Pioneer IBM System/38, AS/400
- Products Transmeta x86 Processors
- Code Morphing Software VLIW engine.
- Crusoe ? Efficeon.
- Research Project IBM DAISY, BOA
- Full system translator VLIW engine.
- DBT overhead 4000 PowerPC instrs per translated
instr. - Other research projects DBT for ILDP (H. Kim
JES). - Dynamic Binary Translation / Optimization
- SW based (Often user mode only) UQBT, Dynamo
(RIO), IA-32 EL. Java and .NET HLL VM runtime
systems. - HW based Trace cache fill units, rePLay, PARROT,
etc.
8Thesis Contributions
- Efficient Dynamic Binary Translation (DBT)
- DBT runtime overhead modeling ? translation
strategy. - Efficient software translation algorithms.
- Simple hardware accelerators for DBT.
- Macro-op Execution µ-Arch (w/ Kim, Lipasti)
- Higher IPC and Higher clock speed potential.
- Reduced complexity at critical pipeline stages.
- An Integrated Co-designed x86 Virtual Machine
- Superior steady-state performance.
- Competitive startup performance.
- Complexity-effective, power efficient.
9Outline
- Introduction
- The x86vm Infrastructure
- Efficient Dynamic Binary Translation
- DBT Modeling ? Translation Strategy
- Efficient DBT for the x86 (SW)
- Hardware Assists for DBT (HW)
- A Co-Designed x86 Processor
- Conclusions
10The x86vm Framework
- Goal Experimental research infrastructure to
explore the co-designed x86 VM paradigm - Our co-designed x86 virtual machine design
- Software components VMM.
- Hardware components Microarchitecture timing
simulators. - Internal implementation ISA.
- Object Oriented Design Implementation in C
11The x86vm Framework
Software in Architected ISA OS, Drivers, Lib
code Apps
x
86
instructions
Architected ISA e.g. x86
RISC
-
ops
BOCHS 2.2
x86vmm
DBT
VMM Runtime
Software
Code Cache(s)
Macro
-
ops
Implementation ISA e.g. Fusible ISA
Hardware Model Microarchitecture, Timing
Simulator
12Two-stage DBT system
- VMM runtime
- Orchestrate VM system EXE.
- Runtime resource mgmt.
- Precise state recovery.
- DBT
- BBT Basic block translator.
- SBT Hot super- block translator optimizer.
13Evaluation Methodology
- Reference/Baseline Best performing x86
processors - Approximation to Intel Pentium M, AMD K7/K8.
- Experimental Data Collection Simulation
- Different models need different instantiations of
x86vm. - Benchmarks
- SPEC 2000 integer (SPEC2K). Binary generation
Intel C/C compiler O3 base opt. Test data
inputs, full runs. - Winstone2004 business suite (WSB2004),
500-million x86 instruction traces for 10 common
Windows applications.
14x86 Binary Characterization
- Instruction Count Expansion (x86 to RISC-ops)
- 40 for SPEC2K.
- 50- for Windows workloads.
- Instruction management and communication.
- Redundancy and inefficiency.
- Code footprint expansion
- Nearly double if cracked into 32b fixed length
RISC-ops. - 3040 if cracked into 16/32b RISC-ops.
- Affects fetch efficiency, memory hierarchy
performance.
15Overview of Baseline VM Design
- Goal demonstrate the power of a VM paradigm via
a specific x86 processor design - Architected ISA x86
- Co-designed VM software SBT, BBT and VM runtime
control system. - Implementation ISA Fusible ISA ? FISA
- Efficient microarchitecture enabled macro-op
execution
16Fusible Instruction Set
- RISC-ops with unique features
- A fusible bit per instr. for fusing.
- Dense encoding, 16/32-bit ISA.
- Special Features to Support x86
- Condition codes.
- Addressing modes
- Aware of long immediate values.
F
10 b opcode
5b Rds
F
F
10 b opcode
5b Rsrc
F
16 bit opcode
5b Rsrc
5b Rds
10b Immediate / Disp
5b opcode
F
5b opcode
5b Rds
5b Immd
F
5b Rds
5b Rsrc
5b opcode
17VMM Virtual Machine Monitor
- Runtime Controller
- Orchestrate VM operation, translation, translated
code, etc. - Code Cache(s) management
- Hold translations, lookup translations
chaining, eviction policy. - BBT Initial emulation
- Straightforward cracking, no optimization..
- SBT Hotspot optimizer
- Fuse dependent Instruction pairs into macro-ops.
- Precise state recovery routines
18Microarchitecture Macro-op Execution
- Enhanced OoO superscalar microarchitecture
- Process execute fused macro-ops as single
Instructions throughout the entire pipeline. - Analogy All lanes ? car-pool on highway ? reduce
congestion w/ high throughput, AND raise the
speed limit from 65mph to 80mph. - Joint work with I. Kim M. Lipasti.
3-1 ALUs
cache
ports
Fuse
bit
Decode
Wake-
Align
Retire
WB
RF
Select
EXE
MEM
Rename
Fetch
up
Fuse
Dispatch
19Co-designed x86 pipeline frond-end
20Co-designed x86 pipeline backend
21Outline
- Introduction
- The x86vm Infrastructure
- Efficient Dynamic Binary Translation
- DBT Modeling ? Translation Strategy
- Efficient DBT for the x86 (SW)
- Hardware Assists to DBT (HW)
- An Example Co-Designed x86 Processor
- Conclusions
22Performance Memory Hierarchy Perspective
- Disk Startup
- Initial program startup, module or task reloading
after swap. - Memory Startup
- Long duration context switch, phase changes. x86
code is still in memory, translated code is not. - Code Cache Transient / Startup
- Short duration context switch, phase changes.
- Steady State
- Translated code is available and placed in the
memory hierarchy.
23Memory Startup Curves
24Hotspot Behavior WSB2004 (100M)
Hot threshold
25Outline
- Introduction
- The x86vm Infrastructure
- Efficient Dynamic Binary Translation
- DBT Modeling ? Translation Strategy
- Efficient DBT Software for the x86
- Hardware Assists for DBT (HW)
- A Co-Designed x86 Processor
- Conclusions
26(Hotspot) Translation Procedure
1. Translation Unit Formation Form hotspot
superblock. 2. IR generation Crack x86
instructions into RISC-style micro-ops. 3.
Machine State Mapping Perform Cluster Analysis
of embedded long immediate values and assign to
registers if necessary. 4. Dependency Graph
Construction for the superblock. 5. Macro-op
Fusing Algorithm Scan looking for dependent
pairs to be fused. Forward scan, backward
pairing. Two-pass fusing to prioritize ALU ops.
6. Register Allocation re-order fused
dependent pairs together, extend live ranges for
precise traps, use consistent state mapping at
superblock exits. 7. Code generation to code
cache.
27Macro-op Fusing Algorithm
- Objectives
- Maximize fused dependent pairs.
- Simple Fast.
- Heuristics
- Pipelined Issue Logic Only single-cycle ALU ops
can be a head. Minimize non-fused single-cycle
ALU ops. - Criticality Fuse instructions that are close
in the original sequence. ALU-ops criticality is
easier to estimate. - Simplicity 2 or fewer distinct register operands
per fused pair. - Two-pass Fusing Algorithm
- The 1st pass, forward scan, prioritizes ALU ops,
i.e. for each ALU-op tail candidate, look
backward in the scan for its head. - The 2nd pass considers all kinds of RISC-ops as
tail candidates
28Dependence Cycle Detection
- All cases are generalized to (c) due to Anti-Scan
Fusing Heuristic.
29Fusing Algorithm Example
x86 asm -----------------------------------------
------------------ 1. lea eax, DSedi
01 2. mov DS080b8658, eax 3. movzx ebx,
SSebp ecx ltlt 1 4. and eax, 0000007f 5.
mov edx, DSeax esi ltlt 0 0x7c
RISC-ops ----------------------------------------
------------- 1. ADD Reax, Redi, 1 2. ST
Reax, memR14 3. LD.zx Rebx, memRebp Recx
ltlt 1 4. AND Reax, 0000007f 5. ADD R11, Reax,
Resi 6. LD Redx, memR11 0x7c
After fusing Macro-ops --------------------------
--------------------------- 1. ADD R12, Redi, 1
AND Reax, R12, 007f 2. ST R12,
memR14 3. LD.zx Rebx, memRebp Recx ltlt 1 4.
ADD R11, Reax, Resi LD Rebx,memR110x7c
30Instruction Fusing Profile (SPEC2K)
31Instruction Fusing Profile (WSB2004)
32Macro-op Fusing Profile
- Of all fused macro-ops
- 52 / 43 are ALU-ALU pairs.
- 30 / 35 are fused condition test conditional
branch pairs. - 18 / 22 are ALU-MEM ops pairs.
- Of all fused macro-ops.
- 70 are inter-x86instruction fusion.
- 46 access two distinct source registers.
- 15 (i.e. 6 of all instruction entities) write
two distinct destination registers.
SPEC
WSB
33DBT Software Runtime Overhead Profile
- Software BBT Profile
- BBT overhead ?BBT About 105 FISA instructions
(85 cycles) per translated x86 instruction.
Mostly for decoding, cracking. - Software SBT Profile
- SBT overhead ?SBT About 1000 instructions per
translated hotspot instruction.
34Outline
- Introduction
- The x86vm Infrastructure
- Efficient Dynamic Binary Translation
- DBT Modeling ? Translation Strategy
- Efficient DBT for the x86 (SW)
- Hardware Assists for DBT
- A Co-Designed x86 Processor
- Conclusions
35Principles for Hardware Assist Design
- Goals
- Reduce VMM runtime software overhead
significantly. - Maintain HW complexity-effectiveness power
efficiency. - HW SW simply each other ? Synergetic co-design.
- Observations (Analytic Modeling Simulation)
- High performance startup (non-hotspot) is
critical. - Hotspot is usually a small fraction of the
overall footprint. - Approach BBT accelerators
- Front-end Dual mode decoders.
- Backend HW assist functional unit(s).
36Dual Mode CISC (x86) Decoders
-
µ
op decoder
Opcode
Operand
Designators
Other pipeline
Control signals
- Basic idea 2-stage decoder for CISC ISA
- First published in Motorola 68K processor papers.
- Break a Monolithic complex decoder into two
separate simpler decoder stages. - Dual-mode CISC decoders
- CISC (x86) instructions pass both stages.
- Internal RISC-ops pass only the second stage.
37Dual Mode CISC (x86) Decoders
- Advantages
- High performance startup ? similar to
conventional superscalar design. - No code cache needed for non-hotspot code.
- Smooth transition from conventional superscalar
design. -
- Disadvantages
- Complexity n-wide machine needs n such decoders.
- But so does conventional design.
- Less power efficient (than other VM schemes).
38Hardware Assists as Functional Units
39Hardware Assists as Functional Units
- Advantages
- High performance startup.
- Power Efficient.
- Programmability and flexibility.
- Simplicity only one simplified decoder needed.
- Disadvantages
- Runtime overhead Reduce ?BBT from 85 to about
20 cycles, but still involves some translation
overhead. - Memory space overhead some extra code cache
space for cold code. - Must use VM paradigm, more risky than dual mode
decoder.
40Machine Startup Models
- Ref Superscalar
- Conventional processor design as the baseline.
- VM.soft
- Software BBT and hotspot optimization via SBT.
- State-of-the-art VM design.
- VM.be
- BBT accelerated by backend functional unit(s).
- VM.fe
- Dual mode decoder at the pipeline front-end.
41Startup Evaluation Hardware Assists
1.1
1
0.9
Ref Superscalar
0.8
VM.soft
0.7
VM.be
0.6
VM.fe
Cumulative x86 IPC (Normalized)
0.5
VM.steady-state
0.4
0.3
0.2
0.1
0
1
10
100
1,000
Finish
10,000
100,000
1,000,000
10,000,000
100,000,000
Time Cycles
42Activity of HW Assists
43Related Work on HW Assists for DBT
- HW Assists for Profiling
- Profile Buffer Conte96 BBB Hotspot Detector
Merten99, Programmable HW Path Profiler
CGO05 etc - Profiling Co-Processor Zilles01
- Many others.
- HW Assists for General VM Technology
- System VM Assists Intel VT, AMD Pacifica.
- Transmeta Efficeon Processor A new execute
instruction to accelerate interpreter. - HW Assists for Translation
- Trace Cache Fill Unit Friendly98
- Customized buffer and instructions rePLay,
PARROT. - Instruction Path Co-Processor Zhou00 etc.
44Outline
- Introduction
- The x86vm Framework
- Efficient Dynamic Binary Translation
- DBT Modeling Translation Strategy
- Efficient DBT for the x86 (SW)
- Hardware Assists for DBT (HW)
- A Co-Designed x86 Processor
- Conclusions
45Co-designed x86 processor architecture
horizontal
1
Memory Hierarchy
2
micro
/
Macro
-
op
decoder
x86 code
vertical x86 decoder
Pipeline EXE backend
Rename/ Dispatch
Issue buffer
I
-
Code
(Macro-op)
- Co-designed virtual machine paradigm
- Startup Simple hardware decode/crack for fast
translation. - Steady State Dynamic software translation/optimiz
ation for hotspots.
46Processor Pipeline
Reduced Instr. traffic throughout
Reduced forwarding
Pipelined scheduler
- Macro-op pipeline for efficient hotspot execution
- Execute macro-ops.
- Higher IPC, and Higher clock speed potential.
- Shorter pipeline front-end.
47Performance Simulation Configuration
?
48Processor Evaluation SPEC2K
?
49Processor Evaluation WSB2004
?
50Performance Contributors
- Many factors contribute to the IPC performance
improvement - Code straightening.
- Macro-op fusing and execution.
- Shortened pipeline front-end (reduced branch
penalty). - Collapsed 3-1 ALUs (resolve branches addresses
sooner). - Besides baseline and macro-op models, we model
three middle configurations - M0 baseline code cache.
- M1 M0 macro-op fusing.
- M2 M1 shorter pipeline front-end. (Macro-op
mode). - Macro-op M2 collapsed 3-1 ALUs.
51Performance Contributors SPEC2K
52Performance Contributors WSB2004
53Outline
- Introduction
- The x86vm Framework
- Efficient Dynamic Binary Translation
- DBT Modeling Translation Strategy
- Efficient DBT for the x86 (SW)
- Hardware Assists for DBT (HW)
- A Co-Designed x86 Processor
- Conclusions
54Conclusions
- The Co-Designed Virtual Machine Paradigm
- Capability
- Binary compatibility.
- Functionality integration.
- Functionality dynamic upgrading.
- Performance
- Enables novel efficient architectures.
- Superior steady-state, competitive startup.
- Complexity Effectiveness
- More flexibility for processor design, etc.
- Power Efficiency.
55Conclusions
- The co-designed x86 processor
- Capability
- Full x86 compatibility.
- Performance and Power efficiency
- Macro-op execution engine DBT are efficient
combination. - Complexity Effectiveness
- VMM ? DBT SW removes considerable HW complexity.
- µ-Arch ? Can reduce pipeline width.
- Simplified 2-cycle pipelined scheduler.
- Simplified operand forwarding network
56Finale Questions Answers
Suggestions and comments are welcome, Thank you!
57Acknowledgement
- Prof. James E. Smith (Advisor)
- Prof. Mikko H. Lipasti
- Dr. Ho-Seop Kim
- Dr. Ilhyun Kim
- Mr. Wooseok Chang
- Wisconsin Architecture Group
- Funding NSF, IBM Intel
- My Family
58Objectives for Computer Architects
- Efficient Designs for Computer Systems
- More benefits
- Less costs
- Specifically, Three Dimensions
- Capability ? Practically, software code can run
- High Performance ? Higher IPC, Higher clock
speeds - Simplicity / Complexity-Effectiveness ? Less
cost, more reliable. Also ? Low power consumption
59VM Software
- VM Mode
- VM runtime
- Initial emulation of the Architected ISA binary
- DBT translation.
- Exception handling
- Translated Native Mode
- Hotspot code executed in the code cache, chained
together.
60Fuse Macro-ops An Illustrative Example
61Why DBT Modeling ?
- Understand translation overhead dynamics
- Profiling only gives samples
- Translation strategy to reduce runtime overhead
- HW / SW division, collaboration
- Translation stages, trigger mechanisms
- Hot Threshold Setting
- Too low, excessive false positives
- Too high, lose performance benefits
?
EXE time
?
62Modeling via Memory Hierarchy Perspective
63Analytic Modeling DBT Overhead
SBT C
Performance
Ref Superscalar
BBT C
MSBT ?SBT
MBBT ?BBT
x86 code
Translation Complexity
?
64Analytic Modeling Hot Threshold
SBT C
Performance
Hot threshold
BBT C
Translation Overhead
x86 code
Execution frequency
?
65Modeling Evaluation
?
66Modeling Evaluation
?
67Model Implications
- Translation overhead
- Mainly caused by initial emulation
- Hotspot translation overhead is not significant
(or can be always beneficial) if no translation
is allowed for false positive hotspots. - Hotspot Detection
- Model explains many empirical threshold settings
in the literature. - Excessively low thresholds ? excessive false
positive hotspots ? excessive hotspot
optimization overhead.
68Efficient Dynamic Binary Translation
- Efficient Binary Translation
- Strategy Adaptive runtime optimization
- Efficient algorithms for translation
optimization - Co-op with Hardware Accelerators.
- Efficient Translated Native Code
- Generic optimizations Native Register mapping
long immediate values register allocation. - Implementation ISA optimizations Enable new
microarchitecture execution ? Macro-op Fusing. - Architected ISA optimizations Runtime x86
specific optimizations.
69Register Mapping LIMM Conversion
- Objectives
- Efficient emulation of the x86 ISA on the fusible
ISA - 32b long immediate values embedded in x86 binary
are problematic for 32b instruction set ? remove
them - Solution Register mapping
- Map all x86 GPR to lower 16 R registers in the
fusible ISA - Map all x86 FP/multimedia registers to lower 24 F
registers in the fusable ISA - Solution Long Immediate Conversion
- Scan superblock looking for all long immediate
values. - Perform value clustering analysis and allocate
registers to frequent long immediate values. - Convert some x86 embedded long immediate values
into register access or register plus a short
immediate that can be handled in implementation
ISA.
70Code Re-ordering Algorithm
- Problem A new code scheduling algorithm is
needed to group dependent instructions together. - Modern compilers schedule independent
instructions, not dependent pairs. - Idea Partitioning MidSet ? PreSet PostSet
- PreSet all instructions that must be moved
before the head - MidSet all instructions in the middle b/w the
head and tail - PostSet all instructions that can be moved after
the tail.
71Code Re-ordering Example
In RISC-ops w/ fusing info.
Out macro-ops sequence
72BBT overhead profile
- Distributed evenly into fetch/decode, semantics,
loop x86 crack. - BBT 100 instrs. per x86 inst. overhead per x86
instruction, similar to main memory access
73HST back-end profile
- Light-weight opts ProcLongImm, DDG setup, encode
tens of instrs. each Overhead per x86
instruction -- initial load from disk. - Heavy-weight opts uops translation, fusing,
codegen none dominates
74HW Assisted DBT Overhead (100M)
75Breakeven Points for Individual Bench
76Hotspot Detected vs. runs
77Hotspot Coverage vs. threshold
78Hotspot Coverage vs. runs
79DBT Complexity/Overhead Trade-off
80Performance Evaluation SPEC2000
?
81Co-designed x86 processor
- Architecture Enhancement
- Hardware/Software co-designed paradigm to enable
novel designs more desirable system features. - Fusing dependent instruction pairs ? collapses
dataflow graph to reduce instruction mgmt and
inter-ins comm. - Complexity Effectiveness
- Pipelined 2-cycle instruction scheduler.
- Reduce ALU value forwarding network
significantly. - DBT software reduces a lot of hardware
complexity. - Power Consumption Implication
- Reduced pipeline width.
- Reduced Inter-instruction communication and
instruction management.
82Processor Design Challenges
- CISC challenges -- Suboptimal internal micro-ops.
- Complex decoders obsolete features/instructions
- Instruction count expansion ? mgmt, comm
- Redundancy Inefficiency in the cracked
micro-ops - Solution Dynamic optimization
- Other current challenges (CISC RISC)
- Efficiency (Nowadays, less performance gain per
transistor) - Power consumption has become acute
- Solution Novel efficient microarchitectures
83Superscalar x86 Pipeline Challenges
Atomic Scheduler
x86 Pipeline
Wake-up Select
x86 Decode3
x86 Decode1
x86 Decode2
Fetch
EXE
Rename
Dispatch
Align
RF
WB
Retire
- Best performing x86 processors, BUT
- In general, several critical pipeline stages
- Branch behavior ? Fetch bandwidth.
- Complex x86 decoders at the pipeline front-end.
- Complex Issue logic, wake-up and select in the
same cycle. - Complex operand forwarding networks ? wire
delays. - Instruction count expansion High pressure on
instruction mgmt and communication mechanisms.
84Related Work x86 processors
- AMD K7/K8 microarchitecture
- Macro-Operations
- High performance, efficient pipeline
- Intel Pentium M
- Micro-op fusion.
- Stack manager.
- High performance, low power.
- Transmeta x86 processors
- Co-Designed x86 VM
- VLIW engine code morphing software.
85Future Directions
- Co-Designed Virtual Machine Technology
- Confidence More realistic, exhaustive benchmark
study important for whole workload behavior - Enhancement More synergetic, complexity-effective
HW/SW Co-design techniques. - Application Specific enabling techniques for
specific novel computer architectures of the
future. - Example co-designed x86 processor design
- Confidence Study as above.
- Enhancement HW µ-Arch ? Reduce register write
ports. VMM ? More dynamic optimizations
in SBT, e.g. CSE, software stack manager,
SIMDification.