Efficient Binary Translation In CoDesigned Virtual Machines - PowerPoint PPT Presentation

About This Presentation

Title:

Efficient Binary Translation In CoDesigned Virtual Machines

Description:

Startup: Fast, hardware-intensive translation ... 3. Machine State Mapping: Perform Cluster Analysis of embedded long immediate ... – PowerPoint PPT presentation

Number of Views:78

Avg rating:3.0/5.0

Slides: 86

Provided by: shili5

Learn more at: https://pages.cs.wisc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Efficient Binary Translation In CoDesigned Virtual Machines

1
Efficient Binary Translation In
Co-Designed Virtual Machines
Feb. 28th, 2006 -- Shiliang Hu
2
Outline

Introduction
The x86vm Infrastructure
Efficient Dynamic Binary Translation
DBT Modeling ? Translation Strategy
Efficient DBT for the x86 (SW)
Hardware Assists for DBT (HW)
A Co-Designed x86 Processor
Conclusions

3
The Dilemma Binary Compatibility

Two Fundamentals for Computer Architects
Computer Applications Ever-expanding
Software development is expensive
Software porting is also costly.
?Standard software binary distribution format(s)
Implementation Technology Ever-evolving
Silicon technology has been evolving rapidly
Moore Law
Trend Always needs ISA architecture innovation.
Dilemma Binary Compatibility
Cause Coupled Software Binary Distribution
Format and Hardware/Software Interface.

4
Solution Dynamic ISA Mapping
Software in Architected ISA OS, Drivers, Lib
code, Apps
Architected ISA e.g. x86

Dynamic Translation
Implementation ISA e.g. fusible ISA
HW Implementation Processors, Mem-sys, I/O
devices

ISA mapping
Hardware-intensive translation good for startup
performance.
Software dynamic optimization good for hotspots.

Can we combine the advantages of both?
Startup Fast, hardware-intensive translation
Steady State Intelligent translation /
optimization for hotspots.

5
Key Efficient Binary Translation

Startup curves for Windows workloads

6
Issue Bad Case Scenarios

Short-Running Fine-Grain Cooperating Tasks
Performance lost to slow startup cannot be
compensated for before the tasks end.
Real Time Applications
Real time constraints can be compromised due to
slow. translation process.
Multi-tasking Server-like Applications
Frequent context switches b/w resource competing
tasks.
Limited code cache size causes re-translations
when switched in and out.
OS boot-up shutdown (Client, mobile platforms)

7
Related Work State-of-the Art VM

Pioneer IBM System/38, AS/400
Products Transmeta x86 Processors
Code Morphing Software VLIW engine.
Crusoe ? Efficeon.
Research Project IBM DAISY, BOA
Full system translator VLIW engine.
DBT overhead 4000 PowerPC instrs per translated
instr.
Other research projects DBT for ILDP (H. Kim
JES).
Dynamic Binary Translation / Optimization
SW based (Often user mode only) UQBT, Dynamo
(RIO), IA-32 EL. Java and .NET HLL VM runtime
systems.
HW based Trace cache fill units, rePLay, PARROT,
etc.

8
Thesis Contributions

Efficient Dynamic Binary Translation (DBT)
DBT runtime overhead modeling ? translation
strategy.
Efficient software translation algorithms.
Simple hardware accelerators for DBT.
Macro-op Execution µ-Arch (w/ Kim, Lipasti)
Higher IPC and Higher clock speed potential.
Reduced complexity at critical pipeline stages.
An Integrated Co-designed x86 Virtual Machine
Superior steady-state performance.
Competitive startup performance.
Complexity-effective, power efficient.

9
Outline

Introduction
The x86vm Infrastructure
Efficient Dynamic Binary Translation
DBT Modeling ? Translation Strategy
Efficient DBT for the x86 (SW)
Hardware Assists for DBT (HW)
A Co-Designed x86 Processor
Conclusions

10
The x86vm Framework

Goal Experimental research infrastructure to
explore the co-designed x86 VM paradigm
Our co-designed x86 virtual machine design
Software components VMM.
Hardware components Microarchitecture timing
simulators.
Internal implementation ISA.
Object Oriented Design Implementation in C

11
The x86vm Framework
Software in Architected ISA OS, Drivers, Lib
code Apps
x
86
instructions
Architected ISA e.g. x86
RISC
-
ops
BOCHS 2.2
x86vmm

DBT
VMM Runtime
Software
Code Cache(s)
Macro
-
ops
Implementation ISA e.g. Fusible ISA
Hardware Model Microarchitecture, Timing
Simulator
12
Two-stage DBT system

VMM runtime
Orchestrate VM system EXE.
Runtime resource mgmt.
Precise state recovery.
DBT
BBT Basic block translator.
SBT Hot super- block translator optimizer.

13
Evaluation Methodology

Reference/Baseline Best performing x86
processors
Approximation to Intel Pentium M, AMD K7/K8.
Experimental Data Collection Simulation
Different models need different instantiations of
x86vm.
Benchmarks
SPEC 2000 integer (SPEC2K). Binary generation
Intel C/C compiler O3 base opt. Test data
inputs, full runs.
Winstone2004 business suite (WSB2004),
500-million x86 instruction traces for 10 common
Windows applications.

14
x86 Binary Characterization

Instruction Count Expansion (x86 to RISC-ops)
40 for SPEC2K.
50- for Windows workloads.
Instruction management and communication.
Redundancy and inefficiency.
Code footprint expansion
Nearly double if cracked into 32b fixed length
RISC-ops.
3040 if cracked into 16/32b RISC-ops.
Affects fetch efficiency, memory hierarchy
performance.

15
Overview of Baseline VM Design

Goal demonstrate the power of a VM paradigm via
a specific x86 processor design
Architected ISA x86
Co-designed VM software SBT, BBT and VM runtime
control system.
Implementation ISA Fusible ISA ? FISA
Efficient microarchitecture enabled macro-op
execution

16
Fusible Instruction Set

RISC-ops with unique features
A fusible bit per instr. for fusing.
Dense encoding, 16/32-bit ISA.
Special Features to Support x86
Condition codes.
Addressing modes
Aware of long immediate values.

F
10 b opcode
5b Rds
F
F
10 b opcode
5b Rsrc
F
16 bit opcode
5b Rsrc
5b Rds
10b Immediate / Disp
5b opcode
F
5b opcode
5b Rds
5b Immd
F
5b Rds
5b Rsrc
5b opcode
17
VMM Virtual Machine Monitor

Runtime Controller
Orchestrate VM operation, translation, translated
code, etc.
Code Cache(s) management
Hold translations, lookup translations
chaining, eviction policy.
BBT Initial emulation
Straightforward cracking, no optimization..
SBT Hotspot optimizer
Fuse dependent Instruction pairs into macro-ops.
Precise state recovery routines

18
Microarchitecture Macro-op Execution

Enhanced OoO superscalar microarchitecture
Process execute fused macro-ops as single
Instructions throughout the entire pipeline.
Analogy All lanes ? car-pool on highway ? reduce
congestion w/ high throughput, AND raise the
speed limit from 65mph to 80mph.
Joint work with I. Kim M. Lipasti.

3-1 ALUs
cache
ports
Fuse
bit
Decode
Wake-
Align
Retire
WB
RF
Select
EXE
MEM
Rename
Fetch
up
Fuse
Dispatch
19
Co-designed x86 pipeline frond-end
20
Co-designed x86 pipeline backend
21
Outline

Introduction
The x86vm Infrastructure
Efficient Dynamic Binary Translation
DBT Modeling ? Translation Strategy
Efficient DBT for the x86 (SW)
Hardware Assists to DBT (HW)
An Example Co-Designed x86 Processor
Conclusions

22
Performance Memory Hierarchy Perspective

Disk Startup
Initial program startup, module or task reloading
after swap.
Memory Startup
Long duration context switch, phase changes. x86
code is still in memory, translated code is not.
Code Cache Transient / Startup
Short duration context switch, phase changes.
Steady State
Translated code is available and placed in the
memory hierarchy.

23
Memory Startup Curves
24
Hotspot Behavior WSB2004 (100M)
Hot threshold
25
Outline

Introduction
The x86vm Infrastructure
Efficient Dynamic Binary Translation
DBT Modeling ? Translation Strategy
Efficient DBT Software for the x86
Hardware Assists for DBT (HW)
A Co-Designed x86 Processor
Conclusions

26
(Hotspot) Translation Procedure
1. Translation Unit Formation Form hotspot
superblock. 2. IR generation Crack x86
instructions into RISC-style micro-ops. 3.
Machine State Mapping Perform Cluster Analysis
of embedded long immediate values and assign to
registers if necessary. 4. Dependency Graph
Construction for the superblock. 5. Macro-op
Fusing Algorithm Scan looking for dependent
pairs to be fused. Forward scan, backward
pairing. Two-pass fusing to prioritize ALU ops.
6. Register Allocation re-order fused
dependent pairs together, extend live ranges for
precise traps, use consistent state mapping at
superblock exits. 7. Code generation to code
cache.
27
Macro-op Fusing Algorithm

Objectives
Maximize fused dependent pairs.
Simple Fast.
Heuristics
Pipelined Issue Logic Only single-cycle ALU ops
can be a head. Minimize non-fused single-cycle
ALU ops.
Criticality Fuse instructions that are close
in the original sequence. ALU-ops criticality is
easier to estimate.
Simplicity 2 or fewer distinct register operands
per fused pair.
Two-pass Fusing Algorithm
The 1st pass, forward scan, prioritizes ALU ops,
i.e. for each ALU-op tail candidate, look
backward in the scan for its head.
The 2nd pass considers all kinds of RISC-ops as
tail candidates

28
Dependence Cycle Detection

All cases are generalized to (c) due to Anti-Scan
Fusing Heuristic.

29
Fusing Algorithm Example
x86 asm -----------------------------------------
------------------ 1. lea eax, DSedi
01 2. mov DS080b8658, eax 3. movzx ebx,
SSebp ecx ltlt 1 4. and eax, 0000007f 5.
mov edx, DSeax esi ltlt 0 0x7c
RISC-ops ----------------------------------------
------------- 1. ADD Reax, Redi, 1 2. ST
Reax, memR14 3. LD.zx Rebx, memRebp Recx
ltlt 1 4. AND Reax, 0000007f 5. ADD R11, Reax,
Resi 6. LD Redx, memR11 0x7c
After fusing Macro-ops --------------------------
--------------------------- 1. ADD R12, Redi, 1
AND Reax, R12, 007f 2. ST R12,
memR14 3. LD.zx Rebx, memRebp Recx ltlt 1 4.
ADD R11, Reax, Resi LD Rebx,memR110x7c

30
Instruction Fusing Profile (SPEC2K)
31
Instruction Fusing Profile (WSB2004)
32
Macro-op Fusing Profile

Of all fused macro-ops
52 / 43 are ALU-ALU pairs.
30 / 35 are fused condition test conditional
branch pairs.
18 / 22 are ALU-MEM ops pairs.
Of all fused macro-ops.
70 are inter-x86instruction fusion.
46 access two distinct source registers.
15 (i.e. 6 of all instruction entities) write
two distinct destination registers.

SPEC
WSB
33
DBT Software Runtime Overhead Profile

Software BBT Profile
BBT overhead ?BBT About 105 FISA instructions
(85 cycles) per translated x86 instruction.
Mostly for decoding, cracking.
Software SBT Profile
SBT overhead ?SBT About 1000 instructions per
translated hotspot instruction.

34
Outline

Introduction
The x86vm Infrastructure
Efficient Dynamic Binary Translation
DBT Modeling ? Translation Strategy
Efficient DBT for the x86 (SW)
Hardware Assists for DBT
A Co-Designed x86 Processor
Conclusions

35
Principles for Hardware Assist Design

Goals
Reduce VMM runtime software overhead
significantly.
Maintain HW complexity-effectiveness power
efficiency.
HW SW simply each other ? Synergetic co-design.
Observations (Analytic Modeling Simulation)
High performance startup (non-hotspot) is
critical.
Hotspot is usually a small fraction of the
overall footprint.
Approach BBT accelerators
Front-end Dual mode decoders.
Backend HW assist functional unit(s).

36
Dual Mode CISC (x86) Decoders
-
µ
op decoder
Opcode
Operand
Designators
Other pipeline
Control signals

Basic idea 2-stage decoder for CISC ISA
First published in Motorola 68K processor papers.
Break a Monolithic complex decoder into two
separate simpler decoder stages.
Dual-mode CISC decoders
CISC (x86) instructions pass both stages.
Internal RISC-ops pass only the second stage.

37
Dual Mode CISC (x86) Decoders

Advantages
High performance startup ? similar to
conventional superscalar design.
No code cache needed for non-hotspot code.
Smooth transition from conventional superscalar
design.
Disadvantages
Complexity n-wide machine needs n such decoders.
But so does conventional design.
Less power efficient (than other VM schemes).

38
Hardware Assists as Functional Units
39
Hardware Assists as Functional Units

Advantages
High performance startup.
Power Efficient.
Programmability and flexibility.
Simplicity only one simplified decoder needed.
Disadvantages
Runtime overhead Reduce ?BBT from 85 to about
20 cycles, but still involves some translation
overhead.
Memory space overhead some extra code cache
space for cold code.
Must use VM paradigm, more risky than dual mode
decoder.

40
Machine Startup Models

Ref Superscalar
Conventional processor design as the baseline.
VM.soft
Software BBT and hotspot optimization via SBT.
State-of-the-art VM design.
VM.be
BBT accelerated by backend functional unit(s).
VM.fe
Dual mode decoder at the pipeline front-end.

41
Startup Evaluation Hardware Assists
1.1
1
0.9
Ref Superscalar
0.8
VM.soft
0.7
VM.be
0.6
VM.fe
Cumulative x86 IPC (Normalized)
0.5
VM.steady-state
0.4
0.3
0.2
0.1
0
1
10
100
1,000
Finish
10,000
100,000
1,000,000
10,000,000
100,000,000
Time Cycles
42
Activity of HW Assists
43
Related Work on HW Assists for DBT

HW Assists for Profiling
Profile Buffer Conte96 BBB Hotspot Detector
Merten99, Programmable HW Path Profiler
CGO05 etc
Profiling Co-Processor Zilles01
Many others.
HW Assists for General VM Technology
System VM Assists Intel VT, AMD Pacifica.
Transmeta Efficeon Processor A new execute
instruction to accelerate interpreter.
HW Assists for Translation
Trace Cache Fill Unit Friendly98
Customized buffer and instructions rePLay,
PARROT.
Instruction Path Co-Processor Zhou00 etc.

44
Outline

Introduction
The x86vm Framework
Efficient Dynamic Binary Translation
DBT Modeling Translation Strategy
Efficient DBT for the x86 (SW)
Hardware Assists for DBT (HW)
A Co-Designed x86 Processor
Conclusions

45
Co-designed x86 processor architecture
horizontal
1
Memory Hierarchy
2
micro
/
Macro
-
op
decoder
x86 code
vertical x86 decoder
Pipeline EXE backend
Rename/ Dispatch
Issue buffer
I
-

Code

(Macro-op)

Co-designed virtual machine paradigm
Startup Simple hardware decode/crack for fast
translation.
Steady State Dynamic software translation/optimiz
ation for hotspots.

46
Processor Pipeline
Reduced Instr. traffic throughout
Reduced forwarding
Pipelined scheduler

Macro-op pipeline for efficient hotspot execution
Execute macro-ops.
Higher IPC, and Higher clock speed potential.
Shorter pipeline front-end.

47
Performance Simulation Configuration
?
48
Processor Evaluation SPEC2K
?
49
Processor Evaluation WSB2004
?
50
Performance Contributors

Many factors contribute to the IPC performance
improvement
Code straightening.
Macro-op fusing and execution.
Shortened pipeline front-end (reduced branch
penalty).
Collapsed 3-1 ALUs (resolve branches addresses
sooner).
Besides baseline and macro-op models, we model
three middle configurations
M0 baseline code cache.
M1 M0 macro-op fusing.
M2 M1 shorter pipeline front-end. (Macro-op
mode).
Macro-op M2 collapsed 3-1 ALUs.

51
Performance Contributors SPEC2K
52
Performance Contributors WSB2004
53
Outline

Introduction
The x86vm Framework
Efficient Dynamic Binary Translation
DBT Modeling Translation Strategy
Efficient DBT for the x86 (SW)
Hardware Assists for DBT (HW)
A Co-Designed x86 Processor
Conclusions

54
Conclusions

The Co-Designed Virtual Machine Paradigm
Capability
Binary compatibility.
Functionality integration.
Functionality dynamic upgrading.
Performance
Enables novel efficient architectures.
Superior steady-state, competitive startup.
Complexity Effectiveness
More flexibility for processor design, etc.
Power Efficiency.

55
Conclusions

The co-designed x86 processor
Capability
Full x86 compatibility.
Performance and Power efficiency
Macro-op execution engine DBT are efficient
combination.
Complexity Effectiveness
VMM ? DBT SW removes considerable HW complexity.
µ-Arch ? Can reduce pipeline width.
Simplified 2-cycle pipelined scheduler.
Simplified operand forwarding network

56
Finale Questions Answers
Suggestions and comments are welcome, Thank you!
57
Acknowledgement

Prof. James E. Smith (Advisor)
Prof. Mikko H. Lipasti
Dr. Ho-Seop Kim
Dr. Ilhyun Kim
Mr. Wooseok Chang
Wisconsin Architecture Group
Funding NSF, IBM Intel
My Family

58
Objectives for Computer Architects

Efficient Designs for Computer Systems
More benefits
Less costs
Specifically, Three Dimensions
Capability ? Practically, software code can run
High Performance ? Higher IPC, Higher clock
speeds
Simplicity / Complexity-Effectiveness ? Less
cost, more reliable. Also ? Low power consumption

59
VM Software

VM Mode
VM runtime
Initial emulation of the Architected ISA binary
DBT translation.
Exception handling
Translated Native Mode
Hotspot code executed in the code cache, chained
together.

60
Fuse Macro-ops An Illustrative Example
61
Why DBT Modeling ?

Understand translation overhead dynamics
Profiling only gives samples
Translation strategy to reduce runtime overhead
HW / SW division, collaboration
Translation stages, trigger mechanisms
Hot Threshold Setting
Too low, excessive false positives
Too high, lose performance benefits

?
EXE time
?
62
Modeling via Memory Hierarchy Perspective

63
Analytic Modeling DBT Overhead
SBT C
Performance
Ref Superscalar
BBT C
MSBT ?SBT
MBBT ?BBT
x86 code
Translation Complexity
?
64
Analytic Modeling Hot Threshold
SBT C
Performance
Hot threshold
BBT C
Translation Overhead
x86 code
Execution frequency
?
65
Modeling Evaluation
?
66
Modeling Evaluation
?
67
Model Implications

Translation overhead
Mainly caused by initial emulation
Hotspot translation overhead is not significant
(or can be always beneficial) if no translation
is allowed for false positive hotspots.
Hotspot Detection
Model explains many empirical threshold settings
in the literature.
Excessively low thresholds ? excessive false
positive hotspots ? excessive hotspot
optimization overhead.

68
Efficient Dynamic Binary Translation

Efficient Binary Translation
Strategy Adaptive runtime optimization
Efficient algorithms for translation
optimization
Co-op with Hardware Accelerators.
Efficient Translated Native Code
Generic optimizations Native Register mapping
long immediate values register allocation.
Implementation ISA optimizations Enable new
microarchitecture execution ? Macro-op Fusing.
Architected ISA optimizations Runtime x86
specific optimizations.

69
Register Mapping LIMM Conversion

Objectives
Efficient emulation of the x86 ISA on the fusible
ISA
32b long immediate values embedded in x86 binary
are problematic for 32b instruction set ? remove
them
Solution Register mapping
Map all x86 GPR to lower 16 R registers in the
fusible ISA
Map all x86 FP/multimedia registers to lower 24 F
registers in the fusable ISA
Solution Long Immediate Conversion
Scan superblock looking for all long immediate
values.
Perform value clustering analysis and allocate
registers to frequent long immediate values.
Convert some x86 embedded long immediate values
into register access or register plus a short
immediate that can be handled in implementation
ISA.

70
Code Re-ordering Algorithm

Problem A new code scheduling algorithm is
needed to group dependent instructions together.
Modern compilers schedule independent
instructions, not dependent pairs.
Idea Partitioning MidSet ? PreSet PostSet
PreSet all instructions that must be moved
before the head
MidSet all instructions in the middle b/w the
head and tail
PostSet all instructions that can be moved after
the tail.

71
Code Re-ordering Example
In RISC-ops w/ fusing info.
Out macro-ops sequence
72
BBT overhead profile

Distributed evenly into fetch/decode, semantics,
loop x86 crack.
BBT 100 instrs. per x86 inst. overhead per x86
instruction, similar to main memory access

73
HST back-end profile

Light-weight opts ProcLongImm, DDG setup, encode
tens of instrs. each Overhead per x86
instruction -- initial load from disk.
Heavy-weight opts uops translation, fusing,
codegen none dominates

74
HW Assisted DBT Overhead (100M)
75
Breakeven Points for Individual Bench
76
Hotspot Detected vs. runs
77
Hotspot Coverage vs. threshold
78
Hotspot Coverage vs. runs
79
DBT Complexity/Overhead Trade-off
80
Performance Evaluation SPEC2000
?
81
Co-designed x86 processor

Architecture Enhancement
Hardware/Software co-designed paradigm to enable
novel designs more desirable system features.
Fusing dependent instruction pairs ? collapses
dataflow graph to reduce instruction mgmt and
inter-ins comm.
Complexity Effectiveness
Pipelined 2-cycle instruction scheduler.
Reduce ALU value forwarding network
significantly.
DBT software reduces a lot of hardware
complexity.
Power Consumption Implication
Reduced pipeline width.
Reduced Inter-instruction communication and
instruction management.

82
Processor Design Challenges

CISC challenges -- Suboptimal internal micro-ops.
Complex decoders obsolete features/instructions
Instruction count expansion ? mgmt, comm
Redundancy Inefficiency in the cracked
micro-ops
Solution Dynamic optimization
Other current challenges (CISC RISC)
Efficiency (Nowadays, less performance gain per
transistor)
Power consumption has become acute
Solution Novel efficient microarchitectures

83
Superscalar x86 Pipeline Challenges
Atomic Scheduler
x86 Pipeline
Wake-up Select
x86 Decode3
x86 Decode1
x86 Decode2
Fetch
EXE
Rename
Dispatch
Align
RF
WB
Retire

Best performing x86 processors, BUT
In general, several critical pipeline stages
Branch behavior ? Fetch bandwidth.
Complex x86 decoders at the pipeline front-end.
Complex Issue logic, wake-up and select in the
same cycle.
Complex operand forwarding networks ? wire
delays.
Instruction count expansion High pressure on
instruction mgmt and communication mechanisms.

84
Related Work x86 processors

AMD K7/K8 microarchitecture
Macro-Operations
High performance, efficient pipeline
Intel Pentium M
Micro-op fusion.
Stack manager.
High performance, low power.
Transmeta x86 processors
Co-Designed x86 VM
VLIW engine code morphing software.

85
Future Directions

Co-Designed Virtual Machine Technology
Confidence More realistic, exhaustive benchmark
study important for whole workload behavior
Enhancement More synergetic, complexity-effective
HW/SW Co-design techniques.
Application Specific enabling techniques for
specific novel computer architectures of the
future.
Example co-designed x86 processor design
Confidence Study as above.
Enhancement HW µ-Arch ? Reduce register write
ports. VMM ? More dynamic optimizations
in SBT, e.g. CSE, software stack manager,
SIMDification.