Efficient Binary Translation In CoDesigned Virtual Machines - PowerPoint PPT Presentation

About This Presentation
Title:

Efficient Binary Translation In CoDesigned Virtual Machines

Description:

Startup: Fast, hardware-intensive translation ... 3. Machine State Mapping: Perform Cluster Analysis of embedded long immediate ... – PowerPoint PPT presentation

Number of Views:78
Avg rating:3.0/5.0
Slides: 86
Provided by: shili5
Category:

less

Transcript and Presenter's Notes

Title: Efficient Binary Translation In CoDesigned Virtual Machines


1
Efficient Binary Translation In
Co-Designed Virtual Machines
Feb. 28th, 2006 -- Shiliang Hu
2
Outline
  • Introduction
  • The x86vm Infrastructure
  • Efficient Dynamic Binary Translation
  • DBT Modeling ? Translation Strategy
  • Efficient DBT for the x86 (SW)
  • Hardware Assists for DBT (HW)
  • A Co-Designed x86 Processor
  • Conclusions

3
The Dilemma Binary Compatibility
  • Two Fundamentals for Computer Architects
  • Computer Applications Ever-expanding
  • Software development is expensive
  • Software porting is also costly.
  • ?Standard software binary distribution format(s)
  • Implementation Technology Ever-evolving
  • Silicon technology has been evolving rapidly
    Moore Law
  • Trend Always needs ISA architecture innovation.
  • Dilemma Binary Compatibility
  • Cause Coupled Software Binary Distribution
    Format and Hardware/Software Interface.

4
Solution Dynamic ISA Mapping
Software in Architected ISA OS, Drivers, Lib
code, Apps
Architected ISA e.g. x86

Dynamic Translation
Implementation ISA e.g. fusible ISA
HW Implementation Processors, Mem-sys, I/O
devices
  • ISA mapping
  • Hardware-intensive translation good for startup
    performance.
  • Software dynamic optimization good for hotspots.
  • Can we combine the advantages of both?
  • Startup Fast, hardware-intensive translation
  • Steady State Intelligent translation /
    optimization for hotspots.

5
Key Efficient Binary Translation
  • Startup curves for Windows workloads

6
Issue Bad Case Scenarios
  • Short-Running Fine-Grain Cooperating Tasks
  • Performance lost to slow startup cannot be
    compensated for before the tasks end.
  • Real Time Applications
  • Real time constraints can be compromised due to
    slow. translation process.
  • Multi-tasking Server-like Applications
  • Frequent context switches b/w resource competing
    tasks.
  • Limited code cache size causes re-translations
    when switched in and out.
  • OS boot-up shutdown (Client, mobile platforms)

7
Related Work State-of-the Art VM
  • Pioneer IBM System/38, AS/400
  • Products Transmeta x86 Processors
  • Code Morphing Software VLIW engine.
  • Crusoe ? Efficeon.
  • Research Project IBM DAISY, BOA
  • Full system translator VLIW engine.
  • DBT overhead 4000 PowerPC instrs per translated
    instr.
  • Other research projects DBT for ILDP (H. Kim
    JES).
  • Dynamic Binary Translation / Optimization
  • SW based (Often user mode only) UQBT, Dynamo
    (RIO), IA-32 EL. Java and .NET HLL VM runtime
    systems.
  • HW based Trace cache fill units, rePLay, PARROT,
    etc.

8
Thesis Contributions
  • Efficient Dynamic Binary Translation (DBT)
  • DBT runtime overhead modeling ? translation
    strategy.
  • Efficient software translation algorithms.
  • Simple hardware accelerators for DBT.
  • Macro-op Execution µ-Arch (w/ Kim, Lipasti)
  • Higher IPC and Higher clock speed potential.
  • Reduced complexity at critical pipeline stages.
  • An Integrated Co-designed x86 Virtual Machine
  • Superior steady-state performance.
  • Competitive startup performance.
  • Complexity-effective, power efficient.

9
Outline
  • Introduction
  • The x86vm Infrastructure
  • Efficient Dynamic Binary Translation
  • DBT Modeling ? Translation Strategy
  • Efficient DBT for the x86 (SW)
  • Hardware Assists for DBT (HW)
  • A Co-Designed x86 Processor
  • Conclusions

10
The x86vm Framework
  • Goal Experimental research infrastructure to
    explore the co-designed x86 VM paradigm
  • Our co-designed x86 virtual machine design
  • Software components VMM.
  • Hardware components Microarchitecture timing
    simulators.
  • Internal implementation ISA.
  • Object Oriented Design Implementation in C

11
The x86vm Framework
Software in Architected ISA OS, Drivers, Lib
code Apps
x
86
instructions
Architected ISA e.g. x86
RISC
-
ops
BOCHS 2.2
x86vmm

DBT
VMM Runtime
Software
Code Cache(s)
Macro
-
ops
Implementation ISA e.g. Fusible ISA
Hardware Model Microarchitecture, Timing
Simulator
12
Two-stage DBT system
  • VMM runtime
  • Orchestrate VM system EXE.
  • Runtime resource mgmt.
  • Precise state recovery.
  • DBT
  • BBT Basic block translator.
  • SBT Hot super- block translator optimizer.

13
Evaluation Methodology
  • Reference/Baseline Best performing x86
    processors
  • Approximation to Intel Pentium M, AMD K7/K8.
  • Experimental Data Collection Simulation
  • Different models need different instantiations of
    x86vm.
  • Benchmarks
  • SPEC 2000 integer (SPEC2K). Binary generation
    Intel C/C compiler O3 base opt. Test data
    inputs, full runs.
  • Winstone2004 business suite (WSB2004),
    500-million x86 instruction traces for 10 common
    Windows applications.

14
x86 Binary Characterization
  • Instruction Count Expansion (x86 to RISC-ops)
  • 40 for SPEC2K.
  • 50- for Windows workloads.
  • Instruction management and communication.
  • Redundancy and inefficiency.
  • Code footprint expansion
  • Nearly double if cracked into 32b fixed length
    RISC-ops.
  • 3040 if cracked into 16/32b RISC-ops.
  • Affects fetch efficiency, memory hierarchy
    performance.

15
Overview of Baseline VM Design
  • Goal demonstrate the power of a VM paradigm via
    a specific x86 processor design
  • Architected ISA x86
  • Co-designed VM software SBT, BBT and VM runtime
    control system.
  • Implementation ISA Fusible ISA ? FISA
  • Efficient microarchitecture enabled macro-op
    execution

16
Fusible Instruction Set
  • RISC-ops with unique features
  • A fusible bit per instr. for fusing.
  • Dense encoding, 16/32-bit ISA.
  • Special Features to Support x86
  • Condition codes.
  • Addressing modes
  • Aware of long immediate values.

F
10 b opcode
5b Rds
F
F
10 b opcode
5b Rsrc
F
16 bit opcode
5b Rsrc
5b Rds
10b Immediate / Disp
5b opcode
F
5b opcode
5b Rds
5b Immd
F
5b Rds
5b Rsrc
5b opcode
17
VMM Virtual Machine Monitor
  • Runtime Controller
  • Orchestrate VM operation, translation, translated
    code, etc.
  • Code Cache(s) management
  • Hold translations, lookup translations
    chaining, eviction policy.
  • BBT Initial emulation
  • Straightforward cracking, no optimization..
  • SBT Hotspot optimizer
  • Fuse dependent Instruction pairs into macro-ops.
  • Precise state recovery routines

18
Microarchitecture Macro-op Execution
  • Enhanced OoO superscalar microarchitecture
  • Process execute fused macro-ops as single
    Instructions throughout the entire pipeline.
  • Analogy All lanes ? car-pool on highway ? reduce
    congestion w/ high throughput, AND raise the
    speed limit from 65mph to 80mph.
  • Joint work with I. Kim M. Lipasti.

3-1 ALUs
cache
ports
Fuse
bit
Decode
Wake-
Align
Retire
WB
RF
Select
EXE
MEM
Rename
Fetch
up
Fuse
Dispatch
19
Co-designed x86 pipeline frond-end
20
Co-designed x86 pipeline backend
21
Outline
  • Introduction
  • The x86vm Infrastructure
  • Efficient Dynamic Binary Translation
  • DBT Modeling ? Translation Strategy
  • Efficient DBT for the x86 (SW)
  • Hardware Assists to DBT (HW)
  • An Example Co-Designed x86 Processor
  • Conclusions

22
Performance Memory Hierarchy Perspective
  • Disk Startup
  • Initial program startup, module or task reloading
    after swap.
  • Memory Startup
  • Long duration context switch, phase changes. x86
    code is still in memory, translated code is not.
  • Code Cache Transient / Startup
  • Short duration context switch, phase changes.
  • Steady State
  • Translated code is available and placed in the
    memory hierarchy.

23
Memory Startup Curves
24
Hotspot Behavior WSB2004 (100M)
Hot threshold
25
Outline
  • Introduction
  • The x86vm Infrastructure
  • Efficient Dynamic Binary Translation
  • DBT Modeling ? Translation Strategy
  • Efficient DBT Software for the x86
  • Hardware Assists for DBT (HW)
  • A Co-Designed x86 Processor
  • Conclusions

26
(Hotspot) Translation Procedure
1. Translation Unit Formation Form hotspot
superblock. 2. IR generation Crack x86
instructions into RISC-style micro-ops. 3.
Machine State Mapping Perform Cluster Analysis
of embedded long immediate values and assign to
registers if necessary. 4. Dependency Graph
Construction for the superblock. 5. Macro-op
Fusing Algorithm Scan looking for dependent
pairs to be fused. Forward scan, backward
pairing. Two-pass fusing to prioritize ALU ops.
6. Register Allocation re-order fused
dependent pairs together, extend live ranges for
precise traps, use consistent state mapping at
superblock exits. 7. Code generation to code
cache.
27
Macro-op Fusing Algorithm
  • Objectives
  • Maximize fused dependent pairs.
  • Simple Fast.
  • Heuristics
  • Pipelined Issue Logic Only single-cycle ALU ops
    can be a head. Minimize non-fused single-cycle
    ALU ops.
  • Criticality Fuse instructions that are close
    in the original sequence. ALU-ops criticality is
    easier to estimate.
  • Simplicity 2 or fewer distinct register operands
    per fused pair.
  • Two-pass Fusing Algorithm
  • The 1st pass, forward scan, prioritizes ALU ops,
    i.e. for each ALU-op tail candidate, look
    backward in the scan for its head.
  • The 2nd pass considers all kinds of RISC-ops as
    tail candidates

28
Dependence Cycle Detection
  • All cases are generalized to (c) due to Anti-Scan
    Fusing Heuristic.

29
Fusing Algorithm Example
x86 asm -----------------------------------------
------------------ 1. lea eax, DSedi
01 2. mov DS080b8658, eax 3. movzx ebx,
SSebp ecx ltlt 1 4. and eax, 0000007f 5.
mov edx, DSeax esi ltlt 0 0x7c
RISC-ops ----------------------------------------
------------- 1. ADD Reax, Redi, 1 2. ST
Reax, memR14 3. LD.zx Rebx, memRebp Recx
ltlt 1 4. AND Reax, 0000007f 5. ADD R11, Reax,
Resi 6. LD Redx, memR11 0x7c
After fusing Macro-ops --------------------------
--------------------------- 1. ADD R12, Redi, 1
AND Reax, R12, 007f 2. ST R12,
memR14 3. LD.zx Rebx, memRebp Recx ltlt 1 4.
ADD R11, Reax, Resi LD Rebx,memR110x7c

30
Instruction Fusing Profile (SPEC2K)
31
Instruction Fusing Profile (WSB2004)
32
Macro-op Fusing Profile
  • Of all fused macro-ops
  • 52 / 43 are ALU-ALU pairs.
  • 30 / 35 are fused condition test conditional
    branch pairs.
  • 18 / 22 are ALU-MEM ops pairs.
  • Of all fused macro-ops.
  • 70 are inter-x86instruction fusion.
  • 46 access two distinct source registers.
  • 15 (i.e. 6 of all instruction entities) write
    two distinct destination registers.

SPEC
WSB
33
DBT Software Runtime Overhead Profile
  • Software BBT Profile
  • BBT overhead ?BBT About 105 FISA instructions
    (85 cycles) per translated x86 instruction.
    Mostly for decoding, cracking.
  • Software SBT Profile
  • SBT overhead ?SBT About 1000 instructions per
    translated hotspot instruction.

34
Outline
  • Introduction
  • The x86vm Infrastructure
  • Efficient Dynamic Binary Translation
  • DBT Modeling ? Translation Strategy
  • Efficient DBT for the x86 (SW)
  • Hardware Assists for DBT
  • A Co-Designed x86 Processor
  • Conclusions

35
Principles for Hardware Assist Design
  • Goals
  • Reduce VMM runtime software overhead
    significantly.
  • Maintain HW complexity-effectiveness power
    efficiency.
  • HW SW simply each other ? Synergetic co-design.
  • Observations (Analytic Modeling Simulation)
  • High performance startup (non-hotspot) is
    critical.
  • Hotspot is usually a small fraction of the
    overall footprint.
  • Approach BBT accelerators
  • Front-end Dual mode decoders.
  • Backend HW assist functional unit(s).

36
Dual Mode CISC (x86) Decoders
-
µ
op decoder
Opcode
Operand
Designators
Other pipeline
Control signals
  • Basic idea 2-stage decoder for CISC ISA
  • First published in Motorola 68K processor papers.
  • Break a Monolithic complex decoder into two
    separate simpler decoder stages.
  • Dual-mode CISC decoders
  • CISC (x86) instructions pass both stages.
  • Internal RISC-ops pass only the second stage.

37
Dual Mode CISC (x86) Decoders
  • Advantages
  • High performance startup ? similar to
    conventional superscalar design.
  • No code cache needed for non-hotspot code.
  • Smooth transition from conventional superscalar
    design.
  • Disadvantages
  • Complexity n-wide machine needs n such decoders.
  • But so does conventional design.
  • Less power efficient (than other VM schemes).

38
Hardware Assists as Functional Units
39
Hardware Assists as Functional Units
  • Advantages
  • High performance startup.
  • Power Efficient.
  • Programmability and flexibility.
  • Simplicity only one simplified decoder needed.
  • Disadvantages
  • Runtime overhead Reduce ?BBT from 85 to about
    20 cycles, but still involves some translation
    overhead.
  • Memory space overhead some extra code cache
    space for cold code.
  • Must use VM paradigm, more risky than dual mode
    decoder.

40
Machine Startup Models
  • Ref Superscalar
  • Conventional processor design as the baseline.
  • VM.soft
  • Software BBT and hotspot optimization via SBT.
  • State-of-the-art VM design.
  • VM.be
  • BBT accelerated by backend functional unit(s).
  • VM.fe
  • Dual mode decoder at the pipeline front-end.

41
Startup Evaluation Hardware Assists
1.1
1
0.9
Ref Superscalar
0.8
VM.soft
0.7
VM.be
0.6
VM.fe
Cumulative x86 IPC (Normalized)
0.5
VM.steady-state
0.4
0.3
0.2
0.1
0
1
10
100
1,000
Finish
10,000
100,000
1,000,000
10,000,000
100,000,000
Time Cycles
42
Activity of HW Assists
43
Related Work on HW Assists for DBT
  • HW Assists for Profiling
  • Profile Buffer Conte96 BBB Hotspot Detector
    Merten99, Programmable HW Path Profiler
    CGO05 etc
  • Profiling Co-Processor Zilles01
  • Many others.
  • HW Assists for General VM Technology
  • System VM Assists Intel VT, AMD Pacifica.
  • Transmeta Efficeon Processor A new execute
    instruction to accelerate interpreter.
  • HW Assists for Translation
  • Trace Cache Fill Unit Friendly98
  • Customized buffer and instructions rePLay,
    PARROT.
  • Instruction Path Co-Processor Zhou00 etc.

44
Outline
  • Introduction
  • The x86vm Framework
  • Efficient Dynamic Binary Translation
  • DBT Modeling Translation Strategy
  • Efficient DBT for the x86 (SW)
  • Hardware Assists for DBT (HW)
  • A Co-Designed x86 Processor
  • Conclusions

45
Co-designed x86 processor architecture
horizontal
1
Memory Hierarchy
2
micro
/
Macro
-
op
decoder
x86 code
vertical x86 decoder
Pipeline EXE backend
Rename/ Dispatch
Issue buffer
I
-

Code

(Macro-op)
  • Co-designed virtual machine paradigm
  • Startup Simple hardware decode/crack for fast
    translation.
  • Steady State Dynamic software translation/optimiz
    ation for hotspots.

46
Processor Pipeline
Reduced Instr. traffic throughout
Reduced forwarding
Pipelined scheduler
  • Macro-op pipeline for efficient hotspot execution
  • Execute macro-ops.
  • Higher IPC, and Higher clock speed potential.
  • Shorter pipeline front-end.

47
Performance Simulation Configuration
?
48
Processor Evaluation SPEC2K
?
49
Processor Evaluation WSB2004
?
50
Performance Contributors
  • Many factors contribute to the IPC performance
    improvement
  • Code straightening.
  • Macro-op fusing and execution.
  • Shortened pipeline front-end (reduced branch
    penalty).
  • Collapsed 3-1 ALUs (resolve branches addresses
    sooner).
  • Besides baseline and macro-op models, we model
    three middle configurations
  • M0 baseline code cache.
  • M1 M0 macro-op fusing.
  • M2 M1 shorter pipeline front-end. (Macro-op
    mode).
  • Macro-op M2 collapsed 3-1 ALUs.

51
Performance Contributors SPEC2K
52
Performance Contributors WSB2004
53
Outline
  • Introduction
  • The x86vm Framework
  • Efficient Dynamic Binary Translation
  • DBT Modeling Translation Strategy
  • Efficient DBT for the x86 (SW)
  • Hardware Assists for DBT (HW)
  • A Co-Designed x86 Processor
  • Conclusions

54
Conclusions
  • The Co-Designed Virtual Machine Paradigm
  • Capability
  • Binary compatibility.
  • Functionality integration.
  • Functionality dynamic upgrading.
  • Performance
  • Enables novel efficient architectures.
  • Superior steady-state, competitive startup.
  • Complexity Effectiveness
  • More flexibility for processor design, etc.
  • Power Efficiency.

55
Conclusions
  • The co-designed x86 processor
  • Capability
  • Full x86 compatibility.
  • Performance and Power efficiency
  • Macro-op execution engine DBT are efficient
    combination.
  • Complexity Effectiveness
  • VMM ? DBT SW removes considerable HW complexity.
  • µ-Arch ? Can reduce pipeline width.
  • Simplified 2-cycle pipelined scheduler.
  • Simplified operand forwarding network

56
Finale Questions Answers
Suggestions and comments are welcome, Thank you!
57
Acknowledgement
  • Prof. James E. Smith (Advisor)
  • Prof. Mikko H. Lipasti
  • Dr. Ho-Seop Kim
  • Dr. Ilhyun Kim
  • Mr. Wooseok Chang
  • Wisconsin Architecture Group
  • Funding NSF, IBM Intel
  • My Family

58
Objectives for Computer Architects
  • Efficient Designs for Computer Systems
  • More benefits
  • Less costs
  • Specifically, Three Dimensions
  • Capability ? Practically, software code can run
  • High Performance ? Higher IPC, Higher clock
    speeds
  • Simplicity / Complexity-Effectiveness ? Less
    cost, more reliable. Also ? Low power consumption

59
VM Software
  • VM Mode
  • VM runtime
  • Initial emulation of the Architected ISA binary
  • DBT translation.
  • Exception handling
  • Translated Native Mode
  • Hotspot code executed in the code cache, chained
    together.

60
Fuse Macro-ops An Illustrative Example
61
Why DBT Modeling ?
  • Understand translation overhead dynamics
  • Profiling only gives samples
  • Translation strategy to reduce runtime overhead
  • HW / SW division, collaboration
  • Translation stages, trigger mechanisms
  • Hot Threshold Setting
  • Too low, excessive false positives
  • Too high, lose performance benefits

?
EXE time
?
62
Modeling via Memory Hierarchy Perspective

63
Analytic Modeling DBT Overhead
SBT C
Performance
Ref Superscalar
BBT C
MSBT ?SBT
MBBT ?BBT
x86 code
Translation Complexity
?
64
Analytic Modeling Hot Threshold
SBT C
Performance
Hot threshold
BBT C
Translation Overhead
x86 code
Execution frequency
?
65
Modeling Evaluation
?
66
Modeling Evaluation
?
67
Model Implications
  • Translation overhead
  • Mainly caused by initial emulation
  • Hotspot translation overhead is not significant
    (or can be always beneficial) if no translation
    is allowed for false positive hotspots.
  • Hotspot Detection
  • Model explains many empirical threshold settings
    in the literature.
  • Excessively low thresholds ? excessive false
    positive hotspots ? excessive hotspot
    optimization overhead.

68
Efficient Dynamic Binary Translation
  • Efficient Binary Translation
  • Strategy Adaptive runtime optimization
  • Efficient algorithms for translation
    optimization
  • Co-op with Hardware Accelerators.
  • Efficient Translated Native Code
  • Generic optimizations Native Register mapping
    long immediate values register allocation.
  • Implementation ISA optimizations Enable new
    microarchitecture execution ? Macro-op Fusing.
  • Architected ISA optimizations Runtime x86
    specific optimizations.

69
Register Mapping LIMM Conversion
  • Objectives
  • Efficient emulation of the x86 ISA on the fusible
    ISA
  • 32b long immediate values embedded in x86 binary
    are problematic for 32b instruction set ? remove
    them
  • Solution Register mapping
  • Map all x86 GPR to lower 16 R registers in the
    fusible ISA
  • Map all x86 FP/multimedia registers to lower 24 F
    registers in the fusable ISA
  • Solution Long Immediate Conversion
  • Scan superblock looking for all long immediate
    values.
  • Perform value clustering analysis and allocate
    registers to frequent long immediate values.
  • Convert some x86 embedded long immediate values
    into register access or register plus a short
    immediate that can be handled in implementation
    ISA.

70
Code Re-ordering Algorithm
  • Problem A new code scheduling algorithm is
    needed to group dependent instructions together.
  • Modern compilers schedule independent
    instructions, not dependent pairs.
  • Idea Partitioning MidSet ? PreSet PostSet
  • PreSet all instructions that must be moved
    before the head
  • MidSet all instructions in the middle b/w the
    head and tail
  • PostSet all instructions that can be moved after
    the tail.

71
Code Re-ordering Example
In RISC-ops w/ fusing info.
Out macro-ops sequence
72
BBT overhead profile
  • Distributed evenly into fetch/decode, semantics,
    loop x86 crack.
  • BBT 100 instrs. per x86 inst. overhead per x86
    instruction, similar to main memory access

73
HST back-end profile
  • Light-weight opts ProcLongImm, DDG setup, encode
    tens of instrs. each Overhead per x86
    instruction -- initial load from disk.
  • Heavy-weight opts uops translation, fusing,
    codegen none dominates

74
HW Assisted DBT Overhead (100M)
75
Breakeven Points for Individual Bench
76
Hotspot Detected vs. runs
77
Hotspot Coverage vs. threshold
78
Hotspot Coverage vs. runs
79
DBT Complexity/Overhead Trade-off
80
Performance Evaluation SPEC2000
?
81
Co-designed x86 processor
  • Architecture Enhancement
  • Hardware/Software co-designed paradigm to enable
    novel designs more desirable system features.
  • Fusing dependent instruction pairs ? collapses
    dataflow graph to reduce instruction mgmt and
    inter-ins comm.
  • Complexity Effectiveness
  • Pipelined 2-cycle instruction scheduler.
  • Reduce ALU value forwarding network
    significantly.
  • DBT software reduces a lot of hardware
    complexity.
  • Power Consumption Implication
  • Reduced pipeline width.
  • Reduced Inter-instruction communication and
    instruction management.

82
Processor Design Challenges
  • CISC challenges -- Suboptimal internal micro-ops.
  • Complex decoders obsolete features/instructions
  • Instruction count expansion ? mgmt, comm
  • Redundancy Inefficiency in the cracked
    micro-ops
  • Solution Dynamic optimization
  • Other current challenges (CISC RISC)
  • Efficiency (Nowadays, less performance gain per
    transistor)
  • Power consumption has become acute
  • Solution Novel efficient microarchitectures

83
Superscalar x86 Pipeline Challenges
Atomic Scheduler
x86 Pipeline
Wake-up Select
x86 Decode3
x86 Decode1
x86 Decode2
Fetch
EXE
Rename
Dispatch
Align
RF
WB
Retire
  • Best performing x86 processors, BUT
  • In general, several critical pipeline stages
  • Branch behavior ? Fetch bandwidth.
  • Complex x86 decoders at the pipeline front-end.
  • Complex Issue logic, wake-up and select in the
    same cycle.
  • Complex operand forwarding networks ? wire
    delays.
  • Instruction count expansion High pressure on
    instruction mgmt and communication mechanisms.

84
Related Work x86 processors
  • AMD K7/K8 microarchitecture
  • Macro-Operations
  • High performance, efficient pipeline
  • Intel Pentium M
  • Micro-op fusion.
  • Stack manager.
  • High performance, low power.
  • Transmeta x86 processors
  • Co-Designed x86 VM
  • VLIW engine code morphing software.

85
Future Directions
  • Co-Designed Virtual Machine Technology
  • Confidence More realistic, exhaustive benchmark
    study important for whole workload behavior
  • Enhancement More synergetic, complexity-effective
    HW/SW Co-design techniques.
  • Application Specific enabling techniques for
    specific novel computer architectures of the
    future.
  • Example co-designed x86 processor design
  • Confidence Study as above.
  • Enhancement HW µ-Arch ? Reduce register write
    ports. VMM ? More dynamic optimizations
    in SBT, e.g. CSE, software stack manager,
    SIMDification.
Write a Comment
User Comments (0)
About PowerShow.com