Using Dynamic Binary Translation to Fuse Dependent Instructions - PowerPoint PPT Presentation

About This Presentation
Title:

Using Dynamic Binary Translation to Fuse Dependent Instructions

Description:

2. Crack x86 instructions into RISC-like abstract micro-ops ... Fuse instructions that are close in the original sequence cracked from x86 binary. ... – PowerPoint PPT presentation

Number of Views:81
Avg rating:3.0/5.0
Slides: 27
Provided by: shili5
Category:

less

Transcript and Presenter's Notes

Title: Using Dynamic Binary Translation to Fuse Dependent Instructions


1
Using Dynamic Binary Translation to Fuse
Dependent Instructions
  • Shiliang Hu James E. Smith

2
Outline
  • Introduction
  • Fused Instruction Set
  • Fusing Algorithm
  • Evaluation
  • Conclusion

3
MicroArchitecture Model
  • Dependence-based Architectures ILDP ISCA02
    etc.
  • Fuse dependent instruction pairs to be processed
    as if single Introductions in the processor
    pipeline
  • Both higher IPC and deeper pipelining can be
    achieved simultaneously.
  • Original proposal by I.Kim and M. Lipasti
    Macro-op Scheduling MICRO03 (Hardware
    Intensive, RISC)
  • Other related work on pipelined scheduling logic.

4
Pipelined Scheduling Window
  • Critical path in select-wakeup for single cycle
    instructions If producer has latency gt 1, then
    wakeup can be done a cycle late gt wakeup and
    select in different pipe stages

Reax ? memResi 4 Reax ? memResi
4 Select Select Wakeup
Wakeup Reax ? Reax 7 Reax ? Reax 7
Rebx?ReaxRebx Select Wakeup
Select Rebx ? Reax Rebx
Wakeup Selcct Wakeup Recx ? Rebx
4 Recx ? Rebx 4
5
Performance Implications
  • Pros
  • Effectively larger scheduling window by holding
    two instructions in the same window slot.
  • Effectively wider issue by issuing one slot with
    two fused instructions, two dependent
    instructions are kicked off for execution with a
    single issue decision.
  • Can pipeline the scheduling logic without a heavy
    penalty if there is high fusing rate.
  • Cons
  • Non-fused single cycle instructions have two
    cycle latency.
  • If the head (the 1st instr. in the pair) provides
    value to another critical consumer most values
    are consumed only once.
  • If the tail (the 2nd instr. in the pair) has a
    critical dependence, slows down the wakeup of the
    pair.

6
Co-Designed Virtual Machine
  • Concurrently design ISA, microarchitecture, and
    dynamic binary translation (DBT) system
  • Examples -- Transmeta Crusoe Efficeon
    processors IBM DAISY, BOA.
  • Our Design for x86
  • RISC-style implementation ISA with fuse bit
  • Fetch straightened code generated by fast DBT
  • Run on an enhanced dynamic superscalar

7
Implementation Instruction Set
  • Allocate 1-bit of each instruction, the fuse bit,
    to fuse two instructions in the pipeline
  • Dense Instruction Encoding 16/32 bit instruction
    set design
  • Features specialized for efficient emulation of
    the x86 ISA long immediates, condition code,
    addressing modes etc

8
Fused Instruction Set
9
An Illustrative Example
10
Dynamic Binary Translation
  • Goals Simple, Fast Effective
  • Hot Superblock detection and formation
  • Translation from x86 binary to fused instruction
    set
  • Code cache placement linking among superblocks
    in the code cache

11
Hot Superblock Detection Formation
  • Modified MRET (Most Recently Executed Tail) --
    Stop at indirect jumps. Threshold 32. Max Len
    256.

12
Translation Procedure
Single Pass Algorithm 1. Form superblocks using
Modified MRET method 2. Crack x86 instructions
into RISC-like abstract micro-ops 3. Perform
Cluster Analysis of long immediates and assign to
regs. 4. Generate micro-ops in the implementation
ISA 5. Fusing Algorithm Scan looking for
dependent pairs to be fused. Forward scan,
backward pairing. 6. Assign registers extend
live ranges for precise traps, use consistent
state mapping at superblock exits 7. Code
generation
13
Cluster Analysis
  • Objectives
  • Remove embedded long immediates in x86 binary.
  • Reduce static and dynamic instructions.
  • Long Immediate Conversion.
  • Scan superblock looking for all long immediate
    values.
  • Perform value clustering analysis and allocate
    registers to frequent long immediate values.
  • Convert some x86 embedded long immeidates into
    register access or register plus a short
    immediate that can be handled in implementation
    ISA.

14
Fusing Algorithm
  • Objectives
  • Maximize fused dependent pairs
  • Minimize non-fused single cycle ALU ops.
  • Heuristics
  • Only single cycle ALU ops can be a head.
  • Fuse instructions that are close in the original
    sequence cracked from x86 binary.
  • Fusing Algorithm
  • Single pass forward scan.
  • For each tail candidate, look backward in the
    scan for its head.

15
Dependence Cycle Detection
  • All cases are generalized in (d) due to Anti-Scan
    Fusing Heuristic

16
Dynamic x86 Superblock Size
  • Average superblock size is about 15 x86
    instructions, 20 RISC ops.
  • String instructions are common in some x86
    applications.

17
Static Translation Size
  • Variable length ISA is only about 33 bigger than
    x86 binary
  • Fixed length ISA is 60 to 120 bigger than
    original x86 binary.

18
Long Immediate Values Converted
  • Intra superblock conversion for now.
  • Address Displacement is easier to convert, but
    not the general long immediate values.

19
Registers For Long Immediate
  • Two or three registers are enough for 95
    dynamic superblocks.
  • Most SPEC2000INT benchmarks need no more than 5
    registers

20
Scheduling Density
  • Consistently high fusing rate across SPEC2000INT
    benchmarks.
  • 1.5 Scheduling Density means more than 60
    instructions are fused

21
Non-Fused Instruction Profile
  • Consistently low single cycle ALU leftovers
    across SPEC2000INT
  • (23) X (35) means single cycle ALU ops are
    about 8 of all.

22
Distance Distribution of Fused Pairs
  • Most pairs are consecutive or very close in the
    original cracked RISC ops cracked from x86
    superblock.

23
Code Re-organization
  • More than 50 pairs are across x86 instruction
    boundaries.
  • Single cycle ALU ops pairs is about 60

24
Source Register Operands
  • 99 fusable pairs have no more than 3 source
    register operands.
  • 95 fusable pairs have no more than 2 source
    register operands.

25
Conclusion
  • High degree of fusing in typical x86 binary 60
    of all dynamic instructions
  • Two source register operands are enough 95 of
    fusable dependent pairs.
  • Non-fused instructions are mostly LD, ST, BR, FP
    and NOPs
  • Little impact from pipelined issue
  • Variable length ISA improves code density by 30
    in our case
  • Co-Designed VM featuring fused instruction
    execution is promising ? Future work Complete
    the co-designed microarchitecture

26
Backup Dynamic Binary Translation
  • Start program execution by interpretation
    identify hot (frequently executed) program
    paths
  • Translate hot paths into translation cache
  • If program control flow reaches already
    translated code, execute natively

Target translation found
Interpret
Not found (call-DBT instruction)
Native execution
Threshold
End of superblock Translation not found
End of superblock Translation found
Translate
DBT (VMM)
Write a Comment
User Comments (0)
About PowerShow.com