Confessions of a RAMP Heretic: Fast, FullSystem, CycleAccurate x86PowerPCARMSparc Simulators - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Confessions of a RAMP Heretic: Fast, FullSystem, CycleAccurate x86PowerPCARMSparc Simulators

Description:

Fast, Full-System, Cycle-Accurate x86/PowerPC/ARM/Sparc ... Accurate: produce cycle-accurate numbers for modern ... Leverage extant full ... – PowerPoint PPT presentation

Number of Views:100
Avg rating:3.0/5.0
Slides: 18
Provided by: derek92
Category:

less

Transcript and Presenter's Notes

Title: Confessions of a RAMP Heretic: Fast, FullSystem, CycleAccurate x86PowerPCARMSparc Simulators


1
Confessions of a RAMP HereticFast, Full-System,
Cycle-Accurate x86/PowerPC/ARM/Sparc Simulators
  • Derek Chiou
  • University of Texas at Austin
  • Electrical and Computer Engineering

2
FAST Goals
  • Fast as fast as possible
  • 2-3 orders of magnitude slower than target?
  • Fast enough to run real datasets to completion
  • Interactive?
  • Accurate produce cycle-accurate numbers for
    modern microprocessors (Pentium M)
  • Complete run unmodified operating systems,
    applications, ISAs,
  • Transparent full visibility, no performance hit
  • Inexpensive need thousands
  • Usable quick changes, use RTL to generate
  • I/O the MOST important part of systems

3
Functional/Timing Partitioning
  • Proven Partitioning
  • Asim, Simplescalar, Timing-First, Memoized, etc.
  • Simplifies simulator.
  • Promotes reuse
  • Same performance in software
  • Asim at 10KHz
  • Most of the time spent in timing model!
  • Hardware???

4
FAST
Inst stream
Functional Model (ISA)
Timing Model (Micro-architecture)
Full-System Simulator
FPGA
  • Functional model could be
  • Pure software (QEMU, Bochs, Simics, SimNow)
  • Use JIT for performance, very fast
  • No better hardware for executing ISA than
    processor
  • Can operate under the covers (flush cache for
    example)
  • Pure Hardware (Hoe et al)
  • Hybrid (Hoe et al)
  • Timing model very simple hardware

5
What is a FAST Timing Model?
Bypass/interlock
I1
IR
IR
IR
Add
we
I2
rr1
rr2
addr
rd1
PC
we
algn
inst
wr
waddr
ALU
rd2
wd
raddr
GPR File
Instruction Memory
rdata
Trace
0
Data Memory
1
M
R
2
Immed. Extend
wdata
3
re
MD1
MD2
6
More Complexity
  • Caches/TLBs?
  • Keep tags, pass address (virtual and physical if
    necessary)
  • Hits, misses determined but dont need data
  • Superscalar (multiple issue)?
  • Fetch and issue multiple instructions assuming
    they meet boundary constraints
  • Multiple functional units
  • Schedulers
  • Reorder buffer/instruction window
  • Pipeline control along with instructions
  • NO DATAPATH (and only part of control path)!!!!

7
Driving a Timing Model
8
Complexity BP
  • Wrong-path instructions!
  • Implement BP in timing model
  • Timing model forces ISA simulator to
    mis-speculate
  • Rollback, restore
  • BP only works in processor if its fairly
    accurate
  • Degrades to trace driven!
  • FAST simulators take advantage of the fact that
    most of the time micro-architecture is on the
    right path
  • Most complexity (BP, parallelism) can be handled
    this way

9
Parallelism Detect Problem Rollback
FM
FM
FM
FM
TM
TM
TM
TM
Memory
Network
Memory Model
10
Functional Model Rollback
  • Need to
  • Rollback, force branch
  • Rollback, restore and continue
  • How?
  • set_pc(inst_num, pc)
  • Set a particular dynamic instance of an
    instruction to a particular instruction pointed
    to by PC
  • Sufficient
  • Currently implemented with checkpoints
  • ISA state, memory, peripherals
  • Works for parallelism too

BR
BR
BR
BR
BR
11
RTL to Timing Model
Bypass/interlock
I1
IR
IR
IR
Add
we
I2
rr1
rr2
addr
rd1
PC
we
algn
inst
wr
waddr
ALU
rd2
wd
raddr
GPR File
Instruction Memory
rdata
Trace
0
Data Memory
1
M
R
2
Immed. Extend
wdata
3
re
MD1
MD2
Timing model perfectly models RTL Verification???
12
Current FAST System
13
QEMU on Xilinx PowerPC
14
Status
  • x86 functional model boots Linux, targeting 80486
    to Pentium D-like and beyond (Dam Sunwoo)
  • Modified Bochs and QEMU
  • Branch-predicted multi-function unit, OOO timing
    model compiles in Bluespec (FAST group)
  • Synthesized for FPGA, 8.5K lines of code, rated
    Top 5 User!
  • Memory, disk models
  • Hope to have network model soon
  • Have straight pipeline 486 model with TLBs and
    caches
  • Preliminary statistics gathered in hardware
    timing model
  • RTL-to-timing model (Nikhil Patil)
  • Defining tools for ISA extension and timing model
    assembly

15
Timing Model Resources
  • OOO, superscalar, 2b branch prediction, five
    functional units, 32KB DCache
  • INTERFACE Fast_if TM IfcVB(interface bt.
    Bluespec Verilog)/CmdQ/Fetch/Decode/Rename/Execu
    te 26 of V2P30 (3593 slices)
  • 22 Block RAMS (out of 136)
  • ROB broken right now
  • Early configurable cache model (state shouldnt
    change much)
  • 32KB 4-way set associative cache with 16B
    cache-lines
  • 165 slices (1 of a 2VP30)
  • 17 block RAMs (12 of a 2VP30)
  • 2MB 4-way set-associative cache with 64B
    cache-lines
  • 140 slices (1 of a 2VP30)
  • 40 block RAMs (29 of a 2VP30)

16
Current Performance
  • Functional model
  • Up to 500K x86 inst/sec today on V2P30 FPGA
  • includes rollbacks assuming 5 mis-speculation
  • Not that optimized
  • 5MIPS unmodified
  • 10M on 3.0GHz Pentium 4
  • DRC box should give this performance
  • PowerPC ISA should be much faster!
  • PowerPC on PowerPC
  • Timing model
  • Not bottleneck!

17
Conclusions
  • 1MHz to 100MHz, cycle-accurate, full-system,
    multiprocessor x86, x86-64, PowerPC, ARM, Sparc
    simulator
  • Leverage extant full-system simulators
  • FPGA timing models maximize performance and
    statistic gathering capabilities
  • Pretty much any timing model seems to fit into a
    single FPGA (Pentium M in V2P30?)
  • Uniprocesssor, multi-processor capable
  • Tools can minimize creation/modification effort
Write a Comment
User Comments (0)
About PowerShow.com