FPGA-based Fast, Cycle-Accurate Full System Simulators - PowerPoint PPT Presentation

About This Presentation

Title:

FPGA-based Fast, Cycle-Accurate Full System Simulators

Description:

FPGA-based Fast, Cycle-Accurate Full System Simulators ... Accurately (to cycle resolution) simulate its behavior ... to 100MHz, cycle-accurate, full-system, ... – PowerPoint PPT presentation

Number of Views:28

Avg rating:3.0/5.0

Slides: 13

Provided by: derek157

Learn more at: https://groups.csail.mit.edu

Category:

more less

Transcript and Presenter's Notes

Title: FPGA-based Fast, Cycle-Accurate Full System Simulators

1
FPGA-based Fast, Cycle-Accurate Full System
Simulators

Derek Chiou, Huzefa Sanjeliwala, Dam Sunwoo,
John Xu and Nikhil Patil
University of Texas at Austin

2
Wouldnt it be nice to have a simulator that is

Fast
10M cycles per second, fast enough to run real
datasets to completion
Accurate
Produce cycle-accurate numbers
Complete
Run real operating systems, applications
Transparent
Can see everything in processor, no performance
hit
Inexpensive
Need thousands
Usable
Quick changes, easy to see performance

3
Software?

Software-based simulators inherently cannot
achieve this speed and be cycle-accurate at the
same time
A 128 entry, fully-associative TLB at the limit
requires 128 load, compare operations
Arbitration requires first looking across
multiple bidders
There are lots of these structures in a complex
processor!
Thousands to tens of thousands of events
Even with perfect parallelism, need a lot of CPUs

4
Hardware

Clearly, hardware is necessary
Reconfigurability (read FPGAs) is required for
flexibility
But how?

5
Full Implementation?

Take RTL code, compile for FPGA
Implementing full system in FPGA is prohibitively
large
Shih-Lin Lus group has single original Pentium
(586, 3.1M transistors) in largest Xilinx FPGA
Emulate Pentium M in a single FPGA?
140M transistors
Instead, what about
Accurately (to cycle resolution) simulate its
behavior
Running real, unmodified applications, OS
With full visibility at full speed?
If execution speeds are reasonable, do I care?

Derek Chiou, UTexas, Austin
6
Can I Partition the Problem?

64b adder way too big to be implemented as a
single monolithic entity
But, I can implement 64 1b adders very easily
with very little state and complexity
Partitioning is good if possible
But, how to partition?

7
Classic Partitioning

On module boundary
Caches, memories, ALUs, processors, memory
controllers
Partitioning doesnt save state or complexity,
but enables design to be partitioned over
multiple FPGAs and software
Problems?

I1
bypass
IR
IR
IR
Add
we
I2
rr1
rr2
addr
rd1
PC
we
algn
inst
wr
waddr
ALU
rd2
wd
raddr
GPR File
Instruction /Mem
rdata
0
Data /Memory
1
M
R
2
Immed. Extend
wdata
3
re
MD1
MD2
8
Functional/Timing Partition

Functional model simulates ISA
Timing model simulates micro-architecture
Asim and Simplescalar are written like this
Software
One processor
Lots of interaction between functional and timing
Intended to avoid rollback of any component
Put timing model in FPGA???
Parallel component executed in hardware!

9
UT FAST Partitioning

On ISA/micro-architecture boundary (ISA FPGA)
Instruction trace generated by ISA simulator
(e.g., Bochs, Simics)
Fast, full system but no timing information
(could be hardware!!!)
What do we need to simulate in the timing model?

I1
bypass
IR
IR
IR
Add
we
I2
rr1
rr2
addr
rd1
PC
we
algn
inst
wr
waddr
ALU
rd2
wd
raddr
GPR File
Instruction Memory
rdata
Trace
0
Data Memory
1
M
R
2
Immed. Extend
wdata
3
re
MD1
MD2
10
UT FAST Complex Processors

Straight pipelines are easy what about
Caches/TLBs?
Keep tags, pass address (virtual and physical if
necessary)
Hits, misses determined but dont need data
Superscalar (multiple issue)?
Fetch and issue multiple instructions assuming
they meet boundary constraints
Multiple functional units
Reservation stations
Reorder buffer
Pipeline control along with instructions
NO DATAPATH!!!
Timing Model speed almost unimportant!
Multi-cycle memories to create more ports

11
Example of Complication Branch Prediction

Must process mis-speculated instructions in
timing model
Implement BP in timing model
Timing model forces ISA simulator to
mis-speculate
Rollback, restore
Requires support from ISA simulator
Branch predictor predictor in ISA simulator?
BP only works in processor if its fairly
accurate
FAST simulators take advantage of the fact that
most of the time micro-architecture is on the
right path
Most complexity (BP, parallelism) can be handled
this way

12
Status Conclusions

1MHz to 100MHz, cycle-accurate, full-system,
multiprocessor simulator
Well, not quite that fast right now, but we are
using embedded 300MHz PowerPC 405 to simplify
X86, boots Linux, Windows, targeting 80486 to
Pentium D-like and beyond (Dam Sunwoo, Nikhil
Patil)
Bochs functional model (looking at much faster
models)
Heavily modified instruction trace and rollback
Branch-predicted superscalar model almost done in
Bluespec and Verilog (John Xu, Huzefa
Sanjeliwala)
Have straight pipeline 486 model with TLBs and
caches
Statistics gathered in hardware
Very little if any probe effect
Tools to semi-automate micro-architectural and
ISA level exploration
Orthogonality of models makes both simpler