PROTOFLEX: FPGA-Accelerated Hybrid Functional Simulator

About This Presentation

Title:

PROTOFLEX: FPGA-Accelerated Hybrid Functional Simulator

Description:

Title: PowerPoint Presentation Last modified by: Eric S. Chung Created Date: 1/1/1601 12:00:00 AM Document presentation format: Custom Other titles – PowerPoint PPT presentation

Number of Views:65

Avg rating:3.0/5.0

Slides: 13

Provided by: epf9

Category:

more less

Transcript and Presenter's Notes

Title: PROTOFLEX: FPGA-Accelerated Hybrid Functional Simulator

1
PROTOFLEX FPGA-Accelerated Hybrid Functional
Simulator

Eric S. Chung, Eriko Nurvitadhi,James C. Hoe,
Babak Falsafi, Ken Mai
echung, enurvita, jhoe, babak,
kenmai_at_ece.cmu.edu

PROTOFLEX/SIMFLEX
2
Multiprocessor Functional Simulation

Functionally simulating one processor in software
is slow
Simulating many processors is of course even
slower
Parallelism of FPGAs can scale up functional MP
simulation perf ? conduct large-scale (gt64-way)
SW research, cache simulations, perf sampling
studies, etc.
But we cant forfeit full-ISA, full-system
fidelity (run stock OS)

FPGAs unprecedented level of scalability but
full-system building effort can outweigh any
benefits
CPU
FPGAs
Memory
3
Combining FPGAs and simulators
Target design
FPGA
Simulator
Cpu
Cpu

Advantages
Leverage full-system simulators for reference
designs
Infrequent, complex behaviors remain simulated
TLB misses, block memory instrs, disk I/O
instrs, SCSI disks, graphics,

Disk
Mem
Mem
Disk

3 ways to map target object to hybrid-simulation
host
Emulation-only Simulation-only
Transplantable
Transplant runtime system
target processors switch modes between FPGA
simulator hosts
processors need not execute 100 in FPGA mode
e.g., implement only the frequently used ISA
subset in FPGA

4
It Really Works
Virtutech Simics (commercial simulator)
Xilinx XUP Virtex-II Pro 30
OurSPARCV9 core
Simics UltraSPARC
Embedded PowerPC
Transplant messageinterface

BlueSPARC specs
7k lines Bluespec
UltraSPARC III ISA
Validated against Simics w/ real apps (e.g.,
Solaris 8, SPEC2000, DB2, Oracle, etc.)
41 all instr groups implemented MMU
8kB I/D direct-mapped caches
multi-cycle func model (CPIideal 5 _at_ 100MHz)
16K LUTs (50 of XUP Virtex-II Pro 30)

Simulated target devices
DDR memory
Ethernet
SUN 3800 Server (1x UltraSPARC III, Solaris 8)

developed in 6 monthsx86 also works
5
Is this the best we can do?

Reality check transplants are expensive!
(10ms1,000,000 cycles)
given CPI 1 _at_ 100 Mhz (100 MIPS), 1 transplant
per 1 million instructions increases CPI to 2
(50 MIPS)
Recall lessons in hierarchical cache design
Hierarchical transplants
Run simulator kernel on nearby embedded
PowerPC
write SW to cover the entire ISA
only I/O operations need full transplant to
SIMICS(a 10x reduction in our case)

CPIeffective 1.1
CPIeffective 2
FPGA fabric
coverage99.9999 CPIraw 1
coverage99.9999 CPIraw 1

Advantages
Now it makes sense to optimize towards CPIraw 1
You actually need fewer instructions in
hardware (especially beneficial for x86)

Embedded PPC ISAsim
coverage99.99999 CPI1,000
full-system SIMICS
coverage100 CPItplant1,000,000
6

Demo

7
How to build a 1024-node MP functional emulator,
without building 1024 nodes?
8
How fast do you need to simulate?
fast enough for 1024-way arch. studies
Aggregate Throughput

In the uniprocessor world
up to 100x slowdown for interactive software
research (e.g. Simics)
1k to 10k slowdown for design exploration (e.g.
cache simulation)

9
Different approaches to scale to 1K

Even for 1K-node MP, only 1000 to 10,000 MIPS
(aggregate) to do useful work
The obvious approach
build fast ISA core (estimate 100 MIPS per core)
physically replicate the core 1000 times
? 10x to 100x faster than needed, why spend
effort and area on perf I dont need?
The better approach?think in terms of MIPS
build 100 MIPS ISA emulation engine supporting
multiple contexts
map 100 simulated processors onto single engine
with just 10 physical engines, I can emulate
1000-way system(10 x 100 MIPS 1000 MIPS)

10
PROTOFLEXMP

Build 1000-MIPS simulator from 10s of emulation
engines
multiplex large of emulated contexts onto few
emulation engines
Decide of emulation engines to build from
desired performance, not from nodes to emulate

N-way target system
11
Interleaved Emulation Engine

Statically interleaved emulation engine (ala HEP)
issue new instr from new context per cycle ?
maximize engine throughput
simple pipeline (no fwding or interlock if
context gt pipe stages)
deeper pipelines for higher frequency (or complex
x86 instrs)
hide the latency of memory and transplants
It is actually easier to optimize instruction
throughput
Open issues
How to manage very large of contexts? Do we
have to dynamically page clusters of contexts
in and out of the engine?
How to fake memory capacity? How much DRAM to
emulate 1000-node system?