Title: PROTOFLEX: FPGA-Accelerated Hybrid Functional Simulator
1PROTOFLEX FPGA-Accelerated Hybrid Functional
Simulator
- Eric S. Chung, Eriko Nurvitadhi,James C. Hoe,
Babak Falsafi, Ken Mai - echung, enurvita, jhoe, babak,
kenmai_at_ece.cmu.edu
PROTOFLEX/SIMFLEX
2Multiprocessor Functional Simulation
- Functionally simulating one processor in software
is slow - Simulating many processors is of course even
slower - Parallelism of FPGAs can scale up functional MP
simulation perf ? conduct large-scale (gt64-way)
SW research, cache simulations, perf sampling
studies, etc. - But we cant forfeit full-ISA, full-system
fidelity (run stock OS)
FPGAs unprecedented level of scalability but
full-system building effort can outweigh any
benefits
CPU
FPGAs
Memory
3Combining FPGAs and simulators
Target design
FPGA
Simulator
Cpu
Cpu
- Advantages
- Leverage full-system simulators for reference
designs - Infrequent, complex behaviors remain simulated
- TLB misses, block memory instrs, disk I/O
instrs, SCSI disks, graphics,
Disk
Mem
Mem
Disk
- 3 ways to map target object to hybrid-simulation
host - Emulation-only Simulation-only
Transplantable - Transplant runtime system
- target processors switch modes between FPGA
simulator hosts - processors need not execute 100 in FPGA mode
- e.g., implement only the frequently used ISA
subset in FPGA
4It Really Works
Virtutech Simics (commercial simulator)
Xilinx XUP Virtex-II Pro 30
OurSPARCV9 core
Simics UltraSPARC
Embedded PowerPC
Transplant messageinterface
- BlueSPARC specs
- 7k lines Bluespec
- UltraSPARC III ISA
- Validated against Simics w/ real apps (e.g.,
Solaris 8, SPEC2000, DB2, Oracle, etc.) - 41 all instr groups implemented MMU
- 8kB I/D direct-mapped caches
- multi-cycle func model (CPIideal 5 _at_ 100MHz)
- 16K LUTs (50 of XUP Virtex-II Pro 30)
Simulated target devices
DDR memory
Ethernet
SUN 3800 Server (1x UltraSPARC III, Solaris 8)
developed in 6 monthsx86 also works
5Is this the best we can do?
- Reality check transplants are expensive!
(10ms1,000,000 cycles) - given CPI 1 _at_ 100 Mhz (100 MIPS), 1 transplant
per 1 million instructions increases CPI to 2
(50 MIPS) - Recall lessons in hierarchical cache design
- Hierarchical transplants
- Run simulator kernel on nearby embedded
PowerPC - write SW to cover the entire ISA
- only I/O operations need full transplant to
SIMICS(a 10x reduction in our case)
CPIeffective 1.1
CPIeffective 2
FPGA fabric
coverage99.9999 CPIraw 1
coverage99.9999 CPIraw 1
- Advantages
- Now it makes sense to optimize towards CPIraw 1
- You actually need fewer instructions in
hardware (especially beneficial for x86)
Embedded PPC ISAsim
coverage99.99999 CPI1,000
full-system SIMICS
coverage100 CPItplant1,000,000
6 7How to build a 1024-node MP functional emulator,
without building 1024 nodes?
8How fast do you need to simulate?
fast enough for 1024-way arch. studies
Aggregate Throughput
- In the uniprocessor world
- up to 100x slowdown for interactive software
research (e.g. Simics) - 1k to 10k slowdown for design exploration (e.g.
cache simulation)
9Different approaches to scale to 1K
- Even for 1K-node MP, only 1000 to 10,000 MIPS
(aggregate) to do useful work - The obvious approach
- build fast ISA core (estimate 100 MIPS per core)
- physically replicate the core 1000 times
- ? 10x to 100x faster than needed, why spend
effort and area on perf I dont need? - The better approach?think in terms of MIPS
- build 100 MIPS ISA emulation engine supporting
multiple contexts - map 100 simulated processors onto single engine
- with just 10 physical engines, I can emulate
1000-way system(10 x 100 MIPS 1000 MIPS)
10PROTOFLEXMP
- Build 1000-MIPS simulator from 10s of emulation
engines - multiplex large of emulated contexts onto few
emulation engines - Decide of emulation engines to build from
desired performance, not from nodes to emulate
N-way target system
11Interleaved Emulation Engine
- Statically interleaved emulation engine (ala HEP)
- issue new instr from new context per cycle ?
maximize engine throughput - simple pipeline (no fwding or interlock if
context gt pipe stages) - deeper pipelines for higher frequency (or complex
x86 instrs) - hide the latency of memory and transplants
- It is actually easier to optimize instruction
throughput - Open issues
- How to manage very large of contexts? Do we
have to dynamically page clusters of contexts
in and out of the engine? - How to fake memory capacity? How much DRAM to
emulate 1000-node system?
12Conclusion
- Contributions
- hybrid transplant simulation reduces FPGA
development effort - proof-of-concept demonstrates up to 16 MIPS on
select SPECINT - ? plan to run TPC-C on DB2 and Oracle on BEE2
(not enough DRAM on XUP) - Future work
- 1024-way system on 10-way interleaved emulation
engines - Thanks! Questions? echung_at_ece.cmu.edu
- PROTOFLEX/SIMFLEX (http//www.ece.cmu.edu/simflex
)