Title: Full-System Chip Multiprocessor Power Evaluations Using FPGA-Based Emulation
1Full-System Chip Multiprocessor Power
EvaluationsUsing FPGA-Based Emulation
- Abhishek BHATTACHARJEE
- Gilberto CONTRERAS
- Margaret MARTONOSI
- Princeton University
Intl. Symp. on Low Power Electronics and Design
(ISLPED) August 13, 2008
2Problem SW Simulators for Architectural Power
Estimation
Full-System Chip Multiprocessor Power Evaluations
Using FPGA-Based Emulation
- Power has become a first-class design problem
- Affects power density, thermal behavior,
packaging constraints - Early stage µ-arch perf/power evaluation is
crucial - Convention SW simulators (Wattch, SimplePower,
Hotspot) - Flexible, low development time
- But SW simulations are too slow 10-100 KIPS !
- Chips getting more complex core counts, uncore
etc. - Design space getting more complex
perf/power/thermal - Must consider OS, workload interaction
-
Abhishek Bhattacharjee, Gilberto Contreras,
Margaret Martonosi, PRINCETON UNIVERSITY
3Alternatives to Long Simulations
Full-System Chip Multiprocessor Power Evaluations
Using FPGA-Based Emulation
- Run application snippets, ignore OS
- Compromises result accuracy and credibility
- Parallelize simulator Chidester et al. 02
- Shared structures (LLC, coherence) limit
scalability - Hardware runtime monitoring Joseph et al. 01,
Bellosa et al. 02, Isci et al. 03, Contreras et
al. 03 - Fast evaluation time
- Restricted view of components
- Requires existing design
-
Abhishek Bhattacharjee, Gilberto Contreras,
Margaret Martonosi, PRINCETON UNIVERSITY
4Our Approach FPGA-Based Full-System Emulation
Full-System Chip Multiprocessor Power Evaluations
Using FPGA-Based Emulation
- Develop FPGA-based perf./power emulator of a
proposed CMP machine - 50-300 MHz ? run full apps, OS
- Similar to HW monitoring
- Programmable ? insert relevant monitors, model
various designs - Similar to SW simulations
- Bottomline Get detail and full-system effects of
real measurements before it is built -
Abhishek Bhattacharjee, Gilberto Contreras,
Margaret Martonosi, PRINCETON UNIVERSITY
5Past and Ongoing FPGA-Based Emulation Work
Full-System Chip Multiprocessor Power Evaluations
Using FPGA-Based Emulation
- Purely performance emulation
- HASim Emer et al. 06, RAMP Wawrzynek et al.
06 - Modular, parameterizable perf. models on FPGAs
- Purely power emulation Coburn et al. 05
- RTL with power-models on FPGA (area/latency
overhead analysis) - Performance and power emulation Atienza et al.
06 - Performance and thermal emulation of MPSoCs for
existing cores - Runs OS on host and communicates with FPGA
Abhishek Bhattacharjee, Gilberto Contreras,
Margaret Martonosi, PRINCETON UNIVERSITY
6Presentation Outline
Full-System Chip Multiprocessor Power Evaluations
Using FPGA-Based Emulation
- Designing the emulator
- Validating emulator power models
- Evaluating emulator speedup
- Profiling application runtime power behavior
- Case study Activity migration
- Conclusion
Abhishek Bhattacharjee, Gilberto Contreras,
Margaret Martonosi, PRINCETON UNIVERSITY
7Steps in Designing Emulator
Full-System Chip Multiprocessor Power Evaluations
Using FPGA-Based Emulation
Emulator Design Steps 1. Choose target
platform 2. Choose candidate core design 3.
Design event counters 4. Design power models 5.
Boot OS and run full apps.
Abhishek Bhattacharjee, Gilberto Contreras,
Margaret Martonosi, PRINCETON UNIVERSITY
8Target Emulation Platform
Full-System Chip Multiprocessor Power Evaluations
Using FPGA-Based Emulation
Emulator Design Steps 1. Choose target
platform 2. Choose candidate core design 3.
Design event counters 4. Design power models 5.
Boot OS and run full apps.
- Target FPGA platform BEE2
- 5 Xilinx V2P 70 FPGAs (1 control/4 user)
- Current design on control unit
- Methodology extensible to other platforms
Abhishek Bhattacharjee, Gilberto Contreras,
Margaret Martonosi, PRINCETON UNIVERSITY
9Candidate Core DesignLeon3 SparcV8 CMP
Full-System Chip Multiprocessor Power Evaluations
Using FPGA-Based Emulation
Emulator Design Steps 1. Choose target
platform 2. Choose candidate core design 3.
Design event counters 4. Design power models 5.
Boot OS and run full apps.
Candidate Core Leon3 Sparc V8 VHDL core
Clock Rate 65 MHz
Organization 2-core, L1 snoopy cache coherence (ARM bus)
Pipeline Single-issue, in-order, 7-stage
Functional Units Adder, Shifter, Pipelined Mul /Div
L1 I-Cache 4 KB, 2-way, 32-byte lines, LRR,
L1 D-Cache 4 KB, 2-way, 32-byte lines, LRR, write-through, virtually addressed
MMU 8-entry I and D TLBs, LRU
- Currently use 60 LUTs, 20 BRAM
- Future scale core count, L1 caches, add LLC, FPU
- Methodology extensible to other core designs
Abhishek Bhattacharjee, Gilberto Contreras,
Margaret Martonosi, PRINCETON UNIVERSITY
10Inserting Event Counters
Full-System Chip Multiprocessor Power Evaluations
Using FPGA-Based Emulation
Emulator Design Steps 1. Choose target
platform 2. Choose candidate core design 3.
Design event counters 4. Design power models 5.
Boot OS and run full apps.
SparcV8 Core 0
SparcV8 Core 1
3-Port Reg. File
3-Port Reg. File
7-Stage Integer Pipeline
7-Stage Integer Pipeline
Memory-mapped counters Add to ISA
start/stop/reset counters 36 counters ? 3
LUTs, no impact on freq.
Event Counters 64-bit
4KB I
4KB D
4KB I
4KB D
AHB Cont.
AHB Bus
Abhishek Bhattacharjee, Gilberto Contreras,
Margaret Martonosi, PRINCETON UNIVERSITY
11Power Model Development
Full-System Chip Multiprocessor Power Evaluations
Using FPGA-Based Emulation
Emulator Design Steps 1. Choose target
platform 2. Choose candidate core design 3.
Design event counters 4. Design power models 5.
Boot OS and run full apps.
- General form of component power model
- How to assign event Ei?
- Want power of emulated
- machine, not FPGA !
- Calibrate with gate-level
- simulations and
- microbenchmarks
- Please refer to paper
- for details
Assumed 0.13µm technology requires just switching
power term (negligible leakage)
Large idle power no clock gating
Abhishek Bhattacharjee, Gilberto Contreras,
Margaret Martonosi, PRINCETON UNIVERSITY
12Register File Power Model
Full-System Chip Multiprocessor Power Evaluations
Using FPGA-Based Emulation
Emulator Design Steps 1. Choose target
platform 2. Choose candidate core design 3.
Design event counters 4. Design power models 5.
Boot OS and run full apps.
- Write 500-instruction microbenchmarks
- Vary event/nop ratio
- Idle Power 18.83 mW, Write 0.53 nJ,
- Single Read 0.29 nJ, Double Read 0.39 nJ
Abhishek Bhattacharjee, Gilberto Contreras,
Margaret Martonosi, PRINCETON UNIVERSITY
13Full-System Emulator with OS and Applications
Full-System Chip Multiprocessor Power Evaluations
Using FPGA-Based Emulation
Emulator Design Steps 1. Choose target
platform 2. Choose candidate core design 3.
Design event counters 4. Design power models 5.
Boot OS and run full apps.
FPGA Platform BEE2 Control Unit
I/O
Emulated CMP
Linux 2.6, applications (Spec2006, Splash-2,
PARSEC) Knowledge of power models
Host PC
RS-232
SparcV8Core 0
SparcV8Core 1
Ethernet
AHB Bus
Event counters for all modules
Main Memory
Abhishek Bhattacharjee, Gilberto Contreras,
Margaret Martonosi, PRINCETON UNIVERSITY
14Presentation Outline
Full-System Chip Multiprocessor Power Evaluations
Using FPGA-Based Emulation
- Designing the emulator
- Validating emulator power models
- Evaluating emulator speedup
- Profiling application runtime power behavior
- Case study Activity migration
- Conclusion
Abhishek Bhattacharjee, Gilberto Contreras,
Margaret Martonosi, PRINCETON UNIVERSITY
15Validating Emulator Power Models
Full-System Chip Multiprocessor Power Evaluations
Using FPGA-Based Emulation
- Extensive validation with Synopsys PrimeTime PX
using - Validation µ-benchmarks
- 2x calibration µ-benchmarks, multiple event types
- Spec 2006 benchmarks
- Mcf, Libquantum, Bzip2, Gcc, Sjeng (train problem
size) - Run 5 distinct 1-million instruction snapshots
Module µ-benchmarks Spec 2006
Pipeline 7.51 7.58
Reg. File 6.03 6.23
I-Cache 6.81 7.21
D-Cache 7.21 7.41
AHB 5.66 7.30
Abhishek Bhattacharjee, Gilberto Contreras,
Margaret Martonosi, PRINCETON UNIVERSITY
16Results Emulation Speedup
Full-System Chip Multiprocessor Power Evaluations
Using FPGA-Based Emulation
- Speedup over architectural simulator, Multifacet
GEMS - 2-core, 4KB L1 caches
- Mcf, Libquantum, Bzip2, Gcc, Sjeng on each core
with train size - With Ruby Max. 35x
- With Ruby Opal Max. 452x
- Even greater speedup expected for
- Modeling greater core counts
- Collecting power/thermal data
- Greater FPGA clock
- Bigger caches
NOTE GEMS host uses a 64-bit, 2-GHz dual-core
AMD Athlon processor
Abhishek Bhattacharjee, Gilberto Contreras,
Margaret Martonosi, PRINCETON UNIVERSITY
17Presentation Outline
Full-System Chip Multiprocessor Power Evaluations
Using FPGA-Based Emulation
- Designing the emulator
- Validating emulator power models
- Evaluating emulator speedup
- Profiling application runtime power behavior
- Case study Activity migration
- Conclusion
Abhishek Bhattacharjee, Gilberto Contreras,
Margaret Martonosi, PRINCETON UNIVERSITY
18Runtime Power Profiling
Full-System Chip Multiprocessor Power Evaluations
Using FPGA-Based Emulation
- Important for OS controlled power-aware
scheduling - Modify Linux kernel to feed counter values to
power models - Read counters within 10ms timer interrupt
- Sampling rate multiples of 10ms
- Access 36 counters in 5700 cycles ? Max. 0.87
perturbation
Abhishek Bhattacharjee, Gilberto Contreras,
Margaret Martonosi, PRINCETON UNIVERSITY
19Runtime Power for LU (2-threads)
Full-System Chip Multiprocessor Power Evaluations
Using FPGA-Based Emulation
CPU 1 master, CPU0 idle (380 mW)
Barrier CPU0 spin-waiting
Possible Reg. File hotspot cannot be tracked on
CPU composite profile
Low power numbers and swing 65 MHz clock, no
L2, no FPU, no gating, simple pipeline
Abhishek Bhattacharjee, Gilberto Contreras,
Margaret Martonosi, PRINCETON UNIVERSITY
20Case Study Activity Migration
Full-System Chip Multiprocessor Power Evaluations
Using FPGA-Based Emulation
- Why study AM?
- Effective at handling hotspots Choi et al. 07,
Heo et al. 03 - Our emulator is the ideal platform for AM studies
- Hotspots depend on component power
- Emulator directly provides this
- On-chip temperature rise/fall times 100 ms
- Emulator fast enough to run OS and apps. beyond
this time range
Abhishek Bhattacharjee, Gilberto Contreras,
Margaret Martonosi, PRINCETON UNIVERSITY
21Linux Kernel Scheduler for AM
Full-System Chip Multiprocessor Power Evaluations
Using FPGA-Based Emulation
Avg. migration time ? 300ms (65 MHz clock and
small caches) 2s interval for max. 15 migration
penalty
Abhishek Bhattacharjee, Gilberto Contreras,
Margaret Martonosi, PRINCETON UNIVERSITY
22Case Study Activity Migration
Full-System Chip Multiprocessor Power Evaluations
Using FPGA-Based Emulation
- Example AM on Bzip2, Mcf
- Power surge on pipeline triggers swap
- Bzip2 small working set, pipeline active most
of run - Mcf large working set, lower power phases
Abhishek Bhattacharjee, Gilberto Contreras,
Margaret Martonosi, PRINCETON UNIVERSITY
23Presentation Outline
Full-System Chip Multiprocessor Power Evaluations
Using FPGA-Based Emulation
- Designing the emulator
- Validating emulator power models
- Evaluating emulator speedup
- Profiling application runtime power behavior
- Case study Activity migration
- Conclusion
Abhishek Bhattacharjee, Gilberto Contreras,
Margaret Martonosi, PRINCETON UNIVERSITY
24Conclusion
Full-System Chip Multiprocessor Power Evaluations
Using FPGA-Based Emulation
- Emulator combines HW speeds (65 MHz) with SW
programmability 432x speedup over GEMS (Ruby
Opal) - Power models accurate within 10 of Synopsys
simulations - Can model range of proposed designs
- Moores Law applies to FPGAs too!
- Ongoing/future work
- Scaling design with higher core counts, larger
caches - GHz emulation
- DVFS emulation
- Thermal models
Abhishek Bhattacharjee, Gilberto Contreras,
Margaret Martonosi, PRINCETON UNIVERSITY