Title: Hardware Performance Counters for Detailed Runtime Power and Thermal Estimations: Experiences
1Hardware Performance Counters for Detailed
Runtime Power and Thermal Estimations
Experiences Proposals
- Canturk ISCIGilberto CONTRERASMargaret MARTONOSI
2Hardware Performance Counters (HPCs) Go beyond
Performance
- Several explored research avenues
- Runtime power/thermal estimations
- Dynamic management
- Workload phases and application behavior
prediction - HPCs provide value beyond simulations
- Long-timescales
- Real-system behavior
3Hardware Performance Counters (HPCs) Go beyond
Performance
- Runtime power
- Isci Martonosi MICRO 2003
- Contreras Martonosi Submitted 2005
- Runtime thermal
- Lee Skadron HP-PAC in IPDPS 2005
- Dynamic power management
- Choi et al. ISLPED 2004
- Weißel Bellosa CASES 2002
- Dynamic thermal management
- Bellosa et al. COLP 2003
- Workload phases and application behavior
prediction - Isci Martonosi WWC 2003
- Duesterwald et al. PACT 2003
4High-Performance Corner P4 Power Estimation
- Motivation
- Fast (Real-time)
- Estimated view of on-chip detail (Per physical
component) - Design
- Developed heuristics using 24 events to
approximate access rates for 22 chip components - Used 15 counters with 4 rotations to collect all
event data - Validation
- Real-time estimates against real-time measured
power
5P4 Power Estimator Results
Gcc
Gzip
Vpr
Vortex
Gap
Crafty
Measured
- Average difference 5 among all benchmarks
- SPEC CPU2000 other applications
6Embedded Corner PXA255 Power Estimation
CPU Powernx1
PerformanceEventsnx5 x LinearParameters5x1
IdlePower
Mem Powernx1
PerformanceEventsnx2 x LinearParameters2x1
IdlePower
- Motivation
- Runtime power optimizations under DVFS
- Design
- Parameter estimation (OLS) using dominant counter
readings and live power measurements - Power estimation at various CPU configurations
- Validation
- Comparison between estimates and real-time
measured power
7PXA255 Results
- 5 average error across 3 domains
- Java CDC
- Java CLDC
- SPEC2000
8Proposals from Experiences
- 1. Track each physical unit individually for
power thermal - Ex
DispatchPorts
Instr-n Queue1
TraceCache
MEM
µop Queue
Allocate
Rename
Schedulers
µCodeROM
Instr-n Queue2
EXE
All tracked with in-flight µops written to µop
queue
- Need individual utilization counts for each
physical unit available on die for power and
hotspot analyses
9Proposals from Experiences
- 2. Need bitline activity counts
- Utilization is not complete information, power in
part depends on switching factor - Not necessarily fully detailed counts
- Accumulate bitwise XOR of current and previous
input/output ports - Sample RegFile ports/bit populations
30mW (10) swing
400Mhz 1.3V PXA255 Processor
10Proposals from Experiences
- 2. Need bitline activity counts
- Utilization is not complete information, power in
part depends on switching factor - Not necessarily fully detailed counts
- Accumulate bitwise XOR of current and previous
input/output ports - Sample RegFile ports/bit populations
11111
00001
11111
11111
B 11111 00000 01111 00000
00111 00000 00011 00000 00001
00000
A 00001 00001 00001 00001
00001 00001 00001 00001 00001
00001
20mW swing
00000
11111
11111
00000
00001
00000
400Mhz 1.3V PXA255 Processor
11Proposals from Experiences
- 3. More detailed off-chip/memory access support
in the embedded domain - Mem Power 40 of system power
- Tracking memory hierarchy transactions may help
render better memory power estimates
- Main memory Read/Writes
- Core DMA
- Transaction length in bytes
- Activity factors can be shared with RegFile
REX Memory power consumption (one 16b bank)
12Proposals from Experiences
- 4. Metrics related to queue occupancy
- Modern processor Several queues
- Depending on implementation Power ? Queue
occupancy
Buyuktosunoglu et al. ISLPED02Tradeoffs in
Power-Efficient Issue Queue Design
13Proposals from Experiences
- 5. General/aggregate metrics in addition to
specialized cases/ breakdowns simplify runtime
sampling for unit accesses - P4 ex1. MOB Only event MOB_load_replays
- Counts replays for unknown st addr./data,
partial/unaligned addr. match - No info for MOB entries/accesses/updates
- P4 ex2. FPU Has 8 separate events (with 2
dedicated ESCRs) - Need at least 4 rotations to collect
- P4 ex3. INT ALU No dedicated event
14Additional Comments for HPC Design
- General/aggregate metrics in addition to
specialized cases/ breakdowns simplify runtime
sampling for unit accesses - Metrics related to RegFile accesses vs.
forwarding - Semi-distributed implementations will always
induce dependencies among simultaneously
countable events - Higher parallelism among (power oriented) metrics
for minimal counter rotations at runtime - Implementations that allow counter rotations
without need for intermediate logging - Partitioned / Dual-mode / Buffered counters
- Different events for different types of accesses
to same units with different magnitude power
implications - i.e. branch scan lt BHT update lt BTA update
- Different API/SW demands
- Lightweight implementations for runtime analyses
- Per-thread for application profiling vs. global
for real-time measurement comparisons and
hotspots
15Wishlist for Power/Thermal
- 1) For each physical unit on die, separate events
to track utilization rates - Sub events for different type of accesses with
different power costs - 2) Bitline activity counters for switching units
- 3) Occupancy counters for related queues
- 4) Counter support for off-core memory accesses
- 5) High parallelism among power events for
minimal counter rotations
16Conclusions
- New opportunities remain to be explored in future
PMC designs for power and thermal studies - Direct correspondence to physical units
- Bitline and occupancy counters
- We believe in the feasibility of these additions
with the continuing emphasis given to counter
design, as long as power is also considered a
primary design target.