Hardware Performance Counters for Detailed Runtime Power and Thermal Estimations: Experiences - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

Hardware Performance Counters for Detailed Runtime Power and Thermal Estimations: Experiences

Description:

HPCA-11. Feb 13, 2005. Hardware Performance Counters. for Detailed ... 111...11. 000...00. 400Mhz 1.3V. PXA255 Processor. 111...11. 111...11. 000...01. 000...00 ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 17
Provided by: sci28
Category:

less

Transcript and Presenter's Notes

Title: Hardware Performance Counters for Detailed Runtime Power and Thermal Estimations: Experiences


1
Hardware Performance Counters for Detailed
Runtime Power and Thermal Estimations
Experiences Proposals
  • Canturk ISCIGilberto CONTRERASMargaret MARTONOSI

2
Hardware Performance Counters (HPCs) Go beyond
Performance
  • Several explored research avenues
  • Runtime power/thermal estimations
  • Dynamic management
  • Workload phases and application behavior
    prediction
  • HPCs provide value beyond simulations
  • Long-timescales
  • Real-system behavior

3
Hardware Performance Counters (HPCs) Go beyond
Performance
  • Runtime power
  • Isci Martonosi MICRO 2003
  • Contreras Martonosi Submitted 2005
  • Runtime thermal
  • Lee Skadron HP-PAC in IPDPS 2005
  • Dynamic power management
  • Choi et al. ISLPED 2004
  • Weißel Bellosa CASES 2002
  • Dynamic thermal management
  • Bellosa et al. COLP 2003
  • Workload phases and application behavior
    prediction
  • Isci Martonosi WWC 2003
  • Duesterwald et al. PACT 2003

4
High-Performance Corner P4 Power Estimation
  • Idea
  • Motivation
  • Fast (Real-time)
  • Estimated view of on-chip detail (Per physical
    component)
  • Design
  • Developed heuristics using 24 events to
    approximate access rates for 22 chip components
  • Used 15 counters with 4 rotations to collect all
    event data
  • Validation
  • Real-time estimates against real-time measured
    power

5
P4 Power Estimator Results
Gcc
Gzip
Vpr
Vortex
Gap
Crafty
Measured
  • Average difference 5 among all benchmarks
  • SPEC CPU2000 other applications

6
Embedded Corner PXA255 Power Estimation
  • Idea

CPU Powernx1
PerformanceEventsnx5 x LinearParameters5x1
IdlePower
Mem Powernx1
PerformanceEventsnx2 x LinearParameters2x1
IdlePower
  • Motivation
  • Runtime power optimizations under DVFS
  • Design
  • Parameter estimation (OLS) using dominant counter
    readings and live power measurements
  • Power estimation at various CPU configurations
  • Validation
  • Comparison between estimates and real-time
    measured power

7
PXA255 Results
  • DB CDC Java
  • 5 average error across 3 domains
  • Java CDC
  • Java CLDC
  • SPEC2000

8
Proposals from Experiences
  • 1. Track each physical unit individually for
    power thermal
  • Ex

DispatchPorts
Instr-n Queue1
TraceCache
MEM
µop Queue
Allocate
Rename
Schedulers
µCodeROM
Instr-n Queue2
EXE
All tracked with in-flight µops written to µop
queue
  • Need individual utilization counts for each
    physical unit available on die for power and
    hotspot analyses

9
Proposals from Experiences
  • 2. Need bitline activity counts
  • Utilization is not complete information, power in
    part depends on switching factor
  • Not necessarily fully detailed counts
  • Accumulate bitwise XOR of current and previous
    input/output ports
  • Sample RegFile ports/bit populations

30mW (10) swing
400Mhz 1.3V PXA255 Processor
10
Proposals from Experiences
  • 2. Need bitline activity counts
  • Utilization is not complete information, power in
    part depends on switching factor
  • Not necessarily fully detailed counts
  • Accumulate bitwise XOR of current and previous
    input/output ports
  • Sample RegFile ports/bit populations

11111
00001


11111
11111
B 11111 00000 01111 00000
00111 00000 00011 00000 00001
00000
A 00001 00001 00001 00001
00001 00001 00001 00001 00001
00001
20mW swing
00000

11111
11111

00000
00001

00000
400Mhz 1.3V PXA255 Processor
11
Proposals from Experiences
  • 3. More detailed off-chip/memory access support
    in the embedded domain
  • Mem Power 40 of system power
  • Tracking memory hierarchy transactions may help
    render better memory power estimates
  • Main memory Read/Writes
  • Core DMA
  • Transaction length in bytes
  • Activity factors can be shared with RegFile

REX Memory power consumption (one 16b bank)
12
Proposals from Experiences
  • 4. Metrics related to queue occupancy
  • Modern processor Several queues
  • Depending on implementation Power ? Queue
    occupancy

Buyuktosunoglu et al. ISLPED02Tradeoffs in
Power-Efficient Issue Queue Design
13
Proposals from Experiences
  • 5. General/aggregate metrics in addition to
    specialized cases/ breakdowns simplify runtime
    sampling for unit accesses
  • P4 ex1. MOB Only event MOB_load_replays
  • Counts replays for unknown st addr./data,
    partial/unaligned addr. match
  • No info for MOB entries/accesses/updates
  • P4 ex2. FPU Has 8 separate events (with 2
    dedicated ESCRs)
  • Need at least 4 rotations to collect
  • P4 ex3. INT ALU No dedicated event

14
Additional Comments for HPC Design
  • General/aggregate metrics in addition to
    specialized cases/ breakdowns simplify runtime
    sampling for unit accesses
  • Metrics related to RegFile accesses vs.
    forwarding
  • Semi-distributed implementations will always
    induce dependencies among simultaneously
    countable events
  • Higher parallelism among (power oriented) metrics
    for minimal counter rotations at runtime
  • Implementations that allow counter rotations
    without need for intermediate logging
  • Partitioned / Dual-mode / Buffered counters
  • Different events for different types of accesses
    to same units with different magnitude power
    implications
  • i.e. branch scan lt BHT update lt BTA update
  • Different API/SW demands
  • Lightweight implementations for runtime analyses
  • Per-thread for application profiling vs. global
    for real-time measurement comparisons and
    hotspots

15
Wishlist for Power/Thermal
  • 1) For each physical unit on die, separate events
    to track utilization rates
  • Sub events for different type of accesses with
    different power costs
  • 2) Bitline activity counters for switching units
  • 3) Occupancy counters for related queues
  • 4) Counter support for off-core memory accesses
  • 5) High parallelism among power events for
    minimal counter rotations

16
Conclusions
  • New opportunities remain to be explored in future
    PMC designs for power and thermal studies
  • Direct correspondence to physical units
  • Bitline and occupancy counters
  • We believe in the feasibility of these additions
    with the continuing emphasis given to counter
    design, as long as power is also considered a
    primary design target.
Write a Comment
User Comments (0)
About PowerShow.com