Title: Power Analyzer Program Review
1Power AnalyzerProgram Review
- Dirk GrunwaldUniversity of Colorado
2Overview
- Clustered Voltage Scaling
- Motivator to insure out power analyzer can model
new microarchitectural mechanisms - Operating System Voltage Scaling
- Memory system control for low power
3Lessons from Physics
4Dynamic Voltage Scaling in SA-2
2 Billion Instructions_at_ 750 MIPSTakes 2.7
secondsConsumes 1200 J.
5Running just fast enoughcuts energy 3-fold
432 J
1200 J
12 second duty cycle
6You can exploit slack at many levels
- Circuits Dual Vdd or Dual Vt designs
- Design methodology of Ultra Low-Power MPEG4
Codec Core Exploiting Voltage Scaling Techniques
Igarashi et al, DAC98 50 power reduction, no
performance loss - Architecture
- Macro-level clustered voltage scaling,
multi-voltage multithreading - Operating Systems
- Applications / Runtime Systems
7Why are we looking at system-level savings
modeling
- Computer architecture folks are the only people
who would watch an MPEG movie at 120 fps - We ignore interactive applications
- But thats the future of computing
- And we ignore human response times..
- Do credit card transactions need to be faster
than a 1/10th of a second?
8Clustered Voltage Scaling
- Voltage Scaling normally refers to varying
voltage over time. - CVS is voltage scaling in space
- Run part of a processor at different V f
- Historically, done at circuit level
- Were trying to exploit at component level
9Clustered Voltage Scaling
- CVS already applied at circuit level
- Mitsubishi designed MPACT media processor w/2
voltage levels for 43 power savings, 10 area
increase, no performance hit
10Slack Scheduling
- Use inherent instruction dependencies and
operational latencies to form alternate schedules - Slack is the minimum time difference, in cycles,
between when an instructions output is produced
and when it is consumed - We want to exploit slack within a cluster of
functional units - Schedule slackfull instructions to slower
pipelines that run at ½ speed and reduced power
11Exploiting Slack
add r0, r1, r2 (A) sub r3,
r4, r5 (B) and r9, 0x1, r9
(C) ornot r5, r9, r10 (D) xor r2,
r10, r11 (E)
12Simulation Methodology
- Simulation architecture
- SimpleScalar 3.0a w/CaiWattch mode
- 4-wide 21264 16-entry RUU
- SPECint95 benchmarks
13Results and Potential
- Over 90 of issue cycles have at least one
instruction that has 1 cycle of slack - This means that 90 of the time, we could run one
instruction on a slow pipe without impacting
performance - Between 1-7 have 2 cycles
- 68-87 instructions with slack are integer
14Story Gets Better With More Aggressive Processor
- Slack is affected by deeper RUU
- More opportunities to find slack
- More slack valuesgt 1 cycle available
RUU Size 3
U
V
W
X
RUU Size 4
Y
RUU Size 5
15Operating System Scheduling
- Goals Control power using clock / voltage
scheduling - Real systems
- Real apps
- To date comparison study showing that previously
proposed heuristics dont really work well - Why know why
- How to fix it
16What are some challenges?
- How slow is fast enough?
- How do I tell the architecture?
- Enforce constraints?
- Not miss deadlines?
- Can I define benchmarks and evaluation
methodologies for human-scale computing where
voltage?
17Difficult to predict application demands
Goal Dont disturb application behavior.
Inelastic performance.
Speech Rendering
AudioRendering
18Prior work
- Weiser et al and Govil et al
- Used Intervals
- Selected averageweighted average
- Reportedgreat success
- Pering et al
- Tried intervals
- Switched to RTOS,which has highdemands
onapplications - Is RTOS really needed?
19Evaluation
- Implement clock scheduling module in Linux 2.0.35
kernel - Extensible can model all practical prior
policies - Strong SA-1100 provides 15 clock steps from
56Mhz to beyond 206Mhz - Used modified motherboard
- Useful, but not critical in early study
- Drop from 1.5V to 1.23V only provides 10 power
reduction - Measured reasonable applications
- Text speaker, chess player, Web browser, MPEG
video player
20Widely Varying Power Usage
- Better evaluation metric given resources is
scheduling stability. - E.g. MPEG-1 player runs at 80 utilization at
206Mhz - Should be able to settle at 176mhz _at_ 93
utilization - But, the best policy has widely varying power
settings from - Best policy is assume next scheduling interval
is same as prior - Only run at fastest or slowest setting
- This is awful, but we know why this happens
21Leading to bursty energy demands..
22What were doing now
- All this work done in old O/S
- Upgrading to Linux 2.4.0, interoperation with
iPAQ Itsy - Determine minimal O/S mechanism
- Simple Go fast vs I went too slow
- More complex soft real time system
- Application-specific behavior state
- Via queue length for events in e.g. Java
applications - Trying to implement control mechanism in a number
of processor families - AMD K6-III Mobile
- SpeedStep
- Xscale
23Memory ManagementEnergy Efficiency
- Dynamic memory management a large part of
complex applications - Implemented four memory management mechanisms on
Itsy - No allocation, explicit allocation, conservative
allocation, incremental conservative allocation - Measured processor, system with DAQ
- Energy not always correlated with performance
- Interaction of CPU and memory system fairly
complicated - SA-1 places CPU in sleep mode on memory traffic,
slows clock. - Thus, memory traffic can take much time but less
energy - Plan to exploit interaction with O/S for powering
down memory pages
24Collateral Related Projects
- NSF ITR proposal funded on integrated power
management of wireless and system resources - Management of 802.11b performance for location,
trajectory, real time constraints - Adhoc routing for global energy minimization
- Broader effort at leading to inter-departmental
center - Colorado Center for low-power, ubiquitous,
mobile and pervasive systems (CCLUMPS) - Circuits, architecture, telecom, computer
services, applications