Title: PowerAnalyzer for Pocket Computers
1PowerAnalyzer for Pocket Computers
- Dr. Robert Graybills PAC/C Program
- Second review May 22, 2001
- Todd Austin and Trevor Mudge, U. Michigan
- Dirk Grunwald, U. Colorado
- http//www.eecs.umich.edu/jringenb/power/
2Agenda
- Introduction
- Team members
- Milestones
- Action items from last review 25th September
2000 - Budget overview
- Collaborations
- Work with other group in PAC/C and outside
- Presentation breakdown
- FastPower
- Quick estimates of power
- SimpleScalar ARM target support
- Power Analyzer Data Sensitive Parameterized
Architectural Level Power Estimator - Microarchitectural power/performance estimator
- SystemPower
- Complete system
- Dynamic voltage scheduling
- Vertigo
- PowerScale
- Wrap up
3Team Members
- University of Colorado
- Dirk Grunwald
- Students
- Jason Casmira
- Soraya Ghiasi
- Brad Morrey
- Mike Neufeld
- Audon Tornquist
- University of Michigan
- Todd Austin
- Trevor Mudge
- Students
- Dan Ernst
- Kris Flautner
- Nam Kim
- Rajeev Krishna
- Jeff Ringenberg
- Chris Weaver
- Industrial Partners
- Intel
- Support for two students
- XScale evaluation boards
- Mentors George Cai (Texas), Doug Carmean and
Rich Uhlig (Portland) Chris Newburn (Santa
Clara) Mike Morrow (Chandler, AZ) - Cobalt Networks
- Equipment for system level modeling
optimization - Compaq Computer
- Itsy motherboards
other support/fellows/postdocs
4Milestones
today
- SS/ARM available since mid-November, used by 10
PAC/C groups - Power model design nearing completion
- Pending Integration of S/DRAM power,
interconnect,framework for external device power - Platform simulator on-target for end-of-summer /
early fall - Initial release targets SA-1100, later release
for full system - DVS interface available for Crusoe, LRH board,
SA-1100, AMD K6 w/PowerNow
5Action Item from Last Review
- Tracking Legend
- !Active action item
- Completed action item
- New action item since last update
- !Need to provide calibration of simulation models
that the PAC/C community is using, specifically
included is the need to provide specification and
calibration of the SimpleScalar simulation tool
to support the PAC/C communitys use of the
model. (See Power Analyzer Data Sensitive
Parameterized Architectural Level Power
Estimator.) - !Examine and suggest a list of the simulation
tools that will be utilized to address future
architecture developments and determine what will
be necessary to prepare for the next generation
of architectures and their simulation and
support, what efforts will be necessary to have
calibrated tools in place for future
architectural developments. (See Power Analyzer
Data Sensitive Parameterized Architectural
Level Power Estimator. List construction
on-going.) - ! Present and offer benchmarks MiBench to
PAC/C community. (See SimpleScalar ARM Target
Support. Work on-going.) - ! Describe the modularity and parameterization
for varying processes that can be addressed
thorough the GUI interface proposed, define the
interfaces. parameters, and inputs planned for
the tools being developed and that will be
available to the PAC/C community. Address
modularity and the ability to work with other
tools at multiple levels. (See Fast power
SimpleScalar ARM Target SystemPower. Work
on-going.) - !What particular demonstrations would be
recommended with Land Warrior - what elements
will be available, how would they be of value,
how could they be demonstrated. (Intel Board
Transmeta. Massoud and Dennis are aware.) - !Coordinate with UC Berkeley on ECAD and
coordinate with UC Berkeley GSRC on tools and
ECAD efforts. (Discussion and visit with Trevor
Pering, now at Intels MRL.) - !Determine what input is required from the Land
Warrior application to allow the pursuit of that
application under this effort. (Benchmarks.
Massoud and Dennis are aware.) - !Provide a readable schedule for the program.
Put/place major milestones within the schedule to
be presented, including major program
elements/milestones at approximately 6 weeks
increments. - !Provide financial summary information. (See next
slide.) - !Coordinate with the MIT PAC/C effort (Anantha
Chandrakasan) (Intel board, and see later.)
6Budget Overview
- Budget 800k/2 years
- July 00 through June 02
- Colorado 250k subcont.
- Received 600k
- Through December 01
- Expenditure at right
- Summer months
- Full GSRA complement (5)
- Equipment purchases
7Collaborations
- Intel Mike Morrow LRH board
- XScale FPGA
- With Maxim fuel gauge to experiment with DVS
- Boards being tested
- Make available to
- Anantha Chandrakasans group
- Bob Parkers group
- Vince Mooney and Krishna Palems group
- Dennis Lane and Massoud Pedrams LW effort
- Intel Microprocessor Research Lab
- DVS work Trevor Pering, Rich Uhlig
- Pentium 4 design study group
- Doug Carmean pipelines vs. power
- IBM Austin
- Low power server Gary Carpenter
8Publications
- K. Flautner T. Mudge. Automatic
performance-setting for dynamic voltage scaling.
MOBICOM, Rome July 2001 (to appear). - T. Mudge. Power A first class design constraint.
Computer, vol. 34, no. 4, April 2001, pp. 52-57. - T. Mudge. Power A first class design constraint
for future architectures. Proc.7th Int. Conf. on
High Performance Computing - HiPC, (Springer
Lecture Notes in Computer Science), Dec. 2000,
Bangalore, India, pp. 215-224. - K. Flautner, S. Reinhardt, T. Mudge.
Thread-level parallelism and interactive
performance of desktop applications. 9th Int.
Conf. Architectural Support for Programming
Languages and Operating Systems (ASPLOS-IX), Nov.
2000, pp. 129-138. - Jason Casmira Dirk Grunwald. Dynamic
Instruction Scheduling Slack, Proceedings of the
2000 KoolChips workshop, held in conjunction with
MICRO-00 - Soraya Ghiasi, Dirk Grunwald, A Comparison of Two
Architectural Power Models, In Proceedings, Power
Aware Computer Systems Workshop - Soraya Ghiasi, Jason Casmira Dirk Grunwald. IPC
Matching Mechanisms Using IPC Variation in
Workloads with Externally Specified Rates to
Reduce Power Consumption. 2000 Workshop on
Complexity Effective Design - Dirk Grunwald, Phil Levis, Brad Morrey Mike
Neufeld. Policies for Dynamic Clock Scheduling.
Proceedings of the 2000 Operating Systems Design
and Implementation - Farkas et. Al. Quantifying the Energy Consumption
of a Pocket Computer and Java Virtual Machine.
Proceedings of the 2000 SIGMETRICS Conference on
Computer System Performance.
9FastPower
- Soraya Ghiasi, Jason CasmiraUniversity of
Colorado
10Goals
- Provide quick, accurate estimate of power use by
applications on specific platform - Models an existing processor platform (SA-1100)
- Useful for algorithmic analysis
- Useful for evaluating compiler optimizations
- Cheaper faster than hooking up a scope
11Two classes of tools
- Functional simulator produces quick results
- Analysis of non-microarchitectural power
reduction techniques - Performance simulator produces slower, more
accurate results - Models all delays stalls in the processor
- Better estimate of energy-delay product
- Drawbacks
- No Vector Floating Point Architecture
(co-processor 10-11) instruction costs - Currently only measure application performance
12Combination of 2 PAC/C Parts
PowerAnalyzerComponents
Per-instructionPower Profiling(LART, then MIT)
13Instruction Energy
- MIT group measured energy cost of repeatedly
executing single instructions - LART group also has voltage-scaling data for same
CPU, but not in as much detail. We use this to
extrapolate across different clock speeds - We assume standby on cache miss
14Time-varying Power Estimate
15Highlighting Program Source
- We count frequency energy cost for each
instruction in the application - Perl scripts provide decorated source listng
with energy per - Assembly instruction
- Source line
- Simple, fast feedback on algorithm / compiler
efficiency
Dfhdhjfkhkjhf Afkhdajfh Akjfhaj Akjfhhasdjfhadh Aj
kfhdshfksh Akjfdhkjashdfjkhdskjhfkjhfkjashdfjkhasj
kh Aljdfhajsdh Asdkjfhadsjk kjhafdjkha jkhjkhfd
as ajhdfjadsh
16Plans and Release Schedule
- Currently no measurements for memory
- Varies from system to system
- Different memory / CPU clock speeds an issue
- No measurements for FPU / Vector instructions
- Not provided in MIT or LART
- We will calibrate and fill-in using Itsy
hardware, where we can measure CPU and Memory/IO
power - Measurements final tool will be included in the
PowerAnalyzer suite by end of June
17SimpleScalar ARM Target Support
- Rajeev Krishna, Chris Weaver, Todd Austin
- University of Michigan
18Overview
- The SimpleScalar Tool Set
- ARM Target Implementation
- Instruction Emulation
- SA-1 Pipeline Model
- ARM Cross-Compiler Kit
- MiBench Benchmark Suite
- ARM Target Validation
- Functional Validation
- Performance Model Validation
- ARM Target Deployment
- Ongoing Work
- System Simulation
- Related and Future Work
19The SimpleScalar Tool Set
- Computer system simulation tools
- Developed at University of Michigan
- Maintained by SimpleScalar LLC
- Freely available to academic sites
- Target apps run on emulator
- SPEC95, SPEC2k, MiBench
- Tool set supports multiple ISAs
- PISA, Alpha, PPC, x86, ARM
- Multiple I/O targets supported
- Linux, OSF, Solaris system calls
- Devices models (e.g., SA-1110)
- Modeling infrastructure enables analysis of
programs and hardware - performance
- power
- Very portable code base
MiBench and Linux/ARM
Power/Performance Model
Fetch
Pipeline
SA-1100 Core
Predictor
Caches
Simulation Kernel
ARM7 ISA ARM FPA
Linux/ARM iPAQ Devices
Host Platform
20ARM Target Instruction Emulation
- ARM ISA emulation support added to SimpleScalar
tool set - ARM 7 integer instruction set support
- Floating Point Accelerator (FPA) instruction set
support - Linux/ARM system call support added
- system calls are implemented by the simulator
- portable I/O, but does not capture OS execution
- ARM CISC instructions required microcode support
- required for accurate microarchitectural modeling
agen tmp1,r13,0 agen tmp0,tmp1,-16 stp
r11,tmp0 agen r13,r13,-16 agen
tmp0,tmp1,-12 stp r12,tmp0 agen
tmp0,tmp1,-8 stp r14,tmp0 agen
tmp0,tmp1,-4 stp r15,tmp0
stmdb r13!,r4-r8,r10-r15
21Processor Performance Model
- SA-1 pipeline model implemented
- pipeline used in Intels SA-11xx
- simple five stage pipeline
- two level memory hierarchy
- Challenging task due to lack of info on SA-1
microarchitecture - derived many details from the compiler writers
guide - used directed black-box testing to fill in the
rest of the blanks - prototype Xscale model completed
- Intels new StrongARM processor
- based on (sparse) published details
- will validate when we secure a real evaluation
platform (mid-summer?)
SA-1 Pipeline
IF
ID
EX
MEM
WB
I
D
DMMU
IMMU
Physical Memory
22ARM Cross-Compiler Kit
- Permits users to compile ARM binaries w/o ARM
hardware - Most users lack access to a real ARM target with
a native compiler - We use Rebel.coms NetWinder platforms to build
native binaries - GNU GCC targeted to ARM ISA
- includes soft-float support (permits compilation
for non-FP hardware) - GNU binutils targeted to ARM ISA
- GNU ld linker
- GNU binary utilies, e.g., objdump, nm, size, etc
- pre-built C libraries for ARM ISA
- targeted to Linux system call interfaces
- portable code base
23MiBench Benchmark Suite
- Unencumbered embedded benchmark suite
- Includes source code and multiple benchmark
inputs - With binaries compiled for SimpleScalar/ARM
simulator - Preliminary report details benchmarks and
performance characteristics - Six embedded programming domains (22 benchmarks)
- Automotive/industrial
- Process control kernels from engine control,
sensor monitoring - Networking
- Shortest path router, Patricia tree, packet
processor, CRC32 - Security
- Private and Public key ciphers, digest routines
- 3DES, Blowfish, SHA, AES finalists
- Consumer
- Multimedia, image processing, entertainment
- JPEG, Dither, RGBA, MediaBench, DOOM
- Office
- Spell, Grep, Ghostscript Postscript Interpreter
- Telecommunications
- FFT, GSM, ADPCM
24ARM Target Validation
- ARM 7 ISA validated again reference
implementation - Functional validation implemented with random
testing - using the FuzzBuster framework
- Validated against real SA-1100 H/W
- Validated against ARMs ARMulator
- ARM FPA extensions validated against SoftFloat
suite - ARMulator and SA-1110 reference lack FP
implementations - SoftFloat suit implements reference FP with
integer ISA - Large validation effort
- 500 billion instructions tested
- 6 bugs found in the ARMulator! (reported to ARM
Ltd)
Random Instruction and State
ARM Target
Ref Impl
- ARMulator - SA-1100 H/W
FuzzBuster
Correct?
25Performance Model Validation
- Performance validation against SA-1110 platform
- Rebel.com NetWinder reference with SA-1 pipeline
- Microbenchmarks were used to reveal and test
specific latencies - e.g., branch mispredictions, cache misses,
writeback stalls - Final validation completed with macrobenchmark
testing - compared IPC of SA-1110 to IPCs computed by SA-1
performance model - H/W IPCs computed using wall clock time, clock
frequency, and known instruction counts - Excellect IPC correlation across entire test suite
26SimpleScalar/ARM Deployment
- First release deployed Dec 2000
- unvalidated ARM ISA emulator
- SimpleScalar performance models
- Second release deployed March 2001
- validated ARM ISA emulator
- validated ARM microcode emulator
- SA-1 validated pipeline model
- ARM cross-compiler kit
- Current PAC/C users
- Georgia Tech, Northwestern, Arizona, Princeton,
U-Delaware, USC,ISI, U-Toronto, UC-Irvine, Notre
Dame - We are actively providing support to these users
27SimpleScalar/ARM System Simulation
SA-1110 System Bus
- System simulation development
- SA-1110 device set
- Compaq IPaq reference hardware
- Linux MiBench workload
- Status
- Core components deployed
- Virtual memory, RTC, PIC, DMA, SER0 development
ongoing - Booting Linux kernel only requires serial, DMA,
PIC, RTC,MMU - Concurrent Development
- Develop SA-1100 specific devices
- Integrate Bochs platform simulator
I-cache
IMMU
SA-1 Pipeline
D-cache
DMMU
RTC
RAM
PIC
DMA
Flash
SER0
PCMCIA
SER1
SER2
SER3
complete/deployed
in development/test
next generation
28SimpleScalar/ARM Platform Simulation
SA-1110 System Bus
- Bochs Platform Simulator
- Liberally harvesting components from Bochs x86
platform simulator - Functional models for IDE, Ethernet, VGA, Cdrom,
etc - Goal Common devices interfaces across x86
ARM simulators - Augmenting Devices with Approximate Power Models
- Empirical measurements of 802.11, microdrive,
flash, etc.
I-cache
IMMU
SA-1 Pipeline
D-cache
DMMU
RTC
RAM
PIC
DMA
SER0
BochsDeviceModel
29Related Work SimpleScalar Visualization Tools
30Related Work SimpleScalarI/O Traces
- The Challenges
- Conduct reproducible experiments with real-time
systems - Share sophisticated workloads between users
- The Solution External I/O Traces
- Sim-EIO executes any workload tracing all I/O
activity - Device I/O (or System Calls)
- External Interrupts
- DMA activity
- External I/O (EIO) traces can be executed on any
simulator - 100 reproducible execution
- Fully portable experiments
- Traces can be shared without needing to share
traced program components
31Related Work SimpleScalar/C30
- Many embedded targets feature a DSP
- for fast processing of multimedia workload
components - e.g., signal processing, codec routines,image
processing - typical architecture couples a general purpose
microprocessor with a DSP - Adding TI TMS320C30 (C30) ISA support to
SimpleScalar - integer and FP ISA components
- power control instructions
- Provided as an extensions component
- permits use of general purpose processor model
and C30 model in tandem - inter-processor communication implemented with
bi-directional mailbox primitives - requires a fairly sophisticated compiler tool set
(in development)
inter- processor interrupts
ARM Core
C30 Core
Shared Memory
32Future Work Self-TunedDigital Systems
- Electrical verification determines if
implementation is robust - Design must be functionally correctly for all
valid (T,V,p,f) - Design must meet mean-time-to-failure goals (via
power/current analysis) - Verify functional correctness at slow
corner(Tmax,Vmin,pslow,fmax) - Verify power/current characteristics at fast
corner (Tmin,Vmax,pfast,fmax) - Additional margin on clock used to avoid any
electrical faults
Temp
Voltage
Frequency
Process
design margin
max
max
fast
max
min
min
min
slow
33Future Work Self-Tuned Microprocessor Systems
Temp
Voltage
Frequency
max
max
worse-case margin
insts to verify
clk
Tuned Core
Checker
max
Vdd
temperature
error rate
clk
min
min
min
Clock/Voltage Generator
Vdd
Slow corner
Actual operating conditions
- Modern logic design methodologies are very
conservative - Large design margins consume power and
performance - System environment may not be reflective of slow
corner - Employ a checker to enables a self-tuned
clock/voltage strategy - Push clock, drop voltage until desired
power-performance characteristics - If system fails, use checker will correct error,
notify control system - Reclaims design margins plus any temperature and
voltage margins - Example checkers
- CRC hardware for a crypto-processor
- DIVA checker processor for microprocessors
34Power Analyzer Data Sensitive Parameterized
Architectural Level Power Estimator
- Nam Sung Kim, Trevor Mudge
- University of Michigan
35Outlines
- Previous Works
- Disadvantage of Previous Works
- Feature of Power Analyzer
- Power Analyzer Methodology and Estimation Flow
- Power Characterization of Functional Units
- Construction of Power Model
- Experimental Results
- Conclusion and Future Work
36Previous Work
- Wattch Model
- Analytical power models for functional blocks
consisting of memory and CAM - Cai-Lim Model
- Using active and inactive power density model for
each functional block - Limitations of Previous Work
- Ignoring data-sensitivity of functional blocks in
power consumption - 8-bits MULT shows up to 198 difference in power
consumption for various input data activities - Lack of parameterized power models
- Lack of flexibility in power model data structure
- Supporting only fixed form of processor
architecture - Not considering I/O pad power
- I/O pads can consume up to 40 of entire chip
power - Using empirical power values from existing
designs - Unable to consider different architectural
configurations
37Features of Power Analyzer
- Data Sensitive Power Model
- Assuming power consumption of functional blocks
is proportional to Hamming Distance (HD) of
applied input vector sequences - Parameterized Power Model
- Assuming power consumption of functional block is
proportional to bit width - Rapid power characterization using small circuits
- Supporting analytical power model for cache and
CAM structure - Interconnect and Clock Power Model
- Providing interconnection and clock power
estimation framework for different architectural
configuration future work - Hierarchical Power Model
- Supporting various level of abstraction of power
model for each functional unit
38Power Analyzer Methodology
- Power Consumption of Functional Blocks
- Different for various circuit design styles
- Fast Power Characterization of Functional Units
- Using existing functional block design or soft
macro blocks provided by synthesis tool - Using gate-level or transistor level power
simulation results for small bit width circuits
having similar functionality - Parameterized Power Models
- Extracting appropriate power values for target
architecture using extrapolation technique - Extracting interconnect and clock capacitance
from the estimated transistor count, floorplan
and technology future work - Dynamic Power Simulation
- Extracting HD and access activity information of
functional units using SimpleScalar architectural
level simulation
39Power Analyzer Power Estimation Flow
Fig 1 Power Analyze power estimation flow
40Power Characterization of Functional Units
- Power Characterization for Standard Cell Design
- Power Characterization of Functional Unit
- depending on target uP design methodology,
circuit style, and process technology - Target Processor
- MARS Michigan ARM instruction set compatible
microprocessor - Target Technology
- 0.25um TSMC standard cell and I/O library
- Target Design Flow
- ASIC design flow - Synthesizing all functional
units using Synopsys Design Compiler and Avant!
Rapid Memory Compiler
41Datapath Power - ALU
- Power of ALU Increases Linearly
- Linearly as bit width of ALU increases
- Linearly as HD of applied input vectors increases
Fig 2 ALU power consumption for various bit
widths
Fig 3 - ALU power consumption for various HD
of inputs
42Datapath Power - Multiplier
- Power of Multiplier increases
- Quadratically as bit width increases
- Logarithmically as HD of applied input vectors
increases
Fig 4 MULT power consumption for various bit
widths
Fig 5 - MULT power consumption for various
HD of inputs
43Datapath Power Register File
- Power of Register File increases
- Linearly as bit width increases
- Linearly as HD of applied input vectors increases
Fig 6 RF power consumption for various bit
widths
Fig 7 - RF power consumption for various HD of
inputs
44I/O Pad and Cache Power
- I/O Pad Power Consumption consists of
- Pad cell internal power
- Load capacitance power
- Most of I/O Pads
- Used by address and data buses
- 64 pads among 78 in/out pads
- Easy to monitor address and data bus activity at
the architectural level simulation
- Cache Power Consumption consists of
- Decoder, Bit lines and Sense amps of Data and Tag
Arrays - Characterizing Energy Consumption using
- Capacitance value extraction from CACTI cache
access time model Jouppi - Most of cache power consumption is caused by bit
lines and sense amp - Power consumption of bit lines is data
independent because of complementary structure of
cache bitlines
45Power Analyzer Experimental Results
- Target Design Specification
- ARM instruction set and 32-bit datapath
- 32 bit ALU, 32x32 multiplier and 32 register
files - 4KB instruction and data caches and 32-bit
address and data buses - Power Measurement of Target Design using
- Test program consisting of various instructions
- PrimePower - Gate-level dynamic power simulation
tool
- Power Estimation Result
- Simulation cycle 250 Cycles at 50MHz
- Power value extrapolation from 8-bit synthesized
circuits - No output load for I/O pad power measurement
FUs
Measurement
Estimated
Error
ALU
1.14mW
0.76mW
33
Mult
2.60mW
2.58mW
1
Regfile
4.69mW
3.56mW
24
I/O Pad
2.58mW
2.36mW
9
46Conclusion and Future Work
- Conclusion
- Suggesting power modeling methodology for
standard cell design - Constructing Power model for standard cell
- datapath ALU, register file, and multiplier
- I/O pads
- Experiment results shows
- 0.8 33 errors against in-house ARM processor
standard cell design with a test program - Much of combinational block power consumed by
glitch power one of major cause of estimation
errors - Future Work
- Cache power model calibration against real design
- Need to make more cache power models for various
cache design scheme - Power model for random logic
- e.g. Control logic
- Clock and interconnection capacitance and power
estimation framework - Clock power estimating entire chip area and
assuming H-clock tree - Interconnection power estimating
interconnection length among functional units
using floorplan and FU area information
47SystemPower
- Soraya Ghiasi, Dirk Grunwald
- University of Colorado
48Goals
- Current SimpleScalar/PowerAnalyzer tools only
measure application performancce - We want to analyze system or platform level power
and performance. - Operating system
- Applications
- Most importantly, we want to couple this with
power models of external devices - Compact Flash / Disk
- Wireless Networks
- Target people working application or system
level optimization (routing, compression, file
systems, etc)
49Design
- Based around a design similar to the Bochs
platform simulator - Leverage existing open source device models
- Leverage existing Bochs architecture
- Replace all x86 specific Bochs elements
- Instruction Set Architecture, Instruction
interpreter, memory system, timing model, etc - Power models will be drawn from core
PowerAnalyzer components or profiling of existing
hardware (like FastPower).
50We need to model MMU in full detail
51This being ARM, nothing is simple
52Device Models
- Devices come in two flavors
- Co-processors (MMU, Floating point)
- Memory mapped interfaces specified by unique
addresses in physical memory - E.g. 0xE0000820 is the address of the UART 0
data-out - Each memory mapped device handled by specific
Object - E.g. UartMemoryRegion handles 0xE0000820
- FastMap used to handle the common case of
accessing physical memory
53Diagram of Memory / Device Interaction
Physical Memory
FastPath
LD R0, R2
SlowPath
54ARM-in-BochsBooting the OS with sim-boot
- Memory system developed
- Flash, DRAM, Peripheral Control Modules, System
Control Modules - Memory Management Unit implemented
- Virtual to physical address translator
- Domain access restrictions
- Open HandHelds bootloader mostly working
- Finish implementing the full ARM instruction set
- Next up - load and run the kernel
55Bootloader Execution
U3 _at_00000328 F04000000 MTST 00000001 00000002 00
000004 00000008 10000000 20000000 40000000 80000
000 ENDM
56Bootloader (Continued)
STKP C19F3FFC MMU table startC19F4000 Boot data
startC19F8000 Boot data size00008000 Stack data
baseC19F0000 Stack data size00004000 FLASH_BASE
00000000 Evacuating 1MB of Flash to DRAM at
C1E00000 done Map Flash virtual section to DRAM
at C1E00000 btflash_init mfrid00890089
devid00180018 walking flash descriptors
57Bootloader (continued)
comp_inst at 711147 0x76dc
0x15148 btflash_init found flash 28F128J3A
flashDescriptor00015148 flashSectors00014F28
nsectors00000080 flash_size02000000
flash_address_mask01FFFFFF get_param could not
find parameter system_rev dram_size
00000000 uncompress failed, rc 0xFFFFFFFB gtgt
Compaq OHH BootLoader, Rev 2-13-2 gtgt Last link
date Wed May 16 194803 MDT 2001 gtgt Contact
bootldr_at_handhelds.org gtgt ARM Processor
RevE0008000 gtgt (c) 2000 Compaq Cambridge
Research Laboratory Press Return to start the OS
now, any other key for monitor menu eval param
blk autoboot_timeout 0x01C9C380
58Current Bootloader Problems
walking flash descriptors comp_inst at 711147
0x76dc 0x15148 btflash_init found flash
28F128J3A flashDescriptor00015148
flashSectors00014F28 nsectors00000080
flash_size02000000 flash_address_mask01FFFFFF
get_param could not find parameter
system_rev dram_size 00000000 uncompress failed,
rc 0xFFFFFFFB gtgt Compaq OHH BootLoader, Rev
2-13-2 gtgt Last link date Wed May 16 194803 MDT
2001 gtgt Contact bootldr_at_handhelds.org
59Current Problems
- PowerAnalyzer incorrectly decodes certain
instructions that only appear in O/S, not
applications - Coprocessor and management instructions
- Also, O/S is expecting to handle FPU itself,
PowerAnalyzer handles this automatically - Adopting some tools from two Redhat/Cygnus
projects (cgen SID) to get around these problems
60Other necessary System structure
- Timers
- Simulated system clock
- DMA read and write functions
- Simulated interrupts
- Were using design of www.bochs.org to guide our
specific implementation
61Modeling Device Power
- The simulator sees microscopic events
- Write the current console output to 0xE000820
- We need to convert this to reflect power demands
of higher level components - Initial start is focusing on measuring storage
PCMCIA cards - On-chip devices measured by DAQ
- PCMCIA devicesmeasured usingKaitek extended
62Measurements from 802.11b Cards
63Compact Flash Energy (/32MB)
64Device Power Models
- Need better device models
- For example, wireless cards are
- Transmitting
- Receiving
- Idle
- Powered Off
- In power save mode
- Power transients (e.g. IBM Microdrive)
- Oceans of data
65Plans, Strategy
- Get basic framework up and running
- Try to maintain compatability with Bochs (x86
platform simulator) and SID (platform framework
from Redhat) - Make devices pluggable to allow independent
development - Host SourceForge collaborative development for
devices
66Time Line
- Should be able to boot O/S and run Linux by end
of summer - First target since we have access to source code
and can examine use of the devices - Could extend to WinCE
- Extend with power measurements models
- Attempt validation against existing handheld
(iPAQ) in fall
67Operating DVS System Scheduling
Michigan
Colorado
- Use models of process interaction
- Daemon mediates voltage scaling
- Implementation on TransMeta
- Extend interval methods (OSDI00)
- PID controller w/optional signal for
responsiveness - Accurately hits rate based applications w/o real
time interface - Implementation on SA-1100, AMD K6
- Allows comparison to RTLinux
- Multi-architecture DVS interface
- Application(optional hints Im Important)
- Vertigod daemon
- User-mode process.
- Implements performance-
- setting algorithm.
- Application(optional hints)
- GUI Library(optional hints)
Linux Kernel
Linux Kernel
HO O K S
HO O K S
DVS
- Vertigo Module
- Episode detection tracking.
- Comm. with policy daemon.
- Event tracing.
- /proc interface.
- Modular Scheduler
- Policy Daemon
- Event tracing.
- /proc interface.
68Vertigo power management
- Reduce energy consumption by running slower
without impacting the user experience. - Operating voltage can be lowered with frequency.
- Faster performance is not always better.
- Wastes energy.
- Speed improvement may be imperceptible.
- Developed a performance-setting scheme for Linux.
- Automatic quantification of user experience.
- Works with existing user applications.
- Interactive, irregular, multiprogrammed workloads
are supported. - Implemented in the Linux kernel.
- Simulations indicate significant energy savings.
- Working on evaluating Vertigo on actual hardware.
69Vertigo
- Kris Flautner, Trevor Mudge
70Small performance reduction big energy savings
20 performance reduction 32 energy
reduction 40 performance reduction 55 energy
reduction
71A utilization trace
Each horizontal quantum is a millisecond, height
corresponds to the utilization in that quantum.
72Episode classification
Interactive (Acrobat Reader), Producer (MP3
playback), and Consumer (esd sound daemon)
episodes.
73Episode classification for power reduction
Response time the time it takes for the computer
to respond to user initiated events.
- Faster is not always better.
- Fundamental limit to what is perceptible to
humans. - Movies 20-30 frames per second.
- Perceptual causality 50ms-100ms.
- Dragging objects on screen 200ms.
- Non-continuous operation 1-2sec.
- Periodic activity determines necessary
performance for real-time tasks.
The goal is to run fast enough to meet the
perception threshold, no point to running any
faster.
74Finding interactive episodes
- One way mouse click indicates start, long idle
time indicates end. - Not always accurate.
- Not all episodes are initiated by mouse click.
- Latency in finding the ends of episodes.
- Our approach track inter-task communication.
- Accurate.
- Finds all interactive episodes.
- No latency.
- No program modifications required.
75Communication between tasks
76Implementation
- Some kernel modifications are required.
- Example Task switch, creation, exit
notification. - Episode detection implemented in a kernel module.
- Performance setting policy in user-space daemon.
77Experiments
- Implementation on Transmeta Crusoe-based
notebook. - Crusoe has the ability to dynamically change
performance levels. - Goal validate software.
- Gather initial high-level results.
- Evaluation on Intel XScale LHR prototype board.
- Significantly larger performance range than
Transmeta. - Limited on board devices.
- Custom-built Intel Xscale based board.
- Capabilities similar to Compaq iPaq.
- Include on-board power measurement
infrastructure. - Enable research into full-system power management.
78- BUFScale Boulder Unified Frequency/voltage
Scaling API
79Goals
- NSF Usenix funded research to develop O/S
support for power efficient computing - Design minimally invasive O/S demands for
automatic voltage scaling - Initial platform was SA-1100. Inflexible.
- Wanted cross-platform mechanism that could be
used by numerous policies.
80Supported Platforms
- AMD PowerNow!
- 8 frequency steps, requires support from
external VRM - E.g. Our Sony laptop supports200, 300, 350,
400, 450, 500, 550, 600 (overclocked) - StrongARM series (SA-110, SA-1100, etc)
- 16 frequency steps, requires external VRM
- Usually only 11 (59 .. 206) used
- Transmeta 5600
- Variable speedsteps, external VRM
- Abstracted by LongRun interface
- Sony picture book supports 4 levels
- Intel SpeedStep
- Two steps, external VRM supported by MX chipset
- Intel X-Scale
- Similar to StrongARM
Implemented, not validated
By end of May?
81BufScale Mechanism
- Abstract speed interface
- 0..100, by scaled integer fraction
- Mechanisms to query each speed setting / voltage
- Turns out to be similar to LongRun interface
- Linux Kernel loadable module
- Other modules record scheduler trace, actions
- Mechanism distinct from Policy
82(No Transcript)
83Speed Setting Policies
- Modular implementation of minimally invasive
policies - Uses any combination of past scheduling history
- Uses information about process state
- Initial work is control theory based -
Proportional-Integral-Differential Controller
(PID) - Proportional adjustment signal is directly tied
to error - Integral adjustment signal is based on all
previous errors (sum) - Differential adjustment signal is based on
current error and previous error. - But, with mechanism to indicate possible
importance - De-scheduled from I/O wait, emerge from sleep
- Captures unknown state
84PID Controller
100
Error
90
Error
85Performance at Lowest Speed
Utilization
86Performance at Medium Speed
87Hitting target utilization
88Current work
- Expand interface to new platforms
- Athlon-4 -- 1Ghz, commodity x86, PowerNow!
- SH-4 used in HP handhelds
- Primary research target is in multithreaded
processor design - Co-scheduling for energy efficiency
- Active feedback based on application
characteristics - Preliminary target is a commercially available
SMT, will use simulator infrastructure this
summer for detailed evaluation