PowerAnalyzer for Pocket Computers - PowerPoint PPT Presentation

1 / 88
About This Presentation
Title:

PowerAnalyzer for Pocket Computers

Description:

SimpleScalar ARM target support ... SS/ARM available since mid-November, used by 10 PAC/C groups ... ARM CISC instructions required microcode support ... – PowerPoint PPT presentation

Number of Views:177
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: PowerAnalyzer for Pocket Computers


1
PowerAnalyzer for Pocket Computers
  • Dr. Robert Graybills PAC/C Program
  • Second review May 22, 2001
  • Todd Austin and Trevor Mudge, U. Michigan
  • Dirk Grunwald, U. Colorado
  • http//www.eecs.umich.edu/jringenb/power/

2
Agenda
  • Introduction
  • Team members
  • Milestones
  • Action items from last review 25th September
    2000
  • Budget overview
  • Collaborations
  • Work with other group in PAC/C and outside
  • Presentation breakdown
  • FastPower
  • Quick estimates of power
  • SimpleScalar ARM target support
  • Power Analyzer Data Sensitive Parameterized
    Architectural Level Power Estimator
  • Microarchitectural power/performance estimator
  • SystemPower
  • Complete system
  • Dynamic voltage scheduling
  • Vertigo
  • PowerScale
  • Wrap up

3
Team Members
  • University of Colorado
  • Dirk Grunwald
  • Students
  • Jason Casmira
  • Soraya Ghiasi
  • Brad Morrey
  • Mike Neufeld
  • Audon Tornquist
  • University of Michigan
  • Todd Austin
  • Trevor Mudge
  • Students
  • Dan Ernst
  • Kris Flautner
  • Nam Kim
  • Rajeev Krishna
  • Jeff Ringenberg
  • Chris Weaver
  • Industrial Partners
  • Intel
  • Support for two students
  • XScale evaluation boards
  • Mentors George Cai (Texas), Doug Carmean and
    Rich Uhlig (Portland) Chris Newburn (Santa
    Clara) Mike Morrow (Chandler, AZ)
  • Cobalt Networks
  • Equipment for system level modeling
    optimization
  • Compaq Computer
  • Itsy motherboards

other support/fellows/postdocs
4
Milestones
today
  • SS/ARM available since mid-November, used by 10
    PAC/C groups
  • Power model design nearing completion
  • Pending Integration of S/DRAM power,
    interconnect,framework for external device power
  • Platform simulator on-target for end-of-summer /
    early fall
  • Initial release targets SA-1100, later release
    for full system
  • DVS interface available for Crusoe, LRH board,
    SA-1100, AMD K6 w/PowerNow

5
Action Item from Last Review
  •  Tracking Legend
  • !Active action item
  • Completed action item
  • New action item since last update
  • !Need to provide calibration of simulation models
    that the PAC/C community is using, specifically
    included is the need to provide specification and
    calibration of the SimpleScalar simulation tool
    to support the PAC/C communitys use of the
    model. (See Power Analyzer Data Sensitive
    Parameterized Architectural Level Power
    Estimator.)
  • !Examine and suggest a list of the simulation
    tools that will be utilized to address future
    architecture developments and determine what will
    be necessary to prepare for the next generation
    of architectures and their simulation and
    support, what efforts will be necessary to have
    calibrated tools in place for future
    architectural developments. (See Power Analyzer
    Data Sensitive Parameterized Architectural
    Level Power Estimator. List construction
    on-going.)
  • ! Present and offer benchmarks MiBench to
    PAC/C community. (See SimpleScalar ARM Target
    Support. Work on-going.)
  • ! Describe the modularity and parameterization
    for varying processes that can be addressed
    thorough the GUI interface proposed, define the
    interfaces. parameters, and inputs planned for
    the tools being developed and that will be
    available to the PAC/C community. Address
    modularity and the ability to work with other
    tools at multiple levels. (See Fast power
    SimpleScalar ARM Target SystemPower. Work
    on-going.)
  • !What particular demonstrations would be
    recommended with Land Warrior - what elements
    will be available, how would they be of value,
    how could they be demonstrated. (Intel Board
    Transmeta. Massoud and Dennis are aware.)
  • !Coordinate with UC Berkeley on ECAD and
    coordinate with UC Berkeley GSRC on tools and
    ECAD efforts. (Discussion and visit with Trevor
    Pering, now at Intels MRL.)
  • !Determine what input is required from the Land
    Warrior application to allow the pursuit of that
    application under this effort. (Benchmarks.
    Massoud and Dennis are aware.)
  • !Provide a readable schedule for the program.
    Put/place major milestones within the schedule to
    be presented, including major program
    elements/milestones at approximately 6 weeks
    increments.
  • !Provide financial summary information. (See next
    slide.)
  • !Coordinate with the MIT PAC/C effort (Anantha
    Chandrakasan) (Intel board, and see later.)

6
Budget Overview
  • Budget 800k/2 years
  • July 00 through June 02
  • Colorado 250k subcont.
  • Received 600k
  • Through December 01
  • Expenditure at right
  • Summer months
  • Full GSRA complement (5)
  • Equipment purchases

7
Collaborations
  • Intel Mike Morrow LRH board
  • XScale FPGA
  • With Maxim fuel gauge to experiment with DVS
  • Boards being tested
  • Make available to
  • Anantha Chandrakasans group
  • Bob Parkers group
  • Vince Mooney and Krishna Palems group
  • Dennis Lane and Massoud Pedrams LW effort
  • Intel Microprocessor Research Lab
  • DVS work Trevor Pering, Rich Uhlig
  • Pentium 4 design study group
  • Doug Carmean pipelines vs. power
  • IBM Austin
  • Low power server Gary Carpenter

8
Publications
  • K. Flautner T. Mudge. Automatic
    performance-setting for dynamic voltage scaling.
    MOBICOM, Rome July 2001 (to appear).
  • T. Mudge. Power A first class design constraint.
    Computer, vol. 34, no. 4, April 2001, pp. 52-57.
  • T. Mudge. Power A first class design constraint
    for future architectures. Proc.7th Int. Conf. on
    High Performance Computing - HiPC, (Springer
    Lecture Notes in Computer Science), Dec. 2000,
    Bangalore, India, pp. 215-224.
  • K. Flautner, S. Reinhardt, T. Mudge.
    Thread-level parallelism and interactive
    performance of desktop applications. 9th Int.
    Conf. Architectural Support for Programming
    Languages and Operating Systems (ASPLOS-IX), Nov.
    2000, pp. 129-138.
  • Jason Casmira Dirk Grunwald. Dynamic
    Instruction Scheduling Slack, Proceedings of the
    2000 KoolChips workshop, held in conjunction with
    MICRO-00
  • Soraya Ghiasi, Dirk Grunwald, A Comparison of Two
    Architectural Power Models, In Proceedings, Power
    Aware Computer Systems Workshop
  • Soraya Ghiasi, Jason Casmira Dirk Grunwald. IPC
    Matching Mechanisms Using IPC Variation in
    Workloads with Externally Specified Rates to
    Reduce Power Consumption. 2000 Workshop on
    Complexity Effective Design
  • Dirk Grunwald, Phil Levis, Brad Morrey Mike
    Neufeld. Policies for Dynamic Clock Scheduling.
    Proceedings of the 2000 Operating Systems Design
    and Implementation
  • Farkas et. Al. Quantifying the Energy Consumption
    of a Pocket Computer and Java Virtual Machine.
    Proceedings of the 2000 SIGMETRICS Conference on
    Computer System Performance.

9
FastPower
  • Soraya Ghiasi, Jason CasmiraUniversity of
    Colorado

10
Goals
  • Provide quick, accurate estimate of power use by
    applications on specific platform
  • Models an existing processor platform (SA-1100)
  • Useful for algorithmic analysis
  • Useful for evaluating compiler optimizations
  • Cheaper faster than hooking up a scope

11
Two classes of tools
  • Functional simulator produces quick results
  • Analysis of non-microarchitectural power
    reduction techniques
  • Performance simulator produces slower, more
    accurate results
  • Models all delays stalls in the processor
  • Better estimate of energy-delay product
  • Drawbacks
  • No Vector Floating Point Architecture
    (co-processor 10-11) instruction costs
  • Currently only measure application performance

12
Combination of 2 PAC/C Parts
PowerAnalyzerComponents
Per-instructionPower Profiling(LART, then MIT)
13
Instruction Energy
  • MIT group measured energy cost of repeatedly
    executing single instructions
  • LART group also has voltage-scaling data for same
    CPU, but not in as much detail. We use this to
    extrapolate across different clock speeds
  • We assume standby on cache miss

14
Time-varying Power Estimate
15
Highlighting Program Source
  • We count frequency energy cost for each
    instruction in the application
  • Perl scripts provide decorated source listng
    with energy per
  • Assembly instruction
  • Source line
  • Simple, fast feedback on algorithm / compiler
    efficiency

Dfhdhjfkhkjhf Afkhdajfh Akjfhaj Akjfhhasdjfhadh Aj
kfhdshfksh Akjfdhkjashdfjkhdskjhfkjhfkjashdfjkhasj
kh Aljdfhajsdh Asdkjfhadsjk kjhafdjkha jkhjkhfd
as ajhdfjadsh
16
Plans and Release Schedule
  • Currently no measurements for memory
  • Varies from system to system
  • Different memory / CPU clock speeds an issue
  • No measurements for FPU / Vector instructions
  • Not provided in MIT or LART
  • We will calibrate and fill-in using Itsy
    hardware, where we can measure CPU and Memory/IO
    power
  • Measurements final tool will be included in the
    PowerAnalyzer suite by end of June

17
SimpleScalar ARM Target Support
  • Rajeev Krishna, Chris Weaver, Todd Austin
  • University of Michigan

18
Overview
  • The SimpleScalar Tool Set
  • ARM Target Implementation
  • Instruction Emulation
  • SA-1 Pipeline Model
  • ARM Cross-Compiler Kit
  • MiBench Benchmark Suite
  • ARM Target Validation
  • Functional Validation
  • Performance Model Validation
  • ARM Target Deployment
  • Ongoing Work
  • System Simulation
  • Related and Future Work

19
The SimpleScalar Tool Set
  • Computer system simulation tools
  • Developed at University of Michigan
  • Maintained by SimpleScalar LLC
  • Freely available to academic sites
  • Target apps run on emulator
  • SPEC95, SPEC2k, MiBench
  • Tool set supports multiple ISAs
  • PISA, Alpha, PPC, x86, ARM
  • Multiple I/O targets supported
  • Linux, OSF, Solaris system calls
  • Devices models (e.g., SA-1110)
  • Modeling infrastructure enables analysis of
    programs and hardware
  • performance
  • power
  • Very portable code base

MiBench and Linux/ARM
Power/Performance Model
Fetch
Pipeline
SA-1100 Core
Predictor
Caches
Simulation Kernel
ARM7 ISA ARM FPA
Linux/ARM iPAQ Devices
Host Platform
20
ARM Target Instruction Emulation
  • ARM ISA emulation support added to SimpleScalar
    tool set
  • ARM 7 integer instruction set support
  • Floating Point Accelerator (FPA) instruction set
    support
  • Linux/ARM system call support added
  • system calls are implemented by the simulator
  • portable I/O, but does not capture OS execution
  • ARM CISC instructions required microcode support
  • required for accurate microarchitectural modeling

agen tmp1,r13,0 agen tmp0,tmp1,-16 stp
r11,tmp0 agen r13,r13,-16 agen
tmp0,tmp1,-12 stp r12,tmp0 agen
tmp0,tmp1,-8 stp r14,tmp0 agen
tmp0,tmp1,-4 stp r15,tmp0
stmdb r13!,r4-r8,r10-r15
21
Processor Performance Model
  • SA-1 pipeline model implemented
  • pipeline used in Intels SA-11xx
  • simple five stage pipeline
  • two level memory hierarchy
  • Challenging task due to lack of info on SA-1
    microarchitecture
  • derived many details from the compiler writers
    guide
  • used directed black-box testing to fill in the
    rest of the blanks
  • prototype Xscale model completed
  • Intels new StrongARM processor
  • based on (sparse) published details
  • will validate when we secure a real evaluation
    platform (mid-summer?)

SA-1 Pipeline
IF
ID
EX
MEM
WB
I
D
DMMU
IMMU
Physical Memory
22
ARM Cross-Compiler Kit
  • Permits users to compile ARM binaries w/o ARM
    hardware
  • Most users lack access to a real ARM target with
    a native compiler
  • We use Rebel.coms NetWinder platforms to build
    native binaries
  • GNU GCC targeted to ARM ISA
  • includes soft-float support (permits compilation
    for non-FP hardware)
  • GNU binutils targeted to ARM ISA
  • GNU ld linker
  • GNU binary utilies, e.g., objdump, nm, size, etc
  • pre-built C libraries for ARM ISA
  • targeted to Linux system call interfaces
  • portable code base

23
MiBench Benchmark Suite
  • Unencumbered embedded benchmark suite
  • Includes source code and multiple benchmark
    inputs
  • With binaries compiled for SimpleScalar/ARM
    simulator
  • Preliminary report details benchmarks and
    performance characteristics
  • Six embedded programming domains (22 benchmarks)
  • Automotive/industrial
  • Process control kernels from engine control,
    sensor monitoring
  • Networking
  • Shortest path router, Patricia tree, packet
    processor, CRC32
  • Security
  • Private and Public key ciphers, digest routines
  • 3DES, Blowfish, SHA, AES finalists
  • Consumer
  • Multimedia, image processing, entertainment
  • JPEG, Dither, RGBA, MediaBench, DOOM
  • Office
  • Spell, Grep, Ghostscript Postscript Interpreter
  • Telecommunications
  • FFT, GSM, ADPCM

24
ARM Target Validation
  • ARM 7 ISA validated again reference
    implementation
  • Functional validation implemented with random
    testing
  • using the FuzzBuster framework
  • Validated against real SA-1100 H/W
  • Validated against ARMs ARMulator
  • ARM FPA extensions validated against SoftFloat
    suite
  • ARMulator and SA-1110 reference lack FP
    implementations
  • SoftFloat suit implements reference FP with
    integer ISA
  • Large validation effort
  • 500 billion instructions tested
  • 6 bugs found in the ARMulator! (reported to ARM
    Ltd)

Random Instruction and State
ARM Target
Ref Impl
- ARMulator - SA-1100 H/W

FuzzBuster
Correct?
25
Performance Model Validation
  • Performance validation against SA-1110 platform
  • Rebel.com NetWinder reference with SA-1 pipeline
  • Microbenchmarks were used to reveal and test
    specific latencies
  • e.g., branch mispredictions, cache misses,
    writeback stalls
  • Final validation completed with macrobenchmark
    testing
  • compared IPC of SA-1110 to IPCs computed by SA-1
    performance model
  • H/W IPCs computed using wall clock time, clock
    frequency, and known instruction counts
  • Excellect IPC correlation across entire test suite

26
SimpleScalar/ARM Deployment
  • First release deployed Dec 2000
  • unvalidated ARM ISA emulator
  • SimpleScalar performance models
  • Second release deployed March 2001
  • validated ARM ISA emulator
  • validated ARM microcode emulator
  • SA-1 validated pipeline model
  • ARM cross-compiler kit
  • Current PAC/C users
  • Georgia Tech, Northwestern, Arizona, Princeton,
    U-Delaware, USC,ISI, U-Toronto, UC-Irvine, Notre
    Dame
  • We are actively providing support to these users

27
SimpleScalar/ARM System Simulation
SA-1110 System Bus
  • System simulation development
  • SA-1110 device set
  • Compaq IPaq reference hardware
  • Linux MiBench workload
  • Status
  • Core components deployed
  • Virtual memory, RTC, PIC, DMA, SER0 development
    ongoing
  • Booting Linux kernel only requires serial, DMA,
    PIC, RTC,MMU
  • Concurrent Development
  • Develop SA-1100 specific devices
  • Integrate Bochs platform simulator

I-cache
IMMU
SA-1 Pipeline
D-cache
DMMU
RTC
RAM
PIC
DMA
Flash
SER0
PCMCIA
SER1
SER2
SER3
complete/deployed
in development/test
next generation
28
SimpleScalar/ARM Platform Simulation
SA-1110 System Bus
  • Bochs Platform Simulator
  • Liberally harvesting components from Bochs x86
    platform simulator
  • Functional models for IDE, Ethernet, VGA, Cdrom,
    etc
  • Goal Common devices interfaces across x86
    ARM simulators
  • Augmenting Devices with Approximate Power Models
  • Empirical measurements of 802.11, microdrive,
    flash, etc.

I-cache
IMMU
SA-1 Pipeline
D-cache
DMMU
RTC
RAM
PIC
DMA
SER0
BochsDeviceModel
29
Related Work SimpleScalar Visualization Tools
30
Related Work SimpleScalarI/O Traces
  • The Challenges
  • Conduct reproducible experiments with real-time
    systems
  • Share sophisticated workloads between users
  • The Solution External I/O Traces
  • Sim-EIO executes any workload tracing all I/O
    activity
  • Device I/O (or System Calls)
  • External Interrupts
  • DMA activity
  • External I/O (EIO) traces can be executed on any
    simulator
  • 100 reproducible execution
  • Fully portable experiments
  • Traces can be shared without needing to share
    traced program components

31
Related Work SimpleScalar/C30
  • Many embedded targets feature a DSP
  • for fast processing of multimedia workload
    components
  • e.g., signal processing, codec routines,image
    processing
  • typical architecture couples a general purpose
    microprocessor with a DSP
  • Adding TI TMS320C30 (C30) ISA support to
    SimpleScalar
  • integer and FP ISA components
  • power control instructions
  • Provided as an extensions component
  • permits use of general purpose processor model
    and C30 model in tandem
  • inter-processor communication implemented with
    bi-directional mailbox primitives
  • requires a fairly sophisticated compiler tool set
    (in development)

inter- processor interrupts
ARM Core
C30 Core
Shared Memory
32
Future Work Self-TunedDigital Systems
  • Electrical verification determines if
    implementation is robust
  • Design must be functionally correctly for all
    valid (T,V,p,f)
  • Design must meet mean-time-to-failure goals (via
    power/current analysis)
  • Verify functional correctness at slow
    corner(Tmax,Vmin,pslow,fmax)
  • Verify power/current characteristics at fast
    corner (Tmin,Vmax,pfast,fmax)
  • Additional margin on clock used to avoid any
    electrical faults

Temp
Voltage
Frequency
Process
design margin
max
max
fast
max
min
min
min
slow
33
Future Work Self-Tuned Microprocessor Systems
Temp
Voltage
Frequency
max
max
worse-case margin
insts to verify
clk
Tuned Core
Checker
max
Vdd
temperature
error rate
clk
min
min
min
Clock/Voltage Generator
Vdd
Slow corner
Actual operating conditions
  • Modern logic design methodologies are very
    conservative
  • Large design margins consume power and
    performance
  • System environment may not be reflective of slow
    corner
  • Employ a checker to enables a self-tuned
    clock/voltage strategy
  • Push clock, drop voltage until desired
    power-performance characteristics
  • If system fails, use checker will correct error,
    notify control system
  • Reclaims design margins plus any temperature and
    voltage margins
  • Example checkers
  • CRC hardware for a crypto-processor
  • DIVA checker processor for microprocessors

34
Power Analyzer Data Sensitive Parameterized
Architectural Level Power Estimator
  • Nam Sung Kim, Trevor Mudge
  • University of Michigan

35
Outlines
  • Previous Works
  • Disadvantage of Previous Works
  • Feature of Power Analyzer
  • Power Analyzer Methodology and Estimation Flow
  • Power Characterization of Functional Units
  • Construction of Power Model
  • Experimental Results
  • Conclusion and Future Work

36
Previous Work
  • Wattch Model
  • Analytical power models for functional blocks
    consisting of memory and CAM
  • Cai-Lim Model
  • Using active and inactive power density model for
    each functional block
  • Limitations of Previous Work
  • Ignoring data-sensitivity of functional blocks in
    power consumption
  • 8-bits MULT shows up to 198 difference in power
    consumption for various input data activities
  • Lack of parameterized power models
  • Lack of flexibility in power model data structure
  • Supporting only fixed form of processor
    architecture
  • Not considering I/O pad power
  • I/O pads can consume up to 40 of entire chip
    power
  • Using empirical power values from existing
    designs
  • Unable to consider different architectural
    configurations

37
Features of Power Analyzer
  • Data Sensitive Power Model
  • Assuming power consumption of functional blocks
    is proportional to Hamming Distance (HD) of
    applied input vector sequences
  • Parameterized Power Model
  • Assuming power consumption of functional block is
    proportional to bit width
  • Rapid power characterization using small circuits
  • Supporting analytical power model for cache and
    CAM structure
  • Interconnect and Clock Power Model
  • Providing interconnection and clock power
    estimation framework for different architectural
    configuration future work
  • Hierarchical Power Model
  • Supporting various level of abstraction of power
    model for each functional unit

38
Power Analyzer Methodology
  • Power Consumption of Functional Blocks
  • Different for various circuit design styles
  • Fast Power Characterization of Functional Units
  • Using existing functional block design or soft
    macro blocks provided by synthesis tool
  • Using gate-level or transistor level power
    simulation results for small bit width circuits
    having similar functionality
  • Parameterized Power Models
  • Extracting appropriate power values for target
    architecture using extrapolation technique
  • Extracting interconnect and clock capacitance
    from the estimated transistor count, floorplan
    and technology future work
  • Dynamic Power Simulation
  • Extracting HD and access activity information of
    functional units using SimpleScalar architectural
    level simulation

39
Power Analyzer Power Estimation Flow
Fig 1 Power Analyze power estimation flow
40
Power Characterization of Functional Units
  • Power Characterization for Standard Cell Design
  • Power Characterization of Functional Unit
  • depending on target uP design methodology,
    circuit style, and process technology
  • Target Processor
  • MARS Michigan ARM instruction set compatible
    microprocessor
  • Target Technology
  • 0.25um TSMC standard cell and I/O library
  • Target Design Flow
  • ASIC design flow - Synthesizing all functional
    units using Synopsys Design Compiler and Avant!
    Rapid Memory Compiler

41
Datapath Power - ALU
  • Power of ALU Increases Linearly
  • Linearly as bit width of ALU increases
  • Linearly as HD of applied input vectors increases

Fig 2 ALU power consumption for various bit
widths
Fig 3 - ALU power consumption for various HD
of inputs
42
Datapath Power - Multiplier
  • Power of Multiplier increases
  • Quadratically as bit width increases
  • Logarithmically as HD of applied input vectors
    increases

Fig 4 MULT power consumption for various bit
widths
Fig 5 - MULT power consumption for various
HD of inputs
43
Datapath Power Register File
  • Power of Register File increases
  • Linearly as bit width increases
  • Linearly as HD of applied input vectors increases

Fig 6 RF power consumption for various bit
widths
Fig 7 - RF power consumption for various HD of
inputs
44
I/O Pad and Cache Power
  • I/O Pad Power Consumption consists of
  • Pad cell internal power
  • Load capacitance power
  • Most of I/O Pads
  • Used by address and data buses
  • 64 pads among 78 in/out pads
  • Easy to monitor address and data bus activity at
    the architectural level simulation
  • Cache Power Consumption consists of
  • Decoder, Bit lines and Sense amps of Data and Tag
    Arrays
  • Characterizing Energy Consumption using
  • Capacitance value extraction from CACTI cache
    access time model Jouppi
  • Most of cache power consumption is caused by bit
    lines and sense amp
  • Power consumption of bit lines is data
    independent because of complementary structure of
    cache bitlines

45
Power Analyzer Experimental Results
  • Target Design Specification
  • ARM instruction set and 32-bit datapath
  • 32 bit ALU, 32x32 multiplier and 32 register
    files
  • 4KB instruction and data caches and 32-bit
    address and data buses
  • Power Measurement of Target Design using
  • Test program consisting of various instructions
  • PrimePower - Gate-level dynamic power simulation
    tool
  • Power Estimation Result
  • Simulation cycle 250 Cycles at 50MHz
  • Power value extrapolation from 8-bit synthesized
    circuits
  • No output load for I/O pad power measurement

FUs
Measurement
Estimated
Error
ALU
1.14mW
0.76mW
33
Mult
2.60mW
2.58mW
1
Regfile
4.69mW
3.56mW
24
I/O Pad
2.58mW
2.36mW
9
46
Conclusion and Future Work
  • Conclusion
  • Suggesting power modeling methodology for
    standard cell design
  • Constructing Power model for standard cell
  • datapath ALU, register file, and multiplier
  • I/O pads
  • Experiment results shows
  • 0.8 33 errors against in-house ARM processor
    standard cell design with a test program
  • Much of combinational block power consumed by
    glitch power one of major cause of estimation
    errors
  • Future Work
  • Cache power model calibration against real design
  • Need to make more cache power models for various
    cache design scheme
  • Power model for random logic
  • e.g. Control logic
  • Clock and interconnection capacitance and power
    estimation framework
  • Clock power estimating entire chip area and
    assuming H-clock tree
  • Interconnection power estimating
    interconnection length among functional units
    using floorplan and FU area information

47
SystemPower
  • Soraya Ghiasi, Dirk Grunwald
  • University of Colorado

48
Goals
  • Current SimpleScalar/PowerAnalyzer tools only
    measure application performancce
  • We want to analyze system or platform level power
    and performance.
  • Operating system
  • Applications
  • Most importantly, we want to couple this with
    power models of external devices
  • Compact Flash / Disk
  • Wireless Networks
  • Target people working application or system
    level optimization (routing, compression, file
    systems, etc)

49
Design
  • Based around a design similar to the Bochs
    platform simulator
  • Leverage existing open source device models
  • Leverage existing Bochs architecture
  • Replace all x86 specific Bochs elements
  • Instruction Set Architecture, Instruction
    interpreter, memory system, timing model, etc
  • Power models will be drawn from core
    PowerAnalyzer components or profiling of existing
    hardware (like FastPower).

50
We need to model MMU in full detail
51
This being ARM, nothing is simple
52
Device Models
  • Devices come in two flavors
  • Co-processors (MMU, Floating point)
  • Memory mapped interfaces specified by unique
    addresses in physical memory
  • E.g. 0xE0000820 is the address of the UART 0
    data-out
  • Each memory mapped device handled by specific
    Object
  • E.g. UartMemoryRegion handles 0xE0000820
  • FastMap used to handle the common case of
    accessing physical memory

53
Diagram of Memory / Device Interaction
Physical Memory
FastPath
LD R0, R2
SlowPath
54
ARM-in-BochsBooting the OS with sim-boot
  • Memory system developed
  • Flash, DRAM, Peripheral Control Modules, System
    Control Modules
  • Memory Management Unit implemented
  • Virtual to physical address translator
  • Domain access restrictions
  • Open HandHelds bootloader mostly working
  • Finish implementing the full ARM instruction set
  • Next up - load and run the kernel

55
Bootloader Execution
U3 _at_00000328 F04000000 MTST 00000001 00000002 00
000004 00000008 10000000 20000000 40000000 80000
000 ENDM
56
Bootloader (Continued)
STKP C19F3FFC MMU table startC19F4000 Boot data
startC19F8000 Boot data size00008000 Stack data
baseC19F0000 Stack data size00004000 FLASH_BASE
00000000 Evacuating 1MB of Flash to DRAM at
C1E00000 done Map Flash virtual section to DRAM
at C1E00000 btflash_init mfrid00890089
devid00180018 walking flash descriptors
57
Bootloader (continued)
comp_inst at 711147 0x76dc
0x15148 btflash_init found flash 28F128J3A
flashDescriptor00015148 flashSectors00014F28
nsectors00000080 flash_size02000000
flash_address_mask01FFFFFF get_param could not
find parameter system_rev dram_size
00000000 uncompress failed, rc 0xFFFFFFFB gtgt
Compaq OHH BootLoader, Rev 2-13-2 gtgt Last link
date Wed May 16 194803 MDT 2001 gtgt Contact
bootldr_at_handhelds.org gtgt ARM Processor
RevE0008000 gtgt (c) 2000 Compaq Cambridge
Research Laboratory Press Return to start the OS
now, any other key for monitor menu eval param
blk autoboot_timeout 0x01C9C380
58
Current Bootloader Problems
walking flash descriptors comp_inst at 711147
0x76dc 0x15148 btflash_init found flash
28F128J3A flashDescriptor00015148
flashSectors00014F28 nsectors00000080
flash_size02000000 flash_address_mask01FFFFFF
get_param could not find parameter
system_rev dram_size 00000000 uncompress failed,
rc 0xFFFFFFFB gtgt Compaq OHH BootLoader, Rev
2-13-2 gtgt Last link date Wed May 16 194803 MDT
2001 gtgt Contact bootldr_at_handhelds.org
59
Current Problems
  • PowerAnalyzer incorrectly decodes certain
    instructions that only appear in O/S, not
    applications
  • Coprocessor and management instructions
  • Also, O/S is expecting to handle FPU itself,
    PowerAnalyzer handles this automatically
  • Adopting some tools from two Redhat/Cygnus
    projects (cgen SID) to get around these problems

60
Other necessary System structure
  • Timers
  • Simulated system clock
  • DMA read and write functions
  • Simulated interrupts
  • Were using design of www.bochs.org to guide our
    specific implementation

61
Modeling Device Power
  • The simulator sees microscopic events
  • Write the current console output to 0xE000820
  • We need to convert this to reflect power demands
    of higher level components
  • Initial start is focusing on measuring storage
    PCMCIA cards
  • On-chip devices measured by DAQ
  • PCMCIA devicesmeasured usingKaitek extended

62
Measurements from 802.11b Cards
63
Compact Flash Energy (/32MB)
64
Device Power Models
  • Need better device models
  • For example, wireless cards are
  • Transmitting
  • Receiving
  • Idle
  • Powered Off
  • In power save mode
  • Power transients (e.g. IBM Microdrive)
  • Oceans of data

65
Plans, Strategy
  • Get basic framework up and running
  • Try to maintain compatability with Bochs (x86
    platform simulator) and SID (platform framework
    from Redhat)
  • Make devices pluggable to allow independent
    development
  • Host SourceForge collaborative development for
    devices

66
Time Line
  • Should be able to boot O/S and run Linux by end
    of summer
  • First target since we have access to source code
    and can examine use of the devices
  • Could extend to WinCE
  • Extend with power measurements models
  • Attempt validation against existing handheld
    (iPAQ) in fall

67
Operating DVS System Scheduling
Michigan
Colorado
  • Use models of process interaction
  • Daemon mediates voltage scaling
  • Implementation on TransMeta
  • Extend interval methods (OSDI00)
  • PID controller w/optional signal for
    responsiveness
  • Accurately hits rate based applications w/o real
    time interface
  • Implementation on SA-1100, AMD K6
  • Allows comparison to RTLinux
  • Multi-architecture DVS interface
  • Application(optional hints Im Important)
  • Vertigod daemon
  • User-mode process.
  • Implements performance-
  • setting algorithm.
  • Application(optional hints)
  • GUI Library(optional hints)

Linux Kernel
Linux Kernel
HO O K S
HO O K S
DVS
  • Vertigo Module
  • Episode detection tracking.
  • Comm. with policy daemon.
  • Event tracing.
  • /proc interface.
  • Modular Scheduler
  • Policy Daemon
  • Event tracing.
  • /proc interface.

68
Vertigo power management
  • Reduce energy consumption by running slower
    without impacting the user experience.
  • Operating voltage can be lowered with frequency.
  • Faster performance is not always better.
  • Wastes energy.
  • Speed improvement may be imperceptible.
  • Developed a performance-setting scheme for Linux.
  • Automatic quantification of user experience.
  • Works with existing user applications.
  • Interactive, irregular, multiprogrammed workloads
    are supported.
  • Implemented in the Linux kernel.
  • Simulations indicate significant energy savings.
  • Working on evaluating Vertigo on actual hardware.

69
Vertigo
  • Kris Flautner, Trevor Mudge

70
Small performance reduction big energy savings
20 performance reduction 32 energy
reduction 40 performance reduction 55 energy
reduction
71
A utilization trace
Each horizontal quantum is a millisecond, height
corresponds to the utilization in that quantum.
72
Episode classification
Interactive (Acrobat Reader), Producer (MP3
playback), and Consumer (esd sound daemon)
episodes.
73
Episode classification for power reduction
Response time the time it takes for the computer
to respond to user initiated events.
  • Faster is not always better.
  • Fundamental limit to what is perceptible to
    humans.
  • Movies 20-30 frames per second.
  • Perceptual causality 50ms-100ms.
  • Dragging objects on screen 200ms.
  • Non-continuous operation 1-2sec.
  • Periodic activity determines necessary
    performance for real-time tasks.

The goal is to run fast enough to meet the
perception threshold, no point to running any
faster.
74
Finding interactive episodes
  • One way mouse click indicates start, long idle
    time indicates end.
  • Not always accurate.
  • Not all episodes are initiated by mouse click.
  • Latency in finding the ends of episodes.
  • Our approach track inter-task communication.
  • Accurate.
  • Finds all interactive episodes.
  • No latency.
  • No program modifications required.

75
Communication between tasks
76
Implementation
  • Some kernel modifications are required.
  • Example Task switch, creation, exit
    notification.
  • Episode detection implemented in a kernel module.
  • Performance setting policy in user-space daemon.

77
Experiments
  • Implementation on Transmeta Crusoe-based
    notebook.
  • Crusoe has the ability to dynamically change
    performance levels.
  • Goal validate software.
  • Gather initial high-level results.
  • Evaluation on Intel XScale LHR prototype board.
  • Significantly larger performance range than
    Transmeta.
  • Limited on board devices.
  • Custom-built Intel Xscale based board.
  • Capabilities similar to Compaq iPaq.
  • Include on-board power measurement
    infrastructure.
  • Enable research into full-system power management.

78
  • BUFScale Boulder Unified Frequency/voltage
    Scaling API

79
Goals
  • NSF Usenix funded research to develop O/S
    support for power efficient computing
  • Design minimally invasive O/S demands for
    automatic voltage scaling
  • Initial platform was SA-1100. Inflexible.
  • Wanted cross-platform mechanism that could be
    used by numerous policies.

80
Supported Platforms
  • AMD PowerNow!
  • 8 frequency steps, requires support from
    external VRM
  • E.g. Our Sony laptop supports200, 300, 350,
    400, 450, 500, 550, 600 (overclocked)
  • StrongARM series (SA-110, SA-1100, etc)
  • 16 frequency steps, requires external VRM
  • Usually only 11 (59 .. 206) used
  • Transmeta 5600
  • Variable speedsteps, external VRM
  • Abstracted by LongRun interface
  • Sony picture book supports 4 levels
  • Intel SpeedStep
  • Two steps, external VRM supported by MX chipset
  • Intel X-Scale
  • Similar to StrongARM

Implemented, not validated
By end of May?
81
BufScale Mechanism
  • Abstract speed interface
  • 0..100, by scaled integer fraction
  • Mechanisms to query each speed setting / voltage
  • Turns out to be similar to LongRun interface
  • Linux Kernel loadable module
  • Other modules record scheduler trace, actions
  • Mechanism distinct from Policy

82
(No Transcript)
83
Speed Setting Policies
  • Modular implementation of minimally invasive
    policies
  • Uses any combination of past scheduling history
  • Uses information about process state
  • Initial work is control theory based -
    Proportional-Integral-Differential Controller
    (PID)
  • Proportional adjustment signal is directly tied
    to error
  • Integral adjustment signal is based on all
    previous errors (sum)
  • Differential adjustment signal is based on
    current error and previous error.
  • But, with mechanism to indicate possible
    importance
  • De-scheduled from I/O wait, emerge from sleep
  • Captures unknown state

84
PID Controller
100
Error
90
Error
85
Performance at Lowest Speed
Utilization
86
Performance at Medium Speed
87
Hitting target utilization
88
Current work
  • Expand interface to new platforms
  • Athlon-4 -- 1Ghz, commodity x86, PowerNow!
  • SH-4 used in HP handhelds
  • Primary research target is in multithreaded
    processor design
  • Co-scheduling for energy efficiency
  • Active feedback based on application
    characteristics
  • Preliminary target is a commercially available
    SMT, will use simulator infrastructure this
    summer for detailed evaluation
Write a Comment
User Comments (0)
About PowerShow.com