Low Power System Design

About This Presentation

Title:

Low Power System Design

Description:

2. Physics of Power Dissipation in CMOS FET Devices. 3. Power Estimation ... But it may increase battery life. Reduce power. Multi-frequency clocks. 25 ... – PowerPoint PPT presentation

Number of Views:196

Avg rating:3.0/5.0

Slides: 60

Provided by: NTU

Category:

more less

Transcript and Presenter's Notes

Title: Low Power System Design

1
Low Power System Design

Feipei Lai
33664924
flai_at_ntu.edu.tw
CSIE 419
Grade Mid-term 30, Paper presentation 40,
Final 30,

2
Key references

Intl Conf. on CADs
Intl Symp. on Low Power Electronics and Design
IEEE Trans. on CADs
ACM Trans. on DAES
IEEE/ACM DAC
Intl Symp. on Circuits and Systems
IEEE Intl Solid-State Circuits Conference

3
Outline

1. Low-Power CMOS VLSI Design
2. Physics of Power Dissipation in CMOS FET
Devices
3. Power Estimation
4. Synthesis for Lower Power
5. Low Voltage CMOS Circuits
6. Low-Power SRAM Architectures
7. Energy Recovery Techniques
8. Software Design for Low Power
9. Low Power SOC design
10. Embedded Software

4
Motivation

Energy-efficient computing is required by
Mobile electronic systems
Large-scale electronic systems
The quest for energy efficiency affects all
aspects of system design
Packaging costs cooling costs
Power supply rail design
Noise immunity

5
Technology directions
6

Just as with CMOS replacement of HBTs
(heterojunction bipolar transistor) , a lower
performance/lower power technology ultimately
will deliver superior system throughput because
of the higher integration it enables.

The International Roadmap for Semiconductor
(ITRS) projects that MOSFETs with equivalent
oxide thickness of 5A and for junction depths
less than 10nm will be in production in the next
decade.
While 6nm gate lengths MOSFETs have been
demonstrated, performance and Manufacturability
problems remain.

8
Electronic system design

Conceptualization and modeling
From idea to model
Design
HW computation, storage and communication
SW application and system software
Run-time management
Run-time system management and control of all
units including peripherals

9
Examples

Modeling
Choice of algorithm
Application-specific hardware vs. programmable
hardware (software) implementation
Word-width and precision
Design
Structural trade-off
Resource sharing and logic supplies
Management
Operating system
Dynamic power management

10
(No Transcript)
11
System models

Modeling is an abstraction
Represent important features and hide unnecessary
details
Functional models
Capture functionality and requirements
Executable models
Support hw and/or sw compilation and simulation
Implementation models
Describe target realization

12
Algorithm selection

Inputs
A target macro-architecture
Abstract functional/executable spec.
Constraints
Library of algorithms
Objective
Select the most energy-efficient algorithm that
satisfies constraints

13
Issues in algorithm selection

Applicable only to general-purpose primitives
with many alternative implementation
Pre-characterization on target architecture
Limited search space exploration

14
Approximate processing

Introducing well-controlled errors can be
advantageous for power
Reduced data width (coarse discretization)
Layered algorithms (successive approximations)
Lossy communication

15
Processing elements

Several classes of PEs
General-purpose processors (e.g. RISC core)
Digital signal processors (e.g. VLIW core)
Programmable logic (e.g. LUT-based FPGA)
Specialized processors (e.g. custom DCT core)
Tradeoff flexibility vs. efficiency
Specialized is faster and power-efficient
General-purpose is flexible and inexpensive

16
Constrained optimization

Design space
Who does what and when (binding scheduling)
Supply voltage of the various PEs
TCLK K Vdd/(Vdd Vt)2
Design target
Minimize power
Performance constraint (e.g. Titeration 21
µsec)

17
Datasheet Analysis
18
System Design

Input
The output of the conceptualization phase
A macro-architectural template
A hardware-software partition
Component by component constraints
Output
Complete hardware design

19
Design process

Specify computation, storage, template
components, and software
Synergic process
Fundamental tradeoff general-purpose vs.
application-specific
Flexibility has a cost in terms of power

20
Application-specific computational units

Synthesized from high-level executable
specification (behavioral synthesis)
Supply voltage reduction
Load capacitance reduction
Minimization of switching activity

21
CMOS Gate Power equations

P CLVDD2f 0?1 tsc VDD Ipeak f 0 ? 1 VDD
Ipeak
Dynamic term CLVDD2f 0?1
Short-circuit term tsc VDD Ipeakf 0 ? 1
Leakage term VDD Ipeak

22
Power-driven voltage scaling

From faster to power efficient by scaling down
voltage supply
Traditional speed-enhancing transformations can
be exploited for low power design
Pipelining
Parallelization
Loop unrolling
Re-timing

23
Advanced voltage scaling

Multiple voltages
Slow down non-critical path with lower voltage
supply
Two or more power grids
High-efficiency voltage converters

24
Clock frequency reduction

fclk ?does not decrease energy
But it may increase battery life
Reduce power
Multi-frequency clocks

25
Reducing load capacitance

Reduce wiring capacitance
Reduce local loads
Reduce global interconnect
Global interconnect can be reduced by improving
spatial locality trade off communication for
computation

26
Reduce switching activity

Improve correlation between consecutive input to
functional macros
Reduced glitching
All basic high-level-synthesis steps have been
modified
A synergic approach lead best results

27
Application-specific processors

Parameterized processors tailored to a specific
application
Optimally exploit parallelism
Eliminate unneeded features
Applied to different architectures
Single-issue cores ?instruction subsetting
Superscalar cores ? and type of Functions
VLIW cores ?Functions and compiler

28
Low power core processors

Low voltage
Reduce wasted switching
Specialized modes of operations/instructions
Variable voltage supply

29
Exploiting variable supply

Supply voltage can be dynamically changed during
system operation
Quadratic power savings
Circuit slowdown
Just-in-time computation
Stretch execution time up to the max tolerable

30
Variable-supply architecture

High-efficiency adjustable DC-DC converter
Adjustable synchronization
Variable-frequency clock generator
Self-timed circuits

31
Memory optimization

Custom data processors
Computation is less critical than data storage
(for data-dominated applications)
General-purpose processors
A significant fraction of system power is
consumed by memories
Key idea exploit locality
Hierarchical memory
Partitioned memory

32
Optimization approaches

Fixed memory access patterns
Optimize memory architecture
Fixed memory architecture
Optimize memory access patterns
Concurrently optimize memory architecture and
accesses

33
Optimize memory architecture

Data replication to localize accesses
Implicit multi-level caches
Explicit buffers
Partitioning to minimize cost per access
Multi-bank caches
Partitioned memories

34
Optimize memory accesses

Sequentialize memory accesses
Reduce address bus transitions
Exploit multiple small memories
Localize program execution
Fit frequently executed code into a small
instruction buffer (or cache)
Reduce storage requirements

35
Design of communication units

Trends
Faster computation blocks, larger chips
Communication speed is critical
Energy cost of communication is significant
Multifaceted design approach
On chip, networks, wireless
Protocol stack

36
Optimize memory architecture and access patterns

Two phase-process
Specification (program) transformations
Reduce memory requirements
Improve regularity of accesses
Build optimized memory architecture

37
Data encoding

Theoretical results
Bounds on transition activity reduction
The higher the entropy rate of the source is, the
lower is the gain achievable by coding
Practical applications
Processor-memory (and other) busses
Data busses, address busses
Transition activity reduction does not guarantee
energy savings

38
Bus-Invert coding for data busses

Add redundant line INV to bus
When INV0
Data is equal to remaining bus lines
When INV1
Data is complement of remaining bus lines
Performance
Peak at most n/2 bus lines switch
Average code is optimal. No other code with
1-bit redundancy can do better

Average switching reduction is bus-width
dependent
Ex 3.27 for an 8-bit bus
Average switching per line decreases as busses
get wider
Use partitioned codes
No longer optimal (among redundant codes)
Implementation issues
Different (XOR) of two data samples and majority
vote

40
Encoding instruction addresses

Most instruction addresses are consecutive
Use Gray code
Word-oriented machines
Increments by 4 (32 bit) or by 8 (64 bit).
Modify Gray code to switch 1 bit per increment
Gray code adder for jumps
Harder to partition
Convert to Gray code after update

41
T0 Code

Add redundant line INC to bus
When INC 0
Address is equal to remaining bus lines
When INC 1
Transmitter freezes other bus lines
Receiver increments previously transmitted
address by a parameter called stride
Asymptotically zero transitions for sequences
Better than Gray code

42
Mixed bus encoding techniques

T0_BI
Use two redundant lines INC and INV
Good for shared address/data busses
Dual encoding
Good for time-multiplexed address busses
Use redundant line SEL
SEL 1 denotes addresses
SEL is already present in the bus interface
Dual T0
Use T0 code when SEL is asserted.
Dual T0_BI
Use T0 when SEL is asserted otherwise use BI

43
Impact of software

For a given hardware platform, the energy to
realize a function depends on software
Operating system
Different algorithms to embody a function
Different coding styles
Application software compilation

44
Coding styles

Use processor-specific instruction style
Function calls style
Conditionalized instructions (for ARM)
Follow general guidelines for software coding
Use table look-up instead of conditionals
Make local copies of global variables so that
they can be assigned to registers
Avoid multiple memory look-ups with pointer chains

45
Example ARM variable types

Default int variable type is 18.2 more energy
efficient than char or short
Sign or zero extending is needed for shorter
variable types

46
ARM conditional execution

All ARM instructions are conditional
Conditional execution reduces the number of
branches

47
Instruction-level analysis

Analyze loop execution containing specific
instructions
Loop should be long enough to neglect overhead
and short enough to avoid cache misses
About 200 instructions
Measure instruction base cost
Measure inter-instruction effects

48
Compilation for low-power operation scheduling

Reorder instructions
Reduce inter-instruction effects
Switching in control part
Cold scheduling
Reorder instructions to reduce inter-instruction
effects on instruction bus
Consider instruction op-codes
Inter-instruction cost is op-code Hamming
distance
Use list scheduler where priority criterion is
tied to Hamming distance

49
Scheduling to reduce off-chip traffic

Schedule instructions to minimize Hamming
distance
Scheduling algorithm
Operations within each basic block
Searches for linear orders consistent with
data-flow
Prunes search space by avoiding redundant
solutions (hash sub-trees) and heuristically
limits the number of sub-trees

50
Compilation for low-power register assignment

Minimize spills to memory
Register labeling
Reduce switching in instruction register/bus and
register file decoder by encoding
Reduce Hamming distance between addresses of
consecutive register accesses
Approaches is complementary to cold scheduling

51
Other compiler optimizations

Loop unrolling to reduce overhead
Contra increased code space
Software pipelining
Decreases the number of stalls by fetching
instructions in different iterations
Eliminate tail recursion
Reduce overhead and use of stack

52
Dynamic power management

Systems are
Designed to deliver peak performance
Not needing peak performance most of the time
Components are idle at times
Dynamic power management (DPM)
Put idle components in low-power non-operational
states when idle
Power manager
Observes and controls the system
Power consumption of power manager is negligible

53
Structure of power-manageable systems

Systems consists of several components
E.g., Laptop Processor, memory, disk, display
E.g., SOC CPU, DSP, FPU, RF unit
Components may
Self-manage state transitions
Be controlled externally
Power manager
Abstraction of power control unit
May be realized in hardware or software

54
Power manageable components

Components with several internal states
Corresponding to power and service levels
Abstracted as a power state machine
State diagram with
Power and service annotation on states
Power and delay annotation on edges

55
Predictive techniques

Observe time-varying workload
Predict idle period Tpred Tidle
Go to sleep state if Tpred is long enough to
amortize state transition cost
Main issue prediction accuracy

56
When to use predictive techniques

When workload has memory
Implementing predictive schemes
Predictor families must be chosen based on
workload types
Predictor parameters must be tuned to the
instance-specific workload statistics
When workload is non-stationary or unknown,
on-line adaptation is required.

57
Operating system-based power management

In systems with an operating system (OS)
The OS knows of tasks running and waiting
The OS should perform the DPM decisions
Advanced Configuration and Power Interface (ACPI)
Open standard to facilitate design of OS-based
power management

58
(No Transcript)
59
Implementations of DPM

Shut down idle components
Gate clock of idle units
Clock setting and voltage setting
Support multiple-voltage multiple-frequency
components
Components with multiple working power states

Write a Comment

User Comments (0)

About PowerShow.com

Low Power System Design - PowerPoint PPT Presentation

Low Power System Design

2. Physics of Power Dissipation in CMOS FET Devices. 3. Power Estimation ... But it may increase battery life. Reduce power. Multi-frequency clocks. 25 ... – PowerPoint PPT presentation