Title: Low Power System Design
1Low Power System Design
- Feipei Lai
- 33664924
- flai_at_ntu.edu.tw
- CSIE 419
- Grade Mid-term 30, Paper presentation 40,
Final 30,
2Key references
- Intl Conf. on CADs
- Intl Symp. on Low Power Electronics and Design
- IEEE Trans. on CADs
- ACM Trans. on DAES
- IEEE/ACM DAC
- Intl Symp. on Circuits and Systems
- IEEE Intl Solid-State Circuits Conference
3Outline
- 1. Low-Power CMOS VLSI Design
- 2. Physics of Power Dissipation in CMOS FET
Devices - 3. Power Estimation
- 4. Synthesis for Lower Power
- 5. Low Voltage CMOS Circuits
- 6. Low-Power SRAM Architectures
- 7. Energy Recovery Techniques
- 8. Software Design for Low Power
- 9. Low Power SOC design
- 10. Embedded Software
4Motivation
- Energy-efficient computing is required by
- Mobile electronic systems
- Large-scale electronic systems
- The quest for energy efficiency affects all
aspects of system design - Packaging costs cooling costs
- Power supply rail design
- Noise immunity
5Technology directions
6- Just as with CMOS replacement of HBTs
(heterojunction bipolar transistor) , a lower
performance/lower power technology ultimately
will deliver superior system throughput because
of the higher integration it enables.
7- The International Roadmap for Semiconductor
(ITRS) projects that MOSFETs with equivalent
oxide thickness of 5A and for junction depths
less than 10nm will be in production in the next
decade. - While 6nm gate lengths MOSFETs have been
demonstrated, performance and Manufacturability
problems remain.
8Electronic system design
- Conceptualization and modeling
- From idea to model
- Design
- HW computation, storage and communication
- SW application and system software
- Run-time management
- Run-time system management and control of all
units including peripherals
9Examples
- Modeling
- Choice of algorithm
- Application-specific hardware vs. programmable
hardware (software) implementation - Word-width and precision
- Design
- Structural trade-off
- Resource sharing and logic supplies
- Management
- Operating system
- Dynamic power management
10(No Transcript)
11System models
- Modeling is an abstraction
- Represent important features and hide unnecessary
details - Functional models
- Capture functionality and requirements
- Executable models
- Support hw and/or sw compilation and simulation
- Implementation models
- Describe target realization
12Algorithm selection
- Inputs
- A target macro-architecture
- Abstract functional/executable spec.
- Constraints
- Library of algorithms
- Objective
- Select the most energy-efficient algorithm that
satisfies constraints
13Issues in algorithm selection
- Applicable only to general-purpose primitives
with many alternative implementation - Pre-characterization on target architecture
- Limited search space exploration
14Approximate processing
- Introducing well-controlled errors can be
advantageous for power - Reduced data width (coarse discretization)
- Layered algorithms (successive approximations)
- Lossy communication
15Processing elements
- Several classes of PEs
- General-purpose processors (e.g. RISC core)
- Digital signal processors (e.g. VLIW core)
- Programmable logic (e.g. LUT-based FPGA)
- Specialized processors (e.g. custom DCT core)
- Tradeoff flexibility vs. efficiency
- Specialized is faster and power-efficient
- General-purpose is flexible and inexpensive
16Constrained optimization
- Design space
- Who does what and when (binding scheduling)
- Supply voltage of the various PEs
- TCLK K Vdd/(Vdd Vt)2
- Design target
- Minimize power
- Performance constraint (e.g. Titeration 21
µsec)
17Datasheet Analysis
18System Design
- Input
- The output of the conceptualization phase
- A macro-architectural template
- A hardware-software partition
- Component by component constraints
- Output
- Complete hardware design
19Design process
- Specify computation, storage, template
components, and software - Synergic process
- Fundamental tradeoff general-purpose vs.
application-specific - Flexibility has a cost in terms of power
20Application-specific computational units
- Synthesized from high-level executable
specification (behavioral synthesis) - Supply voltage reduction
- Load capacitance reduction
- Minimization of switching activity
21CMOS Gate Power equations
- P CLVDD2f 0?1 tsc VDD Ipeak f 0 ? 1 VDD
Ipeak - Dynamic term CLVDD2f 0?1
- Short-circuit term tsc VDD Ipeakf 0 ? 1
- Leakage term VDD Ipeak
22Power-driven voltage scaling
- From faster to power efficient by scaling down
voltage supply - Traditional speed-enhancing transformations can
be exploited for low power design - Pipelining
- Parallelization
- Loop unrolling
- Re-timing
23Advanced voltage scaling
- Multiple voltages
- Slow down non-critical path with lower voltage
supply - Two or more power grids
- High-efficiency voltage converters
24Clock frequency reduction
- fclk ?does not decrease energy
- But it may increase battery life
- Reduce power
- Multi-frequency clocks
25Reducing load capacitance
- Reduce wiring capacitance
- Reduce local loads
- Reduce global interconnect
- Global interconnect can be reduced by improving
spatial locality trade off communication for
computation
26Reduce switching activity
- Improve correlation between consecutive input to
functional macros - Reduced glitching
- All basic high-level-synthesis steps have been
modified - A synergic approach lead best results
27Application-specific processors
- Parameterized processors tailored to a specific
application - Optimally exploit parallelism
- Eliminate unneeded features
- Applied to different architectures
- Single-issue cores ?instruction subsetting
- Superscalar cores ? and type of Functions
- VLIW cores ?Functions and compiler
28Low power core processors
- Low voltage
- Reduce wasted switching
- Specialized modes of operations/instructions
- Variable voltage supply
29Exploiting variable supply
- Supply voltage can be dynamically changed during
system operation - Quadratic power savings
- Circuit slowdown
- Just-in-time computation
- Stretch execution time up to the max tolerable
30Variable-supply architecture
- High-efficiency adjustable DC-DC converter
- Adjustable synchronization
- Variable-frequency clock generator
- Self-timed circuits
31Memory optimization
- Custom data processors
- Computation is less critical than data storage
(for data-dominated applications) - General-purpose processors
- A significant fraction of system power is
consumed by memories - Key idea exploit locality
- Hierarchical memory
- Partitioned memory
32Optimization approaches
- Fixed memory access patterns
- Optimize memory architecture
- Fixed memory architecture
- Optimize memory access patterns
- Concurrently optimize memory architecture and
accesses
33Optimize memory architecture
- Data replication to localize accesses
- Implicit multi-level caches
- Explicit buffers
- Partitioning to minimize cost per access
- Multi-bank caches
- Partitioned memories
34Optimize memory accesses
- Sequentialize memory accesses
- Reduce address bus transitions
- Exploit multiple small memories
- Localize program execution
- Fit frequently executed code into a small
instruction buffer (or cache) - Reduce storage requirements
35Design of communication units
- Trends
- Faster computation blocks, larger chips
- Communication speed is critical
- Energy cost of communication is significant
- Multifaceted design approach
- On chip, networks, wireless
- Protocol stack
36Optimize memory architecture and access patterns
- Two phase-process
- Specification (program) transformations
- Reduce memory requirements
- Improve regularity of accesses
- Build optimized memory architecture
37Data encoding
- Theoretical results
- Bounds on transition activity reduction
- The higher the entropy rate of the source is, the
lower is the gain achievable by coding - Practical applications
- Processor-memory (and other) busses
- Data busses, address busses
- Transition activity reduction does not guarantee
energy savings
38Bus-Invert coding for data busses
- Add redundant line INV to bus
- When INV0
- Data is equal to remaining bus lines
- When INV1
- Data is complement of remaining bus lines
- Performance
- Peak at most n/2 bus lines switch
- Average code is optimal. No other code with
1-bit redundancy can do better
39- Average switching reduction is bus-width
dependent - Ex 3.27 for an 8-bit bus
- Average switching per line decreases as busses
get wider - Use partitioned codes
- No longer optimal (among redundant codes)
- Implementation issues
- Different (XOR) of two data samples and majority
vote
40Encoding instruction addresses
- Most instruction addresses are consecutive
- Use Gray code
- Word-oriented machines
- Increments by 4 (32 bit) or by 8 (64 bit).
- Modify Gray code to switch 1 bit per increment
- Gray code adder for jumps
- Harder to partition
- Convert to Gray code after update
41T0 Code
- Add redundant line INC to bus
- When INC 0
- Address is equal to remaining bus lines
- When INC 1
- Transmitter freezes other bus lines
- Receiver increments previously transmitted
address by a parameter called stride - Asymptotically zero transitions for sequences
- Better than Gray code
42Mixed bus encoding techniques
- T0_BI
- Use two redundant lines INC and INV
- Good for shared address/data busses
- Dual encoding
- Good for time-multiplexed address busses
- Use redundant line SEL
- SEL 1 denotes addresses
- SEL is already present in the bus interface
- Dual T0
- Use T0 code when SEL is asserted.
- Dual T0_BI
- Use T0 when SEL is asserted otherwise use BI
43Impact of software
- For a given hardware platform, the energy to
realize a function depends on software - Operating system
- Different algorithms to embody a function
- Different coding styles
- Application software compilation
44Coding styles
- Use processor-specific instruction style
- Function calls style
- Conditionalized instructions (for ARM)
- Follow general guidelines for software coding
- Use table look-up instead of conditionals
- Make local copies of global variables so that
they can be assigned to registers - Avoid multiple memory look-ups with pointer chains
45Example ARM variable types
- Default int variable type is 18.2 more energy
efficient than char or short - Sign or zero extending is needed for shorter
variable types
46ARM conditional execution
- All ARM instructions are conditional
- Conditional execution reduces the number of
branches
47Instruction-level analysis
- Analyze loop execution containing specific
instructions - Loop should be long enough to neglect overhead
and short enough to avoid cache misses - About 200 instructions
- Measure instruction base cost
- Measure inter-instruction effects
48Compilation for low-power operation scheduling
- Reorder instructions
- Reduce inter-instruction effects
- Switching in control part
- Cold scheduling
- Reorder instructions to reduce inter-instruction
effects on instruction bus - Consider instruction op-codes
- Inter-instruction cost is op-code Hamming
distance - Use list scheduler where priority criterion is
tied to Hamming distance
49Scheduling to reduce off-chip traffic
- Schedule instructions to minimize Hamming
distance - Scheduling algorithm
- Operations within each basic block
- Searches for linear orders consistent with
data-flow - Prunes search space by avoiding redundant
solutions (hash sub-trees) and heuristically
limits the number of sub-trees
50Compilation for low-power register assignment
- Minimize spills to memory
- Register labeling
- Reduce switching in instruction register/bus and
register file decoder by encoding - Reduce Hamming distance between addresses of
consecutive register accesses - Approaches is complementary to cold scheduling
51Other compiler optimizations
- Loop unrolling to reduce overhead
- Contra increased code space
- Software pipelining
- Decreases the number of stalls by fetching
instructions in different iterations - Eliminate tail recursion
- Reduce overhead and use of stack
52Dynamic power management
- Systems are
- Designed to deliver peak performance
- Not needing peak performance most of the time
- Components are idle at times
- Dynamic power management (DPM)
- Put idle components in low-power non-operational
states when idle - Power manager
- Observes and controls the system
- Power consumption of power manager is negligible
53Structure of power-manageable systems
- Systems consists of several components
- E.g., Laptop Processor, memory, disk, display
- E.g., SOC CPU, DSP, FPU, RF unit
- Components may
- Self-manage state transitions
- Be controlled externally
- Power manager
- Abstraction of power control unit
- May be realized in hardware or software
54Power manageable components
- Components with several internal states
- Corresponding to power and service levels
- Abstracted as a power state machine
- State diagram with
- Power and service annotation on states
- Power and delay annotation on edges
55Predictive techniques
- Observe time-varying workload
- Predict idle period Tpred Tidle
- Go to sleep state if Tpred is long enough to
amortize state transition cost - Main issue prediction accuracy
56When to use predictive techniques
- When workload has memory
- Implementing predictive schemes
- Predictor families must be chosen based on
workload types - Predictor parameters must be tuned to the
instance-specific workload statistics - When workload is non-stationary or unknown,
on-line adaptation is required.
57Operating system-based power management
- In systems with an operating system (OS)
- The OS knows of tasks running and waiting
- The OS should perform the DPM decisions
- Advanced Configuration and Power Interface (ACPI)
- Open standard to facilitate design of OS-based
power management
58(No Transcript)
59Implementations of DPM
- Shut down idle components
- Gate clock of idle units
- Clock setting and voltage setting
- Support multiple-voltage multiple-frequency
components - Components with multiple working power states