Low Power System Design - PowerPoint PPT Presentation

1 / 59
About This Presentation
Title:

Low Power System Design

Description:

2. Physics of Power Dissipation in CMOS FET Devices. 3. Power Estimation ... But it may increase battery life. Reduce power. Multi-frequency clocks. 25 ... – PowerPoint PPT presentation

Number of Views:196
Avg rating:3.0/5.0
Slides: 60
Provided by: NTU
Category:
Tags: design | fet | life | low | power | system

less

Transcript and Presenter's Notes

Title: Low Power System Design


1
Low Power System Design
  • Feipei Lai
  • 33664924
  • flai_at_ntu.edu.tw
  • CSIE 419
  • Grade Mid-term 30, Paper presentation 40,
    Final 30,

2
Key references
  • Intl Conf. on CADs
  • Intl Symp. on Low Power Electronics and Design
  • IEEE Trans. on CADs
  • ACM Trans. on DAES
  • IEEE/ACM DAC
  • Intl Symp. on Circuits and Systems
  • IEEE Intl Solid-State Circuits Conference

3
Outline
  • 1. Low-Power CMOS VLSI Design
  • 2. Physics of Power Dissipation in CMOS FET
    Devices
  • 3. Power Estimation
  • 4. Synthesis for Lower Power
  • 5. Low Voltage CMOS Circuits
  • 6. Low-Power SRAM Architectures
  • 7. Energy Recovery Techniques
  • 8. Software Design for Low Power
  • 9. Low Power SOC design
  • 10. Embedded Software

4
Motivation
  • Energy-efficient computing is required by
  • Mobile electronic systems
  • Large-scale electronic systems
  • The quest for energy efficiency affects all
    aspects of system design
  • Packaging costs cooling costs
  • Power supply rail design
  • Noise immunity

5
Technology directions
6
  • Just as with CMOS replacement of HBTs
    (heterojunction bipolar transistor) , a lower
    performance/lower power technology ultimately
    will deliver superior system throughput because
    of the higher integration it enables.

7
  • The International Roadmap for Semiconductor
    (ITRS) projects that MOSFETs with equivalent
    oxide thickness of 5A and for junction depths
    less than 10nm will be in production in the next
    decade.
  • While 6nm gate lengths MOSFETs have been
    demonstrated, performance and Manufacturability
    problems remain.

8
Electronic system design
  • Conceptualization and modeling
  • From idea to model
  • Design
  • HW computation, storage and communication
  • SW application and system software
  • Run-time management
  • Run-time system management and control of all
    units including peripherals

9
Examples
  • Modeling
  • Choice of algorithm
  • Application-specific hardware vs. programmable
    hardware (software) implementation
  • Word-width and precision
  • Design
  • Structural trade-off
  • Resource sharing and logic supplies
  • Management
  • Operating system
  • Dynamic power management

10
(No Transcript)
11
System models
  • Modeling is an abstraction
  • Represent important features and hide unnecessary
    details
  • Functional models
  • Capture functionality and requirements
  • Executable models
  • Support hw and/or sw compilation and simulation
  • Implementation models
  • Describe target realization

12
Algorithm selection
  • Inputs
  • A target macro-architecture
  • Abstract functional/executable spec.
  • Constraints
  • Library of algorithms
  • Objective
  • Select the most energy-efficient algorithm that
    satisfies constraints

13
Issues in algorithm selection
  • Applicable only to general-purpose primitives
    with many alternative implementation
  • Pre-characterization on target architecture
  • Limited search space exploration

14
Approximate processing
  • Introducing well-controlled errors can be
    advantageous for power
  • Reduced data width (coarse discretization)
  • Layered algorithms (successive approximations)
  • Lossy communication

15
Processing elements
  • Several classes of PEs
  • General-purpose processors (e.g. RISC core)
  • Digital signal processors (e.g. VLIW core)
  • Programmable logic (e.g. LUT-based FPGA)
  • Specialized processors (e.g. custom DCT core)
  • Tradeoff flexibility vs. efficiency
  • Specialized is faster and power-efficient
  • General-purpose is flexible and inexpensive

16
Constrained optimization
  • Design space
  • Who does what and when (binding scheduling)
  • Supply voltage of the various PEs
  • TCLK K Vdd/(Vdd Vt)2
  • Design target
  • Minimize power
  • Performance constraint (e.g. Titeration 21
    µsec)

17
Datasheet Analysis
18
System Design
  • Input
  • The output of the conceptualization phase
  • A macro-architectural template
  • A hardware-software partition
  • Component by component constraints
  • Output
  • Complete hardware design

19
Design process
  • Specify computation, storage, template
    components, and software
  • Synergic process
  • Fundamental tradeoff general-purpose vs.
    application-specific
  • Flexibility has a cost in terms of power

20
Application-specific computational units
  • Synthesized from high-level executable
    specification (behavioral synthesis)
  • Supply voltage reduction
  • Load capacitance reduction
  • Minimization of switching activity

21
CMOS Gate Power equations
  • P CLVDD2f 0?1 tsc VDD Ipeak f 0 ? 1 VDD
    Ipeak
  • Dynamic term CLVDD2f 0?1
  • Short-circuit term tsc VDD Ipeakf 0 ? 1
  • Leakage term VDD Ipeak

22
Power-driven voltage scaling
  • From faster to power efficient by scaling down
    voltage supply
  • Traditional speed-enhancing transformations can
    be exploited for low power design
  • Pipelining
  • Parallelization
  • Loop unrolling
  • Re-timing

23
Advanced voltage scaling
  • Multiple voltages
  • Slow down non-critical path with lower voltage
    supply
  • Two or more power grids
  • High-efficiency voltage converters

24
Clock frequency reduction
  • fclk ?does not decrease energy
  • But it may increase battery life
  • Reduce power
  • Multi-frequency clocks

25
Reducing load capacitance
  • Reduce wiring capacitance
  • Reduce local loads
  • Reduce global interconnect
  • Global interconnect can be reduced by improving
    spatial locality trade off communication for
    computation

26
Reduce switching activity
  • Improve correlation between consecutive input to
    functional macros
  • Reduced glitching
  • All basic high-level-synthesis steps have been
    modified
  • A synergic approach lead best results

27
Application-specific processors
  • Parameterized processors tailored to a specific
    application
  • Optimally exploit parallelism
  • Eliminate unneeded features
  • Applied to different architectures
  • Single-issue cores ?instruction subsetting
  • Superscalar cores ? and type of Functions
  • VLIW cores ?Functions and compiler

28
Low power core processors
  • Low voltage
  • Reduce wasted switching
  • Specialized modes of operations/instructions
  • Variable voltage supply

29
Exploiting variable supply
  • Supply voltage can be dynamically changed during
    system operation
  • Quadratic power savings
  • Circuit slowdown
  • Just-in-time computation
  • Stretch execution time up to the max tolerable

30
Variable-supply architecture
  • High-efficiency adjustable DC-DC converter
  • Adjustable synchronization
  • Variable-frequency clock generator
  • Self-timed circuits

31
Memory optimization
  • Custom data processors
  • Computation is less critical than data storage
    (for data-dominated applications)
  • General-purpose processors
  • A significant fraction of system power is
    consumed by memories
  • Key idea exploit locality
  • Hierarchical memory
  • Partitioned memory

32
Optimization approaches
  • Fixed memory access patterns
  • Optimize memory architecture
  • Fixed memory architecture
  • Optimize memory access patterns
  • Concurrently optimize memory architecture and
    accesses

33
Optimize memory architecture
  • Data replication to localize accesses
  • Implicit multi-level caches
  • Explicit buffers
  • Partitioning to minimize cost per access
  • Multi-bank caches
  • Partitioned memories

34
Optimize memory accesses
  • Sequentialize memory accesses
  • Reduce address bus transitions
  • Exploit multiple small memories
  • Localize program execution
  • Fit frequently executed code into a small
    instruction buffer (or cache)
  • Reduce storage requirements

35
Design of communication units
  • Trends
  • Faster computation blocks, larger chips
  • Communication speed is critical
  • Energy cost of communication is significant
  • Multifaceted design approach
  • On chip, networks, wireless
  • Protocol stack

36
Optimize memory architecture and access patterns
  • Two phase-process
  • Specification (program) transformations
  • Reduce memory requirements
  • Improve regularity of accesses
  • Build optimized memory architecture

37
Data encoding
  • Theoretical results
  • Bounds on transition activity reduction
  • The higher the entropy rate of the source is, the
    lower is the gain achievable by coding
  • Practical applications
  • Processor-memory (and other) busses
  • Data busses, address busses
  • Transition activity reduction does not guarantee
    energy savings

38
Bus-Invert coding for data busses
  • Add redundant line INV to bus
  • When INV0
  • Data is equal to remaining bus lines
  • When INV1
  • Data is complement of remaining bus lines
  • Performance
  • Peak at most n/2 bus lines switch
  • Average code is optimal. No other code with
    1-bit redundancy can do better

39
  • Average switching reduction is bus-width
    dependent
  • Ex 3.27 for an 8-bit bus
  • Average switching per line decreases as busses
    get wider
  • Use partitioned codes
  • No longer optimal (among redundant codes)
  • Implementation issues
  • Different (XOR) of two data samples and majority
    vote

40
Encoding instruction addresses
  • Most instruction addresses are consecutive
  • Use Gray code
  • Word-oriented machines
  • Increments by 4 (32 bit) or by 8 (64 bit).
  • Modify Gray code to switch 1 bit per increment
  • Gray code adder for jumps
  • Harder to partition
  • Convert to Gray code after update

41
T0 Code
  • Add redundant line INC to bus
  • When INC 0
  • Address is equal to remaining bus lines
  • When INC 1
  • Transmitter freezes other bus lines
  • Receiver increments previously transmitted
    address by a parameter called stride
  • Asymptotically zero transitions for sequences
  • Better than Gray code

42
Mixed bus encoding techniques
  • T0_BI
  • Use two redundant lines INC and INV
  • Good for shared address/data busses
  • Dual encoding
  • Good for time-multiplexed address busses
  • Use redundant line SEL
  • SEL 1 denotes addresses
  • SEL is already present in the bus interface
  • Dual T0
  • Use T0 code when SEL is asserted.
  • Dual T0_BI
  • Use T0 when SEL is asserted otherwise use BI

43
Impact of software
  • For a given hardware platform, the energy to
    realize a function depends on software
  • Operating system
  • Different algorithms to embody a function
  • Different coding styles
  • Application software compilation

44
Coding styles
  • Use processor-specific instruction style
  • Function calls style
  • Conditionalized instructions (for ARM)
  • Follow general guidelines for software coding
  • Use table look-up instead of conditionals
  • Make local copies of global variables so that
    they can be assigned to registers
  • Avoid multiple memory look-ups with pointer chains

45
Example ARM variable types
  • Default int variable type is 18.2 more energy
    efficient than char or short
  • Sign or zero extending is needed for shorter
    variable types

46
ARM conditional execution
  • All ARM instructions are conditional
  • Conditional execution reduces the number of
    branches

47
Instruction-level analysis
  • Analyze loop execution containing specific
    instructions
  • Loop should be long enough to neglect overhead
    and short enough to avoid cache misses
  • About 200 instructions
  • Measure instruction base cost
  • Measure inter-instruction effects

48
Compilation for low-power operation scheduling
  • Reorder instructions
  • Reduce inter-instruction effects
  • Switching in control part
  • Cold scheduling
  • Reorder instructions to reduce inter-instruction
    effects on instruction bus
  • Consider instruction op-codes
  • Inter-instruction cost is op-code Hamming
    distance
  • Use list scheduler where priority criterion is
    tied to Hamming distance

49
Scheduling to reduce off-chip traffic
  • Schedule instructions to minimize Hamming
    distance
  • Scheduling algorithm
  • Operations within each basic block
  • Searches for linear orders consistent with
    data-flow
  • Prunes search space by avoiding redundant
    solutions (hash sub-trees) and heuristically
    limits the number of sub-trees

50
Compilation for low-power register assignment
  • Minimize spills to memory
  • Register labeling
  • Reduce switching in instruction register/bus and
    register file decoder by encoding
  • Reduce Hamming distance between addresses of
    consecutive register accesses
  • Approaches is complementary to cold scheduling

51
Other compiler optimizations
  • Loop unrolling to reduce overhead
  • Contra increased code space
  • Software pipelining
  • Decreases the number of stalls by fetching
    instructions in different iterations
  • Eliminate tail recursion
  • Reduce overhead and use of stack

52
Dynamic power management
  • Systems are
  • Designed to deliver peak performance
  • Not needing peak performance most of the time
  • Components are idle at times
  • Dynamic power management (DPM)
  • Put idle components in low-power non-operational
    states when idle
  • Power manager
  • Observes and controls the system
  • Power consumption of power manager is negligible

53
Structure of power-manageable systems
  • Systems consists of several components
  • E.g., Laptop Processor, memory, disk, display
  • E.g., SOC CPU, DSP, FPU, RF unit
  • Components may
  • Self-manage state transitions
  • Be controlled externally
  • Power manager
  • Abstraction of power control unit
  • May be realized in hardware or software

54
Power manageable components
  • Components with several internal states
  • Corresponding to power and service levels
  • Abstracted as a power state machine
  • State diagram with
  • Power and service annotation on states
  • Power and delay annotation on edges

55
Predictive techniques
  • Observe time-varying workload
  • Predict idle period Tpred Tidle
  • Go to sleep state if Tpred is long enough to
    amortize state transition cost
  • Main issue prediction accuracy

56
When to use predictive techniques
  • When workload has memory
  • Implementing predictive schemes
  • Predictor families must be chosen based on
    workload types
  • Predictor parameters must be tuned to the
    instance-specific workload statistics
  • When workload is non-stationary or unknown,
    on-line adaptation is required.

57
Operating system-based power management
  • In systems with an operating system (OS)
  • The OS knows of tasks running and waiting
  • The OS should perform the DPM decisions
  • Advanced Configuration and Power Interface (ACPI)
  • Open standard to facilitate design of OS-based
    power management

58
(No Transcript)
59
Implementations of DPM
  • Shut down idle components
  • Gate clock of idle units
  • Clock setting and voltage setting
  • Support multiple-voltage multiple-frequency
    components
  • Components with multiple working power states
Write a Comment
User Comments (0)
About PowerShow.com