Design Technology for Low Power Radio Systems - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

Design Technology for Low Power Radio Systems

Description:

(1-2 % of the area of a $4 chip) ... MOPS/mm2 - Area efficiency (cost) ... Why time multiplex to save area if the overhead is much greater than the area saved? ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 39
Provided by: bobb179
Category:

less

Transcript and Presenter's Notes

Title: Design Technology for Low Power Radio Systems


1
Design Technology for Low Power Radio Systems
Rhett Davis Dept. of EECS Univ. of Calif. Berkeley
  • http//bwrc.eecs.berkeley.edu

2
Domain of Interest
  • Highly integrated system-on-a-chip solutions
    SOCs
  • Wireless communications with associated
    processing, e.g. multimedia processing,
    compression, switching, etc
  • Primary computation is high complexity dataflow
    with a relatively small amount of control

3
Why Systems-on-a-Chip - SOC ?
  • State-of-the-Art CMOS is easily able to implement
    complete systems (or what was on a board before)
  • A microprocessor core is only 1-2 mm2
    (1-2 of the area of a 4 chip)
  • Portability (size) is critical to meet the cost,
    power and size requirements of future wireless
    systems
  • Chips will be required to support the complete
    application (wireless internet, multimedia)
  • Dedicated stand-alone computation is replacing
    general purpose processors as the semiconductor
    industry driver

4
Cellular Phones An example
Digital Cellular Market (Phones Shipped)
1996 1997 1998 1999 2000
Units 48M 86M 162M 260M 435M
(Courtesy Mike McMahon, Texas Instruments)
5
Cellular Phone Baseband SOC
ROM
MCU
DSP
Gates
RAM
Analog
2000 phones on each 8 wafer _at_ .15 Leff
1Million Baseband Chips per Day!!!
(Courtesy Mike McMahon, Texas Instruments)
6
Wireless System Design Issues
  • It is now possible to use CMOS to integrate all
    digital radio functions but what is the best
    architectural way to use CMOS???
  • Computation rates for wireless systems will
    easily range up to 100s of GOPS in signal
    processing (UI, control insignificant)
  • Whats keeping us from achieving this in silicon?
  • What can we do about it?

7
Computational Efficiency Metrics
  • Definition MOPS
  • Millions of algorithmically defined arithmetic
    operations (e.g. multiply, add, shift) in a GP
    processor several instructions per useful
    operation
  • Figures of merit
  • MOPS/mW - Energy efficiency (battery life)
  • MOPS/mm2 - Area efficiency (cost)
  • Optimization of these efficiencies is the basic
    goal assuming functionality is met

8
Energy-Efficiency of Architectures
1000
Dedicated HW
Direct mapped 100-1000 MOPS/mW
100
ReconfigurableProcessor/Logic
Reconfiguration (???) Potential of 10-100 MOPS/mW
Energy Efficiency MOPS/mW (or MIPS/mW)
10
1
Embedded mProcessors
Microprocessor .1-1 MIPS/mW
0.1
Flexibility (Coverage)
9
Software Processors Energy Trends
  • Primary means of performance increase of software
    processors has been by increasing clock rate
  • Decreasing Energy Efficiency

E ? C ? VDD2
10
Software Processors Area Trends
  • Increasing clock rate results in a memory
    bottleneck addressed by bringing memory on-chip
  • Area is increasingly dominated by memory
    degrading MOPs/mm2

16x16 multiplier (.05 mm2)
DSP processor with 1 multiplier (25 mm2)
Why time multiplex to save area if the overhead
is much greater than the area saved????
11
Parallelism is the answer, but
  • Not by putting Von Neumann processors in parallel
    and programming with a sequential language
  • Attempts to do this have failed over and over
    again
  • The parallel computer compiler problem is very
    difficult
  • Not by trying to capture parallelism at the
    instruction level
  • Superscalar, VLIW, etc are very inefficient
  • Hardware cant figure out the parallelism from a
    sequential language either
  • The problem is the initial sequential description
    (e.g. C) which is poorly matched to highly
    parallel applications

12
What is really hapenning
Re-entering it using a sequential description
Then try to rediscover the parallelism
Starting with a parallel algorithmic description
  • While (i0iiltnum)
  • a a ci
  • bi sin (a pi) cos(api)
  • Outfil bi indata

We take this path so that we can use an
architecture that is orders of magnitude less
efficient in energy and area ??????
13
What can a fully parallel CMOS solution
potentially do?
  • In .25 micron a multiplier requires .05 mm2 and
    7pJ per operation at 1 V. Adders and registers
    are about 10 times smaller and 10 times lower
    energy
  • Lets implement a 50mm2 , .25 micron chip using
    adders, registers and multipliers
  • We can have 2000 adders/registers and 200
    multipliers in less than 1/2 of the chip, also
    assume 1/3 of power goes into clocks
  • 25 MHz clock (1 volt) gives 50 Gops at 100mW
  • 500 MOPS/mW and 1000 MOPS/mm2

14
Start with a parallel description of the
algorithm
15
Then directly map into hardware
16
Results in fully parallel solutions
(numbers taken from vendor-published
benchmarks) Orders of magnitude lower efficiency
even for an optimized processor architecture
17
Reasons software solutions seem attractive
  • (1) Believed to reduce time-to-system-implementati
    on
  • (2) Provides flexibility
  • (3) Locks the customers into an architecture they
    cant change
  • (4) Difficulty in getting dedicated SOC chips
    designed
  • Are these good reasons???

18
(1) Believed to reduce time-to-system
implementation
  • Software decreases time to get first prototype,
    but time to fully verified system is much longer
    (hardware is often ready but software still needs
    to be done)
  • Limitations of software prototype often sets the
    ultimate limit of the system performance
  • Software solutions can be shipped with bugs, not
    a real option for SOC

19
(2) Need flexibility
  • Software is not always flexible
  • Can be hard to verify
  • Flexibility does not imply software
    programmability
  • Domain specific design can have multiple modules,
    coefficients and local state control (use some of
    the factor of 100 in efficiency) to address a
    range of applications
  • Reconfiguration of interconnect can achieve
    flexibility with high levels of efficiency

20
Flexibility without software
Energy per Transform vs. FFT size
Transforms per Second per mm2 vs. FFT size
All results are scaled to 0.18mm
21
Reasons software solutions seem attractive
  • (1) Believed to reduce time-to-system
    implementation
  • (2) Provides flexibility
  • (3) Locks the customers into an architecture they
    cant change
  • (4) Difficulty in getting dedicated SOC chips
    designed

22
Standard DSP-ASIC Design Flow
Problems
  • Three translations of design data
  • Requirements for re-verification at each stage
  • Uncontrolled looping when pipeline stalls

Prohibitively Long Design Time for Direct Mapped
Architectures
23
Direct Mapping Design Flow
Algorithm/System
Simulation
Back-End
Front-End
Floorplan
RTL Libraries
Automated Flow
Mask Layout
Performance Estimates
  • Encourages iterations of layout
  • Controls looping
  • Reduces the flow to a single phase
  • Depends on fast automation

24
(an aside) Déjà vu???
  • An automated style of design with parameterized
    modules processed through foundries is just the
    reincarnation of good ole Silicon Compilation of
    gt10 years ago
  • What happened?
  • A decline of research into design methodologies
  • A single dominant flow has resulted - the
    Verilog-Synopsys-Standard Cell
  • Lack of tool flows to support alternative styles
    of design
  • Research community lost access to technology
    moved to highly sub-optimal processor and FPGA
    solutions

25
Capturing Design Decisions
  • Categories
  • Function - basic input-output behavior
  • Signal - physical signals and types
  • Circuit - transistors
  • Floorplan - physical positions

How to get layout and performance estimates in a
day?
26
Simplified View of the Flow
  • New Software
  • Generation of netlists from a dataflow graph
  • Merging of floorplan from last iteration
  • Automatic routing and performance analysis
  • Automation of flow as a dependency graph (UNIX
    MAKE program)

27
Why Simulink?
  • Simulink is an easy sell to algorithm developers
  • Closely integrated with popular system design
    tool Matlab
  • Successfully models digital and analog circuits

28
Modeling Datapath Logic
  • Discrete-Time(cycle accurate)
  • Fixed-Point Types(bit true)
  • Completely specify function and signal decisions
  • No need for RTL

Multiply / Accumulate
29
Modeling Control Logic
  • Extended finite state-machine editor
  • Co-simulation with dataflow graph
  • New SoftwareStateflow-VHDL translator
  • No need for RTL

Address Generator / MAC Reset
30
Specifying Circuit Decisions
Black Box
RTL CodeorData-pathGeneratorCodeorCustomMod
ule
Stateflow-VHDLtranslator
Time-Multiplexed FIR Filter
  • Macro choices embedded in dataflow graph
  • Cross-check simulations required

31
Hierarchy Hardened Progressively
  • Macro characterization saved for fast estimates
  • Each level of hierarchy becomes a new hard macro
  • Higher levels of hierarchy are adjusted
  • When top level of hierarchy is hardened, the
    design is done

32
Capturing Floorplan Decisions
  • Commercial physical design tools used
  • Instance names in floorplan match dataflow graph
  • Placements merged on each iteration
  • Manhattan distance can be used for parasitic
    estimates

33
Reduced Impact of Interconnect
  • 0.18 mm

Long wires can be modeled as lumped capacitances
34
Race-Immune Clock Tree Synthesis
  • Race margin 580 ps
  • 0.18 mm
  • VDD 1 V

Demonstrated on a 600k transistor design
35
Example 1 Macro Hardening
Most time/disk space spent on extraction and
power simulation
36
Example 2 Test Chip
  • 300k transistors
  • 0.25 mm
  • 1.0 V
  • 25 MHz
  • 6.8 mm2
  • 14 mW
  • 2 phase clock
  • 3 layers of PR hierarchy

Parallel Pipelined FIR Filter(8X decimation
filter for 12-bit 200 MHz SD)
37
TDMA Baseband Receiver
  • 600k transistors
  • 0.18 mm
  • 1.0 V
  • 25 MHz
  • 1.1 mm2
  • 21 mW
  • single phase clock
  • 5 clock domains
  • 2 layers of PR hierarchy

carrierdetection
frequency estimation
rotate correlate
control
38
Conclusions
  • Direct-Mapped hardware is the most efficient use
    of silicon
  • Direct-Mapped hardware can be easier to design
    and verify than embedded hardware/software
    systems
  • Dont translate design data, refine it
  • Design with dataflow graphs, not sequential code
  • Design flow automation speeds up design space
    exploration
Write a Comment
User Comments (0)
About PowerShow.com