Title: Design Technology for Low Power Radio Systems
1Design Technology for Low Power Radio Systems
Rhett Davis Dept. of EECS Univ. of Calif. Berkeley
- http//bwrc.eecs.berkeley.edu
2Domain of Interest
- Highly integrated system-on-a-chip solutions
SOCs - Wireless communications with associated
processing, e.g. multimedia processing,
compression, switching, etc - Primary computation is high complexity dataflow
with a relatively small amount of control
3Why Systems-on-a-Chip - SOC ?
- State-of-the-Art CMOS is easily able to implement
complete systems (or what was on a board before) - A microprocessor core is only 1-2 mm2
(1-2 of the area of a 4 chip) - Portability (size) is critical to meet the cost,
power and size requirements of future wireless
systems - Chips will be required to support the complete
application (wireless internet, multimedia) - Dedicated stand-alone computation is replacing
general purpose processors as the semiconductor
industry driver
4Cellular Phones An example
Digital Cellular Market (Phones Shipped)
1996 1997 1998 1999 2000
Units 48M 86M 162M 260M 435M
(Courtesy Mike McMahon, Texas Instruments)
5Cellular Phone Baseband SOC
ROM
MCU
DSP
Gates
RAM
Analog
2000 phones on each 8 wafer _at_ .15 Leff
1Million Baseband Chips per Day!!!
(Courtesy Mike McMahon, Texas Instruments)
6Wireless System Design Issues
- It is now possible to use CMOS to integrate all
digital radio functions but what is the best
architectural way to use CMOS??? - Computation rates for wireless systems will
easily range up to 100s of GOPS in signal
processing (UI, control insignificant) - Whats keeping us from achieving this in silicon?
- What can we do about it?
7Computational Efficiency Metrics
- Definition MOPS
- Millions of algorithmically defined arithmetic
operations (e.g. multiply, add, shift) in a GP
processor several instructions per useful
operation - Figures of merit
- MOPS/mW - Energy efficiency (battery life)
- MOPS/mm2 - Area efficiency (cost)
- Optimization of these efficiencies is the basic
goal assuming functionality is met
8Energy-Efficiency of Architectures
1000
Dedicated HW
Direct mapped 100-1000 MOPS/mW
100
ReconfigurableProcessor/Logic
Reconfiguration (???) Potential of 10-100 MOPS/mW
Energy Efficiency MOPS/mW (or MIPS/mW)
10
1
Embedded mProcessors
Microprocessor .1-1 MIPS/mW
0.1
Flexibility (Coverage)
9Software Processors Energy Trends
- Primary means of performance increase of software
processors has been by increasing clock rate - Decreasing Energy Efficiency
E ? C ? VDD2
10Software Processors Area Trends
- Increasing clock rate results in a memory
bottleneck addressed by bringing memory on-chip - Area is increasingly dominated by memory
degrading MOPs/mm2
16x16 multiplier (.05 mm2)
DSP processor with 1 multiplier (25 mm2)
Why time multiplex to save area if the overhead
is much greater than the area saved????
11Parallelism is the answer, but
- Not by putting Von Neumann processors in parallel
and programming with a sequential language - Attempts to do this have failed over and over
again - The parallel computer compiler problem is very
difficult - Not by trying to capture parallelism at the
instruction level - Superscalar, VLIW, etc are very inefficient
- Hardware cant figure out the parallelism from a
sequential language either - The problem is the initial sequential description
(e.g. C) which is poorly matched to highly
parallel applications
12What is really hapenning
Re-entering it using a sequential description
Then try to rediscover the parallelism
Starting with a parallel algorithmic description
- While (i0iiltnum)
- a a ci
- bi sin (a pi) cos(api)
-
- Outfil bi indata
We take this path so that we can use an
architecture that is orders of magnitude less
efficient in energy and area ??????
13What can a fully parallel CMOS solution
potentially do?
- In .25 micron a multiplier requires .05 mm2 and
7pJ per operation at 1 V. Adders and registers
are about 10 times smaller and 10 times lower
energy - Lets implement a 50mm2 , .25 micron chip using
adders, registers and multipliers - We can have 2000 adders/registers and 200
multipliers in less than 1/2 of the chip, also
assume 1/3 of power goes into clocks - 25 MHz clock (1 volt) gives 50 Gops at 100mW
- 500 MOPS/mW and 1000 MOPS/mm2
14Start with a parallel description of the
algorithm
15Then directly map into hardware
16Results in fully parallel solutions
(numbers taken from vendor-published
benchmarks) Orders of magnitude lower efficiency
even for an optimized processor architecture
17Reasons software solutions seem attractive
- (1) Believed to reduce time-to-system-implementati
on - (2) Provides flexibility
- (3) Locks the customers into an architecture they
cant change - (4) Difficulty in getting dedicated SOC chips
designed - Are these good reasons???
18(1) Believed to reduce time-to-system
implementation
- Software decreases time to get first prototype,
but time to fully verified system is much longer
(hardware is often ready but software still needs
to be done) - Limitations of software prototype often sets the
ultimate limit of the system performance - Software solutions can be shipped with bugs, not
a real option for SOC
19(2) Need flexibility
- Software is not always flexible
- Can be hard to verify
- Flexibility does not imply software
programmability - Domain specific design can have multiple modules,
coefficients and local state control (use some of
the factor of 100 in efficiency) to address a
range of applications - Reconfiguration of interconnect can achieve
flexibility with high levels of efficiency
20Flexibility without software
Energy per Transform vs. FFT size
Transforms per Second per mm2 vs. FFT size
All results are scaled to 0.18mm
21Reasons software solutions seem attractive
- (1) Believed to reduce time-to-system
implementation - (2) Provides flexibility
- (3) Locks the customers into an architecture they
cant change - (4) Difficulty in getting dedicated SOC chips
designed
22Standard DSP-ASIC Design Flow
Problems
- Three translations of design data
- Requirements for re-verification at each stage
- Uncontrolled looping when pipeline stalls
Prohibitively Long Design Time for Direct Mapped
Architectures
23Direct Mapping Design Flow
Algorithm/System
Simulation
Back-End
Front-End
Floorplan
RTL Libraries
Automated Flow
Mask Layout
Performance Estimates
- Encourages iterations of layout
- Controls looping
- Reduces the flow to a single phase
- Depends on fast automation
24(an aside) Déjà vu???
- An automated style of design with parameterized
modules processed through foundries is just the
reincarnation of good ole Silicon Compilation of
gt10 years ago - What happened?
- A decline of research into design methodologies
- A single dominant flow has resulted - the
Verilog-Synopsys-Standard Cell - Lack of tool flows to support alternative styles
of design - Research community lost access to technology
moved to highly sub-optimal processor and FPGA
solutions
25Capturing Design Decisions
- Categories
- Function - basic input-output behavior
- Signal - physical signals and types
- Circuit - transistors
- Floorplan - physical positions
How to get layout and performance estimates in a
day?
26Simplified View of the Flow
- New Software
- Generation of netlists from a dataflow graph
- Merging of floorplan from last iteration
- Automatic routing and performance analysis
- Automation of flow as a dependency graph (UNIX
MAKE program)
27Why Simulink?
- Simulink is an easy sell to algorithm developers
- Closely integrated with popular system design
tool Matlab - Successfully models digital and analog circuits
28Modeling Datapath Logic
- Discrete-Time(cycle accurate)
- Fixed-Point Types(bit true)
- Completely specify function and signal decisions
- No need for RTL
Multiply / Accumulate
29Modeling Control Logic
- Extended finite state-machine editor
- Co-simulation with dataflow graph
- New SoftwareStateflow-VHDL translator
- No need for RTL
Address Generator / MAC Reset
30Specifying Circuit Decisions
Black Box
RTL CodeorData-pathGeneratorCodeorCustomMod
ule
Stateflow-VHDLtranslator
Time-Multiplexed FIR Filter
- Macro choices embedded in dataflow graph
- Cross-check simulations required
31Hierarchy Hardened Progressively
- Macro characterization saved for fast estimates
- Each level of hierarchy becomes a new hard macro
- Higher levels of hierarchy are adjusted
- When top level of hierarchy is hardened, the
design is done
32Capturing Floorplan Decisions
- Commercial physical design tools used
- Instance names in floorplan match dataflow graph
- Placements merged on each iteration
- Manhattan distance can be used for parasitic
estimates
33Reduced Impact of Interconnect
Long wires can be modeled as lumped capacitances
34Race-Immune Clock Tree Synthesis
- Race margin 580 ps
- 0.18 mm
- VDD 1 V
Demonstrated on a 600k transistor design
35Example 1 Macro Hardening
Most time/disk space spent on extraction and
power simulation
36Example 2 Test Chip
- 300k transistors
- 0.25 mm
- 1.0 V
- 25 MHz
- 6.8 mm2
- 14 mW
- 2 phase clock
- 3 layers of PR hierarchy
Parallel Pipelined FIR Filter(8X decimation
filter for 12-bit 200 MHz SD)
37TDMA Baseband Receiver
- 600k transistors
- 0.18 mm
- 1.0 V
- 25 MHz
- 1.1 mm2
- 21 mW
- single phase clock
- 5 clock domains
- 2 layers of PR hierarchy
carrierdetection
frequency estimation
rotate correlate
control
38Conclusions
- Direct-Mapped hardware is the most efficient use
of silicon - Direct-Mapped hardware can be easier to design
and verify than embedded hardware/software
systems - Dont translate design data, refine it
- Design with dataflow graphs, not sequential code
- Design flow automation speeds up design space
exploration