Design Technology for Low Power Radio Systems - PowerPoint PPT Presentation

1 / 38

About This Presentation

Title:

Design Technology for Low Power Radio Systems

Description:

(1-2 % of the area of a $4 chip) ... MOPS/mm2 - Area efficiency (cost) ... Why time multiplex to save area if the overhead is much greater than the area saved? ... – PowerPoint PPT presentation

Number of Views:59

Avg rating:3.0/5.0

Slides: 39

Provided by: bobb179

Category:

more less

Transcript and Presenter's Notes

Title: Design Technology for Low Power Radio Systems

1
Design Technology for Low Power Radio Systems
Rhett Davis Dept. of EECS Univ. of Calif. Berkeley

http//bwrc.eecs.berkeley.edu

2
Domain of Interest

Highly integrated system-on-a-chip solutions
SOCs
Wireless communications with associated
processing, e.g. multimedia processing,
compression, switching, etc
Primary computation is high complexity dataflow
with a relatively small amount of control

3
Why Systems-on-a-Chip - SOC ?

State-of-the-Art CMOS is easily able to implement
complete systems (or what was on a board before)
A microprocessor core is only 1-2 mm2
(1-2 of the area of a 4 chip)
Portability (size) is critical to meet the cost,
power and size requirements of future wireless
systems
Chips will be required to support the complete
application (wireless internet, multimedia)
Dedicated stand-alone computation is replacing
general purpose processors as the semiconductor
industry driver

4
Cellular Phones An example
Digital Cellular Market (Phones Shipped)
1996 1997 1998 1999 2000
Units 48M 86M 162M 260M 435M
(Courtesy Mike McMahon, Texas Instruments)
5
Cellular Phone Baseband SOC
ROM
MCU
DSP
Gates
RAM
Analog
2000 phones on each 8 wafer _at_ .15 Leff
1Million Baseband Chips per Day!!!
(Courtesy Mike McMahon, Texas Instruments)
6
Wireless System Design Issues

It is now possible to use CMOS to integrate all
digital radio functions but what is the best
architectural way to use CMOS???
Computation rates for wireless systems will
easily range up to 100s of GOPS in signal
processing (UI, control insignificant)
Whats keeping us from achieving this in silicon?
What can we do about it?

7
Computational Efficiency Metrics

Definition MOPS
Millions of algorithmically defined arithmetic
operations (e.g. multiply, add, shift) in a GP
processor several instructions per useful
operation
Figures of merit
MOPS/mW - Energy efficiency (battery life)
MOPS/mm2 - Area efficiency (cost)
Optimization of these efficiencies is the basic
goal assuming functionality is met

8
Energy-Efficiency of Architectures
1000
Dedicated HW
Direct mapped 100-1000 MOPS/mW
100
ReconfigurableProcessor/Logic
Reconfiguration (???) Potential of 10-100 MOPS/mW
Energy Efficiency MOPS/mW (or MIPS/mW)
10
1
Embedded mProcessors
Microprocessor .1-1 MIPS/mW
0.1
Flexibility (Coverage)
9
Software Processors Energy Trends

Primary means of performance increase of software
processors has been by increasing clock rate
Decreasing Energy Efficiency

E ? C ? VDD2
10
Software Processors Area Trends

Increasing clock rate results in a memory
bottleneck addressed by bringing memory on-chip
Area is increasingly dominated by memory
degrading MOPs/mm2

16x16 multiplier (.05 mm2)
DSP processor with 1 multiplier (25 mm2)
Why time multiplex to save area if the overhead
is much greater than the area saved????
11
Parallelism is the answer, but

Not by putting Von Neumann processors in parallel
and programming with a sequential language
Attempts to do this have failed over and over
again
The parallel computer compiler problem is very
difficult
Not by trying to capture parallelism at the
instruction level
Superscalar, VLIW, etc are very inefficient
Hardware cant figure out the parallelism from a
sequential language either
The problem is the initial sequential description
(e.g. C) which is poorly matched to highly
parallel applications

12
What is really hapenning
Re-entering it using a sequential description
Then try to rediscover the parallelism
Starting with a parallel algorithmic description

While (i0iiltnum)
a a ci
bi sin (a pi) cos(api)
Outfil bi indata

We take this path so that we can use an
architecture that is orders of magnitude less
efficient in energy and area ??????
13
What can a fully parallel CMOS solution
potentially do?

In .25 micron a multiplier requires .05 mm2 and
7pJ per operation at 1 V. Adders and registers
are about 10 times smaller and 10 times lower
energy
Lets implement a 50mm2 , .25 micron chip using
adders, registers and multipliers
We can have 2000 adders/registers and 200
multipliers in less than 1/2 of the chip, also
assume 1/3 of power goes into clocks
25 MHz clock (1 volt) gives 50 Gops at 100mW
500 MOPS/mW and 1000 MOPS/mm2

14
Start with a parallel description of the
algorithm
15
Then directly map into hardware
16
Results in fully parallel solutions
(numbers taken from vendor-published
benchmarks) Orders of magnitude lower efficiency
even for an optimized processor architecture
17
Reasons software solutions seem attractive

(1) Believed to reduce time-to-system-implementati
on
(2) Provides flexibility
(3) Locks the customers into an architecture they
cant change
(4) Difficulty in getting dedicated SOC chips
designed
Are these good reasons???

18
(1) Believed to reduce time-to-system
implementation

Software decreases time to get first prototype,
but time to fully verified system is much longer
(hardware is often ready but software still needs
to be done)
Limitations of software prototype often sets the
ultimate limit of the system performance
Software solutions can be shipped with bugs, not
a real option for SOC

19
(2) Need flexibility

Software is not always flexible
Can be hard to verify
Flexibility does not imply software
programmability
Domain specific design can have multiple modules,
coefficients and local state control (use some of
the factor of 100 in efficiency) to address a
range of applications
Reconfiguration of interconnect can achieve
flexibility with high levels of efficiency

20
Flexibility without software
Energy per Transform vs. FFT size
Transforms per Second per mm2 vs. FFT size
All results are scaled to 0.18mm
21
Reasons software solutions seem attractive

(1) Believed to reduce time-to-system
implementation
(2) Provides flexibility
(3) Locks the customers into an architecture they
cant change
(4) Difficulty in getting dedicated SOC chips
designed

22
Standard DSP-ASIC Design Flow
Problems

Three translations of design data
Requirements for re-verification at each stage
Uncontrolled looping when pipeline stalls

Prohibitively Long Design Time for Direct Mapped
Architectures
23
Direct Mapping Design Flow
Algorithm/System
Simulation
Back-End
Front-End
Floorplan
RTL Libraries
Automated Flow
Mask Layout
Performance Estimates

Encourages iterations of layout
Controls looping
Reduces the flow to a single phase
Depends on fast automation

24
(an aside) Déjà vu???

An automated style of design with parameterized
modules processed through foundries is just the
reincarnation of good ole Silicon Compilation of
gt10 years ago
What happened?
A decline of research into design methodologies
A single dominant flow has resulted - the
Verilog-Synopsys-Standard Cell
Lack of tool flows to support alternative styles
of design
Research community lost access to technology
moved to highly sub-optimal processor and FPGA
solutions

25
Capturing Design Decisions

Categories
Function - basic input-output behavior
Signal - physical signals and types
Circuit - transistors
Floorplan - physical positions

How to get layout and performance estimates in a
day?
26
Simplified View of the Flow

New Software
Generation of netlists from a dataflow graph
Merging of floorplan from last iteration
Automatic routing and performance analysis
Automation of flow as a dependency graph (UNIX
MAKE program)

27
Why Simulink?

Simulink is an easy sell to algorithm developers
Closely integrated with popular system design
tool Matlab
Successfully models digital and analog circuits

28
Modeling Datapath Logic

Discrete-Time(cycle accurate)
Fixed-Point Types(bit true)
Completely specify function and signal decisions
No need for RTL

Multiply / Accumulate
29
Modeling Control Logic

Extended finite state-machine editor
Co-simulation with dataflow graph
New SoftwareStateflow-VHDL translator
No need for RTL

Address Generator / MAC Reset
30
Specifying Circuit Decisions
Black Box
RTL CodeorData-pathGeneratorCodeorCustomMod
ule
Stateflow-VHDLtranslator
Time-Multiplexed FIR Filter

Macro choices embedded in dataflow graph
Cross-check simulations required

31
Hierarchy Hardened Progressively

Macro characterization saved for fast estimates
Each level of hierarchy becomes a new hard macro
Higher levels of hierarchy are adjusted
When top level of hierarchy is hardened, the
design is done

32
Capturing Floorplan Decisions

Commercial physical design tools used
Instance names in floorplan match dataflow graph
Placements merged on each iteration
Manhattan distance can be used for parasitic
estimates

33
Reduced Impact of Interconnect

0.18 mm

Long wires can be modeled as lumped capacitances
34
Race-Immune Clock Tree Synthesis

Race margin 580 ps
0.18 mm
VDD 1 V

Demonstrated on a 600k transistor design
35
Example 1 Macro Hardening
Most time/disk space spent on extraction and
power simulation
36
Example 2 Test Chip

300k transistors
0.25 mm
1.0 V
25 MHz
6.8 mm2
14 mW
2 phase clock
3 layers of PR hierarchy

Parallel Pipelined FIR Filter(8X decimation
filter for 12-bit 200 MHz SD)
37
TDMA Baseband Receiver

600k transistors
0.18 mm
1.0 V
25 MHz
1.1 mm2
21 mW
single phase clock
5 clock domains
2 layers of PR hierarchy

carrierdetection
frequency estimation
rotate correlate
control
38
Conclusions

Direct-Mapped hardware is the most efficient use
of silicon
Direct-Mapped hardware can be easier to design
and verify than embedded hardware/software
systems
Dont translate design data, refine it
Design with dataflow graphs, not sequential code
Design flow automation speeds up design space
exploration

Write a Comment

User Comments (0)