Title: L05-1
1- Architectural Exploration
- Area-Power tradeoff in 802.11a transmitter design
- Arvind
- Computer Science Artificial Intelligence Lab
- Massachusetts Institute of Technology
2This lecture has two purposes
- Illustrate how area-power tradeoff can be studied
at a high-level for a realistic design - Example 802.11a transmitter
- Illustrate some features of BSV
- Static elaboration
- Combinational circuits
- Simple synchronous pipelines
- Valid bits as the Maybe type in BSV
No prior understanding of 802.11a is necessary to
follow this lecture
3Bluespec Two-Level Compilation
Bluespec (Objects, Types, Higher-order functions)
- Lennart Augustsson
- _at_Sandburst 2000-2002
- Type checking
- Massive partial evaluation and static
elaboration
Level 1 compilation
Rules and Actions (Term Rewriting System)
- Rule conflict analysis
- Rule scheduling
Level 2 synthesis
- James Hoe Arvind
- _at_MIT 1997-2000
Object code (Verilog/C)
4Static Elaboration
- At compile time
- Inline function calls and datatypes
- Instantiate modules with specific parameters
- Resolve polymorphism/overloading
5802.11a Transmitter Overview
headers
Must produce one OFDM symbol every 4 msec
24 Uncoded bits
data
6Preliminary resultsMEMOCODE 2006 Dave,
Gerding, Pellauer, Arvind
- Design Lines of Relative
- Block Code (BSV) Area
- Controller 49 0
- Scrambler 40 0
- Conv. Encoder 113 0
- Interleaver 76 1
- Mapper 112 11
- IFFT 95 85
- Cyc. Extender 23 3
Complex arithmetic libraries constitute another
200 lines of code
7Combinational IFFT
All numbers are complex and represented as two
sixteen bit quantities. Fixed-point arithmetic is
used to reduce area, power, ...
84-way Butterfly Node
- function Vector(4,Complex) Bfly4
- (Vector(4,Complex) t, Vector(4,Complex)
k)
- BSV has a very strong notion of types
- Every expression has a type. Either it is
declared by the user or automatically deduced by
the compiler - The compiler verifies that the type declarations
are compatible
9BSV code 4-way Butterfly
- function Vector(4,Complex) Bfly4
- (Vector(4,Complex) t,
Vector(4,Complex) k) - Vector(4,Complex) m newVector(),
- y newVector(),
- z newVector()
- m0 k0 t0 m1 k1 t1
- m2 k2 t2 m3 k3 t3
- y0 m0 m2 y1 m0 m2
- y2 m1 m3 y3 i(m1 m3)
- z0 y0 y2 z1 y1 y3
- z2 y0 y2 z3 y1 y3
- return(z)
- endfunction
Polymorphic code works on any type of numbers
for which , and - have been defined
Note Vector does not mean storage
10Combinational IFFT
stage_f function
repeat it three times
11BSV Code Combinational IFFT
function SVector(64, Complex) ifft
(SVector(64, Complex) in_data) //Declare
vectors SVector(4,SVector(64, Complex))
stage_data replicate(newSVector)
stage_data0 in_data for (Integer stage
0 stage lt 3 stage stage 1)
stage_datastage1 stage_f(stage,stage_datasta
ge) return(stage_data3)
The for loop is unfolded and stage_f is inlined
during static elaboration
Note no notion of loops or procedures during
execution
12BSV Code Combinational IFFT- Unfolded
function SVector(64, Complex) ifft
(SVector(64, Complex) in_data) //Declare
vectors SVector(4,SVector(64, Complex))
stage_data replicate(newSVector)
stage_data0 in_data for (Integer stage
0 stage lt 3 stage stage 1)
stage_datastage1 stage_f(stage,stage_datasta
ge) return(stage_data3)
stage_data1 stage_f(0,stage_data0) stage_da
ta2 stage_f(1,stage_data1) stage_data3
stage_f(2,stage_data2)
Stage_f can be inlined now it could have been
inlined before loop unfolding also. Does the
order matter?
13Bluespec Code for stage_f
- function SVector(64, Complex) stage_f
- (Bit(2) stage, SVector(64, Complex)
stage_in) - begin
- for (Integer i 0 i lt 16 i i 1)
- begin
- Integer idx i 4
- let twid getTwiddle(stage,
fromInteger(i)) - let y bfly4(twid, stage_inidxidx3)
- stage_tempidx y0 stage_tempidx1
y1 - stage_tempidx2 y2 stage_tempidx3
y3 - end
- //Permutation
- for (Integer i 0 i lt 64 i i 1)
- stage_outi stage_temppermutei
- end
- return(stage_out)
14Architectural Exploration
15Design Alternatives
- Reuse a block over multiple cycles
we expect Throughput to Area to
decrease less parallelism
decrease reusing a block
The clock needs to run faster for the same
throughput ? hyper-linear increase in energy
Energy/unit work ?
more on power issues later
16Combinational IFFTOpportunity for reuse
Reuse the same circuit three times
17Circular pipeline Reusing the Pipeline Stage
64, 4-way Muxes
Stage Counter
16 Bfly4s can be shared but not the three
permutations. Hence the need for muxes
18Superfolded circular pipeline Just one Bfly-4
node!
19Algorithmic Improvements
1. All the three permutations can be made
identical ? more saving in area in the folded
case 2. One multiplication can be removed from
Bfly-4
20Area improvements because of change in Algorithm
21Which design consumes the least energy to
transmit a symbol?
- Can we quickly code up all the alternatives?
- single source with parameters?
Not practical in traditional hardware description
languages like Verilog/VHDL
22Pipelining a block
Clock C lt P ? FP
Area FP lt C lt P
Throughput FP lt C lt P
23Synchronous pipeline
rule sync-pipeline (True) inQ.deq() sReg1
lt f1(inQ.first()) sReg2 lt f2(sReg1)
outQ.enq(f3(sReg2)) endrule
This rule can fire only if
- inQ has an element - outQ has space
Atomicity Either all or none of the state
elements inQ, outQ, sReg1 and sReg2 will be
updated
This is real IFFT code just replace f1, f2 and
f3 with stage_f code
24Stage functions f1, f2 and f3
function f1(x) return (stage_f(1,x))
endfunction function f2(x) return
(stage_f(2,x)) endfunction function f3(x)
return (stage_f(3,x)) endfunction
The stage_f fucntion is given on slide 12
25Problem What about pipeline bubbles?
Red and Green tokens must move even if there is
nothing in the inQ!
rule sync-pipeline (True) inQ.deq() sReg1
lt f1(inQ.first()) sReg2 lt f2(sReg1)
outQ.enq(f3(sReg2)) endrule
Also if there is no token in sReg2 then nothing
should be enqueued in the outQ
Valid bits or the Maybe type
Modify the rule to deal with these conditions
26The Maybe type data in the pipeline
typedef union tagged void Invalid data_T
Valid Maybe(type data_T)
Registers contain Maybe type values
rule sync-pipeline (True) if (inQ.notEmpty())
begin sReg1 lt Valid f1(inQ.first()) inq.deq()
end else sReg1 lt Invalid case (sReg1)
matches tagged Valid .sx1 sReg2 lt Valid
f2(sx1) tagged Invalid sReg2 lt Invalid
case (sReg2) matches tagged Valid .sx2
outQ.enq(f3(sx2)) endrule
27Folded pipeline
The same code will work for superfolded pipelines
by changing n and stage function f
rule folded-pipeline (True) if (stage1)
begin sxIn inQ.first() inQ.deq() end else
sxIn sReg sxOut f(stage,sxIn) if
(stagen) outQ.enq(sxOut) else sReg lt sxOut
stage lt (stagen)? 1 stage1 endrule
28802.11a Transmitter Synthesis results (Only the
IFFT block is changing)
IFFT Design Area (mm2) ThroughputLatency (CLKs/sym) Min. Freq Required
Pipelined 5.25 04 1.0 MHz
Combinational 4.91 04 1.0 MHz
Folded (16 Bfly-4s) 3.97 04 1.0 MHz
Super-Folded (8 Bfly-4s) 3.69 06 1.5 MHz
SF(4 Bfly-4s) 2.45 12 3.0 MHz
SF(2 Bfly-4s) 1.84 24 6.0 MHz
SF (1 Bfly4) 1.52 48 12 MHZ
All these designs were done in less than 24 hours!
TSMC .18 micron numbers reported are before
place and route.
29Why are the areas so similar
- Folding should have given a 3x improvement in
IFFT area - BUT a constant twiddle allows low-level
optimization on a Bfly-4 block - a 2.5x area reduction!
30Summary
- It is essential to do architectural exploration
for better (area, power, performance, ...)
designs. - It is possible to do so with new design tools and
methodologies, i.e., Bluespec - Better and faster tools for estimating area,
timing and power would dramatically increase our
capability to do architectural exploration.
31Bluespec Learnings
- How to write highly parameterized combinational
codes - How to write rules for simple synchronous
pipelines - Effect of dynamic vs static values on generated
circuits - Using Maybe types to express valid/invalid data
Thanks
32Backup slides
33Function f for the folded pipeline is the same
stage_f function but ...
- function SVector(64, Complex) stage_f
- (Bit(2) stage, SVector(64, Complex)
stage_in) - begin
- for (Integer i 0 i lt 16 i i 1)
- begin
- Integer idx i 4
- let twid getTwiddle(stage,
fromInteger(i)) - let y bfly4(twid, stage_inidxidx3)
- stage_tempidx y0 stage_tempidx1
y1 - stage_tempidx2 y2 stage_tempidx3
y3 - end
- //Permutation
- for (Integer i 0 i lt 64 i i 1)
- stage_outi stage_temppermutei
- end
- return(stage_out)
will cause a mux to be generated
34Folded pipeline stage function f
35Function f for the Superfolded pipeline (One
Bfly-4 case)
- f will be invoked for 48 dynamic values of stage
- each invocation will modify 4 numbers in sReg
- after 16 invocations a permutation would be done
on the whole sReg
36Code for the Superfolded pipeline stage function
- function SVector(64, Complex) f
- (Bit(6) stage, SVector(64, Complex)
stage_in) - begin
- let idx stage mod 16
- let twid getTwiddle(stage div 16, idx)
- let y bfly4(twid, stage_inidxidx3)
- stage_temp stage_in
- stage_tempidx y0
- stage_tempidx1 y1
- stage_tempidx2 y2
- stage_tempidx3 y3
- for (Integer i 0 i lt 64 i i 1)
- stage_outi stage_temppermutei
- end
- return((idx 15) ? stage_out stage_temp)
One Bfly-4 case
37Experimental Results
- Nirav Dave, Mike Pellauer, Steve Gerding, Arvind
- MEMOCODE 2006
38Expressing these designs in Bluespec was easy
Combinational
Pipelined
Folded (16 Bfly-4s)
Super-Folded (8 Bfly-4s)
Super-Folded (4 Bfly-4s)
Super-Folded (2 Bfly-4s)
Super-Folded (1 Bfly-4)
- All these designs were done in less than one day!
- Designers were experts in Bluespec
- Area and power estimates?
39Bluespec Tool flow
Bluespec SystemVerilog source
Bluespec Compiler
Verilog 95 RTL
Verilog sim
RTL synthesis
gates
FPGA
40802.11a Transmitter Synthesis results (Only the
IFFT block is changing)
IFFT Design Area (mm2) Symbol Latency (CLKs) ThroughputLatency (CLKs/sym) Min. Freq Required Average Power (mW)
Pipelined 5.25 12 04 1.0 MHz 4.92
Combinational 4.91 10 04 1.0 MHz 3.99
Folded (16 Bfly-4s) 3.97 12 04 1.0 MHz 7.27
Super-Folded (8 Bfly-4s) 3.69 15 06 1.5 MHz 10.9
SF(4 Bfly-4s) 2.45 21 12 3.0 MHz 14.4
SF(2 Bfly-4s) 1.84 33 24 6.0 MHz 21.1
SF (1 Bfly4) 1.52 57 48 12 MHZ 34.6
TSMC .18 micron numbers reported are before
place and route. (DesignCompiler), Power numbers
are from Sequence PowerTheater
41Power can be reduced further
- Right now all blocks in the transmitter run on
the same clock - ? if we run IFFT faster then all other blocks
also run faster - Bluespec has facilities for Multiple Clock
Domains and the design can be easily modified to
run the earlier blocks at a lower clock rate