Title: Post Placement CSlow Retiming for Xilinx Virtex FPGAs
1Post Placement C-Slow Retiming for Xilinx Virtex
FPGAs
- UC Berkeley Reconfigurable Architectures,
Systems, and Software (BRASS) Group - ACM Symposium on Field Programmable Gate Arrays
(FPGA) - February 2x, 2003
- http//www.cs.berkeley.edu/nweaver/cslow.html
2Outline
- Automatically Double Your Throughput
- You paid for those registers, heres how to
use them - Retiming and C-slow Retiming
- The transformation
- C-slow Retiming and the Virtex FPGA
- The target
- Retiming 3 Benchmarks
- The tests
3Retiming and Repipelining
- Retiming
- Automatically moving registers to minimize the
clock period - Benefits limited by the number of registers
- Algorithm developed by Leiserson et al
- Repipelining
- Adding registers to the front or back
- Let retiming then move them around
- But What About Feedback Loops?
- Retiming and repipelining are of limited benefit
when you have feedback loops
4C-Slow Retiming
- Replace every register with a sequenceof C
registers. - With more registersretiming can break the
design into finer pieces - Again proposed by Leiserson et al, to meet
systolic slowdown - Semantic altering transformation
- But resulting semantics are predictable and
useful - Ideal C-slow in synthesis, retime after
placement - Our prototype C-slow and retime after placement
5Design Semantics After C-Slowing
- Design operates on C independent data streams
- Data streams are externally interleaved on round
robin basis - Semantics apply to designs with Task Level
Parallelism - Encryption
- Counter (CTR) mode works on independent blocks
- Sequence matching
- Compare sequence vs database
- C-slowing improves throughput but adds latency
and registers
6C-slowing, Retiming, and the Virtex FPGA
- Every 4-LUT has associated register
- Register can, almost always, be used
independently of the LUT - LUTs can act as clocked shiftregisters (SRL16s)
- Used in our AES hand-benchmark
- Not used in our tool
- Many designs have low register utilization
- Excess of registers available in unoptimized
designs - Retiming best performed with/after placement
- Xilinx placement operates on mapped slices
- Need net delay information for better results
7Sketch of Tools Operation
- Convert .ncd to .xdl after placement
- Load design into graph representation
- Replace registers with edge annotations to
represent registers - Replace every single register with C registers
- Compute costs based on delay model
- Retime
- Convert edge annotations back to instance
registers - Write out .xdl, convert to .ncd
- Route
Placer
Router
8Experiment 1How Good is the Tool?
- Tool is a simple prototype
- Manhattan distance delay estimate
- No attempt to minimize flip-flops
- Basic flip-flop allocation
- Two benchmarks AES and Smith/Waterman
- Hand mapped
- (optionally) hand placed
- (optionally) hand C-slowed and retimed
- Our Best hand AES implementation
- 1.3 Gb/s
- lt800 Slices, 10 BlockRAMs
- 10 part, Spartan II-100
9Experiment 1AES, Automatically Placed
- Just retiming is of no benefit
- Automatic C-slowing very effective
- But could do even better
10Experiment 1Smith/Waterman, Automatically Placed
- Again, just retiming is of no benefit
- C-slowing highly effective
- Within 7 of hand-built implementation
11Experiment 1Comments
- Just retiming is of no benefit
- Both designs limited by single cycle feedback
loops - C-Slowing very effective
- Able to automatically nearly double throughput
- Hand implementations more than doubled throughput
- Reasonable numbers of additional registers
- Limitations of prototype tool
- Flip-flop allocation routines could be better
- Some AES hand benchmarks used SRL16 delay chains
- Simple is pretty good
- Relatively simplistic implementation gets
reasonably close to hand-mapped performance
12Experiment 2 Retiming LEON
- Can we automatically C-slow a large, synthesized
design? - Leon 1 A synthesized , GPLed SPARCcompatible
microprocessor core 1 - 5 stage pipeline, integer only
- Modify register file to use BlockRAMs
- BlockRAMs are used as negative edge devices
- Remove caches, I/O, etc
- Synthesize, using Symplify with CEs disabled
- Edit EDIF to replace Sets/Resets
- Retime and C-slow with prototype tool
- Prototype tool converts BlockRAMs to positive
edge - C-slow a microprocessor core...
- Get an interleaved multithreaded architecture
1 Leon 1, by Jiri Gaisler, http//www.gaisler.co
m/leonmain.html
13Experiment 2Results
- Retiming alone worked surprisingly well
- 2-slowing very effective
- 3-slowing hit diminishing returns
6132 Luts for all designs
14Experiment 2Comments
- Retiming alone worked surprisingly well
- Tool automatically converted BlockRAMs to
positive-edge clocking and rebalanced the
pipeline - 2-slowing very effective
- Effectively doubled the initial throughput
- NO slowdown in latency over initial design
because retiming was effective without C-slowing - Used more many registers, but fewer registers
than LUTs - 3-slowing hit diminishing returns
- Too many registers required combined with poor
register allocation ? poor performance
15Conclusions
- C-slow retiming is very effective
- "Automatically double your throughput"
- Benefits More throughput
- Costs More Flip Flops, worse latency
- Post-placement retiming appropriate
- Independent Flip Flop usage critical
- Have delay model for interconnect as well as
logic - Some room for improvement
- Faster/Better implementation
- Minimize Flip Flop usage as well as delay
- Use SRL16s
- Better placement of Flip Flops
- Experience suggests more Flip Flops/LUT would be
useful
16Backup Slide Why Not Use (Current) Synthesis
Tools?
- Many synthesis tools support retiming, but with
caveats - ONLY works for synthesized items
- AES and Smith/Waterman didn't use synthesis
- Can't automatically C-slow
- Can't retime through memory blocks
- Can't accurately guesstimate interconnect delay
before placement - gt½ of the delay is the interconnect
- Can't effectively scavenge unused flip-flops
before placement - Xilinx placement operates on slices, not luts
17Backup Slide Why the limitations on total
speedup?
- Absolute maximum
- Interconnect LUT Flip-Flop
- Practical maximums
- Too many flip-flops to allocate
- Only one flip-flop per LUT available
- Flip-flop allocation poor
- Quick and dirty greedy heuristic
- Works well for mild C-slowing
- Fails with highly aggressive C-slowing
- Tool doesnt minimize flip-flops
- Critical path is defined by the single worst path
- Tool uses Cheap and dirty interconnect delay
model
18(Backup Slide) Design Restrictions to Enable
C-slowing
- Resets and Clock Enables
- Convert to explicit logic
- Memories
- Increase by a factor of C
- Add high bits of addr to provide round-robin
access - Every stream sees an independent memory
- Global Set/Reset
- Convert to individual resets
- Still highly restrictive
- Interleave/deinterleave IO
- Requires external logic
- No asynchronous sets/resets
19Scrap Image
20Scrap Image 2-
21Scrap Image 3
22Scrap Image 4
23Scrap 5
24Scrap 6