Title: Design of a High-Speed Asynchronous Turbo Decoder
1Design of a High-Speed Asynchronous Turbo Decoder
Pankaj Golani, George Dimou, Mallika Prakash and
Peter A. Beerel Asynchronous CAD/VLSI Group Ming
Hsieh Electrical Engineering Department University
of Southern California ASYNC 2007 Berkeley,
California March 12th 2007
2Motivation and Goal
- Mainstream acceptance of asynchronous design
- Leverage-off of ASIC standard-cell library-based
design flow - Achieve significant benefits to overcome sync
momentum - Our research goal for async designs
- High-speed standard-cell flow
- Applications where designs yield significant
improvement - throughput and throughput per area
- energy efficiency
3Single Track Full Buffer (Ferretti02)
L
R
- Follows 2 phase protocol
- High performance standard cell circuit family
- Comparison to synchronous standard-cell
- 4.5x better latency
- 1GHz in 0.18µm
- 2.4X faster than synchronous
- 2.8x more area
4Block Processing Pipelining and Parallelism
M people pipelines
Steinhart Aquarium
Latency l
First M cases arrive at t l
Let c be the person cycle time
Subsequent M cases arrive every c time units
- Consider two scenarios
- Baseline
- cycle time C1, latency L1
- Improved
- cycle time C2 C1/2.4, latency L2 L1/4.5
- Questions
- How does cycle time affect throughput?
- How does latency affect throughput ?
5Block Processing Combined Cycle Time and
Latency Effect
Large K throughput ratio ? cycle time ratio
Small K throughput ratio ? latency ratio
6Talk Outline
- Turbo coding and decoding an introduction
- Tree soft-input soft-output (SISO) decoder
- Synchronous turbo decoder
- Asynchronous turbo decoder
- Comparisons and conclusions
7Turbo Coding Introduction
- Error correcting codes
- Adds redundancy
- The input data is K bits
- The output code word is N bits (NgtK)
- The code rate is r K/N
- Type of codes
- Linear code
- Convolutional code (CC)
- Turbo code
8Turbo Encoding - Introduction
Inner CC
Interleaver
Outer CC
Turbo Encoder
- Turbo Encoding
- Berrou, Glavieux and Thitimajshima (1993)
- Performance close to Shannon channel capacity
- Typically uses two convolutional codes and an
interleaver - Interleaver used to improve error correction
- increases minimum distance of code
- creates a large block code
9Turbo Decoding
- Turbo decoder components
- Two soft-in soft-out (SISO) decoders
- one for inner CC and one for outer CC
- soft input a priori estimates of input data
- soft output a posterior estimates of input data
- SISO often based on Min-Sum formulation
- Interleaver / De-interleaver
- maps SISO outputs to SISO inputs
- same permutation as used in encoder
- Iterative nature of algorithm leads to block
processing - One SISO must finish before next SISO starts
Received Data memory
Inner SISO
De- interleaver
Outer SISO
Interleaver
10The Decoding Problem
- Requires finding paths in a graph called a
trellis - Node State j of encoder at time index k
- Edge Represents receiving a 0 or 1 in node for
state j at time k - Path Represents a possible decoded sequence
- the algorithm finds multiple paths
- Example Trellis
- For a 2-state encoder, encoding K bits
t 0
t K
t k
Sent bit is 1
Sent bit is 0
Decoded Sequence
0 1 0 0 0
1 0 1 0 0
11Min-Sum SISO Problem Formulation
- Branch and path metrics
- Branch metric (BM)
- indicates difference between expected and
received values - Path metric
- sum of associated branch metrics
- Min-Sum Formulation for each time index k find
- Minimum path metric over all paths for which bit
k 1 - Minimum path metric over all paths for which bit
k 0
t 0
t k
t K
1
Sent bit is 1
Sent bit is 0
Minimum path metric when bit k 1 is 13
Minimum path metric when bit k 0 is 16
12Talk Outline
- Turbo coding and decoding an introduction
- Tree SISO low-latency turbo decoder architecture
- Synchronous turbo decoder
- Asynchronous turbo decoder
- Comparisons and conclusions
13Conventional SISO - O(K) latency
- Calculation of the minimum path can be divided
into two phases - Forward state metric for time k and state j
- Backward state metric for time k and state j
- Data dependency loop prevents pipelining
- Cycle time limited to latency of 2-way ACS
- Latency is O(K)
t k
t K
t k-1
t k1
t 0
Received bit is 1
Received bit is 0
14Tree SISO low latency architecture
- Tree SISO (Beerel/Chugg JSAC01)
- Calculate BMs for larger and larger segments of
trellis.( ) - Analogous to creating group-wise PG logic for
tree adders - Tree SISO can process the entire trellis in
parallel - No data dependency loops so finer pipelining
possible - Latency is O(log K)
15Remainder of Talk Outline
- Turbo Coding an introduction
- Turbo Decoding
- Tree SISO low-latency turbo decoder architecture
- Synchronous turbo decoder
- Asynchronous turbo decoder
- Comparisons and conclusions
16Synchronous Base-Line Turbo Decoder
- Synchronous turbo decoder base-line
- IBM 0.18µm Artisan standard cell library
- SCCC code was used with a rate of ½
- Number of iterations performed is 6
- Gate level pipelined to achieve high throughput
- Performed timing-driven PR
- Peak frequency of 475MHz
- SISO area of 2.46mm2
- To achieve high throughput, multiple blocks
instantiated
17Asynchronous Turbo Decoder
- Static Single Track Full Buffer Standard-Cell
Library (Golani06) - Total of (only) 14 cells in IBM 0.18µm process
- Extensive spice simulations were performed
- optimized trade-off between performance and
robustness - Chip design
- Standard ASIC place-and-route flow
(congestion-based) - ECO optimization flow
- Chip level simulation
- Performed on critical sub-block (55K transistors)
- Verified timing constraints
- Measured latency and throughput using Synopsys
Nanosim
18Static Single Track Full Buffer (Ferretti01)
1-of-N data
Receiver
Sender
SST channel
1-of-N static single-track protocol
Holds low Drives high
Holds high Drives low
Statically drive line ? improves noise margin
19Asynchronous Implementation Challenges - I
- Degradation in throughput
- Unbalanced fork and join structure
- The token on the short branch is stalled due to
imbalance - This leads to over all slowing down of the fork
join
- Slack matching
- Improves the throughput because of an additional
pipeline buffer - Identify fork / join bottlenecks and resolve by
adding buffers - After PR long wires can also create such a
problem - This can be solved by adding buffers on long
wires using ECO flow
20Asynchronous Implementation Challenges - II
Full Adder
Fork
Full Adder with Integrated Fork
- SSTFB implements only point to point
communication - Use dedicated Fork cells
- Creates another pipeline stage
- To slack match buffers are needed on the other
paths - Integrate Fork within Full Adder
45 less area than full adder and fork Decreases
the number of slack matching buffers required
21Asynchronous Implementation Challenges III
- 60 of the design are slack matching buffers
- Most of the time these buffers occur in linear
chains
Slack2
Buffer
Buffer
- To save area and power two new cells were created
- SLACK2
- SLACK4
17 area and 10 power improvement for
SLACK2 30 area and 19 power improvement for
SLACK4
22Remainder of Talk Outline
- Turbo Coding an introduction
- Turbo Decoding
- Tree SISO low-latency turbo decoder architecture
- Synchronous turbo decoder
- Asynchronous turbo decoder
- Comparisons and conclusions
23Comparisons
- Synchronous
- Peak frequency of 475MHz
- Logic area of 2.46mm2
- Asynchronous
- Peak frequency of 1.15GHz
- Logic area of 6.92mm2
- Design time comparison
- Synchronous 4 graduate-student months
- Asynchronous 12 graduate-student months
24Synch vs Async
Received Memory
M pipelined 8-bit Tree SISOs
Interleaver/ De-interleaver
Latency l
First M bits arrive at t l
K bits
Let c be the sync clock cycle time (475 MHz)
Subsequent M bits arrive every c time units
- Two implementations
- Synch cycle time C1 and latency L1
- Async cycle time C2 C1/2.4
- latency L2 L1/4.5
- Desired comparisons
- Throughput comparison vs block size
- Energy comparison vs block size
25Comparisons Throughput / Area
1.28 M3
2.13 M8
3.91 M11
- For small block sizes asynchronous provides
better throughput/area - As block size ? the two implementations become
comparable - For block sizes of 512 bits synchronous cannot
achieve async throughput
26Comparisons Energy/Block
- For equivalent throughputs and small block sizes
asynchronous is more energy
efficient than synchronous - Async advantages grow with larger async library
(e.g., w/ BUF1of4)
27Conclusions
- Asynchronous turbo decoder vs. synchronous
baseline - static STFB offers significant improvements for
small block sizes - more than 2X throughput/area
- higher peak throughput (500Mbps)
- more energy efficient
- well-suited for low-latency applications (e.g.
voice) - High-performance async advantageous for
applications which require - high performance (e.g., pipelining)
- low latency
- block processing for which parallelism has
diminishing returns - synchronous design requires extensive parallelism
to achieve equivalent throughput
28Future Work
- Library Design
- Larger library with more than 1 size per cell
- 1-of-4 encoding
- Async CAD
- Automated slack matching
- Static timing analysis
29Questions ??