Design of a High-Speed Asynchronous Turbo Decoder - PowerPoint PPT Presentation

1 / 29

About This Presentation

Title:

Design of a High-Speed Asynchronous Turbo Decoder

Description:

Static Single Track Full Buffer Standard-Cell Library (Golani'06) ... Improves the throughput because of an additional pipeline buffer ... – PowerPoint PPT presentation

Number of Views:65

Avg rating:3.0/5.0

Slides: 30

Provided by: pabe8

Learn more at: https://conferences.computer.org

Category:

more less

Transcript and Presenter's Notes

Title: Design of a High-Speed Asynchronous Turbo Decoder

1
Design of a High-Speed Asynchronous Turbo Decoder
Pankaj Golani, George Dimou, Mallika Prakash and
Peter A. Beerel Asynchronous CAD/VLSI Group Ming
Hsieh Electrical Engineering Department University
of Southern California ASYNC 2007 Berkeley,
California March 12th 2007
2
Motivation and Goal

Mainstream acceptance of asynchronous design
Leverage-off of ASIC standard-cell library-based
design flow
Achieve significant benefits to overcome sync
momentum
Our research goal for async designs
High-speed standard-cell flow
Applications where designs yield significant
improvement
throughput and throughput per area
energy efficiency

3
Single Track Full Buffer (Ferretti02)
L
R

Follows 2 phase protocol
High performance standard cell circuit family
Comparison to synchronous standard-cell
4.5x better latency
1GHz in 0.18µm
2.4X faster than synchronous
2.8x more area

4
Block Processing Pipelining and Parallelism
M people pipelines
Steinhart Aquarium
Latency l
First M cases arrive at t l
Let c be the person cycle time
Subsequent M cases arrive every c time units

Consider two scenarios
Baseline
cycle time C1, latency L1
Improved
cycle time C2 C1/2.4, latency L2 L1/4.5
Questions
How does cycle time affect throughput?
How does latency affect throughput ?

5
Block Processing Combined Cycle Time and
Latency Effect
Large K throughput ratio ? cycle time ratio
Small K throughput ratio ? latency ratio
6
Talk Outline

Turbo coding and decoding an introduction
Tree soft-input soft-output (SISO) decoder
Synchronous turbo decoder
Asynchronous turbo decoder
Comparisons and conclusions

7
Turbo Coding Introduction

Error correcting codes
Adds redundancy
The input data is K bits
The output code word is N bits (NgtK)
The code rate is r K/N
Type of codes
Linear code
Convolutional code (CC)
Turbo code

8
Turbo Encoding - Introduction
Inner CC
Interleaver
Outer CC
Turbo Encoder

Turbo Encoding
Berrou, Glavieux and Thitimajshima (1993)
Performance close to Shannon channel capacity
Typically uses two convolutional codes and an
interleaver
Interleaver used to improve error correction
increases minimum distance of code
creates a large block code

9
Turbo Decoding

Turbo decoder components
Two soft-in soft-out (SISO) decoders
one for inner CC and one for outer CC
soft input a priori estimates of input data
soft output a posterior estimates of input data
SISO often based on Min-Sum formulation
Interleaver / De-interleaver
maps SISO outputs to SISO inputs
same permutation as used in encoder
Iterative nature of algorithm leads to block
processing
One SISO must finish before next SISO starts

Received Data memory
Inner SISO
De- interleaver
Outer SISO
Interleaver
10
The Decoding Problem

Requires finding paths in a graph called a
trellis
Node State j of encoder at time index k
Edge Represents receiving a 0 or 1 in node for
state j at time k
Path Represents a possible decoded sequence
the algorithm finds multiple paths
Example Trellis
For a 2-state encoder, encoding K bits

t 0
t K
t k
Sent bit is 1
Sent bit is 0
Decoded Sequence
0 1 0 0 0
1 0 1 0 0
11
Min-Sum SISO Problem Formulation

Branch and path metrics
Branch metric (BM)
indicates difference between expected and
received values
Path metric
sum of associated branch metrics
Min-Sum Formulation for each time index k find
Minimum path metric over all paths for which bit
k 1
Minimum path metric over all paths for which bit
k 0

t 0
t k
t K
1
Sent bit is 1
Sent bit is 0
Minimum path metric when bit k 1 is 13
Minimum path metric when bit k 0 is 16
12
Talk Outline

Turbo coding and decoding an introduction
Tree SISO low-latency turbo decoder architecture
Synchronous turbo decoder
Asynchronous turbo decoder
Comparisons and conclusions

13
Conventional SISO - O(K) latency

Calculation of the minimum path can be divided
into two phases
Forward state metric for time k and state j
Backward state metric for time k and state j
Data dependency loop prevents pipelining
Cycle time limited to latency of 2-way ACS
Latency is O(K)

t k
t K
t k-1
t k1
t 0
Received bit is 1
Received bit is 0
14
Tree SISO low latency architecture

Tree SISO (Beerel/Chugg JSAC01)
Calculate BMs for larger and larger segments of
trellis.( )
Analogous to creating group-wise PG logic for
tree adders
Tree SISO can process the entire trellis in
parallel
No data dependency loops so finer pipelining
possible
Latency is O(log K)

15
Remainder of Talk Outline

Turbo Coding an introduction
Turbo Decoding
Tree SISO low-latency turbo decoder architecture
Synchronous turbo decoder
Asynchronous turbo decoder
Comparisons and conclusions

16
Synchronous Base-Line Turbo Decoder

Synchronous turbo decoder base-line
IBM 0.18µm Artisan standard cell library
SCCC code was used with a rate of ½
Number of iterations performed is 6
Gate level pipelined to achieve high throughput
Performed timing-driven PR
Peak frequency of 475MHz
SISO area of 2.46mm2
To achieve high throughput, multiple blocks
instantiated

17
Asynchronous Turbo Decoder

Static Single Track Full Buffer Standard-Cell
Library (Golani06)
Total of (only) 14 cells in IBM 0.18µm process
Extensive spice simulations were performed
optimized trade-off between performance and
robustness
Chip design
Standard ASIC place-and-route flow
(congestion-based)
ECO optimization flow
Chip level simulation
Performed on critical sub-block (55K transistors)
Verified timing constraints
Measured latency and throughput using Synopsys
Nanosim

18
Static Single Track Full Buffer (Ferretti01)
1-of-N data
Receiver
Sender
SST channel
1-of-N static single-track protocol
Holds low Drives high
Holds high Drives low
Statically drive line ? improves noise margin
19
Asynchronous Implementation Challenges - I

Degradation in throughput
Unbalanced fork and join structure
The token on the short branch is stalled due to
imbalance
This leads to over all slowing down of the fork
join

Slack matching
Improves the throughput because of an additional
pipeline buffer
Identify fork / join bottlenecks and resolve by
adding buffers
After PR long wires can also create such a
problem
This can be solved by adding buffers on long
wires using ECO flow

20
Asynchronous Implementation Challenges - II
Full Adder

Fork
Full Adder with Integrated Fork

SSTFB implements only point to point
communication
Use dedicated Fork cells
Creates another pipeline stage
To slack match buffers are needed on the other
paths
Integrate Fork within Full Adder

45 less area than full adder and fork Decreases
the number of slack matching buffers required
21
Asynchronous Implementation Challenges III

60 of the design are slack matching buffers
Most of the time these buffers occur in linear
chains

Slack2
Buffer
Buffer

To save area and power two new cells were created
SLACK2
SLACK4

17 area and 10 power improvement for
SLACK2 30 area and 19 power improvement for
SLACK4
22
Remainder of Talk Outline

Turbo Coding an introduction
Turbo Decoding
Tree SISO low-latency turbo decoder architecture
Synchronous turbo decoder
Asynchronous turbo decoder
Comparisons and conclusions

23
Comparisons

Synchronous
Peak frequency of 475MHz
Logic area of 2.46mm2
Asynchronous
Peak frequency of 1.15GHz
Logic area of 6.92mm2
Design time comparison
Synchronous 4 graduate-student months
Asynchronous 12 graduate-student months

24
Synch vs Async
Received Memory
M pipelined 8-bit Tree SISOs
Interleaver/ De-interleaver
Latency l
First M bits arrive at t l
K bits
Let c be the sync clock cycle time (475 MHz)
Subsequent M bits arrive every c time units

Two implementations
Synch cycle time C1 and latency L1
Async cycle time C2 C1/2.4
latency L2 L1/4.5
Desired comparisons
Throughput comparison vs block size
Energy comparison vs block size

25
Comparisons Throughput / Area
1.28 M3
2.13 M8
3.91 M11

For small block sizes asynchronous provides
better throughput/area
As block size ? the two implementations become
comparable
For block sizes of 512 bits synchronous cannot
achieve async throughput

26
Comparisons Energy/Block

For equivalent throughputs and small block sizes
asynchronous is more energy
efficient than synchronous
Async advantages grow with larger async library
(e.g., w/ BUF1of4)

27
Conclusions

Asynchronous turbo decoder vs. synchronous
baseline
static STFB offers significant improvements for
small block sizes
more than 2X throughput/area
higher peak throughput (500Mbps)
more energy efficient
well-suited for low-latency applications (e.g.
voice)
High-performance async advantageous for
applications which require
high performance (e.g., pipelining)
low latency
block processing for which parallelism has
diminishing returns
synchronous design requires extensive parallelism
to achieve equivalent throughput