Efficient Asynchronous Protocol Converters for TwoPhase DelayInsensitive Global Communication presentation

About This Presentation

Transcript and Presenter's Notes

Title: Efficient Asynchronous Protocol Converters for TwoPhase DelayInsensitive Global Communication

1
Efficient Asynchronous Protocol Converters for
Two-Phase Delay-Insensitive Global Communication

Amitava Mitra
Intel Corp., Bangalore, India
William F. McLaughlin
Columbia University, Electrical Engineering
Steven M. Nowick
Columbia University, Computer Science

2
Outline

Motivation and Contribution
System-on-Chip Concepts and Trends
Asynchronous Signaling Styles
Target Asynchronous SOC Architecture
Contribution
Proposed System Architecture
Experimental Results
Extensions Other Signaling Styles
Conclusions and Future Work

3
System-on-Chip (SOC) Concept and Trends

Microelectronic trends enabling SOC design
Increasing integration density chip size
Formerly discrete functions (memory, I/O) now
integrated
Popularity of multi-core designs
Heterogeneous SOC
Large complex chip with broad functionality
Many independent computation nodes
Multiple cores, memories, accelerators,
multimedia processing, etc.
Often includes multiple timing domains
Complex network-style interconnect fabric
Challenges in Heterogeneous SOC design
Wire costs not scaling down with device size
Increasing proportion of power and delay in
interconnect
Robust and high-performance interconnect design
High latencies between remote nodes
Mixed timing, timing variability/uncertainty
Need to support varied components
modular/scalable design

4
SOC Communication Fabric

Growing factor in overall system performance
Ideal Requirements
Speed high throughput, low latency
Low power
Robust to timing variations
Flexibility integrate modular IPs and upgrades
Asynchronous design well-suited to these goals
Timing robust flexible designs
Lower power than synchronous
Work by Quinton, Greenstreet, and Wilton ICCD
2005
GALS-style
global LEDR interconnect local synchronous
blocks
does not provide details of protocol converters

5
Asynchronous for SOC Communication

Advantages of asynchronous global communication
Delay-insensitive (DI) encoding
Removes timing constraints on global routing
No clock signals to route across chip
Significant power advantage
Can support both async sync computation
Delay-insensitive async logic combats growing
variability concerns
GALS style Globally-Asynchronous
Locally-Synchronous
Several popular async signaling protocols
Dual rail four-phase, LEDR, 1-of-4, bundled data,
others
No single protocol ideal for both logic and
communication

6
Background LEDR Signaling

Dual-rail encoding two wires per bit
delay-insensitive
Level-encoding
Data rail holds actual data value
Parity rail holds parity value
Alternating-phase protocol
Encoding parity alternates between odd and even

Bit value
LEDR Encoding
data rail parity rail
Phase
7
LEDR Signaling

Exactly one wire transition for each new data
item

Data rail carries bit value in both phases
0
1
0
0
1
1
1
data
parity
even
odd
even
even
odd
even
odd
Parity rail phase alternates with each data item
8
Four-Phase Dual-Rail Signaling

Alternative DI Code
Key Differences
Four-phase (Return-to-Zero) protocol
Spacer (reset) state required between each data
item
One-hot encoding
True rail (encodes 1) false rail (encodes 0)

1
0
1
1
Data values
True rail
False rail
Evaluation (one rail high)
Reset (both rails low)
9
Four-Phase Dual-Rail vs. LEDR

Advantages of four-phase dual-rail
Delay-insensitive logic using standard gates
Implementations are simple and fast widely used
LEDR complex impractical
Disadvantages of four-phase dual-rail
System-level communication throughput
Spacer state doubles round-trip communication
latency
LEDR no spacer required
Power dissipation
Two transitions/bit (up and down) for each data
item
LEDR only one transition/bit
Conclusion
Four-phase dual-rail better for implementing
function blocks
LEDR is better for global communication

10
Target Asynchronous SOC Architecture
Our goal Protocol converters to enable this
global LEDR SOC

Three major components
Global communication network (LEDR)
Local computation nodes (varied styles)
New requirement protocol converters at
interfaces
Allow full separation of computation and
communication

11
Contribution

High-speed protocol converters to enable
heterogeneous SOC architectures
Supports high-throughput, robust global
communication
LEDR encoding
Supports efficient design of local function
blocks
(i) 4-phase dual-rail, (ii) 1-of-4, (iii)
single-rail bundled data
Features
Family of low-latency protocol converters
support above 3 local encoding styles
High throughput
facilitates concurrent interaction of nodes
Timing-robust
converters almost entirely QDI
Low design effort
standard cell design flow
Fully implemented in 0.18 µm CMOS
Layout and simulation
FIFO throughputs up to 250 MHz

12
Two Target SOC Topologies

1. Pipeline-style topology
Feed-forward data path
uni-directional token flow
Receiving node returns a single ACK (control
signal)
Supports concurrency between nodes

Data feeds forward
Acknowledge sent back
13
Two SOC Topologies (cont.)

2. Server-style topology
Client passes data token to server
Server computes/returns data token to client
(result)
Explicit ACK unnecessary
Proposed SOC architecture supports both topologies

Four-phase server
Four-phase data client
Bi-directional data flow data passed back to
client on completion
14
Outline

Motivation and Contribution
Proposed System Architecture
Architecture Overview
System Simulation
Detailed Hardware Implementation
Timing Analysis
Experimental Results
Extensions Other Signaling Styles
Conclusions and Future Work

15
Architecture Overview
Four-phase core
LEDR input
LEDR output

External LEDR interface, internal four-phase core
Four-phase signals are shown in red
Two-phase or transition signals are shown in
yellow

16
Control Signals

Two-phase control signals

Phase of LEDR input (request from left)
Phase of LEDR output (forward complete)
Acknowledge to left neighbor
Acknowledge from right neighbor
17
Control Signals

Four-phase control signals

Completion detect four-phase evaluate and RZ
Enable four-phase evaluate and RZ
18
System Simulation

LEDR inputs begin arriving at quiescent system

LEDR inputs arrive
Completion detection
19
System Simulation

Input completion detection sent to control

All input phases matching
Transition to new phase
20
System Simulation

Control enables four-phase evaluate phase

Enable rises
21
System Simulation

LEDR input converted to four-phase

Enable now high
One wire of each four-phase pair rises
22
System Simulation

Four-phase function evaluation

23
System Simulation

Four-phase bits decoded to LEDR
Each bit converted as soon as it computes

LEDR outputs to next node generated
Four-phase complete not used in evaluate phase
24
System Simulation

LEDR output completion detection

Output pairs
ACK from right may come any time after all pairs
are sent
25
System Simulation

Control enables four-phase reset phase

Enable falls
26
System Simulation

Function block inputs return-to-zero
ACK is sent concurrently to left

Enable now low
Pipeline concurrency request new data during
reset phase
27
System Simulation

Four-phase reset propagates through logic block

New data may arrive now that ACK has been sent
Reset Completion detection
Enable remains low
28
System Simulation

Four-phase reset completes
Complete internal cycle has now been performed

Complete falls
29
System Simulation

New evaluate phase begins when Enable rises again
Pre-conditions reset finished, new data REQ,
and old data ACK

Three-way synchronization
Input phase transitions when new data ready
ACK transitions when outputs safe to change
Complete low (means reset finished)
30
Detailed Hardware Implementation
Four-phase core
LEDR input
LEDR output

Each block implemented in CMOS standard cells
Design has few non-QDI timing constraints

31
Four-phase Encode (Input Converter)

Converts LEDR input to four-phase dual-rail
Enable1 outputs evaluate based on LEDR data
Enable0 outputs reset (LEDR data blocked)

32
Four-phase Decode (Output Converter)

Converts four-phase bits to LEDR output
LEDR data rail encoding
Assert either S (1 value) or R (0 value), then
return-to-hold
More robust alternative C-element

33
Four-phase Decode (Output Converter)

Converts four-phase bits to LEDR output
LEDR parity rail encoding

even phase
odd phase
34
1-Bit Completion Detectors

LEDR CD at input and output
Four-phase CD in function block
Both protocols have one gate CD
XOR (parity) for LEDR
OR for four-phase dual-rail

1-bit LEDR completion detector
1-bit four-phase completion detector
35
N-Bit Completion Detectors

C-element trees
Used for both LEDR and four-phase
C-element standard cell implementation (AOI222
w/feedback)

36
Control Block

Main Purpose controls 4-phase function block
4-phase eval requires 3-way synchronization
Function block previous RZ complete
Primary inputs new data arrival
Right interface (in pipeline) ACK received
In pipeline topology also sends left ACK

For pipeline topology only
37
Control Block

Converts two-phase inputs to four-phase outputs

Two-phase to four-phase conversion
38
Control Block Signaling Conversion
Pulse-mode (timed)
Transition-signal (falling or rising )
Four-phase (level-sensitive)
SR latch captures the pulse
Inverter and XNOR form simple pulse gen
39
Timing Requirements

Circuits almost entirely QDI
Exceptions
Control block
Two-sided timing constraint on length of pulse
Sensitive to both gate and wire delays
Careful layout required
Latches simple hold time constraints
SR latches can be replaced by C-elements
C-elements also have implementation-specific
timing constraints
SR latch much faster than our standard cell
C-element
D latch can be removed at cost of concurrency

40
Outline

Motivation and Contribution
Proposed System Architecture
Experimental Results
Design Methodology
Datapath Setup
Simulation Results
Latency and Throughput Analysis
Extensions Other Signaling Styles
Conclusions and Future Work

41
Design Methodology

Standard cell design flow with complete layout
0.18 µm TSMC CMOS process
4 metal layers of 7 available used in routing
Custom place-and-route used
Only major layout concern pulse generator
circuit
Design could be automated with constraints on
pulse
Analog simulations based on layout-extracted
design
Test vectors including limiting fast and slow
cases

42
Datapath Implementation

Two function blocks implemented
An 8x8 carry-save multiplier
An empty FIFO stage
FIFO contains four-phase completion detector only
Demonstrates minimum possible node latency
Blocks are QDI in evaluate, but eager in reset
Implemented in combinational CMOS
DIMS-style logic (with C-elements) could be
used instead
QDI in both directions
Increases both forward and reverse latencies

43
Multiplier Layout

Includes dual rail multiplier and all conversion
circuits
Total area of 0.051 mm2
FIFO stage has area of 0.018 mm2

44
Measured Block Latencies
45
Performance Results

3 Metrics
Forward Latency input arrival ? output data
available
Average Values Multiplier 6.8 ns FIFO 2.9 ns.
Stabilization Time input arrival ? reset
complete (circuit quiescent)
Multiplier 10.5 ns FIFO 6.3 ns.
Pipelined Cycle Time min processing time/data
item (steady-state)
Multiplier 8.3 ns FIFO 4.0 ns.

46
Performance Analysis

Forward latency overhead
2.2 ns for both nodes
Overhead independent of function block size
Includes
LEDR CD, control unit, input/output converters
Throughput increased by concurrency
Benefit 2.2 ns reduction in cycle time (vs.
post-reset ACK)
Savings achieved even in environment without
channel latency
Core converter overhead (no CD) extremely low
Only 1.1 ns average latency for converters
control
Completion detectors
Account for half of forward latency overhead
Account for 55 of FIFO cycle time
Faster CDs would provide big improvement

47
Outline

Motivation and Contribution
Proposed System Architecture
Experimental Results
Extensions Other Signaling Styles
Converters for 1-of-4 function blocks
Converters for bundled data function block
Conclusions and Future Work

48
Extensions to Other Local Protocols

Only small changes to handle 1-of-4 or bundled
data
No change to control block
1-of-4 encoding
Input/output converters
Small changes to logic
Needs standard 1-of-4 completion detector
Single-rail bundled data
Input converter not needed use LEDR data rail
Output converter
New basic circuit required (see paper for
details)
Function block completion detection
Use bundled done signal
Asymmetric delay chain (fast reset)

49
Outline

Background and Motivation
Contribution
Proposed System Architecture
Experimental Results
Extensions Other Signaling Styles
Conclusions and Future Work
Summary and Conclusion
Future Work

50
Summary and Conclusions

Support heterogeneous SOCs using hybrid protocols
LEDR low-power, delay-insensitive communication
fabric
Dual rail four-phase Simple, fast logic blocks
Designed Converters for LEDR/four-phase SOC
Low latency, high throughput, timing robust
design
Robust concurrency system developed
Exploits four-phase reset to mask communication
time
Simulations with realistic mid-sized function
nodes
Demonstrated low latency overhead
Demonstrated low area overhead
Achieved throughputs up to 250 MHz for FIFO stage

51
Future Work

Evaluating system-level benefits
Determine design spaces where converters most
useful
Quantify benefits over using either protocol
exclusively
Optimal partitioning of converter nodes
Explore dependence on system topology
Potential applications use in async SOCs
Beigne/Vivet GALS NoC Architectures (Async-06)
Scott et al. (Intel/Silistix) PXA27x System
(Async-07)
Dobkin/Ginosar/Kolodny fast LEDR serial links
(Async-06/07)
Convert 4-phase dual-rail to LEDR (for parallel
load)

Write a Comment

User Comments (0)

About PowerShow.com

Efficient Asynchronous Protocol Converters for TwoPhase DelayInsensitive Global Communication PowerPoint PPT Presentation