Title: Efficient Asynchronous Protocol Converters for TwoPhase DelayInsensitive Global Communication
1Efficient Asynchronous Protocol Converters for
Two-Phase Delay-Insensitive Global Communication
- Amitava Mitra
- Intel Corp., Bangalore, India
- William F. McLaughlin
- Columbia University, Electrical Engineering
- Steven M. Nowick
- Columbia University, Computer Science
2Outline
- Motivation and Contribution
- System-on-Chip Concepts and Trends
- Asynchronous Signaling Styles
- Target Asynchronous SOC Architecture
- Contribution
- Proposed System Architecture
- Experimental Results
- Extensions Other Signaling Styles
- Conclusions and Future Work
3System-on-Chip (SOC) Concept and Trends
- Microelectronic trends enabling SOC design
- Increasing integration density chip size
- Formerly discrete functions (memory, I/O) now
integrated - Popularity of multi-core designs
- Heterogeneous SOC
- Large complex chip with broad functionality
- Many independent computation nodes
- Multiple cores, memories, accelerators,
multimedia processing, etc. - Often includes multiple timing domains
- Complex network-style interconnect fabric
- Challenges in Heterogeneous SOC design
- Wire costs not scaling down with device size
- Increasing proportion of power and delay in
interconnect - Robust and high-performance interconnect design
- High latencies between remote nodes
- Mixed timing, timing variability/uncertainty
- Need to support varied components
modular/scalable design
4SOC Communication Fabric
- Growing factor in overall system performance
- Ideal Requirements
- Speed high throughput, low latency
- Low power
- Robust to timing variations
- Flexibility integrate modular IPs and upgrades
- Asynchronous design well-suited to these goals
- Timing robust flexible designs
- Lower power than synchronous
- Work by Quinton, Greenstreet, and Wilton ICCD
2005 - GALS-style
- global LEDR interconnect local synchronous
blocks - does not provide details of protocol converters
5Asynchronous for SOC Communication
- Advantages of asynchronous global communication
- Delay-insensitive (DI) encoding
- Removes timing constraints on global routing
- No clock signals to route across chip
- Significant power advantage
- Can support both async sync computation
- Delay-insensitive async logic combats growing
variability concerns - GALS style Globally-Asynchronous
Locally-Synchronous - Several popular async signaling protocols
- Dual rail four-phase, LEDR, 1-of-4, bundled data,
others - No single protocol ideal for both logic and
communication
6Background LEDR Signaling
- Dual-rail encoding two wires per bit
delay-insensitive - Level-encoding
- Data rail holds actual data value
- Parity rail holds parity value
- Alternating-phase protocol
- Encoding parity alternates between odd and even
Bit value
LEDR Encoding
data rail parity rail
Phase
7LEDR Signaling
- Exactly one wire transition for each new data
item
Data rail carries bit value in both phases
0
1
0
0
1
1
1
data
parity
even
odd
even
even
odd
even
odd
Parity rail phase alternates with each data item
8Four-Phase Dual-Rail Signaling
- Alternative DI Code
- Key Differences
- Four-phase (Return-to-Zero) protocol
- Spacer (reset) state required between each data
item - One-hot encoding
- True rail (encodes 1) false rail (encodes 0)
1
0
1
1
Data values
True rail
False rail
Evaluation (one rail high)
Reset (both rails low)
9Four-Phase Dual-Rail vs. LEDR
- Advantages of four-phase dual-rail
- Delay-insensitive logic using standard gates
- Implementations are simple and fast widely used
- LEDR complex impractical
- Disadvantages of four-phase dual-rail
- System-level communication throughput
- Spacer state doubles round-trip communication
latency - LEDR no spacer required
- Power dissipation
- Two transitions/bit (up and down) for each data
item - LEDR only one transition/bit
- Conclusion
- Four-phase dual-rail better for implementing
function blocks - LEDR is better for global communication
10Target Asynchronous SOC Architecture
Our goal Protocol converters to enable this
global LEDR SOC
- Three major components
- Global communication network (LEDR)
- Local computation nodes (varied styles)
- New requirement protocol converters at
interfaces - Allow full separation of computation and
communication
11Contribution
- High-speed protocol converters to enable
heterogeneous SOC architectures - Supports high-throughput, robust global
communication - LEDR encoding
- Supports efficient design of local function
blocks - (i) 4-phase dual-rail, (ii) 1-of-4, (iii)
single-rail bundled data - Features
- Family of low-latency protocol converters
- support above 3 local encoding styles
- High throughput
- facilitates concurrent interaction of nodes
- Timing-robust
- converters almost entirely QDI
- Low design effort
- standard cell design flow
- Fully implemented in 0.18 µm CMOS
- Layout and simulation
- FIFO throughputs up to 250 MHz
12Two Target SOC Topologies
- 1. Pipeline-style topology
- Feed-forward data path
- uni-directional token flow
- Receiving node returns a single ACK (control
signal) - Supports concurrency between nodes
Data feeds forward
Acknowledge sent back
13Two SOC Topologies (cont.)
- 2. Server-style topology
- Client passes data token to server
- Server computes/returns data token to client
(result) - Explicit ACK unnecessary
- Proposed SOC architecture supports both topologies
Four-phase server
Four-phase data client
Bi-directional data flow data passed back to
client on completion
14Outline
- Motivation and Contribution
- Proposed System Architecture
- Architecture Overview
- System Simulation
- Detailed Hardware Implementation
- Timing Analysis
- Experimental Results
- Extensions Other Signaling Styles
- Conclusions and Future Work
15Architecture Overview
Four-phase core
LEDR input
LEDR output
- External LEDR interface, internal four-phase core
- Four-phase signals are shown in red
- Two-phase or transition signals are shown in
yellow
16Control Signals
- Two-phase control signals
Phase of LEDR input (request from left)
Phase of LEDR output (forward complete)
Acknowledge to left neighbor
Acknowledge from right neighbor
17Control Signals
- Four-phase control signals
Completion detect four-phase evaluate and RZ
Enable four-phase evaluate and RZ
18System Simulation
- LEDR inputs begin arriving at quiescent system
LEDR inputs arrive
Completion detection
19System Simulation
- Input completion detection sent to control
All input phases matching
Transition to new phase
20System Simulation
- Control enables four-phase evaluate phase
Enable rises
21System Simulation
- LEDR input converted to four-phase
Enable now high
One wire of each four-phase pair rises
22System Simulation
- Four-phase function evaluation
23System Simulation
- Four-phase bits decoded to LEDR
- Each bit converted as soon as it computes
LEDR outputs to next node generated
Four-phase complete not used in evaluate phase
24System Simulation
- LEDR output completion detection
Output pairs
ACK from right may come any time after all pairs
are sent
25System Simulation
- Control enables four-phase reset phase
Enable falls
26System Simulation
- Function block inputs return-to-zero
- ACK is sent concurrently to left
Enable now low
Pipeline concurrency request new data during
reset phase
27System Simulation
- Four-phase reset propagates through logic block
New data may arrive now that ACK has been sent
Reset Completion detection
Enable remains low
28System Simulation
- Four-phase reset completes
- Complete internal cycle has now been performed
Complete falls
29System Simulation
- New evaluate phase begins when Enable rises again
- Pre-conditions reset finished, new data REQ,
and old data ACK
Three-way synchronization
Input phase transitions when new data ready
ACK transitions when outputs safe to change
Complete low (means reset finished)
30Detailed Hardware Implementation
Four-phase core
LEDR input
LEDR output
- Each block implemented in CMOS standard cells
- Design has few non-QDI timing constraints
31Four-phase Encode (Input Converter)
- Converts LEDR input to four-phase dual-rail
- Enable1 outputs evaluate based on LEDR data
- Enable0 outputs reset (LEDR data blocked)
32Four-phase Decode (Output Converter)
- Converts four-phase bits to LEDR output
- LEDR data rail encoding
- Assert either S (1 value) or R (0 value), then
return-to-hold - More robust alternative C-element
33Four-phase Decode (Output Converter)
- Converts four-phase bits to LEDR output
- LEDR parity rail encoding
-
even phase
odd phase
341-Bit Completion Detectors
- LEDR CD at input and output
- Four-phase CD in function block
- Both protocols have one gate CD
- XOR (parity) for LEDR
- OR for four-phase dual-rail
-
-
1-bit LEDR completion detector
1-bit four-phase completion detector
35N-Bit Completion Detectors
- C-element trees
- Used for both LEDR and four-phase
- C-element standard cell implementation (AOI222
w/feedback) -
36Control Block
-
- Main Purpose controls 4-phase function block
- 4-phase eval requires 3-way synchronization
- Function block previous RZ complete
- Primary inputs new data arrival
- Right interface (in pipeline) ACK received
- In pipeline topology also sends left ACK
For pipeline topology only
37Control Block
- Converts two-phase inputs to four-phase outputs
Two-phase to four-phase conversion
38Control Block Signaling Conversion
Pulse-mode (timed)
Transition-signal (falling or rising )
Four-phase (level-sensitive)
SR latch captures the pulse
Inverter and XNOR form simple pulse gen
39Timing Requirements
- Circuits almost entirely QDI
- Exceptions
- Control block
- Two-sided timing constraint on length of pulse
- Sensitive to both gate and wire delays
- Careful layout required
- Latches simple hold time constraints
- SR latches can be replaced by C-elements
- C-elements also have implementation-specific
timing constraints - SR latch much faster than our standard cell
C-element - D latch can be removed at cost of concurrency
40Outline
- Motivation and Contribution
- Proposed System Architecture
- Experimental Results
- Design Methodology
- Datapath Setup
- Simulation Results
- Latency and Throughput Analysis
- Extensions Other Signaling Styles
- Conclusions and Future Work
41Design Methodology
- Standard cell design flow with complete layout
- 0.18 µm TSMC CMOS process
- 4 metal layers of 7 available used in routing
- Custom place-and-route used
- Only major layout concern pulse generator
circuit - Design could be automated with constraints on
pulse - Analog simulations based on layout-extracted
design - Test vectors including limiting fast and slow
cases
42Datapath Implementation
- Two function blocks implemented
- An 8x8 carry-save multiplier
- An empty FIFO stage
- FIFO contains four-phase completion detector only
- Demonstrates minimum possible node latency
- Blocks are QDI in evaluate, but eager in reset
- Implemented in combinational CMOS
- DIMS-style logic (with C-elements) could be
used instead - QDI in both directions
- Increases both forward and reverse latencies
43Multiplier Layout
-
- Includes dual rail multiplier and all conversion
circuits - Total area of 0.051 mm2
- FIFO stage has area of 0.018 mm2
44Measured Block Latencies
45Performance Results
- 3 Metrics
- Forward Latency input arrival ? output data
available - Average Values Multiplier 6.8 ns FIFO 2.9 ns.
- Stabilization Time input arrival ? reset
complete (circuit quiescent) - Multiplier 10.5 ns FIFO 6.3 ns.
- Pipelined Cycle Time min processing time/data
item (steady-state) - Multiplier 8.3 ns FIFO 4.0 ns.
46Performance Analysis
- Forward latency overhead
- 2.2 ns for both nodes
- Overhead independent of function block size
- Includes
- LEDR CD, control unit, input/output converters
- Throughput increased by concurrency
- Benefit 2.2 ns reduction in cycle time (vs.
post-reset ACK) - Savings achieved even in environment without
channel latency - Core converter overhead (no CD) extremely low
- Only 1.1 ns average latency for converters
control - Completion detectors
- Account for half of forward latency overhead
- Account for 55 of FIFO cycle time
- Faster CDs would provide big improvement
47Outline
- Motivation and Contribution
- Proposed System Architecture
- Experimental Results
- Extensions Other Signaling Styles
- Converters for 1-of-4 function blocks
- Converters for bundled data function block
- Conclusions and Future Work
48Extensions to Other Local Protocols
- Only small changes to handle 1-of-4 or bundled
data - No change to control block
- 1-of-4 encoding
- Input/output converters
- Small changes to logic
- Needs standard 1-of-4 completion detector
- Single-rail bundled data
- Input converter not needed use LEDR data rail
- Output converter
- New basic circuit required (see paper for
details) - Function block completion detection
- Use bundled done signal
- Asymmetric delay chain (fast reset)
49Outline
- Background and Motivation
- Contribution
- Proposed System Architecture
- Experimental Results
- Extensions Other Signaling Styles
- Conclusions and Future Work
- Summary and Conclusion
- Future Work
50Summary and Conclusions
- Support heterogeneous SOCs using hybrid protocols
- LEDR low-power, delay-insensitive communication
fabric - Dual rail four-phase Simple, fast logic blocks
- Designed Converters for LEDR/four-phase SOC
- Low latency, high throughput, timing robust
design - Robust concurrency system developed
- Exploits four-phase reset to mask communication
time - Simulations with realistic mid-sized function
nodes - Demonstrated low latency overhead
- Demonstrated low area overhead
- Achieved throughputs up to 250 MHz for FIFO stage
51Future Work
- Evaluating system-level benefits
- Determine design spaces where converters most
useful - Quantify benefits over using either protocol
exclusively - Optimal partitioning of converter nodes
- Explore dependence on system topology
- Potential applications use in async SOCs
- Beigne/Vivet GALS NoC Architectures (Async-06)
- Scott et al. (Intel/Silistix) PXA27x System
(Async-07) - Dobkin/Ginosar/Kolodny fast LEDR serial links
(Async-06/07) - Convert 4-phase dual-rail to LEDR (for parallel
load)