Efficient Asynchronous Protocol Converters for TwoPhase DelayInsensitive Global Communication PowerPoint PPT Presentation

presentation player overlay
1 / 51
About This Presentation
Transcript and Presenter's Notes

Title: Efficient Asynchronous Protocol Converters for TwoPhase DelayInsensitive Global Communication


1
Efficient Asynchronous Protocol Converters for
Two-Phase Delay-Insensitive Global Communication
  • Amitava Mitra
  • Intel Corp., Bangalore, India
  • William F. McLaughlin
  • Columbia University, Electrical Engineering
  • Steven M. Nowick
  • Columbia University, Computer Science

2
Outline
  • Motivation and Contribution
  • System-on-Chip Concepts and Trends
  • Asynchronous Signaling Styles
  • Target Asynchronous SOC Architecture
  • Contribution
  • Proposed System Architecture
  • Experimental Results
  • Extensions Other Signaling Styles
  • Conclusions and Future Work

3
System-on-Chip (SOC) Concept and Trends
  • Microelectronic trends enabling SOC design
  • Increasing integration density chip size
  • Formerly discrete functions (memory, I/O) now
    integrated
  • Popularity of multi-core designs
  • Heterogeneous SOC
  • Large complex chip with broad functionality
  • Many independent computation nodes
  • Multiple cores, memories, accelerators,
    multimedia processing, etc.
  • Often includes multiple timing domains
  • Complex network-style interconnect fabric
  • Challenges in Heterogeneous SOC design
  • Wire costs not scaling down with device size
  • Increasing proportion of power and delay in
    interconnect
  • Robust and high-performance interconnect design
  • High latencies between remote nodes
  • Mixed timing, timing variability/uncertainty
  • Need to support varied components
    modular/scalable design

4
SOC Communication Fabric
  • Growing factor in overall system performance
  • Ideal Requirements
  • Speed high throughput, low latency
  • Low power
  • Robust to timing variations
  • Flexibility integrate modular IPs and upgrades
  • Asynchronous design well-suited to these goals
  • Timing robust flexible designs
  • Lower power than synchronous
  • Work by Quinton, Greenstreet, and Wilton ICCD
    2005
  • GALS-style
  • global LEDR interconnect local synchronous
    blocks
  • does not provide details of protocol converters

5
Asynchronous for SOC Communication
  • Advantages of asynchronous global communication
  • Delay-insensitive (DI) encoding
  • Removes timing constraints on global routing
  • No clock signals to route across chip
  • Significant power advantage
  • Can support both async sync computation
  • Delay-insensitive async logic combats growing
    variability concerns
  • GALS style Globally-Asynchronous
    Locally-Synchronous
  • Several popular async signaling protocols
  • Dual rail four-phase, LEDR, 1-of-4, bundled data,
    others
  • No single protocol ideal for both logic and
    communication

6
Background LEDR Signaling
  • Dual-rail encoding two wires per bit
    delay-insensitive
  • Level-encoding
  • Data rail holds actual data value
  • Parity rail holds parity value
  • Alternating-phase protocol
  • Encoding parity alternates between odd and even

Bit value
LEDR Encoding
data rail parity rail
Phase
7
LEDR Signaling
  • Exactly one wire transition for each new data
    item

Data rail carries bit value in both phases
0
1
0
0
1
1
1
data
parity
even
odd
even
even
odd
even
odd
Parity rail phase alternates with each data item
8
Four-Phase Dual-Rail Signaling
  • Alternative DI Code
  • Key Differences
  • Four-phase (Return-to-Zero) protocol
  • Spacer (reset) state required between each data
    item
  • One-hot encoding
  • True rail (encodes 1) false rail (encodes 0)

1
0
1
1
Data values
True rail
False rail
Evaluation (one rail high)
Reset (both rails low)
9
Four-Phase Dual-Rail vs. LEDR
  • Advantages of four-phase dual-rail
  • Delay-insensitive logic using standard gates
  • Implementations are simple and fast widely used
  • LEDR complex impractical
  • Disadvantages of four-phase dual-rail
  • System-level communication throughput
  • Spacer state doubles round-trip communication
    latency
  • LEDR no spacer required
  • Power dissipation
  • Two transitions/bit (up and down) for each data
    item
  • LEDR only one transition/bit
  • Conclusion
  • Four-phase dual-rail better for implementing
    function blocks
  • LEDR is better for global communication

10
Target Asynchronous SOC Architecture
Our goal Protocol converters to enable this
global LEDR SOC
  • Three major components
  • Global communication network (LEDR)
  • Local computation nodes (varied styles)
  • New requirement protocol converters at
    interfaces
  • Allow full separation of computation and
    communication

11
Contribution
  • High-speed protocol converters to enable
    heterogeneous SOC architectures
  • Supports high-throughput, robust global
    communication
  • LEDR encoding
  • Supports efficient design of local function
    blocks
  • (i) 4-phase dual-rail, (ii) 1-of-4, (iii)
    single-rail bundled data
  • Features
  • Family of low-latency protocol converters
  • support above 3 local encoding styles
  • High throughput
  • facilitates concurrent interaction of nodes
  • Timing-robust
  • converters almost entirely QDI
  • Low design effort
  • standard cell design flow
  • Fully implemented in 0.18 µm CMOS
  • Layout and simulation
  • FIFO throughputs up to 250 MHz

12
Two Target SOC Topologies
  • 1. Pipeline-style topology
  • Feed-forward data path
  • uni-directional token flow
  • Receiving node returns a single ACK (control
    signal)
  • Supports concurrency between nodes

Data feeds forward
Acknowledge sent back
13
Two SOC Topologies (cont.)
  • 2. Server-style topology
  • Client passes data token to server
  • Server computes/returns data token to client
    (result)
  • Explicit ACK unnecessary
  • Proposed SOC architecture supports both topologies

Four-phase server
Four-phase data client
Bi-directional data flow data passed back to
client on completion
14
Outline
  • Motivation and Contribution
  • Proposed System Architecture
  • Architecture Overview
  • System Simulation
  • Detailed Hardware Implementation
  • Timing Analysis
  • Experimental Results
  • Extensions Other Signaling Styles
  • Conclusions and Future Work

15
Architecture Overview
Four-phase core
LEDR input
LEDR output
  • External LEDR interface, internal four-phase core
  • Four-phase signals are shown in red
  • Two-phase or transition signals are shown in
    yellow

16
Control Signals
  • Two-phase control signals

Phase of LEDR input (request from left)
Phase of LEDR output (forward complete)
Acknowledge to left neighbor
Acknowledge from right neighbor
17
Control Signals
  • Four-phase control signals

Completion detect four-phase evaluate and RZ
Enable four-phase evaluate and RZ
18
System Simulation
  • LEDR inputs begin arriving at quiescent system

LEDR inputs arrive
Completion detection
19
System Simulation
  • Input completion detection sent to control

All input phases matching
Transition to new phase
20
System Simulation
  • Control enables four-phase evaluate phase

Enable rises
21
System Simulation
  • LEDR input converted to four-phase

Enable now high
One wire of each four-phase pair rises
22
System Simulation
  • Four-phase function evaluation

23
System Simulation
  • Four-phase bits decoded to LEDR
  • Each bit converted as soon as it computes

LEDR outputs to next node generated
Four-phase complete not used in evaluate phase
24
System Simulation
  • LEDR output completion detection

Output pairs
ACK from right may come any time after all pairs
are sent
25
System Simulation
  • Control enables four-phase reset phase

Enable falls
26
System Simulation
  • Function block inputs return-to-zero
  • ACK is sent concurrently to left

Enable now low
Pipeline concurrency request new data during
reset phase
27
System Simulation
  • Four-phase reset propagates through logic block

New data may arrive now that ACK has been sent
Reset Completion detection
Enable remains low
28
System Simulation
  • Four-phase reset completes
  • Complete internal cycle has now been performed

Complete falls
29
System Simulation
  • New evaluate phase begins when Enable rises again
  • Pre-conditions reset finished, new data REQ,
    and old data ACK

Three-way synchronization
Input phase transitions when new data ready
ACK transitions when outputs safe to change
Complete low (means reset finished)
30
Detailed Hardware Implementation
Four-phase core
LEDR input
LEDR output
  • Each block implemented in CMOS standard cells
  • Design has few non-QDI timing constraints

31
Four-phase Encode (Input Converter)
  • Converts LEDR input to four-phase dual-rail
  • Enable1 outputs evaluate based on LEDR data
  • Enable0 outputs reset (LEDR data blocked)

32
Four-phase Decode (Output Converter)
  • Converts four-phase bits to LEDR output
  • LEDR data rail encoding
  • Assert either S (1 value) or R (0 value), then
    return-to-hold
  • More robust alternative C-element

33
Four-phase Decode (Output Converter)
  • Converts four-phase bits to LEDR output
  • LEDR parity rail encoding

even phase
odd phase
34
1-Bit Completion Detectors
  • LEDR CD at input and output
  • Four-phase CD in function block
  • Both protocols have one gate CD
  • XOR (parity) for LEDR
  • OR for four-phase dual-rail

1-bit LEDR completion detector
1-bit four-phase completion detector
35
N-Bit Completion Detectors
  • C-element trees
  • Used for both LEDR and four-phase
  • C-element standard cell implementation (AOI222
    w/feedback)

36
Control Block
  • Main Purpose controls 4-phase function block
  • 4-phase eval requires 3-way synchronization
  • Function block previous RZ complete
  • Primary inputs new data arrival
  • Right interface (in pipeline) ACK received
  • In pipeline topology also sends left ACK

For pipeline topology only
37
Control Block
  • Converts two-phase inputs to four-phase outputs

Two-phase to four-phase conversion
38
Control Block Signaling Conversion
Pulse-mode (timed)
Transition-signal (falling or rising )
Four-phase (level-sensitive)
SR latch captures the pulse
Inverter and XNOR form simple pulse gen
39
Timing Requirements
  • Circuits almost entirely QDI
  • Exceptions
  • Control block
  • Two-sided timing constraint on length of pulse
  • Sensitive to both gate and wire delays
  • Careful layout required
  • Latches simple hold time constraints
  • SR latches can be replaced by C-elements
  • C-elements also have implementation-specific
    timing constraints
  • SR latch much faster than our standard cell
    C-element
  • D latch can be removed at cost of concurrency

40
Outline
  • Motivation and Contribution
  • Proposed System Architecture
  • Experimental Results
  • Design Methodology
  • Datapath Setup
  • Simulation Results
  • Latency and Throughput Analysis
  • Extensions Other Signaling Styles
  • Conclusions and Future Work

41
Design Methodology
  • Standard cell design flow with complete layout
  • 0.18 µm TSMC CMOS process
  • 4 metal layers of 7 available used in routing
  • Custom place-and-route used
  • Only major layout concern pulse generator
    circuit
  • Design could be automated with constraints on
    pulse
  • Analog simulations based on layout-extracted
    design
  • Test vectors including limiting fast and slow
    cases

42
Datapath Implementation
  • Two function blocks implemented
  • An 8x8 carry-save multiplier
  • An empty FIFO stage
  • FIFO contains four-phase completion detector only
  • Demonstrates minimum possible node latency
  • Blocks are QDI in evaluate, but eager in reset
  • Implemented in combinational CMOS
  • DIMS-style logic (with C-elements) could be
    used instead
  • QDI in both directions
  • Increases both forward and reverse latencies

43
Multiplier Layout
  • Includes dual rail multiplier and all conversion
    circuits
  • Total area of 0.051 mm2
  • FIFO stage has area of 0.018 mm2

44
Measured Block Latencies
45
Performance Results
  • 3 Metrics
  • Forward Latency input arrival ? output data
    available
  • Average Values Multiplier 6.8 ns FIFO 2.9 ns.
  • Stabilization Time input arrival ? reset
    complete (circuit quiescent)
  • Multiplier 10.5 ns FIFO 6.3 ns.
  • Pipelined Cycle Time min processing time/data
    item (steady-state)
  • Multiplier 8.3 ns FIFO 4.0 ns.

46
Performance Analysis
  • Forward latency overhead
  • 2.2 ns for both nodes
  • Overhead independent of function block size
  • Includes
  • LEDR CD, control unit, input/output converters
  • Throughput increased by concurrency
  • Benefit 2.2 ns reduction in cycle time (vs.
    post-reset ACK)
  • Savings achieved even in environment without
    channel latency
  • Core converter overhead (no CD) extremely low
  • Only 1.1 ns average latency for converters
    control
  • Completion detectors
  • Account for half of forward latency overhead
  • Account for 55 of FIFO cycle time
  • Faster CDs would provide big improvement

47
Outline
  • Motivation and Contribution
  • Proposed System Architecture
  • Experimental Results
  • Extensions Other Signaling Styles
  • Converters for 1-of-4 function blocks
  • Converters for bundled data function block
  • Conclusions and Future Work

48
Extensions to Other Local Protocols
  • Only small changes to handle 1-of-4 or bundled
    data
  • No change to control block
  • 1-of-4 encoding
  • Input/output converters
  • Small changes to logic
  • Needs standard 1-of-4 completion detector
  • Single-rail bundled data
  • Input converter not needed use LEDR data rail
  • Output converter
  • New basic circuit required (see paper for
    details)
  • Function block completion detection
  • Use bundled done signal
  • Asymmetric delay chain (fast reset)

49
Outline
  • Background and Motivation
  • Contribution
  • Proposed System Architecture
  • Experimental Results
  • Extensions Other Signaling Styles
  • Conclusions and Future Work
  • Summary and Conclusion
  • Future Work

50
Summary and Conclusions
  • Support heterogeneous SOCs using hybrid protocols
  • LEDR low-power, delay-insensitive communication
    fabric
  • Dual rail four-phase Simple, fast logic blocks
  • Designed Converters for LEDR/four-phase SOC
  • Low latency, high throughput, timing robust
    design
  • Robust concurrency system developed
  • Exploits four-phase reset to mask communication
    time
  • Simulations with realistic mid-sized function
    nodes
  • Demonstrated low latency overhead
  • Demonstrated low area overhead
  • Achieved throughputs up to 250 MHz for FIFO stage

51
Future Work
  • Evaluating system-level benefits
  • Determine design spaces where converters most
    useful
  • Quantify benefits over using either protocol
    exclusively
  • Optimal partitioning of converter nodes
  • Explore dependence on system topology
  • Potential applications use in async SOCs
  • Beigne/Vivet GALS NoC Architectures (Async-06)
  • Scott et al. (Intel/Silistix) PXA27x System
    (Async-07)
  • Dobkin/Ginosar/Kolodny fast LEDR serial links
    (Async-06/07)
  • Convert 4-phase dual-rail to LEDR (for parallel
    load)
Write a Comment
User Comments (0)
About PowerShow.com