Asynchronous Links, for NanoNets? - PowerPoint PPT Presentation

About This Presentation
Title:

Asynchronous Links, for NanoNets?

Description:

Multiple clock domains are reality, problem of interface between them ... clocked cells require synchronization. Synchronization necessary when clocks are ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 79
Provided by: davidki4
Category:

less

Transcript and Presenter's Notes

Title: Asynchronous Links, for NanoNets?


1
Asynchronous Links, for NanoNets?
  • Alex YakovlevUniversity of Newcastle, UK

2
Motivation-1
  • At very deep submicron, gate delay is much less
    than interconnect delay total interconnect
    length can reach several meters interconnect
    delay can be as much as 90 of total path delay
    in VDSM circuits
  • Timing issue is a problem, particularly for
    global wires
  • Multiple clock domains are reality, problem of
    interface between them
  • ITRS05 predicted 4x (8x) increase in global
    asynchronous signalling by 2012 (2020)

3
Motivation-2
  • Variability and uncertainty
  • Geometry and process for long channels intra-die
    variations are less correlated for different part
    of the interconnect, both for interconnects and
    repeaters
  • e.g., M4 and M5 resistance/um massively differ,
    leading to mistracking (C.Visuweswariah, SLIP06)
  • e.g. 250nm clock skew has 25 variability due to
    interconnect variations (Y.Liu et.al. DAC00)
  • Behavioural crosstalk (sidewall capacitance can
    cause up to 7x variation in delay (R. Ho,
    M.Horowitz))

4
A Network on Chip
Async Links
5
Example from the Past Fault-Tolerant Self-Timed
Ring (Varshavsky et al. 1986)
For an onboard airborne computer-control system
which tolerated up to two faults. Self-timed ring
was a GALS system with self-checking and
self-repair at the hardware level
Individually clocked subsystems
Self-timed adapters forming a ring
6
Communication Channel Adapter
Much higher reliability than a bus and other
forms of redundancy MCC was developed
TTL-Schottky gate arrays, approx 2K gates.
Data (DR,DS) is encoded using 3-of-6 Sperner code
(16 data values for half-byte, plus 4 tokens for
ring acquisition protocol) AR, AS
acknowledgements RR, RS spare (for self-repair)
lines
7
Outline
  • Token-based view of communication
  • Basics of asynchronous signalling
  • Self-timed data encoding
  • Pipelining
  • How to hide acknowledgements
  • Serial vs Parallel links
  • Arbiters and routers
  • Async2sync interface
  • CAD issues

8
Data exchange token-based view
Data
source
tx
rx
dest
  • Question 1 when can Rx look at the incoming
    data?
  • Data validity issue Forming a well-defined token

9
Data exchange token-based view
Data
source
tx
rx
dest
  • Question 1 when can Rx looked at the data?
  • Data validity issue Forming a well-defined
    token
  • Question 2 when can Tx send new data?
  • Acknowledgement issue Separation b/w tokens

10
Data exchange token-based view
  • Question 1 when can Rx looked at the data?
  • Data validity issue Forming a well-defined
    token
  • Question 2 when can Tx send new data?
  • Acknowledgement issue Separation b/w tokens
  • These are fundamental issues of flow control at
    the physical and link levels
  • The answers are determined by many design
    aspects technology level, system architecture
    (application, pipelining), latency, throughput,
    power, design process etc.

11
Tokens and spaces with global clocking
clk
  • In globally clocked systems both Q1 and Q2 are
    resolved with the aid of clock pulses

12
Tokens and spaces
Data
source
tx
rx
dest
D_valid
Clk_rx
Clk_tx
bundle
  • Without global clocking Q1 can be resolved
    differently from Q2
  • E.g. Q1 source-synchronous (mesochronous),
    bundled data or self-synchronising codes Q2
    ack or stop signal, or by local timing

13
Tokens and spaces
Data
source
tx
rx
dest
D_valid
ack
ack
bundle
ack
  • Without global clocking Q1 can be resolved
    differently from Q2
  • E.g. Q1 source-synchronous (mesochronous),
    bundled data or self-synchronising codes Q2
    ack or stop signal, or by local timing

14
Petri net model
dest
Tx
Rx
source
Data Valid
Tx delay
Rx delay
One way delay, but may be unsafe!
dest
Tx
Rx
source
Data Valid
ack
Tx delay or ack
Rx delay or ack
Always safe but with a round trip delay!
15
Asynchronous handshake signalling
  • Valid data tokens and safe spaces between them
    can be created by different means of signalling
    and encoding
  • Level-based -gt Return-To-Zero (RTZ) or 4-phase
    protocol
  • Transition-based -gt Non-Return-to-Zero (NRZ) or
    2-phase protocol
  • Pulse-based, e.g. GasP
  • Phase-difference-based
  • Data encoding bundled data (BD),
    Delay-insensitive (DI)

16
Handshake Signalling Protocols
  • Level Signalling (RTZ or 4-phase)
  • Transition Signalling (RTZ or 4-phase)

req
ack
One cycle
One cycle
17
Handshake Signalling Protocols
  • Pulse Signalling

req
req
ack
ack
One cycle
  • Single-track Signalling (GasP)

req
ack
18
GasP signalling
Pull up from pred (req)
Pulse length control loops
Pull up from here (req)
Pull down here (ack)
Pull down from succ (ack)
Source R. Ho et al, Async04
19
Data encoding
  • Bundled data
  • Code is positional binary, token is determined by
    Req signal Req arrives with a safe set-up
    delay from data
  • Delay-insensitive codes (tokens determined by the
    codeword values, require a spacer, or NULL, state
    if RTZ)
  • 1-of-2 (Dual-rail per bit) systematic code,
    encoding, decoding straightforward
  • m-of-n (ngt2) not systematic, i.e. incur
    encoding and decoding costs, optimal when mn/2
  • One-hot ,1-of-n (ngt2), completion detection is
    easy, not practical beyond ngt4
  • Systematic, such as Berger, incur complex
    completion detection

20
Bundled Data
RTZ
Data
req
ack
NRZ
Data
req
ack
One cycle
One cycle
21
DI encoded data (Dual-Rail)
RTZ
NULL (spacer)
NULL
Data.0
Data.1
Data.0
Logical 0
Logical 1
ack
Data.1
ack
One cycle
One cycle
NRZ
Data.0
Logical 0
Logical 1
Logical 1
Logical 1
Data.1
ack
cycle
cycle
cycle
cycle
22
DI encoded data (Dual-Rail)
RTZ
NULL (spacer)
NULL
Data.0
Data.1
Data.0
Logical 0
Logical 1
ack
Data.1
ack
One cycle
One cycle
NRZ
This coding leads to complex logic
implementation hard to track odd and even phases
and logic values hence see LEDR below
Data.0
Logical 0
Logical 1
Logical 1
Logical 1
Data.1
ack
cycle
cycle
cycle
cycle
23
DI codes (1-of-n and m-of-n)
  • 1-of-4
  • 0001gt 00, 0010gt01, 0100gt10, 1000gt11
  • 2-of-4
  • 1100, 1010, 1001, 0110, 0101, 0011 total 6
    combinations (cf. 2-bit dual-rail 4 comb.)
  • 3-of-6
  • 111000, 110100, , 000111 total 20 combinations
    (can encode 4 bits 4 control tokens)
  • 2-of-7
  • 1100000, 1010000, , 0000011 total 21
    combinations (4 bits 5 control tokens)

24
DI codes completion detection and decoding
  • 1-of-4 completion detection is a 4-input OR gate
    (CDd0d1d2d3)
  • Decode 1-of-4 to dual rail is a set of four
    2-input OR gates (q0.0d0d2 q0.1d1d3
    q1.0d0d1 q1.1d2d3)
  • For m-of-n codes CD and decoding is non-trivial

From J.Bainbridge et al, ASYNC03
25
Incomplete DI codes
Incomplete 2-of-7 Composed of 1-of-3 and 1-of-4
From J.Bainbridge et al ASYNC03
26
Phase difference based encoding (C. DAlessandro
et al. ASYNC06,07)
  • The proposed system consists in encoding a bit of
    data in the phase relationship between two
    signals generated using a reference
  • This would ensure that any transient fault
    appearing on one of the reference signals will be
    ignored if it is not mirrored by a corresponding
    transition on the other line
  • Similarity with multi-wire communication

27
Phase encoding multiple rail
  • No group of wires has the same delay
  • All wires toggle when an item of data is sent
  • Increased number of states available ( n wires
    n! states) hence more bits/symbol
  • Table illustrates examples of phase encoding
    compared to the respective m-of-n counterpart

Type of Link Number of states Bits per Symbol Extra states Transitions per symbol Symbols per packet Transitions per packet
Phase enc. (4) 24 4 8 4 32 128
1-of-4 4 2 0 2 64 128
Phase enc. (6) 720 9 208 6 15 90
3-of-6 20 4 4 6 32 192
28
Phase encoding Repeater
1lt3
3lt1
2lt3
3lt2
1lt2
2lt1
Phase detectors (Mutexes)
29
Pipelines
Dual-rail pipeline
From J.Bainbridge S. Furber IEEE Micro, 2002
30
The problem of Acking
  • Question 2 when can Tx send new data? has two
    aspects
  • Safety (not to overflow the channel or when Tx
    and Rx have much variation in delay)
  • Performance (to maximize throughput and reduce
    latency)
  • Can we hide ack (round trip) delay?

31
To maintain throughput more pipeline stages are
required but that costs too much latency and power
First minimize latency along a long wire (not
specific to asynchronous) and then maximize
throughput (using wagging tail buffer approach)
From R.Ho et al. ASYNC04
32
Use of wagging buffer approach
Alternate between top and bottom control
From R.Ho et al. ASYNC04
33
Wagging tail buffer approach
reqtop
Top and bot control channels work at ½ frequency
of data channel
acktop
data
reqbot
ackbot
34
Serial Link vs Parallel Link (from R. Dobkin)
  • Why Serial Link?
  • Less interconnect area
  • Less routing congestion
  • Less coupling
  • Less power (depends on range)
  • The relative improvement grows with technology
    scaling. The example on the right refers to
  • Single gate delay serial link
  • Fully-shielded parallel link with 8 gate delay
    clock cycle
  • Equal bit-rate
  • Word width N8

Link Length mm
Serial Link dissipates less power
Parallel Link dissipates less power
Serial Link requires less area
Parallel Link requires less area
Technology Node nm
35
Serialization model
Tx
Rx


Acking at the bit level
36
Serialization model
Tx
Rx
Acking at the word level
37
Serialization model
Tx
Rx
Acking at the word level (with more concurrency)
38
Serial Link Top Structure (R.Dobkin, Async07)
  • Transition signaling instead of sampling
    two-phase NRZ Level Encoded Dual Rail (LEDR)
    asynchronous protocol, a.k.a. data-strobe (DS)
  • Acknowledge per word instead of per bit
  • Synchronizers used at the level of the ack
    signals
  • Wave-pipelining over channel
  • Differential encoding (DS-DE, IEEE1355-95)
  • Reported throughput 67Gps for 65nm process
    (viz. one bit per 15ps expected FO4 inverter
    delay), based on simulations

39
Encoding Two Phase NRZ LEDR
  • Two Phase Non-Return-to-Zero Level Encoded Dual
    Rail
  • delta encoding (one transition per bit)

40
Transmitter Fast SR Approach (from R. Dobkin)
41
Receiver Splitter (from R. Dobkin)
42
Self Timed Networks
  • Router requires priority arbitration
  • Arbitration necessary at every router merge
  • Potential delay at every node on the path
  • BUT
  • Asynchronous merge/arbitration time is average
    not worst case
  • Adapters to locally clocked cells require
    synchronization
  • Synchronization necessary when clocks are unknown
  • Occurs when receiving data (data valid), and when
    sending (acknowledge)
  • BUT
  • Time can be long (2 cycles?)
  • Must assume worst case time (maybe)

43
Router priority
Flow Control
Link
Merge
Split
  • Virtual channels implement scheduling algorithm
  • Contention for link resolved by priority circuits

44
Asynchronous Arbiters
  • Multiway arbiters (e.g. for Xbar switches)
  • Cascaded mesh (latency N)
  • Cascaded Tree (latency logN)
  • Token-Ring (busy ring and lazy ring) (latency
    from 1 to N)
  • Priority arbiters (e.g. for Routers with
    different QS)
  • Static priority (topological order)
  • Dynamic priority (request arrives with priority
    code)
  • Ordered (time-priority) - multiway arbiter,
    followed by a FIFO buffer

45
Static Priority Arbiter
46
Why Synchronizer?
DATA
1
CLK
DATA
Q
DFF
0
CLK
Q
1
0
Metastability
Metastability
DATA
Q
Here one clock cycle is used for the
metastability to resolve.
DFF
DFF
CLK
Two DFF Synchronizer
47
CAD support Async design flow
48
Synthesis of Asynchronous link interfaces
49
(No Transcript)
50
Boolean equations LDS D ? csc DTACK D D
LDTACK csc DSr
51
Conclusions on Async Links
  • At nm level links will be more asynchronous,
    perhaps first, mesochronous to avoid global clock
    skew
  • Delay-insensitive codes can be used to tolerate
    interwire-delay variability
  • Phase-encoding can be used for higher power-bit
    efficiency and SEU tolerance
  • Acking will be mainly used for flow control (word
    level) and its overhead can be hidden by using
    the wagging buffer technique
  • Serial Links save area and power for long
    interconnects, with buffering (pipelining) if one
    wants to maintain high throughput they also
    simplify building switches
  • Synthesis tools can be used to build clock-free
    interfaces between different links
  • Asynchronous logic can be used for building
    higher level circuits, e.g. arbiters for switches
    and routers

52
  • And finally

53
ASYNC08 and NOCs08 plus SLIP08
  • Held in Newcastle upon Tyne, UK, 7-11 April 2008
    (SLIP on 5-6 April weekend)
  • async.org.uk/async2008
  • async.org.uk/nocs2008
  • Submission deadlines
  • Async08 Abstract Oct. 8 , Full paper Oct.
    15
  • NOCs08 Abstract Nov. 12, Full paper Nov. 19

54
Extras
  • More slides if I have time!

55
Chain Network Components
From J.Bainbridge S. Furber IEEE Micro, 2002
56
A Network on Chip
57
Transmitter Fast SR Approach (from R. Dobkin)
58
Receiver Splitter (from R. Dobkin)
59
Self Timed Networks
  • Router requires priority arbitration
  • Arbitration necessary at every router merge
  • Potential delay at every node on the path
  • BUT
  • Asynchronous merge/arbitration time is average
    not worst case
  • Adapters to locally clocked cells require
    synchronization
  • Synchronization necessary when clocks are unknown
  • Occurs when receiving data (data valid), and when
    sending (acknowledge)
  • BUT
  • Time can be long (2 cycles?)
  • Must assume worst case time (maybe)

60
Router priority
Flow Control
Link
Merge
Split
  • Virtual channels implement scheduling algorithm
  • Contention for link resolved by priority circuits

61
Static priority arbiter
62
Reliability and latency
  • Asynchronous arbiters fail only if time is
    bounded
  • Latency depends on fixed gates plus MUTEX lock
    time
  • ? for 2 channels, ? ? ln(N-1) for more
  • This likely to be small compared with flow
    control latency
  • Synchronizers fail at (fairly) predictable rates
    but these rates may get worse
  • Latency can be 35? now for good reliability

63
The synchronizer
  • Clock and valid can happen very close together
  • Flip Flop 1 gets caught in metastability
  • We wait until it is resolved (1 2 clock periods)

DATA
VALID
1
2
CLK2
CLK1
64
MTBF
  • For a 0.18? process ? is 20 50 ps
  • Tw is similar
  • Suppose the clock and data frequencies are 2 GHz
  • t needs to be gt 25 ? (more than one clock period)
    to get MTBF gt 28 days
  • 100 synchronizers 5 ?
  • MTBF gt 1year 2 ?
  • PVT variations 5 - 10? . . .

65
Event Histogram
Measurement
Convert to log scale, slope is ?
66
Not always simple
More than one slope 350ps 120ps 140ps
67
Synchronization Strategies
  • Avoid synchronization time (and arbitration time)
    by
  • predicting clocks, stoppable clocks
  • dedicate link paths for long periods of time
  • Minimize time by circuit methods
  • Higher power, better ?
  • Reducing apparent device variability - wide
    transistors
  • many parallel synchronizers increase throughput
  • Reduce average latency by speculation
  • Reduce synchronization time, detect errors and
    roll back

68
Timing regions can have predictable relationships
  • Locked
  • Two clocks from same source
  • Linked by PLL
  • One produced by dividing the other
  • Some asynchronous systems
  • Some GALS
  • Not locked together but predictable
  • Two clocks same frequency, but different
    oscillators.
  • As above, same frequency ratio

69
Dont synchronise when you dont need to
  • If the two clocks are locked together, you dont
    need a synchroniser, just an asynchronous FIFO
    big enough to accommodate any jitter/skew
  • FIFO must never overflow
  • Next read clock can be predicted and
    metastability avoided

70
Conflict Prediction
Receiver Clock
Transmitter Clock
Predicted Transmitter Clock
Synchronization problem known a cycle in advance
of the Receiver clock.
We can do this thanks to the periodic nature of
the clocks
71
Problems predicting next cycle
  • Difficult to predict
  • Multiple source clocks
  • Input output interfaces
  • Dynamic jitter and noise
  • GALS start up clocks take several cycles to
    stabilise
  • Crosstalk
  • power supply variations introducing noise into
    both data and clock .
  • temperature changes alter relative delays
  • As a proportion of cycle time, this is likely to
    increase with smaller geometries

72
Synchronizer reliability trends
  • Clock rates increase. 10 GHz gives 100ps for a
    cycle.
  • Both data and clock rates up by n
  • ? down by n
  • Assume ? scales with cycle time reliability
    (MTBF) of one synchronizer down by n
  • Number of synchronizers goes up by N
  • Die reliability down by N
  • Die die and on-die variability increases to as
    much as 40
  • 40 more time needed for all synchronizers

73
An example
  • Example
  • 10 GHz clock and data rate
  • ? 10 ps
  • 100 synchronizers
  • MBTF required 3.8 months (107 seconds )
  • Time required 41 ?, or 4.1 cycles 40 5.8
    cycles
  • Does this matter?

74
Power futures
  • Total synchronizer area/power small, BUT
  • ? very sensitive to voltage/power both n and p
    transistors can turn off at low voltages no
    gain
  • This affects MUTEX circuits as well

75
Power/speed tradeoffs
  • Increase Vdd when synchronisation required
  • Make synchronizer transistors wide to reduce
    variation and, to some extent, ?
  • Make many synchronizer circuits, and select the
    consistently fastest one
  • Avoid reducing synchronizer Vdd when running slow

76
Speculation
  • Mostly, the synchronizer does not need 35? to
    settle
  • Only e-10 (0.005) need more than 10?
  • Why not go ahead anyway, and try again if more
    time was needed

77
Low latency synchronization
  • Data Available, or Free to write are produced
    early
  • After one cycle?.
  • If they prove to be in error, synchronization
    failed
  • Only know this after two of more cycles
  • Read Fail or Write Fail flag is then raised and
    the action can be repeated.

DATA
DATA
FIFO
Data Available
Free to write
Speculativesynchronizer
Speculativesynchronizer
Full
Not Empty
Read Fail
Write Fail
Write clock
Read Clock
WRITE
READ
Write Data
Read done
78
Comments
  • Synchronization time will be an issue for future
    GALS
  • Latency and throughput can be affected
  • Should the flit be large to reduce the effective
    overhead of time and power?
  • Some power speed trade off is possible
  • Higher power synchronization can buy some
    performance ?
  • Speculation is complex
  • Is it worth it?
Write a Comment
User Comments (0)
About PowerShow.com