Title: Asynchronous Links, for NanoNets?
1Asynchronous Links, for NanoNets?
- Alex YakovlevUniversity of Newcastle, UK
2Motivation-1
- At very deep submicron, gate delay is much less
than interconnect delay total interconnect
length can reach several meters interconnect
delay can be as much as 90 of total path delay
in VDSM circuits - Timing issue is a problem, particularly for
global wires
- Multiple clock domains are reality, problem of
interface between them - ITRS05 predicted 4x (8x) increase in global
asynchronous signalling by 2012 (2020)
3Motivation-2
- Variability and uncertainty
- Geometry and process for long channels intra-die
variations are less correlated for different part
of the interconnect, both for interconnects and
repeaters - e.g., M4 and M5 resistance/um massively differ,
leading to mistracking (C.Visuweswariah, SLIP06) - e.g. 250nm clock skew has 25 variability due to
interconnect variations (Y.Liu et.al. DAC00) - Behavioural crosstalk (sidewall capacitance can
cause up to 7x variation in delay (R. Ho,
M.Horowitz))
4A Network on Chip
Async Links
5Example from the Past Fault-Tolerant Self-Timed
Ring (Varshavsky et al. 1986)
For an onboard airborne computer-control system
which tolerated up to two faults. Self-timed ring
was a GALS system with self-checking and
self-repair at the hardware level
Individually clocked subsystems
Self-timed adapters forming a ring
6Communication Channel Adapter
Much higher reliability than a bus and other
forms of redundancy MCC was developed
TTL-Schottky gate arrays, approx 2K gates.
Data (DR,DS) is encoded using 3-of-6 Sperner code
(16 data values for half-byte, plus 4 tokens for
ring acquisition protocol) AR, AS
acknowledgements RR, RS spare (for self-repair)
lines
7Outline
- Token-based view of communication
- Basics of asynchronous signalling
- Self-timed data encoding
- Pipelining
- How to hide acknowledgements
- Serial vs Parallel links
- Arbiters and routers
- Async2sync interface
- CAD issues
8Data exchange token-based view
Data
source
tx
rx
dest
- Question 1 when can Rx look at the incoming
data? - Data validity issue Forming a well-defined token
9Data exchange token-based view
Data
source
tx
rx
dest
- Question 1 when can Rx looked at the data?
- Data validity issue Forming a well-defined
token - Question 2 when can Tx send new data?
- Acknowledgement issue Separation b/w tokens
10Data exchange token-based view
- Question 1 when can Rx looked at the data?
- Data validity issue Forming a well-defined
token - Question 2 when can Tx send new data?
- Acknowledgement issue Separation b/w tokens
- These are fundamental issues of flow control at
the physical and link levels - The answers are determined by many design
aspects technology level, system architecture
(application, pipelining), latency, throughput,
power, design process etc.
11Tokens and spaces with global clocking
clk
- In globally clocked systems both Q1 and Q2 are
resolved with the aid of clock pulses
12Tokens and spaces
Data
source
tx
rx
dest
D_valid
Clk_rx
Clk_tx
bundle
- Without global clocking Q1 can be resolved
differently from Q2 - E.g. Q1 source-synchronous (mesochronous),
bundled data or self-synchronising codes Q2
ack or stop signal, or by local timing
13Tokens and spaces
Data
source
tx
rx
dest
D_valid
ack
ack
bundle
ack
- Without global clocking Q1 can be resolved
differently from Q2 - E.g. Q1 source-synchronous (mesochronous),
bundled data or self-synchronising codes Q2
ack or stop signal, or by local timing
14Petri net model
dest
Tx
Rx
source
Data Valid
Tx delay
Rx delay
One way delay, but may be unsafe!
dest
Tx
Rx
source
Data Valid
ack
Tx delay or ack
Rx delay or ack
Always safe but with a round trip delay!
15Asynchronous handshake signalling
- Valid data tokens and safe spaces between them
can be created by different means of signalling
and encoding - Level-based -gt Return-To-Zero (RTZ) or 4-phase
protocol - Transition-based -gt Non-Return-to-Zero (NRZ) or
2-phase protocol - Pulse-based, e.g. GasP
- Phase-difference-based
- Data encoding bundled data (BD),
Delay-insensitive (DI)
16Handshake Signalling Protocols
- Level Signalling (RTZ or 4-phase)
- Transition Signalling (RTZ or 4-phase)
req
ack
One cycle
One cycle
17Handshake Signalling Protocols
req
req
ack
ack
One cycle
- Single-track Signalling (GasP)
req
ack
18GasP signalling
Pull up from pred (req)
Pulse length control loops
Pull up from here (req)
Pull down here (ack)
Pull down from succ (ack)
Source R. Ho et al, Async04
19Data encoding
- Bundled data
- Code is positional binary, token is determined by
Req signal Req arrives with a safe set-up
delay from data - Delay-insensitive codes (tokens determined by the
codeword values, require a spacer, or NULL, state
if RTZ) - 1-of-2 (Dual-rail per bit) systematic code,
encoding, decoding straightforward - m-of-n (ngt2) not systematic, i.e. incur
encoding and decoding costs, optimal when mn/2 - One-hot ,1-of-n (ngt2), completion detection is
easy, not practical beyond ngt4 - Systematic, such as Berger, incur complex
completion detection
20Bundled Data
RTZ
Data
req
ack
NRZ
Data
req
ack
One cycle
One cycle
21DI encoded data (Dual-Rail)
RTZ
NULL (spacer)
NULL
Data.0
Data.1
Data.0
Logical 0
Logical 1
ack
Data.1
ack
One cycle
One cycle
NRZ
Data.0
Logical 0
Logical 1
Logical 1
Logical 1
Data.1
ack
cycle
cycle
cycle
cycle
22DI encoded data (Dual-Rail)
RTZ
NULL (spacer)
NULL
Data.0
Data.1
Data.0
Logical 0
Logical 1
ack
Data.1
ack
One cycle
One cycle
NRZ
This coding leads to complex logic
implementation hard to track odd and even phases
and logic values hence see LEDR below
Data.0
Logical 0
Logical 1
Logical 1
Logical 1
Data.1
ack
cycle
cycle
cycle
cycle
23DI codes (1-of-n and m-of-n)
- 1-of-4
- 0001gt 00, 0010gt01, 0100gt10, 1000gt11
- 2-of-4
- 1100, 1010, 1001, 0110, 0101, 0011 total 6
combinations (cf. 2-bit dual-rail 4 comb.) - 3-of-6
- 111000, 110100, , 000111 total 20 combinations
(can encode 4 bits 4 control tokens) - 2-of-7
- 1100000, 1010000, , 0000011 total 21
combinations (4 bits 5 control tokens)
24DI codes completion detection and decoding
- 1-of-4 completion detection is a 4-input OR gate
(CDd0d1d2d3) - Decode 1-of-4 to dual rail is a set of four
2-input OR gates (q0.0d0d2 q0.1d1d3
q1.0d0d1 q1.1d2d3) - For m-of-n codes CD and decoding is non-trivial
From J.Bainbridge et al, ASYNC03
25Incomplete DI codes
Incomplete 2-of-7 Composed of 1-of-3 and 1-of-4
From J.Bainbridge et al ASYNC03
26Phase difference based encoding (C. DAlessandro
et al. ASYNC06,07)
- The proposed system consists in encoding a bit of
data in the phase relationship between two
signals generated using a reference - This would ensure that any transient fault
appearing on one of the reference signals will be
ignored if it is not mirrored by a corresponding
transition on the other line - Similarity with multi-wire communication
27Phase encoding multiple rail
- No group of wires has the same delay
- All wires toggle when an item of data is sent
- Increased number of states available ( n wires
n! states) hence more bits/symbol - Table illustrates examples of phase encoding
compared to the respective m-of-n counterpart
Type of Link Number of states Bits per Symbol Extra states Transitions per symbol Symbols per packet Transitions per packet
Phase enc. (4) 24 4 8 4 32 128
1-of-4 4 2 0 2 64 128
Phase enc. (6) 720 9 208 6 15 90
3-of-6 20 4 4 6 32 192
28Phase encoding Repeater
1lt3
3lt1
2lt3
3lt2
1lt2
2lt1
Phase detectors (Mutexes)
29Pipelines
Dual-rail pipeline
From J.Bainbridge S. Furber IEEE Micro, 2002
30The problem of Acking
- Question 2 when can Tx send new data? has two
aspects - Safety (not to overflow the channel or when Tx
and Rx have much variation in delay) - Performance (to maximize throughput and reduce
latency) - Can we hide ack (round trip) delay?
31To maintain throughput more pipeline stages are
required but that costs too much latency and power
First minimize latency along a long wire (not
specific to asynchronous) and then maximize
throughput (using wagging tail buffer approach)
From R.Ho et al. ASYNC04
32Use of wagging buffer approach
Alternate between top and bottom control
From R.Ho et al. ASYNC04
33Wagging tail buffer approach
reqtop
Top and bot control channels work at ½ frequency
of data channel
acktop
data
reqbot
ackbot
34Serial Link vs Parallel Link (from R. Dobkin)
- Why Serial Link?
- Less interconnect area
- Less routing congestion
- Less coupling
- Less power (depends on range)
- The relative improvement grows with technology
scaling. The example on the right refers to - Single gate delay serial link
- Fully-shielded parallel link with 8 gate delay
clock cycle - Equal bit-rate
- Word width N8
Link Length mm
Serial Link dissipates less power
Parallel Link dissipates less power
Serial Link requires less area
Parallel Link requires less area
Technology Node nm
35Serialization model
Tx
Rx
Acking at the bit level
36Serialization model
Tx
Rx
Acking at the word level
37Serialization model
Tx
Rx
Acking at the word level (with more concurrency)
38Serial Link Top Structure (R.Dobkin, Async07)
- Transition signaling instead of sampling
two-phase NRZ Level Encoded Dual Rail (LEDR)
asynchronous protocol, a.k.a. data-strobe (DS) - Acknowledge per word instead of per bit
- Synchronizers used at the level of the ack
signals - Wave-pipelining over channel
- Differential encoding (DS-DE, IEEE1355-95)
- Reported throughput 67Gps for 65nm process
(viz. one bit per 15ps expected FO4 inverter
delay), based on simulations
39Encoding Two Phase NRZ LEDR
- Two Phase Non-Return-to-Zero Level Encoded Dual
Rail - delta encoding (one transition per bit)
40Transmitter Fast SR Approach (from R. Dobkin)
41Receiver Splitter (from R. Dobkin)
42Self Timed Networks
- Router requires priority arbitration
- Arbitration necessary at every router merge
- Potential delay at every node on the path
- BUT
- Asynchronous merge/arbitration time is average
not worst case - Adapters to locally clocked cells require
synchronization - Synchronization necessary when clocks are unknown
- Occurs when receiving data (data valid), and when
sending (acknowledge) - BUT
- Time can be long (2 cycles?)
- Must assume worst case time (maybe)
43Router priority
Flow Control
Link
Merge
Split
- Virtual channels implement scheduling algorithm
- Contention for link resolved by priority circuits
44Asynchronous Arbiters
- Multiway arbiters (e.g. for Xbar switches)
- Cascaded mesh (latency N)
- Cascaded Tree (latency logN)
- Token-Ring (busy ring and lazy ring) (latency
from 1 to N) - Priority arbiters (e.g. for Routers with
different QS) - Static priority (topological order)
- Dynamic priority (request arrives with priority
code) - Ordered (time-priority) - multiway arbiter,
followed by a FIFO buffer
45Static Priority Arbiter
46Why Synchronizer?
DATA
1
CLK
DATA
Q
DFF
0
CLK
Q
1
0
Metastability
Metastability
DATA
Q
Here one clock cycle is used for the
metastability to resolve.
DFF
DFF
CLK
Two DFF Synchronizer
47CAD support Async design flow
48Synthesis of Asynchronous link interfaces
49(No Transcript)
50Boolean equations LDS D ? csc DTACK D D
LDTACK csc DSr
51Conclusions on Async Links
- At nm level links will be more asynchronous,
perhaps first, mesochronous to avoid global clock
skew - Delay-insensitive codes can be used to tolerate
interwire-delay variability - Phase-encoding can be used for higher power-bit
efficiency and SEU tolerance - Acking will be mainly used for flow control (word
level) and its overhead can be hidden by using
the wagging buffer technique - Serial Links save area and power for long
interconnects, with buffering (pipelining) if one
wants to maintain high throughput they also
simplify building switches - Synthesis tools can be used to build clock-free
interfaces between different links - Asynchronous logic can be used for building
higher level circuits, e.g. arbiters for switches
and routers
52 53ASYNC08 and NOCs08 plus SLIP08
- Held in Newcastle upon Tyne, UK, 7-11 April 2008
(SLIP on 5-6 April weekend) - async.org.uk/async2008
- async.org.uk/nocs2008
- Submission deadlines
- Async08 Abstract Oct. 8 , Full paper Oct.
15 - NOCs08 Abstract Nov. 12, Full paper Nov. 19
54Extras
- More slides if I have time!
55Chain Network Components
From J.Bainbridge S. Furber IEEE Micro, 2002
56A Network on Chip
57Transmitter Fast SR Approach (from R. Dobkin)
58Receiver Splitter (from R. Dobkin)
59Self Timed Networks
- Router requires priority arbitration
- Arbitration necessary at every router merge
- Potential delay at every node on the path
- BUT
- Asynchronous merge/arbitration time is average
not worst case - Adapters to locally clocked cells require
synchronization - Synchronization necessary when clocks are unknown
- Occurs when receiving data (data valid), and when
sending (acknowledge) - BUT
- Time can be long (2 cycles?)
- Must assume worst case time (maybe)
60Router priority
Flow Control
Link
Merge
Split
- Virtual channels implement scheduling algorithm
- Contention for link resolved by priority circuits
61Static priority arbiter
62Reliability and latency
- Asynchronous arbiters fail only if time is
bounded - Latency depends on fixed gates plus MUTEX lock
time - ? for 2 channels, ? ? ln(N-1) for more
- This likely to be small compared with flow
control latency - Synchronizers fail at (fairly) predictable rates
but these rates may get worse - Latency can be 35? now for good reliability
63The synchronizer
- Clock and valid can happen very close together
- Flip Flop 1 gets caught in metastability
- We wait until it is resolved (1 2 clock periods)
DATA
VALID
1
2
CLK2
CLK1
64MTBF
- For a 0.18? process ? is 20 50 ps
- Tw is similar
- Suppose the clock and data frequencies are 2 GHz
- t needs to be gt 25 ? (more than one clock period)
to get MTBF gt 28 days - 100 synchronizers 5 ?
- MTBF gt 1year 2 ?
- PVT variations 5 - 10? . . .
65Event Histogram
Measurement
Convert to log scale, slope is ?
66Not always simple
More than one slope 350ps 120ps 140ps
67Synchronization Strategies
- Avoid synchronization time (and arbitration time)
by - predicting clocks, stoppable clocks
- dedicate link paths for long periods of time
- Minimize time by circuit methods
- Higher power, better ?
- Reducing apparent device variability - wide
transistors - many parallel synchronizers increase throughput
- Reduce average latency by speculation
- Reduce synchronization time, detect errors and
roll back
68Timing regions can have predictable relationships
- Locked
- Two clocks from same source
- Linked by PLL
- One produced by dividing the other
- Some asynchronous systems
- Some GALS
- Not locked together but predictable
- Two clocks same frequency, but different
oscillators. - As above, same frequency ratio
69Dont synchronise when you dont need to
- If the two clocks are locked together, you dont
need a synchroniser, just an asynchronous FIFO
big enough to accommodate any jitter/skew - FIFO must never overflow
- Next read clock can be predicted and
metastability avoided
70Conflict Prediction
Receiver Clock
Transmitter Clock
Predicted Transmitter Clock
Synchronization problem known a cycle in advance
of the Receiver clock.
We can do this thanks to the periodic nature of
the clocks
71Problems predicting next cycle
- Difficult to predict
- Multiple source clocks
- Input output interfaces
- Dynamic jitter and noise
- GALS start up clocks take several cycles to
stabilise - Crosstalk
- power supply variations introducing noise into
both data and clock . - temperature changes alter relative delays
- As a proportion of cycle time, this is likely to
increase with smaller geometries
72Synchronizer reliability trends
- Clock rates increase. 10 GHz gives 100ps for a
cycle. - Both data and clock rates up by n
- ? down by n
- Assume ? scales with cycle time reliability
(MTBF) of one synchronizer down by n - Number of synchronizers goes up by N
- Die reliability down by N
- Die die and on-die variability increases to as
much as 40 - 40 more time needed for all synchronizers
73An example
- Example
- 10 GHz clock and data rate
- ? 10 ps
- 100 synchronizers
- MBTF required 3.8 months (107 seconds )
- Time required 41 ?, or 4.1 cycles 40 5.8
cycles - Does this matter?
74Power futures
- Total synchronizer area/power small, BUT
- ? very sensitive to voltage/power both n and p
transistors can turn off at low voltages no
gain - This affects MUTEX circuits as well
75Power/speed tradeoffs
- Increase Vdd when synchronisation required
- Make synchronizer transistors wide to reduce
variation and, to some extent, ? - Make many synchronizer circuits, and select the
consistently fastest one - Avoid reducing synchronizer Vdd when running slow
76Speculation
- Mostly, the synchronizer does not need 35? to
settle - Only e-10 (0.005) need more than 10?
- Why not go ahead anyway, and try again if more
time was needed
77Low latency synchronization
- Data Available, or Free to write are produced
early - After one cycle?.
- If they prove to be in error, synchronization
failed - Only know this after two of more cycles
- Read Fail or Write Fail flag is then raised and
the action can be repeated.
DATA
DATA
FIFO
Data Available
Free to write
Speculativesynchronizer
Speculativesynchronizer
Full
Not Empty
Read Fail
Write Fail
Write clock
Read Clock
WRITE
READ
Write Data
Read done
78Comments
- Synchronization time will be an issue for future
GALS - Latency and throughput can be affected
- Should the flit be large to reduce the effective
overhead of time and power? - Some power speed trade off is possible
- Higher power synchronization can buy some
performance ? - Speculation is complex
- Is it worth it?