Asynchronous Links, for NanoNets? - PowerPoint PPT Presentation

About This Presentation

Title:

Asynchronous Links, for NanoNets?

Description:

Multiple clock domains are reality, problem of interface between them ... clocked cells require synchronization. Synchronization necessary when clocks are ... – PowerPoint PPT presentation

Number of Views:61

Avg rating:3.0/5.0

Slides: 79

Provided by: davidki4

Category:

more less

Transcript and Presenter's Notes

Title: Asynchronous Links, for NanoNets?

1
Asynchronous Links, for NanoNets?

Alex YakovlevUniversity of Newcastle, UK

2
Motivation-1

At very deep submicron, gate delay is much less
than interconnect delay total interconnect
length can reach several meters interconnect
delay can be as much as 90 of total path delay
in VDSM circuits
Timing issue is a problem, particularly for
global wires

Multiple clock domains are reality, problem of
interface between them
ITRS05 predicted 4x (8x) increase in global
asynchronous signalling by 2012 (2020)

3
Motivation-2

Variability and uncertainty
Geometry and process for long channels intra-die
variations are less correlated for different part
of the interconnect, both for interconnects and
repeaters
e.g., M4 and M5 resistance/um massively differ,
leading to mistracking (C.Visuweswariah, SLIP06)
e.g. 250nm clock skew has 25 variability due to
interconnect variations (Y.Liu et.al. DAC00)
Behavioural crosstalk (sidewall capacitance can
cause up to 7x variation in delay (R. Ho,
M.Horowitz))

4
A Network on Chip
Async Links
5
Example from the Past Fault-Tolerant Self-Timed
Ring (Varshavsky et al. 1986)
For an onboard airborne computer-control system
which tolerated up to two faults. Self-timed ring
was a GALS system with self-checking and
self-repair at the hardware level
Individually clocked subsystems
Self-timed adapters forming a ring
6
Communication Channel Adapter
Much higher reliability than a bus and other
forms of redundancy MCC was developed
TTL-Schottky gate arrays, approx 2K gates.
Data (DR,DS) is encoded using 3-of-6 Sperner code
(16 data values for half-byte, plus 4 tokens for
ring acquisition protocol) AR, AS
acknowledgements RR, RS spare (for self-repair)
lines
7
Outline

Token-based view of communication
Basics of asynchronous signalling
Self-timed data encoding
Pipelining
How to hide acknowledgements
Serial vs Parallel links
Arbiters and routers
Async2sync interface
CAD issues

8
Data exchange token-based view
Data
source
tx
rx
dest

Question 1 when can Rx look at the incoming
data?
Data validity issue Forming a well-defined token

9
Data exchange token-based view
Data
source
tx
rx
dest

Question 1 when can Rx looked at the data?
Data validity issue Forming a well-defined
token
Question 2 when can Tx send new data?
Acknowledgement issue Separation b/w tokens

10
Data exchange token-based view

Question 1 when can Rx looked at the data?
Data validity issue Forming a well-defined
token
Question 2 when can Tx send new data?
Acknowledgement issue Separation b/w tokens
These are fundamental issues of flow control at
the physical and link levels
The answers are determined by many design
aspects technology level, system architecture
(application, pipelining), latency, throughput,
power, design process etc.

11
Tokens and spaces with global clocking
clk

In globally clocked systems both Q1 and Q2 are
resolved with the aid of clock pulses

12
Tokens and spaces
Data
source
tx
rx
dest
D_valid
Clk_rx
Clk_tx
bundle

Without global clocking Q1 can be resolved
differently from Q2
E.g. Q1 source-synchronous (mesochronous),
bundled data or self-synchronising codes Q2
ack or stop signal, or by local timing

13
Tokens and spaces
Data
source
tx
rx
dest
D_valid
ack
ack
bundle
ack

Without global clocking Q1 can be resolved
differently from Q2
E.g. Q1 source-synchronous (mesochronous),
bundled data or self-synchronising codes Q2
ack or stop signal, or by local timing

14
Petri net model
dest
Tx
Rx
source
Data Valid
Tx delay
Rx delay
One way delay, but may be unsafe!
dest
Tx
Rx
source
Data Valid
ack
Tx delay or ack
Rx delay or ack
Always safe but with a round trip delay!
15
Asynchronous handshake signalling

Valid data tokens and safe spaces between them
can be created by different means of signalling
and encoding
Level-based -gt Return-To-Zero (RTZ) or 4-phase
protocol
Transition-based -gt Non-Return-to-Zero (NRZ) or
2-phase protocol
Pulse-based, e.g. GasP
Phase-difference-based
Data encoding bundled data (BD),
Delay-insensitive (DI)

16
Handshake Signalling Protocols

Level Signalling (RTZ or 4-phase)

Transition Signalling (RTZ or 4-phase)

req
ack
One cycle
One cycle
17
Handshake Signalling Protocols

Pulse Signalling

req
req
ack
ack
One cycle

Single-track Signalling (GasP)

req
ack
18
GasP signalling
Pull up from pred (req)
Pulse length control loops
Pull up from here (req)
Pull down here (ack)
Pull down from succ (ack)
Source R. Ho et al, Async04
19
Data encoding

Bundled data
Code is positional binary, token is determined by
Req signal Req arrives with a safe set-up
delay from data
Delay-insensitive codes (tokens determined by the
codeword values, require a spacer, or NULL, state
if RTZ)
1-of-2 (Dual-rail per bit) systematic code,
encoding, decoding straightforward
m-of-n (ngt2) not systematic, i.e. incur
encoding and decoding costs, optimal when mn/2
One-hot ,1-of-n (ngt2), completion detection is
easy, not practical beyond ngt4
Systematic, such as Berger, incur complex
completion detection

20
Bundled Data
RTZ
Data
req
ack
NRZ
Data
req
ack
One cycle
One cycle
21
DI encoded data (Dual-Rail)
RTZ
NULL (spacer)
NULL
Data.0
Data.1
Data.0
Logical 0
Logical 1
ack
Data.1
ack
One cycle
One cycle
NRZ
Data.0
Logical 0
Logical 1
Logical 1
Logical 1
Data.1
ack
cycle
cycle
cycle
cycle
22
DI encoded data (Dual-Rail)
RTZ
NULL (spacer)
NULL
Data.0
Data.1
Data.0
Logical 0
Logical 1
ack
Data.1
ack
One cycle
One cycle
NRZ
This coding leads to complex logic
implementation hard to track odd and even phases
and logic values hence see LEDR below
Data.0
Logical 0
Logical 1
Logical 1
Logical 1
Data.1
ack
cycle
cycle
cycle
cycle
23
DI codes (1-of-n and m-of-n)

1-of-4
0001gt 00, 0010gt01, 0100gt10, 1000gt11
2-of-4
1100, 1010, 1001, 0110, 0101, 0011 total 6
combinations (cf. 2-bit dual-rail 4 comb.)
3-of-6
111000, 110100, , 000111 total 20 combinations
(can encode 4 bits 4 control tokens)
2-of-7
1100000, 1010000, , 0000011 total 21
combinations (4 bits 5 control tokens)

24
DI codes completion detection and decoding

1-of-4 completion detection is a 4-input OR gate
(CDd0d1d2d3)
Decode 1-of-4 to dual rail is a set of four
2-input OR gates (q0.0d0d2 q0.1d1d3
q1.0d0d1 q1.1d2d3)
For m-of-n codes CD and decoding is non-trivial

From J.Bainbridge et al, ASYNC03
25
Incomplete DI codes
Incomplete 2-of-7 Composed of 1-of-3 and 1-of-4
From J.Bainbridge et al ASYNC03
26
Phase difference based encoding (C. DAlessandro
et al. ASYNC06,07)

The proposed system consists in encoding a bit of
data in the phase relationship between two
signals generated using a reference
This would ensure that any transient fault
appearing on one of the reference signals will be
ignored if it is not mirrored by a corresponding
transition on the other line
Similarity with multi-wire communication

27
Phase encoding multiple rail

No group of wires has the same delay
All wires toggle when an item of data is sent
Increased number of states available ( n wires
n! states) hence more bits/symbol
Table illustrates examples of phase encoding
compared to the respective m-of-n counterpart

Type of Link Number of states Bits per Symbol Extra states Transitions per symbol Symbols per packet Transitions per packet
Phase enc. (4) 24 4 8 4 32 128
1-of-4 4 2 0 2 64 128
Phase enc. (6) 720 9 208 6 15 90
3-of-6 20 4 4 6 32 192
28
Phase encoding Repeater
1lt3
3lt1
2lt3
3lt2
1lt2
2lt1
Phase detectors (Mutexes)
29
Pipelines
Dual-rail pipeline
From J.Bainbridge S. Furber IEEE Micro, 2002
30
The problem of Acking

Question 2 when can Tx send new data? has two
aspects
Safety (not to overflow the channel or when Tx
and Rx have much variation in delay)
Performance (to maximize throughput and reduce
latency)
Can we hide ack (round trip) delay?

31
To maintain throughput more pipeline stages are
required but that costs too much latency and power
First minimize latency along a long wire (not
specific to asynchronous) and then maximize
throughput (using wagging tail buffer approach)
From R.Ho et al. ASYNC04
32
Use of wagging buffer approach
Alternate between top and bottom control
From R.Ho et al. ASYNC04
33
Wagging tail buffer approach
reqtop
Top and bot control channels work at ½ frequency
of data channel
acktop
data
reqbot
ackbot
34
Serial Link vs Parallel Link (from R. Dobkin)

Why Serial Link?
Less interconnect area
Less routing congestion
Less coupling
Less power (depends on range)
The relative improvement grows with technology
scaling. The example on the right refers to
Single gate delay serial link
Fully-shielded parallel link with 8 gate delay
clock cycle
Equal bit-rate
Word width N8

Link Length mm
Serial Link dissipates less power
Parallel Link dissipates less power
Serial Link requires less area
Parallel Link requires less area
Technology Node nm
35
Serialization model
Tx
Rx

Acking at the bit level
36
Serialization model
Tx
Rx
Acking at the word level
37
Serialization model
Tx
Rx
Acking at the word level (with more concurrency)
38
Serial Link Top Structure (R.Dobkin, Async07)

Transition signaling instead of sampling
two-phase NRZ Level Encoded Dual Rail (LEDR)
asynchronous protocol, a.k.a. data-strobe (DS)
Acknowledge per word instead of per bit
Synchronizers used at the level of the ack
signals
Wave-pipelining over channel
Differential encoding (DS-DE, IEEE1355-95)
Reported throughput 67Gps for 65nm process
(viz. one bit per 15ps expected FO4 inverter
delay), based on simulations

39
Encoding Two Phase NRZ LEDR

Two Phase Non-Return-to-Zero Level Encoded Dual
Rail
delta encoding (one transition per bit)

40
Transmitter Fast SR Approach (from R. Dobkin)
41
Receiver Splitter (from R. Dobkin)
42
Self Timed Networks

Router requires priority arbitration
Arbitration necessary at every router merge
Potential delay at every node on the path
BUT
Asynchronous merge/arbitration time is average
not worst case
Adapters to locally clocked cells require
synchronization
Synchronization necessary when clocks are unknown
Occurs when receiving data (data valid), and when
sending (acknowledge)
BUT
Time can be long (2 cycles?)
Must assume worst case time (maybe)

43
Router priority
Flow Control
Link
Merge
Split

Virtual channels implement scheduling algorithm
Contention for link resolved by priority circuits

44
Asynchronous Arbiters

Multiway arbiters (e.g. for Xbar switches)
Cascaded mesh (latency N)
Cascaded Tree (latency logN)
Token-Ring (busy ring and lazy ring) (latency
from 1 to N)
Priority arbiters (e.g. for Routers with
different QS)
Static priority (topological order)
Dynamic priority (request arrives with priority
code)
Ordered (time-priority) - multiway arbiter,
followed by a FIFO buffer

45
Static Priority Arbiter
46
Why Synchronizer?
DATA
1
CLK
DATA
Q
DFF
0
CLK
Q
1
0
Metastability
Metastability
DATA
Q
Here one clock cycle is used for the
metastability to resolve.
DFF
DFF
CLK
Two DFF Synchronizer
47
CAD support Async design flow
48
Synthesis of Asynchronous link interfaces
49
(No Transcript)
50
Boolean equations LDS D ? csc DTACK D D
LDTACK csc DSr
51
Conclusions on Async Links

At nm level links will be more asynchronous,
perhaps first, mesochronous to avoid global clock
skew
Delay-insensitive codes can be used to tolerate
interwire-delay variability
Phase-encoding can be used for higher power-bit
efficiency and SEU tolerance
Acking will be mainly used for flow control (word
level) and its overhead can be hidden by using
the wagging buffer technique
Serial Links save area and power for long
interconnects, with buffering (pipelining) if one
wants to maintain high throughput they also
simplify building switches
Synthesis tools can be used to build clock-free
interfaces between different links
Asynchronous logic can be used for building
higher level circuits, e.g. arbiters for switches
and routers

And finally

53
ASYNC08 and NOCs08 plus SLIP08

Held in Newcastle upon Tyne, UK, 7-11 April 2008
(SLIP on 5-6 April weekend)
async.org.uk/async2008
async.org.uk/nocs2008
Submission deadlines
Async08 Abstract Oct. 8 , Full paper Oct.
15
NOCs08 Abstract Nov. 12, Full paper Nov. 19

54
Extras

More slides if I have time!

55
Chain Network Components
From J.Bainbridge S. Furber IEEE Micro, 2002
56
A Network on Chip
57
Transmitter Fast SR Approach (from R. Dobkin)
58
Receiver Splitter (from R. Dobkin)
59
Self Timed Networks

Router requires priority arbitration
Arbitration necessary at every router merge
Potential delay at every node on the path
BUT
Asynchronous merge/arbitration time is average
not worst case
Adapters to locally clocked cells require
synchronization
Synchronization necessary when clocks are unknown
Occurs when receiving data (data valid), and when
sending (acknowledge)
BUT
Time can be long (2 cycles?)
Must assume worst case time (maybe)

60
Router priority
Flow Control
Link
Merge
Split

Virtual channels implement scheduling algorithm
Contention for link resolved by priority circuits

61
Static priority arbiter
62
Reliability and latency

Asynchronous arbiters fail only if time is
bounded
Latency depends on fixed gates plus MUTEX lock
time
? for 2 channels, ? ? ln(N-1) for more
This likely to be small compared with flow
control latency
Synchronizers fail at (fairly) predictable rates
but these rates may get worse
Latency can be 35? now for good reliability

63
The synchronizer

Clock and valid can happen very close together
Flip Flop 1 gets caught in metastability
We wait until it is resolved (1 2 clock periods)

DATA
VALID
1
2
CLK2
CLK1
64
MTBF

For a 0.18? process ? is 20 50 ps
Tw is similar
Suppose the clock and data frequencies are 2 GHz
t needs to be gt 25 ? (more than one clock period)
to get MTBF gt 28 days
100 synchronizers 5 ?
MTBF gt 1year 2 ?
PVT variations 5 - 10? . . .

65
Event Histogram
Measurement
Convert to log scale, slope is ?
66
Not always simple
More than one slope 350ps 120ps 140ps
67
Synchronization Strategies

Avoid synchronization time (and arbitration time)
by
predicting clocks, stoppable clocks
dedicate link paths for long periods of time
Minimize time by circuit methods
Higher power, better ?
Reducing apparent device variability - wide
transistors
many parallel synchronizers increase throughput
Reduce average latency by speculation
Reduce synchronization time, detect errors and
roll back

68
Timing regions can have predictable relationships

Locked
Two clocks from same source
Linked by PLL
One produced by dividing the other
Some asynchronous systems
Some GALS
Not locked together but predictable
Two clocks same frequency, but different
oscillators.
As above, same frequency ratio

69
Dont synchronise when you dont need to

If the two clocks are locked together, you dont
need a synchroniser, just an asynchronous FIFO
big enough to accommodate any jitter/skew
FIFO must never overflow
Next read clock can be predicted and
metastability avoided

70
Conflict Prediction
Receiver Clock
Transmitter Clock
Predicted Transmitter Clock
Synchronization problem known a cycle in advance
of the Receiver clock.
We can do this thanks to the periodic nature of
the clocks
71
Problems predicting next cycle

Difficult to predict
Multiple source clocks
Input output interfaces
Dynamic jitter and noise
GALS start up clocks take several cycles to
stabilise
Crosstalk
power supply variations introducing noise into
both data and clock .
temperature changes alter relative delays
As a proportion of cycle time, this is likely to
increase with smaller geometries

72
Synchronizer reliability trends

Clock rates increase. 10 GHz gives 100ps for a
cycle.
Both data and clock rates up by n
? down by n
Assume ? scales with cycle time reliability
(MTBF) of one synchronizer down by n
Number of synchronizers goes up by N
Die reliability down by N
Die die and on-die variability increases to as
much as 40
40 more time needed for all synchronizers

73
An example

Example
10 GHz clock and data rate
? 10 ps
100 synchronizers
MBTF required 3.8 months (107 seconds )
Time required 41 ?, or 4.1 cycles 40 5.8
cycles
Does this matter?

74
Power futures

Total synchronizer area/power small, BUT
? very sensitive to voltage/power both n and p
transistors can turn off at low voltages no
gain
This affects MUTEX circuits as well

75
Power/speed tradeoffs

Increase Vdd when synchronisation required
Make synchronizer transistors wide to reduce
variation and, to some extent, ?
Make many synchronizer circuits, and select the
consistently fastest one
Avoid reducing synchronizer Vdd when running slow

76
Speculation

Mostly, the synchronizer does not need 35? to
settle
Only e-10 (0.005) need more than 10?
Why not go ahead anyway, and try again if more
time was needed

77
Low latency synchronization

Data Available, or Free to write are produced
early
After one cycle?.
If they prove to be in error, synchronization
failed
Only know this after two of more cycles
Read Fail or Write Fail flag is then raised and
the action can be repeated.

DATA
DATA
FIFO
Data Available
Free to write
Speculativesynchronizer
Speculativesynchronizer
Full
Not Empty
Read Fail
Write Fail
Write clock
Read Clock
WRITE
READ
Write Data
Read done
78
Comments