Title: Asynchronous Pipelines
1Asynchronous Pipelines
- Author Peter Yeh
- Advisor Professor Beerel
2Motivation
- Can we reduce asynchronous pipelines
communication overhead while hiding precharge
time? - Can we have cycle time in asynchronous pipelines
as fast, if not faster, than best synchronous
counterparts.
3Motivation System Performance
- Fixed stage pipeline
- Low pipeline usage Low latency is critical
- High pipeline usage Cycle time is the limiting
factor to generate new outputs as fast as
possible - Flexible stage pipeline
- With zero forward overhead and short cycle time,
we can achieve a given desired throughput with
fewer stages
4Motivation System Performance
- Pipelines with loop dependencies
- Optimal cycle time is the sum of latency around
the loop - Pipelining is required to ensure precharge/reset
is not in the critical path - Our scheme requires less pipeline stages to
achieve same performance
5Introduction
- Asynchronous pipeline schemes using Taken
Detector (TD) - Best use in coarse-grained pipelines
- Two schemes targeting different requirements (a
possible third SI scheme as well)
6Outline
- Background review
- Sutherland
- Ted William
- Renaudin
- Martin
- Taken pipeline
- Performance comparison
- Conclusion
7Definition
- Stage A collection of logic that is precharged
or evaluated at the same time - Cycle The time it takes for a stage to start
next evaluation from the current one - Forward Latency The time it takes between the
start of the evaluation of current stage to next
stage
8Background Outline
- Sutherlands Micropipeline scheme
- Ted Williams PS0 and PC0 pipeline schemes
- Renaudins DCVSL pipeline scheme
- Martins deep pipeline scheme
9Sutherlands Micropipeline
- Father of Asynchronous Pipeline. Presented in
Turing Award lecture - Delay Insensitive
A(out)
c
c
R(in)
LOGIC
LOGIC
LOGIC
D(out)
D(in)
A(in)
c
R(out)
10Williams PC0
- Speed Independent
- Cycle Time (P) 3tF ? 1tF ? 4tC4tD
- Forward Latency (Lf) 1tF?1tD1tC
A(in)
A(out)
C1
C2
C3
R(out)
R(in)
Precharged Function Block F1
Precharged Function Block F3
Precharged Function Block F3
Precharged Function Block F1
Precharged Function Block F3
Precharged Function Block F1
Precharged Function Block F3
Precharged Function Block F2
D2
D1
D3
D(out)
D(in)
11PC0 Timing Diagram
- The cycle time is shown in read arrows while the
blue arrows show the precharge phase
12Dependency Graph
C2?
F2?
C3?
F3?
C4?
F4?
D2?
D2?
D2?
C1?
F1?
C2?
F2?
C3?
F3?
D1?
D2?
D3?
1
Flat Dependency Graph
1
0
0
C?
F?
D?
-1
Folded Dependency Graph
-1
0
0
C?
F?
D?
1
1
13Williams PC1
- Cycle Time (P) 2tF ? 4tC4tD
- Forward Latency (Lf) 1tF?2tC1tD
A(in)
A(out)
C1
C2
R(out)
R(in)
Precharged Function Block F1
Precharged Function Block F2
C Latch
DB
DA
D2
D(in)
D(out)
14Williams PS0
- Not Speed Independent
- Cycle Time (P) 3tF ? 1tF ? 2tD
- Forward Latency (Lf) 1tF?
A(in)
A(out)
Precharged Function Block F1
Precharged Function Block F2
Precharged Function Block F3
D2
D1
D3
D(out)
D(in)
15PS0 Timing Diagram
16PS0 Timing Assumption
- The pipeline has to meet the following timing
assoumption
tF?
17Renaudins DCVSL Pipeline
- Compare to Teds PC0 only
- Use DCVSL exclusively
- Introduce Latched DCVSL
- Improve cycle time but not forward latency
- Cycle Time (P) 1tF? 1tF? 4tC 2tD
- Forward Latency (Lf) 1tF? 1tC 1tD
18DCVS Logic Family
DCVS Logic
Latched DCVS Logic
19More on DCVSL
- Advantage
- Fast, based on the dynamic domino type logic
- Build-in Four-Phase handshaking
- Robust completion sensing
- Storage element
- Disadvantage
- Higher Complexity - increase in number of
transistors and area - Higher Power dissipation
20DCVS Pipeline
- Cycle Time (P) 1tF? 1tF? 4tC 2tD
- (2tF? 4tC 2tD )
- Forward Latency (Lf) 1tF? 1tC 1tD
R(in)
A(out)
C1
C2
C3
A(in)
R(out)
Precharged Function Block F1
Precharged Function Block F2
Precharged Function Block F3
D2
D1
D3
D(in)
D(out)
21DCVS Pipeline Timing Diagram
22DCVS Dependency Graph
- Cycle Time (P) 1tF? 1tF? 4tC 2tD
- Forward Latency (Lf) 1tF? 1tC 1tD
1
1
0
0
C?
F?
D?
Folded Dependency Graph
-1
-1
0
0
C?
F?
D?
1
1
23Martins Pipeline Schemes
- Deep pipelining
- Quasi Delay-Insensitive (QDI)?No timing
assumption - Based on different handshaking reshuffling
- Best scheme has high concurrency which reduce
control overhead - Control logic is more complex
24Basic Asynchronous Handshaking
Le?
Re?
Re?
Le?
R1?
L1?
L1?
R1?
- Reshuffling eliminates the explicit variable x
- Large control overhead
25Handshaking Reshuffling
Re?
Le?
Le?
Re?
R1?
L1?
L1?
R1?
- Still wait for predecessor to reset before
resetting itself?larger overhead for more inputs
26Precharge-Logic Half-Buffer
Re?
Le?
Le?
Re?
R1?
L1?
L1?
R1?
- Doesnt wait for the predecessor to reset before
it resets its outputs. Yet, the control logic
wait for the reset of the predecessor only after
current stage has reset
27Precharge-Logic Full-Buffer
Re?
Le?
Le?
Re?
en?
en?
R1?
L1?
L1?
R1?
- Allows the neutrality test of the output data to
overlap with raising the left enables - Complex control logic, requires extra state
variable
28Martins PCHB Full-adder
29Martins Pipeline in General
Le
Le
Control
Control
Control
Precharged Function Block F1
Precharged Function Block F2
Precharged Function Block F3
Re
D2
D1
D3
D(out)
D(in)
- The Cycle time is limited by the properties of
QDI - Next stage has to finish precharge before the
current stage can evaluate next input
30Performance Analysis on PCFB
- Control logic can be seen as completion detection
(D) plus C-element (C) - Reshuffling of handshaking just changes the
degree of the concurrency but it doesnt affect
the best case performance analysis
- Cycle Time (P) 3tF? 1tF? 2tC 2tD
- Forward Latency (Lf) 1tF?
31Outline
- Background review
- Sutherland
- Ted William
- Renaudin
- Martin
- Taken pipeline
- Performance comparison
- Conclusion
32Taken Pipeline
- Use of Taken Detector
- Two schemes to satisfy different requirements
- Both are not speed independent
33Initial Idea
- Precharge only when next stage has taken the
current result - Evaluation only when next stage has precharged
- Similar idea to Martins pipeline schemes
34Further Observation
- Precharge
- We can precharge the current stage as soon as the
first level logic of next stage has
evaluated?next stage has taken the result - Evaluate
- Evaluation can be started as soon as the guarded
N-transistor in the first level logic of next
stage has turned off
35Relax Precharge (RP) Constraint
- Current stage can precharge as soon as the first
level logic of next stage has evaluated Next
stage has Taken the result - Current stage can evaluate as soon as the first
level logic of next stage has precharged,
blocking the new result from passing through - No need for extra control logic except TD which
is similar to completion detector
36RP Pipeline Scheme
- Cycle Time (P) 2tF? 1tF1? 1tF1? 2tTD
- Forward Latency (Lf) 1tF?
Precharged Function Block F1
Precharged Function Block F2
Precharged Function Block F3
D(in)
D(out)
37RP Timing Diagram
38RP Timing Assumption
- Easy to meet timing assumption
39RP Timing Assumption Cont.
- tF1i is the first level logic of stage i
- tF2i is the logic after the first level of stage
i - Assuming rising and falling of TD is the same
40Relax Evaluation (RE) Constraint
- Current stage can start the evaluation about the
same time as the next stage turns off the guarded
N-transistors in the first level logic - Requires general C-element, yet improve cycle time
41RE Pipeline Scheme
- TD can be skewed for fast evaluation detection
- Cycle Time (P) 2tF? 1tF1? 1tTD 1tC
- Forward Latency (Lf) 1tF?
GC1
GC1
GC1
Precharged Function Block F1
Precharged Function Block F2
Precharged Function Block F3
D(in)
D(out)
42RE Timing Diagram
43RE Timing Assumption 1
44RE Timing Assumption 2
- Evaluation constraint (Min Delay)
45Issue in Fine-Grained Pipelines
- In a fine-grained pipeline, such as Martins
single gate pipeline, RE scheme may require
buffering due to process variation - Buffering is necessary because of second timing
assumption, next gate (stage) may not have turned
off N-stack before the result from current stage
reaches it
46Taken Detector (TD)
- Similar to Completion Detector
- Detect both evaluation and precharge
- Inputs are the output of first level logic of
each stage
47Datapath Merging Splitting
- Datapath merging and splitting can be done
similar to Williams style
48Outline
- Background review
- Sutherland
- Ted William
- Renaudin
- Martin
- Taken pipeline
- Performance comparison
- Conclusions
49Comparison of RE and Synchronous Skew Tolerant
- Assuming 4 stages pipeline, stage 1-4, and 4
phases clocking - Synchronous
- Stage 1 starts next evaluation after stage 4
starts evaluation - Asynchronous
- Stage 1 starts next evaluation after we detect
the completion of the first level logic of stage 3
50Comparison Assumptions
- It is a balanced pipelineall stages have equal
evaluation time - Precharge time is same as evaluation time
51Graphical Comparison
52Optimum Number of Stages
- Optimum Number of Stages (ONS)
- Cycle Time is not the only factor in system
performance, Forward Latency is also a limiting
factor - Larger cycle time can be compensated by
increasing the number of stages - However, high Lf means system throughput can not
be increased by adding more stages
53Conclusion
- With Taken logic and some easy to meet timing
requirement, we can achieve the best cycle time
and forward latency - The performance comparison with existing pipeline
schemes are favorable - Implementation is still required to prove the
theory