Asynchronous Pipelines - PowerPoint PPT Presentation

1 / 53

About This Presentation

Title:

Asynchronous Pipelines

Description:

Asynchronous Pipelines Author: Peter Yeh Advisor: Professor Beerel – PowerPoint PPT presentation

Number of Views:112

Avg rating:3.0/5.0

Slides: 54

Provided by: Peter1621

Category:

more less

Transcript and Presenter's Notes

Title: Asynchronous Pipelines

1
Asynchronous Pipelines

Author Peter Yeh
Advisor Professor Beerel

2
Motivation

Can we reduce asynchronous pipelines
communication overhead while hiding precharge
time?
Can we have cycle time in asynchronous pipelines
as fast, if not faster, than best synchronous
counterparts.

3
Motivation System Performance

Fixed stage pipeline
Low pipeline usage Low latency is critical
High pipeline usage Cycle time is the limiting
factor to generate new outputs as fast as
possible
Flexible stage pipeline
With zero forward overhead and short cycle time,
we can achieve a given desired throughput with
fewer stages

4
Motivation System Performance

Pipelines with loop dependencies
Optimal cycle time is the sum of latency around
the loop
Pipelining is required to ensure precharge/reset
is not in the critical path
Our scheme requires less pipeline stages to
achieve same performance

5
Introduction

Asynchronous pipeline schemes using Taken
Detector (TD)
Best use in coarse-grained pipelines
Two schemes targeting different requirements (a
possible third SI scheme as well)

6
Outline

Background review
Sutherland
Ted William
Renaudin
Martin
Taken pipeline
Performance comparison
Conclusion

7
Definition

Stage A collection of logic that is precharged
or evaluated at the same time
Cycle The time it takes for a stage to start
next evaluation from the current one
Forward Latency The time it takes between the
start of the evaluation of current stage to next
stage

8
Background Outline

Sutherlands Micropipeline scheme
Ted Williams PS0 and PC0 pipeline schemes
Renaudins DCVSL pipeline scheme
Martins deep pipeline scheme

9
Sutherlands Micropipeline

Father of Asynchronous Pipeline. Presented in
Turing Award lecture
Delay Insensitive

A(out)
c
c
R(in)
LOGIC
LOGIC
LOGIC
D(out)
D(in)
A(in)
c
R(out)
10
Williams PC0

Speed Independent
Cycle Time (P) 3tF ? 1tF ? 4tC4tD
Forward Latency (Lf) 1tF?1tD1tC

A(in)
A(out)
C1
C2
C3
R(out)
R(in)
Precharged Function Block F1
Precharged Function Block F3
Precharged Function Block F3
Precharged Function Block F1
Precharged Function Block F3
Precharged Function Block F1
Precharged Function Block F3
Precharged Function Block F2
D2
D1
D3
D(out)
D(in)
11
PC0 Timing Diagram

The cycle time is shown in read arrows while the
blue arrows show the precharge phase

12
Dependency Graph
C2?
F2?
C3?
F3?
C4?
F4?
D2?
D2?
D2?
C1?
F1?
C2?
F2?
C3?
F3?
D1?
D2?
D3?
1
Flat Dependency Graph
1
0
0
C?
F?
D?
-1
Folded Dependency Graph
-1
0
0
C?
F?
D?
1
1
13
Williams PC1

Cycle Time (P) 2tF ? 4tC4tD
Forward Latency (Lf) 1tF?2tC1tD

A(in)
A(out)
C1
C2
R(out)
R(in)
Precharged Function Block F1
Precharged Function Block F2
C Latch
DB
DA
D2
D(in)
D(out)
14
Williams PS0

Not Speed Independent
Cycle Time (P) 3tF ? 1tF ? 2tD
Forward Latency (Lf) 1tF?

A(in)
A(out)
Precharged Function Block F1
Precharged Function Block F2
Precharged Function Block F3
D2
D1
D3
D(out)
D(in)
15
PS0 Timing Diagram
16
PS0 Timing Assumption

The pipeline has to meet the following timing
assoumption

tF?
17
Renaudins DCVSL Pipeline

Compare to Teds PC0 only
Use DCVSL exclusively
Introduce Latched DCVSL
Improve cycle time but not forward latency
Cycle Time (P) 1tF? 1tF? 4tC 2tD
Forward Latency (Lf) 1tF? 1tC 1tD

18
DCVS Logic Family
DCVS Logic
Latched DCVS Logic
19
More on DCVSL

Advantage
Fast, based on the dynamic domino type logic
Build-in Four-Phase handshaking
Robust completion sensing
Storage element
Disadvantage
Higher Complexity - increase in number of
transistors and area
Higher Power dissipation

20
DCVS Pipeline

Cycle Time (P) 1tF? 1tF? 4tC 2tD
(2tF? 4tC 2tD )
Forward Latency (Lf) 1tF? 1tC 1tD

R(in)
A(out)
C1
C2
C3
A(in)
R(out)
Precharged Function Block F1
Precharged Function Block F2
Precharged Function Block F3
D2
D1
D3
D(in)
D(out)
21
DCVS Pipeline Timing Diagram
22
DCVS Dependency Graph

Cycle Time (P) 1tF? 1tF? 4tC 2tD
Forward Latency (Lf) 1tF? 1tC 1tD

1
1
0
0
C?
F?
D?
Folded Dependency Graph
-1
-1
0
0
C?
F?
D?
1
1
23
Martins Pipeline Schemes

Deep pipelining
Quasi Delay-Insensitive (QDI)?No timing
assumption
Based on different handshaking reshuffling
Best scheme has high concurrency which reduce
control overhead
Control logic is more complex

24
Basic Asynchronous Handshaking
Le?
Re?
Re?
Le?
R1?
L1?
L1?
R1?

Reshuffling eliminates the explicit variable x
Large control overhead

25
Handshaking Reshuffling
Re?
Le?
Le?
Re?
R1?
L1?
L1?
R1?

Still wait for predecessor to reset before
resetting itself?larger overhead for more inputs

26
Precharge-Logic Half-Buffer
Re?
Le?
Le?
Re?
R1?
L1?
L1?
R1?

Doesnt wait for the predecessor to reset before
it resets its outputs. Yet, the control logic
wait for the reset of the predecessor only after
current stage has reset

27
Precharge-Logic Full-Buffer
Re?
Le?
Le?
Re?
en?
en?
R1?
L1?
L1?
R1?

Allows the neutrality test of the output data to
overlap with raising the left enables
Complex control logic, requires extra state
variable

28
Martins PCHB Full-adder
29
Martins Pipeline in General
Le
Le
Control
Control
Control
Precharged Function Block F1
Precharged Function Block F2
Precharged Function Block F3
Re
D2
D1
D3
D(out)
D(in)

The Cycle time is limited by the properties of
QDI
Next stage has to finish precharge before the
current stage can evaluate next input

30
Performance Analysis on PCFB

Control logic can be seen as completion detection
(D) plus C-element (C)
Reshuffling of handshaking just changes the
degree of the concurrency but it doesnt affect
the best case performance analysis

Cycle Time (P) 3tF? 1tF? 2tC 2tD
Forward Latency (Lf) 1tF?

31
Outline

Background review
Sutherland
Ted William
Renaudin
Martin
Taken pipeline
Performance comparison
Conclusion

32
Taken Pipeline

Use of Taken Detector
Two schemes to satisfy different requirements
Both are not speed independent

33
Initial Idea

Precharge only when next stage has taken the
current result
Evaluation only when next stage has precharged
Similar idea to Martins pipeline schemes

34
Further Observation

Precharge
We can precharge the current stage as soon as the
first level logic of next stage has
evaluated?next stage has taken the result
Evaluate
Evaluation can be started as soon as the guarded
N-transistor in the first level logic of next
stage has turned off

35
Relax Precharge (RP) Constraint

Current stage can precharge as soon as the first
level logic of next stage has evaluated Next
stage has Taken the result
Current stage can evaluate as soon as the first
level logic of next stage has precharged,
blocking the new result from passing through
No need for extra control logic except TD which
is similar to completion detector

36
RP Pipeline Scheme

Cycle Time (P) 2tF? 1tF1? 1tF1? 2tTD
Forward Latency (Lf) 1tF?

Precharged Function Block F1
Precharged Function Block F2
Precharged Function Block F3
D(in)
D(out)
37
RP Timing Diagram
38
RP Timing Assumption

Easy to meet timing assumption

39
RP Timing Assumption Cont.

tF1i is the first level logic of stage i
tF2i is the logic after the first level of stage
i
Assuming rising and falling of TD is the same

40
Relax Evaluation (RE) Constraint

Current stage can start the evaluation about the
same time as the next stage turns off the guarded
N-transistors in the first level logic
Requires general C-element, yet improve cycle time

41
RE Pipeline Scheme

TD can be skewed for fast evaluation detection
Cycle Time (P) 2tF? 1tF1? 1tTD 1tC
Forward Latency (Lf) 1tF?

GC1
GC1
GC1
Precharged Function Block F1
Precharged Function Block F2
Precharged Function Block F3
D(in)
D(out)
42
RE Timing Diagram
43
RE Timing Assumption 1

Precharge constraint

44
RE Timing Assumption 2

Evaluation constraint (Min Delay)

45
Issue in Fine-Grained Pipelines

In a fine-grained pipeline, such as Martins
single gate pipeline, RE scheme may require
buffering due to process variation
Buffering is necessary because of second timing
assumption, next gate (stage) may not have turned
off N-stack before the result from current stage
reaches it

46
Taken Detector (TD)

Similar to Completion Detector
Detect both evaluation and precharge
Inputs are the output of first level logic of
each stage

47
Datapath Merging Splitting

Datapath merging and splitting can be done
similar to Williams style

48
Outline

Background review
Sutherland
Ted William
Renaudin
Martin
Taken pipeline
Performance comparison
Conclusions

49
Comparison of RE and Synchronous Skew Tolerant

Assuming 4 stages pipeline, stage 1-4, and 4
phases clocking
Synchronous
Stage 1 starts next evaluation after stage 4
starts evaluation
Asynchronous
Stage 1 starts next evaluation after we detect
the completion of the first level logic of stage 3

50
Comparison Assumptions

It is a balanced pipelineall stages have equal
evaluation time
Precharge time is same as evaluation time

51
Graphical Comparison
52
Optimum Number of Stages

Optimum Number of Stages (ONS)
Cycle Time is not the only factor in system
performance, Forward Latency is also a limiting
factor
Larger cycle time can be compensated by
increasing the number of stages
However, high Lf means system throughput can not
be increased by adding more stages

53
Conclusion

With Taken logic and some easy to meet timing
requirement, we can achieve the best cycle time
and forward latency
The performance comparison with existing pipeline
schemes are favorable
Implementation is still required to prove the
theory

Write a Comment

User Comments (0)