Title: Clockless Logic: Dynamic Logic Pipelines (contd.)
1Clockless LogicDynamic Logic Pipelines (contd.)
- Drawbacks of Williams PS0 Pipelines
- Lookahead Pipelines
2Drawbacks of PSO Pipelining
- Poor throughput
- long cycle time 6 events per cycle
- data tokens are forced far apart in time
- Limited storage capacity
- max only 50 of stages can hold distinct tokens
- data tokens must be separated by at least one
spacer - Our Research Goals address both issues
- still maintain very low latency
3Recent Approaches
- 3 novel styles for high-speed async pipelining
- MOUSETRAP Pipelines Singh/Nowick, TAU-00,
ICCD-01 - Lookahead Pipelines (LP) Singh/Nowick,
Async-00 - High-Capacity Pipelines (HC) Singh/Nowick,
WVLSI-00 - Goal significantly improve throughput of PS0
- Two Distinct Strategies
- LP introduce protocol optimizations
- shave off components from critical cycle
- HC fundamentally new protocol
- greater concurrency loosely-coupled stages
?
?
4Outline
- New Asynchronous Pipelines
- MOUSETRAP Pipelines
- Lookahead Pipelines (LP)
- High-Capacity Pipelines (HC)
5Lookahead Pipelines Strategy 1
- Use non-neighbor communication
- stage receives information from multiple later
stages - allows early evaluation
Benefit stage gets head-start on next cycle
6Lookahead Pipelines Strategy 2
- Use early completion detection
- completion detector moved before stage (not
after) - stage indicates early done in parallel with
computation
early completion detector
Benefit again, stage gets head-start on next
cycle
7Lookahead Pipelines Overview
- 5 New Designs
- Dual-Rail Data Signaling
- LP3/1 early evaluation
- LP2/2 early done
- LP2/1 early evaluation early done
- Single-Rail Bundled-Data Signaling
- LPSR2/2 early done
- LPSR2/1 early evaluation early done
8Dual-Rail Design 1 LP3/1
PC
Eval
Data in
Data out
N
N1
N2
ProcessingBlock
Completion Detector
From N2
- Optimization early evaluation
- each stage has two control inputs from stages
N1 and N2 - Idea shorten precharge phase
- terminate precharge early when N2 is done
evaluating
9 LP3/1 Protocol
- PRECHARGE N when N1 completes evaluation
- EVALUATE N when N2 completes evaluation
N2 indicates done
N
N1
N2
N2 evaluates
N evaluates
N1 evaluates
10LP3/1 Comparison with PS0
N
N1
N2
LP3/1
Only 4 events in cycle!
N
N1
N2
PS0
6 events in cycle
11LP3/1 Performance
saved path
Savings over PS0 1 Precharge 1 Completion
Detection
12LP3/1 Inside a Stage
Merging 2 Control Inputs
A NAND gate merges2 control inputs
- Precharge when PC1 (and Eval0)
- Evaluate early when Eval1 (or PC0)
- Problem early Eval1 is non-persistent!
- may be de-asserted before stage completes
evaluation!
13LP3/1 Timing Constraints Example
Problem (cont.) early Eval1 non-persistent
- Observation PC0 soon after Eval1, and is
persistent - Solution no change!
- ?use PC as safe takeover for Eval!
- Timing Constraint PC0 must arrive before Eval
de-asserted - simple one-sided timing requirement
- other constraints as well all easily satisfied
in practice
14Dual-Rail Design 2 LP2/2
- Optimization early done
- Idea move completion detector before processing
block - stage indicates when about to precharge/evaluate
early Completion Detector
early done
Data in
Data out
Processing Block
15LP2/2 Completion Detector
- Modified completion detectors needed
- Done1 when stage starts evaluating, and inputs
valid - Done0 when stage starts precharging
- asymmetric C-element
16LP2/2 Protocol
- Completion Detection
- performed in parallel with evaluation/precharge
of stage
N
N1
N2
N evaluates
N1 evaluates
17LP2/2 Performance
4
1
2
LP2/2 savings over PS0 1 Evaluation 1
Precharge
18Dual-Rail Design 3 LP2/1
- Hybrid of LP3/1 and LP2/2. Combines
- early evaluation of LP3/1
- early done of LP2/2
19Lookahead Pipelines Overview
- 5 New Designs
- Dual-Rail Data Signaling
- LP3/1 early evaluation
- LP2/2 early done
- LP2/1 early evaluation early done
- Single-Rail Bundled-Data Signaling
- LPSR2/2 early done
- LPSR2/1 early evaluation early done
20Single-Rail Design LPSR2/1
- Derivative of LP2/1, adapted to single-rail
- bundled-data matched delays instead of
completion detectors
21Inside an LPSR2/1 Stage
22LPSR2/1 Protocol
N
N1
N2
N evaluates
23Results
- Designed/simulated FIFOs for each pipeline style
- Experimental Setup
- design 4-bit wide, 10-stage FIFO
- technology 0.6? HP CMOS
- operating conditions 3.3 V and 300K
24Comparison with Williams PS0
dual-rail
single-rail
- LP2/1 gt2X faster than Williams PS0
- LPSR2/1 1.2 Giga items/sec
25Comparison LPSR2/1 vs. Molnar FIFOs
- LPSR2/1 FIFO 1.2 Giga items/sec
- Adding logic processing to FIFO
- simply fold logic into dynamic gate ? little
overhead - Comparison with Molnar FIFOs
- asp FIFO 1.1 Giga items/sec
- more complex timing assumptions ? not easily
formalized - requires explicit latches, separate from logic!
- adding logic processing between stages ?
significant overhead - micropipeline 1.7 Giga items/sec
- two parallel FIFOs, each only 0.85 Giga/sec
- very expensive transition latches
- cannot add logic processing to FIFO!
26Practicality of Gate-Level Pipelining
When datapath is wide
- Can often split into narrow streams
- Use localized completion detector
- for each stream
- need to examine only a few bits
- ? small fan-in
- send done to only a few gates
- ? small fan-out
- comp. det. fairly low cost!
27Conclusions
- Introduced several new dynamic pipelines
- Use two novel protocols
- early evaluation
- early done
- Especially suitable for fine-grain (gate-level)
pipelining - Very high throughputs obtained
- dual-rail gt2X improvement over Williams PS0
- single-rail 1.2 Giga items/second in 0.6? CMOS
- Use easy-to-satisfy, one-sided timing constraints
- Robustly handle arbitrary-speed environments
- overcome a major shortcoming of Williams PS0
pipelines - Recent Improvement Even faster single-rail
pipeline (WVLSI00)