Title: Clockless Computing
1Clockless Computing
- Montek Singh
- Thu, Sep 13, 2007
2Dynamic Logic Pipelines (contd.)
- Drawbacks of Williams PS0 Pipelines
- Lookahead Pipelines Singh/Nowick 2000
- High-Capacity Pipelines Singh/Nowick 2000
3Drawbacks of PSO Pipelining
- Poor throughput
- long cycle time 6 events per cycle
- data tokens are forced far apart in time
- Limited storage capacity
- max only 50 of stages can hold distinct tokens
- data tokens must be separated by at least one
spacer - My Research Goals have been address both issues
- still maintain very low latency
4Recent Approaches
- 3 novel styles for high-speed async pipelining
- MOUSETRAP Pipelines Singh/Nowick, TAU-00,
ICCD-01 - Lookahead Pipelines (LP) Singh/Nowick,
Async-00 - High-Capacity Pipelines (HC) Singh/Nowick,
WVLSI-00 - Goal significantly improve throughput of PS0
- Two Distinct Strategies
- LP introduce protocol optimizations
- shave off components from critical cycle
- HC fundamentally new protocol
- greater concurrency loosely-coupled stages
?
?
5Outline
- New Asynchronous Pipelines
- MOUSETRAP Pipelines
- Lookahead Pipelines (LP)
- High-Capacity Pipelines (HC)
6Lookahead Pipeline Styles
- Singh and Nowick
- Async-2000
- Best Paper Award
7Lookahead Pipelines Strategy 1
- Use non-neighbor communication
- stage receives information from multiple later
stages - allows early evaluation
Benefit stage gets head-start on next cycle
8Lookahead Pipelines Strategy 2
- Use early completion detection
- completion detector moved before stage (not
after) - stage indicates early done in parallel with
computation
early completion detector
Benefit again, stage gets head-start on next
cycle
9Lookahead Pipelines Overview
- 5 New Designs
- Dual-Rail Data Signaling
- LP3/1 early evaluation
- LP2/2 early done
- LP2/1 early evaluation early done
- Single-Rail Bundled-Data Signaling
- LPSR2/2 early done
- LPSR2/1 early evaluation early done
10Dual-Rail Design 1 LP3/1
PC
Eval
Data in
Data out
N
N1
N2
ProcessingBlock
Completion Detector
From N2
- Optimization early evaluation
- each stage has two control inputs from stages
N1 and N2 - Idea shorten precharge phase
- terminate precharge early when N2 is done
evaluating
11 LP3/1 Protocol
- PRECHARGE N when N1 completes evaluation
- EVALUATE N when N2 completes evaluation
N2 indicates done
N
N1
N2
N2 evaluates
N evaluates
N1 evaluates
12LP3/1 Comparison with PS0
N
N1
N2
LP3/1
Only 4 events in cycle!
N
N1
N2
PS0
6 events in cycle
13LP3/1 Performance
saved path
Savings over PS0 1 Precharge 1 Completion
Detection
14LP3/1 Inside a Stage
old Eval
early Eval
- Timing Issues
- must satisfy several simple constraints
- Ex. PC must arrive before Eval de-asserted
- 1-sided timing requirement
- easily satisfied in practice
15Dual-Rail Design 2 LP2/2
- Optimization early done
- Idea move completion detector before processing
block - stage indicates when about to precharge/evaluate
early Completion Detector
early done
Data in
Data out
Processing Block
16LP2/2 Completion Detector
- Modified completion detectors needed
- Done1 when stage starts evaluating, and inputs
valid - Done0 when stage starts precharging
- asymmetric C-element
17LP2/2 Protocol
- Completion Detection
- performed in parallel with evaluation/precharge
of stage
N
N1
N2
N evaluates
N1 evaluates
18LP2/2 Performance
4
1
2
LP2/2 savings over PS0 1 Evaluation 1
Precharge
19Dual-Rail Design 3 LP2/1
- Hybrid of LP3/1 and LP2/2. Combines
- early evaluation of LP3/1
- early done of LP2/2
20Lookahead Pipelines Overview
- 5 New Designs
- Dual-Rail Data Signaling
- LP3/1 early evaluation
- LP2/2 early done
- LP2/1 early evaluation early done
- Single-Rail Bundled-Data Signaling
- LPSR2/2 early done
- LPSR2/1 early evaluation early done
21Single-Rail Design LPSR2/1
- Derivative of LP2/1, adapted to single-rail
- bundled-data matched delays instead of
completion detectors
22Inside an LPSR2/1 Stage
23LPSR2/1 Protocol
N
N1
N2
N evaluates
24FIFO Results (simulations)
0.19? CMOS 3.3 V, 300K
dual-rail
single-rail
- LP dual-rail over 80 faster than Williams PS0
- comparable latency
- LP single-rail even faster
25Practicality of Gate-Level Pipelining
When datapath is wide
- Can often split into narrow streams
- Use localized completion detector
- for each stream
- need to examine only a few bits
- ? small fan-in
- send done to only a few gates
- ? small fan-out
- comp. det. fairly low cost!
26High-Capacity Pipelines
- Singh/Nowick WVLSI-00, ISSCC-02, Async-02
27HC Pipeline Style
- High-Capacity Pipelines (HC)
- bundled datapaths dynamic logic function blocks
- latch-free no explicit latches needed
- dynamic logic provides implicit latching
- novel highly-concurrent protocol maximizes
storage capacity - traditional latch-free approaches spacers
limit capacity to 50 - Key Idea Obtain greater control of stages
operation - separate control of pull-up/pull-down
- result new isolate phase
- stage holds outputs/impervious to input changes
- Advantage Each stage can hold a distinct data
item - 100 storage capacity
- Extra Benefit Obtain greater concurrency
- ? High throughput
28HC Basic Structure
- Key Idea
- 2 independent control signals
- pc controls precharge
- eval controls evaluation
- Allows novel 3-phase cycle
- Evaluate
- Isolate (hold)
- Precharge
pc
eval
ack
delay
delay
delay
N
N1
N2
29HC Inside a Stage
- Independent Controls of pull-up and pull-down
- allows new 3rd phase isolate
- pc asserted precharge
- eval asserted evaluate
- pc and eval de-asserted enter isolate (hold)
phase
30HC Protocol
Stage N
Stage N1
Eval
X
Isolate
Precharge
- Our protocol only 2 synchronization arcs
- only 1 backward arc
- once stage N1 evaluates, N can complete entire
next cycle!
- Most Existing Protocols 3 synchronization arcs
- 1 forward arc data dependency
- 2 backward arcs control synchronization
31Formal Specification of Controller
- Problem Specification too concurrent for direct
synthesis - desired precharge condition N and N1 have
evaluated same data - problem this condition not uniquely captured by
given signals! - N may evaluate next data item, while N1 stuck on
current item!
32Modified Specification of Controller
- Solution Add a state variable ok2pc
- ok2pc records whether N1 has absorbed Ns data
item - ok2pc resets immediately when N deletes item (N
precharges) - ok2pc is set when N1 deletes item (N1
precharges)
33Controller implementation
- Controller implementation is very simple
- each signal implemented using a single gate
- ok2pc typically off the critical path
34HC Stage Implementation
NAND
INV
eval
pc
ack
req
done
delay
35HC Operation
N enables itself for next evaluation
N
N1
N evaluates
N precharges
N1 starts to evaluate
Cycle Time 8 CMOS gate delays
36Performance
N
N1
N2
N enables itself for next evaluation
N precharges
N evaluates
N1 evaluates
37FIFO Results (simulations)
0.19? CMOS 3.3 V, 300K
dual-rail
single-rail
- LP dual-rail over 80 faster than Williams PS0
- comparable latency
- LP single-rail even faster
38Fabricated Chip HC FIFO