Title: CS 7810 Lecture 5
1CS 7810 Lecture 5
The Optimal Logic Depth Per Pipeline Stage is 6
to 8 FO4 Inverter Delays M.S. Hrishikesh, N.P.
Jouppi, K.I. Farkas, D. Burger, S.W. Keckler, P.
Shivakumar UT-Austin and Compaq ISCA02
2Improvements in Clock Speed
33MHz
66MHz
100MHz
200MHz
450MHz
1GHz
2GHz
1000nm
130nm
3Definitions
- Clock Period f flogic flatch fskew
fjitter - flogic the actual work being done in one stage
- flatch data has to be saved in latch registers
at the - end of each pipeline stage (1 FO4 36ps at
100nm) - fskew Two parts of the circuit may receive
their - clocks thru different paths, resulting in a
slight - phase difference (0.3 FO4)
- fjitter Unpredictable variations (0.5 FO4)
4Processor Model
- An Alpha-like processor with latencies updated
- for 100nm
- Simplification the study is insensitive to the
- technology generation
- Note that all structures are perfectly pipelined
- this is a Limit of Pipelining study
5Effect of Deep Pipelining
Add 16 FO4 Mpred 128 FO4 Load from mem 400
FO4 Mult 160 FO4 Overhead 2 FO4
.
.
.
.
Clock Period 18 FO4 10 FO4
add 162
mpred 8x18
load 400
mult 180
add 8282
mpred 16x10
load 400
mult 200
Clock Period FO4s
Cycles Clock speed 18 FO4
18144400180742 42
1.54GHz 10 FO4 20160400200780
78 2.78GHz
6Yet, Performance Increases
- Deepening a car assembly line ? more cars
- being made at the same time ? a new car rolls
- out at twice the freq
- Independent instrs benefit from deep pipelining
- Dependent instrs are slowed down
- The latter dominates when pipelining overhead is
- a large fraction of clock period
7Example Latencies
Logic Delay L1D IssueQ Int-Add
2 FO4 16 9 9
4 FO4 9 5 5
8 FO4 5 3 3
16 FO4 3 2 2
8In-Order Processors
- With no overhead, when flogic reduces from 8FO4
- to 4FO4, performance can go up by 100 (like in
- the car assembly line), but only goes up by 18
- With overhead, max performance is seen for 6FO4
- for all three benchmark classes
- For the Cray, optimal pipeline depth was 10.9FO4
- (Int) and 5.4FO4 (vector)
- Degree of parallelism vector gt
int-programs-today - gt int-programs-before (no caches!)
9Out-of-Order Processors
- Optimal logic delay for integer is 6FO4, for FP
- non-vector is 5FO4, for FP vector is 4FO4
- These results are insensitive to overhead costs
- and microarchitecure optimizations
- P.S. The effect of o-o-o execution on
performance - Non-vector FP 0.5 ? 1.0
- Integer 0.8 ? 1.8
- Vector FP 0.9 ? 3.5
10Out-of-Order Processors
11Increased Pipeline Depth
- Reasons for IPC decrease
- Longer ALU latencies (not quantified)
- Longer load latency (25 for 6-cyc increase)
- Longer branch mpred cost (10)
- Longer wakeupselect (55)
12Pipelining Wakeup
- It takes a long time to broadcast tags across
the - entire issueq
- Hence, wake the first eight instructions in the
- first cycle, wake the next eight in the second,
and - so on
- This works well if most ready instructions are
in - the first stage a 10-stage pipeline worsens
- performance by only 11 -- will this change the
- optimal logic depth?
13Instruction Select
- Stage-1 only goes through one arbiter
- Stages 2-4 have a pre-select and go thru 2
arbiters - Does well if most ready instrs in stage-1 (4
loss)
stage 4
stage 3
stage 2
16-input arbiters
stage 1
/
8
14IssueQ Compaction
- Both techniques work well only if instructions
- move up to occupy empty slots
- Wastes energy, increases complexity
- Correctness problems what if you miss the tag
- while in transit
15Conclusions
- Logic per stage will only shrink by a factor of
two - limits clock speed improvements in the future
- Pipelining wakeupselect has the biggest impact
- on IPC
16Related Work
- Hartstein and Puzak (IBM) Most programs have
- optimal pipeline depth between 13-30,
- corresponding to FO4 delays of 4-8
- Sprangle and Carmean (Intel) Optimum pipeline
- depth is 50-60, corresponding to FO4 delays of
4-5
17Title