CS 7810 Lecture 5 - PowerPoint PPT Presentation

About This Presentation
Title:

CS 7810 Lecture 5

Description:

Degree of parallelism: vector int-programs-today int-programs-before (no caches! ... for integer is 6FO4, for FP. non-vector is 5FO4, for FP vector is 4FO4 ... – PowerPoint PPT presentation

Number of Views:17
Avg rating:3.0/5.0
Slides: 18
Provided by: RajeevBala4
Category:
Tags: degree | lecture

less

Transcript and Presenter's Notes

Title: CS 7810 Lecture 5


1
CS 7810 Lecture 5
The Optimal Logic Depth Per Pipeline Stage is 6
to 8 FO4 Inverter Delays M.S. Hrishikesh, N.P.
Jouppi, K.I. Farkas, D. Burger, S.W. Keckler, P.
Shivakumar UT-Austin and Compaq ISCA02
2
Improvements in Clock Speed
33MHz
66MHz
100MHz
200MHz
450MHz
1GHz
2GHz
1000nm
130nm
3
Definitions
  • Clock Period f flogic flatch fskew
    fjitter
  • flogic the actual work being done in one stage
  • flatch data has to be saved in latch registers
    at the
  • end of each pipeline stage (1 FO4 36ps at
    100nm)
  • fskew Two parts of the circuit may receive
    their
  • clocks thru different paths, resulting in a
    slight
  • phase difference (0.3 FO4)
  • fjitter Unpredictable variations (0.5 FO4)

4
Processor Model
  • An Alpha-like processor with latencies updated
  • for 100nm
  • Simplification the study is insensitive to the
  • technology generation
  • Note that all structures are perfectly pipelined
  • this is a Limit of Pipelining study

5
Effect of Deep Pipelining
Add 16 FO4 Mpred 128 FO4 Load from mem 400
FO4 Mult 160 FO4 Overhead 2 FO4
.
.
.
.
Clock Period 18 FO4 10 FO4
add 162
mpred 8x18
load 400
mult 180
add 8282
mpred 16x10
load 400
mult 200
Clock Period FO4s
Cycles Clock speed 18 FO4
18144400180742 42
1.54GHz 10 FO4 20160400200780
78 2.78GHz
6
Yet, Performance Increases
  • Deepening a car assembly line ? more cars
  • being made at the same time ? a new car rolls
  • out at twice the freq
  • Independent instrs benefit from deep pipelining
  • Dependent instrs are slowed down
  • The latter dominates when pipelining overhead is
  • a large fraction of clock period

7
Example Latencies
Logic Delay L1D IssueQ Int-Add
2 FO4 16 9 9
4 FO4 9 5 5
8 FO4 5 3 3
16 FO4 3 2 2
8
In-Order Processors
  • With no overhead, when flogic reduces from 8FO4
  • to 4FO4, performance can go up by 100 (like in
  • the car assembly line), but only goes up by 18
  • With overhead, max performance is seen for 6FO4
  • for all three benchmark classes
  • For the Cray, optimal pipeline depth was 10.9FO4
  • (Int) and 5.4FO4 (vector)
  • Degree of parallelism vector gt
    int-programs-today
  • gt int-programs-before (no caches!)

9
Out-of-Order Processors
  • Optimal logic delay for integer is 6FO4, for FP
  • non-vector is 5FO4, for FP vector is 4FO4
  • These results are insensitive to overhead costs
  • and microarchitecure optimizations
  • P.S. The effect of o-o-o execution on
    performance
  • Non-vector FP 0.5 ? 1.0
  • Integer 0.8 ? 1.8
  • Vector FP 0.9 ? 3.5

10
Out-of-Order Processors
11
Increased Pipeline Depth
  • Reasons for IPC decrease
  • Longer ALU latencies (not quantified)
  • Longer load latency (25 for 6-cyc increase)
  • Longer branch mpred cost (10)
  • Longer wakeupselect (55)

12
Pipelining Wakeup
  • It takes a long time to broadcast tags across
    the
  • entire issueq
  • Hence, wake the first eight instructions in the
  • first cycle, wake the next eight in the second,
    and
  • so on
  • This works well if most ready instructions are
    in
  • the first stage a 10-stage pipeline worsens
  • performance by only 11 -- will this change the
  • optimal logic depth?

13
Instruction Select
  • Stage-1 only goes through one arbiter
  • Stages 2-4 have a pre-select and go thru 2
    arbiters
  • Does well if most ready instrs in stage-1 (4
    loss)

stage 4
stage 3
stage 2
16-input arbiters
stage 1
/
8
14
IssueQ Compaction
  • Both techniques work well only if instructions
  • move up to occupy empty slots
  • Wastes energy, increases complexity
  • Correctness problems what if you miss the tag
  • while in transit

15
Conclusions
  • Logic per stage will only shrink by a factor of
    two
  • limits clock speed improvements in the future
  • Pipelining wakeupselect has the biggest impact
  • on IPC

16
Related Work
  • Hartstein and Puzak (IBM) Most programs have
  • optimal pipeline depth between 13-30,
  • corresponding to FO4 delays of 4-8
  • Sprangle and Carmean (Intel) Optimum pipeline
  • depth is 50-60, corresponding to FO4 delays of
    4-5

17
Title
  • Bullet
Write a Comment
User Comments (0)
About PowerShow.com