Enhancing performance Pipelining Chapter 6 Part 1 Concepts

About This Presentation

Title:

Enhancing performance Pipelining Chapter 6 Part 1 Concepts

Description:

While the engine is installed in one car, the seats are installed in another ... Comparison With Mult-clock-nonpipeline Control ... – PowerPoint PPT presentation

Number of Views:30

Avg rating:3.0/5.0

Slides: 36

Provided by: csBing

Learn more at: https://www.cs.binghamton.edu

Category:

more less

Transcript and Presenter's Notes

Title: Enhancing performance Pipelining Chapter 6 Part 1 Concepts

1
Enhancing performance - PipeliningChapter 6Part
1 Concepts

N. Guydosh
3/24/04

2
Introduction

Parallelism built into the processor hardware
The logical sequence of events in the execution
of an instruction is generally wasteful of time .
Example while an instruction is doing arithmetic
using registers, the memory is idle ... why not
fetch the next instruction during this time?
The key idea is to overlap the processing of
multiple instructions.

3
Introduction An Analogy

Analogous to an assembly line in a factory
The time to build an individual car does not
decrease but the number of cars built per unit
time is greatly increased .
Multiple cars are simultaneously built
While the engine is installed in one car, the
seats are installed in another ... at the same
time
Success of assembly line depends on how well
balanced it is ... we dont want one task (phase)
taking 10 minutes, while another takes 1 hour.
... The car having the short task done at one
station will have to wait idle for the next task
station to free up.
The series of work stations in an assembly line
are analogous to the functional units in a
processor data path
A series of cars to be built passes though the
assembly line simultaneously - each work station
is busy.
On startup and stopping of the line it take a
time equal to the sum of the work stations time
to fill and empty line (pipeline).

4
Introduction Computer Instructions

In the assembly line example, the car becomes an
instruction .
The tasks done on the car become the instruction
phases (or stages).
The work stations become the functional units in
the data path.
The assembly line becomes the data path
The data path simultaneously executing multiple
instructions is called a pipeline

5
Introduction Computer Instructions (cont)

Pipelining improves instruction throughput rather
than individual instruction execution time
The time required to move an instruction one step
down the pipeline is one clock cycle
The length of a clock cycle is determined by the
time required for the slowest pipeline stage
because all stages must proceed at the same rate
The goal of the designer is to balance the length
of each stage - otherwise there will be idle time
during a stage.

6
Steps to Take

Decompose the processing of instructions into
phases
Simplest decomposition is two phases or stages
fetch and execute
1st stage fetches and buffers the instruction
2nd stage (execution) receives the buffered
instruction from the 1st stage when it is free
While 2nd stage is executing, the 1st stage takes
advantage of any unused memory cycles to fetch
and buffer the next instruction this is called
instruction pre-fetch or fetch overlap

7
Steps to Take (Cont.)

Problems with this approach
Execution time is generally longer than fetch
time.
Fetch stage may have to wait before it can empty
its buffer
Ideally we would like to have the various stages
of instruction processing take the same amount of
time.
A conditional branch instruction makes the
address of the next instruction uncertain ...
thus fetch stage waits until the execute stage
(branch) determines the next instruction address
Both above situations results in performance loss
- the latter (conditional branch) can be reduced
by guessing at, the outcome of the branch.

8
Steps to Take (Cont.)

An improvement would be to decompose the
instruction processing into smaller steps (finer
granularity)
There would be less variation in processing time
among the stages
These are the familiar phases of our instruction
executionIF Instruction fetchID
Instruction decode and register fetchEX
Execution and effective address calculationMEM
Memory access (fetch memory operands)WB
Write back (into register file )
The various phases (5 of them) will be more
nearly equal in duration
Register read and register write takes only 1 ns
and all the other phases take 2 ns. Thus all the
phases will take 2 ns - register operations will
idle for 1 ns during the register phases

9
Steps to Take (Cont.)

Fundamental conceptIn order to make each phase
as independent as possible of other phases, we
will use the single clock cycle data path (fig,
5.19, p. 360) and a multiple clock cycle timing
scheme.
A hybrid of the two schemes in chapter 5.
The single clock cycle data path has redundant
hardware which enhances parallelism and phase
independence.
ALU and two adders
But it is functionally the same as the multiple
clock data path.

10
Single Clock Cycle Datapath for Multi Clock Cycle
Timing
Fig 6.10
11
Performance Example

Execution of three consecutive lw instructions
see p. 439
2 ns per phase except for reg phase which is 1 ns

12
Performance Example (cont.)

Ideally with no delays for register operation, it
would take 8 ns to execute an lw instruction and
24 ns to do three of them sequentially.
In a 5 stage pipeline the three could be done in
14 ns.
Ideally we would expect to complete an
instruction every 8/5 1.6 ns for the 5 stage
pipeline.
Instead we see 2 ns between instructions
This is due to an imbalance if the time for each
phase all phases are 2ns and the register phase
is 1ns
Since an instruction is fired off every 2 ns in
the 5 way pipeline as opposed to every 8 ns in a
non pipelined scheme, it would seem that the
performance advantage should be 8/2 4.
But what we see is 24/14 1.7
The reason we are not getting the 41 ration is
that this example never filled the pipe about
2/3 of time was spent filling and emptying the
pipe
Maximum parallelism is achieved only when the
pipe is filled
Suppose we increase the number of instructions
executed by 1000
Non pipelined 24 1000(8ns/inst) 8024 ns
Pipelined 14 1000(2ns/inst) 2014
ns ratio 8024/2014 3.98 ? 8/2 4

13
Principles

Principle Keep the pipe full and make the phase
times as equal as possible.
Sometimes disruptions cause it to empty and
have to be refilled .... as can happen with
successful branches

14
Principles (cont.)

Principle In order for a pipelined scheme to
work well, the data path stages (functional
units) must be designed in such a way that
instructions executing at a particular stage will
do so independently of instructions
simultaneously executing on other stages.
As in an assembly line the instruction should
flow through the data path from stage to stage
and not require the services of multiple stages
independently
It turns out the single clock cycle data path
implementation we came up within chapter 5, has
this property to a large degree.
fig. 6.10, p. 450 (single-cycle data path) is an
idealistic abstraction and must be modified to
make it work well in a pipelined environment.
Multi-cycle clocking will be added to this single
cycle datapath.
Instructions roughly flow from left to right as
they get executed. ... Instruction 1 could be in
the ID stage while instruction 2 is in the
EX stage.
Two exceptions to the left to right flow(a)
Write back (WB) stage flows from end of the pipe
to register file in the middle of the pipe (b)
The mem stage feeds back to the fetch stage with
a possible non incremented branch address

15
Making the Pipeline Work

Pipeline phase buffering
Pipeline buffer registers between phases - saving
the data for the next phase, thus make a phase
immediately reusable by another instruction see
fig 6.12, p. 452
There is no pipeline register between WB stage
and the ID phase (a right to left path). This
is ok since this is a natural interdependence
between instructions being executed ... an lw
places data in the register file and a later
instruction uses it. All instructions generally
change the state of the of the machine. ... These
kinds of instruction interdependence could get
hairy --- see later

16
Pipelined Datapath Showing Pipeline Registers
Between Phases
Fig. 6.12
17
Preserving Information in the Pipeline

Data to be stored by sw instructionThe data from
rt register to be stored in memory is buffered in
the ID/EX pipeline register but needed in the mem
stage.
... so it is automatically transferred to the
EX/MEM pipeline register during the EX phase.
See fig 6.16, p 457 or fig 6.18
Destination register number (rt) needed by lw
instruction In the lw instruction the register
number to write the data into is needed at the
output of the mem phase (MEM/WB register) ...
but is first buffered in the IF/ID register ...
So it is automatically transferred though three
pipeline registers to the MEM/WB register where
it is needed. See fig. 6.18, p. 460This move
is ID/EX ? EX/MEM ? MEM/WB ? register file write
register specification
Initially the given datapath did not have this
path in it (a deliberate bug).
If we did not make this correction the
destination register number stored in ID/EX would
get overwritten by the next instruction coming
down the pipe and thus would result in an error.

18
Data Path Showing Information Preservation
Pass rt data for sw ?
? Preserve destination register
number for lw
Fig. 6.18
19
Notes For Scenarios Or Walk Thrus Given in the
Text See overhead slides

Single instruction lw see fig. 6.13 though
6.15. Assumes correction of fig 6.18
The target register number (rt) is determined in
the decode/register fetch stage, but is not used
until the final stage (write-back).
Thus the rt number stored in the ID/EX register
must be moved along with the instruction to the
MEM/WB register where it is needed
This move isID/EX ? EX/MEM ? MEM/WB ? register
file write register specification
If we didnt do this, the destination register
number would get wiped out in ID/EX by the next
instruction coming down the pipe. ... And it
would be all over but the laughing.
This is essentially the same reason we put the
IRWrite control line on the instruction register
for the multi-clock non-pipelined case in chapter
5 a copy of certain data from the fetched
instruction must be maintained throughout the
execution of the instruction.

20
Notes On Scenarios Or Walk Thrus Given in the
Text (cont.)

Single instruction sw see fig. 6.16 though 6.17
Assumes correction of fig 6.18
First two stages (fetch and decode) identical to
lw
Shows need to keep information used in later
stages of execution of the instruction
The source data from rt (to be written to memory)
in the register file is fetched during the
decode/register fetch stage, but is needed in the
mem stage for storage on the memory (stored in
ID/EX register).
Thus this field is transferred along with the
instruction from the ID/EX register to the EX/MEM
register where is now available for writing to
mem
This is similar to the situation in lw, but not
identical.

21
Notes On Scenarios Or Walk Thrus Given in the
Text (cont.)

Two instructions lw and sub see fig. 6.22
though 6.24
Illustrates that each instruction must visit
each phase even if it does not need any services
in the phases.
sub does not need the mem phase, so the ALU
output is merely passed to the next pipeline
register (MEM/WB) to await being written to the
register file

22
Graphical Representation Of Pipelines

Single clock cycle diagram
What was used in the scenarios
Shows state of the entire datapath during a
single clock cycle
All instructions in the pipeline identified by
labels above respective stages
Requires a sequence of such diagrams to show the
execution of instruction(s)
Example figs. 6.22 - 6.24 ... walk thru of a
two instructions lw and sub
Multiple-clock-cycle pipeline diagram
Gives a high level overview
Shows the pipeline activity for all clock pulses
in a single diagram - see fig 6.20, p. 462

23
Multiple-clock-cycle pipeline diagram
Fig 6.20
Fig 6.21
24
What Can Go Wrong Pipeline HazardsA Preview

Hazard a situation when the next instruction
cannot execute in the following clock cycle
Structural hazard
What if there were a only single memory
Example a lw followed by another instruction
lw could be accessing data in the memory and the
while the next instruction is attempting th be
fetched into the same memory.

25
What Can Go Wrong Pipeline HazardsA Preview
(cont.)

Control hazard
The need to make a decision based on the results
of one instruction while others are already
executing. The decision may have an effect on
instructions already executing
Example conditional branch (beq) could
invalidate instruction already in execution if
the branch is successful.
Possible solutionsstall the instruction after
beq until the decision is determined (success or
unsuccessful branch)predict or guess the outcome
of the design. If you are correct then you run
full speed, if you are wrong, then the following
instruction must be flushed and the pipe refills
from the new branched instruction stream.

26
What Can Go Wrong Pipeline HazardsA Preview
(cont.)

Data hazard
An instruction depends on the result of a
previous instruction still in the pipeline.
Exampleadd s0, t0, t1 s0 available in
5th stagesub t2, s0, t3 s0 needed in
2nd stage
Naïve approach would be to stall sub until data
is ready performance penalty
Better make the data available earlier by
forwarding or bypassing stagesgive the sub
instruction the result before writing to the
register file
Sometime even with forwarding a stall ma be
necessary, example
lw s0, 20(t1) s0 available in 5th
stage must access memory sub t2, s0, t3
s0 needed in 2nd stage

Control and data hazard resolution is easier said
than done complicates controls implementation
details later
27
Data Hazard Forwarding
28
Pipeline Control

Start with the controls used for the
none-pipelined case (single clock cycle with
controls) fig 5.19, p. 360

Fig 5.19
29
Pipeline Control (cont.)

No controls needed for pipeline registers - they
are written each clock cycle.
Each control line is associated with a component
active in only a single pipeline stage - thus
divide the control lines into potentially 5
groups (per stage) ... see fig. 6.29, p. 469
Since we are using the single clock data path,
the controls are only for last 3 stages.
Thus, of the 5 potential groups, we will need
only three groups

30
Pipeline Control (cont.)

All controls are created during the decode phase
and stored in the ID/EX pipeline register
extending the register.
As the clock pulses and the instruction advances
thru the pipeline
Control signal needed by the current execution
phase are utilized and ...
The remainder of them are passed to the next
pipeline register to be used by later phases.
See fig. 6.28 - 6.29, p. 469
This method of asserting control lines is
reminiscent of horizontal microcode where the
control lines (bits in the microword) are
asserted as the microword is executed
... The control bits in the phase registers play
this role - controls for a particular phase
becoming asserted when the phase occurs (become
active)
When a stage is inactive, the control lines for
that stage are deasserted (killing that phase)

31
Pipeline ControlComparison With
Mult-clock-nonpipeline Control

In chapter 5 (multi-clock), the sequencing of
control required a special hardware
implementation of an FSM, (see fig, 5.42, 5.43),
in this case the sequencing is embedded in the
pipeline structure itself (pipeline registers).
all control is computed during instruction decode
phase and then passed along via pipeline
registers
The generation of the control values in the
decode phase is combinational logic done in one
clock pulse as in the single-clock design of
chapter 5.
Sequencing is achieved by an instruction moving
from one phase to another the control signals
associated with a phase being presented as the
stage is entered via the pipeline register for
that phase.
In chapter 5 (multi-clock) instruction execution
took a variable number of clock cycles (see fig
5.42), in this case all instructions take same
number of cycles.

32
Pipeline Control (cont.)
Pass control signals along just like the data
Fig. 6.28
Fig. 6.29
33
Datapath with Pipeline Control
Fig. 6.30
34
Datapath with Pipeline ControlChanges from fig.
5.19 (single clock)

Changes from fig. 5.19 (single clock) to fig 6.30
(pipelined)
Destination register (rt or rd) propagated
because of multi-clock timing.
PC set twice via PCSource for the MUX
increment in fetch phase
branch address if instruction is beq in MEM phase
overwrites incremented value if successful
jump instruction not implemented in fig 6.30

35
A Scenario Showing Pipeline ControlsSee pdf
figures 6.31 6.35

Scenario for the following sequence (see
overheads slides)lw 10, 20(1) note that
these instructions are independent of each
other!sub 11, 2, 3and 12, 4, 5or 13,
6, 7add 14, 8, 9
A fully loaded pipeline (5 instructions), with
controls ... Takes 9 clock cycles to complete.
See fig 6.31 - 6.35
Although one instruction begins (and completes)
each clock cycle, an individual instruction takes
five cycles to complete
Note the propagation of the destination register
thru the pipe starting in fig 6.33
It takes 4 cycles before the 5 stage pipeline is
operating at full efficiency (see fig 6.33)
filling the pipe.