Title: Enhancing performance Pipelining Chapter 6 Part 1 Concepts
1Enhancing performance - PipeliningChapter 6Part
1 Concepts
2Introduction
- Parallelism built into the processor hardware
- The logical sequence of events in the execution
of an instruction is generally wasteful of time .
- Example while an instruction is doing arithmetic
using registers, the memory is idle ... why not
fetch the next instruction during this time? - The key idea is to overlap the processing of
multiple instructions.
3Introduction An Analogy
- Analogous to an assembly line in a factory
- The time to build an individual car does not
decrease but the number of cars built per unit
time is greatly increased . - Multiple cars are simultaneously built
- While the engine is installed in one car, the
seats are installed in another ... at the same
time - Success of assembly line depends on how well
balanced it is ... we dont want one task (phase)
taking 10 minutes, while another takes 1 hour.
... The car having the short task done at one
station will have to wait idle for the next task
station to free up. - The series of work stations in an assembly line
are analogous to the functional units in a
processor data path - A series of cars to be built passes though the
assembly line simultaneously - each work station
is busy. - On startup and stopping of the line it take a
time equal to the sum of the work stations time
to fill and empty line (pipeline).
4Introduction Computer Instructions
- In the assembly line example, the car becomes an
instruction . - The tasks done on the car become the instruction
phases (or stages). - The work stations become the functional units in
the data path. - The assembly line becomes the data path
- The data path simultaneously executing multiple
instructions is called a pipeline
5Introduction Computer Instructions (cont)
- Pipelining improves instruction throughput rather
than individual instruction execution time - The time required to move an instruction one step
down the pipeline is one clock cycle - The length of a clock cycle is determined by the
time required for the slowest pipeline stage
because all stages must proceed at the same rate
- The goal of the designer is to balance the length
of each stage - otherwise there will be idle time
during a stage.
6Steps to Take
- Decompose the processing of instructions into
phases - Simplest decomposition is two phases or stages
fetch and execute - 1st stage fetches and buffers the instruction
- 2nd stage (execution) receives the buffered
instruction from the 1st stage when it is free - While 2nd stage is executing, the 1st stage takes
advantage of any unused memory cycles to fetch
and buffer the next instruction this is called
instruction pre-fetch or fetch overlap
7Steps to Take (Cont.)
- Problems with this approach
- Execution time is generally longer than fetch
time. - Fetch stage may have to wait before it can empty
its buffer - Ideally we would like to have the various stages
of instruction processing take the same amount of
time. - A conditional branch instruction makes the
address of the next instruction uncertain ...
thus fetch stage waits until the execute stage
(branch) determines the next instruction address - Both above situations results in performance loss
- the latter (conditional branch) can be reduced
by guessing at, the outcome of the branch.
8Steps to Take (Cont.)
- An improvement would be to decompose the
instruction processing into smaller steps (finer
granularity) - There would be less variation in processing time
among the stages - These are the familiar phases of our instruction
executionIF Instruction fetchID
Instruction decode and register fetchEX
Execution and effective address calculationMEM
Memory access (fetch memory operands)WB
Write back (into register file ) - The various phases (5 of them) will be more
nearly equal in duration - Register read and register write takes only 1 ns
and all the other phases take 2 ns. Thus all the
phases will take 2 ns - register operations will
idle for 1 ns during the register phases -
9Steps to Take (Cont.)
- Fundamental conceptIn order to make each phase
as independent as possible of other phases, we
will use the single clock cycle data path (fig,
5.19, p. 360) and a multiple clock cycle timing
scheme. - A hybrid of the two schemes in chapter 5.
- The single clock cycle data path has redundant
hardware which enhances parallelism and phase
independence. - ALU and two adders
- But it is functionally the same as the multiple
clock data path.
10Single Clock Cycle Datapath for Multi Clock Cycle
Timing
Fig 6.10
11Performance Example
- Execution of three consecutive lw instructions
see p. 439 - 2 ns per phase except for reg phase which is 1 ns
12Performance Example (cont.)
- Ideally with no delays for register operation, it
would take 8 ns to execute an lw instruction and
24 ns to do three of them sequentially. - In a 5 stage pipeline the three could be done in
14 ns. - Ideally we would expect to complete an
instruction every 8/5 1.6 ns for the 5 stage
pipeline. - Instead we see 2 ns between instructions
- This is due to an imbalance if the time for each
phase all phases are 2ns and the register phase
is 1ns - Since an instruction is fired off every 2 ns in
the 5 way pipeline as opposed to every 8 ns in a
non pipelined scheme, it would seem that the
performance advantage should be 8/2 4. - But what we see is 24/14 1.7
- The reason we are not getting the 41 ration is
that this example never filled the pipe about
2/3 of time was spent filling and emptying the
pipe - Maximum parallelism is achieved only when the
pipe is filled - Suppose we increase the number of instructions
executed by 1000 - Non pipelined 24 1000(8ns/inst) 8024 ns
Pipelined 14 1000(2ns/inst) 2014
ns ratio 8024/2014 3.98 ? 8/2 4
13Principles
- Principle Keep the pipe full and make the phase
times as equal as possible. - Sometimes disruptions cause it to empty and
have to be refilled .... as can happen with
successful branches
14Principles (cont.)
- Principle In order for a pipelined scheme to
work well, the data path stages (functional
units) must be designed in such a way that
instructions executing at a particular stage will
do so independently of instructions
simultaneously executing on other stages. - As in an assembly line the instruction should
flow through the data path from stage to stage
and not require the services of multiple stages
independently - It turns out the single clock cycle data path
implementation we came up within chapter 5, has
this property to a large degree. - fig. 6.10, p. 450 (single-cycle data path) is an
idealistic abstraction and must be modified to
make it work well in a pipelined environment.
Multi-cycle clocking will be added to this single
cycle datapath. - Instructions roughly flow from left to right as
they get executed. ... Instruction 1 could be in
the ID stage while instruction 2 is in the
EX stage. - Two exceptions to the left to right flow(a)
Write back (WB) stage flows from end of the pipe
to register file in the middle of the pipe (b)
The mem stage feeds back to the fetch stage with
a possible non incremented branch address
15Making the Pipeline Work
- Pipeline phase buffering
- Pipeline buffer registers between phases - saving
the data for the next phase, thus make a phase
immediately reusable by another instruction see
fig 6.12, p. 452 - There is no pipeline register between WB stage
and the ID phase (a right to left path). This
is ok since this is a natural interdependence
between instructions being executed ... an lw
places data in the register file and a later
instruction uses it. All instructions generally
change the state of the of the machine. ... These
kinds of instruction interdependence could get
hairy --- see later
16Pipelined Datapath Showing Pipeline Registers
Between Phases
Fig. 6.12
17Preserving Information in the Pipeline
- Data to be stored by sw instructionThe data from
rt register to be stored in memory is buffered in
the ID/EX pipeline register but needed in the mem
stage. - ... so it is automatically transferred to the
EX/MEM pipeline register during the EX phase.
See fig 6.16, p 457 or fig 6.18 - Destination register number (rt) needed by lw
instruction In the lw instruction the register
number to write the data into is needed at the
output of the mem phase (MEM/WB register) ...
but is first buffered in the IF/ID register ...
So it is automatically transferred though three
pipeline registers to the MEM/WB register where
it is needed. See fig. 6.18, p. 460This move
is ID/EX ? EX/MEM ? MEM/WB ? register file write
register specification - Initially the given datapath did not have this
path in it (a deliberate bug). - If we did not make this correction the
destination register number stored in ID/EX would
get overwritten by the next instruction coming
down the pipe and thus would result in an error.
18Data Path Showing Information Preservation
Pass rt data for sw ?
? Preserve destination register
number for lw
Fig. 6.18
19Notes For Scenarios Or Walk Thrus Given in the
Text See overhead slides
- Single instruction lw see fig. 6.13 though
6.15. Assumes correction of fig 6.18 - The target register number (rt) is determined in
the decode/register fetch stage, but is not used
until the final stage (write-back). - Thus the rt number stored in the ID/EX register
must be moved along with the instruction to the
MEM/WB register where it is needed - This move isID/EX ? EX/MEM ? MEM/WB ? register
file write register specification - If we didnt do this, the destination register
number would get wiped out in ID/EX by the next
instruction coming down the pipe. ... And it
would be all over but the laughing. - This is essentially the same reason we put the
IRWrite control line on the instruction register
for the multi-clock non-pipelined case in chapter
5 a copy of certain data from the fetched
instruction must be maintained throughout the
execution of the instruction.
20Notes On Scenarios Or Walk Thrus Given in the
Text (cont.)
- Single instruction sw see fig. 6.16 though 6.17
Assumes correction of fig 6.18 - First two stages (fetch and decode) identical to
lw - Shows need to keep information used in later
stages of execution of the instruction - The source data from rt (to be written to memory)
in the register file is fetched during the
decode/register fetch stage, but is needed in the
mem stage for storage on the memory (stored in
ID/EX register). - Thus this field is transferred along with the
instruction from the ID/EX register to the EX/MEM
register where is now available for writing to
mem - This is similar to the situation in lw, but not
identical.
21Notes On Scenarios Or Walk Thrus Given in the
Text (cont.)
- Two instructions lw and sub see fig. 6.22
though 6.24 - Illustrates that each instruction must visit
each phase even if it does not need any services
in the phases. - sub does not need the mem phase, so the ALU
output is merely passed to the next pipeline
register (MEM/WB) to await being written to the
register file
22Graphical Representation Of Pipelines
- Single clock cycle diagram
- What was used in the scenarios
- Shows state of the entire datapath during a
single clock cycle - All instructions in the pipeline identified by
labels above respective stages - Requires a sequence of such diagrams to show the
execution of instruction(s) - Example figs. 6.22 - 6.24 ... walk thru of a
two instructions lw and sub - Multiple-clock-cycle pipeline diagram
- Gives a high level overview
- Shows the pipeline activity for all clock pulses
in a single diagram - see fig 6.20, p. 462
23Multiple-clock-cycle pipeline diagram
Fig 6.20
Fig 6.21
24What Can Go Wrong Pipeline HazardsA Preview
- Hazard a situation when the next instruction
cannot execute in the following clock cycle - Structural hazard
- What if there were a only single memory
- Example a lw followed by another instruction
lw could be accessing data in the memory and the
while the next instruction is attempting th be
fetched into the same memory.
25What Can Go Wrong Pipeline HazardsA Preview
(cont.)
- Control hazard
- The need to make a decision based on the results
of one instruction while others are already
executing. The decision may have an effect on
instructions already executing - Example conditional branch (beq) could
invalidate instruction already in execution if
the branch is successful. - Possible solutionsstall the instruction after
beq until the decision is determined (success or
unsuccessful branch)predict or guess the outcome
of the design. If you are correct then you run
full speed, if you are wrong, then the following
instruction must be flushed and the pipe refills
from the new branched instruction stream.
26What Can Go Wrong Pipeline HazardsA Preview
(cont.)
- Data hazard
- An instruction depends on the result of a
previous instruction still in the pipeline. - Exampleadd s0, t0, t1 s0 available in
5th stagesub t2, s0, t3 s0 needed in
2nd stage - Naïve approach would be to stall sub until data
is ready performance penalty - Better make the data available earlier by
forwarding or bypassing stagesgive the sub
instruction the result before writing to the
register file - Sometime even with forwarding a stall ma be
necessary, example - lw s0, 20(t1) s0 available in 5th
stage must access memory sub t2, s0, t3
s0 needed in 2nd stage
Control and data hazard resolution is easier said
than done complicates controls implementation
details later
27Data Hazard Forwarding
28Pipeline Control
- Start with the controls used for the
none-pipelined case (single clock cycle with
controls) fig 5.19, p. 360
Fig 5.19
29Pipeline Control (cont.)
- No controls needed for pipeline registers - they
are written each clock cycle. - Each control line is associated with a component
active in only a single pipeline stage - thus
divide the control lines into potentially 5
groups (per stage) ... see fig. 6.29, p. 469 - Since we are using the single clock data path,
the controls are only for last 3 stages. - Thus, of the 5 potential groups, we will need
only three groups
30Pipeline Control (cont.)
- All controls are created during the decode phase
and stored in the ID/EX pipeline register
extending the register. - As the clock pulses and the instruction advances
thru the pipeline - Control signal needed by the current execution
phase are utilized and ... - The remainder of them are passed to the next
pipeline register to be used by later phases.
See fig. 6.28 - 6.29, p. 469 - This method of asserting control lines is
reminiscent of horizontal microcode where the
control lines (bits in the microword) are
asserted as the microword is executed - ... The control bits in the phase registers play
this role - controls for a particular phase
becoming asserted when the phase occurs (become
active) - When a stage is inactive, the control lines for
that stage are deasserted (killing that phase)
31Pipeline ControlComparison With
Mult-clock-nonpipeline Control
- In chapter 5 (multi-clock), the sequencing of
control required a special hardware
implementation of an FSM, (see fig, 5.42, 5.43),
in this case the sequencing is embedded in the
pipeline structure itself (pipeline registers). - all control is computed during instruction decode
phase and then passed along via pipeline
registers - The generation of the control values in the
decode phase is combinational logic done in one
clock pulse as in the single-clock design of
chapter 5. - Sequencing is achieved by an instruction moving
from one phase to another the control signals
associated with a phase being presented as the
stage is entered via the pipeline register for
that phase. - In chapter 5 (multi-clock) instruction execution
took a variable number of clock cycles (see fig
5.42), in this case all instructions take same
number of cycles.
32Pipeline Control (cont.)
Pass control signals along just like the data
Fig. 6.28
Fig. 6.29
33Datapath with Pipeline Control
Fig. 6.30
34Datapath with Pipeline ControlChanges from fig.
5.19 (single clock)
- Changes from fig. 5.19 (single clock) to fig 6.30
(pipelined) - Destination register (rt or rd) propagated
because of multi-clock timing. - PC set twice via PCSource for the MUX
- increment in fetch phase
- branch address if instruction is beq in MEM phase
overwrites incremented value if successful - jump instruction not implemented in fig 6.30
35A Scenario Showing Pipeline ControlsSee pdf
figures 6.31 6.35
- Scenario for the following sequence (see
overheads slides)lw 10, 20(1) note that
these instructions are independent of each
other!sub 11, 2, 3and 12, 4, 5or 13,
6, 7add 14, 8, 9 - A fully loaded pipeline (5 instructions), with
controls ... Takes 9 clock cycles to complete.
See fig 6.31 - 6.35 - Although one instruction begins (and completes)
each clock cycle, an individual instruction takes
five cycles to complete - Note the propagation of the destination register
thru the pipe starting in fig 6.33 - It takes 4 cycles before the 5 stage pipeline is
operating at full efficiency (see fig 6.33)
filling the pipe.