CSECE 365 Computer Architecture presentation

About This Presentation

Transcript and Presenter's Notes

Title: CSECE 365 Computer Architecture

1
CS/ECE 365 Computer Architecture

Lecture Jan 8, 2001
Soundararajan Ezekiel
Department of Computer Science
Ohio Northern University

2
Performance of pipelines with stalls

A stall causes the pipeline performance to
degrade from the ideal performance
we will derive simple equation for finding the
actual speedup from pipelining
Speedup from pipelining (Average instruction
time unpipelined)/( aver. instruction time
pipelined)
(CPI unpipelined clock cycle
unpipelined)/(CPI pipelined clock cycle
pipelined)

(CPI unpipelined/ CPI pipelined) (clock
cycle unpipelined/ clock cycle pipelined)
pipeline can be thought of as decreasing the CPI
or the cycle time
traditional -- use CPI to compare pipelines
assumption ideal CPI on a pipelined machine is
always 1.

CPI pipelined ideal CPI Pipeline stall clock
cycles per instruction
1 Pipeline stall clock cycle per
instruction
if we ignore the cycle time overhead of
pipelining and assume that the stages are
perfectly balanced, then the cycle time of the 2
machine can be equal

Speedup CPI unpipelined/ (1pipeline stall
cycles per instruction)
Simple case where all the instruction takes the
same number of cycles, which must also equal the
number of pipeline stages( also called the depth
of the pipeline)
in this case unpiplined CPI depth of pipeline

speedup Pipeline depth/(1 pipeline stall cycles
per instruction)
If there are no pipleline stalls,
SpeedUp pipeline depth

7
Alternatively

if we think of pipelining as improving the clock
cycle time
assume CPI of unpiplined machine as well as the
pipelined machine is 1 this leads to
speedup from pipelining (CPI unpipelined/CPI
pipelined)(Clock cycle unpipelined/ Clock cycle
pipelined)
(1/ (1 Pipeline stall cycles per
instruction))(Clockcylce unpipelined/ clock
cycle pipelined)

in cases where the pipe stages are perfectly
balanced and there is no overhead, the clock
cycles on the pipelined machine is smaller than
the clock cycle of the unpipelined by a factor
equal to the pipelined depth

clock cycle pipelined clock cycle unpipelined/
pipeline depth
pipeline depth Clock cycle unpipelined/ clock
cycle pipelined
This leads to the following

speedup from pipelining (1/ (1pipeline stall
cycles per instruction)) ( clock cycle
unpipelined/clock cycle pipelined)
(1/ (1pipeline stall cycles per
instruction)) pipelinedepth
again if no stall
Speedup Pipeline depth

11
Stall on branch performance

Question Estimate the impact on the clock cycle
per instruction(CPI) of stalling on branches.
Assume all the instructions have CPI of 1
page 189 gcc column conditional branches 17 of
of the instructions
all other instruction 1 CPI
branch took one extra clock cycle
CPI 1.17

12
2 4 6 8
add
beq
2
lw
4
13
Note

if we cannot resolve the branch in the second
stage-- cost will be very high
Predictif you are pretty sure you have right
formula then go ahead and do second load laundry
if you are wrong-- do it again -- while guessing

computers use prediction to handle branches
simple prediction when branches fail
pipeline is in full speed
branch success----then stall
pipeline stall (nick name bubble)

15
branch is not taken
2 4 6 8
add
beq
2
lw
Instruction fetch
Register read
ALU operations
Data Access
Reg write
16
when the branch is taken
2 4 6 8
add
beq
2
lw
bubble
bubble
bubble
4
or
17
Third approach

called delayed decision called delayed branch is
computer-used in MIPS architecture --
delayed branch always execute next sequential
instruction with the branch taken place after
one instruction taken place
It is hidden from MIPS assembly language

18
The pipe bubble has been replace by add
2 4 6 8
beq
and
2
lw
2
Instruction fetch
Register read
ALU operations
Data Access
Reg write
19
Data Hazards

Our Laundry analogy socks left and right will
stall the operation
Example we have add instruction followed by
subtraction that used the sum(s0)
add s0,t1, t2 sub t2, s0,t3
add write its result only in 5 th stage
we have to add 3 bubbles in the pipeline

we can try to rely on compilers to avoid this
type of hazards but most of the time we will fail
this type is happen to often--- delay is too long
to rescue by compilers
primary solution we dont need to wait for the
instruction to be complete
as soon as ALU creates the sum for the add--
supply it as an input for subtract--
this method is called forwarding or by passing

21
Forwarding with two instruction

valid only if destination stage is later in time
than the source stage
it will not prevent all the stalls
for example load instead of add of s0
it will be too late to input

22
graphical representation of instruction pipeline
2 4 6 8
Time
Add s0, to,t1
IF ID EX
MEM WB
The shading indicates the element is used by the
instruction mem white that means add access data
memory---- right half means element read----left
half means writing sate
23
graphical representation of forwarding
2 4 6 8
Time
Sub t2, s0, t3
Add s0, to,t1
IF ID EX
MEM WB
24

the connection shows the forwarding path from the
output of the EX stage of add to the input of the
EX stage for sub replacing the value of from
register s0 read in the second stage of sub

25
need stall even with forwarding
2 4 6 8
Time
lw s0, 20(t1) sub t2, s0, t3
IF ID EX
MEM WB
26
Reordering code to avoid pipeline stall

Find the hazard in this code then reorder

reg t1 has the address
of vk lw t0, 0(t1) reg t0, (temp)vk lw
t2, 4(t1) reg t2 vk1 sw t2,
0(t1) vkreg t2 sw t0, 4(t1) vk1reg
t0 temp
27
Answer

the hazard occur on reg t2 between second lw and
first sw swapping the two sw instruction removes
this hazard

reg t1 has the address
of vk lw t0, 0(t1) reg t0, (temp)vk lw
t2, 4(t1) reg t2 vk1 sw t0,
4(t1) vk1reg t0 temp sw t2,
0(t1) vkreg t2
28
Hardware and software interface

trade of between compiler and hardware
complexity, the original MIPS processors avoided
hardware to stall the pipeline by requiring
software to follow a load with an instruction
independent of that load Such loads are called
delayed loads

CSECE 365 Computer Architecture PowerPoint PPT Presentation