Title: CSECE 365 Computer Architecture
1CS/ECE 365 Computer Architecture
- Lecture Jan 8, 2001
- Soundararajan Ezekiel
- Department of Computer Science
- Ohio Northern University
2Performance of pipelines with stalls
- A stall causes the pipeline performance to
degrade from the ideal performance - we will derive simple equation for finding the
actual speedup from pipelining - Speedup from pipelining (Average instruction
time unpipelined)/( aver. instruction time
pipelined) - (CPI unpipelined clock cycle
unpipelined)/(CPI pipelined clock cycle
pipelined)
3- (CPI unpipelined/ CPI pipelined) (clock
cycle unpipelined/ clock cycle pipelined) - pipeline can be thought of as decreasing the CPI
or the cycle time - traditional -- use CPI to compare pipelines
- assumption ideal CPI on a pipelined machine is
always 1.
4- CPI pipelined ideal CPI Pipeline stall clock
cycles per instruction - 1 Pipeline stall clock cycle per
instruction - if we ignore the cycle time overhead of
pipelining and assume that the stages are
perfectly balanced, then the cycle time of the 2
machine can be equal
5- Speedup CPI unpipelined/ (1pipeline stall
cycles per instruction) - Simple case where all the instruction takes the
same number of cycles, which must also equal the
number of pipeline stages( also called the depth
of the pipeline) - in this case unpiplined CPI depth of pipeline
6- speedup Pipeline depth/(1 pipeline stall cycles
per instruction) - If there are no pipleline stalls,
- SpeedUp pipeline depth
7Alternatively
- if we think of pipelining as improving the clock
cycle time - assume CPI of unpiplined machine as well as the
pipelined machine is 1 this leads to - speedup from pipelining (CPI unpipelined/CPI
pipelined)(Clock cycle unpipelined/ Clock cycle
pipelined) - (1/ (1 Pipeline stall cycles per
instruction))(Clockcylce unpipelined/ clock
cycle pipelined)
8- in cases where the pipe stages are perfectly
balanced and there is no overhead, the clock
cycles on the pipelined machine is smaller than
the clock cycle of the unpipelined by a factor
equal to the pipelined depth
9- clock cycle pipelined clock cycle unpipelined/
pipeline depth - pipeline depth Clock cycle unpipelined/ clock
cycle pipelined - This leads to the following
10- speedup from pipelining (1/ (1pipeline stall
cycles per instruction)) ( clock cycle
unpipelined/clock cycle pipelined) - (1/ (1pipeline stall cycles per
instruction)) pipelinedepth - again if no stall
- Speedup Pipeline depth
11Stall on branch performance
- Question Estimate the impact on the clock cycle
per instruction(CPI) of stalling on branches.
Assume all the instructions have CPI of 1 - page 189 gcc column conditional branches 17 of
of the instructions - all other instruction 1 CPI
- branch took one extra clock cycle
- CPI 1.17
122 4 6 8
add
beq
2
lw
4
13Note
- if we cannot resolve the branch in the second
stage-- cost will be very high - Predictif you are pretty sure you have right
formula then go ahead and do second load laundry - if you are wrong-- do it again -- while guessing
14- computers use prediction to handle branches
- simple prediction when branches fail
- pipeline is in full speed
- branch success----then stall
- pipeline stall (nick name bubble)
15branch is not taken
2 4 6 8
add
beq
2
lw
Instruction fetch
Register read
ALU operations
Data Access
Reg write
16when the branch is taken
2 4 6 8
add
beq
2
lw
bubble
bubble
bubble
4
or
17Third approach
- called delayed decision called delayed branch is
computer-used in MIPS architecture -- - delayed branch always execute next sequential
instruction with the branch taken place after
one instruction taken place - It is hidden from MIPS assembly language
18The pipe bubble has been replace by add
2 4 6 8
beq
and
2
lw
2
Instruction fetch
Register read
ALU operations
Data Access
Reg write
19Data Hazards
- Our Laundry analogy socks left and right will
stall the operation - Example we have add instruction followed by
subtraction that used the sum(s0) - add s0,t1, t2 sub t2, s0,t3
- add write its result only in 5 th stage
- we have to add 3 bubbles in the pipeline
20- we can try to rely on compilers to avoid this
type of hazards but most of the time we will fail - this type is happen to often--- delay is too long
to rescue by compilers - primary solution we dont need to wait for the
instruction to be complete - as soon as ALU creates the sum for the add--
supply it as an input for subtract-- - this method is called forwarding or by passing
21Forwarding with two instruction
- valid only if destination stage is later in time
than the source stage - it will not prevent all the stalls
- for example load instead of add of s0
- it will be too late to input
22graphical representation of instruction pipeline
2 4 6 8
Time
Add s0, to,t1
IF ID EX
MEM WB
The shading indicates the element is used by the
instruction mem white that means add access data
memory---- right half means element read----left
half means writing sate
23graphical representation of forwarding
2 4 6 8
Time
Sub t2, s0, t3
Add s0, to,t1
IF ID EX
MEM WB
24- the connection shows the forwarding path from the
output of the EX stage of add to the input of the
EX stage for sub replacing the value of from
register s0 read in the second stage of sub
25need stall even with forwarding
2 4 6 8
Time
lw s0, 20(t1) sub t2, s0, t3
IF ID EX
MEM WB
26Reordering code to avoid pipeline stall
- Find the hazard in this code then reorder
reg t1 has the address
of vk lw t0, 0(t1) reg t0, (temp)vk lw
t2, 4(t1) reg t2 vk1 sw t2,
0(t1) vkreg t2 sw t0, 4(t1) vk1reg
t0 temp
27Answer
- the hazard occur on reg t2 between second lw and
first sw swapping the two sw instruction removes
this hazard
reg t1 has the address
of vk lw t0, 0(t1) reg t0, (temp)vk lw
t2, 4(t1) reg t2 vk1 sw t0,
4(t1) vk1reg t0 temp sw t2,
0(t1) vkreg t2
28Hardware and software interface
- trade of between compiler and hardware
complexity, the original MIPS processors avoided
hardware to stall the pipeline by requiring
software to follow a load with an instruction
independent of that load Such loads are called
delayed loads
29Next class
- we will do some problems
- we will hazards in DLX architecture point of view
- After that we will study data path
- come back and apply pipeline for data path