CSECE 365 Computer Architecture - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

CSECE 365 Computer Architecture

Description:

assume: CPI of unpiplined machine as well as the pipelined machine is 1 this leads to ... Assume all the instructions have CPI of 1 ... – PowerPoint PPT presentation

Number of Views:276
Avg rating:3.0/5.0
Slides: 30
Provided by: ESO17
Category:

less

Transcript and Presenter's Notes

Title: CSECE 365 Computer Architecture


1
CS/ECE 365 Computer Architecture
  • Lecture Jan 8, 2001
  • Soundararajan Ezekiel
  • Department of Computer Science
  • Ohio Northern University

2
Performance of pipelines with stalls
  • A stall causes the pipeline performance to
    degrade from the ideal performance
  • we will derive simple equation for finding the
    actual speedup from pipelining
  • Speedup from pipelining (Average instruction
    time unpipelined)/( aver. instruction time
    pipelined)
  • (CPI unpipelined clock cycle
    unpipelined)/(CPI pipelined clock cycle
    pipelined)

3
  • (CPI unpipelined/ CPI pipelined) (clock
    cycle unpipelined/ clock cycle pipelined)
  • pipeline can be thought of as decreasing the CPI
    or the cycle time
  • traditional -- use CPI to compare pipelines
  • assumption ideal CPI on a pipelined machine is
    always 1.

4
  • CPI pipelined ideal CPI Pipeline stall clock
    cycles per instruction
  • 1 Pipeline stall clock cycle per
    instruction
  • if we ignore the cycle time overhead of
    pipelining and assume that the stages are
    perfectly balanced, then the cycle time of the 2
    machine can be equal

5
  • Speedup CPI unpipelined/ (1pipeline stall
    cycles per instruction)
  • Simple case where all the instruction takes the
    same number of cycles, which must also equal the
    number of pipeline stages( also called the depth
    of the pipeline)
  • in this case unpiplined CPI depth of pipeline

6
  • speedup Pipeline depth/(1 pipeline stall cycles
    per instruction)
  • If there are no pipleline stalls,
  • SpeedUp pipeline depth

7
Alternatively
  • if we think of pipelining as improving the clock
    cycle time
  • assume CPI of unpiplined machine as well as the
    pipelined machine is 1 this leads to
  • speedup from pipelining (CPI unpipelined/CPI
    pipelined)(Clock cycle unpipelined/ Clock cycle
    pipelined)
  • (1/ (1 Pipeline stall cycles per
    instruction))(Clockcylce unpipelined/ clock
    cycle pipelined)

8
  • in cases where the pipe stages are perfectly
    balanced and there is no overhead, the clock
    cycles on the pipelined machine is smaller than
    the clock cycle of the unpipelined by a factor
    equal to the pipelined depth

9
  • clock cycle pipelined clock cycle unpipelined/
    pipeline depth
  • pipeline depth Clock cycle unpipelined/ clock
    cycle pipelined
  • This leads to the following

10
  • speedup from pipelining (1/ (1pipeline stall
    cycles per instruction)) ( clock cycle
    unpipelined/clock cycle pipelined)
  • (1/ (1pipeline stall cycles per
    instruction)) pipelinedepth
  • again if no stall
  • Speedup Pipeline depth

11
Stall on branch performance
  • Question Estimate the impact on the clock cycle
    per instruction(CPI) of stalling on branches.
    Assume all the instructions have CPI of 1
  • page 189 gcc column conditional branches 17 of
    of the instructions
  • all other instruction 1 CPI
  • branch took one extra clock cycle
  • CPI 1.17

12
2 4 6 8
add
beq
2
lw
4
13
Note
  • if we cannot resolve the branch in the second
    stage-- cost will be very high
  • Predictif you are pretty sure you have right
    formula then go ahead and do second load laundry
  • if you are wrong-- do it again -- while guessing

14
  • computers use prediction to handle branches
  • simple prediction when branches fail
  • pipeline is in full speed
  • branch success----then stall
  • pipeline stall (nick name bubble)

15
branch is not taken
2 4 6 8
add
beq
2
lw
Instruction fetch
Register read
ALU operations
Data Access
Reg write
16
when the branch is taken
2 4 6 8
add
beq
2
lw
bubble
bubble
bubble
4
or
17
Third approach
  • called delayed decision called delayed branch is
    computer-used in MIPS architecture --
  • delayed branch always execute next sequential
    instruction with the branch taken place after
    one instruction taken place
  • It is hidden from MIPS assembly language

18
The pipe bubble has been replace by add
2 4 6 8
beq
and
2
lw
2
Instruction fetch
Register read
ALU operations
Data Access
Reg write
19
Data Hazards
  • Our Laundry analogy socks left and right will
    stall the operation
  • Example we have add instruction followed by
    subtraction that used the sum(s0)
  • add s0,t1, t2 sub t2, s0,t3
  • add write its result only in 5 th stage
  • we have to add 3 bubbles in the pipeline

20
  • we can try to rely on compilers to avoid this
    type of hazards but most of the time we will fail
  • this type is happen to often--- delay is too long
    to rescue by compilers
  • primary solution we dont need to wait for the
    instruction to be complete
  • as soon as ALU creates the sum for the add--
    supply it as an input for subtract--
  • this method is called forwarding or by passing

21
Forwarding with two instruction
  • valid only if destination stage is later in time
    than the source stage
  • it will not prevent all the stalls
  • for example load instead of add of s0
  • it will be too late to input

22
graphical representation of instruction pipeline
2 4 6 8
Time
Add s0, to,t1
IF ID EX
MEM WB
The shading indicates the element is used by the
instruction mem white that means add access data
memory---- right half means element read----left
half means writing sate
23
graphical representation of forwarding
2 4 6 8
Time
Sub t2, s0, t3
Add s0, to,t1
IF ID EX
MEM WB
24
  • the connection shows the forwarding path from the
    output of the EX stage of add to the input of the
    EX stage for sub replacing the value of from
    register s0 read in the second stage of sub

25
need stall even with forwarding
2 4 6 8
Time
lw s0, 20(t1) sub t2, s0, t3
IF ID EX
MEM WB
26
Reordering code to avoid pipeline stall
  • Find the hazard in this code then reorder

reg t1 has the address
of vk lw t0, 0(t1) reg t0, (temp)vk lw
t2, 4(t1) reg t2 vk1 sw t2,
0(t1) vkreg t2 sw t0, 4(t1) vk1reg
t0 temp
27
Answer
  • the hazard occur on reg t2 between second lw and
    first sw swapping the two sw instruction removes
    this hazard

reg t1 has the address
of vk lw t0, 0(t1) reg t0, (temp)vk lw
t2, 4(t1) reg t2 vk1 sw t0,
4(t1) vk1reg t0 temp sw t2,
0(t1) vkreg t2
28
Hardware and software interface
  • trade of between compiler and hardware
    complexity, the original MIPS processors avoided
    hardware to stall the pipeline by requiring
    software to follow a load with an instruction
    independent of that load Such loads are called
    delayed loads

29
Next class
  • we will do some problems
  • we will hazards in DLX architecture point of view
  • After that we will study data path
  • come back and apply pipeline for data path
Write a Comment
User Comments (0)
About PowerShow.com