Computer Organization and Architecture Chapter 6 Enhancing Performance with Pipelining PowerPoint PPT Presentation

presentation player overlay
1 / 87
About This Presentation
Transcript and Presenter's Notes

Title: Computer Organization and Architecture Chapter 6 Enhancing Performance with Pipelining


1
Computer Organization and ArchitectureChapter
6 Enhancing Performance with Pipelining
  • Yu-Lun Kuo
  • Computer Sciences and Information Engineering
  • University of Tunghai, Taiwan
  • sscc6991_at_gmail.com

2
Review Single Cycle vs. Multiple Cycle Timing
3
How Can We Make It Even Faster?
  • Split the multiple instruction cycle into smaller
    and smaller steps
  • There is a point of diminishing returns where as
    much time is spent loading the state registers as
    doing the work
  • Pipelining
  • Multiple instructions are overlapped in execution
  • Key to making processors fast

4
Example Laundry
  • Ann, Brian, Cathy, Dave
  • each have one load of
  • clothes to wash, dry,
  • and fold
  • Washer takes 30 minutes
  • Dryer takes 40 minutes
  • Folder takes 20 minutes

5
Sequential Laundry
6
Pipelined Laundry
7
Example Laundry
8
MIPS Instructions (p.371)
  • Classically take five steps
  • Fetch instruction from (instruction) memory (IF)
  • Read register while decoding the instruction (ID)
  • Execute the operation or calculate an address
    (EX)
  • Access an operand in data memory (MEM)
  • Write the result into a register (WB)
  • Five stages

9
The schematic view
IF
ID
Mem
WB
uses the memory
uses the register file
uses the register file
uses the memory
uses the ALU
Very important to remember the content of this
slide
10
A Pipelined MIPS Processor
  • Start the next instruction before the current one
    has completed
  • Improves throughput
  • Total amount of work done in a given time

Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 7
Cycle 6
Cycle 8
Dec
lw
Dec
sw
Dec
R-type
  • clock cycle (pipeline stage time) is limited by
    the slowest stage
  • for some instructions, some stages are wasted
    cycles

11
Single Cycle vs. Multiple Cycle vs. Pipeline
Multiple Cycle Implementation
12
Pipelined Execution Representation
13
Single-cycle vs. Pipelined Performance (p.372)
  • Single-cycle (non-pipeline)
  • Must allow for the lowest instruction (lw)
  • Required for every instruction is 800 ps
  • The time between the first and fourth
    instructions in
  • the non-pipelined design
  • 3 800 2400 ps

14
Figure 6.3
15
Single-cycle vs. Pipelined Performance (p.372)
  • Pipeline
  • All the pipeline stages take a single clock cycle
  • The clock cycle must be long enough to
    accommodate the slowest operation
  • Execution clock cycle must have the worst-case
    clock cycle of 200 ps
  • The time between the first and fourth
    instructions
  • 3 200 600 ps
  • So the time is 600 ps 4 200 ps 1400 ps

16
Pipelining Speedup (p.374)
  • Under ideal conditions and with a large number of
    instructions
  • The speedup from pipelining is approximately
    equal to the number of pipeline stages
  • Five-stage pipeline is nearly five times faster
  • The above example?
  • Pipeline time 1400 ps
  • Non-pipeline time 2400 ps
  • It is not reflected in the total execution time
    for the three instructions

17
Pipelining Speedup (p.374)
  • Pipelining involves some overhead
  • The source of which will be more clear shortly
  • Thus, the time per instruction in the pipelined
    processor will exceed the minimum possible
  • The speedup will be less than the number of
    pipeline stages
  • The number of instruction is not large
  • If we increased the number of instructions
  • Add 1,000,000 instructions

18
Pipeline Hazards (????) (p.375)
  • Pipeline Hazards
  • When the next instruction cannot execute in the
    following clock cycle
  • Three different types
  • Structural hazards (????)
  • what if we had only one memory?
  • Data hazards (????)
  • what if an instructions input operands depend on
    the output of a previous instruction?
  • Control hazards (????)
  • what about branches?

19
Structural Hazards (1/2) (p.375)
  • The hardware cannot support the combination of
    instructions that we want to execute in the same
    clock cycle
  • Hardware resource is not enough!!!
  • ???????,??????????????????????
  • Ex. The laundry room
  • Washer-dryer vs. separate washer and dryer

20
Structural Hazard (2/2) (p.375)
  • Suppose, single memory instead of two memories
  • If the pipeline in Figure 6.3 had a fourth
    instruction
  • That in the same clock cycle
  • The first instruction is accessing data from
    memory
  • The fourth instruction is fetching an instruction
    from the same memory
  • Without two memories, pipeline could have a
    structural hazard

21
Structural Hazard Single Memory
Time (clock cycles)
lw
I n s t r. O r d e r
Inst 1
Inst 2
Inst 3
Inst 4
22
Data Hazard (p.376)
  • The planned instruction cannot execute in the
    proper clock cycle
  • Because data that is needed to execute the
    instruction is not yet available
  • The pipeline must be stalled (Bubble)
  • Because one step must wait for another to
    complete
  • Ex. add s0, t0, t1
  • sub t2, s0, t3
  • Have to add three bubbles to the pipeline

23
How About Register File Access?
Time (clock cycles)
add 1,
I n s t r. O r d e r
Inst 1
Inst 2
Inst 3
add 2,1,
24
How About Register File Access?
Time (clock cycles)
Fix register file access hazard by doing reads in
the second half of the cycle and writes in the
first half
add 1,
I n s t r. O r d e r
Inst 1
Inst 2
add 2,1,
25
Register Usage Can Cause Data Hazards
  • Dependencies backward in time cause hazards

add 1,
I n s t r. O r d e r
sub 4,1,5
and 6,1,7
or 8,1,9
xor 4,1,5
  • Read before write data hazard

26
Register Usage Can Cause Data Hazards
  • Dependencies backward in time cause hazards

add 1,
sub 4,1,5
and 6,1,7
or 8,1,9
xor 4,1,5
  • Read before write data hazard

27
Loads Can Cause Data Hazards
  • Dependencies backward in time cause hazards

lw 1,4(2)
I n s t r. O r d e r
sub 4,1,5
and 6,1,7
or 8,1,9
xor 4,1,5
  • Load-use data hazard

28
One Way to Fix a Data Hazard
Can fix data hazard by waiting stall but
impacts CPI
add 1,
I n s t r. O r d e r
29
Another Way to Fix a Data Hazard
Fix data hazards by forwarding results as soon as
they are available to where they are needed
add 1,
I n s t r. O r d e r
sub 4,1,5
and 6,1,7
or 8,1,9
xor 4,1,5
30
Forwarding (??) (p.376)
  • Also called bypassing
  • Resolving a data hazard by retrieving the missing
    data element from internal buffers
  • Ex. lw s0, 20(t1)
  • sub t2, s0, t3

Still need one stall
31
Another Way to Fix a Data Hazard
Fix data hazards by forwarding results as soon as
they are available to where they are needed
add 1,
I n s t r. O r d e r
sub 4,1,5
and 6,1,7
or 8,1,9
xor 4,1,5
32
Another Way to Fix a Data Hazard
Fix data hazards by forwarding results as soon as
they are available to where they are needed
add 1,
I n s t r. O r d e r
sub 4,1,5
and 6,1,7
or 8,1,9
xor 4,1,5
33
Forwarding with Load-use Data Hazards
lw 1,4(2)
I n s t r. O r d e r
sub 4,1,5
and 6,1,7
or 8,1,9
xor 4,1,5
34
Forwarding with Load-use Data Hazards
lw 1,4(2)
I n s t r. O r d e r
sub 4,1,5
and 6,1,7
or 8,1,9
xor 4,1,5
  • Will still need one stall cycle even with
    forwarding

35
Control Hazard (1/2) (p.379)
  • Also called branch hazard
  • Make a decision based on the results of one
    instruction while others are executing
  • ????????????,???????????????????????????????
  • Solve 1 Stall (bubble)
  • If branch ? stall first
  • Put in enough extra hardware
  • We can test registers
  • Calculate the branch address and update PC during
    the second stage of the pipeline
  • Slow and high cost

36
(No Transcript)
37
Branch Instructions Cause Control Hazards
  • Dependencies backward in time cause hazards

I n s t r. O r d e r
lw
Inst 3
Inst 4
38
One Way to Fix a Control Hazard
Fix branch hazard by waiting stall but
affects CPI
beq
I n s t r. O r d e r
39
Control Hazard (2/2) (p.381)
  • Solve 2 Predict
  • Always predict that branches will be untaken
  • When youre right ? proceeds at full speed
  • Not jump to branch target address
  • Only when branches are taken ? pipeline stall
  • need to add hardware for flushing instructions if
    we are wrong

40
(No Transcript)
41
Branch Prediction (1/2)
  • Resolving a branch hazard that
  • Assumes a given outcome for the branch and
    proceeds from that assumption rather than waiting
    to ascertain the actual outcome
  • Dynamic prediction of branches
  • Keeping a history for each branch as taken or
    untaken
  • Using the recent past behavior to predict the
    future
  • Correctly predict branches with over 90 accuracy

42
Branch Prediction (2/2)
  • If predict if wrong
  • Pipeline control must ensure that the instruction
    following the wrongly guessed branch have no
    effect
  • Restart the pipeline from the proper branch
    address
  • Keeping the history
  • Branch history table (?????)
  • Branch prediction buffer (???????)

43
Pipeline Hazards Illustrated
44
Pipelined Datapath
  • IF Instruction fetch
  • ID Instruction decode and register file read
  • EX Execution or address calculation
  • MEM Data memory access
  • WB Write back

45
(No Transcript)
46
Pipeline Execution (p.387)
  • Assume
  • Register file is written in the first half of the
    clock cycle
  • Register file is read during the second half

47
Pipeline Register
  • Need to preserve the destination register address
    in the pipeline state registers

48
Five stages of lw (1/3) (p.388)
  • Instruction fetch
  • Reading memory using the address in the PC
  • Placed in the IF/ID pipeline register
  • PC address PC4 (ready for next clock cycle)
  • Instruction decode and register file read
  • IF/ID pipeline register supplying the 16-bits
    immediate field
  • Which is sign-extended to 32-bits
  • The register numbers to read the two register
  • All values are stored in the ID/EX pipeline
    register

49
Five stages of lw (2/3)
  • Execute and address calculation
  • Reads the content of register1
  • The sign-extended immediate from the ID/EX
    pipeline register
  • Add them using the ALU
  • placed in the EX/MEM pipeline register
  • Memory access
  • Reading the data memory using the address from
    the EX/MEM pipeline register
  • Loading the data into the MEN/WB pipeline register

50
Five stages of lw (3/3)
  • Write back
  • Final step
  • Reading the data from the MEM/WB pipeline
    register
  • Writing it into the register file

51
?? lw ?5???
  • ????
  • ???????? (PC) ?????????????????
  • ???IF/ID????? (?????????????????
  • ???????)
  • ??????????
  • ??????, ??????, 16???????, ID/EX ???
  • ??????????? (PC)??
  • ??????????
  • ???????ID/EX???????????????????1
  • ??????ALU??????????EX/MEM???????
  • ?????
  • ??????EX/MEM???????????????????
  • ??

52
Figure 6.12 IF/ID
53
Figure 6.12 EX
54
Figure 6.13 EX
55
Figure 6.14 MEM
56
Figure 6.14 WB
57
Multiple-Clock-Cycle Pipeline (p.397)
Inst 0
I n s t r. O r d e r
Inst 1
Inst 2
Inst 3
Inst 4
58
6.3 Pipeline Control (1/2)
59
Pipeline Control (2/2) (p.403)
60
Control Settings
61
6.4 Data Hazards and Forwarding (p.400)
  • Starting next instruction before first is
    finished
  • Dependencies go backward in time

62
Data Dependence Detection (p.406)
  • Hazard conditions
  • 1a EX/MEM. RegisterRd ID/EX.RegisterRs 2
    (sub and)
  • 1b EX/MEM. RegisterRd ID/EX.RegisterRt
  • 2a MEM//WB.RegisterRd ID/EX.RegisterRs (sub
    or)
  • 2b MEM/WB.RegisterRd ID/EX.RegisterRt
  • Rs,rt source register
  • Rd destination register

63
Ex. Data Hazard (on r1)
64
Dependencies backwards in time are hazards
65
Forward result from one stage to another
66
Dependence Detection (p.407)
  • Some instruction do not write register
  • It would forward when it was unnecessary
  • Check the RegWrite signal will be active
  • Examining the WB control field of the pipeline
    register (EX/MEM) to determined
  • Dependence being from a pipeline register
  • Rather than waiting for the WB stage to write the
    register file
  • Pipeline registers holding the data to be
    forwarded
  • Forward are the four R-format instructions
  • add, sub, and ,or

67
6.5 Data Hazards and Stalls
  • Cant always forward
  • When an instruction tries to read a register
    following a load instruction that write the same
    register
  • Called Load-Data-Hazard

68
Stalling (p.413)
  • Nop (Which act like bubbles)
  • An instruction that does no operation to change
    state
  • We can stall the pipeline by keeping an
    instruction in the same stage

69
Corrected Datapath to Save RegWrite Addr
  • Need to preserve the destination register address
    in the pipeline state registers

70
Corrected Datapath to Save RegWrite Addr
  • Need to preserve the destination register address
    in the pipeline state registers

71
MIPS Pipeline Control Path Modifications
  • All control signals can be determined during
    Decode
  • and held in the state registers between pipeline
    stages

72
6.6 Branch Hazards
  • Instruction must be fetched at every clock cycle
  • Decision until the MEM pipeline stage
  • Delay in determining the proper instruction to
    fetch is called
  • Branch hazard (control hazard)
  • Relatively simple to understand
  • Occur less frequently

73
Single Memory is a Structural Hazard
74
Handling Branch Hazard
  • Predict branch always not taken
  • (p.418)
  • Reduce delay of taken branches
  • (p.418)
  • Dynamic branch prediction
  • (p.421)

75
Predict Branch Not Taken (p.418)
  • Stalling until branch is complete
  • Too slow
  • Assume
  • The branch will not be taken
  • Continue execution
  • The branch is taken
  • The instruction that are being fetched and
    decoded must be discarded (flush)
  • Execute at the branch target
  • Branches are untaken half the time costs little
    to discard the instructions

76
Pipeline on the Branch (p.417)
77
Reduce delay of branch
  • Reduce delay of branch by moving branch execution
    earlier in the pipeline
  • Fewer instructions need be flushed
  • We already have the PC value and the immediate
    field in the IF/ID pipeline register
  • Move the branch adder from the EX stage to ID
    stage
  • Flush one instruction in the IF stage
  • Add a control signal (IF.Flush), to zero
  • Making the instruction an NOP

78
Handling Branches
79
Flushing with Misprediction (Not Taken)
4 beq 1,2,2
8 sub 4,1,5
  • To flush the IF stage instruction, assert
    IF.Flush to zero the instruction field of the
    IF/ID pipeline register (transforming it into a
    noop)

80
Reducing the Delay of Branches
81
Dynamic Branch Prediction (p.421)
  • Dynamic branch prediction
  • Look up the address of the instruction
  • To see if a branch was taken the last time
  • Fetch new instructions from the same place as the
    last time
  • Branch prediction buffer (???????)
  • Branch history table (?????)
  • A small memory indexed by the lower portion of
    the address of the branch instruction
  • Contains a bit
  • Says whether the branch was recently taken or not

82
Dynamic Branch Prediction (contd)
  • Prediction is just a hint assumed to be correct
  • Fetching beings in the predicted direction
  • If the hint turns out to be wrong
  • Incorrectly predicted instructions are deleted
  • Prediction bit is inverted and store back
  • Proper sequence is fetched and executed

83
Loops and Prediction ex.(p.421)
  • A loop branch that branches 9 times, then is not
    taken once
  • What is the prediction accuracy for this branch?
  • (prediction bit remains in prediction buffer)
  • Misspredict on the first and last loop iterations
  • Last inevitable since the bit will say taken
  • First the bit is flipped on prior execution of
    the last iteration of the loop
  • Branch is taken 90 (9/(19))
  • Prediction accuracy 80 (2 incorrect) (8/(28))

84
2-bit Prediction (p.422)
  • 2-bit scheme where change prediction only if get
    misprediction twice

85
2-bit Predictors
  • A 2-bit scheme can give 90 accuracy since a
    prediction must be wrong twice before the
    prediction bit is changed

right 9 times
wrong on loop fall out
Taken
Not taken
1
Predict Taken
Predict Taken
1
10
11
Taken
right on 1st iteration
Not taken
Taken
Not taken
0
Predict Not Taken
00
Predict Not Taken
0
01
Taken
Not taken
86
Delay Branch (p.423)
  • Compilers and assemblers try to
  • place an instruction into branch delay slot
  • that always execute after the branch
  • Branch delay slot
  • Directly after a delayed branch instruction
  • Filled by an instruction that does not affect the
    branch

87
Scheduling Branch Delay Slots
A. From before branch
B. From branch target
C. From fall through
add 1,2,3 if 10 then
add 1,2,3 if 20 then
sub 4,5,6
delay slot
add 1,2,3 if 10 then
delay slot
sub 4,5,6
delay slot
  • A is the best choice, fills delay slot and
    reduces IC
  • In B and C, the sub instruction may need to be
    copied, increasing IC
  • In B and C, must be okay to execute sub when
    branch fails

88
6.8 Exceptions (p.432)
  • We wouldnt want this invalid value to
    contaminate other registers or memory locations
  • Hardware is normally to stop the offending
    instruction in midstream
  • To difficultly of always associating the correct
    exception with correct instruction in pipelined
  • Computer designers to relax this requirement in
    noncritical cases
  • Imprecise interrupts (?????)
  • Imprecise exceptions (?????)

89
Exceptions
  • Steps to handle exceptions
  • Flush the instruction in the IF, ID and EX stages
  • Let all preceding instructions complete if they
    can
  • Save the restart PC
  • Call the OS to handle the exception
  • Return to the user code

90
6.9 Advanced Pipelining
  • Instruction-level parallelism (ILP)
  • Increasing the pipeline depth to overlap more
    instructions
  • Move from four-stage to six-stage
  • Advantage higher clock rate
  • Disadvantage longer load and branch delay
  • Superpipelining (?????)
  • Replicate the internal components (hardware)
  • Can launch multiple instructions in every
    pipeline stage (multiple issue)
  • Execution rate to exceed the clock rate (CPIlt1)
  • Superscalar (????)

91
SuperScalar
  • Superscalar MIPS 2 instructions
  • One instruction is ALU or branch the other could
    be a load or a store
  • More hardware resources
  • Two more read ports and one more write port to
    the register file
  • One more ALU unit

92
Two major ways (multiple-issue)
  • Division of work between the compiler and the
    hardware
  • Static multiple issue
  • Many decisions are made by the compiler before
    execution
  • Dynamic multiple issue
  • Many decisions are made during execution by the
    processor

93
Very Long instruction Word (VLIW) p.435
  • VLIW
  • The set of instructions that issue together in 1
    clock cycle
  • Called issue packet
  • As one large instruction with multiple operations

94
Different Pipelined Designs
Write a Comment
User Comments (0)
About PowerShow.com