Title: Lecture 13 Instruction Execution Pipeline
1Lecture 13Instruction Execution Pipeline
2Lecture 13 Instruction Execution Pipeline
- In this lecture, we will study
- Principle of pipeline
- Characteristics of pipeline
- Number of pipeline stages and the performance
- Delays of pipeline stages and the performance
- Instruction execution steps in RISC-S
- 5-stage instruction execution pipeline for RISC-S
- Ideal pipeline
- Hazards
- Improving RISC-S pipeline for hazards
3Car Wash Station
- Car wash stations
- 1 S(Spray water)
- 2 W(Wash with detergent and brush)
- 3 R(Rinse)
- 4 B(Blow dry)
- Each stage takes 1 minute(identical delay)
1st car S W R B
2nd car
S W R B
3rd car
S W R B . . .
. . .
- To improve the profit
- Improve the speed of the wash stations -
expensive solution - Improve the throughput - Parallel wash stations
- expensive solution - Improve the effective wash time - Pipeline - a
less expensive solution
4Pipeline Principle
Ordinary car wash station
1 car/4 min
Parallel car wash station
1st car S W R B
2nd car S W R B
3rd car S W R B
4th car S W R B
5th car
S W R B . . .
. . .
4 cars/4 mins
Pipeline car wash station
1st car S W R B
2nd car S W R
B
3rd car S W
R B
1 car/1 min
4th car
S W R B
5th car
S W R B
. . .
. . .
5Pipeline Terminology
- Pipeline Stage
- Pipeline consists of a finite number of Pipeline
Stages - Pipeline Cycle
- Delay of a pipeline stage is called Pipeline
Cycle - Delays of the pipeline stages are not necessarily
identical in practice - Control is complicated
- Pipeline cycle can be made equal to the longest
pipeline stage delay by sacrificing
performance(pipeline cycle time) - Pipeline Latency
- Time from beginning of a task to the completion
of the task - Ideal Pipeline
- Delays of the Pipeline Stages are identical -
Pipeline Cycle - All the pipeline stages are occupied with tasks
to be executed - Simple to control and provides the best
performance - 1 instruction/cycle
6Pipeline Characteristics
I0 I1 I2 I3 I4 I5 I6 I7
. . . In-1 In I0
I1 I2 I3 I4 I5 I6 . . .
In-2 In-1 In I0 I1
I2 I3 I4 I5 . . .
In-3 In-2 In-1 In I0
I1 I2 I3 I4 . . .
In-4 In-3 In-2 In-1 In
- Assuming that there are plenty of
tasks(instructions) to be executed - All of the pipeline stages are busy most of time
- Pipeline Filling
- At the initial phase of the execution, pipeline
stages are not fully occupied with tasks - For an n-stage pipeline, first (n-1) pipeline
cycles are filling time - Pipeline Draining
- At the final phase of the execution, pipeline
stages are not fully occupied with tasks - For an n-stage pipeline, last (n-1) pipeline
stages are draining time
7Number of Pipeline Stage
- Comparison of car wash stations with
4-stage(S,W,R,B) and 2-stage(SW,RB) pipeline,
identical pipeline latency(4 minutes) - 4-stage pipeline with 1 minute pipeline cycle
- 2-stage pipeline with 2 minute pipeline cycle
The more pipeline stages, the better performance
8Delay of Pipeline Stages
- Comparison of 4-stage car wash stations with
different pipeline stage delays - Identically 1 minute delay
Identical pipeline stage delay shows better
performance
- S(0.5 min) - W(1.5 min) - R(0.5 min) - B(1.5
min) pipeline
9Instruction Execution Steps
10Instruction Execution PipelineRISC-S
- A 5-stage pipeline
- IF-DR-A-M-SR pipeline
- For the instruction execution pipeline,
information have to be passed to the succeeding
pipeline stage - Need Inter-stage buffers made of latches
- I/D buffer, D/A buffer, A/M buffer, M/S buffer
11IF Stage
Instruction Fetch and update PC stage
12DR Stage
- Instruction decoding and register read stage
- OP lt- OP-code
- A(Rs1) lt- RIR14..18
- t lt- IR13
- B(Rs2) lt- RIR0..4
- D(S2) lt- (IR12)19IR0..12
- C(Cond) lt- IR19..22
- cc(SCC) lt- IR24
- (NPC lt- NPC)
13A Stage
- ALU operations using operands, and effective
address computation, - and condition test for conditional branches
- Memory Ref Instr(t1) AO lt- NPCD(imm32)
- LD Instruction C lt- C
- Functional instr(t0) AO lt- A op B
- C lt- C
- Control instr AO lt- NPCD(imm32)
- T lt- (flag(C) op 0)
- (OP lt- OP)
- (NPC lt- NPC)
14M Stage
- Memory access for read and write, and decide
final PC value for branch instructions - LD DATA lt- MAO
- ST MAO lt- B
- Functional instruction AO lt- AO
- Branch instruction if T0 PC lt- AO
- if T1 PC lt-
NPC - (OP lt- OP)
- (C lt- C)
15SR Stage
- Store the result of operation in a register for a
functional instruction, - and store the data read from memory to a register
for load instruction - Functional instruction RC lt- AO
- LD RC lt- DATA
16Time Out
- ??? ????? ?? ??? ?? ????? ?? ???.
- ??? ??? ?? ??? ??? ???. ?? ??? ???? ?? ??? ?? ??
??? ?? ?? ?? ???? ????. ?? ?? ??? ?? ? ??? ??? ??
????? ???. - ??? ??? ??? ????? ??? ??? ?? ?????? ??? ?? ?? ??
? ?? ???. - 2 ?? ? ?? ??? ??. ??, ?? ? ???! ???, ? ??? ?? ??
? ?? ??? ? ????
17Ideal Pipeline
- Ideal Pipeline
- Delays of the pipeline stages are identical -
Pipeline Cycle - All the pipeline stages are occupied with tasks,
except the filling time and draining time - Complete one task for every pipeline cycle after
the filling time - Reasons for preventing pipelines from operating
as an ideal pipeline even though delays of the
pipeline stages are identical - Hazards
- Structural Hazard
- Data Hazard
- Control Hazard
18Structural Hazard
- Cases when Structural Hazards take place
- More than one instruction require the same
pipeline stage at the same clock cycle - This never happens when the delay of the pipeline
stages are identical
- More than one pipeline stages try to use the same
hardware resource at the same clock cycle - IF and A stages Operation with Adder
- DR and SR stages Access register file
- IF and M stages Access memory
19Example Structural Hazard
- Structural Hazard due to Adder - IF and A stage
in the same cycle
- Structural Hazard due to Register
- Structural Hazard due to Memory
20Hardware Solution - For Structural Hazards -
- Adder Hazard in IF and A stages
- Include a simple 4 adder in the IF stage to
avoid using ALU in A stage in calculating PC4 - Register Hazard
- Register can be made to write access in the first
half of the clock cycle, and read access in the
second half of the clock cycle
- Memory Hazard
- Dedicated memory, i.e., separate Instruction
Memory and Data Memory - 2-port memory
21Data Hazard
- Data Hazard is possible when more than one
instruction in a - sequence share the same data
- SLL R5, R1 IF DR A
M SR - ADD R1, R2, R3 IF DR
A M SR - AND R1, R4, R4 IF
DR A M SR - SUB R5, R1, R6
IF DR A M SR - XOR R1, R7, R8
IF DR A M SR
- Read After Write(RAW) Hazard
- Supposed to read the written data, but reading
it takes place first - Write After Read(WAR) Hazard
- Supposed to read first then write it, but
writing it takes place first - Write After Write(WAW) Hazard
- Written data at the same location in a wrong
order
22Data Hazards
- RAW Hazard
- Ii precedes Ij, and Ij tries to read a register
or data memory location before Ii stores data
into there. - ADD R2, R3, R1
- AND R1, R4, R4
- WAR Hazard
- Ii precedes Ij, and Ii reads data and Ij writes
data at the same location and writing take place
earlier than reading - This never happens if all the instructions go
through the same pipeline stages with same delay
because instructions go through SR stage(for
writing) later than DR stage(for reading) - WAW Hazard
- Ii precedes Ij, and both Ii and Ij writes data at
the same location, but in a wrong order - This never happens also if the assumption in WAR
is true
23Forwarding Circuit For RAW Data Hazard
- Circuit that forwards the data to be stored in SR
stage to ALU input - MUX in A stage
- Data to be stored in a register in SR stage
- DATA, AO in M/S Buffer
- AO in A/M Buffer
- These values in inter-stage buffers are
forwarded to the ALU input MUX
24Instruction Scheduling with Forwarding Circuit
- Resolving Data Hazard with registers by
forwarding No delay - SLL R5, R1 IF DR A
M SR - ADD R1, R2, R3 IF DR
A M SR - AND R1, R4, R4 IF
DR A M SR - SUB R5, R1, R6
IF DR A M SR - XOR R1, R7, R8
IF DR A M SR
25Load Delay Due To RAWImprovement by Forwarding
Circuit
- Load Delay 2 cycles
- LD R1, X IF DR A
M SR - stall
- stall
- ADD R1, R2, R3
IF DR A M SR - AND R1, R4, R4
IF DR A M SR - SUB R5, R1, R6
IF DR A M
SR - XOR R1, R7, R8
IF DR
A M SR
- Load delay with forwarding 1 cycle
- LD R1, X IF DR A M SR
- stall
- ADD R1, R2, R3 IF
DR A M SR - AND R1, R4, R4
IF DR A M SR - SUB R5, R1, R6
IF DR A M SR - XOR R1, R7, R8
IF DR A M SR
26Load Delay Due To RAWImprovement by Software
Scheduling
- LD R1, X IF DR A
M SR - stall
- ADD R1, R2, R3 IF
DR A M SR - SUB R1, R5, R4
IF DR A M SR - LD R6, Y
IF DR A M
SR
Software Scheduling LD R1, X
IF DR A M SR LD R6, Y
IF DR A
M SR ADD R1, R2, R3
IF DR A M SR SUB R1,
R5, R4 IF
DR A M SR
27Control Hazard
- Address of the instruction after a branch
instruction is determined in M stage. Therefore,
the next instruction fetch must be delayed until
the branch instruction completes in M stage. - ADD R1, R2, R3 IF DR A M SR
- JMP COND, X IF DR A M SR
- stall
- stall
- stall
- next instruction IF DR A
M SR
- Branch Delay of 3 cycles
- Value of PC is decided by the value of T, which
select the from input addresses to the MUX in M
stage - AO(branch address) or NPC(PC) - Value of T is decided by testing the conditions
in A stage - Branch address can be decided earlier if branch
condition can be tested earlier
28Reduction of Branch Effect
- If calculation of Branch Address and Testing
Condition are made earlier, Branch delay can be
reduced. - Move these operations to DR stage
- Include an Adder for branch address calculation
in DR stage - Move Circuit to test the branch condition in M
stage to DR stage
29Branch DelayImprovement by Software Rescheduling
- ADD R1, R2, R3 IF DR A M SR
- JMP COND, X IF DR A M SR
- stall
- next instruction IF DR A
M SR
Branch Delay 1 cycle Rescheduling JMP
COND, X IF DR A M SR ADD R1, R2,
R3 IF DR A M SR next
instruction IF DR A M SR
This is possible only if COND is set by the
instruction before the JMP instruction.
Conditional branch on the COND set by the
ADD(following JMP) is not possible. No branch
delay
30Branch DelayImprovement by Hardware Branch
Predictor
Predict TAKEN, and actually TAKEN ADD R1,
R2, R3 IF DR A M SR JMP
COND, X IF DR A M
SR LD R1, Y SUB R3, R4, R5
X ADD R1, R6, R5
IF DR A M SR
Predict TAKEN, and actually NOT TAKEN IF DR
A M SR IF DR A M
SR IF DR A M SR IF
DR A M SR IF
1 Cycle Delay
1 Cycle Delay
Predict NOT TAKEN, and actually NOT TAKEN IF
DR A M SR IF DR A M
SR IF DR A M
SR IF DR A M SR
Predict NOT TAKEN, and actually TAKEN ADD
R1, R2, R3 IF DR A M SR
JMP COND, X IF DR A
M SR LD R1, Y
IF SUB R3, R4, R5 X ADD
R1, R6, R5 IF
DR A M SR
1 Cycle Delay
No Delay
31Branch Prediction Penalty