Title: Computer Organization and Architecture Chapter 6 Enhancing Performance with Pipelining
1Computer Organization and ArchitectureChapter
6 Enhancing Performance with Pipelining
- Yu-Lun Kuo
- Computer Sciences and Information Engineering
- University of Tunghai, Taiwan
- sscc6991_at_gmail.com
2Review Single Cycle vs. Multiple Cycle Timing
3How Can We Make It Even Faster?
- Split the multiple instruction cycle into smaller
and smaller steps - There is a point of diminishing returns where as
much time is spent loading the state registers as
doing the work - Pipelining
- Multiple instructions are overlapped in execution
- Key to making processors fast
4Example Laundry
- Ann, Brian, Cathy, Dave
- each have one load of
- clothes to wash, dry,
- and fold
- Washer takes 30 minutes
- Dryer takes 40 minutes
- Folder takes 20 minutes
5Sequential Laundry
6Pipelined Laundry
7Example Laundry
8MIPS Instructions (p.371)
- Classically take five steps
- Fetch instruction from (instruction) memory (IF)
- Read register while decoding the instruction (ID)
- Execute the operation or calculate an address
(EX) - Access an operand in data memory (MEM)
- Write the result into a register (WB)
- Five stages
9The schematic view
IF
ID
Mem
WB
uses the memory
uses the register file
uses the register file
uses the memory
uses the ALU
Very important to remember the content of this
slide
10A Pipelined MIPS Processor
- Start the next instruction before the current one
has completed - Improves throughput
- Total amount of work done in a given time
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 7
Cycle 6
Cycle 8
Dec
lw
Dec
sw
Dec
R-type
- clock cycle (pipeline stage time) is limited by
the slowest stage - for some instructions, some stages are wasted
cycles
11Single Cycle vs. Multiple Cycle vs. Pipeline
Multiple Cycle Implementation
12Pipelined Execution Representation
13Single-cycle vs. Pipelined Performance (p.372)
- Single-cycle (non-pipeline)
- Must allow for the lowest instruction (lw)
- Required for every instruction is 800 ps
- The time between the first and fourth
instructions in - the non-pipelined design
- 3 800 2400 ps
14Figure 6.3
15Single-cycle vs. Pipelined Performance (p.372)
- Pipeline
- All the pipeline stages take a single clock cycle
- The clock cycle must be long enough to
accommodate the slowest operation - Execution clock cycle must have the worst-case
clock cycle of 200 ps - The time between the first and fourth
instructions - 3 200 600 ps
- So the time is 600 ps 4 200 ps 1400 ps
16Pipelining Speedup (p.374)
- Under ideal conditions and with a large number of
instructions - The speedup from pipelining is approximately
equal to the number of pipeline stages - Five-stage pipeline is nearly five times faster
- The above example?
- Pipeline time 1400 ps
- Non-pipeline time 2400 ps
- It is not reflected in the total execution time
for the three instructions
17Pipelining Speedup (p.374)
- Pipelining involves some overhead
- The source of which will be more clear shortly
- Thus, the time per instruction in the pipelined
processor will exceed the minimum possible - The speedup will be less than the number of
pipeline stages - The number of instruction is not large
- If we increased the number of instructions
- Add 1,000,000 instructions
18Pipeline Hazards (????) (p.375)
- Pipeline Hazards
- When the next instruction cannot execute in the
following clock cycle - Three different types
- Structural hazards (????)
- what if we had only one memory?
- Data hazards (????)
- what if an instructions input operands depend on
the output of a previous instruction? - Control hazards (????)
- what about branches?
19Structural Hazards (1/2) (p.375)
- The hardware cannot support the combination of
instructions that we want to execute in the same
clock cycle - Hardware resource is not enough!!!
- ???????,??????????????????????
- Ex. The laundry room
- Washer-dryer vs. separate washer and dryer
20Structural Hazard (2/2) (p.375)
- Suppose, single memory instead of two memories
- If the pipeline in Figure 6.3 had a fourth
instruction - That in the same clock cycle
- The first instruction is accessing data from
memory - The fourth instruction is fetching an instruction
from the same memory - Without two memories, pipeline could have a
structural hazard
21Structural Hazard Single Memory
Time (clock cycles)
lw
I n s t r. O r d e r
Inst 1
Inst 2
Inst 3
Inst 4
22Data Hazard (p.376)
- The planned instruction cannot execute in the
proper clock cycle - Because data that is needed to execute the
instruction is not yet available - The pipeline must be stalled (Bubble)
- Because one step must wait for another to
complete - Ex. add s0, t0, t1
- sub t2, s0, t3
- Have to add three bubbles to the pipeline
23How About Register File Access?
Time (clock cycles)
add 1,
I n s t r. O r d e r
Inst 1
Inst 2
Inst 3
add 2,1,
24How About Register File Access?
Time (clock cycles)
Fix register file access hazard by doing reads in
the second half of the cycle and writes in the
first half
add 1,
I n s t r. O r d e r
Inst 1
Inst 2
add 2,1,
25Register Usage Can Cause Data Hazards
- Dependencies backward in time cause hazards
add 1,
I n s t r. O r d e r
sub 4,1,5
and 6,1,7
or 8,1,9
xor 4,1,5
- Read before write data hazard
26Register Usage Can Cause Data Hazards
- Dependencies backward in time cause hazards
add 1,
sub 4,1,5
and 6,1,7
or 8,1,9
xor 4,1,5
- Read before write data hazard
27Loads Can Cause Data Hazards
- Dependencies backward in time cause hazards
lw 1,4(2)
I n s t r. O r d e r
sub 4,1,5
and 6,1,7
or 8,1,9
xor 4,1,5
28One Way to Fix a Data Hazard
Can fix data hazard by waiting stall but
impacts CPI
add 1,
I n s t r. O r d e r
29Another Way to Fix a Data Hazard
Fix data hazards by forwarding results as soon as
they are available to where they are needed
add 1,
I n s t r. O r d e r
sub 4,1,5
and 6,1,7
or 8,1,9
xor 4,1,5
30Forwarding (??) (p.376)
- Also called bypassing
- Resolving a data hazard by retrieving the missing
data element from internal buffers - Ex. lw s0, 20(t1)
- sub t2, s0, t3
Still need one stall
31Another Way to Fix a Data Hazard
Fix data hazards by forwarding results as soon as
they are available to where they are needed
add 1,
I n s t r. O r d e r
sub 4,1,5
and 6,1,7
or 8,1,9
xor 4,1,5
32Another Way to Fix a Data Hazard
Fix data hazards by forwarding results as soon as
they are available to where they are needed
add 1,
I n s t r. O r d e r
sub 4,1,5
and 6,1,7
or 8,1,9
xor 4,1,5
33Forwarding with Load-use Data Hazards
lw 1,4(2)
I n s t r. O r d e r
sub 4,1,5
and 6,1,7
or 8,1,9
xor 4,1,5
34Forwarding with Load-use Data Hazards
lw 1,4(2)
I n s t r. O r d e r
sub 4,1,5
and 6,1,7
or 8,1,9
xor 4,1,5
- Will still need one stall cycle even with
forwarding
35Control Hazard (1/2) (p.379)
- Also called branch hazard
- Make a decision based on the results of one
instruction while others are executing - ????????????,???????????????????????????????
- Solve 1 Stall (bubble)
- If branch ? stall first
- Put in enough extra hardware
- We can test registers
- Calculate the branch address and update PC during
the second stage of the pipeline - Slow and high cost
36(No Transcript)
37Branch Instructions Cause Control Hazards
- Dependencies backward in time cause hazards
I n s t r. O r d e r
lw
Inst 3
Inst 4
38One Way to Fix a Control Hazard
Fix branch hazard by waiting stall but
affects CPI
beq
I n s t r. O r d e r
39Control Hazard (2/2) (p.381)
- Solve 2 Predict
- Always predict that branches will be untaken
- When youre right ? proceeds at full speed
- Not jump to branch target address
- Only when branches are taken ? pipeline stall
- need to add hardware for flushing instructions if
we are wrong
40(No Transcript)
41Branch Prediction (1/2)
- Resolving a branch hazard that
- Assumes a given outcome for the branch and
proceeds from that assumption rather than waiting
to ascertain the actual outcome - Dynamic prediction of branches
- Keeping a history for each branch as taken or
untaken - Using the recent past behavior to predict the
future - Correctly predict branches with over 90 accuracy
42Branch Prediction (2/2)
- If predict if wrong
- Pipeline control must ensure that the instruction
following the wrongly guessed branch have no
effect - Restart the pipeline from the proper branch
address - Keeping the history
- Branch history table (?????)
- Branch prediction buffer (???????)
43Pipeline Hazards Illustrated
44Pipelined Datapath
- IF Instruction fetch
- ID Instruction decode and register file read
- EX Execution or address calculation
- MEM Data memory access
- WB Write back
45(No Transcript)
46Pipeline Execution (p.387)
- Assume
- Register file is written in the first half of the
clock cycle - Register file is read during the second half
47Pipeline Register
- Need to preserve the destination register address
in the pipeline state registers
48Five stages of lw (1/3) (p.388)
- Instruction fetch
- Reading memory using the address in the PC
- Placed in the IF/ID pipeline register
- PC address PC4 (ready for next clock cycle)
- Instruction decode and register file read
- IF/ID pipeline register supplying the 16-bits
immediate field - Which is sign-extended to 32-bits
- The register numbers to read the two register
- All values are stored in the ID/EX pipeline
register
49Five stages of lw (2/3)
- Execute and address calculation
- Reads the content of register1
- The sign-extended immediate from the ID/EX
pipeline register - Add them using the ALU
- placed in the EX/MEM pipeline register
- Memory access
- Reading the data memory using the address from
the EX/MEM pipeline register - Loading the data into the MEN/WB pipeline register
50Five stages of lw (3/3)
- Write back
- Final step
- Reading the data from the MEM/WB pipeline
register - Writing it into the register file
51?? lw ?5???
- ????
- ???????? (PC) ?????????????????
- ???IF/ID????? (?????????????????
- ???????)
- ??????????
- ??????, ??????, 16???????, ID/EX ???
- ??????????? (PC)??
- ??????????
- ???????ID/EX???????????????????1
- ??????ALU??????????EX/MEM???????
- ?????
- ??????EX/MEM???????????????????
- ??
52Figure 6.12 IF/ID
53Figure 6.12 EX
54Figure 6.13 EX
55Figure 6.14 MEM
56Figure 6.14 WB
57Multiple-Clock-Cycle Pipeline (p.397)
Inst 0
I n s t r. O r d e r
Inst 1
Inst 2
Inst 3
Inst 4
586.3 Pipeline Control (1/2)
59Pipeline Control (2/2) (p.403)
60Control Settings
616.4 Data Hazards and Forwarding (p.400)
- Starting next instruction before first is
finished - Dependencies go backward in time
62Data Dependence Detection (p.406)
- Hazard conditions
- 1a EX/MEM. RegisterRd ID/EX.RegisterRs 2
(sub and) - 1b EX/MEM. RegisterRd ID/EX.RegisterRt
- 2a MEM//WB.RegisterRd ID/EX.RegisterRs (sub
or) - 2b MEM/WB.RegisterRd ID/EX.RegisterRt
- Rs,rt source register
- Rd destination register
63Ex. Data Hazard (on r1)
64Dependencies backwards in time are hazards
65Forward result from one stage to another
66Dependence Detection (p.407)
- Some instruction do not write register
- It would forward when it was unnecessary
- Check the RegWrite signal will be active
- Examining the WB control field of the pipeline
register (EX/MEM) to determined - Dependence being from a pipeline register
- Rather than waiting for the WB stage to write the
register file - Pipeline registers holding the data to be
forwarded - Forward are the four R-format instructions
- add, sub, and ,or
676.5 Data Hazards and Stalls
- Cant always forward
- When an instruction tries to read a register
following a load instruction that write the same
register - Called Load-Data-Hazard
68Stalling (p.413)
- Nop (Which act like bubbles)
- An instruction that does no operation to change
state - We can stall the pipeline by keeping an
instruction in the same stage
69Corrected Datapath to Save RegWrite Addr
- Need to preserve the destination register address
in the pipeline state registers
70Corrected Datapath to Save RegWrite Addr
- Need to preserve the destination register address
in the pipeline state registers
71MIPS Pipeline Control Path Modifications
- All control signals can be determined during
Decode - and held in the state registers between pipeline
stages
726.6 Branch Hazards
- Instruction must be fetched at every clock cycle
- Decision until the MEM pipeline stage
- Delay in determining the proper instruction to
fetch is called - Branch hazard (control hazard)
- Relatively simple to understand
- Occur less frequently
73Single Memory is a Structural Hazard
74Handling Branch Hazard
- Predict branch always not taken
- (p.418)
- Reduce delay of taken branches
- (p.418)
- Dynamic branch prediction
- (p.421)
75Predict Branch Not Taken (p.418)
- Stalling until branch is complete
- Too slow
- Assume
- The branch will not be taken
- Continue execution
- The branch is taken
- The instruction that are being fetched and
decoded must be discarded (flush) - Execute at the branch target
- Branches are untaken half the time costs little
to discard the instructions
76Pipeline on the Branch (p.417)
77Reduce delay of branch
- Reduce delay of branch by moving branch execution
earlier in the pipeline - Fewer instructions need be flushed
- We already have the PC value and the immediate
field in the IF/ID pipeline register - Move the branch adder from the EX stage to ID
stage - Flush one instruction in the IF stage
- Add a control signal (IF.Flush), to zero
- Making the instruction an NOP
78Handling Branches
79Flushing with Misprediction (Not Taken)
4 beq 1,2,2
8 sub 4,1,5
- To flush the IF stage instruction, assert
IF.Flush to zero the instruction field of the
IF/ID pipeline register (transforming it into a
noop)
80Reducing the Delay of Branches
81Dynamic Branch Prediction (p.421)
- Dynamic branch prediction
- Look up the address of the instruction
- To see if a branch was taken the last time
- Fetch new instructions from the same place as the
last time - Branch prediction buffer (???????)
- Branch history table (?????)
- A small memory indexed by the lower portion of
the address of the branch instruction - Contains a bit
- Says whether the branch was recently taken or not
82Dynamic Branch Prediction (contd)
- Prediction is just a hint assumed to be correct
- Fetching beings in the predicted direction
- If the hint turns out to be wrong
- Incorrectly predicted instructions are deleted
- Prediction bit is inverted and store back
- Proper sequence is fetched and executed
83Loops and Prediction ex.(p.421)
- A loop branch that branches 9 times, then is not
taken once - What is the prediction accuracy for this branch?
- (prediction bit remains in prediction buffer)
- Misspredict on the first and last loop iterations
- Last inevitable since the bit will say taken
- First the bit is flipped on prior execution of
the last iteration of the loop - Branch is taken 90 (9/(19))
- Prediction accuracy 80 (2 incorrect) (8/(28))
842-bit Prediction (p.422)
- 2-bit scheme where change prediction only if get
misprediction twice
852-bit Predictors
- A 2-bit scheme can give 90 accuracy since a
prediction must be wrong twice before the
prediction bit is changed
right 9 times
wrong on loop fall out
Taken
Not taken
1
Predict Taken
Predict Taken
1
10
11
Taken
right on 1st iteration
Not taken
Taken
Not taken
0
Predict Not Taken
00
Predict Not Taken
0
01
Taken
Not taken
86Delay Branch (p.423)
- Compilers and assemblers try to
- place an instruction into branch delay slot
- that always execute after the branch
- Branch delay slot
- Directly after a delayed branch instruction
- Filled by an instruction that does not affect the
branch
87Scheduling Branch Delay Slots
A. From before branch
B. From branch target
C. From fall through
add 1,2,3 if 10 then
add 1,2,3 if 20 then
sub 4,5,6
delay slot
add 1,2,3 if 10 then
delay slot
sub 4,5,6
delay slot
- A is the best choice, fills delay slot and
reduces IC - In B and C, the sub instruction may need to be
copied, increasing IC - In B and C, must be okay to execute sub when
branch fails
886.8 Exceptions (p.432)
- We wouldnt want this invalid value to
contaminate other registers or memory locations - Hardware is normally to stop the offending
instruction in midstream - To difficultly of always associating the correct
exception with correct instruction in pipelined - Computer designers to relax this requirement in
noncritical cases - Imprecise interrupts (?????)
- Imprecise exceptions (?????)
89Exceptions
- Steps to handle exceptions
- Flush the instruction in the IF, ID and EX stages
- Let all preceding instructions complete if they
can - Save the restart PC
- Call the OS to handle the exception
- Return to the user code
906.9 Advanced Pipelining
- Instruction-level parallelism (ILP)
- Increasing the pipeline depth to overlap more
instructions - Move from four-stage to six-stage
- Advantage higher clock rate
- Disadvantage longer load and branch delay
- Superpipelining (?????)
- Replicate the internal components (hardware)
- Can launch multiple instructions in every
pipeline stage (multiple issue) - Execution rate to exceed the clock rate (CPIlt1)
- Superscalar (????)
91SuperScalar
- Superscalar MIPS 2 instructions
- One instruction is ALU or branch the other could
be a load or a store - More hardware resources
- Two more read ports and one more write port to
the register file - One more ALU unit
92Two major ways (multiple-issue)
- Division of work between the compiler and the
hardware - Static multiple issue
- Many decisions are made by the compiler before
execution - Dynamic multiple issue
- Many decisions are made during execution by the
processor
93Very Long instruction Word (VLIW) p.435
- VLIW
- The set of instructions that issue together in 1
clock cycle - Called issue packet
- As one large instruction with multiple operations
94Different Pipelined Designs