Computer Organization and Architecture Chapter 6 Enhancing Performance with Pipelining presentation

About This Presentation

Transcript and Presenter's Notes

Title: Computer Organization and Architecture Chapter 6 Enhancing Performance with Pipelining

1
Computer Organization and ArchitectureChapter
6 Enhancing Performance with Pipelining

Yu-Lun Kuo
Computer Sciences and Information Engineering
University of Tunghai, Taiwan
sscc6991_at_gmail.com

2
Review Single Cycle vs. Multiple Cycle Timing
3
How Can We Make It Even Faster?

Split the multiple instruction cycle into smaller
and smaller steps
There is a point of diminishing returns where as
much time is spent loading the state registers as
doing the work
Pipelining
Multiple instructions are overlapped in execution
Key to making processors fast

4
Example Laundry

Ann, Brian, Cathy, Dave
each have one load of
clothes to wash, dry,
and fold
Washer takes 30 minutes
Dryer takes 40 minutes
Folder takes 20 minutes

5
Sequential Laundry
6
Pipelined Laundry
7
Example Laundry
8
MIPS Instructions (p.371)

Classically take five steps
Fetch instruction from (instruction) memory (IF)
Read register while decoding the instruction (ID)
Execute the operation or calculate an address
(EX)
Access an operand in data memory (MEM)
Write the result into a register (WB)
Five stages

9
The schematic view
IF
ID
Mem
WB
uses the memory
uses the register file
uses the register file
uses the memory
uses the ALU
Very important to remember the content of this
slide
10
A Pipelined MIPS Processor

Start the next instruction before the current one
has completed
Improves throughput
Total amount of work done in a given time

Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 7
Cycle 6
Cycle 8
Dec
lw
Dec
sw
Dec
R-type

clock cycle (pipeline stage time) is limited by
the slowest stage
for some instructions, some stages are wasted
cycles

11
Single Cycle vs. Multiple Cycle vs. Pipeline
Multiple Cycle Implementation
12
Pipelined Execution Representation
13
Single-cycle vs. Pipelined Performance (p.372)

Single-cycle (non-pipeline)
Must allow for the lowest instruction (lw)
Required for every instruction is 800 ps
The time between the first and fourth
instructions in
the non-pipelined design
3 800 2400 ps

14
Figure 6.3
15
Single-cycle vs. Pipelined Performance (p.372)

Pipeline
All the pipeline stages take a single clock cycle
The clock cycle must be long enough to
accommodate the slowest operation
Execution clock cycle must have the worst-case
clock cycle of 200 ps
The time between the first and fourth
instructions
3 200 600 ps
So the time is 600 ps 4 200 ps 1400 ps

16
Pipelining Speedup (p.374)

Under ideal conditions and with a large number of
instructions
The speedup from pipelining is approximately
equal to the number of pipeline stages
Five-stage pipeline is nearly five times faster
The above example?
Pipeline time 1400 ps
Non-pipeline time 2400 ps
It is not reflected in the total execution time
for the three instructions

17
Pipelining Speedup (p.374)

Pipelining involves some overhead
The source of which will be more clear shortly
Thus, the time per instruction in the pipelined
processor will exceed the minimum possible
The speedup will be less than the number of
pipeline stages
The number of instruction is not large
If we increased the number of instructions
Add 1,000,000 instructions

18
Pipeline Hazards (????) (p.375)

Pipeline Hazards
When the next instruction cannot execute in the
following clock cycle
Three different types
Structural hazards (????)
what if we had only one memory?
Data hazards (????)
what if an instructions input operands depend on
the output of a previous instruction?
Control hazards (????)
what about branches?

19
Structural Hazards (1/2) (p.375)

The hardware cannot support the combination of
instructions that we want to execute in the same
clock cycle
Hardware resource is not enough!!!
???????,??????????????????????
Ex. The laundry room
Washer-dryer vs. separate washer and dryer

20
Structural Hazard (2/2) (p.375)

Suppose, single memory instead of two memories
If the pipeline in Figure 6.3 had a fourth
instruction
That in the same clock cycle
The first instruction is accessing data from
memory
The fourth instruction is fetching an instruction
from the same memory
Without two memories, pipeline could have a
structural hazard

21
Structural Hazard Single Memory
Time (clock cycles)
lw
I n s t r. O r d e r
Inst 1
Inst 2
Inst 3
Inst 4
22
Data Hazard (p.376)

The planned instruction cannot execute in the
proper clock cycle
Because data that is needed to execute the
instruction is not yet available
The pipeline must be stalled (Bubble)
Because one step must wait for another to
complete
Ex. add s0, t0, t1
sub t2, s0, t3
Have to add three bubbles to the pipeline

23
How About Register File Access?
Time (clock cycles)
add 1,
I n s t r. O r d e r
Inst 1
Inst 2
Inst 3
add 2,1,
24
How About Register File Access?
Time (clock cycles)
Fix register file access hazard by doing reads in
the second half of the cycle and writes in the
first half
add 1,
I n s t r. O r d e r
Inst 1
Inst 2
add 2,1,
25
Register Usage Can Cause Data Hazards

Dependencies backward in time cause hazards

add 1,
I n s t r. O r d e r
sub 4,1,5
and 6,1,7
or 8,1,9
xor 4,1,5

Read before write data hazard

26
Register Usage Can Cause Data Hazards

Dependencies backward in time cause hazards

add 1,
sub 4,1,5
and 6,1,7
or 8,1,9
xor 4,1,5

Read before write data hazard

27
Loads Can Cause Data Hazards

Dependencies backward in time cause hazards

lw 1,4(2)
I n s t r. O r d e r
sub 4,1,5
and 6,1,7
or 8,1,9
xor 4,1,5

Load-use data hazard

28
One Way to Fix a Data Hazard
Can fix data hazard by waiting stall but
impacts CPI
add 1,
I n s t r. O r d e r
29
Another Way to Fix a Data Hazard
Fix data hazards by forwarding results as soon as
they are available to where they are needed
add 1,
I n s t r. O r d e r
sub 4,1,5
and 6,1,7
or 8,1,9
xor 4,1,5
30
Forwarding (??) (p.376)

Also called bypassing
Resolving a data hazard by retrieving the missing
data element from internal buffers
Ex. lw s0, 20(t1)
sub t2, s0, t3

Still need one stall
31
Another Way to Fix a Data Hazard
Fix data hazards by forwarding results as soon as
they are available to where they are needed
add 1,
I n s t r. O r d e r
sub 4,1,5
and 6,1,7
or 8,1,9
xor 4,1,5
32
Another Way to Fix a Data Hazard
Fix data hazards by forwarding results as soon as
they are available to where they are needed
add 1,
I n s t r. O r d e r
sub 4,1,5
and 6,1,7
or 8,1,9
xor 4,1,5
33
Forwarding with Load-use Data Hazards
lw 1,4(2)
I n s t r. O r d e r
sub 4,1,5
and 6,1,7
or 8,1,9
xor 4,1,5
34
Forwarding with Load-use Data Hazards
lw 1,4(2)
I n s t r. O r d e r
sub 4,1,5
and 6,1,7
or 8,1,9
xor 4,1,5

Will still need one stall cycle even with
forwarding

35
Control Hazard (1/2) (p.379)

Also called branch hazard
Make a decision based on the results of one
instruction while others are executing
????????????,???????????????????????????????
Solve 1 Stall (bubble)
If branch ? stall first
Put in enough extra hardware
We can test registers
Calculate the branch address and update PC during
the second stage of the pipeline
Slow and high cost

36
(No Transcript)
37
Branch Instructions Cause Control Hazards

Dependencies backward in time cause hazards

I n s t r. O r d e r
lw
Inst 3
Inst 4
38
One Way to Fix a Control Hazard
Fix branch hazard by waiting stall but
affects CPI
beq
I n s t r. O r d e r
39
Control Hazard (2/2) (p.381)

Solve 2 Predict
Always predict that branches will be untaken
When youre right ? proceeds at full speed
Not jump to branch target address
Only when branches are taken ? pipeline stall
need to add hardware for flushing instructions if
we are wrong

40
(No Transcript)
41
Branch Prediction (1/2)

Resolving a branch hazard that
Assumes a given outcome for the branch and
proceeds from that assumption rather than waiting
to ascertain the actual outcome
Dynamic prediction of branches
Keeping a history for each branch as taken or
untaken
Using the recent past behavior to predict the
future
Correctly predict branches with over 90 accuracy

42
Branch Prediction (2/2)

If predict if wrong
Pipeline control must ensure that the instruction
following the wrongly guessed branch have no
effect
Restart the pipeline from the proper branch
address
Keeping the history
Branch history table (?????)
Branch prediction buffer (???????)

43
Pipeline Hazards Illustrated
44
Pipelined Datapath

IF Instruction fetch
ID Instruction decode and register file read
EX Execution or address calculation
MEM Data memory access
WB Write back

45
(No Transcript)
46
Pipeline Execution (p.387)

Assume
Register file is written in the first half of the
clock cycle
Register file is read during the second half

47
Pipeline Register

Need to preserve the destination register address
in the pipeline state registers

48
Five stages of lw (1/3) (p.388)

Instruction fetch
Reading memory using the address in the PC
Placed in the IF/ID pipeline register
PC address PC4 (ready for next clock cycle)
Instruction decode and register file read
IF/ID pipeline register supplying the 16-bits
immediate field
Which is sign-extended to 32-bits
The register numbers to read the two register
All values are stored in the ID/EX pipeline
register

49
Five stages of lw (2/3)

Execute and address calculation
Reads the content of register1
The sign-extended immediate from the ID/EX
pipeline register
Add them using the ALU
placed in the EX/MEM pipeline register
Memory access
Reading the data memory using the address from
the EX/MEM pipeline register
Loading the data into the MEN/WB pipeline register

50
Five stages of lw (3/3)

Write back
Final step
Reading the data from the MEM/WB pipeline
register
Writing it into the register file

51
?? lw ?5???

????
???????? (PC) ?????????????????
???IF/ID????? (?????????????????
???????)
??????????
??????, ??????, 16???????, ID/EX ???
??????????? (PC)??
??????????
???????ID/EX???????????????????1
??????ALU??????????EX/MEM???????
?????
??????EX/MEM???????????????????
??

52
Figure 6.12 IF/ID
53
Figure 6.12 EX
54
Figure 6.13 EX
55
Figure 6.14 MEM
56
Figure 6.14 WB
57
Multiple-Clock-Cycle Pipeline (p.397)
Inst 0
I n s t r. O r d e r
Inst 1
Inst 2
Inst 3
Inst 4
58
6.3 Pipeline Control (1/2)
59
Pipeline Control (2/2) (p.403)
60
Control Settings
61
6.4 Data Hazards and Forwarding (p.400)

Starting next instruction before first is
finished
Dependencies go backward in time

62
Data Dependence Detection (p.406)

Hazard conditions
1a EX/MEM. RegisterRd ID/EX.RegisterRs 2
(sub and)
1b EX/MEM. RegisterRd ID/EX.RegisterRt
2a MEM//WB.RegisterRd ID/EX.RegisterRs (sub
or)
2b MEM/WB.RegisterRd ID/EX.RegisterRt
Rs,rt source register
Rd destination register

63
Ex. Data Hazard (on r1)
64
Dependencies backwards in time are hazards
65
Forward result from one stage to another
66
Dependence Detection (p.407)

Some instruction do not write register
It would forward when it was unnecessary
Check the RegWrite signal will be active
Examining the WB control field of the pipeline
register (EX/MEM) to determined
Dependence being from a pipeline register
Rather than waiting for the WB stage to write the
register file
Pipeline registers holding the data to be
forwarded
Forward are the four R-format instructions
add, sub, and ,or

67
6.5 Data Hazards and Stalls

Cant always forward
When an instruction tries to read a register
following a load instruction that write the same
register
Called Load-Data-Hazard

68
Stalling (p.413)

Nop (Which act like bubbles)
An instruction that does no operation to change
state
We can stall the pipeline by keeping an
instruction in the same stage

69
Corrected Datapath to Save RegWrite Addr

Need to preserve the destination register address
in the pipeline state registers

70
Corrected Datapath to Save RegWrite Addr

Need to preserve the destination register address
in the pipeline state registers

71
MIPS Pipeline Control Path Modifications

All control signals can be determined during
Decode
and held in the state registers between pipeline
stages

72
6.6 Branch Hazards

Instruction must be fetched at every clock cycle
Decision until the MEM pipeline stage
Delay in determining the proper instruction to
fetch is called
Branch hazard (control hazard)
Relatively simple to understand
Occur less frequently

73
Single Memory is a Structural Hazard
74
Handling Branch Hazard

Predict branch always not taken
(p.418)
Reduce delay of taken branches
(p.418)
Dynamic branch prediction
(p.421)

75
Predict Branch Not Taken (p.418)

Stalling until branch is complete
Too slow
Assume
The branch will not be taken
Continue execution
The branch is taken
The instruction that are being fetched and
decoded must be discarded (flush)
Execute at the branch target
Branches are untaken half the time costs little
to discard the instructions

76
Pipeline on the Branch (p.417)
77
Reduce delay of branch

Reduce delay of branch by moving branch execution
earlier in the pipeline
Fewer instructions need be flushed
We already have the PC value and the immediate
field in the IF/ID pipeline register
Move the branch adder from the EX stage to ID
stage
Flush one instruction in the IF stage
Add a control signal (IF.Flush), to zero
Making the instruction an NOP

78
Handling Branches
79
Flushing with Misprediction (Not Taken)
4 beq 1,2,2
8 sub 4,1,5

To flush the IF stage instruction, assert
IF.Flush to zero the instruction field of the
IF/ID pipeline register (transforming it into a
noop)

80
Reducing the Delay of Branches
81
Dynamic Branch Prediction (p.421)

Dynamic branch prediction
Look up the address of the instruction
To see if a branch was taken the last time
Fetch new instructions from the same place as the
last time
Branch prediction buffer (???????)
Branch history table (?????)
A small memory indexed by the lower portion of
the address of the branch instruction
Contains a bit
Says whether the branch was recently taken or not

82
Dynamic Branch Prediction (contd)

Prediction is just a hint assumed to be correct
Fetching beings in the predicted direction
If the hint turns out to be wrong
Incorrectly predicted instructions are deleted
Prediction bit is inverted and store back
Proper sequence is fetched and executed

83
Loops and Prediction ex.(p.421)

A loop branch that branches 9 times, then is not
taken once
What is the prediction accuracy for this branch?
(prediction bit remains in prediction buffer)
Misspredict on the first and last loop iterations
Last inevitable since the bit will say taken
First the bit is flipped on prior execution of
the last iteration of the loop
Branch is taken 90 (9/(19))
Prediction accuracy 80 (2 incorrect) (8/(28))

84
2-bit Prediction (p.422)

2-bit scheme where change prediction only if get
misprediction twice

85
2-bit Predictors

A 2-bit scheme can give 90 accuracy since a
prediction must be wrong twice before the
prediction bit is changed

right 9 times
wrong on loop fall out
Taken
Not taken
1
Predict Taken
Predict Taken
1
10
11
Taken
right on 1st iteration
Not taken
Taken
Not taken
0
Predict Not Taken
00
Predict Not Taken
0
01
Taken
Not taken
86
Delay Branch (p.423)

Compilers and assemblers try to
place an instruction into branch delay slot
that always execute after the branch
Branch delay slot
Directly after a delayed branch instruction
Filled by an instruction that does not affect the
branch

87
Scheduling Branch Delay Slots
A. From before branch
B. From branch target
C. From fall through
add 1,2,3 if 10 then
add 1,2,3 if 20 then
sub 4,5,6
delay slot
add 1,2,3 if 10 then
delay slot
sub 4,5,6
delay slot

A is the best choice, fills delay slot and
reduces IC
In B and C, the sub instruction may need to be
copied, increasing IC
In B and C, must be okay to execute sub when
branch fails

88
6.8 Exceptions (p.432)

We wouldnt want this invalid value to
contaminate other registers or memory locations
Hardware is normally to stop the offending
instruction in midstream
To difficultly of always associating the correct
exception with correct instruction in pipelined
Computer designers to relax this requirement in
noncritical cases
Imprecise interrupts (?????)
Imprecise exceptions (?????)

89
Exceptions

Steps to handle exceptions
Flush the instruction in the IF, ID and EX stages
Let all preceding instructions complete if they
can
Save the restart PC
Call the OS to handle the exception
Return to the user code

90
6.9 Advanced Pipelining

Instruction-level parallelism (ILP)
Increasing the pipeline depth to overlap more
instructions
Move from four-stage to six-stage
Advantage higher clock rate
Disadvantage longer load and branch delay
Superpipelining (?????)
Replicate the internal components (hardware)
Can launch multiple instructions in every
pipeline stage (multiple issue)
Execution rate to exceed the clock rate (CPIlt1)
Superscalar (????)

91
SuperScalar

Superscalar MIPS 2 instructions
One instruction is ALU or branch the other could
be a load or a store
More hardware resources
Two more read ports and one more write port to
the register file
One more ALU unit

92
Two major ways (multiple-issue)

Division of work between the compiler and the
hardware
Static multiple issue
Many decisions are made by the compiler before
execution
Dynamic multiple issue
Many decisions are made during execution by the
processor

93
Very Long instruction Word (VLIW) p.435

VLIW
The set of instructions that issue together in 1
clock cycle
Called issue packet
As one large instruction with multiple operations

94
Different Pipelined Designs

Write a Comment

User Comments (0)

About PowerShow.com

Computer Organization and Architecture Chapter 6 Enhancing Performance with Pipelining PowerPoint PPT Presentation