CENG 450 Computer Systems and Architecture Lecture 7 - PowerPoint PPT Presentation

1 / 39

About This Presentation

Title:

CENG 450 Computer Systems and Architecture Lecture 7

Description:

... slot delay ... Other machines: branch target known before outcome. 13. Four Branch ... 1 slot delay allows proper decision and branch target address in 5 ... – PowerPoint PPT presentation

Number of Views:33

Avg rating:3.0/5.0

Slides: 40

Provided by: shin161

Category:

more less

Transcript and Presenter's Notes

Title: CENG 450 Computer Systems and Architecture Lecture 7

1
CENG 450Computer Systems and
ArchitectureLecture 7

Amirali Baniasadi
amirali_at_ece.uvic.ca

2
This Lecture

Pipelining
ILP
Scheduling

3
Limits to pipelining

Hazards circumstances that would cause incorrect
execution if next instruction were launched
Structural hazards Attempting to use the same
hardware to do two different things at the same
time
Data hazards Instruction depends on result of
prior instruction still in the pipeline
Control hazards Caused by delay between the
fetching of instructions and decisions about
changes in control flow (branches and jumps).

4
Data Hazard Even with Forwarding
Time (clock cycles)
5
Resolving this load hazard

Adding hardware? ... not
Detection?
Compilation techniques?
What is the cost of load delays?

6
Resolving the Load Data Hazard
Time (clock cycles)
I n s t r. O r d e r
lw r1, 0(r2)
sub r4,r1,r6
and r6,r1,r7
Bubble
ALU
DMem
or r8,r1,r9
How is this different from the instruction issue
stall?
7
Software Scheduling to Avoid Load Hazards
Try producing fast code for a b c d e
f assuming a, b, c, d ,e, and f in memory.
Slow code LW Rb,b LW Rc,c ADD
Ra,Rb,Rc SW a,Ra LW Re,e LW
Rf,f SUB Rd,Re,Rf SW d,Rd

Fast code
LW Rb,b
LW Rc,c
LW Re,e
ADD Ra,Rb,Rc
LW Rf,f
SW a,Ra
SUB Rd,Re,Rf
SW d,Rd

8
Instruction Set Connection

What is exposed about this organizational hazard
in the instruction set?
k cycle delay?
bad, CPI is not part of ISA
k instruction slot delay
load should not be followed by use of the value
in the next k instructions
Nothing, but code can reduce run-time delays
MIPS did the transformation in the assembler

9
Example Control Hazard
Next PC
MUX
Next SEQ PC
Next SEQ PC
Zero?
RS1
Reg File
MUX
Instr Cache
RS2
Data Cache
MUX
MUX
Sign Extend
branch instruction successor1 successor2 suc
cessor branch target
WB Data
Imm
RD
RD
RD
10
Example Branch Stall Impact

If 30 branch, Stall 3 cycles significant
Two part solution
Determine branch taken or not sooner, AND
Compute taken branch address earlier
MIPS branch tests if register 0 or ? 0
MIPS Solution
Move Zero test to ID/RF stage
Adder to calculate new PC in ID/RF stage
1 clock cycle penalty for branch versus 3

11
Pipelined MIPS Datapath
Memory Access
Instruction Fetch
Execute Addr. Calc
Write Back
Instr. Decode Reg. Fetch
Next SEQ PC
Next PC
MUX
Adder
Zero?
RS1
Reg File
Memory
RS2
Data Memory
MUX
MUX
Sign Extend
WB Data
Imm
EXTRA HARDWARE
RD
RD
RD

Data stationary control
local decode for each instruction phase /
pipeline stage

12
Four Branch Hazard Alternatives

1 Stall until branch direction is clear
2 Predict Branch Not Taken
Execute successor instructions in sequence
Squash instructions in pipeline if branch
actually taken
47 MIPS branches not taken on average
PC4 already calculated, so use it to get next
instruction
3 Predict Branch Taken
53 MIPS branches taken on average
But havent calculated branch target address in
MIPS
MIPS still incurs 1 cycle branch penalty
Other machines branch target known before outcome

13
Four Branch Hazard Alternatives

4 Delayed Branch
Define branch to take place AFTER a following
instruction
branch instruction sequential
successor1 sequential successor2 ........ seque
ntial successorn
........
branch target if taken
1 slot delay allows proper decision and branch
target address in 5 stage pipeline
MIPS uses this

Branch delay of length n
14
Delayed Branch

Where to get instructions to fill branch delay
slot?
Before branch instruction
From the target address only valuable when
branch taken
From fall through only valuable when branch not
taken
Compiler effectiveness for single branch delay
slot
Fills about 60 of branch delay slots
About 80 of instructions executed in branch
delay slots useful in computation
About 50 (60 x 80) of slots usefully filled
Delayed Branch downside 7-8 stage pipelines,
multiple instructions issued per clock
(superscalar)

15
RecallSpeed Up Equation for Pipelining
For simple RISC pipeline, CPI 1
16
Example Evaluating Branch Alternatives

Assume
Conditional Unconditional 14, 65 change PC
Scheduling Branch CPI speedup v. scheme
penalty stall
Stall pipeline 3 1.42 1.0
Predict taken 1 1.14 1.26
Predict not taken 1 1.09 1.29
Delayed branch 0.5 1.07 1.31

17
Example (A-24)

In MIPS R4000, it takes 3 cycles to know the
target address, and an extra cycle to resolve the
condition. Effective CPI? (uncond, 4, untaken 6
taken 10)
Branch scheme Penalty Uncond. Penalty
Untaken Penalty Taken
Stall 2
3 3
Predict taken 2
3 2
Predict untaken 2
0 3
Branch scheme Uncond.
Untaken Taken Total
Stall 0.08
0.18 0.3
0.56
Predict taken 0.08
0.18 0.2
0.46
Predict untaken 0.08
0 0.3
0.38
CPI STALL, 1.56 Predict Taken, 1.46
Predict Untaken 1.38

CYCLE PENALTY
CPI PENALTY
18
Summary of Pipelining Basics

Hazards limit performance by preventing
instructions from executing during their
designated clock cycles
Structural Hazards need more HW resources
Data Hazards need forwarding, compiler
scheduling
Control Hazards early evaluation PC, delayed
branch, prediction
Increasing length of pipe increases impact of
hazards
Pipelining helps instruction bandwidth, not
latency
Compilers reduce cost of data and control hazards
Stall Increases CPI and decreases performance

19
What Is an ILP?

Principle Many instructions in the code do not
depend on each other
Result Possible to execute them in parallel
ILP Potential overlap among instructions (so
they can be evaluated in parallel)
Issues
Building compilers to analyze the code
Building special/smarter hardware to handle the
code
ILP Increase the amount of parallelism
exploited among instructions
Seeks Good Results out of Pipelining

20
What Is ILP?

CODE A
CODE B
LD R1, (R2)100 LD
R1,(R2)100
ADD R4, R1 ADD
R4,R1
SUB R5,R1 SUB
R5,R4
CMP R1,R2 SW
R5,(R2)100
ADD R3,R1 LD
R1,(R2)100
Code A Possible to execute 4 instructions in
parallel.
Code B Cant execute more than one instruction
per cycle.
Code A has Higher ILP

21
Out of Order Execution
Programmer Instructions execute
in-order Processor Instructions may execute
in any order if results remain the same at the
end
Out-of-Order
B ADD R3, R4 C ADD R3, R5 A LD R1, (R2) D CMP
R3, R1
22
Scheduling

Scheduling re-arranging instructions to maximize
performance
Requires knowledge about structure of processor
Static Scheduling done by compiler
Example
for (i1000 igt0 i--) xi xi s
Dynamic Scheduling done by hardware
Dominates Server and Desktop markets (Pentium
III, 4 MIPS R10000/12000, UltraSPARC III,
PowerPC 603 etc)

23
Pipeline Scheduling
Compiler schedules (move) instructions to reduce
stall Ex code sequence a b c, d e f
Before scheduling lw Rb, b lw
Rc, c Add Ra, Rb, Rc //stall sw a, Ra
lw Re, e lw Rf, f sub Rd,
Re, Rf //stall sw d, Rd
After scheduling lw Rb, b lw Rc, c lw Re, e
Add Ra, Rb, Rc lw Rf, f sw a, Ra sub Rd, Re,
Rf sw d, Rd
Schedule
24
Basic Pipeline Scheduling

To avoid pipeline stall
A dependant instruction must be separated from
the source instruction by a distance in clock
cycles equal to the pipeline latency
Compilers ability depends on
Amount of ILP available in the program
Latencies of the functional units in the pipeline
Pipeline CPI Ideal pipeline CPI Structured
stalls Data hazards stalls Control stalls

25
Pipeline Scheduling Loop Unrolling

Basic Block
Set of instructions between entry points and
between branches. A basic block has only one
entry and one exit
Typically 4 to 7 instructions
Amount of overlap ltlt 4 to 7 instructions
Obtain substantial performance enhancements
Exploit ILP across multiple basic blocks
Loop Level Parallelism
Parallelism that exists within a loop Limited
opportunity
Parallelism can cross loop iterations!
Techniques to convert loop-level parallelism to
instructional-level parallelism
Loop Unrolling Compiler or the hardwares
ability to exploit the parallelism inherent in
the loop

26
Assumptions

Five-stage integer pipeline
Branches have delay of one clock cycle
ID stage Comparisons done, decisions made and PC
loaded
No structural hazards
Functional units are fully pipelined or
replicated (as many times as the pipeline depth)
FP Latencies

Integer load latency 1 Integer ALU operation
latency 0
27
Simple Loop Assembler Equivalent

xi s are double/floating point type
R1 initially address of array element with the
highest address
F2 contains the scalar value s
Register R2 is pre-computed so that 8(R2) is the
last element to operate on

for (i1000 igt0 i--) xi xi s
Loop LD F0, 0(R1) F0array element
ADDD F4, F0, F2 add scalar in F2
SD F4 , 0(R1) store result
SUBI R1, R1, 8 decrement pointer 8bytes
(DW)
BNE R1, R2, Loop branch R1!R2

28
Where are the stalls?

Unscheduled
Loop LD F0, 0(R1)
stall
ADDD F4, F0, F2
stall
stall
SD F4, 0(R1)
SUBI R1, R1, 8
stall
BNE R1, R2, Loop
stall
10 clock cycles
Can we minimize?

Scheduled
Loop LD F0, 0(R1)
SUBI R1, R1, 8
ADDD F4, F0, F2
stall
BNE R1, R2, Loop
SD F4, 8(R1)
6 clock cycles
3 cycles actual work 3 cycles overhead
Can we minimize further?

29
Loop Unrolling
Four copies of loop
Four iteration code

LD F0, 0(R1)
ADDD F4, F0, F2
SD F4 , 0(R1)
SUBI R1, R1, 8
BNE R1, R2, Loop
LD F0, 0(R1)
ADDD F4, F0, F2
SD F4 , 0(R1)
SUBI R1, R1, 8
BNE R1, R2, Loop
LD F0, 0(R1)
ADDD F4, F0, F2
SD F4 , 0(R1)
SUBI R1, R1, 8
BNE R1, R2, Loop
LD F0, 0(R1)

Loop LD F0, 0(R1)
ADDD F4, F0, F2
SD F4, 0(R1)
LD F6, -8(R1)
ADDD F8, F6, F2
SD F8, -8(R1)
LD F10, -16(R1)
ADDD F12, F10, F2
SD F12, -16(R1)
LD F14, -24(R1)
ADDD F16, F14, F2
SD F16, -24(R1)
SUBI R1, R1, 32
BNE R1, R2, Loop

Assumption R1 is initially a multiple of 32 or
number of loop iterations is a multiple of 4
30
Loop Unroll Schedule

Loop LD F0, 0(R1)
stall
ADDD F4, F0, F2
stall
stall
SD F4, 0(R1)
LD F6, -8(R1)
stall
ADDD F8, F6, F2
stall
stall
SD F8, -8(R1)
LD F10, -16(R1)
stall
ADDD F12, F10, F2
stall
stall
SD F12, -16(R1)
LD F14, -24(R1)

Loop LD F0, 0(R1) LD F6, -8(R1) LD F10,
-16(R1) LD F14, -24(R1) ADDD F4, F0,
F2 ADDD F8, F6, F2 ADDD F12, F10, F2 ADDD F16,
F14, F2 SD F4, 0(R1) SD F8, -8(R1) SD F12,
-16(R1) SUBI R1, R1, 32 BNE R1, R2,
Loop SD F16, 8(R1)
Schedule
No stalls! 14 clock cycles or 3.5 per
iteration Can we minimize further?
28 clock cycles or 7 per iteration Can we
minimize further?
31
Summary
Iteration 10 cycles
Unrolling
7 cycles
Scheduling
Scheduling
6 cycles
3.5 cycles (No stalls)
32
Limits to Gains of Loop Unrolling

Decreasing benefit
A decrease in the amount of overhead amortized
with each unroll
Example just considered
Unrolled loop 4 times, no stall cycles, in 14
cycles 2 were loop overhead
If unrolled 8 times, the overhead is reduced from
½ cycle per iteration to 1/4
Code size limitations
Memory is premium
Larger size causes cache hit rate changes
Shortfall in registers (Register pressure)
Increasing ILP leads to increase in number of
live values May not be possible to allocate all
the live values to registers
Compiler limitations Significant increase in
complexity

33
What if upper bound of the loop is unknown?

Suppose
Upper bound of the loop is n
Unroll the loop to make k copies of the body
Solution Generate pair of consecutive loops
First loop body same as original loop, execute
(n mod k) times
Second loop unrolled body (k copies of
original), iterate (n/k) times
For large values of n, most of the execution time
is spent in the unrolled loop body

34
Summary Tricks of High Performance Processors

Out-of-order scheduling To tolerate RAW hazard
latency
Determine that the loads and stores can be
exchanged as loads and stores from different
iterations are independent
This requires analyzing the memory addresses and
finding that they do not refer to the same
address
Find that it was ok to move the SD after the SUBI
and BNE, and adjust the SD offset
Loop unrolling Increase scheduling scope for
more latency tolerance
Find that loop unrolling is useful by finding
that loop iterations are independent, except for
the loop maintenance code
Eliminate extra tests and branches and adjust the
loop maintenance code
Register renaming Remove WAR/WAS violations due
to scheduling
Use different registers to avoid unnecessary
constraints that would be forced by using same
registers for different computations
Summary Schedule the code preserving any
dependences needed

35
Data Dependence

Data dependence
Indicates the possibility of a hazard
Determines the order in which results must be
calculated
Sets upper bound on how much parallelism can be
exploited
But, actual hazard length of any stall is
determined by pipeline
Dependence avoidance
Maintain the dependence but avoid hazard
Scheduling
Eliminate dependence by transforming the code

36
Data Dependencies

1 Loop LD F0, 0(R1)
2 ADDD F4, F0, F2
3 SUBI R1, R1, 8
4 BNE R1, R2, Loop delayed branch
5 SD F4, 8(R1) altered when move past SUBI

37
Name Dependencies

Two instructions use same name (register or
memory location) but dont exchange data
Anti-dependence (WAR if a hazard for HW)
Instruction j writes a register or memory
location that instruction i reads from and
instruction i is executed first
Output dependence (WAW if a hazard for HW)
Instruction i and instruction j write the same
register or memory location ordering between
instructions must be preserved
How to remove name dependencies?
They are not true dependencies

38
Register Renaming
WAW
WAR
1 Loop LD F0, 0(R1) 2 ADDD F4, F0, F2
3 SD F4, 0(R1) 4 LD F0, -8(R1) 5 ADDD F4, F0,
F2 6 SD F4, -8(R1) 7 LD F0, -16(R1)
8 ADDD F4, F0, F2 9 SD F4, -16(R1) 10 LD F0,
-24(R1) 11 ADDD F4,F0,F2 12 SD F4, -24(R1)
13 SUBI R1, R1, 32 14 BNE R1, R2, LOOP
1 Loop LD F0, 0(R1) 2 ADDD F4, F0, F2
3 SD F4, 0(R1) 4 LD F6,-8(R1) 5 ADDD F8, F6,
F2 6 SD F8, -8(R1) 7 LD F10,-16(R1)
8 ADDD F12, F10, F2 9 SD F12, -16(R1)
10 LD F14, -24(R1) 11 ADDD F16, F14,F2
12 SD F16, -24(R1) 13 SUBI R1, R1, 32
14 BNE R1, R2, LOOP
No data is passed in F0, but cant reuse F0 in
cycle 4.

Name Dependencies are Hard for Memory Accesses
Does 100(R4) 20(R6)?
From different loop iterations, does 20(R6)
20(R6)?
Our example required compiler to know that if R1
doesnt change then 0(R1) ? -8(R1) ?
-16(R1) ? -24(R1)There were no dependencies
between some loads and stores so they could be
moved around

39
Control Dependencies

Example
if p1 S1
if p2 S2
S1 is control dependent on p1 S2 is control
dependent on p2 but not on p1
Two constraints
An instruction that is control dependent on a
branch cannot be moved before the branch so
that its execution is no longer controlled by the
branch.
An instruction that is not control dependent on a
branch cannot be moved to after the branch so
that its execution is controlled by the branch.
Control dependencies relaxed to get
parallelism-Loop unrolling