Title: CENG 450 Computer Systems and Architecture Lecture 7
1CENG 450Computer Systems and
ArchitectureLecture 7
- Amirali Baniasadi
- amirali_at_ece.uvic.ca
2This Lecture
- Pipelining
- ILP
- Scheduling
3Limits to pipelining
- Hazards circumstances that would cause incorrect
execution if next instruction were launched - Structural hazards Attempting to use the same
hardware to do two different things at the same
time - Data hazards Instruction depends on result of
prior instruction still in the pipeline - Control hazards Caused by delay between the
fetching of instructions and decisions about
changes in control flow (branches and jumps).
4Data Hazard Even with Forwarding
Time (clock cycles)
5Resolving this load hazard
- Adding hardware? ... not
- Detection?
- Compilation techniques?
- What is the cost of load delays?
6Resolving the Load Data Hazard
Time (clock cycles)
I n s t r. O r d e r
lw r1, 0(r2)
sub r4,r1,r6
and r6,r1,r7
Bubble
ALU
DMem
or r8,r1,r9
How is this different from the instruction issue
stall?
7Software Scheduling to Avoid Load Hazards
Try producing fast code for a b c d e
f assuming a, b, c, d ,e, and f in memory.
Slow code LW Rb,b LW Rc,c ADD
Ra,Rb,Rc SW a,Ra LW Re,e LW
Rf,f SUB Rd,Re,Rf SW d,Rd
- Fast code
- LW Rb,b
- LW Rc,c
- LW Re,e
- ADD Ra,Rb,Rc
- LW Rf,f
- SW a,Ra
- SUB Rd,Re,Rf
- SW d,Rd
8Instruction Set Connection
- What is exposed about this organizational hazard
in the instruction set? - k cycle delay?
- bad, CPI is not part of ISA
- k instruction slot delay
- load should not be followed by use of the value
in the next k instructions - Nothing, but code can reduce run-time delays
- MIPS did the transformation in the assembler
9Example Control Hazard
Next PC
MUX
Next SEQ PC
Next SEQ PC
Zero?
RS1
Reg File
MUX
Instr Cache
RS2
Data Cache
MUX
MUX
Sign Extend
branch instruction successor1 successor2 suc
cessor branch target
WB Data
Imm
RD
RD
RD
10Example Branch Stall Impact
- If 30 branch, Stall 3 cycles significant
- Two part solution
- Determine branch taken or not sooner, AND
- Compute taken branch address earlier
- MIPS branch tests if register 0 or ? 0
- MIPS Solution
- Move Zero test to ID/RF stage
- Adder to calculate new PC in ID/RF stage
- 1 clock cycle penalty for branch versus 3
11Pipelined MIPS Datapath
Memory Access
Instruction Fetch
Execute Addr. Calc
Write Back
Instr. Decode Reg. Fetch
Next SEQ PC
Next PC
MUX
Adder
Zero?
RS1
Reg File
Memory
RS2
Data Memory
MUX
MUX
Sign Extend
WB Data
Imm
EXTRA HARDWARE
RD
RD
RD
- Data stationary control
- local decode for each instruction phase /
pipeline stage
12Four Branch Hazard Alternatives
- 1 Stall until branch direction is clear
- 2 Predict Branch Not Taken
- Execute successor instructions in sequence
- Squash instructions in pipeline if branch
actually taken - 47 MIPS branches not taken on average
- PC4 already calculated, so use it to get next
instruction - 3 Predict Branch Taken
- 53 MIPS branches taken on average
- But havent calculated branch target address in
MIPS - MIPS still incurs 1 cycle branch penalty
- Other machines branch target known before outcome
13Four Branch Hazard Alternatives
- 4 Delayed Branch
- Define branch to take place AFTER a following
instruction - branch instruction sequential
successor1 sequential successor2 ........ seque
ntial successorn - ........
- branch target if taken
- 1 slot delay allows proper decision and branch
target address in 5 stage pipeline - MIPS uses this
Branch delay of length n
14Delayed Branch
- Where to get instructions to fill branch delay
slot? - Before branch instruction
- From the target address only valuable when
branch taken - From fall through only valuable when branch not
taken - Compiler effectiveness for single branch delay
slot - Fills about 60 of branch delay slots
- About 80 of instructions executed in branch
delay slots useful in computation - About 50 (60 x 80) of slots usefully filled
- Delayed Branch downside 7-8 stage pipelines,
multiple instructions issued per clock
(superscalar)
15RecallSpeed Up Equation for Pipelining
For simple RISC pipeline, CPI 1
16Example Evaluating Branch Alternatives
- Assume
- Conditional Unconditional 14, 65 change PC
- Scheduling Branch CPI speedup v. scheme
penalty stall - Stall pipeline 3 1.42 1.0
- Predict taken 1 1.14 1.26
- Predict not taken 1 1.09 1.29
- Delayed branch 0.5 1.07 1.31
17Example (A-24)
- In MIPS R4000, it takes 3 cycles to know the
target address, and an extra cycle to resolve the
condition. Effective CPI? (uncond, 4, untaken 6
taken 10) - Branch scheme Penalty Uncond. Penalty
Untaken Penalty Taken - Stall 2
3 3 - Predict taken 2
3 2 - Predict untaken 2
0 3 -
- Branch scheme Uncond.
Untaken Taken Total - Stall 0.08
0.18 0.3
0.56 - Predict taken 0.08
0.18 0.2
0.46 - Predict untaken 0.08
0 0.3
0.38 - CPI STALL, 1.56 Predict Taken, 1.46
Predict Untaken 1.38
CYCLE PENALTY
CPI PENALTY
18Summary of Pipelining Basics
- Hazards limit performance by preventing
instructions from executing during their
designated clock cycles - Structural Hazards need more HW resources
- Data Hazards need forwarding, compiler
scheduling - Control Hazards early evaluation PC, delayed
branch, prediction - Increasing length of pipe increases impact of
hazards - Pipelining helps instruction bandwidth, not
latency - Compilers reduce cost of data and control hazards
- Stall Increases CPI and decreases performance
19What Is an ILP?
- Principle Many instructions in the code do not
depend on each other - Result Possible to execute them in parallel
- ILP Potential overlap among instructions (so
they can be evaluated in parallel) - Issues
- Building compilers to analyze the code
- Building special/smarter hardware to handle the
code - ILP Increase the amount of parallelism
exploited among instructions - Seeks Good Results out of Pipelining
20What Is ILP?
- CODE A
CODE B - LD R1, (R2)100 LD
R1,(R2)100 - ADD R4, R1 ADD
R4,R1 - SUB R5,R1 SUB
R5,R4 - CMP R1,R2 SW
R5,(R2)100 - ADD R3,R1 LD
R1,(R2)100 - Code A Possible to execute 4 instructions in
parallel. - Code B Cant execute more than one instruction
per cycle. - Code A has Higher ILP
21 Out of Order Execution
Programmer Instructions execute
in-order Processor Instructions may execute
in any order if results remain the same at the
end
Out-of-Order
B ADD R3, R4 C ADD R3, R5 A LD R1, (R2) D CMP
R3, R1
22Scheduling
- Scheduling re-arranging instructions to maximize
performance - Requires knowledge about structure of processor
- Static Scheduling done by compiler
- Example
- for (i1000 igt0 i--) xi xi s
- Dynamic Scheduling done by hardware
- Dominates Server and Desktop markets (Pentium
III, 4 MIPS R10000/12000, UltraSPARC III,
PowerPC 603 etc)
23Pipeline Scheduling
Compiler schedules (move) instructions to reduce
stall Ex code sequence a b c, d e f
Before scheduling lw Rb, b lw
Rc, c Add Ra, Rb, Rc //stall sw a, Ra
lw Re, e lw Rf, f sub Rd,
Re, Rf //stall sw d, Rd
After scheduling lw Rb, b lw Rc, c lw Re, e
Add Ra, Rb, Rc lw Rf, f sw a, Ra sub Rd, Re,
Rf sw d, Rd
Schedule
24Basic Pipeline Scheduling
- To avoid pipeline stall
- A dependant instruction must be separated from
the source instruction by a distance in clock
cycles equal to the pipeline latency - Compilers ability depends on
- Amount of ILP available in the program
- Latencies of the functional units in the pipeline
- Pipeline CPI Ideal pipeline CPI Structured
stalls Data hazards stalls Control stalls
25Pipeline Scheduling Loop Unrolling
- Basic Block
- Set of instructions between entry points and
between branches. A basic block has only one
entry and one exit - Typically 4 to 7 instructions
- Amount of overlap ltlt 4 to 7 instructions
- Obtain substantial performance enhancements
Exploit ILP across multiple basic blocks - Loop Level Parallelism
- Parallelism that exists within a loop Limited
opportunity - Parallelism can cross loop iterations!
- Techniques to convert loop-level parallelism to
instructional-level parallelism - Loop Unrolling Compiler or the hardwares
ability to exploit the parallelism inherent in
the loop
26Assumptions
- Five-stage integer pipeline
- Branches have delay of one clock cycle
- ID stage Comparisons done, decisions made and PC
loaded - No structural hazards
- Functional units are fully pipelined or
replicated (as many times as the pipeline depth) - FP Latencies
Integer load latency 1 Integer ALU operation
latency 0
27Simple Loop Assembler Equivalent
- xi s are double/floating point type
- R1 initially address of array element with the
highest address - F2 contains the scalar value s
- Register R2 is pre-computed so that 8(R2) is the
last element to operate on
- for (i1000 igt0 i--) xi xi s
-
- Loop LD F0, 0(R1) F0array element
- ADDD F4, F0, F2 add scalar in F2
- SD F4 , 0(R1) store result
- SUBI R1, R1, 8 decrement pointer 8bytes
(DW) - BNE R1, R2, Loop branch R1!R2
-
28Where are the stalls?
- Unscheduled
- Loop LD F0, 0(R1)
- stall
- ADDD F4, F0, F2
- stall
- stall
- SD F4, 0(R1)
- SUBI R1, R1, 8
- stall
- BNE R1, R2, Loop
- stall
- 10 clock cycles
- Can we minimize?
- Scheduled
- Loop LD F0, 0(R1)
- SUBI R1, R1, 8
- ADDD F4, F0, F2
- stall
- BNE R1, R2, Loop
- SD F4, 8(R1)
-
- 6 clock cycles
- 3 cycles actual work 3 cycles overhead
- Can we minimize further?
-
29Loop Unrolling
Four copies of loop
Four iteration code
- LD F0, 0(R1)
- ADDD F4, F0, F2
- SD F4 , 0(R1)
- SUBI R1, R1, 8
- BNE R1, R2, Loop
- LD F0, 0(R1)
- ADDD F4, F0, F2
- SD F4 , 0(R1)
- SUBI R1, R1, 8
- BNE R1, R2, Loop
- LD F0, 0(R1)
- ADDD F4, F0, F2
- SD F4 , 0(R1)
- SUBI R1, R1, 8
- BNE R1, R2, Loop
- LD F0, 0(R1)
- Loop LD F0, 0(R1)
- ADDD F4, F0, F2
- SD F4, 0(R1)
- LD F6, -8(R1)
- ADDD F8, F6, F2
- SD F8, -8(R1)
- LD F10, -16(R1)
- ADDD F12, F10, F2
- SD F12, -16(R1)
- LD F14, -24(R1)
- ADDD F16, F14, F2
- SD F16, -24(R1)
- SUBI R1, R1, 32
- BNE R1, R2, Loop
-
Assumption R1 is initially a multiple of 32 or
number of loop iterations is a multiple of 4
30Loop Unroll Schedule
- Loop LD F0, 0(R1)
- stall
- ADDD F4, F0, F2
- stall
- stall
- SD F4, 0(R1)
- LD F6, -8(R1)
- stall
- ADDD F8, F6, F2
- stall
- stall
- SD F8, -8(R1)
- LD F10, -16(R1)
- stall
- ADDD F12, F10, F2
- stall
- stall
- SD F12, -16(R1)
- LD F14, -24(R1)
Loop LD F0, 0(R1) LD F6, -8(R1) LD F10,
-16(R1) LD F14, -24(R1) ADDD F4, F0,
F2 ADDD F8, F6, F2 ADDD F12, F10, F2 ADDD F16,
F14, F2 SD F4, 0(R1) SD F8, -8(R1) SD F12,
-16(R1) SUBI R1, R1, 32 BNE R1, R2,
Loop SD F16, 8(R1)
Schedule
No stalls! 14 clock cycles or 3.5 per
iteration Can we minimize further?
28 clock cycles or 7 per iteration Can we
minimize further?
31Summary
Iteration 10 cycles
Unrolling
7 cycles
Scheduling
Scheduling
6 cycles
3.5 cycles (No stalls)
32Limits to Gains of Loop Unrolling
- Decreasing benefit
- A decrease in the amount of overhead amortized
with each unroll - Example just considered
- Unrolled loop 4 times, no stall cycles, in 14
cycles 2 were loop overhead - If unrolled 8 times, the overhead is reduced from
½ cycle per iteration to 1/4 - Code size limitations
- Memory is premium
- Larger size causes cache hit rate changes
- Shortfall in registers (Register pressure)
Increasing ILP leads to increase in number of
live values May not be possible to allocate all
the live values to registers - Compiler limitations Significant increase in
complexity
33What if upper bound of the loop is unknown?
- Suppose
- Upper bound of the loop is n
- Unroll the loop to make k copies of the body
- Solution Generate pair of consecutive loops
- First loop body same as original loop, execute
(n mod k) times - Second loop unrolled body (k copies of
original), iterate (n/k) times - For large values of n, most of the execution time
is spent in the unrolled loop body
34Summary Tricks of High Performance Processors
- Out-of-order scheduling To tolerate RAW hazard
latency - Determine that the loads and stores can be
exchanged as loads and stores from different
iterations are independent - This requires analyzing the memory addresses and
finding that they do not refer to the same
address - Find that it was ok to move the SD after the SUBI
and BNE, and adjust the SD offset - Loop unrolling Increase scheduling scope for
more latency tolerance - Find that loop unrolling is useful by finding
that loop iterations are independent, except for
the loop maintenance code - Eliminate extra tests and branches and adjust the
loop maintenance code - Register renaming Remove WAR/WAS violations due
to scheduling - Use different registers to avoid unnecessary
constraints that would be forced by using same
registers for different computations - Summary Schedule the code preserving any
dependences needed
35Data Dependence
- Data dependence
- Indicates the possibility of a hazard
- Determines the order in which results must be
calculated - Sets upper bound on how much parallelism can be
exploited - But, actual hazard length of any stall is
determined by pipeline - Dependence avoidance
- Maintain the dependence but avoid hazard
Scheduling - Eliminate dependence by transforming the code
36Data Dependencies
- 1 Loop LD F0, 0(R1)
- 2 ADDD F4, F0, F2
- 3 SUBI R1, R1, 8
- 4 BNE R1, R2, Loop delayed branch
- 5 SD F4, 8(R1) altered when move past SUBI
37Name Dependencies
- Two instructions use same name (register or
memory location) but dont exchange data - Anti-dependence (WAR if a hazard for HW)
- Instruction j writes a register or memory
location that instruction i reads from and
instruction i is executed first - Output dependence (WAW if a hazard for HW)
- Instruction i and instruction j write the same
register or memory location ordering between
instructions must be preserved - How to remove name dependencies?
- They are not true dependencies
38Register Renaming
WAW
WAR
1 Loop LD F0, 0(R1) 2 ADDD F4, F0, F2
3 SD F4, 0(R1) 4 LD F0, -8(R1) 5 ADDD F4, F0,
F2 6 SD F4, -8(R1) 7 LD F0, -16(R1)
8 ADDD F4, F0, F2 9 SD F4, -16(R1) 10 LD F0,
-24(R1) 11 ADDD F4,F0,F2 12 SD F4, -24(R1)
13 SUBI R1, R1, 32 14 BNE R1, R2, LOOP
1 Loop LD F0, 0(R1) 2 ADDD F4, F0, F2
3 SD F4, 0(R1) 4 LD F6,-8(R1) 5 ADDD F8, F6,
F2 6 SD F8, -8(R1) 7 LD F10,-16(R1)
8 ADDD F12, F10, F2 9 SD F12, -16(R1)
10 LD F14, -24(R1) 11 ADDD F16, F14,F2
12 SD F16, -24(R1) 13 SUBI R1, R1, 32
14 BNE R1, R2, LOOP
No data is passed in F0, but cant reuse F0 in
cycle 4.
- Name Dependencies are Hard for Memory Accesses
- Does 100(R4) 20(R6)?
- From different loop iterations, does 20(R6)
20(R6)? - Our example required compiler to know that if R1
doesnt change then 0(R1) ? -8(R1) ?
-16(R1) ? -24(R1)There were no dependencies
between some loads and stores so they could be
moved around
39Control Dependencies
- Example
- if p1 S1
- if p2 S2
- S1 is control dependent on p1 S2 is control
dependent on p2 but not on p1 - Two constraints
- An instruction that is control dependent on a
branch cannot be moved before the branch so
that its execution is no longer controlled by the
branch. - An instruction that is not control dependent on a
branch cannot be moved to after the branch so
that its execution is controlled by the branch.
- Control dependencies relaxed to get
parallelism-Loop unrolling