CENG 450 Computer Systems and Architecture Lecture 7 - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

CENG 450 Computer Systems and Architecture Lecture 7

Description:

... slot delay ... Other machines: branch target known before outcome. 13. Four Branch ... 1 slot delay allows proper decision and branch target address in 5 ... – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Slides: 40
Provided by: shin161
Category:

less

Transcript and Presenter's Notes

Title: CENG 450 Computer Systems and Architecture Lecture 7


1
CENG 450Computer Systems and
ArchitectureLecture 7
  • Amirali Baniasadi
  • amirali_at_ece.uvic.ca

2
This Lecture
  • Pipelining
  • ILP
  • Scheduling

3
Limits to pipelining
  • Hazards circumstances that would cause incorrect
    execution if next instruction were launched
  • Structural hazards Attempting to use the same
    hardware to do two different things at the same
    time
  • Data hazards Instruction depends on result of
    prior instruction still in the pipeline
  • Control hazards Caused by delay between the
    fetching of instructions and decisions about
    changes in control flow (branches and jumps).

4
Data Hazard Even with Forwarding
Time (clock cycles)
5
Resolving this load hazard
  • Adding hardware? ... not
  • Detection?
  • Compilation techniques?
  • What is the cost of load delays?

6
Resolving the Load Data Hazard
Time (clock cycles)
I n s t r. O r d e r
lw r1, 0(r2)
sub r4,r1,r6
and r6,r1,r7
Bubble
ALU
DMem
or r8,r1,r9
How is this different from the instruction issue
stall?
7
Software Scheduling to Avoid Load Hazards
Try producing fast code for a b c d e
f assuming a, b, c, d ,e, and f in memory.
Slow code LW Rb,b LW Rc,c ADD
Ra,Rb,Rc SW a,Ra LW Re,e LW
Rf,f SUB Rd,Re,Rf SW d,Rd
  • Fast code
  • LW Rb,b
  • LW Rc,c
  • LW Re,e
  • ADD Ra,Rb,Rc
  • LW Rf,f
  • SW a,Ra
  • SUB Rd,Re,Rf
  • SW d,Rd

8
Instruction Set Connection
  • What is exposed about this organizational hazard
    in the instruction set?
  • k cycle delay?
  • bad, CPI is not part of ISA
  • k instruction slot delay
  • load should not be followed by use of the value
    in the next k instructions
  • Nothing, but code can reduce run-time delays
  • MIPS did the transformation in the assembler

9
Example Control Hazard
Next PC
MUX
Next SEQ PC
Next SEQ PC
Zero?
RS1
Reg File
MUX
Instr Cache
RS2
Data Cache
MUX
MUX
Sign Extend
branch instruction successor1 successor2 suc
cessor branch target
WB Data
Imm
RD
RD
RD
10
Example Branch Stall Impact
  • If 30 branch, Stall 3 cycles significant
  • Two part solution
  • Determine branch taken or not sooner, AND
  • Compute taken branch address earlier
  • MIPS branch tests if register 0 or ? 0
  • MIPS Solution
  • Move Zero test to ID/RF stage
  • Adder to calculate new PC in ID/RF stage
  • 1 clock cycle penalty for branch versus 3

11
Pipelined MIPS Datapath
Memory Access
Instruction Fetch
Execute Addr. Calc
Write Back
Instr. Decode Reg. Fetch
Next SEQ PC
Next PC
MUX
Adder
Zero?
RS1
Reg File
Memory
RS2
Data Memory
MUX
MUX
Sign Extend
WB Data
Imm
EXTRA HARDWARE
RD
RD
RD
  • Data stationary control
  • local decode for each instruction phase /
    pipeline stage

12
Four Branch Hazard Alternatives
  • 1 Stall until branch direction is clear
  • 2 Predict Branch Not Taken
  • Execute successor instructions in sequence
  • Squash instructions in pipeline if branch
    actually taken
  • 47 MIPS branches not taken on average
  • PC4 already calculated, so use it to get next
    instruction
  • 3 Predict Branch Taken
  • 53 MIPS branches taken on average
  • But havent calculated branch target address in
    MIPS
  • MIPS still incurs 1 cycle branch penalty
  • Other machines branch target known before outcome

13
Four Branch Hazard Alternatives
  • 4 Delayed Branch
  • Define branch to take place AFTER a following
    instruction
  • branch instruction sequential
    successor1 sequential successor2 ........ seque
    ntial successorn
  • ........
  • branch target if taken
  • 1 slot delay allows proper decision and branch
    target address in 5 stage pipeline
  • MIPS uses this

Branch delay of length n
14
Delayed Branch
  • Where to get instructions to fill branch delay
    slot?
  • Before branch instruction
  • From the target address only valuable when
    branch taken
  • From fall through only valuable when branch not
    taken
  • Compiler effectiveness for single branch delay
    slot
  • Fills about 60 of branch delay slots
  • About 80 of instructions executed in branch
    delay slots useful in computation
  • About 50 (60 x 80) of slots usefully filled
  • Delayed Branch downside 7-8 stage pipelines,
    multiple instructions issued per clock
    (superscalar)

15
RecallSpeed Up Equation for Pipelining
For simple RISC pipeline, CPI 1
16
Example Evaluating Branch Alternatives
  • Assume
  • Conditional Unconditional 14, 65 change PC
  • Scheduling Branch CPI speedup v. scheme
    penalty stall
  • Stall pipeline 3 1.42 1.0
  • Predict taken 1 1.14 1.26
  • Predict not taken 1 1.09 1.29
  • Delayed branch 0.5 1.07 1.31

17
Example (A-24)
  • In MIPS R4000, it takes 3 cycles to know the
    target address, and an extra cycle to resolve the
    condition. Effective CPI? (uncond, 4, untaken 6
    taken 10)
  • Branch scheme Penalty Uncond. Penalty
    Untaken Penalty Taken
  • Stall 2
    3 3
  • Predict taken 2
    3 2
  • Predict untaken 2
    0 3
  • Branch scheme Uncond.
    Untaken Taken Total
  • Stall 0.08
    0.18 0.3
    0.56
  • Predict taken 0.08
    0.18 0.2
    0.46
  • Predict untaken 0.08
    0 0.3
    0.38
  • CPI STALL, 1.56 Predict Taken, 1.46
    Predict Untaken 1.38

CYCLE PENALTY
CPI PENALTY
18
Summary of Pipelining Basics
  • Hazards limit performance by preventing
    instructions from executing during their
    designated clock cycles
  • Structural Hazards need more HW resources
  • Data Hazards need forwarding, compiler
    scheduling
  • Control Hazards early evaluation PC, delayed
    branch, prediction
  • Increasing length of pipe increases impact of
    hazards
  • Pipelining helps instruction bandwidth, not
    latency
  • Compilers reduce cost of data and control hazards
  • Stall Increases CPI and decreases performance

19
What Is an ILP?
  • Principle Many instructions in the code do not
    depend on each other
  • Result Possible to execute them in parallel
  • ILP Potential overlap among instructions (so
    they can be evaluated in parallel)
  • Issues
  • Building compilers to analyze the code
  • Building special/smarter hardware to handle the
    code
  • ILP Increase the amount of parallelism
    exploited among instructions
  • Seeks Good Results out of Pipelining

20
What Is ILP?
  • CODE A
    CODE B
  • LD R1, (R2)100 LD
    R1,(R2)100
  • ADD R4, R1 ADD
    R4,R1
  • SUB R5,R1 SUB
    R5,R4
  • CMP R1,R2 SW
    R5,(R2)100
  • ADD R3,R1 LD
    R1,(R2)100
  • Code A Possible to execute 4 instructions in
    parallel.
  • Code B Cant execute more than one instruction
    per cycle.
  • Code A has Higher ILP

21
Out of Order Execution
Programmer Instructions execute
in-order Processor Instructions may execute
in any order if results remain the same at the
end
Out-of-Order
B ADD R3, R4 C ADD R3, R5 A LD R1, (R2) D CMP
R3, R1
22
Scheduling
  • Scheduling re-arranging instructions to maximize
    performance
  • Requires knowledge about structure of processor
  • Static Scheduling done by compiler
  • Example
  • for (i1000 igt0 i--) xi xi s
  • Dynamic Scheduling done by hardware
  • Dominates Server and Desktop markets (Pentium
    III, 4 MIPS R10000/12000, UltraSPARC III,
    PowerPC 603 etc)

23
Pipeline Scheduling
Compiler schedules (move) instructions to reduce
stall Ex code sequence a b c, d e f
Before scheduling lw Rb, b lw
Rc, c Add Ra, Rb, Rc //stall sw a, Ra
lw Re, e lw Rf, f sub Rd,
Re, Rf //stall sw d, Rd
After scheduling lw Rb, b lw Rc, c lw Re, e
Add Ra, Rb, Rc lw Rf, f sw a, Ra sub Rd, Re,
Rf sw d, Rd
Schedule
24
Basic Pipeline Scheduling
  • To avoid pipeline stall
  • A dependant instruction must be separated from
    the source instruction by a distance in clock
    cycles equal to the pipeline latency
  • Compilers ability depends on
  • Amount of ILP available in the program
  • Latencies of the functional units in the pipeline
  • Pipeline CPI Ideal pipeline CPI Structured
    stalls Data hazards stalls Control stalls

25
Pipeline Scheduling Loop Unrolling
  • Basic Block
  • Set of instructions between entry points and
    between branches. A basic block has only one
    entry and one exit
  • Typically 4 to 7 instructions
  • Amount of overlap ltlt 4 to 7 instructions
  • Obtain substantial performance enhancements
    Exploit ILP across multiple basic blocks
  • Loop Level Parallelism
  • Parallelism that exists within a loop Limited
    opportunity
  • Parallelism can cross loop iterations!
  • Techniques to convert loop-level parallelism to
    instructional-level parallelism
  • Loop Unrolling Compiler or the hardwares
    ability to exploit the parallelism inherent in
    the loop

26
Assumptions
  • Five-stage integer pipeline
  • Branches have delay of one clock cycle
  • ID stage Comparisons done, decisions made and PC
    loaded
  • No structural hazards
  • Functional units are fully pipelined or
    replicated (as many times as the pipeline depth)
  • FP Latencies

Integer load latency 1 Integer ALU operation
latency 0
27
Simple Loop Assembler Equivalent
  • xi s are double/floating point type
  • R1 initially address of array element with the
    highest address
  • F2 contains the scalar value s
  • Register R2 is pre-computed so that 8(R2) is the
    last element to operate on
  • for (i1000 igt0 i--) xi xi s
  • Loop LD F0, 0(R1) F0array element
  • ADDD F4, F0, F2 add scalar in F2
  • SD F4 , 0(R1) store result
  • SUBI R1, R1, 8 decrement pointer 8bytes
    (DW)
  • BNE R1, R2, Loop branch R1!R2

28
Where are the stalls?
  • Unscheduled
  • Loop LD F0, 0(R1)
  • stall
  • ADDD F4, F0, F2
  • stall
  • stall
  • SD F4, 0(R1)
  • SUBI R1, R1, 8
  • stall
  • BNE R1, R2, Loop
  • stall
  • 10 clock cycles
  • Can we minimize?
  • Scheduled
  • Loop LD F0, 0(R1)
  • SUBI R1, R1, 8
  • ADDD F4, F0, F2
  • stall
  • BNE R1, R2, Loop
  • SD F4, 8(R1)
  • 6 clock cycles
  • 3 cycles actual work 3 cycles overhead
  • Can we minimize further?

29
Loop Unrolling
Four copies of loop
Four iteration code
  • LD F0, 0(R1)
  • ADDD F4, F0, F2
  • SD F4 , 0(R1)
  • SUBI R1, R1, 8
  • BNE R1, R2, Loop
  • LD F0, 0(R1)
  • ADDD F4, F0, F2
  • SD F4 , 0(R1)
  • SUBI R1, R1, 8
  • BNE R1, R2, Loop
  • LD F0, 0(R1)
  • ADDD F4, F0, F2
  • SD F4 , 0(R1)
  • SUBI R1, R1, 8
  • BNE R1, R2, Loop
  • LD F0, 0(R1)
  • Loop LD F0, 0(R1)
  • ADDD F4, F0, F2
  • SD F4, 0(R1)
  • LD F6, -8(R1)
  • ADDD F8, F6, F2
  • SD F8, -8(R1)
  • LD F10, -16(R1)
  • ADDD F12, F10, F2
  • SD F12, -16(R1)
  • LD F14, -24(R1)
  • ADDD F16, F14, F2
  • SD F16, -24(R1)
  • SUBI R1, R1, 32
  • BNE R1, R2, Loop

Assumption R1 is initially a multiple of 32 or
number of loop iterations is a multiple of 4
30
Loop Unroll Schedule
  • Loop LD F0, 0(R1)
  • stall
  • ADDD F4, F0, F2
  • stall
  • stall
  • SD F4, 0(R1)
  • LD F6, -8(R1)
  • stall
  • ADDD F8, F6, F2
  • stall
  • stall
  • SD F8, -8(R1)
  • LD F10, -16(R1)
  • stall
  • ADDD F12, F10, F2
  • stall
  • stall
  • SD F12, -16(R1)
  • LD F14, -24(R1)

Loop LD F0, 0(R1) LD F6, -8(R1) LD F10,
-16(R1) LD F14, -24(R1) ADDD F4, F0,
F2 ADDD F8, F6, F2 ADDD F12, F10, F2 ADDD F16,
F14, F2 SD F4, 0(R1) SD F8, -8(R1) SD F12,
-16(R1) SUBI R1, R1, 32 BNE R1, R2,
Loop SD F16, 8(R1)
Schedule
No stalls! 14 clock cycles or 3.5 per
iteration Can we minimize further?
28 clock cycles or 7 per iteration Can we
minimize further?
31
Summary
Iteration 10 cycles
Unrolling
7 cycles
Scheduling
Scheduling
6 cycles
3.5 cycles (No stalls)
32
Limits to Gains of Loop Unrolling
  • Decreasing benefit
  • A decrease in the amount of overhead amortized
    with each unroll
  • Example just considered
  • Unrolled loop 4 times, no stall cycles, in 14
    cycles 2 were loop overhead
  • If unrolled 8 times, the overhead is reduced from
    ½ cycle per iteration to 1/4
  • Code size limitations
  • Memory is premium
  • Larger size causes cache hit rate changes
  • Shortfall in registers (Register pressure)
    Increasing ILP leads to increase in number of
    live values May not be possible to allocate all
    the live values to registers
  • Compiler limitations Significant increase in
    complexity

33
What if upper bound of the loop is unknown?
  • Suppose
  • Upper bound of the loop is n
  • Unroll the loop to make k copies of the body
  • Solution Generate pair of consecutive loops
  • First loop body same as original loop, execute
    (n mod k) times
  • Second loop unrolled body (k copies of
    original), iterate (n/k) times
  • For large values of n, most of the execution time
    is spent in the unrolled loop body

34
Summary Tricks of High Performance Processors
  • Out-of-order scheduling To tolerate RAW hazard
    latency
  • Determine that the loads and stores can be
    exchanged as loads and stores from different
    iterations are independent
  • This requires analyzing the memory addresses and
    finding that they do not refer to the same
    address
  • Find that it was ok to move the SD after the SUBI
    and BNE, and adjust the SD offset
  • Loop unrolling Increase scheduling scope for
    more latency tolerance
  • Find that loop unrolling is useful by finding
    that loop iterations are independent, except for
    the loop maintenance code
  • Eliminate extra tests and branches and adjust the
    loop maintenance code
  • Register renaming Remove WAR/WAS violations due
    to scheduling
  • Use different registers to avoid unnecessary
    constraints that would be forced by using same
    registers for different computations
  • Summary Schedule the code preserving any
    dependences needed

35
Data Dependence
  • Data dependence
  • Indicates the possibility of a hazard
  • Determines the order in which results must be
    calculated
  • Sets upper bound on how much parallelism can be
    exploited
  • But, actual hazard length of any stall is
    determined by pipeline
  • Dependence avoidance
  • Maintain the dependence but avoid hazard
    Scheduling
  • Eliminate dependence by transforming the code

36
Data Dependencies
  • 1 Loop LD F0, 0(R1)
  • 2 ADDD F4, F0, F2
  • 3 SUBI R1, R1, 8
  • 4 BNE R1, R2, Loop delayed branch
  • 5 SD F4, 8(R1) altered when move past SUBI

37
Name Dependencies
  • Two instructions use same name (register or
    memory location) but dont exchange data
  • Anti-dependence (WAR if a hazard for HW)
  • Instruction j writes a register or memory
    location that instruction i reads from and
    instruction i is executed first
  • Output dependence (WAW if a hazard for HW)
  • Instruction i and instruction j write the same
    register or memory location ordering between
    instructions must be preserved
  • How to remove name dependencies?
  • They are not true dependencies

38
Register Renaming
WAW
WAR
1 Loop LD F0, 0(R1) 2 ADDD F4, F0, F2
3 SD F4, 0(R1) 4 LD F0, -8(R1) 5 ADDD F4, F0,
F2 6 SD F4, -8(R1) 7 LD F0, -16(R1)
8 ADDD F4, F0, F2 9 SD F4, -16(R1) 10 LD F0,
-24(R1) 11 ADDD F4,F0,F2 12 SD F4, -24(R1)
13 SUBI R1, R1, 32 14 BNE R1, R2, LOOP
1 Loop LD F0, 0(R1) 2 ADDD F4, F0, F2
3 SD F4, 0(R1) 4 LD F6,-8(R1) 5 ADDD F8, F6,
F2 6 SD F8, -8(R1) 7 LD F10,-16(R1)
8 ADDD F12, F10, F2 9 SD F12, -16(R1)
10 LD F14, -24(R1) 11 ADDD F16, F14,F2
12 SD F16, -24(R1) 13 SUBI R1, R1, 32
14 BNE R1, R2, LOOP
No data is passed in F0, but cant reuse F0 in
cycle 4.
  • Name Dependencies are Hard for Memory Accesses
  • Does 100(R4) 20(R6)?
  • From different loop iterations, does 20(R6)
    20(R6)?
  • Our example required compiler to know that if R1
    doesnt change then 0(R1) ? -8(R1) ?
    -16(R1) ? -24(R1)There were no dependencies
    between some loads and stores so they could be
    moved around

39
Control Dependencies
  • Example
  • if p1 S1
  • if p2 S2
  • S1 is control dependent on p1 S2 is control
    dependent on p2 but not on p1
  • Two constraints
  • An instruction that is control dependent on a
    branch cannot be moved before the branch so
    that its execution is no longer controlled by the
    branch.
  • An instruction that is not control dependent on a
    branch cannot be moved to after the branch so
    that its execution is controlled by the branch.
  • Control dependencies relaxed to get
    parallelism-Loop unrolling
Write a Comment
User Comments (0)
About PowerShow.com