Static Code Scheduling - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Static Code Scheduling

Description:

lw R1,0(R2) add R3,R1,R4. stall. Memory latency: data not ... lw r1, w. Code. Start Time. Assume: memory instrs take 3 cycles. mult takes 2 cycles (to have ... – PowerPoint PPT presentation

Number of Views:110
Avg rating:3.0/5.0
Slides: 32
Provided by: KimHaz
Category:
Tags: code | lw | scheduling | static

less

Transcript and Presenter's Notes

Title: Static Code Scheduling


1
Static Code Scheduling
  • CS 671
  • April 1, 2008

2
Code Scheduling
  • Scheduling or reordering instructions to improve
    performance and/or guarantee correctness
  • Important for dynamically-scheduled architectures
  • Crucial (assumed!) for statically-scheduled
    architectures, e.g. VLIW or EPIC
  • Takes into account anticipated latencies
  • Machine-specific, performed later in the
    optimization pass
  • How does this contrast with our earlier
    exploration of code motion?

3
Why Must the Compiler Schedule?
  • Many machines are pipelined and expose some
    aspects of pipelining to the user (compiler)
  • Examples
  • Branch delay slots!
  • Memory-access delays
  • Multi-cycle operations
  • Some machines dont have scheduling hardware

4
Example
  • Assume loads take 2 cycles and branches have a
    delay slot.
  • ____cycles

5
Example
  • Assume loads take 2 cycles and branches have a
    delay slot.
  • ____cycles

6
Code Scheduling Strategy
  • Get resources operating in parallel
  • Integer data path
  • Integer multiply / divide hardware
  • FP adder, multiplier, divider
  • Method
  • Fill with computations that do not require result
    or same hardware resources
  • Drawbacks
  • Highly hardware dependent

7
Scheduling Approaches
  • Local
  • Branch scheduling
  • Basic-block scheduling
  • Global
  • Cross-block scheduling
  • Software pipelining
  • Trace scheduling
  • Percolation scheduling

8
Branch Scheduling
  • Two problems
  • Branches often take some number of cycles to
    complete
  • Can be a delay between a compare b and its
    associated branch
  • A compiler will try to fill these slots with
    valid instructions (rather than nop)
  • Delay slots present in PA-RISC, SPARC, MIPS
  • Condition delay PowerPC, Pentium

9
Recall from Architecture
  • IF Instruction Fetch
  • ID Instruction Decode
  • EX Execute
  • MA Memory access
  • WB Write back

IF
ID
EX
MA
WB
IF
ID
EX
MA
WB
IF
ID
EX
MA
WB
10
Control Hazards
ID
EX
MA
WB
Taken Branch
IF
IF
---
---
---
---
Instr 1
Branch Target
IF
ID
EX
MA
WB
IF
ID
EX
MA
WB
Branch Target 1
11
Data Dependences
  • If two operations access the same register, they
    are dependent
  • Types of data dependences

Output
Anti
Flow
r1 r2 r3 r2 r5 6
r1 r2 r3 r1 r4 6
r1 r2 r3 r4 r1 6
12
Data Hazards
Memory latency data not ready
lw R1,0(R2)
IF
ID
EX
MA
WB
IF
ID
EX
MA
WB
stall
add R3,R1,R4
13
Data Hazards
Instruction latency execute takes gt 1 cycle
addf R3,R1,R2
IF
ID
EX
EX
MA
WB
IF
ID
stall
MA
WB
EX
EX
addf R3,R3,R4
Assumes floating point ops take 2 execute cycles
14
Multi-cycle Instructions
  • Scheduling is particularly important for
    multi-cycle operations
  • Alpha instructions gt 1 cycle latency (partial
    list)
  • mull (32-bit integer multiply) 8
  • mulq (64-bit integer multiply) 16
  • addt (fp add) 4
  • mult (fp multiply) 4
  • divs (fp single-precision divide) 10
  • divt (fp double-precision divide) 23

15
Avoiding data hazards
  • Move loads earlier and stores later (assuming
    this does not violate correctness)
  • Other stalls may require more sophisticated
    re-ordering, i.e. ((ab)c)d becomes (ab)(cd)
  • How can we do this in a systematic way??

16
Example Without Scheduling
  • Assume
  • memory instrs take 3 cycles
  • mult takes 2 cycles (to have
  • result in register)
  • rest take 1 cycle
  • ____cycles

17
Basic Block Dependence DAGS
  • Nodes - instructions
  • Edges - dependence between I1 and I2
  • When we cannot determine whether there is a
    dependence, we must assume there is one
  • a) lw R2, (R1)
  • b) lw R3, (R1) 4
  • c) R4 ? R2 R3
  • d) R5 ? R2 - 1

a
b
2
2
2
d
c
18
Example Build the DAG
Assume memory instrs 3 mult 2 (to
have result in register) rest
1 cycle
19
Creating a schedule
  • Create a DAG of dependences
  • Determine priority
  • Schedule instructions with
  • Ready operands
  • Highest priority
  • Heuristics If multiple possibilities, fall back
    on other priority functions

20
Operation Priority
  • Priority Need a mechanism to decide which ops
    to schedule first (when you have choices)
  • Common priority functions
  • Height Distance from exit node
  • Give priority to amount of work left to do
  • Slackness inversely proportional to slack
  • Give priority to ops on the critical path
  • Register use priority to nodes with more source
    operands and fewer destination operands
  • Reduces number of live registers
  • Uncover high priority to nodes with many
    children
  • Frees up more nodes
  • Original order when all else fails

21
Computing Priorities
  • Height(n)
  • exec(n) if n is a leaf
  • max(height(m)) exec(n)
  • for m, where m is a successor of n
  • Critical path(s) path through the dependence
    DAG with longest latency

22
Example Determine Height and CP
Assume memory instrs 3 mult 2 (to
have result in register)
rest 1 cycle
Critical path _______
23
Example List Scheduling
_____cycles
24
Scheduling vs. Register Allocation
25
Register Renaming
26
VLIW
  • Very Long Instruction Word
  • Compiler determines exactly what is issued every
    cycle (before the program is run)
  • Schedules also account for latencies
  • All hardware changes result in a compiler change
  • Usually embedded systems (hence simple HW)
  • Itanium is actually an EPIC-style machine
    (accounts for most parallelism, not latencies)

27
Sample VLIW code
VLIW processor 5 issue 2 Add/Sub units (1
cycle) 1 Mul/Div unit (2 cycle, unpipelined) 1
LD/ST unit (2 cycle, pipelined) 1 Branch unit (no
delay slots)
Add/Sub
Add/Sub
Mul/Div
Ld/St
Branch
c a b
d a - b
e a b
ld j x
nop
g c d
h c - d
nop
ld k y
nop
nop
nop
i j c
ld f z
br g
28
Multi-Issue Scheduling Example
Machine 2 issue, 1 memory port, 1 ALU Memory
port 2 cycles, non-pipelined ALU 1 cycle
RU_map
Schedule
time ALU MEM 0 1 2 3 4 5 6 7 8 9
time Ready Placed 0 1 2 3 4 5 6 7 8 9
29
Earliest Latest Sets
Machine 2 issue, 1 memory port, 1 ALU Memory
port 2 cycles, pipelined ALU 1 cycle
1m
2m
4m
3
7
6
5
8
9m
10
30
List Scheduling Algorithm
  • Build dependence graph, calculate priority
  • Add all ops to UNSCHEDULED set
  • time 0
  • while (UNSCHEDULED is not empty)
  • time
  • READY UNSCHEDULED ops whose incoming deps
    have been satisfied
  • Sort READY using priority function
  • For each op in READY (highest to lowest
    priority)
  • op can be scheduled at current time?
    (resources free?)
  • Yes schedule it, op.issue_time time
  • Mark resources busy in RU_map relative to
    issue time
  • Remove op from UNSCHEDULED/READY sets
  • No continue

31
Improving Basic Block Scheduling
  • Loop unrolling creates longer basic blocks
  • Register renaming can change register usage in
    blocks to remove immediate reuse of registers
  • Summary
  • Static scheduling complements (or replaces)
    dynamic scheduling by the hardware
Write a Comment
User Comments (0)
About PowerShow.com