Title: Static Code Scheduling
1Static Code Scheduling
2Code Scheduling
- Scheduling or reordering instructions to improve
performance and/or guarantee correctness - Important for dynamically-scheduled architectures
- Crucial (assumed!) for statically-scheduled
architectures, e.g. VLIW or EPIC - Takes into account anticipated latencies
- Machine-specific, performed later in the
optimization pass - How does this contrast with our earlier
exploration of code motion?
3Why Must the Compiler Schedule?
- Many machines are pipelined and expose some
aspects of pipelining to the user (compiler) - Examples
- Branch delay slots!
- Memory-access delays
- Multi-cycle operations
- Some machines dont have scheduling hardware
4Example
- Assume loads take 2 cycles and branches have a
delay slot. - ____cycles
5Example
- Assume loads take 2 cycles and branches have a
delay slot. - ____cycles
6Code Scheduling Strategy
- Get resources operating in parallel
- Integer data path
- Integer multiply / divide hardware
- FP adder, multiplier, divider
- Method
- Fill with computations that do not require result
or same hardware resources - Drawbacks
- Highly hardware dependent
7Scheduling Approaches
- Local
- Branch scheduling
- Basic-block scheduling
- Global
- Cross-block scheduling
- Software pipelining
- Trace scheduling
- Percolation scheduling
8Branch Scheduling
- Two problems
- Branches often take some number of cycles to
complete - Can be a delay between a compare b and its
associated branch - A compiler will try to fill these slots with
valid instructions (rather than nop) - Delay slots present in PA-RISC, SPARC, MIPS
- Condition delay PowerPC, Pentium
9Recall from Architecture
- IF Instruction Fetch
- ID Instruction Decode
- EX Execute
- MA Memory access
- WB Write back
IF
ID
EX
MA
WB
IF
ID
EX
MA
WB
IF
ID
EX
MA
WB
10Control Hazards
ID
EX
MA
WB
Taken Branch
IF
IF
---
---
---
---
Instr 1
Branch Target
IF
ID
EX
MA
WB
IF
ID
EX
MA
WB
Branch Target 1
11Data Dependences
- If two operations access the same register, they
are dependent - Types of data dependences
Output
Anti
Flow
r1 r2 r3 r2 r5 6
r1 r2 r3 r1 r4 6
r1 r2 r3 r4 r1 6
12Data Hazards
Memory latency data not ready
lw R1,0(R2)
IF
ID
EX
MA
WB
IF
ID
EX
MA
WB
stall
add R3,R1,R4
13Data Hazards
Instruction latency execute takes gt 1 cycle
addf R3,R1,R2
IF
ID
EX
EX
MA
WB
IF
ID
stall
MA
WB
EX
EX
addf R3,R3,R4
Assumes floating point ops take 2 execute cycles
14Multi-cycle Instructions
- Scheduling is particularly important for
multi-cycle operations - Alpha instructions gt 1 cycle latency (partial
list) - mull (32-bit integer multiply) 8
- mulq (64-bit integer multiply) 16
- addt (fp add) 4
- mult (fp multiply) 4
- divs (fp single-precision divide) 10
- divt (fp double-precision divide) 23
15Avoiding data hazards
- Move loads earlier and stores later (assuming
this does not violate correctness) - Other stalls may require more sophisticated
re-ordering, i.e. ((ab)c)d becomes (ab)(cd)
- How can we do this in a systematic way??
16Example Without Scheduling
- Assume
- memory instrs take 3 cycles
- mult takes 2 cycles (to have
- result in register)
- rest take 1 cycle
- ____cycles
17Basic Block Dependence DAGS
- Nodes - instructions
- Edges - dependence between I1 and I2
- When we cannot determine whether there is a
dependence, we must assume there is one - a) lw R2, (R1)
- b) lw R3, (R1) 4
- c) R4 ? R2 R3
- d) R5 ? R2 - 1
a
b
2
2
2
d
c
18Example Build the DAG
Assume memory instrs 3 mult 2 (to
have result in register) rest
1 cycle
19Creating a schedule
- Create a DAG of dependences
- Determine priority
- Schedule instructions with
- Ready operands
- Highest priority
- Heuristics If multiple possibilities, fall back
on other priority functions
20Operation Priority
- Priority Need a mechanism to decide which ops
to schedule first (when you have choices) - Common priority functions
- Height Distance from exit node
- Give priority to amount of work left to do
- Slackness inversely proportional to slack
- Give priority to ops on the critical path
- Register use priority to nodes with more source
operands and fewer destination operands - Reduces number of live registers
- Uncover high priority to nodes with many
children - Frees up more nodes
- Original order when all else fails
21Computing Priorities
- Height(n)
- exec(n) if n is a leaf
- max(height(m)) exec(n)
- for m, where m is a successor of n
- Critical path(s) path through the dependence
DAG with longest latency
22Example Determine Height and CP
Assume memory instrs 3 mult 2 (to
have result in register)
rest 1 cycle
Critical path _______
23Example List Scheduling
_____cycles
24Scheduling vs. Register Allocation
25Register Renaming
26VLIW
- Very Long Instruction Word
- Compiler determines exactly what is issued every
cycle (before the program is run) - Schedules also account for latencies
- All hardware changes result in a compiler change
- Usually embedded systems (hence simple HW)
- Itanium is actually an EPIC-style machine
(accounts for most parallelism, not latencies)
27Sample VLIW code
VLIW processor 5 issue 2 Add/Sub units (1
cycle) 1 Mul/Div unit (2 cycle, unpipelined) 1
LD/ST unit (2 cycle, pipelined) 1 Branch unit (no
delay slots)
Add/Sub
Add/Sub
Mul/Div
Ld/St
Branch
c a b
d a - b
e a b
ld j x
nop
g c d
h c - d
nop
ld k y
nop
nop
nop
i j c
ld f z
br g
28Multi-Issue Scheduling Example
Machine 2 issue, 1 memory port, 1 ALU Memory
port 2 cycles, non-pipelined ALU 1 cycle
RU_map
Schedule
time ALU MEM 0 1 2 3 4 5 6 7 8 9
time Ready Placed 0 1 2 3 4 5 6 7 8 9
29Earliest Latest Sets
Machine 2 issue, 1 memory port, 1 ALU Memory
port 2 cycles, pipelined ALU 1 cycle
1m
2m
4m
3
7
6
5
8
9m
10
30List Scheduling Algorithm
- Build dependence graph, calculate priority
- Add all ops to UNSCHEDULED set
- time 0
- while (UNSCHEDULED is not empty)
- time
- READY UNSCHEDULED ops whose incoming deps
have been satisfied - Sort READY using priority function
- For each op in READY (highest to lowest
priority) - op can be scheduled at current time?
(resources free?) - Yes schedule it, op.issue_time time
- Mark resources busy in RU_map relative to
issue time - Remove op from UNSCHEDULED/READY sets
- No continue
31Improving Basic Block Scheduling
- Loop unrolling creates longer basic blocks
- Register renaming can change register usage in
blocks to remove immediate reuse of registers - Summary
- Static scheduling complements (or replaces)
dynamic scheduling by the hardware