Title: Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning
1Exploiting Pseudo-schedules to Guide Data
Dependence Graph Partitioning
PACT 2002, Charlottesville, Virginia September
2002
- Alex Aletà
- Josep M. Codina
- Jesús Sánchez
- Antonio González
- David Kaeli
- aaleta, jmcodina, fran, antonio_at_ac.upc.es
- kaeli_at_ece.neu.edu
2Clustered Architectures
- Current/future challenges in processor design
- Delay in the transmission of signals
- Power consumption
- Architecture complexity
- Clustering divide the system in semi-independent
units - Each unit ? Cluster
- Fast interconnects intra-cluster
- Slow interconnects inter-clusters
- Common trend in commercial VLIW processors
- TIs C6x
- Analogs TigerSHARC
- HPs LX
- Equators MAP1000
3Architecture Overview
4Instruction Scheduling
- For non-clustered architectures
- Resources
- Dependences
- For clustered architectures
- Cluster assignment
- Minimize inter-cluster communication delays
- Exploit communication locality
- This work focuses on modulo scheduling for
clustered VLIW architectures - Technique to schedule loops
5Talk Outline
- Previous work
- Proposed algorithm
- Overview
- Graph partitioning
- Pseudo-scheduling
- Performance evaluation
- Conclusions
6MS for Clustered Architectures
- In previous work, two different approaches were
proposed
- Two steps
- Data Dependence Graph partitioning each
instruction is assigned to a cluster - Scheduling instructions are scheduled in a
suitable slot but only in the preassigned cluster
7Goal of the Work
- Both approaches have benefits
- Two steps
- Global vision of the Data Dependence Graph
- Workload is better split among different clusters
- Number of communications is reduced
- One step
- Local vision of partial scheduling
- Cluster assignment is performed with information
of the partial scheduling - Goal obtain an algorithm taking advantage of the
benefits of both approaches
8Baseline
- Baseline scheme GP Aletà et al., Micro34
- Cluster assignment performed with a graph
partitioning algorithm - Feed-back between the partitioning and the
scheduler - Results outperformed previous approaches
- Still little information available for cluster
assignment - New algorithm better partition
- Pseudo-schedules are used to guide the partition
- Global vision of the Data Dependence Graph
- More information to perform cluster assignment
9Algorithm Overview
Compute initial partition
II MII
Start scheduling
Schedule Opj based on the current partition
Refine Partition
Able to schedule?
Select next operation (j)
YES
II
NO
Move Opj to another cluster
NO
Able to schedule?
YES
10Algorithm Overview
Compute initial partition
II MII
Start scheduling
Schedule Opj based on the current partition
Refine Partition
Able to schedule?
Select next operation (j)
YES
II
NO
Move Opj to another cluster
NO
Able to schedule?
YES
11Graph Partitioning Background
- Problem statement
- Split the nodes into a pre-determined number of
sets and optimizing some functions - Multilevel strategy
- Coarsen the graph
- Iteratively, fuse pairs of nodes into new
macro-nodes - Enhancing heuristics
- Avoid excess load in any one set
- Reduce execution time of the loops
12Graph Coarsening
- Previous definitions
- Matching
- Slack
- Iterate until same number of nodes than clusters
- The edges are weighted according to
- Impact on execution time of adding a bus delay to
the edge - Slack of the edge
- Then, select the maximum weight matching
- Nodes linked by edges in the matching are fused
in a single macro-node
13Coarsening Example
14Example (II)
1st STEP Partition induced in the original graph
Initial graph
Induced Partition
Final graph
15Reducing Execution Time
- Estimation of execution time needed
- Pseudo-schedules
- Information obtained
- II
- SC
- Lifetimes
- Spills
16Building pseudo-schedules
- Dependences
- Respected if possible
- Else a penalty on register pressure and/or in
execution time is assessed - Cluster assignment
- Partition strictly followed
17Pseudo-schedule example
- 2 clusters, 1 FU/cluster, 1 bus of latency 1, II
2
Cluster 1 Cluster 2
A D
B
Cluster 1 Cluster 2
0 A
1
2
3 B
4 D
5
6 C?NO
7 C?NO
Instruction latency 3
18Pseudo-schedule example
Cluster 1 Cluster 2
0 A
1
2
3 B
4 D
5
6
7
8 C
Cluster 1 Cluster 2
A,C D
B
Induced partition
A
D
B
C
19Heuristic description
- While improvement, iterate
- Different partitions are obtained by moving nodes
among clusters - Partitions that produce overload resources in any
of the clusters are discarded - The partition minimizing execution time is chosen
- In case of tie, the one that minimizes register
pressure is selected
20Algorithm Overview
Compute initial partition
II MII
Start scheduling
Schedule Opj based on the current partition
Refine Partition
Able to schedule?
Select next operation (j)
YES
II
NO
Move Opj to another cluster
NO
Able to schedule?
YES
21The Scheduling Step
- To schedule the partition we use URACAM Codina
et al., PACT01 - Figure of merit
- Uses dynamic transformations to improve the
partial schedule - Register communications
- Bus ? memory
- Spill code on-the-fly
- Register pressure ? memory
- If an instruction can not be scheduled in the
cluster assigned by the partition - Try all other clusters
- Select the best one according to a figure of merit
22Algorithm Overview
Compute initial partition
II MII
Start scheduling
Schedule Opj based on the current partition
Refine Partition
Able to schedule?
Select next operation (j)
YES
II
NO
Move Opj to another cluster
NO
Able to schedule?
YES
23Partition Refinement
- II has increased
- A better partition can be found for the new II
- New slots have been generated in each cluster
- More lifetimes are available
- A larger number of bus communications allowed
- Coarsening process is repeated
- Only edges between nodes in the same set can
appear in the matching - After coarsening, the induced partition will be
the last partition that could not be scheduled - The reducing execution time heuristic is reapplied
24Benchmarks and Configurations
- Benchmarks - all the SPECfp95 using the ref input
set - Two schedulers evaluated
- GP (previous work)
- Pseudo-schedule (PSP)
25GP vs PSP
26GP vs PSP
27Conclusions
- A new algorithm to perform MS for clustered VLIW
architectures - Cluster assignment based on multilevel graph
partitioning - The partition algorithm is improved
- Based on pseudo-schedules
- Reliable information available to guide the
partition - Outperform previous work
- 38.5 speedup for some configurations
28Any questions?
29GP vs PSP
30Different Alternatives
31Clustered Architectures
- Current/future challenges in processor design
- Delay in the transmission of signals
- Power consumption
- Architecture complexity
- Solutions
- VLIW architectures
- Clustering divide the system in semi-independent
units - Fast interconnects intra-cluster
- Slow interconnects inter-clusters
- Common trend in commercial VLIW processors
- TIs C6x Analogs Tigersharc
- HPs LX Equators MAP1000
32Example (I)
1st STEP Coarsening the graph
33Example (I)
1st STEP Partition induced in the original graph
coarsening
34Reducing Execution Time
- Heuristic description
- Different partitions are obtained by moving nodes
among clusters - Partitions overloading resources in any of the
clusters are discarded - The partition minimizing execution time is chosen
- In case of tie, the one that minimizes register
pressure - Estimation of execution time needed
- Pseudo-schedules
35Pseudo-schedules
- Building pseudo-schedules
- Dependences
- Respected if possible
- Else a penalty on register pressure and/or in
execution time is assumed - Cluster assignment
- Partition strictly followed
- Valuable information can be estimated
- II
- Length of the pseudo-schedule
- Register pressure