Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning - PowerPoint PPT Presentation

About This Presentation
Title:

Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning

Description:

Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning. Alex Alet ... Antonio Gonz lez. David Kaeli {aaleta, jmcodina, fran, antonio}_at_ac.upc. ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 36
Provided by: jesss9
Category:

less

Transcript and Presenter's Notes

Title: Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning


1
Exploiting Pseudo-schedules to Guide Data
Dependence Graph Partitioning
PACT 2002, Charlottesville, Virginia September
2002
  • Alex Aletà
  • Josep M. Codina
  • Jesús Sánchez
  • Antonio González
  • David Kaeli
  • aaleta, jmcodina, fran, antonio_at_ac.upc.es
  • kaeli_at_ece.neu.edu

2
Clustered Architectures
  • Current/future challenges in processor design
  • Delay in the transmission of signals
  • Power consumption
  • Architecture complexity
  • Clustering divide the system in semi-independent
    units
  • Each unit ? Cluster
  • Fast interconnects intra-cluster
  • Slow interconnects inter-clusters
  • Common trend in commercial VLIW processors
  • TIs C6x
  • Analogs TigerSHARC
  • HPs LX
  • Equators MAP1000

3
Architecture Overview
4
Instruction Scheduling
  • For non-clustered architectures
  • Resources
  • Dependences
  • For clustered architectures
  • Cluster assignment
  • Minimize inter-cluster communication delays
  • Exploit communication locality
  • This work focuses on modulo scheduling for
    clustered VLIW architectures
  • Technique to schedule loops

5
Talk Outline
  • Previous work
  • Proposed algorithm
  • Overview
  • Graph partitioning
  • Pseudo-scheduling
  • Performance evaluation
  • Conclusions

6
MS for Clustered Architectures
  • In previous work, two different approaches were
    proposed
  • Two steps
  • Data Dependence Graph partitioning each
    instruction is assigned to a cluster
  • Scheduling instructions are scheduled in a
    suitable slot but only in the preassigned cluster

7
Goal of the Work
  • Both approaches have benefits
  • Two steps
  • Global vision of the Data Dependence Graph
  • Workload is better split among different clusters
  • Number of communications is reduced
  • One step
  • Local vision of partial scheduling
  • Cluster assignment is performed with information
    of the partial scheduling
  • Goal obtain an algorithm taking advantage of the
    benefits of both approaches

8
Baseline
  • Baseline scheme GP Aletà et al., Micro34
  • Cluster assignment performed with a graph
    partitioning algorithm
  • Feed-back between the partitioning and the
    scheduler
  • Results outperformed previous approaches
  • Still little information available for cluster
    assignment
  • New algorithm better partition
  • Pseudo-schedules are used to guide the partition
  • Global vision of the Data Dependence Graph
  • More information to perform cluster assignment

9
Algorithm Overview
Compute initial partition
II MII
Start scheduling
Schedule Opj based on the current partition
Refine Partition
Able to schedule?
Select next operation (j)
YES
II
NO
Move Opj to another cluster
NO
Able to schedule?
YES
10
Algorithm Overview
Compute initial partition
II MII
Start scheduling
Schedule Opj based on the current partition
Refine Partition
Able to schedule?
Select next operation (j)
YES
II
NO
Move Opj to another cluster
NO
Able to schedule?
YES
11
Graph Partitioning Background
  • Problem statement
  • Split the nodes into a pre-determined number of
    sets and optimizing some functions
  • Multilevel strategy
  • Coarsen the graph
  • Iteratively, fuse pairs of nodes into new
    macro-nodes
  • Enhancing heuristics
  • Avoid excess load in any one set
  • Reduce execution time of the loops

12
Graph Coarsening
  • Previous definitions
  • Matching
  • Slack
  • Iterate until same number of nodes than clusters
  • The edges are weighted according to
  • Impact on execution time of adding a bus delay to
    the edge
  • Slack of the edge
  • Then, select the maximum weight matching
  • Nodes linked by edges in the matching are fused
    in a single macro-node

13
Coarsening Example
14
Example (II)
1st STEP Partition induced in the original graph
Initial graph
Induced Partition
Final graph
15
Reducing Execution Time
  • Estimation of execution time needed
  • Pseudo-schedules
  • Information obtained
  • II
  • SC
  • Lifetimes
  • Spills

16
Building pseudo-schedules
  • Dependences
  • Respected if possible
  • Else a penalty on register pressure and/or in
    execution time is assessed
  • Cluster assignment
  • Partition strictly followed

17
Pseudo-schedule example
  • 2 clusters, 1 FU/cluster, 1 bus of latency 1, II
    2

Cluster 1 Cluster 2
A D
B
Cluster 1 Cluster 2
0 A
1
2
3 B
4 D
5
6 C?NO
7 C?NO
Instruction latency 3
18
Pseudo-schedule example
Cluster 1 Cluster 2
0 A
1
2
3 B
4 D
5
6
7
8 C
Cluster 1 Cluster 2
A,C D
B
Induced partition
A
D
B
C
19
Heuristic description
  • While improvement, iterate
  • Different partitions are obtained by moving nodes
    among clusters
  • Partitions that produce overload resources in any
    of the clusters are discarded
  • The partition minimizing execution time is chosen
  • In case of tie, the one that minimizes register
    pressure is selected

20
Algorithm Overview
Compute initial partition
II MII
Start scheduling
Schedule Opj based on the current partition
Refine Partition
Able to schedule?
Select next operation (j)
YES
II
NO
Move Opj to another cluster
NO
Able to schedule?
YES
21
The Scheduling Step
  • To schedule the partition we use URACAM Codina
    et al., PACT01
  • Figure of merit
  • Uses dynamic transformations to improve the
    partial schedule
  • Register communications
  • Bus ? memory
  • Spill code on-the-fly
  • Register pressure ? memory
  • If an instruction can not be scheduled in the
    cluster assigned by the partition
  • Try all other clusters
  • Select the best one according to a figure of merit

22
Algorithm Overview
Compute initial partition
II MII
Start scheduling
Schedule Opj based on the current partition
Refine Partition
Able to schedule?
Select next operation (j)
YES
II
NO
Move Opj to another cluster
NO
Able to schedule?
YES
23
Partition Refinement
  • II has increased
  • A better partition can be found for the new II
  • New slots have been generated in each cluster
  • More lifetimes are available
  • A larger number of bus communications allowed
  • Coarsening process is repeated
  • Only edges between nodes in the same set can
    appear in the matching
  • After coarsening, the induced partition will be
    the last partition that could not be scheduled
  • The reducing execution time heuristic is reapplied

24
Benchmarks and Configurations
  • Benchmarks - all the SPECfp95 using the ref input
    set
  • Two schedulers evaluated
  • GP (previous work)
  • Pseudo-schedule (PSP)

25
GP vs PSP
26
GP vs PSP
27
Conclusions
  • A new algorithm to perform MS for clustered VLIW
    architectures
  • Cluster assignment based on multilevel graph
    partitioning
  • The partition algorithm is improved
  • Based on pseudo-schedules
  • Reliable information available to guide the
    partition
  • Outperform previous work
  • 38.5 speedup for some configurations

28
Any questions?
29
GP vs PSP
30
Different Alternatives
31
Clustered Architectures
  • Current/future challenges in processor design
  • Delay in the transmission of signals
  • Power consumption
  • Architecture complexity
  • Solutions
  • VLIW architectures
  • Clustering divide the system in semi-independent
    units
  • Fast interconnects intra-cluster
  • Slow interconnects inter-clusters
  • Common trend in commercial VLIW processors
  • TIs C6x Analogs Tigersharc
  • HPs LX Equators MAP1000

32
Example (I)
1st STEP Coarsening the graph
33
Example (I)
1st STEP Partition induced in the original graph
coarsening
34
Reducing Execution Time
  • Heuristic description
  • Different partitions are obtained by moving nodes
    among clusters
  • Partitions overloading resources in any of the
    clusters are discarded
  • The partition minimizing execution time is chosen
  • In case of tie, the one that minimizes register
    pressure
  • Estimation of execution time needed
  • Pseudo-schedules

35
Pseudo-schedules
  • Building pseudo-schedules
  • Dependences
  • Respected if possible
  • Else a penalty on register pressure and/or in
    execution time is assumed
  • Cluster assignment
  • Partition strictly followed
  • Valuable information can be estimated
  • II
  • Length of the pseudo-schedule
  • Register pressure
Write a Comment
User Comments (0)
About PowerShow.com