Title: EECS 583 Class 17 Multicluster Partitioning
1EECS 583 Class 17Multicluster Partitioning
- University of Michigan
- November 5, 2007
2Reading Material
- Todays class
- Region-based Hierarchical Operation Partitioning
for Multicluster Processors, M. Chu et. al, PLDI
2003. - Next class
- Register Allocation and Spilling Via Graph
Coloring, G. Chaitin, Proc. 1982 SIGPLAN
Symposium on Compiler Construction, 1982.
3Modulo Scheduling WrapupWhat if We Dont Have
Hardware Support?
- No predicates
- Predicates enable kernel-only code by selectively
enabling/disabling operations to create
prolog/epilog - Now must create explicit prolog/epilog code
segments - No rotating registers
- Register names not automatically changed each
iteration - Must unroll the body of the software pipeline,
explicitly rename - Consider each register lifetime i in the loop
- Kmin min unroll factor MAXi (ceiling((Endi
Starti) / II)) - Create Kmin static names to handle maximum
register lifetime - Apply modulo variable expansion
4No Predicates
E
D
C
B
A
A
B
A
Kernel-only code with rotating registers
and predicates, II 1
prolog
C
B
A
D
C
B
A
C
B
B
E
D
C
B
A
D
C
B
kernel
D
C
C
E
D
C
B
D
E
D
C
epilog
E
D
E
Without predicates, must create explicit prolog
and epilogs, but no explicit renaming is needed
as rotating registers take care of this
5No Predicates and No Rotating Registers
Assume Kmin 4 for this example
A1
B1
A2
prolog
B1
C1
B2
A3
C1
B2
C1
D1
C2
B3
A4
D1
C2
B3
D1
C2
D1
E1
D2
C3
B4
A1
E2
D3
C4
B1
A2
unrolled kernel
E3
D4
C1
B2
A3
E4
D1
C2
B3
A4
E1
D2
C3
B4
E4
D1
C2
B3
E3
D4
C1
B2
E2
D3
C4
B1
E2
D3
C4
E1
D2
C3
E4
D1
C2
E3
D4
C1
epilog
E3
D4
E2
D3
E1
D2
E4
D1
E4
E3
E2
E1
6Recap Traditional VLIW Architectures
- Conventional VLIW
- Target architecture seen so far in class
- Large, centralized register file
- Many functional units connected
- Problems with conventional design
- Longer wires require longer latencies on RF
accesses - Large number of connected FUs to the register
file require more ports. - Register file access time increases quadratically
with number of ports
Conventional Architecture
RF
Register File
FU
FU
FU
FU
FU
7Multicluster VLIW Architectures
- Multicluster VLIW
- Solution to problems with conventional VLIW
architecture design. - Decentralized architecture by splitting RF and
connecting subsets of the FUs - Require communication between clusters through
intercluster communication path - Problem with Multicluster VLIW
- Compilation must now deal with disjoint FU/RFs,
and schedule operations accordingly - Used in commercial proceesors
- Alpha 21264, TI C6x, etc.
Clustered Architecture
Register File
Register File
FU
FU
FU
FU
Cluster 1
Cluster 2
8Other Multicluster Architectures Designs
- Clusters can be homogeneous or heterogeneous
- Homogeneous means each cluster is identical
- Heterogeneous means FU number/types differ per
cluster - Communication paths can be intercluster buses or
cross cluster FU inputs
Cross-cluster FU inputs
Intercluster Bus
Register File
Register File
Register File
Register File
FU
FU
FU
FU
FU
FU
FU
FU
Cluster 1
Cluster 2
Cluster 1
Cluster 2
9Multicluster Compilation Basics
- Goal distribute operations evenly to balance
workload while minimizing communication - When two operations on separate clusters require
communication, interconnection network must be
used
Interconnection Network
Register File
Register File
gtgt
LW
I
MEM
MEM
I
Intercluster move
Cluster 1
Cluster 2
10Cluster Assignment
- When do we want to do operation cluster
assignment? - Highly intertwined with Scheduling and Register
Allocation - Assignment to clusters can change how well the
code can be scheduled, which changes how well
registers can be allocated. - Elcors model
- Other possible models
- Combine cluster assignment with scheduling
- Combine all three
- Unifying any or all of these three steps can
greatly increase complexity
Cluster Assignment
Scheduling
Register Allocation
11Bottom-Up Greedy (BUG) Algorithm
- First clustering algorithm introduced for the
Multiflow Trace architecture - Used by Elcor
- Basic idea
- Recursive algorithm
- Go from exit ops to entry ops and pass along good
cluster candidates for each op - Go from entry ops back to exit ops and make final
decisions - Consider ops on critical path first
12BUG Algorithm (cont.)
- Given an op and its immediate predecessors and
successors, how to choose a good cluster? - Op must get its input operands from its
predecessors - Perform some computation
- Send its output to its successors
- Want to pick cluster such that this process
completes soonest (greedy) - A good choice depends on what clusters the ops
predecessors and successors are assigned to
13Definitions
- Available time
- When a source operand is computed
- Arrival time
- When source operand is moved to current cluster
- Start time
- When all source operands are ready (max of
arrival times) and resources are available - Completion time
- Result has been computed and moved to consumers
14Definitions Illustrated
Relative to Op 3
2
Time
AvailableTime (Op2)
move
1
ArrivalTime (Op2, C1)
AvailableTime (Op1), ArrivalTime (Op1, C1)
StartTime (Op3, C1)
3
CompletionTime (Op3, C1, C1, C2)
4
- Choose a cluster for Op 3 to minimize Completion
Time
15The Main Function Assign
- Assign (Op, Dests)
- for each Predecessor of Op
- Est-clusters Estimate (Op, Dests)
- Assign (Pred, Est-clusters)
-
- Est-clusters Estimate (Op, Dests)
- Cluster first cluster in Est-clusters
- Assign Op to Cluster
- Mark Clusters resources busy at
StartTime(Op, Cluster)
Upward pass
recursive call
Downward pass
actual assignment
- Estimate function returns a list of Clusters for
which CompletionTime(Op, Cluster, Dests) is
minimum
16BUG
- Traverses DFG in a reverse depth-first-search
fashion - Upward pass
- Predecessors have not been assigned yet
- Use depth (estart) plus latency to approximate
predecessors AvailableTime - Estimate a set of good clusters for current op
- Recursively assign predecessors with current set
aspredecessors Dests - Downward pass
- Make final cluster decisions for ops
17Example
- Assume all ops are 1-cycle
- Each cluster can execute one op per cycle
- Cluster 1 can execute any op, cluster 2 can only
execute
C1
C2
M
18Example left path upward pass
AvailTime(Op1)1
3
5
C1
5
CompTime(Op3,C1,C1) 2 CompTime(Op3,C2,C1)
3
CompTime(Op5,C1,-) 3
1
3
C1
5
C1
CompTime(Op1,C1,C1) 1 CompTime(Op1,C2,C1)
2
19Example left path downward pass
1
ArrivTime(Op1,C1)1 ArrivTime(Op1,C2)2
3
5
C1
StartTime(Op3,C1)1 CompTime(Op3,C1,C1)
2 StartTime(Op3,C2)2 CompTime(Op3,C2,C1) 4
1
3
5
C1
20Example right path upward pass
1
2
AvailTime(Op2)1
3
4
5
C1
StartTime(Op4,C1)2 CompTime(Op4,C1,C1)
3 StartTime(Op4,C2)1 CompTime(Op4,C2,C1) 3
1
2
3
4
C1,C2
5
C1
CompTime(Op2,C1,C1,C2) 3 CompTime(Op2,C2,C1,C
2) 1
21Example right path downward pass
1
2
ArrivTime(Op2,C1)2 ArrivTime(Op2,C2)1
3
4
5
C1
StartTime(Op4,C1)2 CompTime(Op4,C1,C1)
3 StartTime(Op4,C2)1 CompTime(Op4,C2,C1) 3
1
2
3
4
5
CompTime(Op5,C1,-) 4
22Class problem
C1
C2
M3
4
M
5
Schedule
23Problems with BUG
- BUG does a fairly good job of partitioning the
DFG, but it can be improved - Problem 1 Local scope of the DFG
- Has a very narrow view of the DFG
- Doesnt consider the best global clustering
- Problem 2 Scheduler-centric
- Using the scheduler to determine the clustering
is slow! - BUG is not the only solution to cluster
assignment - Many different algorithms exist all using
different techniques, different scopes, and occur
at different phases in the compilation process - No clear cut winner on the best algorithm for all
situations.
24Local Scope
Local scope clustering
Global scope clustering
1
3
1
4
1
1
2
7
2
8
6
4
move
6
5
2
3
4
5
2
3
4
5
10
8
3
9
6
7
8
9
cycle
cycle
6
7
8
9
5
7
11
11
10
move
11
10
move
9
10
12
12
11
12
12
25Scheduler-centric Nature
- Cluster Assignment during scheduling adds
complexity - Detailed resource model/reservation table is
slow! - Forces local decisions to be made
Cluster 2
cycle
Cluster 1
cycle
X
X
X
X
1
1
1
X
X
X
X
2
2
2
3
4
5
X
X
X
X
1
1
6
7
8
9
X
X
X
X
2
2
11
10
X
X
X
X
1
1
12
X
X
X
X
2
2
26Region-based Hierarchical Operation Partitioning
- RHOP is one of many advanced clustering
techniques - Code is considered region at a time
- Weight calculation determines hints for how
operations affect scheduler - Partitioning uses multilevel graph partitioner to
cluster operations
Program
Region
int main int x printf() . . .
Weight Calculation
Graph Partitioning
27Weight Calculation
- Node weights are used to determine approximate
resource usage - Differs depending on how many FUs of each type
per cluster - Edge weights are used to determine where to best
break the graph - Where is intercluster communication free or
preferred?
1
2
Register File
I
F
M
B
3
(0,0)
(0,0)
1
2
(0,1)
(0,1)
(0,1)
(0,1)
3
5
6
7
4
(1,1)
(1,2)
10
11
8
9
(1,2)
(0,2)
(2,2)
13
12
(2,3)
(3,3)
14
(estart, lstart)
(4,4)
28RHOP - Coarsening
- Coarsening takes highly-related operations and
groups them together to later partition - Groups based on edge weights
- Takes snapshots of how things are coarsened,
later will consider them together
29RHOP Scheduling estimate
0
1
2
Cluster 1
1
4
6
5
2
2.5
2.0
9
3
cycle
8
0.5
12
0.0
14
0.0
Cluster_wgt1 5.0
0
1
Cluster 2
2
7
0.0
10
11
0.33
13
0.33
cycle
0.0
0.0
Cluster_wgt2 0.67
30RHOP Checking proposed moves
- Move groups of operations over, see how it
changes the load on the schedule estimate
Cluster 1
1
2
1.0
SL(before) 5.0
0.0
3
cycle
SL(after) 4.5
8
0.0
12
0.0
14
0.0
Cluster 2
Lgain 0.5
1.33
4
6
5
7
10
2.33
9
11
Egain -1.0
13
0.83
cycle
0.0
Mgain 4.0
0.0
31RHOP - Refinement