EECS 583 Class 17 Multicluster Partitioning - PowerPoint PPT Presentation

1 / 31

About This Presentation

Title:

EECS 583 Class 17 Multicluster Partitioning

Description:

BUG Algorithm (cont. ... BUG. Traverses DFG in a reverse depth-first-search fashion. Upward pass ... BUG is not the only solution to cluster assignment ... – PowerPoint PPT presentation

Number of Views:50

Avg rating:3.0/5.0

Slides: 32

Provided by: scottm80

Category:

more less

Transcript and Presenter's Notes

Title: EECS 583 Class 17 Multicluster Partitioning

1
EECS 583 Class 17Multicluster Partitioning

University of Michigan
November 5, 2007

2
Reading Material

Todays class
Region-based Hierarchical Operation Partitioning
for Multicluster Processors, M. Chu et. al, PLDI
2003.
Next class
Register Allocation and Spilling Via Graph
Coloring, G. Chaitin, Proc. 1982 SIGPLAN
Symposium on Compiler Construction, 1982.

3
Modulo Scheduling WrapupWhat if We Dont Have
Hardware Support?

No predicates
Predicates enable kernel-only code by selectively
enabling/disabling operations to create
prolog/epilog
Now must create explicit prolog/epilog code
segments
No rotating registers
Register names not automatically changed each
iteration
Must unroll the body of the software pipeline,
explicitly rename
Consider each register lifetime i in the loop
Kmin min unroll factor MAXi (ceiling((Endi
Starti) / II))
Create Kmin static names to handle maximum
register lifetime
Apply modulo variable expansion

4
No Predicates
E
D
C
B
A
A
B
A
Kernel-only code with rotating registers
and predicates, II 1
prolog
C
B
A
D
C
B
A
C
B
B
E
D
C
B
A
D
C
B
kernel
D
C
C
E
D
C
B
D
E
D
C
epilog
E
D
E
Without predicates, must create explicit prolog
and epilogs, but no explicit renaming is needed
as rotating registers take care of this
5
No Predicates and No Rotating Registers
Assume Kmin 4 for this example
A1
B1
A2
prolog
B1
C1
B2
A3
C1
B2
C1
D1
C2
B3
A4
D1
C2
B3
D1
C2
D1
E1
D2
C3
B4
A1
E2
D3
C4
B1
A2
unrolled kernel
E3
D4
C1
B2
A3
E4
D1
C2
B3
A4
E1
D2
C3
B4
E4
D1
C2
B3
E3
D4
C1
B2
E2
D3
C4
B1
E2
D3
C4
E1
D2
C3
E4
D1
C2
E3
D4
C1
epilog
E3
D4
E2
D3
E1
D2
E4
D1
E4
E3
E2
E1
6
Recap Traditional VLIW Architectures

Conventional VLIW
Target architecture seen so far in class
Large, centralized register file
Many functional units connected
Problems with conventional design
Longer wires require longer latencies on RF
accesses
Large number of connected FUs to the register
file require more ports.
Register file access time increases quadratically
with number of ports

Conventional Architecture
RF
Register File
FU
FU
FU
FU
FU
7
Multicluster VLIW Architectures

Multicluster VLIW
Solution to problems with conventional VLIW
architecture design.
Decentralized architecture by splitting RF and
connecting subsets of the FUs
Require communication between clusters through
intercluster communication path
Problem with Multicluster VLIW
Compilation must now deal with disjoint FU/RFs,
and schedule operations accordingly
Used in commercial proceesors
Alpha 21264, TI C6x, etc.

Clustered Architecture
Register File
Register File
FU
FU
FU
FU
Cluster 1
Cluster 2
8
Other Multicluster Architectures Designs

Clusters can be homogeneous or heterogeneous
Homogeneous means each cluster is identical
Heterogeneous means FU number/types differ per
cluster
Communication paths can be intercluster buses or
cross cluster FU inputs

Cross-cluster FU inputs
Intercluster Bus
Register File
Register File
Register File
Register File
FU
FU
FU
FU
FU
FU
FU
FU
Cluster 1
Cluster 2
Cluster 1
Cluster 2
9
Multicluster Compilation Basics

Goal distribute operations evenly to balance
workload while minimizing communication
When two operations on separate clusters require
communication, interconnection network must be
used

Interconnection Network

Register File
Register File
gtgt

LW
I
MEM
MEM
I

Intercluster move
Cluster 1
Cluster 2
10
Cluster Assignment

When do we want to do operation cluster
assignment?
Highly intertwined with Scheduling and Register
Allocation
Assignment to clusters can change how well the
code can be scheduled, which changes how well
registers can be allocated.
Elcors model
Other possible models
Combine cluster assignment with scheduling
Combine all three
Unifying any or all of these three steps can
greatly increase complexity

Cluster Assignment
Scheduling
Register Allocation
11
Bottom-Up Greedy (BUG) Algorithm

First clustering algorithm introduced for the
Multiflow Trace architecture
Used by Elcor
Basic idea
Recursive algorithm
Go from exit ops to entry ops and pass along good
cluster candidates for each op
Go from entry ops back to exit ops and make final
decisions
Consider ops on critical path first

12
BUG Algorithm (cont.)

Given an op and its immediate predecessors and
successors, how to choose a good cluster?
Op must get its input operands from its
predecessors
Perform some computation
Send its output to its successors
Want to pick cluster such that this process
completes soonest (greedy)
A good choice depends on what clusters the ops
predecessors and successors are assigned to

13
Definitions

Available time
When a source operand is computed
Arrival time
When source operand is moved to current cluster
Start time
When all source operands are ready (max of
arrival times) and resources are available
Completion time
Result has been computed and moved to consumers

14
Definitions Illustrated
Relative to Op 3
2
Time
AvailableTime (Op2)
move
1
ArrivalTime (Op2, C1)
AvailableTime (Op1), ArrivalTime (Op1, C1)
StartTime (Op3, C1)
3
CompletionTime (Op3, C1, C1, C2)
4

Choose a cluster for Op 3 to minimize Completion
Time

15
The Main Function Assign

Assign (Op, Dests)
for each Predecessor of Op
Est-clusters Estimate (Op, Dests)
Assign (Pred, Est-clusters)
Est-clusters Estimate (Op, Dests)
Cluster first cluster in Est-clusters
Assign Op to Cluster
Mark Clusters resources busy at
StartTime(Op, Cluster)

Upward pass
recursive call
Downward pass
actual assignment

Estimate function returns a list of Clusters for
which CompletionTime(Op, Cluster, Dests) is
minimum

16
BUG

Traverses DFG in a reverse depth-first-search
fashion
Upward pass
Predecessors have not been assigned yet
Use depth (estart) plus latency to approximate
predecessors AvailableTime
Estimate a set of good clusters for current op
Recursively assign predecessors with current set
aspredecessors Dests
Downward pass
Make final cluster decisions for ops

17
Example

Assume all ops are 1-cycle
Each cluster can execute one op per cycle
Cluster 1 can execute any op, cluster 2 can only
execute

C1
C2
M

18
Example left path upward pass
AvailTime(Op1)1
3
5
C1
5
CompTime(Op3,C1,C1) 2 CompTime(Op3,C2,C1)
3
CompTime(Op5,C1,-) 3
1
3
C1
5
C1
CompTime(Op1,C1,C1) 1 CompTime(Op1,C2,C1)
2
19
Example left path downward pass
1
ArrivTime(Op1,C1)1 ArrivTime(Op1,C2)2
3
5
C1
StartTime(Op3,C1)1 CompTime(Op3,C1,C1)
2 StartTime(Op3,C2)2 CompTime(Op3,C2,C1) 4
1
3
5
C1
20
Example right path upward pass
1
2
AvailTime(Op2)1
3
4
5
C1
StartTime(Op4,C1)2 CompTime(Op4,C1,C1)
3 StartTime(Op4,C2)1 CompTime(Op4,C2,C1) 3
1
2
3
4
C1,C2
5
C1
CompTime(Op2,C1,C1,C2) 3 CompTime(Op2,C2,C1,C
2) 1
21
Example right path downward pass
1
2
ArrivTime(Op2,C1)2 ArrivTime(Op2,C2)1
3
4
5
C1
StartTime(Op4,C1)2 CompTime(Op4,C1,C1)
3 StartTime(Op4,C2)1 CompTime(Op4,C2,C1) 3
1
2
3
4
5
CompTime(Op5,C1,-) 4
22
Class problem
C1
C2
M3
4
M

5
Schedule
23
Problems with BUG

BUG does a fairly good job of partitioning the
DFG, but it can be improved
Problem 1 Local scope of the DFG
Has a very narrow view of the DFG
Doesnt consider the best global clustering
Problem 2 Scheduler-centric
Using the scheduler to determine the clustering
is slow!
BUG is not the only solution to cluster
assignment
Many different algorithms exist all using
different techniques, different scopes, and occur
at different phases in the compilation process
No clear cut winner on the best algorithm for all
situations.

24
Local Scope
Local scope clustering
Global scope clustering
1
3
1
4
1
1
2
7
2
8
6
4
move
6
5
2
3
4
5
2
3
4
5
10
8
3
9
6
7
8
9
cycle
cycle
6
7
8
9
5
7
11
11
10
move
11
10
move
9
10
12
12
11
12
12
25
Scheduler-centric Nature

Cluster Assignment during scheduling adds
complexity
Detailed resource model/reservation table is
slow!
Forces local decisions to be made

Cluster 2
cycle
Cluster 1
cycle
X
X
X
X
1
1
1
X
X
X
X
2
2
2
3
4
5
X
X
X
X
1
1
6
7
8
9
X
X
X
X
2
2
11
10
X
X
X
X
1
1
12
X
X
X
X
2
2
26
Region-based Hierarchical Operation Partitioning

RHOP is one of many advanced clustering
techniques
Code is considered region at a time
Weight calculation determines hints for how
operations affect scheduler
Partitioning uses multilevel graph partitioner to
cluster operations

Program
Region
int main int x printf() . . .
Weight Calculation
Graph Partitioning
27
Weight Calculation

Node weights are used to determine approximate
resource usage
Differs depending on how many FUs of each type
per cluster
Edge weights are used to determine where to best
break the graph
Where is intercluster communication free or
preferred?

1
2
Register File
I
F
M
B
3
(0,0)
(0,0)
1
2
(0,1)
(0,1)
(0,1)
(0,1)
3
5
6
7
4
(1,1)
(1,2)
10
11
8
9
(1,2)
(0,2)
(2,2)
13
12
(2,3)
(3,3)
14
(estart, lstart)
(4,4)
28
RHOP - Coarsening

Coarsening takes highly-related operations and
groups them together to later partition
Groups based on edge weights
Takes snapshots of how things are coarsened,
later will consider them together

29
RHOP Scheduling estimate
0
1
2
Cluster 1
1
4
6
5
2
2.5
2.0
9
3
cycle
8
0.5
12
0.0
14
0.0
Cluster_wgt1 5.0
0
1
Cluster 2
2
7
0.0
10
11
0.33
13
0.33
cycle
0.0
0.0
Cluster_wgt2 0.67
30
RHOP Checking proposed moves

Move groups of operations over, see how it
changes the load on the schedule estimate

Cluster 1
1
2
1.0
SL(before) 5.0
0.0
3
cycle
SL(after) 4.5
8
0.0
12
0.0
14
0.0
Cluster 2
Lgain 0.5
1.33
4
6
5
7
10
2.33
9
11
Egain -1.0
13
0.83
cycle
0.0
Mgain 4.0
0.0
31
RHOP - Refinement

Write a Comment

User Comments (0)