EECS 583 Class 17 Multicluster Partitioning - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

EECS 583 Class 17 Multicluster Partitioning

Description:

BUG Algorithm (cont. ... BUG. Traverses DFG in a reverse depth-first-search fashion. Upward pass ... BUG is not the only solution to cluster assignment ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 32
Provided by: scottm80
Category:

less

Transcript and Presenter's Notes

Title: EECS 583 Class 17 Multicluster Partitioning


1
EECS 583 Class 17Multicluster Partitioning
  • University of Michigan
  • November 5, 2007

2
Reading Material
  • Todays class
  • Region-based Hierarchical Operation Partitioning
    for Multicluster Processors, M. Chu et. al, PLDI
    2003.
  • Next class
  • Register Allocation and Spilling Via Graph
    Coloring, G. Chaitin, Proc. 1982 SIGPLAN
    Symposium on Compiler Construction, 1982.

3
Modulo Scheduling WrapupWhat if We Dont Have
Hardware Support?
  • No predicates
  • Predicates enable kernel-only code by selectively
    enabling/disabling operations to create
    prolog/epilog
  • Now must create explicit prolog/epilog code
    segments
  • No rotating registers
  • Register names not automatically changed each
    iteration
  • Must unroll the body of the software pipeline,
    explicitly rename
  • Consider each register lifetime i in the loop
  • Kmin min unroll factor MAXi (ceiling((Endi
    Starti) / II))
  • Create Kmin static names to handle maximum
    register lifetime
  • Apply modulo variable expansion

4
No Predicates
E
D
C
B
A
A
B
A
Kernel-only code with rotating registers
and predicates, II 1
prolog
C
B
A
D
C
B
A
C
B
B
E
D
C
B
A
D
C
B
kernel
D
C
C
E
D
C
B
D
E
D
C
epilog
E
D
E
Without predicates, must create explicit prolog
and epilogs, but no explicit renaming is needed
as rotating registers take care of this
5
No Predicates and No Rotating Registers
Assume Kmin 4 for this example
A1
B1
A2
prolog
B1
C1
B2
A3
C1
B2
C1
D1
C2
B3
A4
D1
C2
B3
D1
C2
D1
E1
D2
C3
B4
A1
E2
D3
C4
B1
A2
unrolled kernel
E3
D4
C1
B2
A3
E4
D1
C2
B3
A4
E1
D2
C3
B4
E4
D1
C2
B3
E3
D4
C1
B2
E2
D3
C4
B1
E2
D3
C4
E1
D2
C3
E4
D1
C2
E3
D4
C1
epilog
E3
D4
E2
D3
E1
D2
E4
D1
E4
E3
E2
E1
6
Recap Traditional VLIW Architectures
  • Conventional VLIW
  • Target architecture seen so far in class
  • Large, centralized register file
  • Many functional units connected
  • Problems with conventional design
  • Longer wires require longer latencies on RF
    accesses
  • Large number of connected FUs to the register
    file require more ports.
  • Register file access time increases quadratically
    with number of ports

Conventional Architecture
RF
Register File
FU
FU
FU
FU
FU
7
Multicluster VLIW Architectures
  • Multicluster VLIW
  • Solution to problems with conventional VLIW
    architecture design.
  • Decentralized architecture by splitting RF and
    connecting subsets of the FUs
  • Require communication between clusters through
    intercluster communication path
  • Problem with Multicluster VLIW
  • Compilation must now deal with disjoint FU/RFs,
    and schedule operations accordingly
  • Used in commercial proceesors
  • Alpha 21264, TI C6x, etc.

Clustered Architecture
Register File
Register File
FU
FU
FU
FU
Cluster 1
Cluster 2
8
Other Multicluster Architectures Designs
  • Clusters can be homogeneous or heterogeneous
  • Homogeneous means each cluster is identical
  • Heterogeneous means FU number/types differ per
    cluster
  • Communication paths can be intercluster buses or
    cross cluster FU inputs

Cross-cluster FU inputs
Intercluster Bus
Register File
Register File
Register File
Register File
FU
FU
FU
FU
FU
FU
FU
FU
Cluster 1
Cluster 2
Cluster 1
Cluster 2
9
Multicluster Compilation Basics
  • Goal distribute operations evenly to balance
    workload while minimizing communication
  • When two operations on separate clusters require
    communication, interconnection network must be
    used

Interconnection Network

Register File
Register File
gtgt


LW
I
MEM
MEM
I

Intercluster move
Cluster 1
Cluster 2
10
Cluster Assignment
  • When do we want to do operation cluster
    assignment?
  • Highly intertwined with Scheduling and Register
    Allocation
  • Assignment to clusters can change how well the
    code can be scheduled, which changes how well
    registers can be allocated.
  • Elcors model
  • Other possible models
  • Combine cluster assignment with scheduling
  • Combine all three
  • Unifying any or all of these three steps can
    greatly increase complexity

Cluster Assignment
Scheduling
Register Allocation
11
Bottom-Up Greedy (BUG) Algorithm
  • First clustering algorithm introduced for the
    Multiflow Trace architecture
  • Used by Elcor
  • Basic idea
  • Recursive algorithm
  • Go from exit ops to entry ops and pass along good
    cluster candidates for each op
  • Go from entry ops back to exit ops and make final
    decisions
  • Consider ops on critical path first

12
BUG Algorithm (cont.)
  • Given an op and its immediate predecessors and
    successors, how to choose a good cluster?
  • Op must get its input operands from its
    predecessors
  • Perform some computation
  • Send its output to its successors
  • Want to pick cluster such that this process
    completes soonest (greedy)
  • A good choice depends on what clusters the ops
    predecessors and successors are assigned to

13
Definitions
  • Available time
  • When a source operand is computed
  • Arrival time
  • When source operand is moved to current cluster
  • Start time
  • When all source operands are ready (max of
    arrival times) and resources are available
  • Completion time
  • Result has been computed and moved to consumers

14
Definitions Illustrated
Relative to Op 3
2
Time
AvailableTime (Op2)
move
1
ArrivalTime (Op2, C1)
AvailableTime (Op1), ArrivalTime (Op1, C1)
StartTime (Op3, C1)
3
CompletionTime (Op3, C1, C1, C2)
4
  • Choose a cluster for Op 3 to minimize Completion
    Time

15
The Main Function Assign
  • Assign (Op, Dests)
  • for each Predecessor of Op
  • Est-clusters Estimate (Op, Dests)
  • Assign (Pred, Est-clusters)
  • Est-clusters Estimate (Op, Dests)
  • Cluster first cluster in Est-clusters
  • Assign Op to Cluster
  • Mark Clusters resources busy at
    StartTime(Op, Cluster)

Upward pass
recursive call
Downward pass
actual assignment
  • Estimate function returns a list of Clusters for
    which CompletionTime(Op, Cluster, Dests) is
    minimum

16
BUG
  • Traverses DFG in a reverse depth-first-search
    fashion
  • Upward pass
  • Predecessors have not been assigned yet
  • Use depth (estart) plus latency to approximate
    predecessors AvailableTime
  • Estimate a set of good clusters for current op
  • Recursively assign predecessors with current set
    aspredecessors Dests
  • Downward pass
  • Make final cluster decisions for ops

17
Example
  • Assume all ops are 1-cycle
  • Each cluster can execute one op per cycle
  • Cluster 1 can execute any op, cluster 2 can only
    execute

C1
C2
M

18
Example left path upward pass
AvailTime(Op1)1
3
5
C1
5
CompTime(Op3,C1,C1) 2 CompTime(Op3,C2,C1)
3
CompTime(Op5,C1,-) 3
1
3
C1
5
C1
CompTime(Op1,C1,C1) 1 CompTime(Op1,C2,C1)
2
19
Example left path downward pass
1
ArrivTime(Op1,C1)1 ArrivTime(Op1,C2)2
3
5
C1
StartTime(Op3,C1)1 CompTime(Op3,C1,C1)
2 StartTime(Op3,C2)2 CompTime(Op3,C2,C1) 4
1
3
5
C1
20
Example right path upward pass
1
2
AvailTime(Op2)1
3
4
5
C1
StartTime(Op4,C1)2 CompTime(Op4,C1,C1)
3 StartTime(Op4,C2)1 CompTime(Op4,C2,C1) 3
1
2
3
4
C1,C2
5
C1
CompTime(Op2,C1,C1,C2) 3 CompTime(Op2,C2,C1,C
2) 1
21
Example right path downward pass
1
2
ArrivTime(Op2,C1)2 ArrivTime(Op2,C2)1
3
4
5
C1
StartTime(Op4,C1)2 CompTime(Op4,C1,C1)
3 StartTime(Op4,C2)1 CompTime(Op4,C2,C1) 3
1
2
3
4
5
CompTime(Op5,C1,-) 4
22
Class problem
C1
C2
M3
4
M

5
Schedule
23
Problems with BUG
  • BUG does a fairly good job of partitioning the
    DFG, but it can be improved
  • Problem 1 Local scope of the DFG
  • Has a very narrow view of the DFG
  • Doesnt consider the best global clustering
  • Problem 2 Scheduler-centric
  • Using the scheduler to determine the clustering
    is slow!
  • BUG is not the only solution to cluster
    assignment
  • Many different algorithms exist all using
    different techniques, different scopes, and occur
    at different phases in the compilation process
  • No clear cut winner on the best algorithm for all
    situations.

24
Local Scope
Local scope clustering
Global scope clustering
1
3
1
4
1
1
2
7
2
8
6
4
move
6
5
2
3
4
5
2
3
4
5
10
8
3
9
6
7
8
9
cycle
cycle
6
7
8
9
5
7
11
11
10
move
11
10
move
9
10
12
12
11
12
12
25
Scheduler-centric Nature
  • Cluster Assignment during scheduling adds
    complexity
  • Detailed resource model/reservation table is
    slow!
  • Forces local decisions to be made

Cluster 2
cycle
Cluster 1
cycle
X
X
X
X
1
1
1
X
X
X
X
2
2
2
3
4
5
X
X
X
X
1
1
6
7
8
9
X
X
X
X
2
2
11
10
X
X
X
X
1
1
12
X
X
X
X
2
2
26
Region-based Hierarchical Operation Partitioning
  • RHOP is one of many advanced clustering
    techniques
  • Code is considered region at a time
  • Weight calculation determines hints for how
    operations affect scheduler
  • Partitioning uses multilevel graph partitioner to
    cluster operations

Program
Region
int main int x printf() . . .
Weight Calculation
Graph Partitioning
27
Weight Calculation
  • Node weights are used to determine approximate
    resource usage
  • Differs depending on how many FUs of each type
    per cluster
  • Edge weights are used to determine where to best
    break the graph
  • Where is intercluster communication free or
    preferred?

1
2
Register File
I
F
M
B
3
(0,0)
(0,0)
1
2
(0,1)
(0,1)
(0,1)
(0,1)
3
5
6
7
4
(1,1)
(1,2)
10
11
8
9
(1,2)
(0,2)
(2,2)
13
12
(2,3)
(3,3)
14
(estart, lstart)
(4,4)
28
RHOP - Coarsening
  • Coarsening takes highly-related operations and
    groups them together to later partition
  • Groups based on edge weights
  • Takes snapshots of how things are coarsened,
    later will consider them together

29
RHOP Scheduling estimate
0
1
2
Cluster 1
1
4
6
5
2
2.5
2.0
9
3
cycle
8
0.5
12
0.0
14
0.0
Cluster_wgt1 5.0
0
1
Cluster 2
2
7
0.0
10
11
0.33
13
0.33
cycle
0.0
0.0
Cluster_wgt2 0.67
30
RHOP Checking proposed moves
  • Move groups of operations over, see how it
    changes the load on the schedule estimate

Cluster 1
1
2
1.0
SL(before) 5.0
0.0
3
cycle
SL(after) 4.5
8
0.0
12
0.0
14
0.0
Cluster 2
Lgain 0.5
1.33
4
6
5
7
10
2.33
9
11
Egain -1.0
13
0.83
cycle
0.0
Mgain 4.0
0.0
31
RHOP - Refinement
Write a Comment
User Comments (0)
About PowerShow.com