Title: Code Generation for Clustered VLIW processors
1Code Generation for Clustered VLIW processors
- M.Tech Project Part II Presentation
- Devadutt N (2004MCS2450)
- Under the guidance of Prof. M. Balakrishnan
2Agenda
- Introduction
- Previous approaches
- Proposed approach
- Results
- Implementation notes
- Conclusions and future work
3Introduction to the problem
- Motivation for clustering
- Need for higher performance processors leads to
increased number of Functional Units and larger
number of registers to extract more ILP - Increasing the number of ports (n) in RF leads to
increase in Area (n3), Delay (n3/2), Power (n3) - Complex by-pass network
- Increased clock period
4Inter-cluster interconnects
Cluster 1
Cluster 3
MEM
INT
INT
R.A
W.A
Cluster 2
R.A
Example with 3 clusters
5Previous approaches for Acyclic clustering
- Multi-phased or Unified with scheduling
- Unified Assign Schedule (Emre Ozer et.al), Cars
(Krishnan Kailas et.al) are unified. - PCC (Desoli), Cluster Assignment for Hi-Perf
Embedded Processors (Lapinskii) are multi-phased,
Region based hierarchical clustering (Chu) - Hierarchical or Flat
- PCC (Desoli), Region based hierarchical
clustering (Chu), Affinity based clustering for
unrolled loops (Krishnamurthy) follow the
hierarchical approach
6Proposed approach to clustering
- 2 stage clustering
- Step 1 Preprocessing
- Pre-compute components consisting of nodes, which
are close to each other. Generate k components,
such that k gt c, where c is number of hardware
clusters - Motivation Utilize the graph structure, to
prevent a completely greedy allocation of
operations to clusters - Step 2 Integrated clustering scheduling
- On the fly cluster binding being done along with
scheduling - Motivation Load balancing and better estimation
of inter cluster transfer costs
7Illustration of proposed approach
Cluster 1
Cluster 2
8Step 1 Preprocessing Component creation
- Parameters considered, while creating components
- Criticality (Weight) of edges
- Slack distribution1
- Resource usage of the component
- Height of node
- Start with grouping nodes higher up
- Cycle formation between components
- Maximum path length of a component
1 Based on Chu et.al Region based Hierarchical
operation partitioning for multicluster
processors, PLDI 03
9Preprocessing Component creation
Hardware configuration 2 clusters, each with 1
ARITH unit and 1 MEM unit
10Preprocessing (2) Slack distribution
0
(10)
0
(10)
0
(10)
0
(10)
1
(8)
0
(10)
0
(10)
1
0
(10)
(1)
11Preprocessing (3) Merging components
m5
m1
10
10
10
m2
10
10
m6
8
10
m4
1
1
10
m3
m7
12Coarsening (4) Resource estimates
m3
m1
m2
m4
0
0
2
1
0.5
1
1
3
2
LD
LD
MPY
MPY
MPY
ADD
m6
m7
0.33
0
0.33
1
m5
1
2
0
2
3
LD
ADD
LD
13Preprocessing (5) Merging components
Usage in Cycle 0 for Load unit exceeds 1
m5
m1
10
10
10
m2
10
10
m6
8
10
m4
1
10
m3
14Preprocessing (6) Merging components
Usage in the 2 components do not interfere.
Merging them
m5
m1
10
10
10
m2
10
10
m6
8
10
m4
1
10
m3
15Preprocessing (7) Merging components
Usage in the 2 components do not interfere.
Merging them
m1
m5
10
10
10
m2
10
10
m6
8
10
m4
1
10
Schedule
16Preprocessing (8) Component graph
17Integrated scheduling Component ordering
- No ordering in binding
- Bind each component, as late as possible.
- Prevent cycles during preprocessing.
- Bind in topological order
18Preprocessing Restricting max. path length
- Needed only when cycle detection is being done
- With long chains, we get more number of incoming
edges - This causes, large components to be bound very
late - If we, restrict the height, then we may get
better results - Partially implemented, to be tested
19Step 2 Integrated scheduling
- Inputs List of components
- Bind component and schedule operations in an
interleaved fashion - Bind component as late as possible
- When an operation is chosen for scheduling,
- Check if the containing component is bound, if
not, bind it to a cluster, where the bind cost
would be minimum - bind cost is defined by
- Cycles lost due to inter-cluster moves
- Resource contention, due to overloading a cluster
20Integrated scheduling Load estimation
- We estimate the loss of cycles, due to resource
contention, by counting the number of operations
in the cluster, which use a loaded resource - A resource is considered loaded in a cycle, if
the probability of it being in use is greater
that a threshold pt - This is done by maintaining the probability of a
resource being used in every cycle in the cluster
and updating it, as components are bound to the
cluster
21Load estimation (2)
- Let, the probability of a resource (r) being used
in cycle (u) be represented by p(r,u) - If, the range of cycles of a component (m) is (i,
j), then resource contention rc(i, j) is defined
as - rc(i,j) Opset r i,j
- Opset r i,j is the set of operations Oa,bk,l,
where k,l is the schedule range of O and a, b
is the intersection of ranges k,l and i, j
such that - "u ÃŽ a,b (maxp(r,u) gt pt)
22Load estimation (3)
- The set of operations, can be classified into 3
categories as follows - a) Bound and scheduled (bs)
- b) Bound and unscheduled (bu)
- c) Unbound (and unscheduled) (uu)
- The resource usage of the bs operations on each
cycle is known accurately - The resource usage of the bu operations is spread
across its range of early_time, late_time on
the bound cluster - For the uu operations, we do not know the cluster
binding. - If, we can estimate the affinity of containing
component of these operations to each cluster,
then resource usage of a component (m) on cluster
(c) is defined as - Total_resource_usage rs(m,c)
- rsm,cbs rsm,cbu pb(m,c) rsm,cuu
- where pb is probability of mapping component m
to cluster c
23Integrated scheduling Intercommunication
estimation
- The intercommunication cost (ic) is defined as
follows - icm,c CriticalEdgeSetm,c
- A component edge
- ex,m ÃŽ CriticalEdgeSetm,c
- if, x is bound to a cluster (w)
- and w ¹ c
- and slack(e) lt lat(e) max. transfer cost
24Integrated scheduling Total cost estimation
- Hence, the total cost of binding a component (m)
to a cluster (c) is - total_cost (m,c) a rc(m) b ic(m,c)
- As of now, in the implementation, we have set a
b 1 - We bind component (m) to a cluster (c) such that,
- "c ÃŽ C, (total_cost(m,c)) is minimum
- where C is the set of clusters
25Integrated scheduling - Illustration
Component Load (m1)
2
3
1
9
Ready List
Components Selected
m1
m2
m3
m4
Existing Cluster Load
IC Cost (C1) 0 IC Cost (C2) 0
M1 is bind to C1
26Integrated scheduling Illustration (2)
Component Load (m1)
3
1
9
Ready List
Components Selected
m2
m3
m4
Existing Cluster Load
After Binding Cluster Load
27Integrated scheduling Illustration (3)
RC-tupler (i,j) is the resource contention tuple
for resource r between cycles i and j and is
equal to ltnew load , load before bindinggt
RC-tupler (0,1) values
IC Cost (C1) 0 IC Cost (C2) 1
Total_cost (C1) 4 Total_cost (C2) 3
M2 is bind to C2 and a move is inserted between 2
and 6
28Post pass scheduling
- Spill code inserted during register allocation,
these need to be bound to clusters. - General structure of spill code
- We dont form components
- At schedule time, greedy
- decision taken to bind
- operation to cluster, wherein
- source/destination is present
- If no sources/destination are present, map to
cluster with least load
29Modified ELCOR compilation architecture
Reference Compilers Creating Custom Processors
(CCCP) group, University of Michigan
30Implementation notes
- Machine description
- Operations classified as
- INT, MEM, FLOAT, BRANCH
- Cluster configuration
- Support configuration of number of number of
operations of each type - Interconnect configuration
- Support multiple buses, point to point
interconnects, extended reads and extended writes - Dedicated functional units for moving operands.
Issues slots shared between moves and computation
31Experimental setup
- We compare the results of the clustered VLIW with
that of the unclustered VLIW (Pure VLIW), with
same FU configuration - Representation
- Pure VLIW PV_imf_issue
- Clustered Cl_clusters_imf_issue
- Configurations compared
- PV_222_4 vs Cl_2_111_4
- PV_444_4 vs Cl_4_111_4
- PV_444_8 vs Cl_4_222_8
32Results (1)
33Results (2)
34Results for DCT
35Conclusions and Future work
- Conclusions
- Have a approach which tries to balance between
pre-pass and integrated clustering approaches - Code generation flow complete. Code generated is
simulated and functionally validated. - Future work - Implementation
- Experiment on some more benchmarks
- Fine tune heuristics and constant values
- Current support for buses. Extend to support
point to point interconnects. (support in MDES
has been introduced)