Code Generation for Clustered VLIW processors - PowerPoint PPT Presentation

1 / 35

About This Presentation

Title:

Code Generation for Clustered VLIW processors

Description:

Bind component and schedule operations in an interleaved fashion ... Integrated scheduling: Illustration (2) Cycle 0 Contd. 3. 1. 9. Ready List: m2. m3. m4 ... – PowerPoint PPT presentation

Number of Views:43

Avg rating:3.0/5.0

Slides: 36

Provided by: Deva1

Category:

more less

Transcript and Presenter's Notes

Title: Code Generation for Clustered VLIW processors

1
Code Generation for Clustered VLIW processors

M.Tech Project Part II Presentation
Devadutt N (2004MCS2450)
Under the guidance of Prof. M. Balakrishnan

2
Agenda

Introduction
Previous approaches
Proposed approach
Results
Implementation notes
Conclusions and future work

3
Introduction to the problem

Motivation for clustering
Need for higher performance processors leads to
increased number of Functional Units and larger
number of registers to extract more ILP
Increasing the number of ports (n) in RF leads to
increase in Area (n3), Delay (n3/2), Power (n3)
Complex by-pass network
Increased clock period

4
Inter-cluster interconnects
Cluster 1
Cluster 3
MEM
INT
INT
R.A
W.A
Cluster 2
R.A
Example with 3 clusters
5
Previous approaches for Acyclic clustering

Multi-phased or Unified with scheduling
Unified Assign Schedule (Emre Ozer et.al), Cars
(Krishnan Kailas et.al) are unified.
PCC (Desoli), Cluster Assignment for Hi-Perf
Embedded Processors (Lapinskii) are multi-phased,
Region based hierarchical clustering (Chu)
Hierarchical or Flat
PCC (Desoli), Region based hierarchical
clustering (Chu), Affinity based clustering for
unrolled loops (Krishnamurthy) follow the
hierarchical approach

6
Proposed approach to clustering

2 stage clustering
Step 1 Preprocessing
Pre-compute components consisting of nodes, which
are close to each other. Generate k components,
such that k gt c, where c is number of hardware
clusters
Motivation Utilize the graph structure, to
prevent a completely greedy allocation of
operations to clusters
Step 2 Integrated clustering scheduling
On the fly cluster binding being done along with
scheduling
Motivation Load balancing and better estimation
of inter cluster transfer costs

7
Illustration of proposed approach
Cluster 1
Cluster 2
8
Step 1 Preprocessing Component creation

Parameters considered, while creating components
Criticality (Weight) of edges
Slack distribution1
Resource usage of the component
Height of node
Start with grouping nodes higher up
Cycle formation between components
Maximum path length of a component

1 Based on Chu et.al Region based Hierarchical
operation partitioning for multicluster
processors, PLDI 03
9
Preprocessing Component creation
Hardware configuration 2 clusters, each with 1
ARITH unit and 1 MEM unit
10
Preprocessing (2) Slack distribution
0
(10)
0
(10)
0
(10)
0
(10)
1
(8)
0
(10)
0
(10)
1
0
(10)
(1)
11
Preprocessing (3) Merging components
m5
m1
10
10
10
m2
10
10
m6
8
10
m4
1
1
10
m3
m7
12
Coarsening (4) Resource estimates
m3
m1
m2
m4
0
0
2
1
0.5
1
1
3
2
LD
LD
MPY
MPY
MPY
ADD
m6
m7
0.33
0
0.33
1
m5
1
2
0
2
3
LD
ADD
LD
13
Preprocessing (5) Merging components
Usage in Cycle 0 for Load unit exceeds 1
m5
m1
10
10
10
m2
10
10
m6
8
10
m4
1
10
m3
14
Preprocessing (6) Merging components
Usage in the 2 components do not interfere.
Merging them
m5
m1
10
10
10
m2
10
10
m6
8
10
m4
1
10
m3
15
Preprocessing (7) Merging components
Usage in the 2 components do not interfere.
Merging them
m1
m5
10
10
10
m2
10
10
m6
8
10
m4
1
10
Schedule
16
Preprocessing (8) Component graph
17
Integrated scheduling Component ordering

2 approaches

No ordering in binding
Bind each component, as late as possible.

Prevent cycles during preprocessing.
Bind in topological order

18
Preprocessing Restricting max. path length

Needed only when cycle detection is being done
With long chains, we get more number of incoming
edges
This causes, large components to be bound very
late
If we, restrict the height, then we may get
better results
Partially implemented, to be tested

19
Step 2 Integrated scheduling

Inputs List of components
Bind component and schedule operations in an
interleaved fashion
Bind component as late as possible
When an operation is chosen for scheduling,
Check if the containing component is bound, if
not, bind it to a cluster, where the bind cost
would be minimum
bind cost is defined by
Cycles lost due to inter-cluster moves
Resource contention, due to overloading a cluster

20
Integrated scheduling Load estimation

We estimate the loss of cycles, due to resource
contention, by counting the number of operations
in the cluster, which use a loaded resource
A resource is considered loaded in a cycle, if
the probability of it being in use is greater
that a threshold pt
This is done by maintaining the probability of a
resource being used in every cycle in the cluster
and updating it, as components are bound to the
cluster

21
Load estimation (2)

Let, the probability of a resource (r) being used
in cycle (u) be represented by p(r,u)
If, the range of cycles of a component (m) is (i,
j), then resource contention rc(i, j) is defined
as
rc(i,j) Opset r i,j
Opset r i,j is the set of operations Oa,bk,l,
where k,l is the schedule range of O and a, b
is the intersection of ranges k,l and i, j
such that
"u Î a,b (maxp(r,u) gt pt)

22
Load estimation (3)

The set of operations, can be classified into 3
categories as follows
a) Bound and scheduled (bs)
b) Bound and unscheduled (bu)
c) Unbound (and unscheduled) (uu)
The resource usage of the bs operations on each
cycle is known accurately
The resource usage of the bu operations is spread
across its range of early_time, late_time on
the bound cluster
For the uu operations, we do not know the cluster
binding.
If, we can estimate the affinity of containing
component of these operations to each cluster,
then resource usage of a component (m) on cluster
(c) is defined as
Total_resource_usage rs(m,c)
rsm,cbs rsm,cbu pb(m,c) rsm,cuu
where pb is probability of mapping component m
to cluster c

23
Integrated scheduling Intercommunication
estimation

The intercommunication cost (ic) is defined as
follows
icm,c CriticalEdgeSetm,c
A component edge
ex,m Î CriticalEdgeSetm,c
if, x is bound to a cluster (w)
and w ¹ c
and slack(e) lt lat(e) max. transfer cost

24
Integrated scheduling Total cost estimation

Hence, the total cost of binding a component (m)
to a cluster (c) is
total_cost (m,c) a rc(m) b ic(m,c)
As of now, in the implementation, we have set a
b 1
We bind component (m) to a cluster (c) such that,
"c Î C, (total_cost(m,c)) is minimum
where C is the set of clusters

25
Integrated scheduling - Illustration
Component Load (m1)

At cycle 0

2
3
1
9
Ready List
Components Selected
m1
m2
m3
m4
Existing Cluster Load
IC Cost (C1) 0 IC Cost (C2) 0
M1 is bind to C1
26
Integrated scheduling Illustration (2)
Component Load (m1)

Cycle 0 Contd.

3
1
9
Ready List
Components Selected
m2
m3
m4
Existing Cluster Load
After Binding Cluster Load
27
Integrated scheduling Illustration (3)
RC-tupler (i,j) is the resource contention tuple
for resource r between cycles i and j and is
equal to ltnew load , load before bindinggt
RC-tupler (0,1) values
IC Cost (C1) 0 IC Cost (C2) 1
Total_cost (C1) 4 Total_cost (C2) 3
M2 is bind to C2 and a move is inserted between 2
and 6
28
Post pass scheduling

Spill code inserted during register allocation,
these need to be bound to clusters.
General structure of spill code
We dont form components
At schedule time, greedy
decision taken to bind
operation to cluster, wherein
source/destination is present
If no sources/destination are present, map to
cluster with least load

29
Modified ELCOR compilation architecture
Reference Compilers Creating Custom Processors
(CCCP) group, University of Michigan
30
Implementation notes

Machine description
Operations classified as
INT, MEM, FLOAT, BRANCH
Cluster configuration
Support configuration of number of number of
operations of each type
Interconnect configuration
Support multiple buses, point to point
interconnects, extended reads and extended writes
Dedicated functional units for moving operands.
Issues slots shared between moves and computation

31
Experimental setup

We compare the results of the clustered VLIW with
that of the unclustered VLIW (Pure VLIW), with
same FU configuration
Representation
Pure VLIW PV_imf_issue
Clustered Cl_clusters_imf_issue
Configurations compared
PV_222_4 vs Cl_2_111_4
PV_444_4 vs Cl_4_111_4
PV_444_8 vs Cl_4_222_8

32
Results (1)
33
Results (2)
34
Results for DCT
35
Conclusions and Future work

Conclusions
Have a approach which tries to balance between
pre-pass and integrated clustering approaches
Code generation flow complete. Code generated is
simulated and functionally validated.
Future work - Implementation
Experiment on some more benchmarks
Fine tune heuristics and constant values
Current support for buses. Extend to support
point to point interconnects. (support in MDES
has been introduced)