Code Generation for Clustered VLIW processors - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Code Generation for Clustered VLIW processors

Description:

Bind component and schedule operations in an interleaved fashion ... Integrated scheduling: Illustration (2) Cycle 0 Contd. 3. 1. 9. Ready List: m2. m3. m4 ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 36
Provided by: Deva1
Category:

less

Transcript and Presenter's Notes

Title: Code Generation for Clustered VLIW processors


1
Code Generation for Clustered VLIW processors
  • M.Tech Project Part II Presentation
  • Devadutt N (2004MCS2450)
  • Under the guidance of Prof. M. Balakrishnan

2
Agenda
  • Introduction
  • Previous approaches
  • Proposed approach
  • Results
  • Implementation notes
  • Conclusions and future work

3
Introduction to the problem
  • Motivation for clustering
  • Need for higher performance processors leads to
    increased number of Functional Units and larger
    number of registers to extract more ILP
  • Increasing the number of ports (n) in RF leads to
    increase in Area (n3), Delay (n3/2), Power (n3)
  • Complex by-pass network
  • Increased clock period

4
Inter-cluster interconnects
Cluster 1
Cluster 3
MEM
INT
INT
R.A
W.A
Cluster 2
R.A
Example with 3 clusters
5
Previous approaches for Acyclic clustering
  • Multi-phased or Unified with scheduling
  • Unified Assign Schedule (Emre Ozer et.al), Cars
    (Krishnan Kailas et.al) are unified.
  • PCC (Desoli), Cluster Assignment for Hi-Perf
    Embedded Processors (Lapinskii) are multi-phased,
    Region based hierarchical clustering (Chu)
  • Hierarchical or Flat
  • PCC (Desoli), Region based hierarchical
    clustering (Chu), Affinity based clustering for
    unrolled loops (Krishnamurthy) follow the
    hierarchical approach

6
Proposed approach to clustering
  • 2 stage clustering
  • Step 1 Preprocessing
  • Pre-compute components consisting of nodes, which
    are close to each other. Generate k components,
    such that k gt c, where c is number of hardware
    clusters
  • Motivation Utilize the graph structure, to
    prevent a completely greedy allocation of
    operations to clusters
  • Step 2 Integrated clustering scheduling
  • On the fly cluster binding being done along with
    scheduling
  • Motivation Load balancing and better estimation
    of inter cluster transfer costs

7
Illustration of proposed approach
Cluster 1
Cluster 2
8
Step 1 Preprocessing Component creation
  • Parameters considered, while creating components
  • Criticality (Weight) of edges
  • Slack distribution1
  • Resource usage of the component
  • Height of node
  • Start with grouping nodes higher up
  • Cycle formation between components
  • Maximum path length of a component

1 Based on Chu et.al Region based Hierarchical
operation partitioning for multicluster
processors, PLDI 03
9
Preprocessing Component creation
Hardware configuration 2 clusters, each with 1
ARITH unit and 1 MEM unit
10
Preprocessing (2) Slack distribution
0
(10)
0
(10)
0
(10)
0
(10)
1
(8)
0
(10)
0
(10)
1
0
(10)
(1)
11
Preprocessing (3) Merging components
m5
m1
10
10
10
m2
10
10
m6
8
10
m4
1
1
10
m3
m7
12
Coarsening (4) Resource estimates
m3
m1
m2
m4
0
0
2
1
0.5
1
1
3
2
LD
LD
MPY
MPY
MPY
ADD
m6
m7
0.33
0
0.33
1
m5
1
2
0
2
3
LD
ADD
LD
13
Preprocessing (5) Merging components
Usage in Cycle 0 for Load unit exceeds 1
m5
m1
10
10
10
m2
10
10
m6
8
10
m4
1
10
m3
14
Preprocessing (6) Merging components
Usage in the 2 components do not interfere.
Merging them
m5
m1
10
10
10
m2
10
10
m6
8
10
m4
1
10
m3
15
Preprocessing (7) Merging components
Usage in the 2 components do not interfere.
Merging them
m1
m5
10
10
10
m2
10
10
m6
8
10
m4
1
10
Schedule
16
Preprocessing (8) Component graph
17
Integrated scheduling Component ordering
  • 2 approaches
  • No ordering in binding
  • Bind each component, as late as possible.
  • Prevent cycles during preprocessing.
  • Bind in topological order

18
Preprocessing Restricting max. path length
  • Needed only when cycle detection is being done
  • With long chains, we get more number of incoming
    edges
  • This causes, large components to be bound very
    late
  • If we, restrict the height, then we may get
    better results
  • Partially implemented, to be tested

19
Step 2 Integrated scheduling
  • Inputs List of components
  • Bind component and schedule operations in an
    interleaved fashion
  • Bind component as late as possible
  • When an operation is chosen for scheduling,
  • Check if the containing component is bound, if
    not, bind it to a cluster, where the bind cost
    would be minimum
  • bind cost is defined by
  • Cycles lost due to inter-cluster moves
  • Resource contention, due to overloading a cluster

20
Integrated scheduling Load estimation
  • We estimate the loss of cycles, due to resource
    contention, by counting the number of operations
    in the cluster, which use a loaded resource
  • A resource is considered loaded in a cycle, if
    the probability of it being in use is greater
    that a threshold pt
  • This is done by maintaining the probability of a
    resource being used in every cycle in the cluster
    and updating it, as components are bound to the
    cluster

21
Load estimation (2)
  • Let, the probability of a resource (r) being used
    in cycle (u) be represented by p(r,u)
  • If, the range of cycles of a component (m) is (i,
    j), then resource contention rc(i, j) is defined
    as
  • rc(i,j) Opset r i,j
  • Opset r i,j is the set of operations Oa,bk,l,
    where k,l is the schedule range of O and a, b
    is the intersection of ranges k,l and i, j
    such that
  • "u ÃŽ a,b (maxp(r,u) gt pt)

22
Load estimation (3)
  • The set of operations, can be classified into 3
    categories as follows
  • a) Bound and scheduled (bs)
  • b) Bound and unscheduled (bu)
  • c) Unbound (and unscheduled) (uu)
  • The resource usage of the bs operations on each
    cycle is known accurately
  • The resource usage of the bu operations is spread
    across its range of early_time, late_time on
    the bound cluster
  • For the uu operations, we do not know the cluster
    binding.
  • If, we can estimate the affinity of containing
    component of these operations to each cluster,
    then resource usage of a component (m) on cluster
    (c) is defined as
  • Total_resource_usage rs(m,c)
  • rsm,cbs rsm,cbu pb(m,c) rsm,cuu
  • where pb is probability of mapping component m
    to cluster c

23
Integrated scheduling Intercommunication
estimation
  • The intercommunication cost (ic) is defined as
    follows
  • icm,c CriticalEdgeSetm,c
  • A component edge
  • ex,m ÃŽ CriticalEdgeSetm,c
  • if, x is bound to a cluster (w)
  • and w ¹ c
  • and slack(e) lt lat(e) max. transfer cost

24
Integrated scheduling Total cost estimation
  • Hence, the total cost of binding a component (m)
    to a cluster (c) is
  • total_cost (m,c) a rc(m) b ic(m,c)
  • As of now, in the implementation, we have set a
    b 1
  • We bind component (m) to a cluster (c) such that,
  • "c ÃŽ C, (total_cost(m,c)) is minimum
  • where C is the set of clusters

25
Integrated scheduling - Illustration
Component Load (m1)
  • At cycle 0

2
3
1
9
Ready List
Components Selected
m1
m2
m3
m4
Existing Cluster Load
IC Cost (C1) 0 IC Cost (C2) 0
M1 is bind to C1
26
Integrated scheduling Illustration (2)
Component Load (m1)
  • Cycle 0 Contd.

3
1
9
Ready List
Components Selected
m2
m3
m4
Existing Cluster Load
After Binding Cluster Load
27
Integrated scheduling Illustration (3)
RC-tupler (i,j) is the resource contention tuple
for resource r between cycles i and j and is
equal to ltnew load , load before bindinggt
RC-tupler (0,1) values
IC Cost (C1) 0 IC Cost (C2) 1
Total_cost (C1) 4 Total_cost (C2) 3
M2 is bind to C2 and a move is inserted between 2
and 6
28
Post pass scheduling
  • Spill code inserted during register allocation,
    these need to be bound to clusters.
  • General structure of spill code
  • We dont form components
  • At schedule time, greedy
  • decision taken to bind
  • operation to cluster, wherein
  • source/destination is present
  • If no sources/destination are present, map to
    cluster with least load

29
Modified ELCOR compilation architecture
Reference Compilers Creating Custom Processors
(CCCP) group, University of Michigan
30
Implementation notes
  • Machine description
  • Operations classified as
  • INT, MEM, FLOAT, BRANCH
  • Cluster configuration
  • Support configuration of number of number of
    operations of each type
  • Interconnect configuration
  • Support multiple buses, point to point
    interconnects, extended reads and extended writes
  • Dedicated functional units for moving operands.
    Issues slots shared between moves and computation

31
Experimental setup
  • We compare the results of the clustered VLIW with
    that of the unclustered VLIW (Pure VLIW), with
    same FU configuration
  • Representation
  • Pure VLIW PV_imf_issue
  • Clustered Cl_clusters_imf_issue
  • Configurations compared
  • PV_222_4 vs Cl_2_111_4
  • PV_444_4 vs Cl_4_111_4
  • PV_444_8 vs Cl_4_222_8

32
Results (1)
33
Results (2)
34
Results for DCT
35
Conclusions and Future work
  • Conclusions
  • Have a approach which tries to balance between
    pre-pass and integrated clustering approaches
  • Code generation flow complete. Code generated is
    simulated and functionally validated.
  • Future work - Implementation
  • Experiment on some more benchmarks
  • Fine tune heuristics and constant values
  • Current support for buses. Extend to support
    point to point interconnects. (support in MDES
    has been introduced)
Write a Comment
User Comments (0)
About PowerShow.com