Region-based Hierarchical Operation Partitioning for Multicluster Processors

About This Presentation

Title:

Region-based Hierarchical Operation Partitioning for Multicluster Processors

Description:

Bypass logic grows quadratically with the number of operations issued per cycle ... of when the operation and its predecessors can complete earliest (from scheduler) ... – PowerPoint PPT presentation

Number of Views:11

Avg rating:3.0/5.0

Slides: 19

Provided by: cecs8

Learn more at: https://www.cecs.uci.edu

Category:

more less

Transcript and Presenter's Notes

Title: Region-based Hierarchical Operation Partitioning for Multicluster Processors

1
Region-based Hierarchical Operation Partitioning
for Multicluster Processors

Michael Chu, Kevin Fan, Scott Mahlke
University of Michigan
Presented by Cristian Petrescu-Prahova

2
Clustered Register Files

Why?
Register file cost and access time grows with the
square of he number of register ports
Bypass logic grows quadratically with the number
of operations issued per cycle
Distance separating FUs from register file
increases with a large number of FUs
gt Clustered register files
Decentralized architecture with several small
register files
Each register file supplies operands to a subset
of FUs
Multiflow Trace, Alpha 21264, TI C6x, Analog
Tigersharc (two clusters) reconfigurable meshes?

3
Goal

Partition operations across the resources
available on each cluster to maximize ILP
Minimize inter-cluster communication
Rule of thumb
2 identical clusters processor loose 20
performance
4 identical clusters processor loose 30
performance
Nonidentical clusters lead to even more
performance loss

4
Well Known TechniqueBottom-Up Greedy

Recurse along DFG, critical path first
Assign each operation a cluster based on
estimates of when the operation and its
predecessors can complete earliest (from
scheduler)
Problem 1 makes local decisions (see figure)
Problem 2 is slow - needs to query accurate
cluster status info for each operation considered

5
Region-Based Hierarchical Operation Partitioning

Works on acyclic DFGs extracted from the complete
program based on region decomposition. I assume
region loop (?!?)
Two phases
Weigth calculation Node and Edge
Partitioning Coarsening and Refining

6
Node Weight Calculation

Reflects the quantity of resources per operation
Ignores dependencies
Individual weight (FUs)
Shared weight (ports, buses)

7
Edge Weight Calculation

Measure of criticalness
Based on the notion of slack
First come first serve slack distribution

8
Coarsening Partitioning

Multilevel graph partitioning algorithm (Chaco,
Metis)
Works by coarsening highly related nodes into
partitions, takes in account only edge weights
Takes a snapshot of each step for refining step

9
Refinement Partitioning

Traverse back the coarsening stages, making
improvements to the initial partition
At each stage the coarsened nodes available at
that point are considered for movement to another
cluster
Highly related operations are grouped together at
each stage because we follow the coarsening
process backwards
Metrics
Cluster weight
estimate of the load per cluster
the cluster with highest weight is denoted the
imbalanced cluster
System load
Estimates the load across all clusters
Gain
The gain of moving operations into other clusters

10
Cluster Weight

Individual resource constraint per cluster, per
cycle (op groups)
Total node weight per cluster per cycle (shared
constraints)
Cycle weight per cluster
Cluster weight

11
Sytem Load

Inter-cluster move overhead
Total load, based on cycle by cycle estimation

12
Gain

Load gain
Edge gain
Move gain

13
Example
14
Evaluation

Implemented using Trimaran tool set
Compared with BUG algorithm
5 DSP benchmarks (high ILP), SPECint2000 (low
ILP)
5 configurations, functional units integer (I),
float (F), memory (M), branch (B)

15
Improvement in dynamic total cycles of RHOP over
BUG
16
Comparison of BUG and RHOP clustering performance
versus a 1-cluster machine
2-1111 processor
4-1111 processor
17
Histogram of RHOP versus BUG
Achieved schedule length versus critical path
length. Numbers of top are dynamic execution
percentage
18
Compiling performance number of calls to the
resource table

Write a Comment

User Comments (0)