Region-based Hierarchical Operation Partitioning for Multicluster Processors - PowerPoint PPT Presentation

About This Presentation
Title:

Region-based Hierarchical Operation Partitioning for Multicluster Processors

Description:

Bypass logic grows quadratically with the number of operations issued per cycle ... of when the operation and its predecessors can complete earliest (from scheduler) ... – PowerPoint PPT presentation

Number of Views:11
Avg rating:3.0/5.0
Slides: 19
Provided by: cecs8
Learn more at: https://www.cecs.uci.edu
Category:

less

Transcript and Presenter's Notes

Title: Region-based Hierarchical Operation Partitioning for Multicluster Processors


1
Region-based Hierarchical Operation Partitioning
for Multicluster Processors
  • Michael Chu, Kevin Fan, Scott Mahlke
  • University of Michigan
  • Presented by Cristian Petrescu-Prahova

2
Clustered Register Files
  • Why?
  • Register file cost and access time grows with the
    square of he number of register ports
  • Bypass logic grows quadratically with the number
    of operations issued per cycle
  • Distance separating FUs from register file
    increases with a large number of FUs
  • gt Clustered register files
  • Decentralized architecture with several small
    register files
  • Each register file supplies operands to a subset
    of FUs
  • Multiflow Trace, Alpha 21264, TI C6x, Analog
    Tigersharc (two clusters) reconfigurable meshes?

3
Goal
  • Partition operations across the resources
    available on each cluster to maximize ILP
  • Minimize inter-cluster communication
  • Rule of thumb
  • 2 identical clusters processor loose 20
    performance
  • 4 identical clusters processor loose 30
    performance
  • Nonidentical clusters lead to even more
    performance loss

4
Well Known TechniqueBottom-Up Greedy
  • Recurse along DFG, critical path first
  • Assign each operation a cluster based on
    estimates of when the operation and its
    predecessors can complete earliest (from
    scheduler)
  • Problem 1 makes local decisions (see figure)
  • Problem 2 is slow - needs to query accurate
    cluster status info for each operation considered

5
Region-Based Hierarchical Operation Partitioning
  • Works on acyclic DFGs extracted from the complete
    program based on region decomposition. I assume
    region loop (?!?)
  • Two phases
  • Weigth calculation Node and Edge
  • Partitioning Coarsening and Refining

6
Node Weight Calculation
  • Reflects the quantity of resources per operation
  • Ignores dependencies
  • Individual weight (FUs)
  • Shared weight (ports, buses)

7
Edge Weight Calculation
  • Measure of criticalness
  • Based on the notion of slack
  • First come first serve slack distribution

8
Coarsening Partitioning
  • Multilevel graph partitioning algorithm (Chaco,
    Metis)
  • Works by coarsening highly related nodes into
    partitions, takes in account only edge weights
  • Takes a snapshot of each step for refining step

9
Refinement Partitioning
  • Traverse back the coarsening stages, making
    improvements to the initial partition
  • At each stage the coarsened nodes available at
    that point are considered for movement to another
    cluster
  • Highly related operations are grouped together at
    each stage because we follow the coarsening
    process backwards
  • Metrics
  • Cluster weight
  • estimate of the load per cluster
  • the cluster with highest weight is denoted the
    imbalanced cluster
  • System load
  • Estimates the load across all clusters
  • Gain
  • The gain of moving operations into other clusters

10
Cluster Weight
  • Individual resource constraint per cluster, per
    cycle (op groups)
  • Total node weight per cluster per cycle (shared
    constraints)
  • Cycle weight per cluster
  • Cluster weight

11
Sytem Load
  • Inter-cluster move overhead
  • Total load, based on cycle by cycle estimation

12
Gain
  • Load gain
  • Edge gain
  • Move gain

13
Example
14
Evaluation
  • Implemented using Trimaran tool set
  • Compared with BUG algorithm
  • 5 DSP benchmarks (high ILP), SPECint2000 (low
    ILP)
  • 5 configurations, functional units integer (I),
    float (F), memory (M), branch (B)

15
Improvement in dynamic total cycles of RHOP over
BUG
16
Comparison of BUG and RHOP clustering performance
versus a 1-cluster machine
2-1111 processor
4-1111 processor
17
Histogram of RHOP versus BUG
Achieved schedule length versus critical path
length. Numbers of top are dynamic execution
percentage
18
Compiling performance number of calls to the
resource table
Write a Comment
User Comments (0)
About PowerShow.com