Title: Inherently WorkloadBalanced Clustered Microarchitecture
1Inherently Workload-Balanced Clustered
Microarchitecture
- Jaume Abella1, Antonio González1,2
1 Computer Architecture Dept. UPC-Barcelona
2 Intel Barcelona Research Center Intel Labs
UPC, Barcelona
2Motivation (I)
Conventional microarchitectures are complex and
power-hungry
3Motivation (II)
- Conventional clustered microarchitectures
features - Workload balance and communications
- Ideal balancing dramatically increases
communications - No communications implies huge imbalance
- A tradeoff must be found not possible to
minimize both - This works main contribution
- It is possible to avoid this tradeoff?
- Yes. Using Ring Clustered Microarchitectures.
4Objectives
- Low complexity and low power using clustered
microarchitectures - Reduce imbalance and communications
simultaneously - Achieve high performance
5Contents
- Motivation
- Background
- Conventional clustered microarchitectures
- Tradeoff implications
- Our approach Ring Clustered Microarchitecture
- Results
- Conclusions
6Background (I)
- Conventional clustered microarchitectures
- In case of high imbalance, send instructions to
the least loaded cluster. - Hence, communications are required
- Otherwise, instructions are sent to the cluster
where their operands are available. - Hence, instructions concentrate in few clusters
7Background (II)
i 5 R5 R1 R2
Cluster 1
Cluster 2
instructions
registers
instructions
registers
i
R0
i 3
R3
i 1
R1
i 2
R2
i 4
R4
8Background (III)
Steering logic
Cluster 1
Cluster 2
Cluster 3
instructions
instructions
instructions
High activity during some cycles. Hence, high
temperature!!!
High activity during some cycles. Hence, high
temperature!!!
9Contents
- Motivation
- Background
- Conventional clustered microarchitectures
- Tradeoff implications
- Our approach Ring Clustered Microarchitecture
- Results
- Conclusions
10Ring Clustered Microarchitecture (I)
- Clusters interconnected in a ring topology
- Data produced in a cluster is available ONLY in
the following cluster in the ring - There are no fast bypasses within a cluster
- There are fast bypasses between one cluster and
the following one in the ring
11Ring Clustered Microarchitecture (II)
Conventional
Ring
C 1
C 2
C 1
C 2
C 4
C 3
C 4
C 3
- Conventional bypass to the same cluster in 0
cycles (a simple integer operation and bypass
take 1 cycle) - Ring bypass to the following cluster in 0 cycles
12Ring Clustered Microarchitecture (III)
Conventional
Ring
Cluster K
Cluster K
write
read
data wakeup bypass
Register file
write
data wakeup bypass
Register file
read
Cluster K1
13Ring Clustered Microarchitecture (IV)
- New designs are required for
- Issue queues
- Register files
- Functional units
- to enable the data to be sent fast to the
following cluster instead of the same one
14Ring Clustered Microarchitecture (V)
Registers for each cluster
I1. R1 1 I2. R2 R1 1 I3. R3 R1 R2 I4.
R4 R1 R3 I5. R5 R1 3
0
1
2
3
15Contents
- Motivation
- Background
- Conventional clustered microarchitectures
- Tradeoff implications
- Our approach Ring Clustered Microarchitecture
- Results
- Conclusions
16Evaluation Framework
- Processor
- 8 clusters
- Reorder buffer size 256 instructions
- Each cluster
- Issue queue size 16 integer 16 FP instructions
- Registers 48 integer 48 FP registers
- Number of buses and issue width per cluster
Issue Width
Buses
1bus_1IW
1
1
1
2
1bus_2IW
2
1
2bus_1IW
2
2
2bus_2IW
17Steering Algorithms (I)
- Smart heuristic takes into account
- Workload imbalance (only conventional
microarchitecture) - Number of required communications
- Distance of communications
Conventional microarchitecture If workload
imbalance is high then Instruction to
least loaded cluster Else Instruction is
sent to the cluster requiring less and shorter
communications endif
Ring microarchitecture Instruction is sent to
the cluster requiring less and shorter
communications
18Steering Algorithms (II)
- Simple heuristic takes into account
- First operand
Ring and Conventional microarchitectures Instruc
tion is sent to the cluster where the first
operand is (will be) available
19Performance (I)
20Performance (II)
- Simple steering algorithm
21Activity Distribution
- Average maximum number of instructions to same
cluster, every dispatch block
22Contents
- Motivation
- Background
- Conventional clustered microarchitectures
- Tradeoff implications
- Our approach Ring Clustered Microarchitecture
- Results
- Conclusions
23Conclusions
- The ring microarchitecture
- achieves higher performance than the conventional
one, especially for simple steering algorithms - balances the workload inherently
- requires low communication resources
- distributes the activity much better than the
conventional one
24Q A