Title: Overview
1Overview
- Motivation (Kevin)
- Thermal issues (Kevin)
- Power modeling (David)
- Thermal management (David)
- Optimal DTM (Lev)
- Clustering (Antonio)
- Power distribution (David)
- What current chips do (Lev)
- HotSpot (Kevin)
2The clustering approach
- Reduce complexity by partitioning
- Less latency, area, power and temperature
- Fast, simple, distributed units
- Communication latency is heterogeneous and
exposed to the microarchitecture - Localize critical communication within clusters
(fast wires)
3The clustering approach (...)
- Smaller structures consume less power
- Higher power efficiency Zyuban, IEEE
Transactions 01 - Partitioning simplifies power management
- Via clock/power gating techniques Bahar, ISCA
01 - Via dynamic cluster resizing González, ICCD 03
- Via DVS/DFS
- Partitioning reduces temperature
- Activity is distributed Chaparro, TACS 04
- Hopping schemes can be applied Chaparro, TACS
04 - Adds flexibility for temperature-effective
layouts - IPC overheads due to communication/imbalance
- Compensated by shorter latency/clock period
Palacharla, ISCA 97, Canal, HPCA 00
4Clustered microarchitecture
- Dynamic steering
- Distributed Issue, Registers, FUs
- Inter-cluster register communication
5On-demand communication
- Map table tracks locations of register values
- At rename
- allocate register for result, in the assigned
cluster - if a source operand is in a remote cluster
- insert a copy instruction in remote cluster
- allocate register for a copy
- At commit
- free allocated register(s) by previous mapping
log. reg.
Canal, PACT99
6Rename
Renaming Table
Log CL0 CL1 CL2 CL3
0 X 27 X X
1 18 X X 9
2 X 3 15 X
3 5 10 X 13
4 X X 12 14
5 4 X X X
6 X 1 24 X
7 2 X X X
8 X 2 X 9
Log CL0 CL1 CL2 CL3
0 X 27 X X
1 X 14 X X
2 X 3 15 X
3 5 10 X 13
4 X X 12 14
5 4 X X X
6 X 1 24 X
7 2 X X X
8 X 2 X 9
Cluster 1
Steering Logic
src1
src2
src3
src4
src5
dst
Logical
2
3
X
X
X
1
Physical
7Copy instructions
Copy instruction
Renaming Table
Log CL0 CL1 CL2 CL3
0 X 27 X X
1 18 X X 9
2 X 3 15 X
3 5 10 X 13
4 X X 12 14
5 4 X X X
6 X 1 24 X
7 2 X X X
8 X 2 X 9
Log CL0 CL1 CL2 CL3
0 X 27 X X
1 13 X X 5
2 X 3 15 X
3 5 10 27 13
4 X X 12 14
5 4 X X X
6 X 1 24 X
7 2 X X X
8 X 2 X 9
Log CL0 CL1 CL2 CL3
0 X 27 X X
1 X X 14 X
2 X 3 15 X
3 5 10 27 13
4 X X 12 14
5 4 X X X
6 X 1 24 X
7 2 X X X
8 X 2 X 9
Cluster 2
Steering Logic
src1
src2
src3
src4
src5
dst
Logical
2
3
X
X
X
1
Physical
8Broadcast communication
- Values sent to all register files
- Local file is updated earlier than remote ones
- Registers are replicated in all files
- Register storage waste
- Increase in power
- Values are written multiple times
- Increase in power
- May reduce communication penalties
- Values are present everywhere
- But not at the same time
- E.g. Alpha 21264
9Cluster assignment schemes
- Main goals
- Minimize inter-cluster communication penalty
- Maximize workload balance
- Main approaches
- Static approachesFarkas, Micro 97 Sastry,
PLDI 98 - Less flexible than dynamic ones poor load
balancing - Dynamic, dependence-basedPalacharla ISCA 97
Alpha 21264 Kemp, ICPP 96 - Only consider dependences through unavailable
operands - Lack specific balancing mechanisms
- Dynamic, workload balance orientedBaniasadi 00
- Only suitable with low communication penalty
architectures - Dynamic, dependence-based and workload balance
orientedCanal HPCA 2000, Parcerisa PACT 2002 - Tries to find best trade-off between
communications and workload balance
10Cluster assignment schemes
- Accurate-Rebalancing Priority RMB
- 1- To minimize communication penalties
- If unavailable source register choose producers
cluster - Else Select clusters with highest number of
source regs. mapped - 2- Choose the least loaded one of the above
- Exception if imbalance gt threshold, then exclude
clusters with positive workload, prior to
applying rules 1 and 2
11Evaluation
12Dynamic vs. static steering
S. Sastry, S.Palacharla and J.E.Smith, PLDI 1998
13Data cache architectures
González, WMPI 04
Backend
Backend
L1 Dcache
- Dcache is a cluster
- Single Load/Store queue
- Simple disambiguation
Backend
Backend
UL2
14Data cache architecture (II)
- Attraction caches
- Lines are copied on demand
- A coherence scheme is needed
- Steering must exploit data locality
15Data cache architecture (III)
- Replicated
- Area cost
- Traffic due to store broadcast
UL2
DL1
DL1
DL1
DL1
BE 2
BE 1
BE 4
BE 3
16Data cache architecture (IV)
- Interleaved
- Word/line interleaved
- Steering needs to predict the bank
UL2
17Memory issues
- Disambiguation
- Load/Store queues are distributed
- Stores are allocated in all clusters
- Address is computed in one and broadcast
- Loads go to memory once previous stores know
their addresses - Memory coherence
- Write-Invalidate / Write-Update protocols
18Performance comparison
19Thermal benefits of clustering
Example layout for a quad-cluster architecture
20Temperature metrics
- AbsMax
- Maximum sensed temperature
- Average
- Average temperature across time and area
- AverageMax
- Average temperature across time of maximum sensed
temperature
21Clustering reduces temperature
22Clustering effects
- May end up with higher power densities!
- Simpler and smaller units may create hotspots
- Layout must be thermal-effective
- Surround hotspots by cold areas
- Activity steering must be smart
- Other techniques (e.g. throttling) can be applied
at smaller granularity - Aim at particular clusters without affecting
others
23Dynamic cluster resizing
González, ICCD 03
24Dynamic cluster resizing
- Proposal
- Dynamically compute the energy of blocks
- Schedulers, FUs, DL0s, etc
- Dynamically compute the energyxdelay2 of the
processor - Use different configurations for different
intervals - Measure the optimal configuration
- Gate-off (disable) useless units
- Scheduler level
- Backend level
25Dynamic cluster resizing
UL2 cache
I
Decode Rename Steer
BEn
BE3
BE2
BE1
BE4
BE5
ED2Px lt ED2Px1 lt ED2Px-1 ?
26Dynamic cluster resizing
27Cluster hopping
- Motivation
- Power and average temperature savings when
statically Vdd gating clusters
Temperatures in the backend area when gating
all but the indicated cluster(s). Reductions over
in-box ambient temperature (45º) respect to a
baseline quad-cluster architecture.
28Cluster hopping
- Based on activity migration Heo, ISLPED 03
- Vdd gate a subset of clusters
- Rotate clusters to spread activity over time
- Gated clusters cannot provide any register value
- Before gating, some register values must be
evicted - Cache/DTLB contents are lost
- Unless some low power (e.g. drowsy) mode is used
- Proactive and/or reactive behavior
- Proactive Per interval basis
- Reactive On thermal events
29Cluster hopping schemes
Effective at reducing average temperature (thus
leakage) but not max temperature
30Thermal-aware steering
- Try to minimize max temperature
- Take into account cluster temperature when
deciding destination - Some examples
- Cold
- Dispatch to coldest cluster with available
resources - Lowest average temperature
- Lowest peak temperature
- T-Cold
- Like Cold but discard clusters that are too hot
- If difference in temperature with previous
cluster (ordered by temperature) is higher than a
threshold
31Thermal-aware steering
- T-Thermal
- Minimize communications unless candidate cluster
is too hot - If temperature difference gt threshold ? Priority
to the colder - Otherwise ? Priority to the one that minimize
communications, and in case of tie maximize
workload balance (instructions in the schedulers)
32Thermal-aware steering
- Thermal-aware steering standalone
33Hopping thermal steering
34Clustering the front-end
Parcerisa, TR 02
Distributed Back-end
35Distributed branch predictor
- Broadcast every prediction (next PC) to all
clusters - Hardware loop predictor uses PC as index
- insert bubble when switching the predictor
cluster (2) - if interleaving by low order bits frequent
bubbles
- Solution
- Pipeline prediction ahead of I-cache interleave
by hi-bits - Bubble only when high level interleave boundary
crossed (2)
36Impact of distributing branch predictor
- Bank switching
- SpecInt95 every 24 instructions
- Mbench every 133 instructions
- IPC loss
- SpecInt95 0,5
- Mbench no loss
37Distributed cluster assignment
- Make local assignments and broadcast them to all
clusters - Loop steering logic uses assignments made by
other clusters
- Partial solution use outdated info (2 cycles)
- Problem outdated dependences ? generates
communications
- Solution
- anticipate dependence-checking and
- override assignment, if dependence was violated
38Impact of distributing assignment
- W/o assignment overriding
- 0.42 communications / instruction
- More than 10 IPC loss
- With assignment overriding
- 0.17 communications / instruction
- Less than 2 IPC loss
39Thermal benefits
- Clustering the rename table and the reorder
buffer Chaparro, 04
40Summary
- Clustering is thermal-effective (in addition to
complexity-effective) - Reduces power
- Distributes activity
- Clustering enables effective temperature control
schemes - Adaptive configuration
- DVS/DFS
- Cluster hopping
- Thermal steering