Title: Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs)
1Organizing the Last Line of Defense before
hitting the Memory Wall for Chip-Multiprocessors
(CMPs)
- C. Liu, A. Sivasubramaniam, M. Kandemir
- The Pennsylvania State University
- anand_at_cse.psu.edu
2Outline
- CMPs and L2 organization
- Shared Processor-based Split L2
- Evaluation using SpecOMP/Specjbb
- Summary of Results
3Why CMPs?
- Can exploit coarser granularity of parallelism
- Better use of anticipated billion transistor
designs - Multiple and simpler cores
- Commercial and research prototypes
- Sun MAJC
- Piranha
- IBM Power 4/5
- Stanford Hydra
- .
4Higher pressure on memory system
- Multiple active threads gt larger working set
- Solution?
- Bigger Cache.
- Faster interconnect.
- What if we have to go off-chip?
- The cores need to share the limited pins.
- Impact of off-chip accesses may be much worse
than incurring a few extra cycles on-chip - Needs a close scrutiny of on-chip caches.
5On-chip Cache Hierarchy
- Assume 2 levels
- L1 (I/D) is private
- What about L2?
- L2 is the last line of defense before going
off-chip, and is the focus of this paper.
6Private (P) L2
I
D
I
D
L1
L1
Coherence Protocol
Advantages Less interconnect traffic
Insulates L2 units
Disadvantages Duplication Load imbalance
Offchip Memory
7Shared-Interleaved (SI) L2
Coherence Protocol
L1
I
D
I
D
L2
Disadvantages Interconnect traffic
Interference between cores
Advantages No duplication Balance the load
8Desirables
- Approach the behavior of private L2s, when the
sharing is not significant - Approach the behavior of private L2 when load is
balanced or when there is interference - Approach behavior of shared L2 when there is
significant sharing - Approach behavior of shared L2 when demands are
uneven.
9Shared Processor-based Split L2
L1
Table and Split Select
L2
Processors/cores are allocated L2 splits
10Lookup
- Look up all splits allocated to requesting core
simultaneously. - If not found, then look at all other splits
(extra latency). - If found, move block over to one of its splits
(chosen randomly), and removing it from the other
split. - Else, go off-chip and place block in one of its
splits (chosen randomly).
11Note
- Note, a core cannot place blocks that evict
blocks useful to another (as in Private case) - A core can look at (shared) blocks of other cores
at a slightly higher cost without being as high
as off-chip accesses (as in Shared case). - There is at most 1 copy of a block in L2.
12Shared Split Uniform (SSU)
I
D
I
D
L1
Table and Split Select
L2
13Shared Split Non-Uniform (SSN)
I
D
I
D
L1
Table and Split Select
L2
14Split Table
P0
X X X
X X X X
X X
X
P1
P2
P3
15Evaluation
- Using Simics complete system simulator
- Benchmarks SpecOMP2000 Specjbb
- Reference dataset used
- Several billion instructions were simulated.
- A bus interconnect was simulated with MESI.
16Default configuration
of proc 8 L2 Assoc 4-way
L1 Size 8KB L2 Latency 10 cycles
L1 Line Size 32 Byte L2 Splits 8 (SI, SSU)
L1 Assoc 4-way L2 Splits 16 (SSN)
L1 Latency 1 cycle MEM Access 120 cycles
L2 Size 2MB total Bus Arbitration 5 cycles
L2 Line Size 64 Byte Replacement Strict LRU
17Benchmarks (SpecOMP Specjbb)
Benchmark L1 L1 L2 L2 of Inst (m)
Benchmark Miss Rate Miss Rate of Inst (m)
ammp 53.1m 0.007 2.1m 0.062 25,528
applu 111.2m 0.009 26.4m 0.168 21,519
apsi 378.9m 0.117 27.2m 0.083 15,713
art_m 66.1m 0.009 25.7m 0.507 22,967
fma3d 18.9m 0.002 6.2m 0.239 26,189
galgel 111.4m 0.014 10.7m 0.127 24,051
swim 261.6m 0.111 95.9m 0.296 7,761
mgrid 333.2m 0.153 68.3m 0.185 10,294
specjbb 828.5m 0.353 22.7m 0.083 9,413
18SSN Terminology
- With a total L2 of 2MB (16 splits of 128K each)
to be allocated to 8 cores, SSN-152 refers to - 512K (4 splits) allocated to 1 CPU
- 256K (2 splits) allocated to each of 5 CPUs
- 128K (1 split) allocated to each of 2 CPUs
- Determining how much to allocate to each CPU (and
when) postpone for future work. - Here, we use a profile based approach based on L2
demands.
19Application behavior
- Intra-application heterogeneity
- Spatial (among CPUs) allocate non-uniform
splits to different CPUs. - Temporal (for each CPU)change the number of
splits allocated to a CPU at different points of
time. - Inter-application heterogeneity
- Different applications running at same time can
have different L2 demands.
20Definition
- SHF (Spatial Heterogeneity Factor)
- THF (Temporal Heterogeneity Factor)
21Spatial heterogeneity Factor
22Temporal Heterogeneity Factor
23Results SI
24Results SSU
25Results SSN
26Summary of Results
- When P does better than S (e.g. apsi), SSU/SSN
does as well (if not better) as P. - When S does better than P (e.g. swim, mgrid,
specjbb), SSU/SSN does as well (if not better) as
S. - In nearly all cases (except applu), some
configuration of SSU/SSN does the best. - On the average we get over 11 improvement in IPC
over the best S/P configuration(s).
27Inter-application Heterogenity
- Different applications have different L2 demands
- These applications could even be running
concurrently on different CPUs.
28Inter-application results
- ammpapsi, lowhigh.
- ammpfma3d, both low
- swimapsi, both high, imbalanced balanced.
- swimmgrid,both high, imbalanced imbalanced
29Inter-application ammpapsi
- SSN-152
- 1.25MB dynamically allocated to apsi, 0.75MB to
ammp. - Graph shows the rough 53 allocation.
- Better overall IPC value.
Low miss rate for apsi and not affecting the
miss rate of ammp.
30Concluding Remarks
- Shared Processor-based Split L2 is a flexible way
of approaching the behavior of shared or private
L2 (based on what is preferable) - It accommodates spatial and temporal
heterogeneity in L2 demands both within an
application and across applications. - Becomes even more important with higher off-chip
accesses.
31Future Work
- How to configure the split sizes statically,
dynamically and a combination of the two?
32Backup Slides
33(No Transcript)
34Meaning
- Capture the heterogeneity between CPUs (spatial)
or over the epochs (temporal) of the load imposed
on the L2 structure. - Weighted by L1 accesses reflect the effect on the
overall IPC. - If the overall access are low, there is not going
to be a significant impact on the IPC even though
the standard deviation is high.
35Results P
36Results SI
37Results SSU
38Results
In swim, mgrid, specjbb with high L1 miss rate
means higher pressure on L2,which results
significant IPC improvement(30.9 to 42.5)
Except applu, shared splitL2 perform the best.
39Why private L2 does better in some?
- L2 performance
- The degree of sharing
- The imbalance of load imposed on L2
- For applu and swimapsi,
- Only 12 of the blocks are shared at any time,
mainly shared between 2 CPUs. - Not much spatial/temporal heterogeneity.
40Why we use IPC instead of the execution time?
- We could not finish any of the benchmark, since
we are using the reference dataset. - Another possible indicator is the number of
iterations executed of certain loop (for example,
the dominating loop) for unit amount of time. - We did this and find the direct correlation
between the IPC value and the number of
iterations.
Private Private SSU SSU
Average time ipc Average time ipc
apsi loop calling dctdx() (mainloop) 3,349m cycles 3.44 3,048m cycles 3.79
41Results
42Closer look specjbb
- SSU is over 31 better than the private L2.
- Direct correlation between the L2 misses and the
IPC values. - P never exceeds 2.5, while SSU sometimes push
over 3.0
43Sensitivity Larger L2
- 2MB -gt 4MB -gt 8MB
- Miss rates go down, difference arising from miss
rate diminish. swim still get considerable
savings. - If application size keep growing up, the split
shared L2 is still going to help. - More splits of L2 -gt finer granularity -gt could
help SSN.
44Sensitivity Longer memory access
120 cycles -gt 240 cyclesBenefits are amplified