Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs) - PowerPoint PPT Presentation

1 / 44

About This Presentation

Title:

Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs)

Description:

Organizing the Last Line of Defense before hitting the ... Stanford Hydra. Higher pressure on memory system. Multiple active threads = larger working set ... – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 45

Provided by: scie275

Category:

more less

Transcript and Presenter's Notes

Title: Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs)

1
Organizing the Last Line of Defense before
hitting the Memory Wall for Chip-Multiprocessors
(CMPs)

C. Liu, A. Sivasubramaniam, M. Kandemir
The Pennsylvania State University
anand_at_cse.psu.edu

2
Outline

CMPs and L2 organization
Shared Processor-based Split L2
Evaluation using SpecOMP/Specjbb
Summary of Results

3
Why CMPs?

Can exploit coarser granularity of parallelism
Better use of anticipated billion transistor
designs
Multiple and simpler cores
Commercial and research prototypes
Sun MAJC
Piranha
IBM Power 4/5
Stanford Hydra
.

4
Higher pressure on memory system

Multiple active threads gt larger working set
Solution?
Bigger Cache.
Faster interconnect.
What if we have to go off-chip?
The cores need to share the limited pins.
Impact of off-chip accesses may be much worse
than incurring a few extra cycles on-chip
Needs a close scrutiny of on-chip caches.

5
On-chip Cache Hierarchy

Assume 2 levels
L1 (I/D) is private
What about L2?
L2 is the last line of defense before going
off-chip, and is the focus of this paper.

6
Private (P) L2
I
D
I
D
L1
L1
Coherence Protocol
Advantages Less interconnect traffic
Insulates L2 units
Disadvantages Duplication Load imbalance
Offchip Memory
7
Shared-Interleaved (SI) L2
Coherence Protocol
L1
I
D
I
D
L2
Disadvantages Interconnect traffic
Interference between cores
Advantages No duplication Balance the load
8
Desirables

Approach the behavior of private L2s, when the
sharing is not significant
Approach the behavior of private L2 when load is
balanced or when there is interference
Approach behavior of shared L2 when there is
significant sharing
Approach behavior of shared L2 when demands are
uneven.

9
Shared Processor-based Split L2
L1
Table and Split Select
L2
Processors/cores are allocated L2 splits
10
Lookup

Look up all splits allocated to requesting core
simultaneously.
If not found, then look at all other splits
(extra latency).
If found, move block over to one of its splits
(chosen randomly), and removing it from the other
split.
Else, go off-chip and place block in one of its
splits (chosen randomly).

11
Note

Note, a core cannot place blocks that evict
blocks useful to another (as in Private case)
A core can look at (shared) blocks of other cores
at a slightly higher cost without being as high
as off-chip accesses (as in Shared case).
There is at most 1 copy of a block in L2.

12
Shared Split Uniform (SSU)
I
D
I
D
L1
Table and Split Select

L2
13
Shared Split Non-Uniform (SSN)
I
D
I
D
L1
Table and Split Select

L2
14
Split Table
P0
X X X
X X X X
X X
X
P1
P2
P3
15
Evaluation

Using Simics complete system simulator
Benchmarks SpecOMP2000 Specjbb
Reference dataset used
Several billion instructions were simulated.
A bus interconnect was simulated with MESI.

16
Default configuration
of proc 8 L2 Assoc 4-way
L1 Size 8KB L2 Latency 10 cycles
L1 Line Size 32 Byte L2 Splits 8 (SI, SSU)
L1 Assoc 4-way L2 Splits 16 (SSN)
L1 Latency 1 cycle MEM Access 120 cycles
L2 Size 2MB total Bus Arbitration 5 cycles
L2 Line Size 64 Byte Replacement Strict LRU
17
Benchmarks (SpecOMP Specjbb)
Benchmark L1 L1 L2 L2 of Inst (m)
Benchmark Miss Rate Miss Rate of Inst (m)
ammp 53.1m 0.007 2.1m 0.062 25,528
applu 111.2m 0.009 26.4m 0.168 21,519
apsi 378.9m 0.117 27.2m 0.083 15,713
art_m 66.1m 0.009 25.7m 0.507 22,967
fma3d 18.9m 0.002 6.2m 0.239 26,189
galgel 111.4m 0.014 10.7m 0.127 24,051
swim 261.6m 0.111 95.9m 0.296 7,761
mgrid 333.2m 0.153 68.3m 0.185 10,294
specjbb 828.5m 0.353 22.7m 0.083 9,413
18
SSN Terminology

With a total L2 of 2MB (16 splits of 128K each)
to be allocated to 8 cores, SSN-152 refers to
512K (4 splits) allocated to 1 CPU
256K (2 splits) allocated to each of 5 CPUs
128K (1 split) allocated to each of 2 CPUs
Determining how much to allocate to each CPU (and
when) postpone for future work.
Here, we use a profile based approach based on L2
demands.

19
Application behavior

Intra-application heterogeneity
Spatial (among CPUs) allocate non-uniform
splits to different CPUs.
Temporal (for each CPU)change the number of
splits allocated to a CPU at different points of
time.
Inter-application heterogeneity
Different applications running at same time can
have different L2 demands.

20
Definition

SHF (Spatial Heterogeneity Factor)
THF (Temporal Heterogeneity Factor)

21
Spatial heterogeneity Factor
22
Temporal Heterogeneity Factor
23
Results SI
24
Results SSU
25
Results SSN
26
Summary of Results

When P does better than S (e.g. apsi), SSU/SSN
does as well (if not better) as P.
When S does better than P (e.g. swim, mgrid,
specjbb), SSU/SSN does as well (if not better) as
S.
In nearly all cases (except applu), some
configuration of SSU/SSN does the best.
On the average we get over 11 improvement in IPC
over the best S/P configuration(s).

27
Inter-application Heterogenity

Different applications have different L2 demands
These applications could even be running
concurrently on different CPUs.

28
Inter-application results

ammpapsi, lowhigh.
ammpfma3d, both low
swimapsi, both high, imbalanced balanced.
swimmgrid,both high, imbalanced imbalanced

29
Inter-application ammpapsi

SSN-152
1.25MB dynamically allocated to apsi, 0.75MB to
ammp.
Graph shows the rough 53 allocation.
Better overall IPC value.

Low miss rate for apsi and not affecting the
miss rate of ammp.
30
Concluding Remarks

Shared Processor-based Split L2 is a flexible way
of approaching the behavior of shared or private
L2 (based on what is preferable)
It accommodates spatial and temporal
heterogeneity in L2 demands both within an
application and across applications.
Becomes even more important with higher off-chip
accesses.

31
Future Work

How to configure the split sizes statically,
dynamically and a combination of the two?

32
Backup Slides
33
(No Transcript)
34
Meaning

Capture the heterogeneity between CPUs (spatial)
or over the epochs (temporal) of the load imposed
on the L2 structure.
Weighted by L1 accesses reflect the effect on the
overall IPC.
If the overall access are low, there is not going
to be a significant impact on the IPC even though
the standard deviation is high.

35
Results P
36
Results SI
37
Results SSU
38
Results
In swim, mgrid, specjbb with high L1 miss rate
means higher pressure on L2,which results
significant IPC improvement(30.9 to 42.5)
Except applu, shared splitL2 perform the best.
39
Why private L2 does better in some?

L2 performance
The degree of sharing
The imbalance of load imposed on L2
For applu and swimapsi,
Only 12 of the blocks are shared at any time,
mainly shared between 2 CPUs.
Not much spatial/temporal heterogeneity.

40
Why we use IPC instead of the execution time?

We could not finish any of the benchmark, since
we are using the reference dataset.
Another possible indicator is the number of
iterations executed of certain loop (for example,
the dominating loop) for unit amount of time.
We did this and find the direct correlation
between the IPC value and the number of
iterations.

Private Private SSU SSU
Average time ipc Average time ipc
apsi loop calling dctdx() (mainloop) 3,349m cycles 3.44 3,048m cycles 3.79
41
Results
42
Closer look specjbb

SSU is over 31 better than the private L2.
Direct correlation between the L2 misses and the
IPC values.
P never exceeds 2.5, while SSU sometimes push
over 3.0

43
Sensitivity Larger L2

2MB -gt 4MB -gt 8MB
Miss rates go down, difference arising from miss
rate diminish. swim still get considerable
savings.
If application size keep growing up, the split
shared L2 is still going to help.
More splits of L2 -gt finer granularity -gt could
help SSN.

44
Sensitivity Longer memory access
120 cycles -gt 240 cyclesBenefits are amplified

Write a Comment

User Comments (0)