Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs) - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs)

Description:

Organizing the Last Line of Defense before hitting the ... Stanford Hydra. Higher pressure on memory system. Multiple active threads = larger working set ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 45
Provided by: scie275
Category:

less

Transcript and Presenter's Notes

Title: Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs)


1
Organizing the Last Line of Defense before
hitting the Memory Wall for Chip-Multiprocessors
(CMPs)
  • C. Liu, A. Sivasubramaniam, M. Kandemir
  • The Pennsylvania State University
  • anand_at_cse.psu.edu

2
Outline
  • CMPs and L2 organization
  • Shared Processor-based Split L2
  • Evaluation using SpecOMP/Specjbb
  • Summary of Results

3
Why CMPs?
  • Can exploit coarser granularity of parallelism
  • Better use of anticipated billion transistor
    designs
  • Multiple and simpler cores
  • Commercial and research prototypes
  • Sun MAJC
  • Piranha
  • IBM Power 4/5
  • Stanford Hydra
  • .

4
Higher pressure on memory system
  • Multiple active threads gt larger working set
  • Solution?
  • Bigger Cache.
  • Faster interconnect.
  • What if we have to go off-chip?
  • The cores need to share the limited pins.
  • Impact of off-chip accesses may be much worse
    than incurring a few extra cycles on-chip
  • Needs a close scrutiny of on-chip caches.

5
On-chip Cache Hierarchy
  • Assume 2 levels
  • L1 (I/D) is private
  • What about L2?
  • L2 is the last line of defense before going
    off-chip, and is the focus of this paper.

6
Private (P) L2
I
D
I
D
L1
L1
Coherence Protocol
Advantages Less interconnect traffic
Insulates L2 units
Disadvantages Duplication Load imbalance
Offchip Memory
7
Shared-Interleaved (SI) L2
Coherence Protocol
L1
I
D
I
D
L2
Disadvantages Interconnect traffic
Interference between cores
Advantages No duplication Balance the load
8
Desirables
  • Approach the behavior of private L2s, when the
    sharing is not significant
  • Approach the behavior of private L2 when load is
    balanced or when there is interference
  • Approach behavior of shared L2 when there is
    significant sharing
  • Approach behavior of shared L2 when demands are
    uneven.

9
Shared Processor-based Split L2
L1
Table and Split Select
L2
Processors/cores are allocated L2 splits
10
Lookup
  • Look up all splits allocated to requesting core
    simultaneously.
  • If not found, then look at all other splits
    (extra latency).
  • If found, move block over to one of its splits
    (chosen randomly), and removing it from the other
    split.
  • Else, go off-chip and place block in one of its
    splits (chosen randomly).

11
Note
  • Note, a core cannot place blocks that evict
    blocks useful to another (as in Private case)
  • A core can look at (shared) blocks of other cores
    at a slightly higher cost without being as high
    as off-chip accesses (as in Shared case).
  • There is at most 1 copy of a block in L2.

12
Shared Split Uniform (SSU)
I
D
I
D
L1
Table and Split Select












L2
13
Shared Split Non-Uniform (SSN)
I
D
I
D
L1
Table and Split Select












L2
14
Split Table
P0
X X X
X X X X
X X
X
P1
P2
P3
15
Evaluation
  • Using Simics complete system simulator
  • Benchmarks SpecOMP2000 Specjbb
  • Reference dataset used
  • Several billion instructions were simulated.
  • A bus interconnect was simulated with MESI.

16
Default configuration
of proc 8 L2 Assoc 4-way
L1 Size 8KB L2 Latency 10 cycles
L1 Line Size 32 Byte L2 Splits 8 (SI, SSU)
L1 Assoc 4-way L2 Splits 16 (SSN)
L1 Latency 1 cycle MEM Access 120 cycles
L2 Size 2MB total Bus Arbitration 5 cycles
L2 Line Size 64 Byte Replacement Strict LRU
17
Benchmarks (SpecOMP Specjbb)
Benchmark L1 L1 L2 L2 of Inst (m)
Benchmark Miss Rate Miss Rate of Inst (m)
ammp 53.1m 0.007 2.1m 0.062 25,528
applu 111.2m 0.009 26.4m 0.168 21,519
apsi 378.9m 0.117 27.2m 0.083 15,713
art_m 66.1m 0.009 25.7m 0.507 22,967
fma3d 18.9m 0.002 6.2m 0.239 26,189
galgel 111.4m 0.014 10.7m 0.127 24,051
swim 261.6m 0.111 95.9m 0.296 7,761
mgrid 333.2m 0.153 68.3m 0.185 10,294
specjbb 828.5m 0.353 22.7m 0.083 9,413
18
SSN Terminology
  • With a total L2 of 2MB (16 splits of 128K each)
    to be allocated to 8 cores, SSN-152 refers to
  • 512K (4 splits) allocated to 1 CPU
  • 256K (2 splits) allocated to each of 5 CPUs
  • 128K (1 split) allocated to each of 2 CPUs
  • Determining how much to allocate to each CPU (and
    when) postpone for future work.
  • Here, we use a profile based approach based on L2
    demands.

19
Application behavior
  • Intra-application heterogeneity
  • Spatial (among CPUs) allocate non-uniform
    splits to different CPUs.
  • Temporal (for each CPU)change the number of
    splits allocated to a CPU at different points of
    time.
  • Inter-application heterogeneity
  • Different applications running at same time can
    have different L2 demands.

20
Definition
  • SHF (Spatial Heterogeneity Factor)
  • THF (Temporal Heterogeneity Factor)

21
Spatial heterogeneity Factor
22
Temporal Heterogeneity Factor
23
Results SI
24
Results SSU
25
Results SSN
26
Summary of Results
  • When P does better than S (e.g. apsi), SSU/SSN
    does as well (if not better) as P.
  • When S does better than P (e.g. swim, mgrid,
    specjbb), SSU/SSN does as well (if not better) as
    S.
  • In nearly all cases (except applu), some
    configuration of SSU/SSN does the best.
  • On the average we get over 11 improvement in IPC
    over the best S/P configuration(s).

27
Inter-application Heterogenity
  • Different applications have different L2 demands
  • These applications could even be running
    concurrently on different CPUs.

28
Inter-application results
  • ammpapsi, lowhigh.
  • ammpfma3d, both low
  • swimapsi, both high, imbalanced balanced.
  • swimmgrid,both high, imbalanced imbalanced

29
Inter-application ammpapsi
  • SSN-152
  • 1.25MB dynamically allocated to apsi, 0.75MB to
    ammp.
  • Graph shows the rough 53 allocation.
  • Better overall IPC value.

Low miss rate for apsi and not affecting the
miss rate of ammp.
30
Concluding Remarks
  • Shared Processor-based Split L2 is a flexible way
    of approaching the behavior of shared or private
    L2 (based on what is preferable)
  • It accommodates spatial and temporal
    heterogeneity in L2 demands both within an
    application and across applications.
  • Becomes even more important with higher off-chip
    accesses.

31
Future Work
  • How to configure the split sizes statically,
    dynamically and a combination of the two?

32
Backup Slides
33
(No Transcript)
34
Meaning
  • Capture the heterogeneity between CPUs (spatial)
    or over the epochs (temporal) of the load imposed
    on the L2 structure.
  • Weighted by L1 accesses reflect the effect on the
    overall IPC.
  • If the overall access are low, there is not going
    to be a significant impact on the IPC even though
    the standard deviation is high.

35
Results P
36
Results SI
37
Results SSU
38
Results
In swim, mgrid, specjbb with high L1 miss rate
means higher pressure on L2,which results
significant IPC improvement(30.9 to 42.5)
Except applu, shared splitL2 perform the best.
39
Why private L2 does better in some?
  • L2 performance
  • The degree of sharing
  • The imbalance of load imposed on L2
  • For applu and swimapsi,
  • Only 12 of the blocks are shared at any time,
    mainly shared between 2 CPUs.
  • Not much spatial/temporal heterogeneity.

40
Why we use IPC instead of the execution time?
  • We could not finish any of the benchmark, since
    we are using the reference dataset.
  • Another possible indicator is the number of
    iterations executed of certain loop (for example,
    the dominating loop) for unit amount of time.
  • We did this and find the direct correlation
    between the IPC value and the number of
    iterations.

Private Private SSU SSU
Average time ipc Average time ipc
apsi loop calling dctdx() (mainloop) 3,349m cycles 3.44 3,048m cycles 3.79
41
Results
42
Closer look specjbb
  • SSU is over 31 better than the private L2.
  • Direct correlation between the L2 misses and the
    IPC values.
  • P never exceeds 2.5, while SSU sometimes push
    over 3.0

43
Sensitivity Larger L2
  • 2MB -gt 4MB -gt 8MB
  • Miss rates go down, difference arising from miss
    rate diminish. swim still get considerable
    savings.
  • If application size keep growing up, the split
    shared L2 is still going to help.
  • More splits of L2 -gt finer granularity -gt could
    help SSN.

44
Sensitivity Longer memory access
120 cycles -gt 240 cyclesBenefits are amplified
Write a Comment
User Comments (0)
About PowerShow.com