Managing Distributed, Shared L2 Caches through OSLevel Page Allocation - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Managing Distributed, Shared L2 Caches through OSLevel Page Allocation

Description:

... pages mapped to slice i. A free list for each i multiple free lists ... Page coloring multiple free lists. NUMA OS process scheduling & page allocation ... – PowerPoint PPT presentation

Number of Views:89
Avg rating:3.0/5.0
Slides: 34
Provided by: sangye
Category:

less

Transcript and Presenter's Notes

Title: Managing Distributed, Shared L2 Caches through OSLevel Page Allocation


1
Managing Distributed, Shared L2
CachesthroughOS-Level Page Allocation
  • Sangyeun Cho and Lei Jin

Dept. of Computer Science University of Pittsburgh
2
Multicore distributed L2 caches
  • L2 caches typically sub-banked and distributed
  • IBM Power4/5 3 banks
  • Sun Microsystems T1 4 banks
  • Intel Itanium2 (L3) many sub-arrays
  • (Distributed L2 caches switched NoC) ? NUCA
  • Hardware-based management schemes
  • Private caching
  • Shared caching
  • Hybrid caching

3
Private caching
3. Access directory
  • ? short hit latency (always local)
  • ? high on-chip miss rate
  • long miss resolution time
  • complex coherence enforcement
  • L1 miss
  • L2 access
  • Hit
  • Miss
  • Access directory
  • A copy on chip
  • Global miss

2. L2 access
4
Shared caching
  • low on-chip miss rate
  • straightforward data location
  • simple coherence (no replication)
  • long average hit latency
  • L1 miss
  • L2 access
  • Hit
  • Miss

5
Our work
  • Placing flexibility as the top design
    consideration
  • OS-level data to L2 cache mapping
  • Simple hardware based on shared caching
  • Efficient mapping maintenance at page granularity
  • Demonstrating the impact using different policies

6
Talk roadmap
  • Data mapping, a key property
  • Flexible page-level mapping
  • Goals
  • Architectural support
  • OS design issues
  • Management policies
  • Conclusion and future works

7
Data mapping, the key
  • Data mapping deciding data location (i.e.,
    cache slice)
  • Private caching
  • Data mapping determined by program location
  • Mapping created at miss time
  • No explicit control
  • Shared caching
  • Data mapping determined by address
  • slice number (block address) (Nslice)
  • Mapping is static
  • Cache block installation at miss time
  • No explicit control
  • (Run-time can impact location within slice)

Mapping granularity block
8
Changing cache mapping granularity
Memory blocks
Memory pages
  • ? miss rate?
  • ? impact on existing techniques?
  • (e.g., prefetching)
  • latency?

9
Observation page-level mapping
Memory pages
Program 1
  • ? Mapping data to different feasible
  • ? Key OS page allocation policies
  • Flexible

OS PAGE ALLOCATION
OS PAGE ALLOCATION
Program 2
10
Goal 1 performance management
? Proximity-aware data mapping
11
Goal 2 power management
? Usage-aware cache shut-off
12
Goal 3 reliability management
? On-demand cache isolation
13
Goal 4 QoS management
? Contract-based cache allocation
14
Architectural support
Method 1 bit selection slice_num (page_num)
(Nslice)
Method 2 region table regionx_low page_num
regionx_high
Method 1 bit selection slice number
(page_num) (Nslice) Method 2 region
table regionx_low page_num
regionx_high Method 3 page table
(TLB) page_num slice_num
data address
Method 3 page table (TLB) page_num
slice_num
reg_table
  • ? Simple hardware support enough
  • ? Combined scheme feasible

TLB
15
Some OS design issues
  • Congruence group CG(i)
  • Set of physical pages mapped to slice i
  • A free list for each i ?? multiple free lists
  • On each page allocation, consider
  • Data proximity
  • Cache pressure
  • (e.g.) Profitability function P f(M, L, P, Q,
    C)
  • M miss rates
  • L network link status
  • P current page allocation status
  • Q QoS requirements
  • C cache configuration
  • Impact on process scheduling
  • Leverage existing frameworks
  • Page coloring multiple free lists
  • NUMA OS process scheduling page allocation

16
Working example
Program
1
0
2
3
5
5
  • ? Static vs. dynamic mapping
  • ? Program information (e.g., profile)
  • Proper run-time monitoring needed

4
5
6
7
P(4) 0.9 P(6) 0.8 P(5) 0.7
P(1) 0.95 P(6) 0.9 P(4) 0.8
5
5
8
9
10
11
4
1
12
13
14
15
6
17
Page mapping policies
18
Simulating private caching
For a page requested from a program running on
core i, map the page to cache slice i
SPEC2k INT
SPEC2k FP
private caching
  • ? Simulating private caching is simple
  • ? Similar or better performance

OS-based
L2 cache latency (cycles)
L2 cache slice size
19
Simulating large private caching
For a page requested from a program running on
core i, map the page to cache slice i also
spread pages
SPEC2k INT
SPEC2k FP
1.93
private
OS
Relative performance (time-1)
512kB cache slice
20
Simulating shared caching
For a page requested from a program running on
core i, map the page to all cache slices
(round-robin, random, )
SPEC2k INT
SPEC2k FP
129
106
  • ? Simulating shared caching is simple
  • Mostly similar behavior/performance
  • Pathological cases (e.g., applu)

OS
shared
L2 cache latency (cycles)
L2 cache slice size
21
Simulating clustered caching
For a page requested from a program running on
core of group j, map the page to any cache slice
within group (round-robin, random, )
private
OS
shared
  • ? Simulating clustered caching is simple
  • Lower miss traffic than private
  • Lower on-chip traffic than shared

Relative performance (time-1)
4 cores used 512kB cache slice
22
Profile-driven page mapping
  • Using profiling collect
  • Inter-page conflict information
  • Per-page access count information
  • Page mapping cost function (per slice)
  • Given program location, page to map, and
    previously mapped pages
  • ( conflicts?? miss penalty) weight ? (
    accesses ? latency)
  • weight as a knob
  • Larger value ? more weight on proximity (than
    miss rate)
  • Optimize both miss rate and data proximity
  • Theoretically important to understand limits
  • Can be practically important, too

miss cost
Latency cost
23
Profile-driven page mapping, contd
remote
weight
local
L2 cache accesses
miss
on-chip hit
256kB L2 cache slice
24
Profile-driven page mapping, contd
Program location
GCC
pages mapped
256kB L2 cache slice
25
Profile-driven page mapping, contd
108
  • ? Room for performance improvement
  • ? Best of the two or better than the two
  • Dynamic mapping schemes desired

Performance improvement Over shared caching
256kB L2 cache slice
26
Isolating faulty caches
When there are faulty cache slices, avoid mapping
pages to them
shared
Relative L2 cache latency
OS
cache slice deletions
4 cores running a multiprogrammed workload 512kB
cache slice
27
Conclusion
  • Flexibility will become important in future
    multicores
  • Many shared resources
  • Allows us to implement high-level policies
  • OS-level page-granularity data-to-slice mapping
  • Low hardware overhead
  • Flexible
  • Several management policies studied
  • Mimicking private/shared/clustered caching
    straightforward
  • Performance-improving schemes

28
Future works
  • Dynamic mapping schemes
  • Performance
  • Power
  • Performance monitoring techniques
  • Hardware-based
  • Software-based
  • Data migration and replication support

29
Managing Distributed, Shared L2 Cachesthrough
OS-Level Page Allocation
Thank you!
  • Sangyeun Cho and Lei Jin

Dept. of Computer Science University of Pittsburgh
30
Multicores are here
Quad cores (2007)
31
A multicore outlook
???
32
A processor model
router
  • Private L1 I/D-
  • 8kB32kB
  • Local unified L2
  • 128kB512kB
  • 818 cycles
  • Switched network
  • 24 cycles/switch
  • Distributed directory
  • Scatter hotspots

processor core
local L2 cache
Many cores (e.g., 16)
33
Other approaches
  • Hybrid/flexible schemes
  • Core clustering Speight et al., ISCA2005
  • Flexible CMP cache sharing Huh et al.,
    ICS2004
  • Flexible bank mapping Liu et al., HPCA2004
  • Improving shared caching
  • Victim replication Zhang and Asanovic,
    ISCA2005
  • Improving private caching
  • Cooperative caching Chang and Sohi, ISCA2006
  • CMP-NuRAPID Chishti et al., ISCA2005
Write a Comment
User Comments (0)
About PowerShow.com