Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead

Description:

Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer Science – PowerPoint PPT presentation

Number of Views:134
Avg rating:3.0/5.0
Slides: 41
Provided by: Zolt
Category:

less

Transcript and Presenter's Notes

Title: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead


1
Memory Management inNUMA Multicore
SystemsTrapped between Cache Contention and
Interconnect Overhead
  • Zoltan Majo and Thomas R. Gross
  • Department of Computer Science
  • ETH Zurich

2
NUMA multicores
IC
IC
3
NUMA multicores
  • Two problems
  • NUMA
  • interconnect overhead

B
A
IC
IC
MA
MB
4
NUMA multicores
  • Two problems
  • NUMA
  • interconnect overhead
  • multicore
  • cache contention

B
A
Cache
IC
IC
MA
MB
5
Outline
  • NUMA experimental evaluation
  • Scheduling
  • N-MASS
  • N-MASS evaluation

6
Multi-clone experiments
  • Memory behavior of unrelated programs
  • Intel Xeon E5520
  • 4 clones of soplex (SPEC CPU2006)
  • local clone
  • remote clone

C
C
C
C
C
C
C
C
C
M
M
M
M
C
7
(No Transcript)
8
Performance of schedules
  • Which is the best schedule?
  • Baseline single-program execution mode

9
Execution time
Slowdown relative to baseline
  • local clones
  • remote clones
  • average

C
C
C
10
Outline
  • NUMA experimental evaluation
  • Scheduling
  • N-MASS
  • N-MASS evaluation

11
N-MASS(NUMA-Multicore-Aware Scheduling Scheme)
  • Two steps
  • Step 1 maximum-local mapping
  • Step 2 cache-aware refinement

12
Step 1 Maximum-local mapping
A
MA
B
MB
C
MC
D
MD
13
Default OS scheduling
B
A
D
C
MB
MA
MC
MD
14
N-MASS(NUMA-Multicore-Aware Scheduling Scheme)
  • Two steps
  • Step 1 maximum-local mapping
  • Step 2 cache-aware refinement

15
Step 2 Cache-aware refinement
  • In an SMP

D
C
MB
MA
MC
MD
16
Step 2 Cache-aware refinement
  • In an SMP

B
A
D
C
MB
MA
MC
MD
MA
17
Step 2 Cache-aware refinement
  • In an SMP

A
B
C
D
18
Step 2 Cache-aware refinement
  • In a NUMA

B
A
C
D
MB
MA
MC
MD
19
Step 2 Cache-aware refinement
  • In a NUMA

A
D
C
B
MB
MA
MC
MD
20
Step 2 Cache-aware refinement
  • In a NUMA

A
B
C
D
21
Performance factors
  • Two factors cause performance degradation
  • NUMA penalty
  • slowdown due toremote memory access
  • cache pressure
  • local processesmisses / KINST (MPKI)
  • remote processesMPKI x NUMA penalty

22
Implementation
  • User-mode extension to the Linux scheduler
  • Performance metrics
  • hardware performance counter feedback
  • NUMA penalty
  • perfect information from program traces
  • estimate based on MPKI
  • All memory for a process allocated on one
    processor

23
Outline
  • NUMA experimental evaluation
  • Scheduling
  • N-MASS
  • N-MASS evaluation

24
Workloads
NUMA penalty
  • SPEC CPU2006 subset
  • 11 multi-program workloads (WL1 ? WL11)
  • 4-program workloads(WL1 ? WL9)
  • 8-program workloads(WL10, WL11)

CPU-bound
Memory-bound
25
Memory allocation setup
  • Where the memory of each process is allocated
    influences performance
  • Controlled setup memory allocation maps

26
Memory allocation maps
B
MB
A
C
MC
D
MD
M A
Allocation map 0000
27
Memory allocation maps
B
A
C
D
28
Memory allocation maps
B
A
C
D
Unbalanced
Balanced
29
Evaluation
  • Baseline Linux average
  • Linux scheduler non-deterministic
  • average performance degradation in all possible
    cases
  • N-MASS with perfect NUMA penalty information

30
WL9 Linux average
Average slowdown relative to single-program mode
31
WL9 N-MASS
Average slowdown relative to single-program mode
32
WL1 Linux average and N-MASS
Average slowdown relative to single-program mode
33
N-MASS performance
  • N-MASS reduces performance degradation by up to
    22
  • Which factor more important interconnect
    overhead or cache contention?
  • Compare
  • - maximum-local
  • - N-MASS (maximum-local cache
    refinement step)

34
Data-locality vs. cache balancing (WL9)
Performance improvement relative to Linux average
35
Data-locality vs. cache balancing (WL1)
Performance improvement relative to Linux average
36
Data locality vs. cache balancing
  • Data-locality more important than cache
    balancing
  • Cache-balancing gives performance benefits
    mostly with unbalanced allocation maps
  • What if information about NUMA penalty not
    available?

37
Estimating NUMA penalty
NUMA penalty
  • NUMA penalty is not directly measurable
  • Estimate fit linear regression onto MPKI data

38
Estimate-based N-MASS performance
Performance improvement relative to Linux average
39
Conclusions
  • N-MASS NUMA?multicore-aware scheduler
  • Data locality optimizations more beneficial than
    cache contention avoidance
  • Better performance metrics needed for scheduling

40
Thank you! Questions?
Write a Comment
User Comments (0)
About PowerShow.com