Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead - PowerPoint PPT Presentation

1 / 40

About This Presentation

Title:

Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead

Description:

Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer Science – PowerPoint PPT presentation

Number of Views:134

Avg rating:3.0/5.0

Slides: 41

Provided by: Zolt

Category:

more less

Transcript and Presenter's Notes

Title: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead

1
Memory Management inNUMA Multicore
SystemsTrapped between Cache Contention and
Interconnect Overhead

Zoltan Majo and Thomas R. Gross
Department of Computer Science
ETH Zurich

2
NUMA multicores
IC
IC
3
NUMA multicores

Two problems
NUMA
interconnect overhead

B
A
IC
IC
MA
MB
4
NUMA multicores

Two problems
NUMA
interconnect overhead
multicore
cache contention

B
A
Cache
IC
IC
MA
MB
5
Outline

NUMA experimental evaluation
Scheduling
N-MASS
N-MASS evaluation

6
Multi-clone experiments

Memory behavior of unrelated programs

Intel Xeon E5520
4 clones of soplex (SPEC CPU2006)
local clone
remote clone

C
C
C
C
C
C
C
C
C
M
M
M
M
C
7
(No Transcript)
8
Performance of schedules

Which is the best schedule?
Baseline single-program execution mode

9
Execution time
Slowdown relative to baseline

local clones
remote clones
average

C
C
C
10
Outline

NUMA experimental evaluation
Scheduling
N-MASS
N-MASS evaluation

11
N-MASS(NUMA-Multicore-Aware Scheduling Scheme)

Two steps
Step 1 maximum-local mapping
Step 2 cache-aware refinement

12
Step 1 Maximum-local mapping
A
MA
B
MB
C
MC
D
MD
13
Default OS scheduling
B
A
D
C
MB
MA
MC
MD
14
N-MASS(NUMA-Multicore-Aware Scheduling Scheme)

Two steps
Step 1 maximum-local mapping
Step 2 cache-aware refinement

15
Step 2 Cache-aware refinement

In an SMP

D
C
MB
MA
MC
MD
16
Step 2 Cache-aware refinement

In an SMP

B
A
D
C
MB
MA
MC
MD
MA
17
Step 2 Cache-aware refinement

In an SMP

A
B
C
D
18
Step 2 Cache-aware refinement

In a NUMA

B
A
C
D
MB
MA
MC
MD
19
Step 2 Cache-aware refinement

In a NUMA

A
D
C
B
MB
MA
MC
MD
20
Step 2 Cache-aware refinement

In a NUMA

A
B
C
D
21
Performance factors

Two factors cause performance degradation
NUMA penalty
slowdown due toremote memory access
cache pressure
local processesmisses / KINST (MPKI)
remote processesMPKI x NUMA penalty

22
Implementation

User-mode extension to the Linux scheduler
Performance metrics
hardware performance counter feedback
NUMA penalty
perfect information from program traces
estimate based on MPKI
All memory for a process allocated on one
processor

23
Outline

NUMA experimental evaluation
Scheduling
N-MASS
N-MASS evaluation

24
Workloads
NUMA penalty

SPEC CPU2006 subset
11 multi-program workloads (WL1 ? WL11)
4-program workloads(WL1 ? WL9)
8-program workloads(WL10, WL11)

CPU-bound
Memory-bound
25
Memory allocation setup

Where the memory of each process is allocated
influences performance
Controlled setup memory allocation maps

26
Memory allocation maps
B
MB
A
C
MC
D
MD
M A
Allocation map 0000
27
Memory allocation maps
B
A
C
D
28
Memory allocation maps
B
A
C
D
Unbalanced
Balanced
29
Evaluation

Baseline Linux average
Linux scheduler non-deterministic
average performance degradation in all possible
cases
N-MASS with perfect NUMA penalty information

30
WL9 Linux average
Average slowdown relative to single-program mode
31
WL9 N-MASS
Average slowdown relative to single-program mode
32
WL1 Linux average and N-MASS
Average slowdown relative to single-program mode
33
N-MASS performance

N-MASS reduces performance degradation by up to
22
Which factor more important interconnect
overhead or cache contention?
Compare
- maximum-local
- N-MASS (maximum-local cache
refinement step)

34
Data-locality vs. cache balancing (WL9)
Performance improvement relative to Linux average
35
Data-locality vs. cache balancing (WL1)
Performance improvement relative to Linux average
36
Data locality vs. cache balancing

Data-locality more important than cache
balancing
Cache-balancing gives performance benefits
mostly with unbalanced allocation maps
What if information about NUMA penalty not
available?

37
Estimating NUMA penalty
NUMA penalty