Title: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead
1Memory Management inNUMA Multicore
SystemsTrapped between Cache Contention and
Interconnect Overhead
- Zoltan Majo and Thomas R. Gross
- Department of Computer Science
- ETH Zurich
2NUMA multicores
IC
IC
3NUMA multicores
- Two problems
- NUMA
- interconnect overhead
-
B
A
IC
IC
MA
MB
4NUMA multicores
- Two problems
- NUMA
- interconnect overhead
- multicore
- cache contention
B
A
Cache
IC
IC
MA
MB
5Outline
- NUMA experimental evaluation
- Scheduling
- N-MASS
- N-MASS evaluation
6Multi-clone experiments
- Memory behavior of unrelated programs
- Intel Xeon E5520
- 4 clones of soplex (SPEC CPU2006)
- local clone
- remote clone
C
C
C
C
C
C
C
C
C
M
M
M
M
C
7(No Transcript)
8Performance of schedules
- Which is the best schedule?
- Baseline single-program execution mode
9Execution time
Slowdown relative to baseline
- local clones
- remote clones
- average
C
C
C
10Outline
- NUMA experimental evaluation
- Scheduling
- N-MASS
- N-MASS evaluation
11N-MASS(NUMA-Multicore-Aware Scheduling Scheme)
- Two steps
- Step 1 maximum-local mapping
- Step 2 cache-aware refinement
12Step 1 Maximum-local mapping
A
MA
B
MB
C
MC
D
MD
13Default OS scheduling
B
A
D
C
MB
MA
MC
MD
14N-MASS(NUMA-Multicore-Aware Scheduling Scheme)
- Two steps
- Step 1 maximum-local mapping
- Step 2 cache-aware refinement
15Step 2 Cache-aware refinement
D
C
MB
MA
MC
MD
16Step 2 Cache-aware refinement
B
A
D
C
MB
MA
MC
MD
MA
17Step 2 Cache-aware refinement
A
B
C
D
18Step 2 Cache-aware refinement
B
A
C
D
MB
MA
MC
MD
19Step 2 Cache-aware refinement
A
D
C
B
MB
MA
MC
MD
20Step 2 Cache-aware refinement
A
B
C
D
21Performance factors
- Two factors cause performance degradation
- NUMA penalty
- slowdown due toremote memory access
- cache pressure
- local processesmisses / KINST (MPKI)
- remote processesMPKI x NUMA penalty
22Implementation
- User-mode extension to the Linux scheduler
- Performance metrics
- hardware performance counter feedback
- NUMA penalty
- perfect information from program traces
- estimate based on MPKI
- All memory for a process allocated on one
processor
23Outline
- NUMA experimental evaluation
- Scheduling
- N-MASS
- N-MASS evaluation
24Workloads
NUMA penalty
- SPEC CPU2006 subset
- 11 multi-program workloads (WL1 ? WL11)
- 4-program workloads(WL1 ? WL9)
- 8-program workloads(WL10, WL11)
CPU-bound
Memory-bound
25Memory allocation setup
- Where the memory of each process is allocated
influences performance - Controlled setup memory allocation maps
26Memory allocation maps
B
MB
A
C
MC
D
MD
M A
Allocation map 0000
27Memory allocation maps
B
A
C
D
28Memory allocation maps
B
A
C
D
Unbalanced
Balanced
29Evaluation
- Baseline Linux average
- Linux scheduler non-deterministic
- average performance degradation in all possible
cases - N-MASS with perfect NUMA penalty information
30WL9 Linux average
Average slowdown relative to single-program mode
31WL9 N-MASS
Average slowdown relative to single-program mode
32WL1 Linux average and N-MASS
Average slowdown relative to single-program mode
33N-MASS performance
- N-MASS reduces performance degradation by up to
22 - Which factor more important interconnect
overhead or cache contention? - Compare
- - maximum-local
- - N-MASS (maximum-local cache
refinement step)
34Data-locality vs. cache balancing (WL9)
Performance improvement relative to Linux average
35Data-locality vs. cache balancing (WL1)
Performance improvement relative to Linux average
36Data locality vs. cache balancing
- Data-locality more important than cache
balancing - Cache-balancing gives performance benefits
mostly with unbalanced allocation maps - What if information about NUMA penalty not
available?
37Estimating NUMA penalty
NUMA penalty
- NUMA penalty is not directly measurable
- Estimate fit linear regression onto MPKI data
38Estimate-based N-MASS performance
Performance improvement relative to Linux average
39Conclusions
- N-MASS NUMA?multicore-aware scheduler
- Data locality optimizations more beneficial than
cache contention avoidance - Better performance metrics needed for scheduling
40Thank you! Questions?