Adaptive MemoryManagement Techniques for High Performance Computing on NUMA Multiprocessors - PowerPoint PPT Presentation

1 / 24

About This Presentation

Title:

Adaptive MemoryManagement Techniques for High Performance Computing on NUMA Multiprocessors

Description:

Adaptive Memory-Management Techniques for High Performance Computing ... Collate and publish results. Complete API and allocator. Evaluate with LA application. ... – PowerPoint PPT presentation

Number of Views:62

Avg rating:3.0/5.0

Slides: 25

Provided by: Ocel

Category:

more less

Transcript and Presenter's Notes

Title: Adaptive MemoryManagement Techniques for High Performance Computing on NUMA Multiprocessors

1
Adaptive Memory-Management Techniques for High
Performance Computing on NUMA Multiprocessors Pau
l Slavin 21 March 2007 Supervisor Dr. Len
Freeman Advisor Prof. John Gurd This work is
supported by the Engineering and Physical
Sciences Research Council
2
Programme

Problem Overview - Background and definitions
Existing Solutions - Discussion and critique
New Techniques - Analysis, experimental
evaluation and results
Ongoing Research - Current tasks and proposed
direction

3
The NUMA Environment
4
Performance Implications

Variable latency between local or remote
accesses.
Inter-node communication inevitable.
Memory access constitutes large component of
execution time.
Locality becomes important determinant of
performance.

5
Memory Placement

Explicit Placement in code- Directives-
Placement by Programmer
Compile-time Placement
Dynamic Run-time Placement

6
Page Migration (i)
7
Page Migration (ii)
8
Applied Page Migration

What to migrate? To where? When?
Dynamic runtime behaviour
requires runtime measurement.
Sampling, thresholds and extrapolation.
Linear sequence of memory accesses.

9
Performance Deficiencies

Little or no effect on NPB applications.
Migration cost itself not prohibitive.
Poor results due to poor migration decisions.
Assumes continued relevance of historic
information.
Phase changes, complexity cause problems.

10
Initial Approaches

Shared memory UNIX Signals - shmget(),
mprotect() - Catch and handle signal raised on
SEGV. - Reveals origin and destination of
accesses. - Excessive overhead.
L2 Cache misses as proxy information - Query
hardware counters. - High overhead, low
relevance. - Poor portability.

11
Types of Solution

Ideal scenario would be access to
application-specific information.
Semantics of underlying algorithm.
Difficult to extract from machine view of
execution.
Can application-specific insight be collected
and presented to migration daemon?

12
Proxy Information
Feedback Guided Dynamic Loop Scheduling
13
Equipartitioned Workload
14
Relevance to Memory Placement

Internal representation of applications runtime
environment.
Represents affinity between CPUs and memory.
Schedule of future work.
Memory may be placed before accesses.

15
Implementation

Evaluate physical topology of machine.
Determine physical location of virtual addresses
relative to CPUs.
Use scheduling information to determine optimal
placement for application.
System call allows address ranges to be migrated.

16
Implementation Details

Multiple Parallel Implementations
pthreads
m_fork(), sproc()
Multiple Scheduling Schemes
Typical FGDLS
Integrated area historic FGDLS

17
Evaluation

Two implementations evaluated - Matrix-Vector
Multiplication - NPB Conjugate Gradient (C
version)
16 processor SGI Origin 3400, IRIX OS - 4 procs,
1GB per node.
Creates Memory Locality Domains.
Applies addr2node() syscall.
Performs migration with migr_range_migrate()
call.

18
Results
19
Analysis

Why difference in results?
Algorithms have underlying similarities
but implementations differ.
Data Structures and method of parallelisation. -
Contiguity of memory regions. - Unity of per
processor workloads. - System page size vs. Size
of data structure.

20
Influence of Page Size

relative to size of data structure.
Sequencing and contiguity of data - Sorted vs.
unsorted tree. - Random distribution gt random
neighbours. - Sequential distribution gt
sequential neighbours.
Page represents minimum granularity of migration.
Reduced overheads from migrating data en masse.
Can data structures be transformed to become more
amenable to page migration?

21
Generalised Scheme

Scheme is implemented entirely in userlevel code.
Portable between applications, OS and
architectures.
Current implementations modify source-code.
Information production vs. consumption.
API could provide level of abstraction.
Expose application-specific information to any
consumer.

22
Object-oriented Allocation

Managed Array object.
Single object represents multiple allocations. -
Local buffer per node - .or delegate migration
to OS.
Function pointers assign operations to data.
Internal relocation guided by scheduling. -
Transparent data transformation. - Suitable
interval, nested loops.

23
Memory Management API