Adaptive MemoryManagement Techniques for High Performance Computing on NUMA Multiprocessors - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Adaptive MemoryManagement Techniques for High Performance Computing on NUMA Multiprocessors

Description:

Adaptive Memory-Management Techniques for High Performance Computing ... Collate and publish results. Complete API and allocator. Evaluate with LA application. ... – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 25
Provided by: Ocel
Category:

less

Transcript and Presenter's Notes

Title: Adaptive MemoryManagement Techniques for High Performance Computing on NUMA Multiprocessors


1
Adaptive Memory-Management Techniques for High
Performance Computing on NUMA Multiprocessors Pau
l Slavin 21 March 2007 Supervisor Dr. Len
Freeman Advisor Prof. John Gurd This work is
supported by the Engineering and Physical
Sciences Research Council
2
Programme
  • Problem Overview - Background and definitions
  • Existing Solutions - Discussion and critique
  • New Techniques - Analysis, experimental
    evaluation and results
  • Ongoing Research - Current tasks and proposed
    direction

3
The NUMA Environment
4
Performance Implications
  • Variable latency between local or remote
    accesses.
  • Inter-node communication inevitable.
  • Memory access constitutes large component of
    execution time.
  • Locality becomes important determinant of
    performance.

5
Memory Placement
  • Explicit Placement in code- Directives-
    Placement by Programmer
  • Compile-time Placement
  • Dynamic Run-time Placement

6
Page Migration (i)
7
Page Migration (ii)
8
Applied Page Migration
  • What to migrate? To where? When?
  • Dynamic runtime behaviour
  • requires runtime measurement.
  • Sampling, thresholds and extrapolation.
  • Linear sequence of memory accesses.

9
Performance Deficiencies
  • Little or no effect on NPB applications.
  • Migration cost itself not prohibitive.
  • Poor results due to poor migration decisions.
  • Assumes continued relevance of historic
    information.
  • Phase changes, complexity cause problems.

10
Initial Approaches
  • Shared memory UNIX Signals - shmget(),
    mprotect() - Catch and handle signal raised on
    SEGV. - Reveals origin and destination of
    accesses. - Excessive overhead.
  • L2 Cache misses as proxy information - Query
    hardware counters. - High overhead, low
    relevance. - Poor portability.

11
Types of Solution
  • Ideal scenario would be access to
    application-specific information.
  • Semantics of underlying algorithm.
  • Difficult to extract from machine view of
    execution.
  • Can application-specific insight be collected
    and presented to migration daemon?

12
Proxy Information
Feedback Guided Dynamic Loop Scheduling
13
Equipartitioned Workload
14
Relevance to Memory Placement
  • Internal representation of applications runtime
    environment.
  • Represents affinity between CPUs and memory.
  • Schedule of future work.
  • Memory may be placed before accesses.

15
Implementation
  • Evaluate physical topology of machine.
  • Determine physical location of virtual addresses
    relative to CPUs.
  • Use scheduling information to determine optimal
    placement for application.
  • System call allows address ranges to be migrated.

16
Implementation Details
  • Multiple Parallel Implementations
  • pthreads
  • m_fork(), sproc()
  • Multiple Scheduling Schemes
  • Typical FGDLS
  • Integrated area historic FGDLS

17
Evaluation
  • Two implementations evaluated - Matrix-Vector
    Multiplication - NPB Conjugate Gradient (C
    version)
  • 16 processor SGI Origin 3400, IRIX OS - 4 procs,
    1GB per node.
  • Creates Memory Locality Domains.
  • Applies addr2node() syscall.
  • Performs migration with migr_range_migrate()
    call.

18
Results
19
Analysis
  • Why difference in results?
  • Algorithms have underlying similarities
  • but implementations differ.
  • Data Structures and method of parallelisation. -
    Contiguity of memory regions. - Unity of per
    processor workloads. - System page size vs. Size
    of data structure.

20
Influence of Page Size
  • relative to size of data structure.
  • Sequencing and contiguity of data - Sorted vs.
    unsorted tree. - Random distribution gt random
    neighbours. - Sequential distribution gt
    sequential neighbours.
  • Page represents minimum granularity of migration.
  • Reduced overheads from migrating data en masse.
  • Can data structures be transformed to become more
    amenable to page migration?

21
Generalised Scheme
  • Scheme is implemented entirely in userlevel code.
  • Portable between applications, OS and
    architectures.
  • Current implementations modify source-code.
  • Information production vs. consumption.
  • API could provide level of abstraction.
  • Expose application-specific information to any
    consumer.

22
Object-oriented Allocation
  • Managed Array object.
  • Single object represents multiple allocations. -
    Local buffer per node - .or delegate migration
    to OS.
  • Function pointers assign operations to data.
  • Internal relocation guided by scheduling. -
    Transparent data transformation. - Suitable
    interval, nested loops.

23
Memory Management API
  • Linux Kernel module.
  • Receives application-specific info via /dev
    entry.
  • Exposes this via /proc filesystem.
  • Allows for separation of production and
    consumption of information.
  • Application information generated near to
    source..
  • but consumed wherever memory can be managed most
    efficiently.

24
Current Future Tasks
  • Collate and publish results.
  • Complete API and allocator.
  • Evaluate with LA application.
  • Other data points.
  • Detailed, quantitative analysis of results.
Write a Comment
User Comments (0)
About PowerShow.com