Dissecting On-node Memory Performance - PowerPoint PPT Presentation

About This Presentation
Title:

Dissecting On-node Memory Performance

Description:

Title: PowerPoint Presentation Author: Gyllenhaal, John C. Last modified by: Todd Gamblin Created Date: 1/1/1601 12:00:00 AM Document presentation format – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 15
Provided by: Gyll1
Learn more at: https://www.paradyn.org
Category:

less

Transcript and Presenter's Notes

Title: Dissecting On-node Memory Performance


1
Dissecting On-node Memory Performance with
MemAxes
Petascale Tools Workshop 2014
Madison, WIAugust 4-7, 2014
Alfredo Gimenez, Todd Gamblin, Martin Schulz,
Peer-Timo Bremer, Barry Rountree, Abhinav
Bhatele, Ilir Jusufi, and Bernd Hammann
LLNL
UC Davis
2
Memory Access Sampling
  • Recent hardware additions allow us to precisely
    sample events, including memory accesses
  • Intel PEBS, AMD IBS
  • Memory access samples contain
  • The instruction pointer
  • The address accessed
  • How many core clock cycles elapsed during the
    access
  • Where in the memory hierarchy the address was
    resolved (e.g. L1 cache, Local RAM, Remote RAM)
  • We need a way to meaningfully interpretthese
    samples

3
Adding Context
  • Can better understand memory references with
    appropriate context
  • Contexts include
  • The code
  • The node hardware topology
  • Calling context (call path)
  • The application (e.g. fluid dynamics)
  • Other work by Liu Mellor-Crummey has looked at
    mapping latency access patterns to particular
    variables, call paths, and access patterns.

4
We can already get coarse-grained application
context for some codes
  • Physics data is available in data structures
  • Time steps are easy to mark in the code
  • Per-process performance
  • easy to get
  • just turn on counters at the beginning of the run
  • read them periodically.
  • What if we want finer-grained attribution?
  • How to tie measurements to data structures?
  • How to slice and dice the data?

Aluminum
FLOP/s per MPI process
5
Node topology is easy to get, but not shown
clearly.
  • PEBS provides metadata for node topology
  • Want to highlight connections clearly to show
  • Load distribution
  • Bandwidth
  • Resource contention
  • Existing visualization from hwloc (right)
  • Does not scale
  • Clutters connections between components

6
We have developed a measurement tool for
collecting detailed context
  • Use PEBS sampling for hardware information
  • Supplement with application instrumentation for
    mapping addresses to physical coordinates

SMT (Semantic Memory Tree) data structure used
to map callbacks sampled instruction operands
7
Currently the developer has to instrument the
application manually
  • Add calls to get metadata for allocated objects
  • Label string
  • Start and end addresses
  • Size of each element
  • Number of elements
  • Callback to map address to physical coordinates
  • Metadata must be provided by the programmer
  • Could easily be implemented in libraries
  • Lots of common mesh libraries would be
    interesting for this.

8
Instrumentation
  • Specify DataObjects

Add additional semantic attributes and define
attribution function (optional)
9
Semantic Memory Tree
Binary Search Tree
Semantic Memory Ranges
Semantic Memory Range Tree Instrumentation
Semantic Memory Range Tree Instrumentation
Binary Search Tree
Binary Search Tree
Binary Search Tree
0x0F
0xF6
0x0F
0xF6
0x0F
0xF6
Address Ranges
Address Ranges
Address Ranges
0x0F
0x80
0xA2
0xF6
0x0F
0x80
0xA2
0xF6
0x0F
0x80
0xA2
0xF6
Velocity
Pressure
Temp
Density
Velocity
Pressure
Temp
Density
Velocity
Pressure
Temp
Density
0x40
0x80
0xA2
0xC2
0x0F
0x20
0xE0
0xF6
0x0F
0x20
0x40
0x80
0xA2
0xC2
0xE0
0xF6
0x40
0x80
0xA2
0xC2
0x0F
0x20
0xE0
0xF6
Data Buffers
Data Buffers
Addresses Application Domain
Addresses Application Domain
Record Performance Data in Application Domain
Record Performance Data in Application Domain
10
Lagrangian Hydrodynamics LULESH
2D
3D
3D with mapped performance data
11
We have developed MemAxes, a tool for analyzing
on-node memory performance
  • Measurement component samples memory instructions
  • We map latency information onto A) source code,
    B) node topology
  • C) Pie chart shows percent of total latency
    selected
  • D) Parallel coordinates view allows exploration
    of correlations

12
Linked views clearly show on-nodelocality
problems
  • Parallel coordinates view shows correlation
    between array index and core id in LULESH
  • Linked node topology view shows data motion for
    highlighted memory operations
  • A contiguous chunk of an array is initially split
    between threads on four cores
  • Using an optimized affinity scheme, we improve
    locality
  • Performance improved by 10

Default thread affinity with poor locality
PIPER
Optimized thread affinity with good locality
13
Hyperion Thread/Core Binding
Improved cache usage 44 less access cycles 10
total speedup
14
Future work
  • Back-port perf_events API to production TOSS 2
    kernel
  • Currently unable to do fine-grained memory
    sampling on production machines due to PMU access
    limits
  • Affects some Intel thread tools as well
  • More detailed architecture mapping
  • Sandy Bridge LLC ring interconnect information?
  • Other node architecture features?
  • Instrument AMR libraries for proper context
    attribution
  • Study per-patch memory behavior
  • Study blocking behavior of solvers
  • How to query large instruction traces effectively?
Write a Comment
User Comments (0)
About PowerShow.com