TAU: Recent Advances

About This Presentation

Title:

TAU: Recent Advances

Description:

TAU: Recent Advances. KTAU: Kernel-Level Measurement for Integrated Parallel Performance Views ... Ktau-Off: Kernel patched with Ktau and instrumentations compiled-in. ... – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 45

Provided by: aroonn

Learn more at: http://www.cs.uoregon.edu

Category:

more less

Transcript and Presenter's Notes

Title: TAU: Recent Advances

1
TAU Recent Advances

KTAU Kernel-Level Measurement for Integrated
Parallel Performance Views
TAUg Runtime Global Performance Data Access
Using MPI
Aroon Nataraj
Performance Research Lab
University of Oregon

2
KTAU Outline

Introduction
Motivations
Objectives
Architecture / Implementation Choices
Experimentation the performance views
Perturbation Study
ZeptoOS KTAU on Blue Gene / L
Future work and directions
Acknowledgements

3
Introduction ZeptoOS and TAU

DOE OS/RTS for Extreme Scale Scientific
Computation(Fastos)
Conduct OS research to provide effective
OS/Runtime for petascale systems
ZeptoOS (under Fastos)
Scalable components for petascale architectures
Joint project Argonne National Lab and University
of Oregon
ANL Putting light-weight kernel (based on Linux)
on BG/L and other platforms (XT3)
University of Oregon
Kernel performance monitoring, tuning
KTAU
Integration of TAU infrastructure in Linux Kernel
Integration with ZeptoOS, installation on BG/L
Port to 32-bit and 64-bit Linux platforms

4
KTAU Motivation

Application Performance
user-level execution performance
OS-level operations performance
Domains Time and Hardware Perf. Metrics
PAPI (Performance Application Programming
Interface)
Exposes virtualized hardware counters
TAU (Tuning and Analysis Utility)
Measures a lot of the interesting user-level
entities parallel application, MPI, libraries
Time domain
Uses PAPI to correlate counter information to
source

5
KTAU Motivation

As HPC systems continue to scale to larger
processor counts
Application performance more sensitive
New OS factors become performance bottlenecks
(E.g. Petrini03, Jones03, other works)
Isolating these system-level issues as
bottlenecks is non-trivial
from Petrini03
Comprehensive performance understanding
Observation of all performance factors
Relative contributions and interrelationship can
we correlate?

6
KTAU Motivation continuedProgram - OS
Interactions

Program OS Interactions - Direct vs. Indirect
Entry Points
Direct - Applications invoke the OS for certain
services
Syscalls (and internal OS routines called
directly from syscalls)
Indirect - OS takes actions without explicit
invocation by application
Preemptive Scheduling
(HW) Interrupt handling
OS-background activity (keeping track of time and
timers, bottom-half handling, etc)
Indirect interactions can occur at any OS entry
(not just when entering through Syscalls)
Direct Interactions easier to handle
Synchronous with user-code and in process-context
Indirect Interactions more difficult to handle
Usually asynchronous and in interrupt-context
Hard to measure and harder to correlate/integrate
with app. Measurements
But can argue Indirect interactions may be
unrelated to task? Why measure?

7
KTAU Motivation continuedKernel-wide vs.
Process-centric

Kernel-wide - Aggregate kernel activity of all
active processes in system
Understand overall OS behavior, identify and
remove kernel hot spots.
Cannot show what parts of app. spend time in OS
and why
Process-centric perspective - OS performance
within context of a specific applications
execution
Virtualization and Mapping performance to process
Interactions between programs, daemons, and
system services
Tune OS for specific workload or tune application
to better conform to OS config.
Expose real source of performance problems (in
the OS or the application)

8
KTAU Motivation continuedExisting Approaches

User-space Only measurement tools
Many tools only work at user-level and cannot
observe system-level performance influences
Kernel-level Only measurement tools
Most only provide the kernel-wide perspective
lack proper mapping/virtualization
Some provide process-centric views but cannot
integrate OS and user-level measurements
Combined or Integrated User/Kernel Measurement
Tools
A few powerful tools allow fine-grained
measurement and correlation of kernel and
user-level performance
Typically these focus only on Direct OS
interactions. Indirect interactions not merged.
Using Combinations of above tools
Without better integration, does not allow
fine-grained correlation between OS and App.
Many kernel tools do not explicitly recognize
Parallel workloads (e.g. MPI ranks)
Need an integrated approach to parallel perf.
observation, analyses

9
KTAU High-Level Objectives

Support low-overhead OS performance measurement
at multiple levels of function and detail
Provide both kernel-wide and process-centric
perspectives of OS performance
Merge user-level and kernel-level performance
information across all program-OS interactions
Provide online information and the ability to
function without a daemon where possible
Support both profiling and tracing for
kernel-wide and process-centric views in parallel
systems
Leverage existing parallel performance analysis
tools
Support for observing, collecting and analyzing
parallel data

10
KTAU Outline

Introduction
Motivations
Objectives
Architecture / Implementation Choices
Experimentation the performance views
Perturbation Study
ZeptoOS KTAU on Blue Gene / L
Future work and directions
Acknowledgements

11
KTAU Architecture
12
KTAU Arch. / Impl. Choices

Instrumentation
Static Source instrumentation
Macro Map-ID Map block of code and
process-context to unique index (dense id-space)
easy array lookup.
Macro Start, Stop provide the mapping index and
process-context is implicit
Measurement
Differentiate between local/self and
inter-context access. HPC codes primarily use
self.
Store performance data in PCB (task_struct)
Integrating Kernel/User Performance state
Dont assume synchronous kernel-entry or
process-context
Have to use memory mapping between kernel and
appl. State
Pinning shared state in memory
Kernel Call Groups program-OS interactions
summary
Analyses and Visualization Use TAU facilities

13
KTAU Controlled Experiments

Controlled Experiments
Exercise kernel in controlled fashion
Check if KTAU produces the expected correct and
meaningful views
Test machines
Neutron 4-CPU Intel P3 Xeon 550MHz, 1GB RAM,
Linux 2.6.14.3(ktau)
Neuronic 16-node 2-CPU Intel P4 Xeon 2.8GHz, 2GB
RAM/node, Redhat Enterprise Linux 2.4(ktau)
Benchmarks
NPB LU application NPB
Simulated computational fluid dynamics (CFD)
application. A regular-sparse, block lower and
upper triangular system solution.
LMBENCH LMBENCH
Suite of micro-benchmarks exercising Linux kernel
A few others not shown (e.g. SKAMPI)

14
KTAU Controlled Examples continued
Profiling
15
KTAU Controlled Examples continuedTracing
Fine-grained Tracing Shows detail inside
interrupts and bottom halves
Using VAMPIR Trace Visualization VAMPIR
16
KTAU Larger-Scale Runs

Run parallel benchmarks on larger-scale (128
dual-cpu nodes)
Identify (and remove) system-level performance
issues
Understand perturbation overheads introduced by
KTAU
NPB benchmark LU Application NPB
Simulated computational fluid dynamics (CFD)
application. A regular-sparse, block lower and
upper triangular system solution.
ASC benchmark Sweep3D Sweep3d
Solves a 3-D, time-independent, neutron particle
transport equation on an orthogonal mesh.
Test machine Chiba-City Linux cluster (ANL)
128 dual-CPU Pentium III, 450MHz, 512MB RAM/node,
Linux 2.6.14.2 (ktau) kernel, connected by
Ethernet

17
KTAU Larger-Scale Runs

By chance experienced problems on Chiba
Initially ran NPB-LU and Sweep3D codes on 128x1
configuration
Then ran on 64x2 configuration
Extreme performance hit (72 slower!) with the
64x2 runs
Used KTAU views to identify and solve issues
iteratively
Eventually brought performance gap to 13 for LU
and 9 for Sweep.

18
KTAU Larger-scale Runs
User-level MPI_Recv
MPI_Recv OS Interactions
Two ranks - relatively very low MPI_Recv() time.
Two ranks - MPI_Recv() diff. from Mean in
OS-SCHED.
19
KTAU Larger-scale Runs
Voluntary Scheduling
Preemptive Scheduling
Note x-axis log scale
Two ranks have very low voluntary scheduling
durations.
(Same) Two ranks have very large preemptive
scheduling.
20
KTAU Larger-scale Runs
ccn10 Node-level View
Interrupt Activity
NPB LU processes PID4066, PID4068 active. No
other significant activity! Why the Pre-emption?
64x2 Pinned Interrupt Activity Bimodal across
MPI ranks.
21
KTAU Larger-scale Runs
Use Merged performance data to identify
imbalance.Why does purely compute bound region
have lots of I/O?
TCP within Compute Time
TCP within Compute Calls
100 More background OS-TCP activity in Compute
phase. More imbalance!
22
KTAU Larger-scale Runs
Cost / Call of OS-level TCP
OS-TCP in SMP Costlier

IRQ-Balancing blindly distributes interrupts and
bottom-halves.
E.g. Handling TCP related BH in CPU-0 for
LU-process on CPU-1
Cache issues! COMSWARE

23
KTAU Perturbation Study

Five different Configurations
Base Vanilla kernel, un-instrumented benchmark
Ktau-Off Kernel patched with Ktau and
instrumentations compiled-in. But all
instrumentations turned Off (boot-time control)
Prof-All All kernel instrumentations turned On.
Prof-Sched Only scheduler subssystems
instrumentations turned on
Prof-AllTAU ProfAll, but also with user-level
Tau instrumentation enabled
NPB LU application benchmark
16 nodes, 5 different configurations, Mean over 5
runs each
ASC Sweep3D
128 nodes, Base and Prof-AllTAU, Mean over 5
runs each.
Test machine Chiba-City ANL

24
KTAU Perturbation Study
Sweep3d on 128 Nodes
Base ProfAllTAU Elapsed Time 368.25
369.9 Avg Slow.
0.49
Complete Integrated Profiling Cost under 3 on
Avg. and as low as 1.58.
Disabled probe effect.
Single instrumentation very cheap. E.g.
Scheduling.
25
KTAU Outline

Introduction
Motivations
Objectives
Architecture / Implementation Choices
Experimentation the performance views
Perturbation Study
ZeptoOS KTAU on Blue Gene / L
Future work and directions
Acknowledgements

26
ZeptoOS KTAU On Blue Gene / L (BG/L)

I/O Node
Open source modified Linux Kernel (2.4, 2.6) -
ZeptoOS
Control I/O Daemon (CIOD) handles I/O syscalls
from Compute nodes in pset.
Compute Node
IBM proprietary (closed-source) light-weight
kernel
No scheduling or virtual memory support
Forwards I/O syscalls to CIOD on I/O node
KTAU on I/O Node
Integrated into ZeptoOS config and build system.
Require KTAU-D (daemon) as CIOD is closed-source.
KTAU-D periodically monitors sys-wide or
individual process
Visualization of trace/profile of ZeptoOS, CIOD
using Paraprof, Vampir/Jumpshot.

27
KTAU On BG/L
28
KTAU On Bg/L continuedEarly Experiences
CIOD Kernel Trace zoomed-in (running iotest
benchmark)
29
KTAU On Bg/L continuedEarly Experiences
30
KTAU On Bg/L continuedEarly Experiences
Correlating CIOD and RPC-IOD Activity
31
KTAU Future Work

Dynamic measurement control - enable/disable
events w/o recompilation or reboot
Improve performance data sources that KTAU can
access - E.g. PAPI
Improve integration with TAUs user-space
capabilities to provide even better correlation
of user and kernel performance information
full callpaths,
phase-based profiling,
merged user/kernel traces
Integration of Tau, Ktau with Supermon (possibly
MRNet?), TAUg (next)
Porting efforts IA-64, PPC-64 and AMD Opteron
ZeptoOS Planned characterization efforts
BGL I/O node
Dynamically adaptive kernels

32
TAUg Outline

Overview
Motivation
Design
Programming Interface
Experimentation
Overheads

33
TAUg Motivation

While an application is running, there exists a
virtual global performance state
All events, profiled on all processes and threads
Need runtime, application-level access to the
state
Load balancing
CQoS Computational Quality of Service
Other adaptive runtime behavior
Need scalable solution
Many large applications already use MPI

34
TAUg Overview

TAU generates and provides access to the local
performance state
MPI provides scalable communication
infrastructure to promote the local states to the
global state
TAUg (global) performance view
Subset of events in the local performance state
TAUg (global) performance communicator
Subset of MPI processes in the application
Querying the view provides selective access to
the global performance state

35
TAUg Design
36
TAUg Programming Interface

TAU_REGISTER_VIEW()
Selects a subset of events
TAU_REGISTER_COMMUNICATOR()
Selects a subset of processes
TAU_GET_VIEW()
Input view ID, communicator ID, exchange type
(all-to-all, one-to-all, all-to-one), source/sink
process rank (ignored for all-to-all
communication)
Output vector of performance data
Uses scalable MPI collectives to exchange data

37
TAUg experiments

Simulation application
Demonstrates functionality of TAUg in a simulated
heterogeneous cluster to provide access to global
performance view for load balancing
ASC benchmark sPPM
Solves a 3D gas dynamics problem on a uniform
Cartesian mesh using a simplified version of the
PPM (Piecewise Parabolic Method) code.
ASC benchmark Sweep3D
Solves a 3-D, time-independent, neutron particle
transport equation on an orthogonal mesh.
Test machines MCR, ALC (LLNL)

38
TAUg Overhead Simulation
Less than 0.1 overhead, one event, all
processes, 10 timesteps in 62.4 seconds for 128
CPUs, weak scaling
Load-balancing gave 28 speedup
39
TAUg Overhead sPPM
Less than 0.1 overhead, all events, all
processes, 20 timesteps in 120 seconds for 64
CPUs, weak scaling
40
TAUg Overhead Sweep3D
Less than 1.3 overhead, one event, all
processes, 200 timesteps in 250 seconds for 512
CPUs, strong scaling
41
Support Acknowledgements

Department of Energys Office of Science
(contract no. DE-FG02-05ER25663) and
National Science Foundation (grant no. NSF CCF
0444475)

42
References

petrini03F. Petrini, D. J. Kerbyson, and S.
Pakin, The case of the missing supercomputer
performance Achieving optimal performance on the
8,192 processors of asci q, in SC 03
jones03 T. Jones and et al., Improving the
scalability of parallel jobs by adding parallel
awareness to the operating system, in SC 03
PAPI S. Browne et al., A Portable Programming
Interface for Performance Evaluation on Modern
Processors. The International Journal of High
Performance Computing Applications,
14(3)189--204, Fall 2000.
VAMPIR W. E. Nagel et. al., VAMPIR
Visualization and analysis of MPI resources,
Supercomputer, vol. 12, no. 1, pp. 6980, 1996.
ZeptoOS ZeptoOS The small linux for big
computers, http//www.mcs.anl.gov/zeptoos/
NPB D.H. Bailey et. al., The nas parallel
benchmarks, The International Journal of
Supercomputer Applications, vol. 5, no. 3, pp.
6373, Fall 1991.

43
References

Sweep3d A. Hoise et. al., A general
predictive performance model for wavefront
algorithms on clusters of SMPs, in International
Conference on Parallel Processing, 2000
LMBENCH L. W. McVoy and C. Staelin, lmbench
Portable tools for performance analysis, in
USENIX Annual Technical Conference, 1996, pp.
279294
TAU TAU Tuning and Analysis Utilities,
http//www.cs.uoregon.edu/research/paracomp/tau/
KTAU-BGL A. Nataraj, A. Malony, A. Morris, and
S. Shende, Early experiences with ktau on the
ibm bg/l, in EuroPar06, European Conference on
Parallel Processing, 2006.
KTAU A. Nataraj et al., Kernel-Level
Measurement for Integrated Parallel Performance
Views the KTAU Project (under submission)

44
Team

Aroon Nataraj, PhD Student KTAU
Kevin Huck, PhD Student - TAUg
Prof. Allen D Malony
Dr. Sameer Shende, Senior Scientist
Alan Morris, Senior Software Engineer
Suravee Suthikulpanit , MS Student (Graduated) -
KTAU

Write a Comment

User Comments (0)

About PowerShow.com

TAU: Recent Advances - PowerPoint PPT Presentation

TAU: Recent Advances

TAU: Recent Advances. KTAU: Kernel-Level Measurement for Integrated Parallel Performance Views ... Ktau-Off: Kernel patched with Ktau and instrumentations compiled-in. ... – PowerPoint PPT presentation