Performance Analysis and Optimization through Run-time Simulation and Statistics - PowerPoint PPT Presentation

About This Presentation
Title:

Performance Analysis and Optimization through Run-time Simulation and Statistics

Description:

Title: Performance Optimization through Run-time Analysis Author: Susan Blackford Last modified by: mucci Created Date: 7/9/1998 3:32:08 AM Document presentation format – PowerPoint PPT presentation

Number of Views:133
Avg rating:3.0/5.0
Slides: 31
Provided by: SusanBl1
Learn more at: https://icl.utk.edu
Category:

less

Transcript and Presenter's Notes

Title: Performance Analysis and Optimization through Run-time Simulation and Statistics


1
Performance Analysis and Optimization through
Run-time Simulation and Statistics
  • Philip J. Mucci
  • University Of Tennessee
  • mucci_at_cs.utk.edu
  • http//www.cs.utk.edu/mucci

2
Motivation
  • Tuning real DOD and DOE applications!
  • Performance on most codes is low.
  • Poor overall efficiency due to poor single node
    performance.
  • Show good scalability because of the above and
    faster interconnects.
  • The expertise is not there, nor should it be.

3
Description
  • To use data available at run-time to better
    compilation and optimization technology.
  • Empirically determine how well the code maps to
    the underlying architecture.
  • Bottlenecks can be identified and possibly
    corrected by an explicit set of rules and
    transformations.

4
Information not being used
  • Hardware statistics gathered through simulation
    or monitoring can identify the problem. (sample
    listing)
  • Cache and branching behavior
  • Cycle/Load/Store/FLOP counts
  • Bottleneck determination
  • Reference pattern
  • Dynamic memory placement

5
Problem Areas
  • Efficient use of the memory hierarchy
  • Register re-use
  • Aliasing
  • Inlining
  • Demotion
  • Algorithms (iterative vs. direct)

6
Solutions
  • Understanding (tutorials, reference material)
  • Tools
  • Preprocessors
  • Compilers
  • Manpower

7
Increasing Cache Performance
  • How do we better the use of the memory hierarchy?
  • For computer scientists, its not that hard. We
    need the right tools.
  • How much can we automate?
  • Through available tools and source analysis we
    can usually get down to the function.

8
Cache Simulation
  • Instrumentation of routines
  • Run of the executable
  • Analysis and correlation with source code!
  • Old idea, new implementation.

9
Cache Simulation
  • Hardware independence
  • Information on
  • Locality
  • Placement
  • Reference pattern and Reuse
  • Line usage

10
Locality
  • Spatial and Temporal
  • misses/memory reference
  • misses/re-use
  • Conflict vs. Capacity

11
Placement
  • Padding can be very important
  • Not always possible to do during static analysis
    phase.
  • Reference pattern can affect padding.

12
Reference Pattern
  • Again, not always possible to do during static
    analysis.
  • Even harder to analyze when dealing with
    pseudo-optimized code.
  • Examples Stencils, Sparse solvers etc...

13
Reuse
  • Blocking is critical to applications where there
    is re-use.
  • We need to identify re-use potential, to spot
    areas where blocking and register allocation
    should be focused on.

14
Source Code Mapping
  • Most cache tools are hard to use and relate to
    the source code.
  • This tool simulates the cache(s) on each memory
    reference and thus can easy correlate the data.
  • Instrumentation is at the source level, not
    object code.

15
Statistics
  • Global, per file, per statement, per reference
  • References, misses, cold misses, re-used
    references
  • Conflict/Re-use matrix
  • M(A,B) x means some element of A ejected some
    element of B from the cache x times iff that
    element of A has been in the cache before.

16
Development status
  • GUI for selective instrumentation
  • Real parsers (F90, C, C)
  • Better report generation

17
Implementation
  • Simulator written in C
  • Instrumentation in Perl
  • GUI in Java
  • Report generator in Perl

18
Relevance
  • Why shouldnt this technology be part of a
    feedback loop?
  • Compile with instrumentation
  • Run
  • Recompile with information from the run
  • Watch input sensitivity issues.

19
Integration
  • Identifying and correcting poor cache behavior
    can be made explicit and part of a compiler.
    (Ideally a source-to-source transformer or
    preprocessor)
  • Simulator can stand alone for detailed analysis
    and optimization by CS folks.
  • Our knowledge and expertise made available
    through the tools.

20
Hardware Counters
  • Virtually every processor available has hardware
    counters
  • The interfaces and documentation are poor or
    non-existent.
  • Hardware differs greatly as do the semantics
  • Useful for measurement, analysis, optimization,
    modeling and benchmarking.

21
Performance Data Standard
  • Standardize an API to obtain hardware performance
    counters
  • Standardize the definitions of what those
    counters mean
  • API is lightweight and portable

22
Performance Data Standard
  • Target platforms
  • R10K, R12K
  • P2SC, Power PC 604e, Power 3
  • Sun Ultra 2/3
  • Intel PII, Katmai, Merced
  • Alpha 21164, 21264

23
Performance Data Standard
  • Motivation
  • Portable performance tools
  • Optimization through feedback
  • Developers wanting simple and accurate timing and
    statistics
  • Modeling, evaluation

24
Performance Data Standard
  • Small number of useful measurement points
  • Timing cycles, microseconds
  • I/D cache misses, invalidations
  • Branch mispredictions
  • Load,store,FLOP,instruction counts
  • I/D TLB misses

25
Performance Data Standard API
  • Efficient counter multiplexing
  • Thread safety
  • Functions for
  • start, stop, reset, get, accumulate, query,
    control
  • Use the best available vendor supported interface
    or API
  • Possible pairing with DAIS, Dyninst for naming

26
Development status
  • Research on the various machines available
    hardware and interfaces
  • Compilation of findings, web page and mailing
    list
  • API specification to appear mid August for
    discussion
  • Vendors are lurking
  • http//www.cs.utk.edu/mucci/pdsa

27
Deliverables
  • API for O2K, T3E, SP
  • Portable prof implementation

28
People
  • Shirley Browne (UT)
  • Jeff Brown (LANL)
  • Jeff Durachta (IBM, LANL)
  • Christopher Kerr (IBM, LANL)
  • George Ho (UT)
  • Kevin London (UT)
  • Philip Mucci (UT, Sandia)

29
Rice/UTK Collaboration
DOE
DOD
Apps
Optimization Technology
Rice
UT
Low level support
Tools
30
Deliverables
  • F90,C preprocessor with feedback
  • Cache tool
  • Analysis and Optimization of poor codes
  • Performance API
Write a Comment
User Comments (0)
About PowerShow.com