Different events collected - PowerPoint PPT Presentation

About This Presentation
Title:

Different events collected

Description:

Profiling tools By Vitaly Kroivets for Software Design Seminar Contents Introduction Software optimization process , optimization traps and pitfalls Benchmark ... – PowerPoint PPT presentation

Number of Views:152
Avg rating:3.0/5.0
Slides: 44
Provided by: vastUccs
Learn more at: https://vast.uccs.edu
Category:

less

Transcript and Presenter's Notes

Title: Different events collected


1
Different events collected modules view
System-wide look at software running on the
system
Our program
CPI- good average indication
2
Hotspot Graph
Click on hotspot bar VTune displays source code
view
Each bar represents one of the functions of our
program
3
Source View
Test_if function
4
See how much time is spent on each one line
Annotated Source View( of module)
Check this for loop !
10 of CPU spent in few statements
5
VTune Tuning assistant
  • In few clicks we reached to the performance
    problem!
  • Now, how to solve it ?
  • Tuning Assistant highlights performance problems
  • Provides approximate time lost by each
    performance problem
  • Database contains performance metrics based on
    Intels experience of tuning hundreds of
    applications
  • Analyzes the data gathered by our application
  • Generates tuning recommendations for each
    hotspot
  • Gives user idea what might be done to fix the
    problem

6
Tuning Assistance Report
7
Hotspot Assistant Report Penalties
8
Hotspot Assistant Report
9
Call Graph Mode
  • Provides with a pictorial view of program flow to
    quickly identify critical functions and call
    sequences
  • Call graph profiling reveals
  • Structure of your program on a function level
  • Number of times a function is called from a
    particular location
  • The time spent in each function
  • Functions on a critical path.

10
Call Graph Screenshot
the function summary pane
Critical Path displayed as red lines call
sequence in an application that took the most
time to execute.
Switch to Call-list View
11
Call Graph (Cont.)
Additional info available - by hovering the move
over the functions
Wait time how much time spent waiting for
event to occur
12
Jump to Source view
13
Call Graph Call List View
Caller Functions are the functions that called
the Focus Function
Callee Functions are the functions that called by
Focus Function
14
Counter Monitor
  • Use the Counter Monitor feature of the VTune to
    collect and display performance counter data.
    Counter monitor selectively polls performance
    counters, which are grouped categorically into
    performance objects.
  • With the VTune analyzer, you can
  • Monitor selected counters in performance objects.
  • Correlate performance counter data with data
    collected by other features in the VTune
    analyzer, such as sampling.
  • Trigger the collection of counter data on events
    other than a periodic timer.

15
Counter Monitor
16
Getting Help
  • Context sensitive help
  • Online Help repository

17
VTune Summary
  • Pros Allows to get best possible performance
    out of Intel architecture
  • Cons Extreme tuning requires deep understanding
    of processor and OS internals

18
Valgrind
  • Multi-purpose Linux x86 profiling tool

19
Valgrind Toolkit
  • Memcheck is memory debugger
  • detects memory-management problems
  • Cachegrind is a cache profiler
  • performs detailed simulation of the I1, D1 and L2
    caches in your CPU
  • Massif is a heap profiler
  • performs detailed heap profiling by taking
    regular snapshots of a program's heap
  • Helgrind is a thread debugger
  • finds data races in multithreaded
  • programs

20
Memcheck Features
  • When a program is run under Memcheck's
    supervision, all reads and writes of memory are
    checked, and calls to malloc/new/free/delete are
    intercepted
  • Memcheck can detect
  • Use of uninitialised memory
  • Reading/writing memory after it has been free'd
  • Reading/writing off the end of malloc'd blocks
  • Reading/writing inappropriate areas on the stack
  • Memory leaks -- where pointers to malloc'd blocks
    are lost forever
  • Passing of uninitialised and/or unaddressible
    memory to system calls
  • Mismatched use of malloc/new/new vs
    free/delete/delete
  • Overlapping src and dst pointers in memcpy() and
    related functions
  • Some misuses of the POSIX pthreads API

21
Memcheck Example
Access of unallocated memory
Using non-initialized value
Memory leak
Using free of memory allocated by new
22
Memcheck Example (Cont.)
  • Compile the program with g flag
  • g -c a.cc g o a.out
  • Execute valgrind
  • valgrind --toolmemcheck --leak-checkyes a.out gt
    log
  • View log

Debug leaks
Executable name
23
Memcheck report
24
Memcheck report (cont.)Leaks detected
S T A C K
25
Cachegrind
  • Detailed cache profiling can be very useful for
    improving the performance of the program
  • On a modern x86 machine, an L1 miss will cost
    around 10 cycles, and an L2 miss can cost as much
    as 200 cycles
  • Cachegrind performs detailed simulation of the
    I1, D1 and L2 caches in your CPU
  • Can accurately pinpoint the sources of cache
    misses in your code
  • Identifies number of cache misses, memory
    references and instructions executed for each
    line of source code, with per-function,
    per-module and whole-program summaries
  • Cachegrind runs programs about 20--100x slower
    than normal

26
How to run
  • Run valgrind --toolcachegrind in front of the
    normal command line invocation
  • Example valgrind --toolcachegrind ls -l
  • When the program finishes, Cachegrind will print
    summary cache statistics. It also collects
    line-by-line information in a file
    cachegrind.out.pid
  • Execute cg_annotate to get annotated source file
  • cg_annotate --7618 a.cc gt a.cc.annotated

Source files
PID
27
Cachegrind Summary output
28
Cachegrind Summary output
Data caches READ performance
D1 cache read misses
29
Cachegrind Summary output
Data caches WRITE performance
30
Cachegrind Accuracy
  • Valgrind's cache profiling has a number of
    shortcomings
  • It doesn't account for kernel activity -- the
    effect of system calls on the cache contents is
    ignored
  • It doesn't account for other process activity
    (although this is probably desirable when
    considering a single program)
  • It doesn't account for virtual-to-physical
    address mappings hence the entire simulation is
    not a true representation of what's happening in
    the cache

31
Massif tool
  • Massif is a heap profiler - it measures how much
    heap memory programs use. It can give information
    about
  • Heap blocks
  • Heap administration blocks
  • Stack sizes
  • Help to reduce the amount of memory the program
    uses
  • smaller program interact better with caches,
    avoid paging
  • Detect leaks that aren't detected by traditional
    leak-checkers, such as Memcheck
  • That's because the memory isn't ever actually
    lost - a pointer remains to it - but it's not in
    use anymore

32
Executing Massif
  • Run valgrind toolmassif prog
  • Produces following
  • Summary
  • Graph Picture
  • Report
  • Summary will look like this
  • Total spacetime 2,258,106 ms.B
  • Heap 24.0
  • Heap admin 2.2
  • Stack (s) 73.7

Space (in bytes) multiplied by time (in
milliseconds).
number of words allocated on heap, via malloc(),
new and new.
33
Spacetime Graphs
34
Spacetime Graph (Cont.)
  • Each band represents single line of source code
  • It's the height of a band that's important
  • Triangles on the x-axis show each point at which
    a memory census was taken
  • Not necessarily evenly spread Massif only takes
    a census when memory is allocated or de-allocated
  • The time on the x-axis is wall-clock time
  • not ideal because can get different graphs for
    different executions of the same program, due to
    random OS delays

35
Text/HTML Report example
  • Contains a lot of extra information about
    heap allocations that you don't see in the graph.

Shows places in the program where most memory was
allocated
36
Valgrind how it works
  • Valgrind is compiled into a shared object,
    valgrind.so. The shell script valgrind sets the
    LD_PRELOAD environment variable to point to
    valgrind.so. This causes the .so to be loaded as
    an extra library to any subsequently executed
    dynamically-linked ELF binary
  • The dynamic linker allows each .so in the process
    image to have an initialization function which is
    run before main(). It also allows each .so to
    have a finalization function run after main()
    exits
  • When valgrind.so's initialization function is
    called by the dynamic linker, the synthetic CPU
    to starts up. The real CPU remains locked in
    valgrind.so until end of run
  • System call are intercepted Signal handlers are
    monitored

37
Valgrind Summary
  • Valgrind will save hours of debugging time
  • Valgrind can help speed up your programs
  • Valgrind runs on x86-Linux
  • Valgrind works with programs written in any
    language
  • Valgrind is actively maintained
  • Valgrind can be used with other tools (gdb)
  • Valgrind is easy to use
  • uses dynamic binary translation, so no need to
    modify, recompile or re-link applications. Just
    prefix command line with valgrind and everything
    works
  • Valgrind is not a toy
  • Used by large projects 25 millions lines of
    code
  • Valgrind is free

38
Other Tools
  • Tools not included in this presentation
  • IBM Purify
  • Parasoft Insure
  • KCachegrind
  • Oprofile
  • GCCs and GLIBCs debugging hooks

39
Writing Fast Programs
  • Select right algorithm
  • Implement it efficiently
  • Detect hotspots using profiler and fix them
  • Understanding of target system architecture is
    often required such as cache structure
  • Use platform-specific compiler extensions
    memory pre-fetching, cache control-instruction,
    branch prediction, SIMD instructions
  • Write multithreaded applications (Hyper
    Threading Technology)

40
CPU Architecture (Pentium 4)
Out-of-order Execution !
41
Instruction Execution
Execution Units
Integer
Integer
Dispatch unit
Instruction pool
Floating point
Floating point
Memory Load
Memory Save
42
Keeping CPU Busy
  • Processors are limited by data dependencies and
    speed of instructions
  • Keep data dependencies low
  • Good blend of instructions keep all execution
    units busy at same time
  • Waiting for memory with nothing else to execute
    is most common reason for slow applications
  • Goals ready instructions, good mix of
    instructions and predictable branches
  • Remove branches if possible
  • Reduce randomness of branches, avoid function
    pointers and jump tables

43
Memory Overview (Pentium 4)
  • L1 cache (data only) 8 kbytes
  • Execution Trace Cache that stores up to 12K of
    decoded micro-ops
  • L2 Advanced Transfer Cache (data instructions)
    256 kbytes, 3 times slower than L1
  • L3 4MB cache (optional)
  • Main RAM (usually 64M 4G) , 10 times slower
    than L1
Write a Comment
User Comments (0)
About PowerShow.com