Different events collected - PowerPoint PPT Presentation

About This Presentation

Title:

Different events collected

Description:

Profiling tools By Vitaly Kroivets for Software Design Seminar Contents Introduction Software optimization process , optimization traps and pitfalls Benchmark ... – PowerPoint PPT presentation

Number of Views:152

Avg rating:3.0/5.0

Slides: 44

Provided by: vastUccs

Learn more at: https://vast.uccs.edu

Category:

more less

Transcript and Presenter's Notes

Title: Different events collected

1
Different events collected modules view
System-wide look at software running on the
system
Our program
CPI- good average indication
2
Hotspot Graph
Click on hotspot bar VTune displays source code
view
Each bar represents one of the functions of our
program
3
Source View
Test_if function
4
See how much time is spent on each one line
Annotated Source View( of module)
Check this for loop !
10 of CPU spent in few statements
5
VTune Tuning assistant

In few clicks we reached to the performance
problem!
Now, how to solve it ?
Tuning Assistant highlights performance problems
Provides approximate time lost by each
performance problem
Database contains performance metrics based on
Intels experience of tuning hundreds of
applications
Analyzes the data gathered by our application
Generates tuning recommendations for each
hotspot
Gives user idea what might be done to fix the
problem

6
Tuning Assistance Report
7
Hotspot Assistant Report Penalties
8
Hotspot Assistant Report
9
Call Graph Mode

Provides with a pictorial view of program flow to
quickly identify critical functions and call
sequences
Call graph profiling reveals
Structure of your program on a function level
Number of times a function is called from a
particular location
The time spent in each function
Functions on a critical path.

10
Call Graph Screenshot
the function summary pane
Critical Path displayed as red lines call
sequence in an application that took the most
time to execute.
Switch to Call-list View
11
Call Graph (Cont.)
Additional info available - by hovering the move
over the functions
Wait time how much time spent waiting for
event to occur
12
Jump to Source view
13
Call Graph Call List View
Caller Functions are the functions that called
the Focus Function
Callee Functions are the functions that called by
Focus Function
14
Counter Monitor

Use the Counter Monitor feature of the VTune to
collect and display performance counter data.
Counter monitor selectively polls performance
counters, which are grouped categorically into
performance objects.
With the VTune analyzer, you can
Monitor selected counters in performance objects.
Correlate performance counter data with data
collected by other features in the VTune
analyzer, such as sampling.
Trigger the collection of counter data on events
other than a periodic timer.

15
Counter Monitor
16
Getting Help

Context sensitive help
Online Help repository

17
VTune Summary

Pros Allows to get best possible performance
out of Intel architecture
Cons Extreme tuning requires deep understanding
of processor and OS internals

18
Valgrind

Multi-purpose Linux x86 profiling tool

19
Valgrind Toolkit

Memcheck is memory debugger
detects memory-management problems
Cachegrind is a cache profiler
performs detailed simulation of the I1, D1 and L2
caches in your CPU
Massif is a heap profiler
performs detailed heap profiling by taking
regular snapshots of a program's heap
Helgrind is a thread debugger
finds data races in multithreaded
programs

20
Memcheck Features

When a program is run under Memcheck's
supervision, all reads and writes of memory are
checked, and calls to malloc/new/free/delete are
intercepted
Memcheck can detect
Use of uninitialised memory
Reading/writing memory after it has been free'd
Reading/writing off the end of malloc'd blocks
Reading/writing inappropriate areas on the stack
Memory leaks -- where pointers to malloc'd blocks
are lost forever
Passing of uninitialised and/or unaddressible
memory to system calls
Mismatched use of malloc/new/new vs
free/delete/delete
Overlapping src and dst pointers in memcpy() and
related functions
Some misuses of the POSIX pthreads API

21
Memcheck Example
Access of unallocated memory
Using non-initialized value
Memory leak
Using free of memory allocated by new
22
Memcheck Example (Cont.)

Compile the program with g flag
g -c a.cc g o a.out
Execute valgrind
valgrind --toolmemcheck --leak-checkyes a.out gt
log
View log

Debug leaks
Executable name
23
Memcheck report
24
Memcheck report (cont.)Leaks detected
S T A C K
25
Cachegrind

Detailed cache profiling can be very useful for
improving the performance of the program
On a modern x86 machine, an L1 miss will cost
around 10 cycles, and an L2 miss can cost as much
as 200 cycles
Cachegrind performs detailed simulation of the
I1, D1 and L2 caches in your CPU
Can accurately pinpoint the sources of cache
misses in your code
Identifies number of cache misses, memory
references and instructions executed for each
line of source code, with per-function,
per-module and whole-program summaries
Cachegrind runs programs about 20--100x slower
than normal

26
How to run

Run valgrind --toolcachegrind in front of the
normal command line invocation
Example valgrind --toolcachegrind ls -l
When the program finishes, Cachegrind will print
summary cache statistics. It also collects
line-by-line information in a file
cachegrind.out.pid
Execute cg_annotate to get annotated source file
cg_annotate --7618 a.cc gt a.cc.annotated

Source files
PID
27
Cachegrind Summary output
28
Cachegrind Summary output
Data caches READ performance
D1 cache read misses
29
Cachegrind Summary output
Data caches WRITE performance
30
Cachegrind Accuracy

Valgrind's cache profiling has a number of
shortcomings
It doesn't account for kernel activity -- the
effect of system calls on the cache contents is
ignored
It doesn't account for other process activity
(although this is probably desirable when
considering a single program)
It doesn't account for virtual-to-physical
address mappings hence the entire simulation is
not a true representation of what's happening in
the cache

31
Massif tool

Massif is a heap profiler - it measures how much
heap memory programs use. It can give information
about
Heap blocks
Heap administration blocks
Stack sizes
Help to reduce the amount of memory the program
uses
smaller program interact better with caches,
avoid paging
Detect leaks that aren't detected by traditional
leak-checkers, such as Memcheck
That's because the memory isn't ever actually
lost - a pointer remains to it - but it's not in
use anymore

32
Executing Massif

Run valgrind toolmassif prog
Produces following
Summary
Graph Picture
Report
Summary will look like this
Total spacetime 2,258,106 ms.B
Heap 24.0
Heap admin 2.2
Stack (s) 73.7

Space (in bytes) multiplied by time (in
milliseconds).
number of words allocated on heap, via malloc(),
new and new.
33
Spacetime Graphs
34
Spacetime Graph (Cont.)

Each band represents single line of source code
It's the height of a band that's important
Triangles on the x-axis show each point at which
a memory census was taken
Not necessarily evenly spread Massif only takes
a census when memory is allocated or de-allocated
The time on the x-axis is wall-clock time
not ideal because can get different graphs for
different executions of the same program, due to
random OS delays

35
Text/HTML Report example

Contains a lot of extra information about
heap allocations that you don't see in the graph.

Shows places in the program where most memory was
allocated
36
Valgrind how it works

Valgrind is compiled into a shared object,
valgrind.so. The shell script valgrind sets the
LD_PRELOAD environment variable to point to
valgrind.so. This causes the .so to be loaded as
an extra library to any subsequently executed
dynamically-linked ELF binary
The dynamic linker allows each .so in the process
image to have an initialization function which is
run before main(). It also allows each .so to
have a finalization function run after main()
exits
When valgrind.so's initialization function is
called by the dynamic linker, the synthetic CPU
to starts up. The real CPU remains locked in
valgrind.so until end of run
System call are intercepted Signal handlers are
monitored

37
Valgrind Summary

Valgrind will save hours of debugging time
Valgrind can help speed up your programs
Valgrind runs on x86-Linux
Valgrind works with programs written in any
language
Valgrind is actively maintained
Valgrind can be used with other tools (gdb)
Valgrind is easy to use
uses dynamic binary translation, so no need to
modify, recompile or re-link applications. Just
prefix command line with valgrind and everything
works
Valgrind is not a toy
Used by large projects 25 millions lines of
code
Valgrind is free

38
Other Tools

Tools not included in this presentation
IBM Purify
Parasoft Insure
KCachegrind
Oprofile
GCCs and GLIBCs debugging hooks

39
Writing Fast Programs

Select right algorithm
Implement it efficiently
Detect hotspots using profiler and fix them
Understanding of target system architecture is
often required such as cache structure
Use platform-specific compiler extensions
memory pre-fetching, cache control-instruction,
branch prediction, SIMD instructions
Write multithreaded applications (Hyper
Threading Technology)

40
CPU Architecture (Pentium 4)
Out-of-order Execution !
41
Instruction Execution
Execution Units
Integer
Integer
Dispatch unit
Instruction pool
Floating point
Floating point
Memory Load
Memory Save
42
Keeping CPU Busy

Processors are limited by data dependencies and
speed of instructions
Keep data dependencies low
Good blend of instructions keep all execution
units busy at same time
Waiting for memory with nothing else to execute
is most common reason for slow applications
Goals ready instructions, good mix of
instructions and predictable branches
Remove branches if possible
Reduce randomness of branches, avoid function
pointers and jump tables

43
Memory Overview (Pentium 4)

L1 cache (data only) 8 kbytes
Execution Trace Cache that stores up to 12K of
decoded micro-ops
L2 Advanced Transfer Cache (data instructions)
256 kbytes, 3 times slower than L1
L3 4MB cache (optional)
Main RAM (usually 64M 4G) , 10 times slower
than L1

Write a Comment

User Comments (0)