Title: Different events collected
1Different events collected modules view
System-wide look at software running on the
system
Our program
CPI- good average indication
2Hotspot Graph
Click on hotspot bar VTune displays source code
view
Each bar represents one of the functions of our
program
3Source View
Test_if function
4See how much time is spent on each one line
Annotated Source View( of module)
Check this for loop !
10 of CPU spent in few statements
5VTune Tuning assistant
- In few clicks we reached to the performance
problem! - Now, how to solve it ?
- Tuning Assistant highlights performance problems
- Provides approximate time lost by each
performance problem - Database contains performance metrics based on
Intels experience of tuning hundreds of
applications - Analyzes the data gathered by our application
- Generates tuning recommendations for each
hotspot - Gives user idea what might be done to fix the
problem
6Tuning Assistance Report
7Hotspot Assistant Report Penalties
8Hotspot Assistant Report
9Call Graph Mode
- Provides with a pictorial view of program flow to
quickly identify critical functions and call
sequences - Call graph profiling reveals
- Structure of your program on a function level
- Number of times a function is called from a
particular location - The time spent in each function
- Functions on a critical path.
10Call Graph Screenshot
the function summary pane
Critical Path displayed as red lines call
sequence in an application that took the most
time to execute.
Switch to Call-list View
11Call Graph (Cont.)
Additional info available - by hovering the move
over the functions
Wait time how much time spent waiting for
event to occur
12Jump to Source view
13Call Graph Call List View
Caller Functions are the functions that called
the Focus Function
Callee Functions are the functions that called by
Focus Function
14Counter Monitor
- Use the Counter Monitor feature of the VTune to
collect and display performance counter data.
Counter monitor selectively polls performance
counters, which are grouped categorically into
performance objects. - With the VTune analyzer, you can
- Monitor selected counters in performance objects.
- Correlate performance counter data with data
collected by other features in the VTune
analyzer, such as sampling. - Trigger the collection of counter data on events
other than a periodic timer.
15Counter Monitor
16Getting Help
- Context sensitive help
- Online Help repository
17VTune Summary
- Pros Allows to get best possible performance
out of Intel architecture - Cons Extreme tuning requires deep understanding
of processor and OS internals
18Valgrind
- Multi-purpose Linux x86 profiling tool
19Valgrind Toolkit
- Memcheck is memory debugger
- detects memory-management problems
- Cachegrind is a cache profiler
- performs detailed simulation of the I1, D1 and L2
caches in your CPU - Massif is a heap profiler
- performs detailed heap profiling by taking
regular snapshots of a program's heap - Helgrind is a thread debugger
- finds data races in multithreaded
- programs
20Memcheck Features
- When a program is run under Memcheck's
supervision, all reads and writes of memory are
checked, and calls to malloc/new/free/delete are
intercepted - Memcheck can detect
- Use of uninitialised memory
- Reading/writing memory after it has been free'd
- Reading/writing off the end of malloc'd blocks
- Reading/writing inappropriate areas on the stack
- Memory leaks -- where pointers to malloc'd blocks
are lost forever - Passing of uninitialised and/or unaddressible
memory to system calls - Mismatched use of malloc/new/new vs
free/delete/delete - Overlapping src and dst pointers in memcpy() and
related functions - Some misuses of the POSIX pthreads API
21Memcheck Example
Access of unallocated memory
Using non-initialized value
Memory leak
Using free of memory allocated by new
22Memcheck Example (Cont.)
- Compile the program with g flag
- g -c a.cc g o a.out
- Execute valgrind
- valgrind --toolmemcheck --leak-checkyes a.out gt
log - View log
Debug leaks
Executable name
23Memcheck report
24Memcheck report (cont.)Leaks detected
S T A C K
25Cachegrind
- Detailed cache profiling can be very useful for
improving the performance of the program - On a modern x86 machine, an L1 miss will cost
around 10 cycles, and an L2 miss can cost as much
as 200 cycles - Cachegrind performs detailed simulation of the
I1, D1 and L2 caches in your CPU - Can accurately pinpoint the sources of cache
misses in your code - Identifies number of cache misses, memory
references and instructions executed for each
line of source code, with per-function,
per-module and whole-program summaries - Cachegrind runs programs about 20--100x slower
than normal
26How to run
- Run valgrind --toolcachegrind in front of the
normal command line invocation - Example valgrind --toolcachegrind ls -l
- When the program finishes, Cachegrind will print
summary cache statistics. It also collects
line-by-line information in a file
cachegrind.out.pid - Execute cg_annotate to get annotated source file
- cg_annotate --7618 a.cc gt a.cc.annotated
Source files
PID
27Cachegrind Summary output
28Cachegrind Summary output
Data caches READ performance
D1 cache read misses
29Cachegrind Summary output
Data caches WRITE performance
30Cachegrind Accuracy
- Valgrind's cache profiling has a number of
shortcomings - It doesn't account for kernel activity -- the
effect of system calls on the cache contents is
ignored - It doesn't account for other process activity
(although this is probably desirable when
considering a single program) - It doesn't account for virtual-to-physical
address mappings hence the entire simulation is
not a true representation of what's happening in
the cache
31Massif tool
- Massif is a heap profiler - it measures how much
heap memory programs use. It can give information
about - Heap blocks
- Heap administration blocks
- Stack sizes
- Help to reduce the amount of memory the program
uses - smaller program interact better with caches,
avoid paging - Detect leaks that aren't detected by traditional
leak-checkers, such as Memcheck - That's because the memory isn't ever actually
lost - a pointer remains to it - but it's not in
use anymore
32Executing Massif
- Run valgrind toolmassif prog
- Produces following
- Summary
- Graph Picture
- Report
- Summary will look like this
- Total spacetime 2,258,106 ms.B
- Heap 24.0
- Heap admin 2.2
- Stack (s) 73.7
Space (in bytes) multiplied by time (in
milliseconds).
number of words allocated on heap, via malloc(),
new and new.
33Spacetime Graphs
34Spacetime Graph (Cont.)
- Each band represents single line of source code
- It's the height of a band that's important
- Triangles on the x-axis show each point at which
a memory census was taken - Not necessarily evenly spread Massif only takes
a census when memory is allocated or de-allocated
- The time on the x-axis is wall-clock time
- not ideal because can get different graphs for
different executions of the same program, due to
random OS delays
35Text/HTML Report example
- Contains a lot of extra information about
heap allocations that you don't see in the graph.
Shows places in the program where most memory was
allocated
36Valgrind how it works
- Valgrind is compiled into a shared object,
valgrind.so. The shell script valgrind sets the
LD_PRELOAD environment variable to point to
valgrind.so. This causes the .so to be loaded as
an extra library to any subsequently executed
dynamically-linked ELF binary - The dynamic linker allows each .so in the process
image to have an initialization function which is
run before main(). It also allows each .so to
have a finalization function run after main()
exits - When valgrind.so's initialization function is
called by the dynamic linker, the synthetic CPU
to starts up. The real CPU remains locked in
valgrind.so until end of run - System call are intercepted Signal handlers are
monitored
37Valgrind Summary
- Valgrind will save hours of debugging time
- Valgrind can help speed up your programs
- Valgrind runs on x86-Linux
- Valgrind works with programs written in any
language - Valgrind is actively maintained
- Valgrind can be used with other tools (gdb)
- Valgrind is easy to use
- uses dynamic binary translation, so no need to
modify, recompile or re-link applications. Just
prefix command line with valgrind and everything
works - Valgrind is not a toy
- Used by large projects 25 millions lines of
code - Valgrind is free
38Other Tools
- Tools not included in this presentation
- IBM Purify
- Parasoft Insure
- KCachegrind
- Oprofile
- GCCs and GLIBCs debugging hooks
39Writing Fast Programs
- Select right algorithm
- Implement it efficiently
- Detect hotspots using profiler and fix them
- Understanding of target system architecture is
often required such as cache structure - Use platform-specific compiler extensions
memory pre-fetching, cache control-instruction,
branch prediction, SIMD instructions - Write multithreaded applications (Hyper
Threading Technology)
40CPU Architecture (Pentium 4)
Out-of-order Execution !
41Instruction Execution
Execution Units
Integer
Integer
Dispatch unit
Instruction pool
Floating point
Floating point
Memory Load
Memory Save
42Keeping CPU Busy
- Processors are limited by data dependencies and
speed of instructions - Keep data dependencies low
- Good blend of instructions keep all execution
units busy at same time - Waiting for memory with nothing else to execute
is most common reason for slow applications - Goals ready instructions, good mix of
instructions and predictable branches - Remove branches if possible
- Reduce randomness of branches, avoid function
pointers and jump tables
43Memory Overview (Pentium 4)
- L1 cache (data only) 8 kbytes
- Execution Trace Cache that stores up to 12K of
decoded micro-ops - L2 Advanced Transfer Cache (data instructions)
256 kbytes, 3 times slower than L1 - L3 4MB cache (optional)
- Main RAM (usually 64M 4G) , 10 times slower
than L1