IBM Hardware Performance Monitor (hpm) - PowerPoint PPT Presentation

1 / 32

About This Presentation

Title:

IBM Hardware Performance Monitor (hpm)

Description:

Resource usage statistics. Hardware performance counters ... floating point performance and usage of floating point units ... Resource Usage Statistics ... – PowerPoint PPT presentation

Number of Views:120

Avg rating:3.0/5.0

Slides: 33

Provided by: DongJ9

Category:

more less

Transcript and Presenter's Notes

Title: IBM Hardware Performance Monitor (hpm)

1
IBM Hardware Performance Monitor (hpm)

NPACI Parallel Computing Institute August, 2002

2
What is Performance?

- Where is time spent and how is time spent?
MIPS Millions of Instructions Per Second
MFLOPS Millions of Floating-Point Operations
Per Second
Run time/CPU time

3
What is a Performance Monitor?

Provides detailed processor/system data
Processor Monitors
Typically a group of registers
Special purpose registers keep track of
programmable events
Non-intrusive counts result in accurate
measurement of processor events
Typical Events counted are Instruction, floating
point instruction, cache misses, etc.
System Level Monitors
Can be hardware or software
Intended to measure system activity
Examples
bus monitor measures memory traffic, can analyze
cache coherency issues in multiprocessor system
Network monitor measures network traffic, can
analyze web traffic internally and externally

4
Hardware Counter Motivations

To understand execution behavior of application
code
Why not use software?
Strength simple, GUI interface
Weakness large overhead, intrusive, higher
level, abstraction and simplicity
How about using a simulator?
Strength control, low-level, accurate
Weakness limit on size of code, difficult to
implement, time-consuming to run
When should we directly use hardware counters?
Software and simulators not available or not
enough
Strength non-intrusive, instruction level
analysis, moderate control, very accurate, low
overhead
Weakness not typically reusable, OS kernel
support

5
Ptools Project

PMAPI Project
Common standard API for industry
Supported by IBM, SUN, SGI, COMPAQ etc
PAPI Project
Standard application programming interface
Portable, available through a module
Can access hardware counter info
HPM Toolkit
Easy to use
Doesnt effect code performance
Use hardware counters
Designed specifically for IBM SPs and Power

6
Problem Set

Should we collect all events all the time?
Not necessary and wasteful
What counts should be used?
Gather only what you need

Cycles
Committed Instructions
Loads
Stores
L1/L2 misses
L1/L2 stores

Committed fl pt instr
Branches
Branch misses
TLB misses
Cache misses

7
POWER3 Architecture

8
IBM HPM Toolkit

High Performance Monitor
Developed for performance measurement of
applications running on IBM Power3 systems. It
consists of
An utility (hpmcount)
An instrumentation library (libhpm)
A graphical user interface (hpmviz).
Requires PMAPI kernel extensions to be loaded
Works on IBM 630 and 604e processors
Based on IBMs PMAPI low level interface

9
HPM Count

Utilities for performance measurement of
application
Extra logic inserted to the processor to count
specific events
Updated at every cycle
Provide a summary output at the end of the
execution
Wall clock time
Resource usage statistics
Hardware performance counters information
Derived hardware metrics
Serial/Parallel, Gives each performance numbers
for each task

10
HPM Usage HW Event Categories

EVENT SET 1
Cycles
Inst. Completed
TLB misses
Stores completed
Loads completed
FPU0 ops
FPU1 ops
FMAs executed

EVENT SET 2 Cycles Inst. Completed TLB
misses Stores dispatched L1 store misses Loads
dispatched L1 load misses LSU idle
EVENT SET 3 Cycles Inst. dispatched Inst.
Completed Cycles w/ 0 inst. completed I cache
misses FXU0 ops FXU1 ops FXU2 ops
EVENT SET 4 Cycles Loads dispatched L1 load
misses L2 load misses Stores dispatched L2 store
misses Comp. unit waiting on load LSU idle
floating point performance and usage of floating
point units
performance and usage of level 1 instruction cache
usage of level 2 data cache and branch prediction
data locality and usage of level 1 data cache
11
HPM for Whole Program using HPMCOUNT

Installed in /usr/local/apps/hpm,
/usr/local/apps/HPM_V2.3
Environment setting
setenv LIBHPM_EVENT_SET 1 (2,3,4)
setenv MP_LABELIO YES -gt to correlate each
line of output with corresponding task
setenv MP_STDOUTMODE -gttaskID(e.g. 0) to
discard output from other tasks
Usage
poe hpmcount ./a.out -nodes 1 -tasks_per_node
1 -rmpool 1 -s ltsetgt -e ev,ev -h
-h displays a help message
-e ev0,ev1, list of event numbers, separated by
commas
evltigt corresponds to event selected for counter
ltIgt
-s predefined set of envets

12
Derived Hardware Metrics

Hardware counters provide only raw counts
8 counters on Power3
Enough info for generation of derived metrics on
each execution
Derived Metrics
Floating point rate
Computational Intensity
Instruction per load / store
Load/store per data cache misses
Cache hit rate
Loads per load miss
Stores per store miss
Loads per TLB miss
FMA
Branches Mispredicted

13
HPMCOUNT Output (Event1) Resource Usage
Statistics

Total execution time of instrumented code (wall
time) 6.218496 seconds
Total amount of time in user mode
5.860000 seconds
Total amount of time in system mode
3.120000 seconds
Maximum resident set size
23408 Kbytes
Average shared memory use in text segment
97372 Kbytessec
Average unshared memory use in data segment
13396800 Kbytessec
Number of page faults without I/O activity
5924
Number of page faults with I/O activity
12
Number of times process was swapped out 0
Number of times file system performed INPUT
0
Number of times file system performed OUTPUT 0
Number of IPC messages sent
0
Number of IPC messages received
0
Number of signals delivered
0
Number of voluntary context switches
2840
Number of involuntary context switches
27740

14
HPMCOUNT (Event1 continued) Resource statistics

Instrumented section 1 - Label ALL - process
1
file swim_omp.f, lines 89 lt--gt 189
Count 1
Wall Clock Time
6.216718 seconds
Total time in user mode
5.35645462067771 seconds
Exclusive duration 0.012166 seconds
PM_CYC (Cycles)
2008608171
PM_INST_CMPL (Instructions completed)
1891769436
PM_TLB_MISS (TLB misses)
2374441
PM_ST_CMPL (Stores completed)
274169278
PM_LD_CMPL (Loads completed)
672275023
PM_FPU0_CMPL (FPU 0 instructions)
528010431
PM_FPU1_CMPL (FPU 1 instructions)
245779486
PM_EXEC_FMA (FMAs executed)
270299532

15
Timers

Time usually reports three metrics
User Time
The time used by your code on CPU, also CPU time
Total time in user mode Cycles/Processor
Frequency
System Time
The time used by your code running kernel code
(doing I/O, writing to disk, or printing to the
screen etc).
It is worth to minimize the system time, by
speeding up the disk I/O, doing I/O in parallel,
or doing I/O in background while your CPU
computes in the foreground
Wall Clock time
Total execution time, the combination of the time
1 and 2 plus the time spent idle (waiting for
resources)
In parallel performance tuning, only wall clock
time counts
Interprocessor communication consumes a
significant amount of your execution time
(user/system time usually dont account for it),
need to rely on wall clock time for all the time
consumed by the job

16
Floating Point Measures

PM_FPU0_CMPL (FPU 0 instructions)
The POWER3 processor has two Floating Point Units
(FPU) which operate in parallel. Each FPU can
start a new instruction at every cycle. This
counter shows the number of floating point
instructions that have been executed by the first
FPU.
PM_FPU1_CMPL (FPU 1 instructions)
This counter shows the number of floating point
instructions (add, multiply, subtract, divide,
multiply add) that have been processed by the
second FPU.
PM_EXEC_FMA (FMAs executed)
This is the number of Floating point Multiply
Add (FMA) instructions. This instruction does a
computation of following type x s a b So
two floating point operations are done within one
instruction. The compiler generate this
instruction as often as possible to speed up the
program. But sometimes additional manual
optimization is necessary to replace single
multiply instructions and corresponding add
instructions by one FMA.

17
HPMCOUNT (Event1 continued)

Utilization rate
86.162
TLB misses per cycle
0.118
Estimated latency from TLB misses
4.432 sec
Avg number of loads per TLB miss
283.130
Load and store operations
946.444 M
Instructions per load/store
1.999
MIPS
304.304
Instructions per cycle
0.942
HW Float points instructions per Cycle
0.385
Floating point instructions FMAs
1044.089 M
Float point instructions FMA rate
167.949 Mflip/s
FMA percentage
51.777
Computation intensity
1.103

18
Total Flop Rate

Float point instructions FMA rate
This is the most often mentioned performance
index, the MFlops rate.
The peak performance of the POWER3-II processor
is 1500 MFlops. (375 MHZ clock x 2 FPUs x 2
Flops/FMA instruction).
Many applications do not reach more than 10
percent of this peak performance.
Average number of loads per TLB miss
This value is the ratio PM_LD_CMPL / PM_TLB_MISS.
Each time after a TLB miss has been processed,
fast access to a new page of data is possible.
Small values for this metric indicate that the
program has a poor data locality, a redesign of
the data structures in the program may result in
significant performance improvements.
Computation intensity
Computational intensity is the ratio of Load and
store operations and Floating point operations

19
HPM for Part of Program using LIBHPM

Instrumentation of performance library for
performance measurement of Fortran, C and C
applications
Collects information and performs summarization
during run-time, generate performance file for
each task
Use the same set of hardware counters events used
by hpmcount
User can specify an event set with the file
libHPMevents
For each instrumented point in a program, libhpm
provides output
Total count
Total duration (wall clock time)
Hardware performance counters information
Hardware derived metrics
Supports
multiple instrumentation points, nested
instrumentation
OpenMP and thread applications
Multiple calls to an instrumented point

20
LIBHPM Functions

C C
hpmInit(taskID)
hpmTerminate(taskID)
hpmStart(instID)
hpmStop(instID)
hpmTstart(instID)
hpmTstop(instID)

Fortran
f_hpminit(taskID)
f_hpmterminate(taskID)
f_ hpmstart(instID)
f_ hpmstop(instID)
f_ hpmtstart(instID)
f_ hpmtstop(instID)

21
Using LIBHPM - C

Declaration
include libhpm.h
C usage
MPI_Comm_rank( MPI_COMM_WORLD, taskID)
hpmInit(taskID,hpm_test)
hpmStart(1,outer call)
code segment to be timed
hpmStop( 1)
hpmTerminate(taskID)
Compilation
mpcc_r -I/usr/local/apps/HPM_V2.3/include -O3
-lhpm_r -lpmapi -lm -qarchpwr3 -qstrict
-qsmpomp -L/usr/local/apps/HPM_V2.3/lib
hpm_test.c -o hpm_test.x

22
Using LIBHPM - Fortran

Declaration
include f_hpm.h
Fortran usage
CALL MPI_COMM_RANK( MPI_COMM_WORLD, taskid,
ierr )
call f_hpminit(taskID)
call f_hpmstart(instID)
code segment to be timed
call f_hpmstop(instID)
call f_hpmterminate(taskID)
CALL MPI_FINALIZE(ierr)
Compilation
mpxlf_r -I/usr/local/apps/HPM_V2.3/include
-qsuffixcppf -O3 -qarchpwr3 -qstrict -qsmpomp
-L/usr/local/apps/HPM_V2.3/lib -lhpm_r -lpmapi
-lm hpm_test.f -o hpm_test.x

23
Using LIBHPM - Threads

call f_hpminit(taskID)
//do
call f_hpmtstart(10)
do_work
call f_hpmtstop(10)
end //do
//do
call f_hpmtstart(20my_thread_ID )
do_work
call f_hpmtstop(20my_thread_ID )
end //do
call f_hpmterminate(taskID)

24
HPM Example Code in C
include ltmpi.hgt include ltstdio.hgt include
"libhpm.h" define n 10000 main(int argc, char
argv) int taskID,i,numprocs double
an,bn,cn MPI_Init(argc,argv) MPI_Comm_si
ze(MPI_COMM_WORLD,numprocs) MPI_Comm_rank(MPI_CO
MM_WORLD,taskID) hpmInit(taskID,"hpm_test") hpm
Start(1,section 1") for(i1iltn1i) aii
bin-1 hpmStop(1)
hpmStart(2, "section 2") for(i2iltn1i)
ciaibiai/bi hpmStop(2) hpmTermin
ate(taskID) MPI_Finalize()

25
HPM Example Code in Fortran

program hpm_test
parameter (n10000)
integer taskID,ierr,numtasks
dimension a(n),b(n),c(n)
include "mpif.h"
include "f_hpm.h"
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD,taskID,ier
r)
call MPI_COMM_SIZE(MPI_COMM_WORLD,numtasks,i
err)
call f_hpminit(taskID,"hpm_test")
call f_hpmstart(5,section1")
do i1,n
a(i)real(i)
b(i)real(n-i)
enddo

call f_hpmstop(5) call
f_hpmterminate(taskID) call
MPI_FINALIZE(ierr) end
26
Compiling and Linking

FF mpxlf_r
HPM_DIR /usr/local/apps/HPM_V2.3
HPM_INC -I(HPM_DIR)/include
HPM_LIB -L(HPM_DIR)/lib -lhpm_r -lpmapi -lm
FFLAGS -qsuffixcppf -O3 -qarchpwr3 -qstrict
-qsmpomp
Note -qsuffixcppf is only required for
Fortran code with .f
hpm_test.x hpm_test.f
(FF) (HPM_INC) (FFLAGS) hpm_test.f (HPM_LIB)
-o hpm_test.x

27
HPMVIZ

takes as input the performance files generated by
libhpm
Usage
gt hpmviz ltperformance files(.viz)gt
define a range of values considered satisfactory
Red below predefined as minimum recommended
value
Green above the threshold value
HPMVIZ left pane of the window
displays for each instrumented point, identified
by its label, the inclusive duration, exclusive,
and count.
HPMVIZ right pane of the window
shows the corresponding source code which can be
edited and saved.
The metrics windows
display the task ID, Thread ID, count, exclusive
duration, inclusive duration, and the derived
hardware metrics.

28
HPMVIZ
29
IBM SP HPM Toolkit Summary

A complete problem set
Derived metrics
Analysis of error message
Analyze derived metrics
HPMCOUNT very accurate with low overhead,
non-intrusive, general view for whole program
LIBHPM same sets as hpmcount, for part of
program
HPMVIZ easier to view the hardware counters
information and derived metrics

30
HPM References

HPM README file in /usr/local/apps/HPM_V2.3
Online Documentation
http//www.sdsc.edu/SciApps/IBM_tools/hpm.html

31
Lab Session for HPMEnvironment Setup

Setup for running X-windows applications on PCs
1. Login to b80login.sdsc.edu using CRT
(located in Applications common).
2. Launch Exceed (located in either
Applications (Common) or as a shortcut on your
desktop called "Humming Bird".
3. set your environment, for csh
setenv DISPLAY t-wolf.sdsc.edu0.0
where "t-wolf for example is the name
of the PC you are using
4. copy files from /work/Training/HPM_Training
directory into your own working space.
create a directory to work with HPM
mkdir HPM
change directories into new directory
cd HPM
copy files into new directory
cp /work/Training/ HPM_Training/
.
5. Go to /work/Training/HPM_Training/simple/

32
Lab Session for HPMRunning HPM

1. Compile either Fortran or C example with the
following
make f makefile_f (or makefile_c)
2. Run executable either interactive or by batch
interactive command
poe hpm_test.x -nodes 1
-tasks_per_node 2 euilib ip \
euidevice en0
3. Explore hpmcount summary output, looking at
both usage and resource statistics