IBM Hardware Performance Monitor (hpm) - PowerPoint PPT Presentation

1 / 26

About This Presentation

Title:

IBM Hardware Performance Monitor (hpm)

Description:

floating point performance and usage of floating point units ... usage statistics--- Total amount of time in user mode : 141.130000 seconds ... – PowerPoint PPT presentation

Number of Views:91

Avg rating:3.0/5.0

Slides: 27

Provided by: DongJ9

Category:

more less

Transcript and Presenter's Notes

Title: IBM Hardware Performance Monitor (hpm)

1
IBM Hardware Performance Monitor (hpm)

NPACI Parallel Computing Workshop February 5,
2002 at SDSC

2
What is Performance?

Where is time spent and how is time spent?
MIPS Millions of Instructions Per Second
not necessarily indicative of the amount of
useful work done
MFLOPS Millions of Floating-Point Operations
Per Second
A better metric for numerically intensive codes,
but different platforms measure Flops
differently, and Flops is not completely
indicative of useful work done
Run time/CPU time
The only true measure of code performance!
accounts for algorithmic improvements to code.
Can be converted to cycles.
Counting cycles means
Estimate how many cycles your loop(s) should take
Compare to measured times(converted to cycles)
and tune the code to narrow the difference

3
What is a Performance Monitor?

Provides detailed processor/system data
Processor Monitors
Typically a group of registers
Special purpose registers keep track of
programmable events
Non-intrusive counts result in accurate
measurement of processor events
Typical Events counted are Instruction, floating
point instr, cache misses, etc.
System Level Monitors
Can be h/w or s/w
Intended to measure system activity
Examples
bus monitor measures memory traffic, can analyze
cache coherency issues in multiprocessor system
Network monitor measures network traffic, can
analyze web traffic internally and externally

4
Hardware Counter Motivations

To understand execution behavior of application
code
Why not use software?
Strength simple, GUI interface
Weakness large overhead, intrusive, higher
level, abstraction and simplicity
How about using a simulator?
Strength control, low-level, accurate
Weakness limit on size of code, difficult to
implement
When should we directly use hardware counters?
Software and simulators not available or not
enough
Strength non-intrusive, instruction level
analysis, moderate control, very accurate, low
overhead
Weakness not typically reusable, OS kernel
support

5
Problem Set

Should we collect all events all the time?
No. not necessary and wasteful
What counts should be used?
Safe to say gather only what you need

Cycles
Committed Instructions
Loads
Stores
L1/L2 misses
L1/L2 stores

Committed fl pt instr
Branches
Branch misses
TLB misses
Cache misses

6
POWER3 Architecture

7
IBM HPM Toolkit

High Performance Monitor
Developed for performance measurement of
applications running on IBM Power3 systems. It
consists of
An utility (hpmcount)
An instrumentation library (libhpm)
A graphical user interface (hpmviz).
Requires PMAPI kernel extensions to be loaded
Works on IBM 630 and 604e processors

8
HPM Count

Utilities for performance measurement of
application
Extra logic inserted to the processor to count
specific events
Updated at every cycle
Provide a summary output at the end of the
execution
Wall clock time
Resource usage statistics
Hardware performance counters information
Derived hardware metrics

9
HPM Usage HW Event Categories

EVENT SET 1
Cycles
Inst. Completed
TLB misses
Stores completed
Loads completed
FPU0 ops
FPU1 ops
FMAs executed

EVENT SET 2 Cycles Inst. Completed TLB
misses Stores dispatched L1 store misses Loads
dispatched L1 load misses LSU idle
EVENT SET 3 Cycles Inst. dispatched Inst.
Completed Cycles w/ 0 inst. completed I cache
misses FXU0 ops FXU1 ops FXU2 ops
EVENT SET 4 Cycles Loads dispatched L1 load
misses L2 load misses Stores dispatched L2 store
misses Comp. unit waiting on load LSU idle
floating point performance and usage of floating
point units
performance and usage of level 1 instruction cache
usage of level 2 data cache and branch prediction
data locality and usage of level 1 data cache
10
HPM for Whole Program using HPMCOUNT

Installed in /usr/local/apps/hpm,
/usr/local/apps/HPM_V2.3
Environment setting
setenv LIBHPM_EVENT_SET 1 (2,3,4)
setenv MP_LABELIO YES -gt to correlate each
line of output with corresponding task
setenv MP_STDOUTMODE -gttaskID(e.g. 0) to
discard output from other tasks
Usage
poe hpmcount ./a.out -nodes 1 -tasks_per_node
1 -rmpool 1 -s ltsetgt -e ev,ev -h
-h displays a help message
-e ev0,ev1, list of event numbers, separated by
commas
evltigt corresponds to event selected for counter
ltIgt
-s predefined set of envets

11
Derived Hardware Metrics

Hardware counters provide only raw counts
8 counters on Power3
Enough info for generation of derived metrics on
each execution
Derived Metrics
Floating point rate
Computational Intensity
Instruction per load / store
Load/store per data cache misses
Cache hit rate
Loads per load miss
Stores per store miss
Loads per TLB miss
FMA
Branches Misspredicted

12
HPMCOUNT Output (Event1)

---usage statistics---
Total amount of time in user mode
141.130000 seconds
Total amount of time in system mode
36.300000 seconds
Maximum resident set size
25516 Kbytes
Average shared memory use in text segment
1978356 Kbytessec
Average unshared memory use in data segment
357949904 Kbytessec
Number of page faults without I/O activity
6750
Number of page faults with I/O activity
81
Number of times process was swapped out 0
Number of times file system performed INPUT
0
Number of times file system performed OUTPUT 0
Number of IPC messages sent
0
Number of IPC messages received
0
Number of signals delivered
0
Number of voluntary context switches
266907
Number of involuntary context switches
2128527

13
HPMCOUNT (Event1 continued)

---Resource statistics---
Wall Clock Time 35.099596 seconds
Total time in user mode 54.0518473203182
seconds
Average duration 0.0146248
Standard deviation 0.0112495
Exclusive duration 0.191238 seconds
PM_CYC (Cycles)
20271809159
PM_INST_CMPL (Instructions completed)
14974657747
PM_TLB_MISS (TLB misses)
4474101
PM_ST_CMPL (Stores completed)
2687036544
PM_LD_CMPL (Loads completed)
5220888450
PM_FPU0_CMPL (FPU 0 instructions)
2581927160
PM_FPU1_CMPL (FPU 1 instructions)
519835526
PM_EXEC_FMA (FMAs executed)
792849657

14
HPMCOUNT (Event1 continued)

Utilization rate
153.988
Avg number of loads per TLB miss
1166.913
Load and store operations
7907.925 M
Instructions per load/store
1.894
MIPS
426.633
Instructions per cycle
0.739
HW Float points instructions per Cycle
0.153
Floating point instructions FMAs
3894.612 M
Float point instructions FMA rate
110.959 Mflip/s
FMA percentage
40.715
Computation intensity
0.492

15
HPM for Part of Program using LIBHPM

Instrumentation of performance library for
performance measurement of Fortran, C and C
applications
Collects information and performs summarization
during run-time, generate performance file for
each task
Use the same set of hardware counters events used
by hpmcount
User can specify an event set with the file
libHPMevents
For each instrumented point in a program, libhpm
provides output
Total count
Total duration (wall clock time)
Hardware performance counters information
Hardware derived metrics
Supports
multiple instrumentation points, nested
instrumentation
OpenMP and thread applications
Multiple calls to an instrumented point

16
LIBHPM Functions

C C
hpmInit(taskID)
hpmTerminate(taskID)
hpmStart(instID)
hpmStop(instID)
hpmTstart(instID)
hpmTstop(instID)

Fortran
f_hpminit(taskID)
f_hpmterminate(taskID)
f_ hpmstart(instID)
f_ hpmstop(instID)
f_ hpmtstart(instID)
f_ hpmtstop(instID)

17
Using LIBHPM - C

Declaration
include libhpm.h
C usage
MPI_Comm_rank( MPI_COMM_WORLD, taskID)
hpmInit(taskID,hpm_test)
hpmStart(1,outer call)
code segment to be timed
hpmStop( 1)
hpmTerminate(taskID)
Compilation
mpcc_r -I/usr/local/apps/HPM_V2.3/include -O3
-lhpm_r -lpmapi -lm -qarchpwr3 -qstrict
-qsmpomp -L/usr/local/apps/HPM_V2.3/lib
hpm_test.c -o hpm_test.x

18
Using LIBHPM - Fortran

Declaration
include f_hpm.h
Fortran usage
CALL MPI_COMM_RANK( MPI_COMM_WORLD, taskid,
ierr )
call f_hpminit(taskID)
call f_hpmstart(instID)
code segment to be timed
call f_hpmstop(instID)
call f_hpmterminate(taskID)
CALL MPI_FINALIZE(ierr)
Compilation
mpxlf_r -I/usr/local/apps/HPM_V2.3/include
-qsuffixcppf -O3 -qarchpwr3 -qstrict -qsmpomp
-L/usr/local/apps/HPM_V2.3/lib -lhpm_r -lpmapi
-lm hpm_test.f -o hpm_test.x

19
Using LIBHPM - Threads

call f_hpminit(taskID)
//do
call f_hpmtstart(10)
do_work
call f_hpmtstop(10)
end //do
//do
call f_hpmtstart(20my_thread_ID )
do_work
call f_hpmtstop(20my_thread_ID )
end //do
call f_hpmterminate(taskID)

20
HPM Code in C
include ltmpi.hgt include ltstdio.hgt include
"libhpm.h" define n 10000 main(int argc, char
argv) int taskID,i,numprocs double
an,bn,cn MPI_Init(argc,argv) MPI_Comm_si
ze(MPI_COMM_WORLD,numprocs) MPI_Comm_rank(MPI_CO
MM_WORLD,taskID) hpmInit(taskID,"hpm_test") hpm
Start(1,section 1") for(i1iltn1i) aii
bin-1 hpmStop(1)
hpmStart(2, "section 2") for(i2iltn1i)
ciaibiai/bi hpmStop(2) hpmTermin
ate(taskID) MPI_Finalize()

21
HPM Code in Fortran

program hpm_test
parameter (n10000)
integer taskID,ierr,numtasks
dimension a(n),b(n),c(n)
include "mpif.h"
include "f_hpm.h"
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD,taskID,ier
r)
call MPI_COMM_SIZE(MPI_COMM_WORLD,numtasks,i
err)
call f_hpminit(taskID,"hpm_test")
call f_hpmstart(5,section1")
do i1,n
a(i)real(i)
b(i)real(n-i)
enddo

call f_hpmstop(5) call
f_hpmterminate(taskID) call
MPI_FINALIZE(ierr) end
22
Compiling and Linking

FF mpxlf_r
HPM_DIR /usr/local/apps/HPM_V2.3
HPM_INC -I(HPM_DIR)/include
HPM_LIB -L(HPM_DIR)/lib -lhpm_r -lpmapi -lm
FFLAGS -qsuffixcppf -O3 -qarchpwr3 -qstrict
-qsmpomp
Note -qsuffixcppf is only required for
Fortran code with .f
hpm_test.x hpm_test.f
(FF) (HPM_INC) (FFLAGS) hpm_test.f (HPM_LIB)
-o hpm_test.x

23
HPMVIZ

takes as input the performance files generated by
libhpm
Usage
gt hpmviz ltperformance files(.viz)gt
define a range of values considered satisfactory
Red below predefined as minimum recommended
value
Green above the threshold value
HPMVIZ left pane of the window
displays for each instrumented point, identified
by its label, the inclusive duration, exclusive,
and count.
HPMVIZ right pane of the window
shows the corresponding source code which can be
edited and saved.
The metrics windows
display the task ID, Thread ID, count, exclusive
duration, inclusive duration, and the derived
hardware metrics.

24
HPMVIZ
25
IBM SP HPM Toolkit Summary

A complete problem set
Derived metrics
Analysis of error message
Analyze derived metrics
HPMCOUNT very accurate with low overhead,
non-intrusive, general view for whole program
LIBHPM same sets as hpmcount, for part of
program
HPMVIZ easier to view the hardware counters
information and derived metrics

26
HPM References