Lecture 2: Performance Measurement - PowerPoint PPT Presentation

1 / 64
About This Presentation
Title:

Lecture 2: Performance Measurement

Description:

... Does not represent real life programs Compiler optimization ... benchmarks Real application programs: C compiler Finite element modeling Fluid dynamics, ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 65
Provided by: ICSFacu
Category:

less

Transcript and Presenter's Notes

Title: Lecture 2: Performance Measurement


1
Lecture 2Performance Measurement
2
Performance Evaluation
  • The primary duty of software developers is to
    create functionally correct programs
  • Performance evaluation is a part of software
    development for well-performing programs

3
Performance Analysis Cycle
  • Have an optimization phase just like testing and
    debugging phase

Code Development
Functionally complete and correct program
Measure
Analyze
Modify / Tune
Complete, correct and well-performing program
Usage
4
Goals of Performance Analysis
  • The goal of performance analysis is to provide
    quantitative information about the performance of
    a computer system

5
Goals of Performance Analysis
  • Compare alternatives
  • When purchasing a new computer system, to provide
    quantitative information
  • Determine the impact of a feature
  • In designing a new system or upgrading, to
    provide before-and-after comparison
  • System tuning
  • To find the best parameters that produce the best
    overall performance
  • Identify relative performance
  • To quantify the performance relative to previous
    generations
  • Performance debugging
  • To identify the performance problems and correct
    them
  • Set expectations
  • To determine the expected capabilities of the
    next generation

6
Performance Evaluation
  • Performance Evaluation steps
  • Measurement / Prediction
  • What to measure? How to measure?
  • Modeling for prediction
  • Simulation
  • Analytical Modeling
  • Analysis Reporting
  • Performance metrics

7
Performance Measurement
  • Interval Timers
  • Hardware Timers
  • Software Timers

8
Performance Measurement
  • Hardware Timers
  • Counter value is read from a memory location
  • Time is calculated as

Tc
Clock
Counter
n bits
to processor memory bus
Time (x2 - x1) x Tc
9
Performance Measurement
  • Software Timers
  • Interrupt-based
  • When interrupt occurs, interrupt-service routine
    increments the timer value which is read by a
    program
  • Time is calculated as

Tc
Clock
Prescaling Counter
Tc
to processor interrupt input
Time (x2 - x1) x Tc
10
Performance Measurement
  • Timer Rollover
  • Occurs when an n-bit counter undergoes a
    transition from its maximum value 2n 1 to zero
  • There is a trade-off between roll over time and
    accuracy

Tc 32-bit 64-bit
10 ns 42 s 5850 years
1 ms 1.2 hour 0.5 million years
1 ms 49 days 0.5 x 109 years
11
Timers
  • Solution
  • Use 64-bit integer (over half a million year)
  • Timer returns two values
  • One represents seconds
  • One represents microseconds since the last second
  • With 32-bit, the roll over is over 100 years

12
Performance Measurement
  • Interval Timers
  • T0 ? Read current time
  • Event being timed ()
  • T1 ? Read current time
  • Time for the event is T1-T0

13
Performance Measurement
  • Timer Overhead
  • Initiate read_time
  • Current time is read
  • Event begins
  • Event ends Initiate read_time
  • Current time is read
  • Measured time
  • Tm T2 T3 T4
  • Desired measurement
  • Te Tm (T2 T4)
  • Tm (T1 T2) since T1 T4
  • Timer overhead
  • Tovhd T1 T2
  • Te should be 100-1000 times greater than Tovhd .

T1
T2
T3
T4
14
Performance Measurement
  • Timer Resolution
  • Resolution is the smallest change that can be
    detected by an interval timer.

nTc lt Te lt (n1)Tc If Tc is large relative to
the event being measured, it may be impossible to
measure the duration of the event.
15
Performance Measurement
  • Measuring Short Intervals
  • Te lt Tc

Tc
? 1
Te
Tc
? 0
Te
16
Performance Measurement
  • Measuring Short Intervals
  • Solution Repeat measurements n times.
  • Average execution time Te (m x Tc) / n
  • m number of 1s measured
  • Average execution time Te (Tt / n ) - h
  • Tt total execution time of n repetitions
  • h repetition overhead

Tc
Te
Tt
17
Performance Measurement
  • Time
  • Elapsed time / wall-clock time / response time
  • Latency to complete a task, including disk
    access, memory access, I/O, operating system
    overhead, and everything (includes time consumed
    by other programs in a time-sharing system)
  • CPU time
  • The time CPU is computing, not including I/O time
    or waiting time
  • User time / user CPU time
  • CPU time spent in the program
  • System time / system CPU time
  • CPU time spent in the operating system performing
    tasks requested by the program

18
Performance Measurement
  • UNIX time command
  • 90.7u 12.9s 239 65
  • Drawbacks
  • Resolution is in milliseconds
  • Different sections of the code can not be timed

User time
Elapsed time
Percentage of elapsed time
System time
19
Timers
  • Timer is a function, subroutine or program that
    can be used to return the amount of time spent in
    a section of code.

zero 0.0 t0 timer(zero) lt code
segment gt t1 timer(t0) time t1
t0 timer() lt code segment gt t1
timer() time t1 t0
20
Timers
  • Read Wadleigh, Crawford pg 130-136 for
  • time, clock, gettimeofday, etc.

21
Timers
  • Measuring Timer Resolution

main() . . . zero 0.0 t0
timer(zero) t1 0.0 j0 while (t1 0.0)
j zero0.0 t0 timer(zero) foo(j)
t1 timer(t0) printf (It took d
iterations for a nonzero time\n, j) if (j1)
printf (timer resolution lt 13.7f seconds\n,
t1) else printf (timer resolution is
13.7f seconds\n, t1) foo(n) . .
. i0 for (j0 jltn j) i return(i)
22
Timers
  • Measuring Timer Resolution
  • Using clock()
  • Using times()
  • Using getrusage()

It took 682 iterations for a nonzero time timer
resolution is 0.0200000 seconds
It took 720 iterations for a nonzero time timer
resolution is 0.0200000 seconds
It took 7374 iterations for a nonzero time timer
resolution is 0.0002700 seconds
23
Timers
  • Spin Loops
  • For codes that take less time to run than the
    resolution of the timer
  • First call to a function may require an
    inordinate amount of time.
  • Therefore the minimum of all times may be
    desired.

main() . . . zero 0.0 t2 100000.0 for
(j0 jltn j) t0 timer(zero) foo(j)
t1 timer(t0) t2 min(t2, t1) t2
t2 / n printf (Minimum time is 13.7f
seconds\n, t2) foo(n) . . . lt code segment
gt
24
Profilers
  • A profiler automatically insert timing calls into
    applications to generate calls into applications
  • It is used to identify the portions of the
    program that consumes the largest fraction of the
    total execution time.
  • It may also be used to find system-level
    bottlenecks in a multitasking system.
  • Profilers may alter the timing of a programs
    execution

25
Profilers
  • Data collection techniques
  • Sampling-based
  • This type of profilers use a predefined clock
    every multiple of this clock tick the program is
    interrupted and the state information is
    recorded.
  • They give the statistical profile of the program
    behavior.
  • They may miss some important events.
  • Event-based
  • Events are defined (e.g. entry into a subroutine)
    and data about these events are collected.
  • The collected information shows the exact
    execution frequencies.
  • It has substantial amount of run-time overhead
    and memory requirement.
  • Information kept
  • Trace-based The compiler keeps all information
    it collects.
  • Reductionist Only statistical information is
    collected.

26
Performance Evaluation
  • Performance Evaluation steps
  • Measurement / Prediction
  • What to measure? How to measure?
  • Modeling for prediction
  • Analysis Reporting
  • Performance metrics

27
Predicting Performance
  • Performance of simple kernels can be predicted to
    a high degree
  • Theoretical performance and peak performance must
    be close
  • It is preferred that the measured performance is
    over 80 of the theoretical peak performance

28
Performance Metrics
  • Time
  • Elapsed time / wall-clock time / response time
  • CPU time
  • User time / user CPU time
  • System time / system CPU time

29
Performance Modeling
  • CPU Performance

CPU time instructions x cycles x
time program
instruction cycle
CPI (cycles per instruction)
CPU time instruction count x CPI x 1

clock rate
30
Performance Modeling
  • CPU Performance
  • CPI
  • is an average
  • depends on the design of micro-architecture
    (hardwired/microprogrammed, pipelined)
  • Number of instructions
  • is the number of instructions executed at runtime
  • Depends on
  • instruction set architecture (ISA)
  • compiler

CPU time instruction count x CPI x 1

clock rate
31
Performance Modeling
  • CPU Performance
  • Drawbacks
  • In modern computers, no program runs without some
    operating system running on the hardware
  • Comparing performance between machines with
    different operating systems will be unfair

32
Performance Evaluation
  • Performance Evaluation steps
  • Measurement / Prediction
  • What to measure? How to measure?
  • Modeling for prediction
  • Simulation
  • Analytical Modeling
  • Queuing Theory
  • Analysis Reporting
  • Performance metrics

33
Performance Metrics
  • Performance Comparison
  • Relative performance

Performancex 1 .
Execution timeX
Performance Ratio PerformanceX Execution
timeY
PerformanceY Execution timeX
34
Performance Metrics
  • Relative Performance
  • If workload consists of more than one program,
    total execution time may be used.
  • If there are more than one machine to be
    compared, one of them must be selected as a
    reference.

35
Performance Metrics
  • Throughput
  • Total amount of work done in a given time
  • Measured in tasks per time unit
  • Can be used for
  • Operating system performance
  • Pipeline performance
  • Multiprocessor performance

36
Performance Metrics
  • Statistical Analysis
  • Used to compare performance
  • Workload consists of many programs
  • Depends on the nature of the data as well as
    distribution of the test results

37
Performance Metrics
  • Statistical Analysis
  • Arithmetic mean
  • May be misleading if the data are skewed or
    scattered

Arithmetic mean S xi , 1 i n
n
MA MB MC
Prog1 50 100 500
Prog2 400 800 800
Prog3 5550 5100 4700
Average 2000 2000 2000
38
Performance Metrics
  • Statistical Analysis
  • Weighted average
  • weight is the frequency of each program in daily
    processing
  • Results may change with a different set of
    execution frequencies

Weighted average ? wi . xi , 1 i n
weight MA MB MC
Prog1 60 50 100 500
Prog2 30 400 800 800
Prog3 10 5550 5100 4700
Average 705 810 1010
39
Performance Metrics
  • Statistical Analysis
  • Geometric mean
  • Results are stated in relation to the performance
    of a reference machine

Geometric mean ( ? xi )1/n , 1 i n
MA Normalized to MB MB (reference) Normalized to MB MC Normalized to MB
Prog1 50 2 100 1 500 0.2
Prog2 400 2 800 1 800 1
Prog3 5550 0.92 5100 1 4700 1.085
Average 1.54 1 0.60
  • Results are consistent no matter which system is
    chosen as reference

40
Performance Metrics
  • Statistical Analysis
  • Harmonic mean
  • Used to compare performance resuts that are
    expressed as a rate (e.g. operations per second,
    throughput, etc.)
  • Slowest rates have the greatest influence on the
    result
  • ?It identifies areas where performance can be
    improved

Harmonic mean n , 1 i n
? 1/xi
41
Performance Metrics
  • MIPS (Million instructions per second)
  • Includes both integer and floating point
    performance
  • Number of instructions in a program varies
    between different computers
  • Number of instructions varies between different
    programs on the same computer

MIPS Instruction count Clock rate
Execution time x 106 CPI x 106
42
Performance Metrics
  • MFLOPS
  • (Million floating point operations per second)
  • Give performance of only floating-point
    operations
  • Different mixes of integer and floating-point
    operations may have different execution times
  • Integer and floating-point units work
    independently
  • Instruction and data caches provide instruction
    and data concurrently

43
Performance Metrics
  • Utilization
  • Speciality ratio
  • 1 ? general purpose

Utilization Busy time .
Total time
Speciality ratio Maximum performance .
Minimum performance
44
Performance Metrics
  • Asymptotic and Half performance
  • r? asymptotic performance
  • n1/2 half performance

T r? (n n1/2) r? 1/t n1/2 t0/t
Slope r?-1
2t0
t0
n1/2
-n1/2
45
Performance Evaluation Methods
  • Benchmarking
  • Monitoring
  • Analytical Modeling
  • Queuing Theory

46
Benchmarking
  • Benchmark is a program that is run on a computer
    to measure its performance and compare it with
    other machines
  • Best benchmark is the users workload the
    mixture of programs and operating system commands
    that users run on a machine.
  • ? Not practical
  • Standard benchmarks

47
Benchmarking
  • Types of Benchmarks
  • Synthetic benchmarks
  • Toy benchmarks
  • Kernels
  • Real Applications

48
Benchmarking
  • Synthetic benchmarks
  • Artificially created benchmark programs that
    represent the average frequency of operations of
    a large set of programs
  • Whetstone benchmark
  • Dhrystone benchmark
  • Rhealstone benchmark

49
Benchmarking
  • Synthetic benchmarks
  • Whetstone benchmark
  • First written in Algol60 in 1972, today Fortran,
    C/C, Java versions are available
  • Represents the workload of numerical applications
  • Measures floating point arithmetic performance
  • Unit is Millions of Whetstone instructions per
    second (MWIPS)
  • Shortcommings
  • Does not represent constructs in modern
    languages, such as pointers, etc.
  • Does not consider cache effects

50
Benchmarking
  • Synthetic benchmarks
  • Dhrystone benchmark
  • First written in Ada in1984, today
  • Represents the workload of C version is available
  • Statistics are collected on system software, such
    as operating system, compilers, editors and a few
    numerical programs
  • Measures integer and string performance, no
    floating-point operations
  • Unit is the number of program iteration
    completions per second
  • Shortcommings
  • Does not represent real life programs
  • Compiler optimization overstates system
    performance
  • Small code that may fit in the instruction cache

51
Benchmarking
  • Synthetic benchmarks
  • Rhealstone benchmark
  • Multi-tasking real-time systems
  • Factors are
  • Task switching time
  • Pre-emption time
  • Interrupt latency time
  • Semaphore shuffling time
  • Dead-lock breaking time
  • Datagram throughput time
  • Metric is Rhealstones per second

6 ? wi . (1/ ti) i1
52
Benchmarking
  • Toy benchmarks
  • 10-100 lines of code that the result is known
    before running the toy program
  • Quick sort
  • Sieve of Eratosthenes
  • Finds prime numbers http//upload.wikimedia.org
    /wikipedia/commons/8/8c/New_Animation_Sieve_of_Era
    tosthenes.gif

func sieve( var N ) var PrimeArray as array
of size N initialize PrimeArray to all true
for i from 2 to N for each j from i
1 to N, where i divides j set PrimeArray( j
) false
53
Benchmarking
  • Kernels
  • Key pieces of codes from real applications.
  • LINPACK and BLAS
  • Livermore Loops
  • NAS

54
Benchmarking
  • Kernels
  • LINPACK and BLAS Libraries
  • LINPACK linear algebra package
  • Measures floating-point computing power
  • Solves system of linear equations Axb with
    Gaussian elimination
  • Metric is MFLOP/s
  • DAXPY - most time consuming routine
  • Used as the measure for TOP500 list
  • BLAS Basic linear algebra subprograms
  • LINPACK makes use of BLAS library

55
Benchmarking
  • Kernels
  • LINPACK and BLAS Libraries
  • SAXPY Scalar Alpha X Plus Y
  • Y a X Y, where X and Y are vectors, a is a
    scalar
  • SAXPY for single and DAXPY for double precision
  • Generic implementation

for (int i m i lt n i) yi a xi
yi
56
Benchmarking
  • Kernels
  • Livermore Loops
  • Developed at LLNL
  • Originally in Fortran, now also in C
  • 24 numerical application kernels, such as
  • hydrodynamics fragment,
  • incomplete Cholesky conjugate gradient,
  • inner product,
  • banded linear systems solution, tridiagonal
    linear systems solution,
  • general linear recurrence equations,
  • first sum, first difference,
  • 2-D particle in a cell, 1-D particle in a cell,
  • Monte Carlo search,
  • location of a first array minimum, etc.
  • Metrics are arithmetic, geometric and harmonic
    mean of CPU rate

57
Benchmarking
  • Kernels
  • NAS Parallel Benchmarks
  • Developed at NASA Advanced Supercomputing
    division
  • Paper-and-pencil benchmarks
  • 11 benchmarks, such as
  • Discrete Poisson equation,
  • Conjugate gradient
  • Fast Fourier Transform
  • Bucket sort
  • Embarrassingly parallel
  • Nonlinear PDE solution
  • Data traffic, etc.

58
Benchmarking
  • Real Applications
  • Programs that are run by many users
  • C compiler
  • Text processing software
  • Frequently used user applications
  • Modified scripts used to measure particular
    aspects of system performance, such as
    interactive behavior, multiuser behavior

59
Benchmarking
  • Benchmark Suites
  • Desktop Benchmarks
  • SPEC benchmark suite
  • Server Benchmarks
  • SPEC benchmark suite
  • TPC
  • Embedded Benchmarks
  • EEMBC

60
Benchmarking
  • SPEC Benchmark Suite
  • Desktop Benchmarks
  • CPU-intensive
  • SPEC CPU2000
  • 11 integer (CINT2000) and 14 floating-point
    (CFP2000) benchmarks
  • Real application programs
  • C compiler
  • Finite element modeling
  • Fluid dynamics, etc.
  • Graphics intensive
  • SPECviewperf
  • Measures rendering performance using OpenGL
  • SPECapc
  • Pro/Engineer 3D rendering with solid models
  • Solid/Works 3D CAD/CAM design tool,
    CPU-intensive and I/O intensive tests
  • Unigraphics solid modeling for an aircraft
    design
  • Server Benchmarks
  • SPECWeb for web servers
  • SPECSFS for NFS performance, throughput-oriented

61
Benchmarking
  • TPC Benchmark Suite
  • Server Benchmark
  • Transaction processing (TP) benchmarks
  • Real applications
  • TPC-C simulates a complex query environment
  • TPC-H ad hoc decision support
  • TPC-R business decision support system where
    users run a standard set of queries
  • TPC-W business-oriented transactional web server
  • Measures performance in transactions per second.
    Throughput performance is measured only when
    response time limit is met.
  • Allows cost-performance comparisons

62
Benchmarking
  • EEMBC Benchmarks
  • for embedded computing systems
  • 34 benchmarks from 5 different application
    classes
  • Automotive/industrial
  • Consumer
  • Networking
  • Office automation
  • Telecommunications

63
Timers
  • Roll Over
  • Suppose a timer returns 32-bit integer data and
    measures microseconds.
  • It rolls over after 232 microseconds ( 1.2
    hours)
  • Timers that measure milliseconds and use 32-bit
    data roll over after 232 milliseconds ( 49 days)
  • There is a trade-off between roll over time and
    accuracy.

64
Performance Evaluation
  • Performance Evaluation steps
  • Measurement / Prediction
  • What to measure? How to measure?
  • Modeling for prediction
  • Analysis Reporting
  • Performance metrics
Write a Comment
User Comments (0)
About PowerShow.com