Title: CS 2200 Lecture 04b Metrics
1CS 2200 Lecture 04bMetrics
- (Lectures based on the work of Jay Brockman,
Sharon Hu, Randy Katz, Peter Kogge, Bill Leahy,
Ken MacKenzie, Richard Murphy, and Michael
Niemier)
2Determining Performance
- Response Time
- Usually something the user cares about!
- Also referred to as execution time
- The time between the start and completion of an
event - Throughput
- The amount of work done in a given amount of time
- Theres tradeoffs between the two improving 1
often adversely affects the other (of course)
3Lets look at an example
- If planes were computers
- 747 highest throughput, Concorde highest
execution time, 737 cheapest, DC-8 most range - Which one is best???
Here, capacity speed
4So, how do we compare?
- Best to stick with execution time! (more later)
- If we say X is faster than Y, we mean the
execution time is lower on X than on Y. - Alternatively
Execution timeY
X is n times faster than Y
n
Execution timeX
1
i.e. 200 MHz
Execution timeY
PerformanceX
PerformanceY
n
1
Execution timeX
Performancey
PerformanceX
i.e. 50 MHz
50/200 ¼, therefore x 4 times slower than y
5The times the thing
- There are many ways to consider exec. time
- Wall clock time, response time, and elapsed time
all different names for the latency to complete
a task - (including disk/memory accesses, I/O, OS, etc.)
- CPU time the time the CPU is working on a
specific task excluding I/O, program switching,
etc. - User CPU time CPU time spent in the program
- System CPU time time spent by OS doing program
specific tasks - Beware of other imposter metrics!
- Execution time, or time for a task in general
are the only true and fair measures of
comparisons
6Evaluating Performance
- For a true comparison of machine X to Y, must run
the same thing on X and Y! - But what do you use to compare?
- Ideally, would have users run a typical workload
over some period of time - This is not practical however
- Instead different programs are used to gauge
performance benchmarks
7Benchmarks
- In the order of relevance...
- (1) Real Programs
- Obviously different users use different programs
- But, a large subset of people WILL use C
compilers, Windows products (gasp!), etc. - (2) Kernels
- What are they? Bits of a real program
- No user would actually run them
- Isolate/characterize performance of individual
machine features
8More Benchmarks
- (3) Toy Benchmarks
- Usually short programs with previously known
results. - More useful for intro. to programming
assignments - (4) Synthetic Benchmarks
- Similar to kernels try to match avg. frequency
of operations and operands of a large set of
programs - Dont compute anything a user would even want
- More details and examples forthcoming
- Nothing perfect industry demands standard
9What do people really use to benchmark?
- Most common are the
- Standard Performance and Evaluation Corporations
(SPEC) suites - SPEC INT, and SPEC FP for short.
- Uses a variety of applications
- Variety will lessen the weakness of any one
particular benchmark - Variety will help defeat opportunists from
optimizing for good benchmark performance
10An Example Suite SPEC95 INT
- Consists of 8 C programs used as benchmarks
- go Artificial intelligence simulator plays
go - m88ksim Moto 88K Chip simulator runs test
program - gcc New version of GCC builds SPARC code
- compress Compresses and decompresses file in
mem. - li LISP interpreter
- ijpeg Graphic compression and decompression
- perl Manipulates strings and prime numbers in
Perl - compress A database program
- See http//www.spec.org for more
11Benchmarkers Beware!
- Benchmark suites are not perfect but used b/c
they are a uniform metric - Performance determines success and failure and
companies and researchers know it - People falsely inflate benchmark performance
- Add optimizations to enhance performance
- Hand-coded library calls (name change will erase
this one) - Special microcode for high frequency segments
- Register reservations for key constants
- Compiler recognizes benchmark and runs different
version
12An examplefor more confusion
- Consider
- All are true
- A is 10 times faster than B for program P1
- B is 10 times faster than A for program P2
- A is 20 times faster than C for program P1
- C is 50 times faster than A for program P2
- B is 2 times faster than C for program P1
- C is 5 times faster than B for program P2
- Which is the best?
13Interpreting the example
- Any one of the previous statements is true!
- But, which computer is better? By how much?
- One way Go back to execution time!
- B is 9.1 times faster than A for programs P1 and
P2 - C is 25 times faster than A for programs P1 and
P2 - C is 2.75 times faster than B for programs P1 and
P2 - Given this, if we had to big 1 configuration over
another, what should we consider?
14Some other options
- Gory Math! Means, etc. etc.
- Well only talk about two here
- The arithmetic mean
- An average of the execution times that tracks the
total execution time. - The weighted arithmetic mean
- Can be used if programs are not run equally
- (i.e. P1 40 of load and P2 60 of load)
15Other ways to measure performance
- (1) Use MIPS (millions of instructions/second)
- MIPS is a rate of operations/unit time.
- Performance can be specified as the inverse of
execution time so faster machines have a higher
MIPS rating - So, bigger MIPS faster machine. Right?
Instruction Count
Clock Rate
MIPS
Exec. Time x 106
CPI x 106
16Wrong!!!
- 3 significant problems with using MIPS
- Problem 1
- MIPS is instruction set dependent.
- (And different computer brands usually have
different instruction sets) - Problem 2
- MIPS varies between programs on the same computer
- Problem 3
- MIPS can vary inversely to performance!
- Lets look at an example of why MIPS doesnt
work
17A MIPS Example (1)
- Consider the following computer
Instruction counts (in millions) for each
instruction class
The machine runs at 100MHz.
Instruction A requires 1 clock cycle, Instruction
B requires 2 clock cycles, Instruction C
requires 3 clock cycles.
n
S
CPIi x Ci
CPU Clock Cycles
i1
CPI
!
Instruction Count
Instruction Count
18A MIPS Example (2)
count
cycles
(5x1) (1x2) (1x3) x 106
CPI1
10/7 1.43
(5 1 1) x 106
cycles
(10x1) (1x2) (1x3) x 106
CPI2
15/12 1.25
(10 1 1) x 106
So, compiler 2 has a higher MIPS rating and
should be faster?
100 MHz
MIPS2
80.0
1.25
19A MIPS Example (3)
- Now lets compare CPU time
Instruction Count x CPI
!
CPU Time
Clock Rate
7 x 106 x 1.43
0.10 seconds
CPU Time1
100 x 106
12 x 106 x 1.25
CPU Time2
0.15 seconds
100 x 106
Therefore program 1 is faster despite a lower
MIPS!
20Other bad benchmarks fallacies
- MFLOPS is a consistent and useful measure of
performance - Stands for Million floating-point
operations/second - Similar issues as MIPS
- But even less fair b/c set of floating point
operations is not consistent across machines - Synthetic benchmarks predict performance
- These are not REAL programs
- Compilers, etc. can artificially inflate
performance - Dont reward optimization of behavior in real
programs - Whats an example of a synthetic benchmark?
21More bad benchmarks/fallacies
- Benchmarks remain valid indefinitely
- This is not true!
- Companies will engineer for benchmark
performance - These people are welllets say not very honest
22Useful/important performance metrics
(Note important sounding title!)
- Lets talk about useable and important metrics
- One of the most important principles in computer
design is - to make the common case fast.
- Specifically
- In making a design trade-off, favor the frequent
case over the infrequent one - Improving the frequent event will help
performance too - Often, the frequent case is simpler than the
infrequent one anyhow
23Amdahls Law
- Qualifies performance gain
- Amdahls Law defined
- The performance improvement to be gained from
using some faster mode of execution is limited by
the amount of time the enhancement is actually
used. - Amdahls Law defines speedup
Perf. for entire task using enhancement when
possible
Speedup
Perf. For entire task without using enhancement
Or
Execution time for entire task without enhancement
Speedup
Execution time for entire task using
enhancement when possible
24Amdahls Law and Speedup
- Speedup tells us how much faster the machine will
run with an enhancement - 2 things to consider
- 1st
- Fraction of the computation time in the original
machine that can use the enhancement - i.e. if a program executes in 30 seconds and 15
seconds of exec. uses enhancement, fraction ½
(always lt 1) - 2nd
- Improvement gained by enhancement (i.e. how much
faster does the program run overall) - i.e. if enhanced task takes 3.5 seconds and
original task took 7, we say the speedup is 2
(always gt 1)
25Amdahls Law Equations
Fractionenhanced
Execution timenew
Execution timeold x
(1 Fractionenhanced)
Speedupenhanced
1
Execution Timeold
Speedupoverall
Execution Timenew
Fractionenhanced
(1 Fractionenhanced)
Speedupenhanced
Use previous equation, Solve for speedup
Please, please, please, dont just try to
memorize these equations and plug numbers into
them. Its always important to think about the
problem too!
26Amdahls Law Example
- A certain machine has a
- Floating point multiply that runs too slow
- It adversely affects benchmark performance.
- One option
- Re-design the FP multiply hardware to make it run
15 times faster than it currently does. - However, the manager thinks
- Re-designing all of the FP hardware to make each
FP instruction run 3 times faster is the way to
go. - FP multiplies account for 10 of execution time.
- FP instructions as a whole account for 30 of
execution time. - Which improvement is better?
27Amdahls Law Example (cont.)
- The speedup gained by improving the multiply
instruction is - 1 / (1-0.1) (0.1/15) 1.10
- The speedup gained by improving all of the
floating point instructions is - 1 / (1-0.3) (.3/3) 1.25
- Believe it or not, the manager is right!
- Improving all of the FP instructions despite the
lesser improvement is the better way to go
28What does Amdahls Law tell us?
- Diminishing returns exist (just like in
economics) - Speedup diminishes as improvements are added
- Corollary Only a fraction of code is affected
cant speed up task by more than reciprocal of 1
fraction - Serves as a guide as to how much an enhancement
will improve performance AND where to spend your
resources - (It really is all about money after all!)
- Overall goal
- Spend your resources where you get the most
improvement!
29Execution time, Execution time, Execution time
- Significantly affected by the CPU rate
- Also referred to/referenced by
- Clock ticks, clock periods, clocks, cycles, clock
cycles - Clock time generally referenced by
- Clock period (i.e. 2 ns)
- Clock rate (i.e. 500 MHz)
- CPU time for program can be expressed as
- CPU time CPU clock cycles for program x Clock
Cycle time - OR
30More CPU metrics
- Instruction count also figures into the mix
- Can affect throughput, execution time, etc.
- Interested in
- instruction path length and instruction count
(IC) - Using this information and the total number of
clock cycles for program can determine clock
Cycles Per Instruction (CPI) - Note Sometimes you see inverse Instructions
per clock cycle (IPC) this is the same metric
really
31Relating the metrics
- New metrics/formulas lead to alternative ways of
expressing others - CPU Time IC x CPI x Clock cycle time
- OR
- DONT memorize formulas. Think units!
- In fact, lets expand the above equation into
units - Execution time falls out
32A CPU The Bigger Picture
- Recall
- We can see CPU performance dependent on
- Clock rate, CPI, and instruction count
- CPU time is directly proportional to all 3
- Therefore an x improvement in any one variable
leads to an x improvement in CPU performance - But, everything usually affects everything
Clock Cycle Time
Instruction Count
Hardware Tech.
Compiler Technology
Organization
ISAs
CPI
33More detailed metrics
- Remember
- Not all instructions execute in same of clock
cycles - Different programs have different instruction
mixes - Therefore we must weight CPU time eqn
- ICi of times instruction i is executed
- CPIi avg. of clock cycles for instruction I
- Note CPI should be measured and not calculated
as you must take cache misses, etc. into account
34An example
- Assume that weve made the following
measurements - Frequency of FP operations 25
- Average CPI of FP operations 4.0
- Average CPI of other instructions 1.33
- Frequency of FP Square Root (FPSQR) instruction
2 - CPI of FPSQR 20
- There are two new design alternatives to
consider - It is possible to reduce the CPI of FPSQR to 2!
- Its also possible to reduce the average CPI of
all FP operations to 2 - Which one is better for overall CPU performance?
35An example continued
- First we need to calculate a base for comparison
- CPIoriginal (4.0 0.25) (1.33 0.75) 2.0
- Note NO equations!!!
- Next, compute CPI for the enhanced FPSQR option
- CPInew FPSQR CPIoriginal 0.02(CPIold FPSQR
CPInew FPSQR) - CPInew FPSQR 2.0 0.02(20.0 2) 1.64
- Now, we can compute a new FP CPI
- CPInew FP (0.75 1.33) (0.25 2) 1.5
- This CPI is lower than the first alternative (of
reducing the FPSQR CPI to 2) - Therefore, the speedup is 1.33 with this
enhancement (2.00/1.5)