COMP 206: Computer Architecture and Implementation - PowerPoint PPT Presentation

About This Presentation

Title:

COMP 206: Computer Architecture and Implementation

Description:

Title: Lecture 2 Author: Montek Singh Last modified by: Montek Singh Created Date: 3/13/2000 2:52:39 AM Document presentation format: Letter Paper (8.5x11 in) – PowerPoint PPT presentation

Number of Views:72

Avg rating:3.0/5.0

Slides: 52

Provided by: Montek9

Learn more at: https://www.cs.unc.edu

Category:

more less

Transcript and Presenter's Notes

Title: COMP 206: Computer Architecture and Implementation

1
COMP 206Computer Architecture and Implementation

Montek Singh
Wed., Aug 26, 2002

2
Amdahls Law
Bottleneckology Evaluating Supercomputers,
Jack Worlton, COMPCOM 85, pp. 405-406
Average execution rate (performance)
Fraction of results generated at this rate
Weighted harmonic mean
3
Example of Amdahls Law
30 of results are generated at the rate of 1
MFLOPS, 20 at 10 MFLOPS, 50 at 100 MFLOPS. What
is the average performance? What is the
bottleneck?
4
Amdahls Law (HP3 book, pp. 40-41)
5
Implications of Amdahls Law

The performance improvements provided by a
feature are limited by how often that feature is
used
As stated, Amdahls Law is valid only if the
system always works with exactly one of the rates
If a non-blocking cache is used, or there is
overlap between CPU and I/O operations, Amdahls
Law as given here is not applicable
Bottleneck is the most promising target for
improvements
Make the common case fast
Infrequent events, even if they consume a lot of
time, will make little difference to performance
Typical use Change only one parameter of
system, and compute effect of this change
The same program, with the same input data,
should run on the machine in both cases

6
Make The Common Case Fast

All instructions require an instruction fetch,
only a fraction require a data fetch/store
Optimize instruction access over data access
Programs exhibit locality
Spatial Locality
items with addresses near one another tend to be
referenced close together in time
Temporal Locality
recently accessed items are likely to be accessed
in the near future
Access to small memories is faster
Provide a storage hierarchy such that the most
frequent accesses are to the smallest (closest)
memories.

7
Make The Common Case Fast (2)

What is the common case?
The rate at which the system spends most of its
time
The bottleneck
What does this statement mean precisely?
Make the common case faster, rather than making
some other case faster
Make the common case faster by a certain amount,
rather than making some other case faster by the
same amount
Absolute amount?
Relative amount?
This principle is merely an informal statement of
a frequently correct consequence of Amdahls Law

8
Make The Common Case Fast (3a)
A machine produces 20 and 80 of its results at
the rates of 1 and 3 MFLOPS, respectively. What
is more advantageous to improve the 1 MFLOPS
rate, or to improve the 3 MFLOPS rate?
Generalize problem Assume rates are x and y
MFLOPS
At (x,y) (1,3), this indicates that it is
better to improve x, the 1 MFLOPS rate, which is
not the common case.
9
Make The Common Case Fast (3b)
Lets say that we want to make the same relative
change to one or the other rate, rather than the
same absolute change.
At (x,y) (1,3), this indicates that it is
better to improve y, the 3 MFLOPS rate, which is
the common case.
If there are two different execution rates,
making the common case faster by the same
relative amount is always more advantageous than
the alternative. However, this does not
necessarily hold if we make absolute changes of
the same magnitude. For three or more rates,
further analysis is needed.
10
Basics of Performance
11
Details of CPI
12
MIPS

Machines with different instruction sets?
Programs with different instruction mixes?
Dynamic frequency of instructions
Uncorrelated with performance
Marketing metric
Meaningless Indicator of Processor Speed

13
MFLOP/s

Popular in supercomputing community
Often not where time is spent
Not all FP operations are equal
Normalized MFLOP/s
Can magnify performance differences
A better algorithm (e.g., with better data reuse)
can run faster even with higher FLOP count
DGEQRF vs. DGEQR2 in LAPACK

14
Aspects of CPU Performance
15
Example 1 (HP2, p. 31)
Which change is more effective on a certain
machine speeding up 10-fold the floating point
square root operation only, which takes up 20 of
execution time, or speeding up 2-fold all
floating point operations, which take up 50 of
total execution time? (Assume that the cost of
accomplishing either change is the same, and
the two changes are mutually exclusive.)
Fsqrt fraction of FP sqrt results Rsqrt
rate of producing FP sqrt results Fnon-sqrt
fraction of non-sqrt results Rnon-sqrt rate
of producing non-sqrt results Ffp fraction of
FP results Rfp rate of producing FP
results Fnon-fp fraction of non-FP
results Rnon-fp rate of producing non-FP
results Rbefore average rate of producing
results before enhancement Rafter average
rate of producing results after enhancement
16
Example 1 (Soln. using Amdahls Law)
Improving all FP operations is more effective
17
Example 2
Why?
Which CPU performs better?
18
Example 2 (Solution)
If clock cycle time of A was only 1.1x clock
cycle time of B, then CPU B would be about 9
higher performance.
19
Example 3
A LOAD/STORE machine has the characteristics
shown below. We also observe that 25 of the
ALU operations directly use a loaded value that
is not used again. Thus we hope to improve
things by adding new ALU instructions that have
one source operand in memory. The CPI of the new
instructions is 2. The only unpleasant
consequence of this change is that the CPI of
branch instructions will increase from 2 to 3.
Overall, will CPU performance increase?
20
Example 3 (Solution)
Before change
After change
Since CPU time increases, change will not improve
performance.
21
Example 4
A load-store machine has the characteristics
shown below. An optimizing compiler for the
machine discards 50 of the ALU operations,
although it cannot reduce loads, stores, or
branches. Assuming a 500 MHz (2 ns) clock, what
is the MIPS rating for optimized code versus
unoptimized code? Does the ranking of MIPS agree
with the ranking of execution time?
22
Example 4 (Solution)
Without optimization
With optimization
Performance increases, but MIPS decreases!
23
Performance of (Blocking) Caches
24
Example
Assume we have a machine where the CPI is 2.0
when all memory accesses hit in the cache. The
only data accesses are loads and stores, and
these total 40 of the instructions. If the miss
penalty is 25 clock cycles and the miss rate is
2, how much faster would the machine be if all
memory accesses were cache hits?
25
Means
26
Weighted Means
27
Relations among Means
Equality holds if and only if all the elements
are identical.
28
Summarizing Computer Performance
Characterizing Computer Performance with a
Single Number, J. E. Smith, CACM, October 1988,
pp. 1202-1206

The starting point is universally accepted
The time required to perform a specified
amount of computation is the ultimate measure of
computer performance
How should we summarize (reduce to a single
number) the measured execution times (or measured
performance values) of several benchmark
programs?
Two required properties
A single-number performance measure for a set of
benchmarks expressed in units of time should be
directly proportional to the total (weighted)
time consumed by the benchmarks.
A single-number performance measure for a set of
benchmarks expressed as a rate should be
inversely proportional to the total (weighted)
time consumed by the benchmarks.

29
Arithmetic Mean for Times
Smaller is better for execution times
30
Harmonic Mean for Rates
Larger is better for execution rates
31
Avoid the Geometric Mean

If benchmark execution times are normalized to
some reference machine, and means of normalized
execution times are computed, only the geometric
mean gives consistent results no matter what the
reference machine is (see Figure 1.17 in HP3, pg.
38)
This has led to declaring the geometric mean as
the preferred method of summarizing execution
time (e.g., SPEC)
Smiths comments
The geometric mean does provide a consistent
measure in this context, but it is consistently
wrong.
If performance is to be normalized with respect
to a specific machine, an aggregate performance
measure such as total time or harmonic mean rate
should be calculated before any normalizing is
done. That is, benchmarks should not be
individually normalized first.

32
Programs to Evaluate Performance

(Toy) Benchmarks
10-100 line program
sieve, puzzle, quicksort
Synthetic Benchmarks
Attempt to match average frequencies of real
workloads
Whetstone, Dhrystone
Kernels
Time-critical excerpts of real programs
Livermore loops
Real programs
gcc, compress

33
SPEC Std Perf Evaluation Corp

First round 1989 (SPEC CPU89)
10 programs yielding a single number
Second round 1992 (SPEC CPU92)
SPECint92 (6 integer programs) and SPECfp92 (14
floating point programs)
Compiler flags unlimited. March 93 of DEC 4000
Model 610
spice unix.c/def(sysv,has_bcopy,bcopy(a,b,c)m
emcpy(b,a,c)
wave5 /ali(all,dcomnat)/aga/ur4/ur200
nasa7 /norecu/aga/ur4/ur2200/lcblas
Third round 1995 (SPEC CPU95)
Single flag setting for all programs new set of
programs (8 integer, 10 floating point)
Phased out in June 2000
SPEC CPU2000 released April 2000

34
SPEC95 Details

Reference machine
Sun SPARCstation 10/40
128 MB memory
Sun SC 3.0.1 compilers
Benchmarks larger than SPEC92
Larger code size
More memory activity
Minimal calls to library routines
Greater reproducibility of results
Standardized build and run environment
Manual intervention forbidden
Definitions of baseline tightened
Multiple numbers
SPECint_95base, SPECint_95, SPECfp_95base,
SPECfp_95

Source SPEC
35
Trends in Integer Performance
Source Microprocessor Report 13(17), 27 Dec 1999
36
Trends in Floating Point Performance
Source Microprocessor Report 13(17), 27 Dec 1999
37
SPEC95 Ratings of Processors
Source Microprocessor Report, 24 Apr 2000
38
SPEC95 vs SPEC CPU2000
Source Microprocessor Report, 17 Apr 2000
Read SPEC CPU2000 Measuring CPU Performance in
the New Millennium, John L. Henning, Computer,
July 2000, pages 28-35
39
SPEC CPU2000 Example

Baseline machine Sun Ultra 5, 300 MHz UltraSPARC
Iii, 256 KB L2
Running time ratios scaled by factor of 100
Reference score of baseline machine is 100
Reference time of 176.gcc should be 1100, not 110
Example shows 667 MHz Alpha processor on both
CINT2000 and CINT95

Source Microprocessor Report, 17 Apr 2000
40
Performance Evaluation

Given sales is a function of performance relative
to the competition, big investment in improving
product as reported by performance summary
Good products created when you have
Good benchmarks
Good ways to summarize performance
If benchmarks/summary inadequate, then choose
between improving product for real programs vs.
improving product to get more sales
Sales almost always wins!
Execution time is the measure of computer
performance!
What about cost?

41
Cost of Integrated Circuits
42
Explanations
Second term in Dies per wafer corrects for the
rectangular dies near the periphery of round
wafers
Die yield assumes a simple empirical model
defects are randomly distributed over the wafer,
and yield is inversely proportional to the
complexity of the fabrication process (indicated
by a)
a3 for modern processes implies that cost of
die is proportional to (Die area)4
43
Real World Examples
Revised Model Reduces Cost Estimates, Linley
Gwennap, Microprocessor Report 10(4), 25 Mar 1996

0.25-micron process standard, 0.18-micron
available now
BiCMOS is dead
See data for current processors on slide 71
Silicon-on-Insulator (SOI) process in works

44
Moores Law
Cramming More Components onto Integrated
Circuits, G. E. Moore, Electronics, pp. 114-117,
April 1965

Historical context
Predicting implications of technology scaling
Makes over 25 predictions, and all of them have
come true
Read the paper and find out these predictions!
Moores Law
The complexity for minimum component costs has
increased at a rate of roughly a factor of two
per year.
Based on extrapolation from five points!
Later, more accurate formula
Technology scaling of integrated circuits
following this trend has been driver of much
economic productivity over last two decades

45
Moores Law in Action at Intel
Source Microprocessor Report 9(6), 8 May 1995
46
Moores Law At Risk?
Source Microprocessor Report, 24 Aug 1998
47
Characteristics of Workstation Processors
Source Microprocessor Report, 24 Apr 2000
48
Where Do The Transistors Go?
Source Microprocessor Report, 24 Apr 2000

Logic contributes a (vanishingly) small fraction
of the number of transistors
Memory (mostly on-chip cache) is the biggest
fraction
Computing is free, communication is expensive

49
Chip Photographs
Source http//micro.magnet.fsu.edu/chipshots/inde
x.html
UltraSparc
HP-PA 8000
50
Embedded Processors

More new instruction sets introduced in 1999 than
in PC market for last 15 years
Hot trends of 1999
Network processors
Configurable cores
VLIW-based processors
ARM unit sales now surpass 68K/Coldfire unit
sales
Diversity of market supports wide range of
performance, power, and cost

Source Microprocessor Report, 17 Jan 2000
51
Power-Performance Tradeoff (Embedded)
Source Microprocessor Report, 17 Jan 2000

Write a Comment

User Comments (0)