Title: Based on slides from D. Patterson and
1COM 249 Computer Organization andAssembly
LanguageChapter 1 (continued)Performance
Based on slides from D. Patterson
and www-inst.eecs.berkeley.edu/cs152/
2Understanding Performance
- Algorithm
- Determines number of operations executed (IC and
possibly CPI) - Programming language, compiler architecture
- Determine number of machine instructions executed
per operation (IC, CPI) - Instruction set architecture
- Determines the instructions needed for a
function, the cycles needed for each instruction
and the clock rate of the processor (IC, CPI, and
Clock rate) - Processor and memory system
- Determine how fast instructions are executed
- I/O system (including OS)
- Determines how fast I/O operations are executed
1.4 Performance
3Comparison of Airplanes
Airplane Passenger Capacity Cruising Range (miles) Cruising Speed (MPH) Passenger Throughput (Pass x mph)
Boeing 777 375 4630 610 228,750
Boeing 747 470 4150 610 286,700
Concord 132 4000 1350 178,200
Douglas DC 8-50 146 8720 544 79,424
4Defining Performance
Which airplane has the best performance?
5Response Time and Throughput
- Response time
- How long it takes to do a task
- Throughput
- Total work done per unit time
- e.g., tasks/transactions/ per hour
- How are response time and throughput affected by
- Replacing the processor with a faster version?
- Adding more processors?
- Well focus on response time for now
6Response Time and Throughput
- Decreasing response time usually improves
throughput. - Do these changes increase response time or
throughput or both? - Replacing processor with faster one?
- Adding additional processors?
Both
Only throughput, since no one task gets done
faster.
7Relative Performance
- Define Performance
- Performance 1/Execution Time
- X is n time faster than Y
- Example time taken to run a program
- 10s on A, 15s on B
- Execution TimeB / Execution TimeA 15s / 10s
1.5 - So A is 1.5 times faster than B
8Measuring Execution Time
- Elapsed time
- Total response time, including all aspects
- Processing, I/O, OS overhead, idle time
- Determines system performance
- CPU time
- Time spent processing a given job
- Discounts I/O time, other jobs shares
- Comprises user CPU time and system CPU time
- Different programs are affected differently by
CPU and system performance
9CPU Clocking
- Operation of digital hardware governed by a
constant-rate clock
Clock period
Clock (cycles)
Data transferand computation
Update state
- Clock period duration of a clock cycle
- e.g., 250ps 0.25ns 2501012s
- Clock frequency (rate) cycles per second
- e.g., 4.0GHz 4000MHz 4.0109Hz
10Program Execution Time(CPU Time)
- CPU TIME Total CPU Clock Cycles X Clock Cycle
Length - Total CPU Clock Cycles Number of Instructions X
CPI - CPI Clock Cycles per Instruction
- Average number of clock cycles each instruction
takes to execute - CPU Designers Choice Trade off between the
number of instructions and the duration of the
clock cycle - Long cycle powerful but complex instructions
(CISC) - Short cycle simple instructions (RISC)
11Instruction Count and CPI
- Instruction Count for a program
- Determined by program, ISA and compiler
- Average cycles per instruction
- Determined by CPU hardware
- If different instructions have different CPI
- Average CPI affected by instruction mix
12CPI Example
- Computer A Cycle Time 250ps, CPI 2.0
- Computer B Cycle Time 500ps, CPI 1.2
- Same ISA
- Which is faster, and by how much?
A is faster
by this much
13Classic CPU Performance Equation
- Basic performance equation in terms of IC, CPI
and clock cycle time - CPU timeInstruction count x CPI x Clock cycle
time - Or
- since clock rate is the inverse of clock
cycle time - CPU Time Instruction count x CPI
- Cock Rate
14CPI in More Detail
- If different instruction classes take different
numbers of cycles
Relative frequency
15CPI Example
- Alternative compiled code sequences using
instructions in classes A, B, C
Class A B C
CPI for class 1 2 3
IC in sequence 1 2 1 2
IC in sequence 2 4 1 1
- Sequence 1 IC 5
- Clock Cycles 21 12 23 10
- Avg. CPI 10/5 2.0
- Sequence 2 IC 6
- Clock Cycles 41 12 13 9
- Avg. CPI 9/6 1.5
16Basic Components of Performance
Components of Performance Units of Measure
CPU execution time Seconds
Instruction count Instructions executed
Clock cycles per Instruction (CPI) Average number of clock cycles per instruction
Clock cycle time Seconds per clock cycle
17Performance Summary
The BIG Picture
- Performance depends on
- Algorithm affects IC, possibly CPI
- Programming language affects IC, CPI
- Compiler affects IC, CPI
- Instruction set architecture affects IC, CPI, Tc
18Power Trends
1.5 The Power Wall
WHY?
1000
30
5V ? 1V
19Power Wall
- Clock rate and power have increased over 25 years
and 8 computer generations and then flattened off - They grew together because they are correlated
- They slowed down because we have run into a
practical power limit for cooling
microprocessors-thermal problems
20CMOS and Power
- Dominant technology for IC (integrated circuits)
is CMOS(complementary metal oxide semiconductor). - For CMOS primary power dissipation is dynamic
power power consumed during switching. - Frequency switched is a function of clock rate.
- Capacitive load is a function of the number of
transistors and the technology. - Therefore, power has been reduced 30 times by
30
5v 1v
1000
21Reducing Power
- Suppose a new CPU has
- 85 of capacitive load of old CPU
- 15 voltage and 15 frequency reduction
- The power wall
- We cant reduce voltage further
- We cant remove more heat
- How else can we improve performance?
New design
22Uniprocessor Performance
1.6 The Sea Change The Switch to Multiprocessors
Constrained by power, instruction-level
parallelism, memory latency
23Sea Change
- Sea Change is an idiom meaning a profound
transformation or big change. - Taken from Shakespeares The Tempest, when Ariel
sings - "Full fathom five thy father lies,Of his bones
are coral made,Those are pearls that were his
eyes,Nothing of him that doth fade,But doth
suffer a sea-change,into something rich and
strange - http//en.wikipedia.org/wiki/Sea_change
24From Uniprocessor to Multiprocessor
- Power limits have forced change in the design of
microprocessors. - Microprocessors now have multiple processors or
cores per chip. - Called multicore (dual core, quad core, etc.)
- Plan to double the number of cores per chip every
two years. - Programmers need to rewrite their programs to
take advantage of multiple processors.
25Multiprocessors
- Multicore microprocessors
- More than one processor per chip
- Requires explicitly parallel programming
- Compare with instruction level parallelism
- Hardware executes multiple instructions at once
- Hidden from the programmer
- Hard to do
- Programming for performance
- Load balancing
- Optimizing communication and synchronization
26Multicore Microprocessors
Product AMD OpteronX4 Barcelona Intel Nehalem IBM Power6 Sun Ultra Spark T2 Niagara2
Cores per chip 4 4 2 8
Clock Rate 2.5 GHz 2.5GHz 4.7 GHz 1.4 GHz
Power 120 W 100 W 100 W 94 W
27Parallelism
- Programmers need to switch to explicitly parallel
programming. - Pipelining (Chapter 4) is an elegant technique to
overlap the execution of instructions. - Instruction-level parallelism abstracts the
parallel nature of the hardware so the programmer
and compiler can think of sequential instruction
execution.
28Parallel Programming
- Hard to write parallel programs
- Parallel programming is by definition performance
programming and must be fast. (If speed is not
needed write sequentially.) - For parallel hardware, programmer must divide the
application so that each processor has same
amount to do, with little overhead. - See Newspaper story analogy p. 43
29Real Stuff Manufacturing AMD chip
- Manufacture of a chip begins with silicon (found
in sand). - Silicon is a semiconductor does not conduct
electricity well. - Material added to silicon to form
- Conductors (copper or aluminum)
- Insulators (plastic or glass)
- Conduct or insulate under special conditions (as
a switch or transistor) - VLSI (very large scale integration) circuit is
millions of conductors, insulators and switches
in a small package.
30Manufacturing ICs
1.7 Real Stuff The AMD Opteron X4
- Yield proportion of working dies per wafer
- http//www.intel.com/museum/onlineexhibits.htm
31AMD Opteron X2 Wafer
- X2 300mm wafer, 117 chips, 90nm technology
- X4 45nm technology
32Integrated Circuit Cost
- Nonlinear relation to area and defect rate
- Wafer cost and area are fixed
- Defect rate determined by manufacturing process
- Die area determined by architecture and circuit
design
33SPEC CPU Benchmark
- Benchmarks are programs used to measure
performance - Supposedly typical of actual workload
- Standard Performance Evaluation Corp (SPEC)
- Develops benchmarks for CPU, I/O, Web,
- SPEC CPU2006
- Elapsed time to execute a selection of programs
- Negligible I/O, so focuses on CPU performance
- Normalize relative to reference machine
- Summarize as geometric mean of performance ratios
- CINT2006 (integer) and CFP2006 (floating-point)
34CINT2006 for Opteron X4 2356
Name Description IC109 CPI Tc (ns) Exec time Ref time SPECratio
perl Interpreted string processing 2,118 0.75 0.40 637 9,777 15.3
bzip2 Block-sorting compression 2,389 0.85 0.40 817 9,650 11.8
gcc GNU C Compiler 1,050 1.72 0.47 24 8,050 11.1
mcf Combinatorial optimization 336 10.00 0.40 1,345 9,120 6.8
go Go game (AI) 1,658 1.09 0.40 721 10,490 14.6
hmmer Search gene sequence 2,783 0.80 0.40 890 9,330 10.5
sjeng Chess game (AI) 2,176 0.96 0.48 37 12,100 14.5
libquantum Quantum computer simulation 1,623 1.61 0.40 1,047 20,720 19.8
h264avc Video compression 3,102 0.80 0.40 993 22,130 22.3
omnetpp Discrete event simulation 587 2.94 0.40 690 6,250 9.1
astar Games/path finding 1,082 1.79 0.40 773 7,020 9.1
xalancbmk XML parsing 1,058 2.70 0.40 1,143 6,900 6.0
Geometric mean Geometric mean Geometric mean Geometric mean Geometric mean Geometric mean Geometric mean 11.7
High cache miss rates
35SPEC Power Benchmark
- Power consumption of server at different workload
levels - Performance ssj_ops/sec
- Power Watts (Joules/sec)
36SPECpower_ssj2008 for X4
Target Load Target Load Performance (ssj_ops/sec) Performance (ssj_ops/sec) Average Power (Watts) Average Power (Watts)
100 231,867 295
90 211,282 286
80 185,803 275
70 163,427 265
60 140,160 256
50 118,324 246
40 920,35 233
30 70,500 222
20 47,126 206
10 23,066 180
0 0 141
Overall sum Overall sum 1,283,590 2,605
?ssj_ops/ ?power ?ssj_ops/ ?power ?ssj_ops/ ?power ?ssj_ops/ ?power 493
37Pitfalls and Fallacies
1.8 Fallacies and Pitfalls
- Fallacies- commonly held misconceptions, usually
presented with a counter example. - Pitfalls- easily made mistakes, often
generalizations of principles that are true in a
limited context. - Purpose of these sections is to help you to avoid
making mistakes.
38Amdahls Law
- Amdahl's Law governs the speedup of using
parallel processors on a problem, versus using
only one serial processor. - Before we examine Amdahl's Law, we should gain a
better understanding of what is meant by speedup.
1.8 Fallacies and Pitfalls
- Speedup is the time it takes a program to execute
in serial (with one processor) divided by the
time it takes to execute in parallel (with many
(j) processors). The formula for speedup is - S T(1)
- T(j)
- Efficiency is the speedup, divided by the number
of processors used. - http//cs.wlu.edu/whaleyt/classes/parallel/topics
/amdahl.html
39Amdahls Law
- If N is the number of processors, s is the amount
of time spent (by a serial processor) on serial
parts of a program and p is the amount of time
spent (by a serial processor) on parts of the
program that can be done in parallel, then
Amdahl's law says that speedup is given by - Speedup (s p ) / (s p / N ) 1 / (s p /
N ), - where we have set total time s p 1 for
algebraic simplicity. - http//www.scl.ameslab.gov/Publications/Gus/Amdahl
sLaw/Amdahls.html
40Limitations of Amdahls Law
If a program needs 20 hours using a single
processor, and a particular portion of 1 hour
cannot be parallelized, while the remaining
portion of 19 hours (95) can be parallelized,
then regardless of how many processors we devote
to a parallelized execution of this program, the
minimal execution time can not be less than that
critical 1 hour. Hence the speed up is limited up
to 20x.
41Pitfall Amdahls Law
- Pitfall Expecting the improvement of one aspect
of a computer to increase overall performance by
an amount proportional to the size of the
improvement. - Suppose a program runs in 100 seconds on a
computer, with multiply operations responsible
for 80 seconds of this time. - How much must the speed of the multiplication
improve to have the program run 5 times faster?
42Pitfall Amdahls Law
1.8 Fallacies and Pitfalls
- Improving an aspect of a computer and expecting a
proportional improvement in overall performance
- Example multiply accounts for 80s/100s
- How much improvement in multiply performance to
get 5 overall?
- Corollary make the common case fast
43Uses for Amdahls Law
- Estimate performance improvements
- Together with CPU performance equation, use to
evaluate potential enhancements - Use the Corollary Make the common case fast to
enhance performance- easier than optimizing the
rare case. - Use to examine the practical limits on the number
of parallel processors.
44Fallacy Low Power at Idle
- Fallacy Computers at low utilization use little
power. - Look back at X4 power benchmark
- At 100 load 295W
- At 50 load 246W (83)
- At 10 load 180W (61)
- Google data center
- Mostly operates at 10 50 load
- At 100 load less than 1 of the time
- Consider designing processors to make power
proportional to load
45Pitfall MIPS as a Performance Metric
- PitfallUsing a subset of the performance
equation as a performance metric - MIPS Millions of Instructions Per Second
- Doesnt account for
- Differences in ISAs between computers
- Differences in complexity between instructions
- CPI varies between programs on a given CPU
46Performance Measurements
Measurement Computer A Computer B
Instruction Count 10 billion 8 billion
Clock Rate 4 GHz 4 GHz
CPI 1.0 1.1
- Which computer has the higher MIPS rating?
- Which computer is faster?
- MIPS Clock rate A 4 GHz 4000 B 4GHz
3630 - CPI x106 1.0 x106
1.1 x106
CPU time IC x CPI A10 x 109 x1 2.5 B
8 x 109x1.1 2.2
Clock rate 4GHz
4Ghz
47Concluding Remarks
- Cost/performance is improving
- Due to underlying technology development
- Hierarchical layers of abstraction
- In both hardware and software
- Instruction set architecture
- The hardware/software interface
- Execution time the best performance measure
- Seconds Instructions x Clock Cycles x Seconds
- Program Program Instruction
Clock cycle - Execution time is the only valid measure of
performance.
1.9 Concluding Remarks
48Concluding Remarks
1.9 Concluding Remarks
- Power is a limiting factor
- Use parallelism to improve performance
- Via multiple processors
- Exploiting the locality of accesses to a memory
hierarchy,via caches - The key hardware technology for modern processors
is silicon. - Historical Perspective ( SEE 1.10 on CD)