Based on slides from D. Patterson and

1 / 48
About This Presentation
Title:

Based on slides from D. Patterson and

Description:

COM 249 Computer Organization and Assembly Language Chapter 1 (continued) Performance Based on s from D. Patterson and www-inst.eecs.berkeley.edu/~cs152/ – PowerPoint PPT presentation

Number of Views:14
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Based on slides from D. Patterson and


1
COM 249 Computer Organization andAssembly
LanguageChapter 1 (continued)Performance
Based on slides from D. Patterson
and www-inst.eecs.berkeley.edu/cs152/
2
Understanding Performance
  • Algorithm
  • Determines number of operations executed (IC and
    possibly CPI)
  • Programming language, compiler architecture
  • Determine number of machine instructions executed
    per operation (IC, CPI)
  • Instruction set architecture
  • Determines the instructions needed for a
    function, the cycles needed for each instruction
    and the clock rate of the processor (IC, CPI, and
    Clock rate)
  • Processor and memory system
  • Determine how fast instructions are executed
  • I/O system (including OS)
  • Determines how fast I/O operations are executed

1.4 Performance
3
Comparison of Airplanes
Airplane Passenger Capacity Cruising Range (miles) Cruising Speed (MPH) Passenger Throughput (Pass x mph)
Boeing 777 375 4630 610 228,750
Boeing 747 470 4150 610 286,700
Concord 132 4000 1350 178,200
Douglas DC 8-50 146 8720 544 79,424
4
Defining Performance
Which airplane has the best performance?
5
Response Time and Throughput
  • Response time
  • How long it takes to do a task
  • Throughput
  • Total work done per unit time
  • e.g., tasks/transactions/ per hour
  • How are response time and throughput affected by
  • Replacing the processor with a faster version?
  • Adding more processors?
  • Well focus on response time for now

6
Response Time and Throughput
  • Decreasing response time usually improves
    throughput.
  • Do these changes increase response time or
    throughput or both?
  • Replacing processor with faster one?
  • Adding additional processors?

Both
Only throughput, since no one task gets done
faster.
7
Relative Performance
  • Define Performance
  • Performance 1/Execution Time
  • X is n time faster than Y
  • Example time taken to run a program
  • 10s on A, 15s on B
  • Execution TimeB / Execution TimeA 15s / 10s
    1.5
  • So A is 1.5 times faster than B

8
Measuring Execution Time
  • Elapsed time
  • Total response time, including all aspects
  • Processing, I/O, OS overhead, idle time
  • Determines system performance
  • CPU time
  • Time spent processing a given job
  • Discounts I/O time, other jobs shares
  • Comprises user CPU time and system CPU time
  • Different programs are affected differently by
    CPU and system performance

9
CPU Clocking
  • Operation of digital hardware governed by a
    constant-rate clock

Clock period
Clock (cycles)
Data transferand computation
Update state
  • Clock period duration of a clock cycle
  • e.g., 250ps 0.25ns 2501012s
  • Clock frequency (rate) cycles per second
  • e.g., 4.0GHz 4000MHz 4.0109Hz

10
Program Execution Time(CPU Time)
  • CPU TIME Total CPU Clock Cycles X Clock Cycle
    Length
  • Total CPU Clock Cycles Number of Instructions X
    CPI
  • CPI Clock Cycles per Instruction
  • Average number of clock cycles each instruction
    takes to execute
  • CPU Designers Choice Trade off between the
    number of instructions and the duration of the
    clock cycle
  • Long cycle powerful but complex instructions
    (CISC)
  • Short cycle simple instructions (RISC)

11
Instruction Count and CPI
  • Instruction Count for a program
  • Determined by program, ISA and compiler
  • Average cycles per instruction
  • Determined by CPU hardware
  • If different instructions have different CPI
  • Average CPI affected by instruction mix

12
CPI Example
  • Computer A Cycle Time 250ps, CPI 2.0
  • Computer B Cycle Time 500ps, CPI 1.2
  • Same ISA
  • Which is faster, and by how much?

A is faster
by this much
13
Classic CPU Performance Equation
  • Basic performance equation in terms of IC, CPI
    and clock cycle time
  • CPU timeInstruction count x CPI x Clock cycle
    time
  • Or
  • since clock rate is the inverse of clock
    cycle time
  • CPU Time Instruction count x CPI
  • Cock Rate

14
CPI in More Detail
  • If different instruction classes take different
    numbers of cycles
  • Weighted average CPI

Relative frequency
15
CPI Example
  • Alternative compiled code sequences using
    instructions in classes A, B, C

Class A B C
CPI for class 1 2 3
IC in sequence 1 2 1 2
IC in sequence 2 4 1 1
  • Sequence 1 IC 5
  • Clock Cycles 21 12 23 10
  • Avg. CPI 10/5 2.0
  • Sequence 2 IC 6
  • Clock Cycles 41 12 13 9
  • Avg. CPI 9/6 1.5

16
Basic Components of Performance
Components of Performance Units of Measure
CPU execution time Seconds
Instruction count Instructions executed
Clock cycles per Instruction (CPI) Average number of clock cycles per instruction
Clock cycle time Seconds per clock cycle
17
Performance Summary
The BIG Picture
  • Performance depends on
  • Algorithm affects IC, possibly CPI
  • Programming language affects IC, CPI
  • Compiler affects IC, CPI
  • Instruction set architecture affects IC, CPI, Tc

18
Power Trends
1.5 The Power Wall
WHY?
  • In CMOS IC technology

1000
30
5V ? 1V
19
Power Wall
  • Clock rate and power have increased over 25 years
    and 8 computer generations and then flattened off
  • They grew together because they are correlated
  • They slowed down because we have run into a
    practical power limit for cooling
    microprocessors-thermal problems

20
CMOS and Power
  • Dominant technology for IC (integrated circuits)
    is CMOS(complementary metal oxide semiconductor).
  • For CMOS primary power dissipation is dynamic
    power power consumed during switching.
  • Frequency switched is a function of clock rate.
  • Capacitive load is a function of the number of
    transistors and the technology.
  • Therefore, power has been reduced 30 times by

30
5v 1v
1000
21
Reducing Power
  • Suppose a new CPU has
  • 85 of capacitive load of old CPU
  • 15 voltage and 15 frequency reduction
  • The power wall
  • We cant reduce voltage further
  • We cant remove more heat
  • How else can we improve performance?

New design
22
Uniprocessor Performance
1.6 The Sea Change The Switch to Multiprocessors
Constrained by power, instruction-level
parallelism, memory latency
23
Sea Change
  • Sea Change is an idiom meaning a profound
    transformation or big change.
  • Taken from Shakespeares The Tempest, when Ariel
    sings
  • "Full fathom five thy father lies,Of his bones
    are coral made,Those are pearls that were his
    eyes,Nothing of him that doth fade,But doth
    suffer a sea-change,into something rich and
    strange
  • http//en.wikipedia.org/wiki/Sea_change

24
From Uniprocessor to Multiprocessor
  • Power limits have forced change in the design of
    microprocessors.
  • Microprocessors now have multiple processors or
    cores per chip.
  • Called multicore (dual core, quad core, etc.)
  • Plan to double the number of cores per chip every
    two years.
  • Programmers need to rewrite their programs to
    take advantage of multiple processors.

25
Multiprocessors
  • Multicore microprocessors
  • More than one processor per chip
  • Requires explicitly parallel programming
  • Compare with instruction level parallelism
  • Hardware executes multiple instructions at once
  • Hidden from the programmer
  • Hard to do
  • Programming for performance
  • Load balancing
  • Optimizing communication and synchronization

26
Multicore Microprocessors
Product AMD OpteronX4 Barcelona Intel Nehalem IBM Power6 Sun Ultra Spark T2 Niagara2
Cores per chip 4 4 2 8
Clock Rate 2.5 GHz 2.5GHz 4.7 GHz 1.4 GHz
Power 120 W 100 W 100 W 94 W
27
Parallelism
  • Programmers need to switch to explicitly parallel
    programming.
  • Pipelining (Chapter 4) is an elegant technique to
    overlap the execution of instructions.
  • Instruction-level parallelism abstracts the
    parallel nature of the hardware so the programmer
    and compiler can think of sequential instruction
    execution.

28
Parallel Programming
  • Hard to write parallel programs
  • Parallel programming is by definition performance
    programming and must be fast. (If speed is not
    needed write sequentially.)
  • For parallel hardware, programmer must divide the
    application so that each processor has same
    amount to do, with little overhead.
  • See Newspaper story analogy p. 43

29
Real Stuff Manufacturing AMD chip
  • Manufacture of a chip begins with silicon (found
    in sand).
  • Silicon is a semiconductor does not conduct
    electricity well.
  • Material added to silicon to form
  • Conductors (copper or aluminum)
  • Insulators (plastic or glass)
  • Conduct or insulate under special conditions (as
    a switch or transistor)
  • VLSI (very large scale integration) circuit is
    millions of conductors, insulators and switches
    in a small package.

30
Manufacturing ICs
1.7 Real Stuff The AMD Opteron X4
  • Yield proportion of working dies per wafer
  • http//www.intel.com/museum/onlineexhibits.htm

31
AMD Opteron X2 Wafer
  • X2 300mm wafer, 117 chips, 90nm technology
  • X4 45nm technology

32
Integrated Circuit Cost
  • Nonlinear relation to area and defect rate
  • Wafer cost and area are fixed
  • Defect rate determined by manufacturing process
  • Die area determined by architecture and circuit
    design

33
SPEC CPU Benchmark
  • Benchmarks are programs used to measure
    performance
  • Supposedly typical of actual workload
  • Standard Performance Evaluation Corp (SPEC)
  • Develops benchmarks for CPU, I/O, Web,
  • SPEC CPU2006
  • Elapsed time to execute a selection of programs
  • Negligible I/O, so focuses on CPU performance
  • Normalize relative to reference machine
  • Summarize as geometric mean of performance ratios
  • CINT2006 (integer) and CFP2006 (floating-point)

34
CINT2006 for Opteron X4 2356
Name Description IC109 CPI Tc (ns) Exec time Ref time SPECratio
perl Interpreted string processing 2,118 0.75 0.40 637 9,777 15.3
bzip2 Block-sorting compression 2,389 0.85 0.40 817 9,650 11.8
gcc GNU C Compiler 1,050 1.72 0.47 24 8,050 11.1
mcf Combinatorial optimization 336 10.00 0.40 1,345 9,120 6.8
go Go game (AI) 1,658 1.09 0.40 721 10,490 14.6
hmmer Search gene sequence 2,783 0.80 0.40 890 9,330 10.5
sjeng Chess game (AI) 2,176 0.96 0.48 37 12,100 14.5
libquantum Quantum computer simulation 1,623 1.61 0.40 1,047 20,720 19.8
h264avc Video compression 3,102 0.80 0.40 993 22,130 22.3
omnetpp Discrete event simulation 587 2.94 0.40 690 6,250 9.1
astar Games/path finding 1,082 1.79 0.40 773 7,020 9.1
xalancbmk XML parsing 1,058 2.70 0.40 1,143 6,900 6.0
Geometric mean Geometric mean Geometric mean Geometric mean Geometric mean Geometric mean Geometric mean 11.7
High cache miss rates
35
SPEC Power Benchmark
  • Power consumption of server at different workload
    levels
  • Performance ssj_ops/sec
  • Power Watts (Joules/sec)

36
SPECpower_ssj2008 for X4
Target Load Target Load Performance (ssj_ops/sec) Performance (ssj_ops/sec) Average Power (Watts) Average Power (Watts)
100 231,867 295
90 211,282 286
80 185,803 275
70 163,427 265
60 140,160 256
50 118,324 246
40 920,35 233
30 70,500 222
20 47,126 206
10 23,066 180
0 0 141
Overall sum Overall sum 1,283,590 2,605
?ssj_ops/ ?power ?ssj_ops/ ?power ?ssj_ops/ ?power ?ssj_ops/ ?power 493
37
Pitfalls and Fallacies
1.8 Fallacies and Pitfalls
  • Fallacies- commonly held misconceptions, usually
    presented with a counter example.
  • Pitfalls- easily made mistakes, often
    generalizations of principles that are true in a
    limited context.
  • Purpose of these sections is to help you to avoid
    making mistakes.

38
Amdahls Law
  • Amdahl's Law governs the speedup of using
    parallel processors on a problem, versus using
    only one serial processor.
  • Before we examine Amdahl's Law, we should gain a
    better understanding of what is meant by speedup.

1.8 Fallacies and Pitfalls
  • Speedup is the time it takes a program to execute
    in serial (with one processor) divided by the
    time it takes to execute in parallel (with many
    (j) processors). The formula for speedup is
  • S T(1)
  • T(j)
  • Efficiency is the speedup, divided by the number
    of processors used.
  • http//cs.wlu.edu/whaleyt/classes/parallel/topics
    /amdahl.html

39
Amdahls Law
  • If N is the number of processors, s is the amount
    of time spent (by a serial processor) on serial
    parts of a program and p is the amount of time
    spent (by a serial processor) on parts of the
    program that can be done in parallel, then
    Amdahl's law says that speedup is given by
  • Speedup (s p ) / (s p / N ) 1 / (s p /
    N ),
  • where we have set total time s p 1 for
    algebraic simplicity.
  • http//www.scl.ameslab.gov/Publications/Gus/Amdahl
    sLaw/Amdahls.html

40
Limitations of Amdahls Law
If a program needs 20 hours using a single
processor, and a particular portion of 1 hour
cannot be parallelized, while the remaining
portion of 19 hours (95) can be parallelized,
then regardless of how many processors we devote
to a parallelized execution of this program, the
minimal execution time can not be less than that
critical 1 hour. Hence the speed up is limited up
to 20x.
41
Pitfall Amdahls Law
  • Pitfall Expecting the improvement of one aspect
    of a computer to increase overall performance by
    an amount proportional to the size of the
    improvement.
  • Suppose a program runs in 100 seconds on a
    computer, with multiply operations responsible
    for 80 seconds of this time.
  • How much must the speed of the multiplication
    improve to have the program run 5 times faster?

42
Pitfall Amdahls Law
1.8 Fallacies and Pitfalls
  • Improving an aspect of a computer and expecting a
    proportional improvement in overall performance
  • Example multiply accounts for 80s/100s
  • How much improvement in multiply performance to
    get 5 overall?
  • Cant be done!
  • Corollary make the common case fast

43
Uses for Amdahls Law
  • Estimate performance improvements
  • Together with CPU performance equation, use to
    evaluate potential enhancements
  • Use the Corollary Make the common case fast to
    enhance performance- easier than optimizing the
    rare case.
  • Use to examine the practical limits on the number
    of parallel processors.

44
Fallacy Low Power at Idle
  • Fallacy Computers at low utilization use little
    power.
  • Look back at X4 power benchmark
  • At 100 load 295W
  • At 50 load 246W (83)
  • At 10 load 180W (61)
  • Google data center
  • Mostly operates at 10 50 load
  • At 100 load less than 1 of the time
  • Consider designing processors to make power
    proportional to load

45
Pitfall MIPS as a Performance Metric
  • PitfallUsing a subset of the performance
    equation as a performance metric
  • MIPS Millions of Instructions Per Second
  • Doesnt account for
  • Differences in ISAs between computers
  • Differences in complexity between instructions
  • CPI varies between programs on a given CPU

46
Performance Measurements
Measurement Computer A Computer B
Instruction Count 10 billion 8 billion
Clock Rate 4 GHz 4 GHz
CPI 1.0 1.1
  • Which computer has the higher MIPS rating?
  • Which computer is faster?
  • MIPS Clock rate A 4 GHz 4000 B 4GHz
    3630
  • CPI x106 1.0 x106
    1.1 x106

CPU time IC x CPI A10 x 109 x1 2.5 B
8 x 109x1.1 2.2
Clock rate 4GHz
4Ghz
47
Concluding Remarks
  • Cost/performance is improving
  • Due to underlying technology development
  • Hierarchical layers of abstraction
  • In both hardware and software
  • Instruction set architecture
  • The hardware/software interface
  • Execution time the best performance measure
  • Seconds Instructions x Clock Cycles x Seconds
  • Program Program Instruction
    Clock cycle
  • Execution time is the only valid measure of
    performance.

1.9 Concluding Remarks
48
Concluding Remarks
1.9 Concluding Remarks
  • Power is a limiting factor
  • Use parallelism to improve performance
  • Via multiple processors
  • Exploiting the locality of accesses to a memory
    hierarchy,via caches
  • The key hardware technology for modern processors
    is silicon.
  • Historical Perspective ( SEE 1.10 on CD)
Write a Comment
User Comments (0)