CS 1104 Help Session II Performance Measures - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

CS 1104 Help Session II Performance Measures

Description:

Class Cycles Per Instructions ... In general, the peak overall CPI will be the CPI of the fastest class. ... the compromises result in higher class CPIs. ... – PowerPoint PPT presentation

Number of Views:12
Avg rating:3.0/5.0
Slides: 34
Provided by: polar
Category:

less

Transcript and Presenter's Notes

Title: CS 1104 Help Session II Performance Measures


1
CS 1104 Help Session IIPerformance Measures
  • Colin Tan
  • ctank_at_comp.nus.edu.sg
  • http//www.comp.nus.edu.sg/ctank

2
Basic ConceptsInstruction Execution Cycles
  • Processors execute instructions in several steps
  • Instruction Fetch (IF)
  • Instructions are fetched from memory and placed
    into an Instruction Register (IR).
  • Instruction Decode (ID)
  • The opcode portion of the instruction is sent to
    a decoder, which generates control signals.
  • Control signals determine tell the Arithmetic
    Logic Unit (ALU) what to do with data add,
    rotate the bits, etc.
  • The operands portion may be sent to the
    register-file to fetch register data, or sent
    directly to the ALU to be operated on (for
    constants).
  • Operand Fetch (OF)
  • Data required for the operation is taken from
    memory or the register file and sent to the ALU
    inputs

3
Basic ConceptsInstruction Execution Cycles
  • Execution steps (contd)
  • Instruction Execute (IE)
  • The ALU computes the results based on the data
    fetched and the control signals generated.
  • Writeback (WB)
  • The results are written back to the destination
    register or memory location.

4
Basic ConceptsThe Need for Synchronization
  • How will the processor know
  • When the instruction has been fetched and placed
    into IR?
  • If the instruction is not yet in IR, neither the
    opcodes nor operands will make sense!
  • Decoding nonsense and fetching invalid data leads
    to incorrect execution.
  • When the instruction has been decoded?
  • If the instructions have not been decoded
    completely, the ALU is receiving invalid control
    signals.
  • When the operands have been fetched?
  • If the operands have not yet been fetched from
    the registers or from memory, then the inputs to
    the ALU are invalid, and the ALU will compute
    invalid results!

5
Basic ConceptsClock Cycles
  • The other steps (IE, OF, WB) also need to know
    when to proceed in order to work correctly.
  • To coordinate each step, the processor relies on
    a series of ticks called clock cycles (CC).
  • CC1 Perform IF
  • By the end of CC1, the instruction is definitely
    sitting in IR, and the decoder can proceed to
    interpret the opcode.
  • CC2 Perform ID
  • Decode the instruction in IR, and generate all
    the control signals by the end of this clock
    cycle.
  • CC3 Perform OF
  • Fetch the data from registers or from memory.
    Must get all the data ready and presented to the
    ALU by the end of this clock cycle.

6
Basic ConceptsClock Cycles
  • CC4 Perform IE
  • The ALU must operate (i.e. add, subtract etc) on
    the inputs and produce the results by the end of
    this clock cycle.
  • CC5 Perform WB
  • The outputs of the ALU must be written back to
    register or memory by the end of this clock
    cycle.
  • CC6 Start IF of next instruction
  • If every step obeys the constraints laid out
    here, then each step will know for sure that the
    results of the previous step are already
    available before starting, and execution will
    proceed correctly.

7
Basic ConceptsInstruction Classes
  • A typical processor supports many instructions.
  • Typically instructions are divided into groups
  • Arithmetic Instructions add, sub, mul, div, mod
  • Bitwise Instructions rol, ror, shl, shr, and,
    or, not
  • Floating Point Instructions fadd, fsub, fmul,
    fdiv
  • Load/Store Instructions lw, sw
  • Etc.

8
Basic ConceptsClass Cycles Per Instructions
  • We have seen how instructions take several clock
    cycles to execute (in our example, each
    instruction takes 5 clock cycles).
  • Each instruction actually takes different number
    of clock cycles to execute, depending on how
    complex the instruction is, or how slow each
    stage of an instruction each.
  • E.g. Floating Point Adds More complex than
    integer adds, and require more clock cycles.
  • lw, sw access memory, which takes more clock
    cycles to fetch an operand from compared with
    registers.

9
Basic ConceptsClass Cycles Per Instruction
  • The Class Cycles Per Instruction (class CPI) is
    the average number of clock cycles required by
    instructions within a particular class
  • E.g.
  • of cycles for ADD 2 cycles
  • of cycles for SUB 2 cycles
  • of cycles for MUL 4 cycles
  • of cycles for DIV 8 cycles
  • ---------------
  • Total 16 cycles
  • Average 16/4 4 CPI.
  • So the class CPI for this class of instructions
    is 4.

10
Basic ConceptsInstruction Frequency
  • A program (e.g. Microsoft Word) is made up of
    many instructions coming from each of the
    different classes of instructions.
  • The number of instructions in each class is
    called the instruction frequency of that class.
  • This is often expressed as a percentage or as a
    fraction.

11
Basic ConceptsOverall Cycles Per Instruction
  • The class instruction frequency and the class CPI
    can be used to compute what the overall Cycles
    Per Instruction, or overall CPI of a particular
    program.
  • Each type of instruction would take a different
    number of clock cycles.
  • A program consists of several different types of
    instructions.
  • The overall CPI is the average number of cycles
    required to execute each instruction, across all
    types of instructions.

12
Calculating Overall CPI
  • Find the overall CPI of a program running on a
    processor with the class CPIs and instruction
    frequencies shown here
  • Class CPI Instruction Frequency
  • A 3 0.4
  • B 2 0.25
  • C 4 0.15
  • D 5 0.20

13
Calculating Overall CPI
  • Lets assume that the total number of
    instructions is IC. Then there are 0.4IC
    instructions in class A, 0.25IC in class B,
    0.15IC in class C and 0.2 IC in class D.
  • Total number of clock cycles used by instructions
    in class A is 0.4IC x 3, class B is 0.25IC x 2,
    class C is 0.15IC x 4, class D is 0.2IC x 5
  • Hence total number of clock cycles used by this
    program is 0.4IC x 3 0.25IC x 2 0.15IC x 4
    0.2IC x 5
  • Number of instructions is IC. Hence average
    number of cycles per instruction (average CPI) is
    (0.4IC x 3 0.25IC x 2 0.15IC x 4 0.2IC x
    5)/1.0IC
  • IC cancels off, leaving 0.4 x 3 0.25 x 2 0.15
    x 4 0.2 x 5, the famous Overall CPI. Final
    answer is 2.7.

14
Calculating Overall CPI
  • Suppose the previous program was re-compiled with
    a different compiler, and the CPI/instruction
    frequency table is modified to the one below

Class CPI Instruction Frequency A 3 0.2 B 2 0
.35 C 4 0.15 D 5 0.20
15
Calculating Overall CPI
  • We take a short-cut and use the famous formula
  • Overall CPI 3 x 0.2 2 x 0.35 4 x 0.15 5
    x 0.2
  • 2.9
  • If we left the answer like this, it will WRONG!
  • Reason The instruction frequencies do not add up
    to 1.0!
  • Returning back to definitions, lets compute the
    total number of clock cycles taken by this
    program
  • Total Clock Cycles 0.2IC x 3 0.35IC x 2
    0.15IC x 4 0.2IC x 5
  • Total number of instructions 0.2IC 0.35IC
    0.15IC 0.2IC
  • 0.9IC

16
Calculating Overall CPI
  • Finding the overall CPI
  • (0.2IC x 3 0.35IC x 2 0.15IC x 4 0.2IC x 5)
    / (0.9IC)
  • Canceling out IC, we get
  • (0.2 x 3 0.35 x 2 0.15 x 4 0.2 x 5) / 0.9
  • Final answer is 3.22
  • Moral Always divide the overall CPI you get with
    the total frequency. In the previous example, the
    total frequency was 1.0, and we didnt have a
    problem. Here this is not the case.

17
Calculating Peak CPI
  • The peak overall CPI is obtained when every
    instruction in a program is from the fastest
    class. Using our previous example, we will have
    peak performance if our instruction frequencies
    are as shown.
  • This will give us a peak CPI of 0.0 x 3 1.0 x 2
    0.0 x 4 0.0 x 5 2.0

Class CPI Instruction Frequency A 3 0.0 B 2 1
.0 C 4 0.0 D 5 0.0
18
Calculating Peak CPI
  • In general, the peak overall CPI will be the CPI
    of the fastest class.
  • It is not possible to modify the class CPIs
    without modifying the hardware organization
    itself.
  • However, by hacking the hardware, the peak class
    CPI can be as low as 0!

19
Basic ConceptsClock Rate
  • We have seen how the processor coordinates the
    various instruction execution stages using a
    common tick, or clock cycle.
  • The number of ticks per second is called the
    clock rate, or clock frequency.
  • Obviously the higher the clock rate, the faster
    each stage has to complete, and therefore the
    faster the processor completes an instruction
  • This implies that a higher clock rate will give
    you faster processors.
  • However there is a limit to how fast each stage
    can do something.
  • Cranking the clock rate beyond the capabilities
    of the hardware will cause execution to fail.

20
Basic ConceptsClock Rate
  • To overcome speed limitations, processor
    designers often make compromises in the designs
    for each stage
  • The compromises allow each stage to work faster
    than before, allowing you to crank up the clock
    rate faster than ever.
  • Such compromises give you faster execution rates
    under ideal circumstances, but may give you worse
    performance under normal circumstances.
  • This is because the compromises result in higher
    class CPIs.
  • Hence faster clock rate may actually result in
    poorer performance
  • This translates to longer execution times for a
    program.
  • The length of a clock cycle measured in seconds
    is called the clock cycle time or clock period.
    It is equal to the reciprocal of the frequency
    (i.e. cycle time 1/(clock_rate))

21
Execution Time
  • The execution time T of a program is the amount
    of time a program takes to run to completion.
  • This will depend on the overall CPI, the total
    number of instructions executed (IC), and the
    clock rate (R) of the processor.
  • IC x CPI will give us the total number of clock
    cycles used to execute all the instructions in
    the program
  • (IC x CPI) / R will give us the execution time.
  • If my program takes 10,000 cycles, and if my
    clock produces 100,000 cycles per second, then my
    program would take 10,000/100,000 0.1 seconds
    to execute.
  • Hence T (IC x CPI)/R

22
Execution Time
  • From the previous example, suppose the program
    has a total of 15,000,000 instructions, and
    suppose that the clock rate of the processor is
    500 MHz, what is the total execution time of the
    program?
  • T (15 x 106) x 2.7 / 500 x 106 0.081
    seconds.

23
Execution Time Issues
  • The execution time computed is unique only to
    this program. Other programs will have different
    execution times.
  • Execution time is affected by
  • Hardware Organization This affects individual
    class CPIs, and hence the overall CPI.
  • E.g. ADD instructions implemented using
    carry-propagate adders will have much higher CPIs
    than those implemented using carry-generate
    adders.
  • Compiler Technology This affects the individual
    class frequencies
  • A good compiler will select more instructions
    from faster classes to accomplish the same
    objective.

24
Execution Time Issues
  • Execution Time is affected by (contd)
  • The program being run
  • Different programs will have different
    instruction distributions (i.e. different
    instruction class frequencies), resulting in
    different overall CPIs.
  • Different programs will have different
    instruction counts IC
  • Instruction Set Architecture
  • A richer ISA will give the compiler more choices
    of instructions to use to minimize IC, CPI or
    both.
  • All this will give you different execution time T.

25
Benchmarking
  • Benchmarks allow us to determine the performance
    of a system, usually relative to another system.
  • A common benchmark that we use is execution time.
    We take the same program and run it on two
    machines, and compare their execution times.
  • We cannot use overall CPI or clock frequencies as
    basis for comparisons
  • High clock frequency processors may make
    compromises that dramatically increase individual
    class CPIs, and hence overall CPI.
  • Instructions may have very low CPIs because clock
    cycle times are very big.
  • Long clock cycle times mean that the processor
    may be able to accomplish gt1 step in 1 clock
    cycle, leading to lower cycle requirements.
  • Unfortunately due to low clock rates, performance
    may be poor.

26
BenchmarkingExecution Time Example
  • The processor in the previous example is
    optimized, and the new class CPIs are shown
    below. Clock frequencies and instruction counts
    remain the same. How much faster is the new
    machine over the old?

Class CPI Instruction Frequency A 2 0.4 B 1 0
.25 C 5 0.15 D 4 0.20
27
BenchmarkingExecution Time Example
  • Overall CPI 2 x 0.4 1 x 0.25 5 x 0.15 4 x
    0.2
  • 2.6
  • Execution Time 2.6 x (15 x 106) / 500 x
    106
  • 0.0936s
  • Previous Execution Time 0.078 s
  • We can measure the speed-up by taking the old
    execution time and dividing it by the new
  • Speedup 0.081 / 0.078 1.04
  • This figure of 1.04 means that the new design is
    1.04 times faster than the old one.

28
BenchmarkingInstruction Throughput
  • Measuring how fast a machine can execute a
    particular program is just one way of determining
    performance.
  • Another good measure is instruction throughput,
    or how many instructions a processor can execute
    per second.
  • The most common measure for throughput is MIPS,
    which is short for Millions of Instructions Per
    Second.
  • This is not to be confused with the MIPS R2000.
    In this case, this MIPS is actually a companys
    name.
  • So we have two meanings for MIPS
  • Millions of Instructions Per Second
  • The company that makes the R2000.

29
BenchmarkingMIPS Example
  • Find the MIPS rating for both machines used in
    these notes
  • CPI for first machine 2.7
  • This means that every instruction requires, on
    average, 2.7 cycles.
  • The clock rate is 500 MHz, so each second there
    are 500 x 106 cycles.
  • Therefore you can execute 500 x 106 / 2.7
    185.2 x 106 instructions per second, or 185.2
    MIPS.
  • CPI for second machine 2.6
  • Clock rate remains the same at 500x106 Hz.
  • So throughput is 500 x 106 / 2.6 192.3 MIPS

30
Types of Benchmarks
  • Micro-Benchmarks
  • These are very small benchmarks aimed primarily
    at gauging the peak performance of a processor.
  • Kernel Benchmarks
  • These are very small benchmarks designed to
    measure processor performance (e.g. benchmarks to
    measure MIPS ratings).
  • Full Applications Benchmarks
  • These use actual applications (or simulations of
    actual applications) to measure the performance
    of CPU, memory and IO systems. Gives a good idea
    of how system will perform running such
    applications.
  • Target Workload
  • These use the actual programs that are going to
    be run on the system to measure performance.

31
Amdahls Law
  • Amdahls Law basically states that
  • Execution time depends on a number of factors,
    such as the speeds of various classes of
    instructions.
  • If you improved the performance of one factor by
    X times, then the overall improvements in
    execution time will always be less than X.
  • If we were to improve the execution time of a
    particular class of instructions, then the new
    execution time is given by
  • New Ex Time Ex Time of unaffected classes (Ex
    Time of affected class / speedup)

32
Amdahls Law
  • Suppose a program runs in 100 seconds on a
    machine, and multiplies account for 80 seconds of
    this time. What improvement in execution time
    will we have if we improved (i) executions by 5
    times, ii) improved the other instructions by 10
    times?
  • i) New ex time unaffected time affected time
    / speedup
  • 20 80/5 36 seconds
  • Improvement 100/36 2.77 times faster.
  • ii) New ex time 80 20/10 82 seconds
  • Improvement 100 / 82 1.22 times faster
  • Moral Always improve the common case to get the
    best increase in performance!
  • Here the common case is the multiply (80).
    Improving multiplies by 5 times gives far better
    gains than improving the other instructions (20)
    by 10 times!

33
Summary
  • We looked at how instructions take several steps
    to execute, and each step is synchronized with
    the tick of a clock a clock cycle.
  • Execution time is the only reliable way to tell
    which machine is faster.
  • Machine performance may also be measured using
    instruction throughput
  • How many instructions can this machine execute in
    1 second?
  • Amdahls law allows us to see how much
    improvements we need to make to a class of
    instructions in order to achieve a desired order
    of improvement in performance.
Write a Comment
User Comments (0)
About PowerShow.com