Advanced Computer Architecture - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Advanced Computer Architecture

Description:

We will consider issues in current Architecture design and implementation: ... research has progressed, several key design concepts have been identified ... – PowerPoint PPT presentation

Number of Views:63
Avg rating:3.0/5.0
Slides: 19
Provided by: rfox
Category:

less

Transcript and Presenter's Notes

Title: Advanced Computer Architecture


1
Advanced Computer Architecture
  • We will consider issues in current Architecture
    design and implementation
  • RISC instruction sets
  • Pipelining
  • Instruction-level parallelism
  • Block-level parallelism
  • Thread-level parallelism
  • Multiprocessors
  • Improving cache performance
  • Optimizing virtual memory usage
  • In CSC 362, we focused on
  • the roles of the components in the architecture
  • the structure of the architecture (how things
    connect together)
  • Here, we focus on
  • Using available technology to improve computer
    performance
  • Using quantitative measures to test architectural
    ideas
  • Using a RISC instruction set for examples
  • Discussing a variety of software and hardware
    techniques to provide optimization
  • Attempting to force as much parallelism out of
    the code as possible

2
Measuring Performance
  • We might use one of the following terms to
    measure performance
  • MIPS, MegaFLOPS
  • neither of these terms tells us how the processor
    performs on the other type of operation
  • Clock Speed (GHZ rating)
  • misleading as we will explore throughout the
    semester
  • Execution time
  • worthwhile on an unloaded system
  • Throughput
  • number of programs / unit time useful for
    servers and large systems
  • Wall-clock time
  • CPU time, user CPU time, system CPU time
  • CPU time user CPU time system CPU time
  • System performance
  • on an unloaded system
  • note CPU performance 1 / execution time
  • What does it mean that one computer is faster
    than another?

3
Meaning of Performance
  • X is n times faster than Y means
  • Exec time Y / Exec time X n
  • Perf X / Perf Y n
  • Example
  • if throughput of X is 1.3 times higher than Y
  • then the number of tasks that can be executed on
    X is 1.3 more than on Y in the same amount of
    time
  • Example
  • X executes program p1 in .32 seconds
  • Y executes program p1 in .38 seconds
  • X is .38 / .32 1.19 times faster
  • 19 faster
  • To validly compare two computers performance, we
    must compare performance on the same program
  • Additionally, computers may have better
    performances on different programs
  • e.g., C1 runs P1 faster than C2 but C2 runs P2
    faster than C1
  • we might use weighted averages or geometric
    means, as well as distributions, to derive a
    single processors overall performance (see pages
    34-37 if you are interested)

4
Benchmarks
  • A benchmark suite is a set of programs that test
    different performance metrics
  • Example test array capabilities, floating point
    operations, loops,
  • SPEC benchmark suites are commonly cited
  • SPEC 96 is the most recent benchmark, see figure
    1.13 on page 31
  • Reporting benchmark results must include
  • compiler settings and version
  • input
  • OS
  • number/size of disks
  • Results must be reproducible
  • Four levels of programs can be used to test
    performance
  • Real programs
  • e.g., C compiler, CAD tool
  • programs with input, output, options that the
    user selects
  • Kernels
  • remove key pieces of programs and just test those
  • Toy benchmarks
  • 10-100 lines of code such as quicksort whose
    performance is known in advance
  • Synthetic benchmarks
  • try to match average frequency of operations to
    simulate larger programs
  • Only real programs are used today
  • These others have been discredited since computer
    architects and compiler writers will optimize
    systems to perform well on these specific
    benchmarks/kernels

5
Principles of Computer Design
  • As computer architecture research has progressed,
    several key design concepts have been identified
  • The goal today is to further exploit each of
    these because they provide a great deal of
    performance speed up
  • We will examine these and use a quantitative
    approach to identify the extent of the speedup
  • Take advantage of parallelism
  • Using multiple hardware components (ALU
    functional units, memory modules, register ports,
    disk drives, etc) we can attempt to execute
    instructions and threads in parallel
  • Principle of locality of reference
  • Used to design memory systems so that we can
    attempt to keep in cache the data and
    instructions that will most likely be referenced
    soon
  • Focus on the common case
  • As we see next, if we can achieve a small speedup
    for executing the common case, it is better than
    achieving a large speedup for an uncommon case

6
Amdahls Law
  • In order to explore architectural improvements,
    we need a mechanism to gage the speedup of our
    improvements
  • Amdahls Law allows us to compute speedup that
    can be gained by using a particular feature as
    follows
  • Given an enhancement E
  • Speedup performance with E /
    performance without E
  • or
  • Speedup execution time without E /
    execution time with E
  • This law uses two factors
  • Fraction of the computation time in the original
    machine that can be converted to take advantage
    of the enhancement (F)
  • Improvement gained by the enhanced execution mode
    (how much faster will the task run if the
    enhanced mode is used for the entire program?) (S)

Speedup 1 / (1 F F / S)
7
Examples
  • Example 1
  • Web server is to be enhanced
  • new CPU is 10 times faster on computation than
    old CPU
  • the original CPU spent 40 of its time processing
    and 60 of its time waiting for I/O
  • What will the speedup be?
  • Fraction enhancement used 40
  • Speedup in enhanced mode 10
  • Speedup 1 / (1 - .4) .4/10 1.56
  • Example 2
  • A benchmark consists of
  • 20 FP sqrt
  • 50 FP operations (including sqrt)
  • 50 other operations
  • Enhancement options are
  • add FP sqrt hardware to speed up sqrt performance
    by a factor of 10
  • enhance all FP operations by a factor of 1.6
  • Speedup FP sqrt 1/(1-.2) .2/10 1.22
  • Speedup all FP 1/(1-.5) .5/1.6 1.23
  • The enhancement to support the common case is
    (slightly) better

8
CPU Performance Formulae
  • CPU time CPU clock cycles clock cycle time
  • CPU clock cycles the number of clock cycles
    that elapse during the execution of the given
    program
  • clock cycle time is the reciprocal of the clock
    rate that is, how much time elapses for one
    clock cycle, which gives us
  • CPU time CPU clock cycles for prog / clock rate
  • CPU time IC CPI Clock cycle time
  • IC - instruction count (number of instructions)
  • CPI - clock cycles per instruction
  • IC CPI CPU clock cycles
  • CPI CPU clock cycles / IC
  • CPU time (S CPIi ICi) clock cycle time
  • Average CPI S (CPIi ICi) / Total Instruction
    Count
  • In the latter equation, CPIi and ICi are for each
    type of operation (for instance, the CPI and
    number of adds, the CPI and number of loads, )

9
Example
  • Assume
  • frequency of FP operations 25 (other than
    sqrt) and frequency of FP sqrt 2
  • average CPI of FP operations 4.0, CPI of FP
    sqrt 20
  • average CPI other instr 1.33
  • CPI 4251.3375 2.0
  • Two alternatives
  • reduce CPI of FP sqrt to 2 or
  • reduce average CPI of all FP ops (including sqrt)
    to 2.5
  • CPI new FP sqrt CPI original - 2 (20-2)
    1.64
  • CPI new FP 751.33252.51.625
  • Speedup new FP CPI original/CPI new FP 1.64 /
    1.625 1.23

10
Computing Speedup which formula?
  • We can compute speedup by
  • determining the difference in CPU time before and
    after an enhancement
  • or by using Amdahls Law
  • Which should we use?
  • the formulas are the same
  • lets demonstrate this with an example
  • Benchmark consists of 35 loads, 15 stores, 40
    ALU operations and 10 branches
  • CPI for each instruction is 5 for loads and
    stores and 4 for ALU and branches (since this is
    an integer benchmark, the floating point
    registers are not used)
  • Consider that we could keep more values in
    registers by moving them to floating point
    registers rather than storing and then reloading
    these values in memory
  • Lets have the compiler replace some of the
    loads/stores with register moves
  • this enhancement is done by the compiler, so
    costs us nothing!
  • assuming that the compiler can reduce 20 of the
    loads from the program, how worthwhile is it?

11
Solution
  • We change some loads/stores to ALU operations
  • so overall CPI goes down, IC remains the same
  • Solution 1 compute CPU Time differences
  • CPU Time IC CPI CPU Clock Rate
  • CPIold 50 5 50 4 4.5
  • CPInew 40 5 60 4 4.4
  • Since IC and CPU Clock Rate have not changed,
    speedup is only CPIold / CPInew
  • Speedup 4.5 / 4.4 1.0227 2.27 speedup
  • Solution 2 Amdahls Law
  • Speedup of enhanced mode is from 5 cycles to 4
    cycles or 5/4 1.25
  • Fraction used fraction of the execution time
    where we use conversions instead of loads/stores
  • overall CPI is 4.5
  • enhancement used on 20 of loads/stores
  • 20 50 5 .5 clock cycles out of 4.5, or .5
    / 4.5 11.1 of the time
  • Amdahls Law 1 / 1 F F / S 1 / 1 -
    .111 .111 / 1.25 1 / .9778 1.0227 2.27
    speedup

12
Why MIPS Can Be Misleading
  • Assume a load-store machine with a breakdown of
  • 43 ALU
  • 21 load/store
  • 24 branch
  • CPI 1 for ALU operations
  • CPI 2 for all other operations
  • Optimizing compiler is able to discard 50 of ALU
    operations
  • Ignoring system issues, if the machine has a 2
    nanosecond clock cycle (500 MHz) and 1.57
    unoptimized CPI,
  • what is the MIPS rating for the optimized and
    unoptimized versions? does the MIPS value agree
    with the execution time?
  • MIPS IC / (Execution Time 106)
  • exec time IC CPI / Clock Cycle rate
  • so, MIPS clock rate / (CPI 106)
  • CPIunoptimized 1.57
  • MIPSunoptimized 500 MHz / (1.57 106) 318.5
  • CPIoptimized (.43 / 2 1 .57 1) / (1 .43
    / 2) 1.73
  • MIPSoptimized 500 MHz / (1.73 106) 289.0
  • The optimized program will execute faster because
    it has fewer instructions, but its CPI is larger
    because a greater portion of the instructions
    have a higher CPI, and therefore its MIPS rating
    is lower
  • So, MIPS and execution time are not directly
    related!

13
Sample Problem 1
  • Consider adding register-memory ALU instructions
    to a machine that previously only permitted
    register-register ALU operations
  • Assume a benchmark with the following breakdown
    of operations is used to test this enhancement
  • ALU operations 43, CPI 1
  • Loads 21, CPI 2
  • Stores 12, CPI 2
  • Branches 24, CPI 2
  • The new ALU register-memory operation has the
    following consequences
  • ALU register-memory operations have CPI 2 and
    Branches now have a CPI 3
  • But, 25 of data loaded are only used once so
    that the new ALU register-memory instruction can
    be used in place of the load ALU operation
  • Is it worth it?

14
Solution
  • CPIold .43 1 .57 2 1.57
  • 3 changes
  • some ALU operations use new mode which changes
    their CPI
  • fewer loads
  • all branches have higher CPI
  • We have a new distribution
  • 25 of ALU operations become ALU-memory
    operations
  • 25 43 11, so we remove this many loads
    giving us 89 as many instructions as previously
  • Loads 21 - (25 43) / 89 11
  • Stores 12 / 89 13
  • ALU operations 43 / 89 48
  • Branches 24 / 89 27
  • CPInew .11 2 .13 2 .27 3 .48 (.25
    2 .75 1) 1.89
  • CPU Time IC CPI Clock Cycle Rate
  • Clock Cycle Rate remains unchanged
  • CPI has been recomputed
  • IC in the new system is 89 of the old system
  • CPUold IC 1.57 CCR
  • CPUnew .89 IC 1.89 CCR
  • Speedup 1.57 / (.89 1.89) .934
  • this is a slowdown, so this enhancement is not an
    improvement!

15
Sample Problem 2
  • Assume a machine with a perfect cache
  • And the following instruction mix breakdown
  • ALU 43, CPI 1
  • Loads 21, CPI 2
  • Stores 12, CPI 2
  • Branches 24, CPI 2
  • An imperfect cache has a miss rate of 5 for
    instructions and 10 for data and a miss penalty
    of 40 cycles
  • How much faster is the machine with the perfect
    cache?
  • CPIperfectcachemachine .43 1 .57 2
    1.57
  • Because of cache misses, we have to compute the
    CPI for all new instructions based on misses
    during instruction fetch (5) and misses during
    data accesses (10) where a miss adds 40 cycles
    to the CPI
  • CPIimperfectcachemachine .43 (1 .05 40)
    .21 (2 .05 40 .10 40) .12 (2 .05
    40 .10 40) .24 (2 .05 40) 4.89
  • Perfect machine 4.89 / 1.57 3.11 times faster

16
Sample Problem 3
  • Architects are considering one of two
    enhancements for their processor
  • 1 can be used 20 of the time and offers a
    speedup of 3
  • 2 offers a speedup of 7
  • What fraction of the time will the second
    enhancement have to be used in order to achieve
    the same overall speedup as the first
    enhancement?
  • speedup from 1 1 / (1 - .2) .2 / 3 1.154
  • So, for the second enhancement to match, we have
    1.154 1 / (1 x) x / 7 and we must solve
    for x
  • using some algebra, we get
  • 1.154 1 / (1 7x / 7 x / 7) 1 / (1 6x /
    7) 1 / (7 6x) / 7 7 / (7 6x) or 7 6x
    7 / 1.154 ? 6x 7 7 / 1.154 0.934, or x
    0.934 / 6 0.156.

17
Sample Problem 4
  • We will compare a CISC machine and a RISC machine
    on a benchmark
  • The machines have the following characteristics
  • CISC machine has CPIs of
  • 4 for load/store, 3 for ALU/branch, 10 for
    call/return
  • CPU clock rate of 1.75 GHz
  • RISC machine has a CPI of 1.2 (as it is
    pipelined) and a CPU clock rate of 1 GHz
  • CISC machine uses complex instructions so the
    CISC version of the benchmark is 40 less than
    the same benchmark on the RISC machine (that is,
    CISC IC is 40 less than RISC IC)
  • The benchmark has a breakdown of
  • 38 loads, 10 stores, 35 ALU operations, 3
    calls, 3 returns, and 11 branches
  • Which machine will run the benchmark in less time?

18
Solution
  • We compare the CPU time for both machines
  • CPU time IC CPI / Clock rate
  • Since both machines have GHz in their clock rate,
    to simplify, we will drop the GHz value
  • CISC machine
  • First, compute the CISC machines CPI given the
    individual CPI for the machine and the
    benchmarks breakdown of instructions
  • CPI 4 (.38 .10) 3 (.35 .11) 10
    (.03 .03) 3.9
  • CPU time CISC IC CISC 3.9 / 1.75
  • RISC machine
  • IC 1.2 / 1 IC RISC 1.2
  • Recall that the CISC machine has 40 fewer
    instructions, so IC CISC .6 IC RISC
  • CPU time CISC .6 IC RISC 3.9 / 1.75 1.34
    IC RISC
  • CPU time RISC 1.2 IC RISC
  • Since the RISC CPU time is smaller, it is faster
    by 1.34 / 1.2 1.12 or 12 faster
Write a Comment
User Comments (0)
About PowerShow.com