Advanced Computer Architecture - PowerPoint PPT Presentation

1 / 18

About This Presentation

Title:

Advanced Computer Architecture

Description:

We will consider issues in current Architecture design and implementation: ... research has progressed, several key design concepts have been identified ... – PowerPoint PPT presentation

Number of Views:63

Avg rating:3.0/5.0

Slides: 19

Provided by: rfox

Category:

more less

Transcript and Presenter's Notes

Title: Advanced Computer Architecture

1
Advanced Computer Architecture

We will consider issues in current Architecture
design and implementation
RISC instruction sets
Pipelining
Instruction-level parallelism
Block-level parallelism
Thread-level parallelism
Multiprocessors
Improving cache performance
Optimizing virtual memory usage

In CSC 362, we focused on
the roles of the components in the architecture
the structure of the architecture (how things
connect together)
Here, we focus on
Using available technology to improve computer
performance
Using quantitative measures to test architectural
ideas
Using a RISC instruction set for examples
Discussing a variety of software and hardware
techniques to provide optimization
Attempting to force as much parallelism out of
the code as possible

2
Measuring Performance

We might use one of the following terms to
measure performance
MIPS, MegaFLOPS
neither of these terms tells us how the processor
performs on the other type of operation
Clock Speed (GHZ rating)
misleading as we will explore throughout the
semester
Execution time
worthwhile on an unloaded system
Throughput
number of programs / unit time useful for
servers and large systems
Wall-clock time
CPU time, user CPU time, system CPU time
CPU time user CPU time system CPU time
System performance
on an unloaded system
note CPU performance 1 / execution time
What does it mean that one computer is faster
than another?

3
Meaning of Performance

X is n times faster than Y means
Exec time Y / Exec time X n
Perf X / Perf Y n
Example
if throughput of X is 1.3 times higher than Y
then the number of tasks that can be executed on
X is 1.3 more than on Y in the same amount of
time
Example
X executes program p1 in .32 seconds
Y executes program p1 in .38 seconds
X is .38 / .32 1.19 times faster
19 faster
To validly compare two computers performance, we
must compare performance on the same program
Additionally, computers may have better
performances on different programs
e.g., C1 runs P1 faster than C2 but C2 runs P2
faster than C1
we might use weighted averages or geometric
means, as well as distributions, to derive a
single processors overall performance (see pages
34-37 if you are interested)

4
Benchmarks

A benchmark suite is a set of programs that test
different performance metrics
Example test array capabilities, floating point
operations, loops,
SPEC benchmark suites are commonly cited
SPEC 96 is the most recent benchmark, see figure
1.13 on page 31
Reporting benchmark results must include
compiler settings and version
input
OS
number/size of disks
Results must be reproducible

Four levels of programs can be used to test
performance
Real programs
e.g., C compiler, CAD tool
programs with input, output, options that the
user selects
Kernels
remove key pieces of programs and just test those
Toy benchmarks
10-100 lines of code such as quicksort whose
performance is known in advance
Synthetic benchmarks
try to match average frequency of operations to
simulate larger programs
Only real programs are used today
These others have been discredited since computer
architects and compiler writers will optimize
systems to perform well on these specific
benchmarks/kernels

5
Principles of Computer Design

As computer architecture research has progressed,
several key design concepts have been identified
The goal today is to further exploit each of
these because they provide a great deal of
performance speed up
We will examine these and use a quantitative
approach to identify the extent of the speedup
Take advantage of parallelism
Using multiple hardware components (ALU
functional units, memory modules, register ports,
disk drives, etc) we can attempt to execute
instructions and threads in parallel
Principle of locality of reference
Used to design memory systems so that we can
attempt to keep in cache the data and
instructions that will most likely be referenced
soon
Focus on the common case
As we see next, if we can achieve a small speedup
for executing the common case, it is better than
achieving a large speedup for an uncommon case

6
Amdahls Law

In order to explore architectural improvements,
we need a mechanism to gage the speedup of our
improvements
Amdahls Law allows us to compute speedup that
can be gained by using a particular feature as
follows
Given an enhancement E
Speedup performance with E /
performance without E
or
Speedup execution time without E /
execution time with E

This law uses two factors
Fraction of the computation time in the original
machine that can be converted to take advantage
of the enhancement (F)
Improvement gained by the enhanced execution mode
(how much faster will the task run if the
enhanced mode is used for the entire program?) (S)

Speedup 1 / (1 F F / S)
7
Examples

Example 1
Web server is to be enhanced
new CPU is 10 times faster on computation than
old CPU
the original CPU spent 40 of its time processing
and 60 of its time waiting for I/O
What will the speedup be?
Fraction enhancement used 40
Speedup in enhanced mode 10
Speedup 1 / (1 - .4) .4/10 1.56
Example 2
A benchmark consists of
20 FP sqrt
50 FP operations (including sqrt)
50 other operations
Enhancement options are
add FP sqrt hardware to speed up sqrt performance
by a factor of 10
enhance all FP operations by a factor of 1.6
Speedup FP sqrt 1/(1-.2) .2/10 1.22
Speedup all FP 1/(1-.5) .5/1.6 1.23
The enhancement to support the common case is
(slightly) better

8
CPU Performance Formulae

CPU time CPU clock cycles clock cycle time
CPU clock cycles the number of clock cycles
that elapse during the execution of the given
program
clock cycle time is the reciprocal of the clock
rate that is, how much time elapses for one
clock cycle, which gives us
CPU time CPU clock cycles for prog / clock rate
CPU time IC CPI Clock cycle time
IC - instruction count (number of instructions)
CPI - clock cycles per instruction
IC CPI CPU clock cycles
CPI CPU clock cycles / IC
CPU time (S CPIi ICi) clock cycle time
Average CPI S (CPIi ICi) / Total Instruction
Count
In the latter equation, CPIi and ICi are for each
type of operation (for instance, the CPI and
number of adds, the CPI and number of loads, )

9
Example

Assume
frequency of FP operations 25 (other than
sqrt) and frequency of FP sqrt 2
average CPI of FP operations 4.0, CPI of FP
sqrt 20
average CPI other instr 1.33
CPI 4251.3375 2.0
Two alternatives
reduce CPI of FP sqrt to 2 or
reduce average CPI of all FP ops (including sqrt)
to 2.5
CPI new FP sqrt CPI original - 2 (20-2)
1.64
CPI new FP 751.33252.51.625
Speedup new FP CPI original/CPI new FP 1.64 /
1.625 1.23

10
Computing Speedup which formula?

We can compute speedup by
determining the difference in CPU time before and
after an enhancement
or by using Amdahls Law
Which should we use?
the formulas are the same
lets demonstrate this with an example
Benchmark consists of 35 loads, 15 stores, 40
ALU operations and 10 branches
CPI for each instruction is 5 for loads and
stores and 4 for ALU and branches (since this is
an integer benchmark, the floating point
registers are not used)
Consider that we could keep more values in
registers by moving them to floating point
registers rather than storing and then reloading
these values in memory
Lets have the compiler replace some of the
loads/stores with register moves
this enhancement is done by the compiler, so
costs us nothing!
assuming that the compiler can reduce 20 of the
loads from the program, how worthwhile is it?

11
Solution

We change some loads/stores to ALU operations
so overall CPI goes down, IC remains the same
Solution 1 compute CPU Time differences
CPU Time IC CPI CPU Clock Rate
CPIold 50 5 50 4 4.5
CPInew 40 5 60 4 4.4
Since IC and CPU Clock Rate have not changed,
speedup is only CPIold / CPInew
Speedup 4.5 / 4.4 1.0227 2.27 speedup
Solution 2 Amdahls Law
Speedup of enhanced mode is from 5 cycles to 4
cycles or 5/4 1.25
Fraction used fraction of the execution time
where we use conversions instead of loads/stores
overall CPI is 4.5
enhancement used on 20 of loads/stores
20 50 5 .5 clock cycles out of 4.5, or .5
/ 4.5 11.1 of the time
Amdahls Law 1 / 1 F F / S 1 / 1 -
.111 .111 / 1.25 1 / .9778 1.0227 2.27
speedup

12
Why MIPS Can Be Misleading

Assume a load-store machine with a breakdown of
43 ALU
21 load/store
24 branch
CPI 1 for ALU operations
CPI 2 for all other operations
Optimizing compiler is able to discard 50 of ALU
operations
Ignoring system issues, if the machine has a 2
nanosecond clock cycle (500 MHz) and 1.57
unoptimized CPI,
what is the MIPS rating for the optimized and
unoptimized versions? does the MIPS value agree
with the execution time?

MIPS IC / (Execution Time 106)
exec time IC CPI / Clock Cycle rate
so, MIPS clock rate / (CPI 106)
CPIunoptimized 1.57
MIPSunoptimized 500 MHz / (1.57 106) 318.5
CPIoptimized (.43 / 2 1 .57 1) / (1 .43
/ 2) 1.73
MIPSoptimized 500 MHz / (1.73 106) 289.0
The optimized program will execute faster because
it has fewer instructions, but its CPI is larger
because a greater portion of the instructions
have a higher CPI, and therefore its MIPS rating
is lower
So, MIPS and execution time are not directly
related!

13
Sample Problem 1

Consider adding register-memory ALU instructions
to a machine that previously only permitted
register-register ALU operations
Assume a benchmark with the following breakdown
of operations is used to test this enhancement
ALU operations 43, CPI 1
Loads 21, CPI 2
Stores 12, CPI 2
Branches 24, CPI 2
The new ALU register-memory operation has the
following consequences
ALU register-memory operations have CPI 2 and
Branches now have a CPI 3
But, 25 of data loaded are only used once so
that the new ALU register-memory instruction can
be used in place of the load ALU operation
Is it worth it?

14
Solution

CPIold .43 1 .57 2 1.57
3 changes
some ALU operations use new mode which changes
their CPI
fewer loads
all branches have higher CPI
We have a new distribution
25 of ALU operations become ALU-memory
operations
25 43 11, so we remove this many loads
giving us 89 as many instructions as previously
Loads 21 - (25 43) / 89 11
Stores 12 / 89 13
ALU operations 43 / 89 48
Branches 24 / 89 27

CPInew .11 2 .13 2 .27 3 .48 (.25
2 .75 1) 1.89
CPU Time IC CPI Clock Cycle Rate
Clock Cycle Rate remains unchanged
CPI has been recomputed
IC in the new system is 89 of the old system
CPUold IC 1.57 CCR
CPUnew .89 IC 1.89 CCR
Speedup 1.57 / (.89 1.89) .934
this is a slowdown, so this enhancement is not an
improvement!

15
Sample Problem 2

Assume a machine with a perfect cache
And the following instruction mix breakdown
ALU 43, CPI 1
Loads 21, CPI 2
Stores 12, CPI 2
Branches 24, CPI 2
An imperfect cache has a miss rate of 5 for
instructions and 10 for data and a miss penalty
of 40 cycles
How much faster is the machine with the perfect
cache?

CPIperfectcachemachine .43 1 .57 2
1.57
Because of cache misses, we have to compute the
CPI for all new instructions based on misses
during instruction fetch (5) and misses during
data accesses (10) where a miss adds 40 cycles
to the CPI
CPIimperfectcachemachine .43 (1 .05 40)
.21 (2 .05 40 .10 40) .12 (2 .05
40 .10 40) .24 (2 .05 40) 4.89
Perfect machine 4.89 / 1.57 3.11 times faster

16
Sample Problem 3

Architects are considering one of two
enhancements for their processor
1 can be used 20 of the time and offers a
speedup of 3
2 offers a speedup of 7
What fraction of the time will the second
enhancement have to be used in order to achieve
the same overall speedup as the first
enhancement?
speedup from 1 1 / (1 - .2) .2 / 3 1.154
So, for the second enhancement to match, we have
1.154 1 / (1 x) x / 7 and we must solve
for x
using some algebra, we get
1.154 1 / (1 7x / 7 x / 7) 1 / (1 6x /
7) 1 / (7 6x) / 7 7 / (7 6x) or 7 6x
7 / 1.154 ? 6x 7 7 / 1.154 0.934, or x
0.934 / 6 0.156.

17
Sample Problem 4

We will compare a CISC machine and a RISC machine
on a benchmark
The machines have the following characteristics
CISC machine has CPIs of
4 for load/store, 3 for ALU/branch, 10 for
call/return
CPU clock rate of 1.75 GHz
RISC machine has a CPI of 1.2 (as it is
pipelined) and a CPU clock rate of 1 GHz
CISC machine uses complex instructions so the
CISC version of the benchmark is 40 less than
the same benchmark on the RISC machine (that is,
CISC IC is 40 less than RISC IC)
The benchmark has a breakdown of
38 loads, 10 stores, 35 ALU operations, 3
calls, 3 returns, and 11 branches
Which machine will run the benchmark in less time?

18
Solution

We compare the CPU time for both machines
CPU time IC CPI / Clock rate
Since both machines have GHz in their clock rate,
to simplify, we will drop the GHz value
CISC machine
First, compute the CISC machines CPI given the
individual CPI for the machine and the
benchmarks breakdown of instructions
CPI 4 (.38 .10) 3 (.35 .11) 10
(.03 .03) 3.9
CPU time CISC IC CISC 3.9 / 1.75
RISC machine
IC 1.2 / 1 IC RISC 1.2
Recall that the CISC machine has 40 fewer
instructions, so IC CISC .6 IC RISC
CPU time CISC .6 IC RISC 3.9 / 1.75 1.34
IC RISC
CPU time RISC 1.2 IC RISC
Since the RISC CPU time is smaller, it is faster
by 1.34 / 1.2 1.12 or 12 faster