Rosetta Demostrator Project MASC, Adelaide University and Ashenden Designs presentation

About This Presentation

Transcript and Presenter's Notes

Title: Rosetta Demostrator Project MASC, Adelaide University and Ashenden Designs

1

Lec 3 Sept
2
complete Chapter 1
exercises from Chapter 1
quiz 1
Chapter 2 start

2
Performance Summary
The BIG Picture

Performance depends on
Algorithm affects IC, possibly CPI
Programming language affects IC, CPI
Compiler affects IC, CPI
Instruction set architecture affects IC, CPI, Tc

3
Exercise 1.2.1 For a color display using 8 bits
for each primary color (R, G, B) per pixel and
with a resolution of 1280 x 800 pixels, what
should be the size (in bytes) of the frame buffer
to store a frame? Each frame requires 1280 x
800 x 3 3072000 3 Mbytes If a computer has 3
GB memory to store such frames, how many frames
can be stored? 3 x 109 / 3 x 106
1000 frames
4
Exercise 1.3 Consider 3 processors P1, P2 and P3
with same instruction set with clock rates and
CPI given below
clock rate CPI P1
2 GHz 1.5 P2
1.5 GHz 1.0 P3
3 GHz 2.5
5
Exercise 1.3 Consider 3 processors P1, P2 and P3
with same instruction set with clock rates and
CPI given below
clock rate CPI P1
2 GHz 1.5 P2
1.5 GHz 1.0 P3
3 GHz 2.5 1.3.1. Which
processor has the highest performance? Suppose
the program has N instructions. Time taken to
execute on P1 is 1.5 N / (2 x 109) 0.75 N x
10-9 Time taken to execute on P2 is N/ (1.5 x
109) 0.66 N x 10-9 Time taken to execute
on P3 is 2.5 N/ (3 x 109) 0.83 N x 10-9
6
Time taken to execute on P1 is 1.5 N / (2 x
109) 0.75 N x 10-9 Time taken to execute on P2
is N/ (1.5 x 109) 0.66 N x 10-9 Time
taken to execute on P3 is 2.5 N/ (3 x 109)
0.83 N x 10-9 P2 has the best performance
(since it takes the least time to execute).
7
Exercise 1.3 Consider 3 processors P1, P2 and P3
with same instruction set with clock rates and
CPI given below
clock rate CPI P1
2 GHz 1.5 P2
1.5 GHz 1.0 P3
3 GHz 2.5 1.3.2. If the
processors each execute a program in 10 seconds,
find the number of cycles and the number of
instructions.
8
Exercise 1.3 Consider 3 processors P1, P2 and P3
with same instruction set with clock rates and
CPI given below
clock rate CPI P1
2 GHz 1.5 P2
1.5 GHz 1.0 P3
3 GHz 2.5 1.3.2. If the
processors each execute a program in 10 seconds,
find the number of cycles and the number of
instructions. Time taken to execute on P1 is
1.5 N / (2 x 109) 0.75 N1 x 10-9

10 So N1 1.33 x 1010
9
Exercise 1.4.3 Given below are the number of
instructions of a program arith store
load branch total 500
50 100 50
700 Assuming the instructions take 1, 5, 5 and 2
cycles, what is the execution time in a 2 GHz
processor?
10
Exercise 1.4.3 Given below are the number of
instructions of a program arith store
load branch total 500
50 100 50
700 Assuming the instructions take 1, 5, 5 and 2
cycles, what is the execution time in a 2 GHz
processor? Solution time to execute cycle
time x CPI x no. of inst Cycle time 1/(2 x
10-9) CPI (500/700 50 x 5/700 100 x 5/700
50 x 2/700) So the total time 675 x 10-9 sec
11

Exercise 1.6
Compilers have a profound impact on the
performance of an application on a given
processor. This problem will explore the impact
compilers have on execution time.
compiler A
compiler B
no instructions exec. Time
no. instructions exec. Time
1.0 x 109 1 s
1.2 x 109 1.4 s
(b) 1.4 x 109 0.8 s
1.2 x 109 0.7 s

Find the average CPI for each program given that
the processor has a cycle time of 1 ns.
12

Exercise 1.6
Compilers have a profound impact on the
performance of an application on a given
processor. This problem will explore the impact
compilers have on execution time.
compiler A
compiler B
no instructions exec. Time
no. instructions exec. Time
1.0 x 109 1 s
1.2 x 109 1.4 s
(b) 1.4 x 109 0.8 s
1.2 x 109 0.7 s

Find the average CPI for each program given that
the processor has a cycle time of 1 ns. Exec.
Time CPI x cycle time x no. of inst (a)
Compiler A CPI 1/ (10-9 x 109 ) 1
13
Power Trends
1.5 The Power Wall

In CMOS IC technology

1000
30
5V ? 1V
14
Reducing Power

Suppose a new CPU has
85 of capacitive load of old CPU
15 voltage and 15 frequency reduction

The power wall
We cant reduce voltage further
We cant remove more heat
How else can we improve performance?

15
Exercise 1.7

1.7.4. Given the following information about each
processor, calculate its capacitive load
Processor 80286 clock rate 12.5 MHz
power 3.3 W
voltage 5 V
Solution Use the equation
power capacitive load x voltage2 x clock
rate
Capacitive load 3.3 / (5 x 5 x 12.5) x 10-6
0.01056 x 10-6

16
Uniprocessor Performance
1.6 The Sea Change The Switch to Multiprocessors
Constrained by power, instruction-level
parallelism, memory latency
17
Multiprocessors
General-purpose uni-cores have reached limits of
historic performance scaling ?? Power
consumption ?? Wire delays ?? DRAM access
latency ?? Diminishing returns of more
instruction-level parallelism
Slide from Prof. Saman Amarasinghe
18
Multiprocessors

Multicore microprocessors
More than one processor per chip
Requires explicitly parallel programming
Compare with instruction level parallelism
Hardware executes multiple instructions at once
Hidden from the programmer
Hard to do
Programming for performance
Load balancing
Optimizing communication and synchronization

19
Manufacturing ICs
1.7 Real Stuff The AMD Opteron X4

Yield proportion of working dies per wafer

20
Integrated Circuit Cost

Nonlinear relation to area and defect rate
Wafer cost and area are fixed
Defect rate determined by manufacturing process
Die area determined by architecture and circuit
design

21
SPEC CPU Benchmark

Programs used to measure performance
Supposedly typical of actual workload
Standard Performance Evaluation Corp (SPEC)
Develops benchmarks for CPU, I/O, Web,
SPEC CPU2006
Elapsed time to execute a selection of programs
Negligible I/O, so focuses on CPU performance
Normalize relative to reference machine
Summarize as geometric mean of performance ratios
CINT2006 (integer) and CFP2006 (floating-point)

22
CINT2006 for Opteron X4 2356
Name Description IC109 CPI Tc (ns) Exec time Ref time SPECratio
perl Interpreted string processing 2,118 0.75 0.40 637 9,777 15.3
bzip2 Block-sorting compression 2,389 0.85 0.40 817 9,650 11.8
gcc GNU C Compiler 1,050 1.72 0.47 24 8,050 11.1
mcf Combinatorial optimization 336 10.00 0.40 1,345 9,120 6.8
go Go game (AI) 1,658 1.09 0.40 721 10,490 14.6
hmmer Search gene sequence 2,783 0.80 0.40 890 9,330 10.5
sjeng Chess game (AI) 2,176 0.96 0.48 37 12,100 14.5
libquantum Quantum computer simulation 1,623 1.61 0.40 1,047 20,720 19.8
h264avc Video compression 3,102 0.80 0.40 993 22,130 22.3
omnetpp Discrete event simulation 587 2.94 0.40 690 6,250 9.1
astar Games/path finding 1,082 1.79 0.40 773 7,020 9.1
xalancbmk XML parsing 1,058 2.70 0.40 1,143 6,900 6.0
Geometric mean Geometric mean Geometric mean Geometric mean Geometric mean Geometric mean Geometric mean 11.7
High cache miss rates
23
Amdahls Law
f fraction unaffected p speedup
of the rest
Amdahls law speedup achieved if a
fraction f of a task is unaffected and the
remaining 1 f part runs p times as fast.
24
Amdahls Law in design
Example

A processor spends 30 of its time on flp
addition, 25 on flp mult,
and 10 on flp division. Evaluate the following
enhancements, each
costing the same to implement
Redesign of the flp adder to make it twice as
fast.
Redesign of the flp multiplier to make it three
times as fast.
Redesign the flp divider to make it 10 times as
fast.

25
Amdahls Law in design
Example

A processor spends 30 of its time on flp
addition, 25 on flp mult,
and 10 on flp division. Evaluate the following
enhancements, each
costing the same to implement
Redesign of the flp adder to make it twice as
fast.
Redesign of the flp multiplier to make it three
times as fast.
Redesign the flp divider to make it 10 times as
fast.
Solution
Adder redesign speedup 1 / 0.7 0.3 / 2
1.18
Multiplier redesign speedup 1 / 0.75 0.25 /
3 1.20
Divider redesign speedup 1 / 0.9 0.1 / 10
1.10
What if both the adder and the multiplier are
redesigned?

26
Amdahls Law limit to improvement

Improving an aspect of a computer and expecting a
proportional improvement in overall performance

1.8 Fallacies and Pitfalls

Example multiply accounts for 80s/100s
How much improvement in multiply performance to
get 5 overall?

Cant be done!

Corollary make the common case fast

27
Pitfall MIPS as a Performance Metric

MIPS Millions of Instructions Per Second
Doesnt account for
Differences in ISAs between computers
Differences in complexity between instructions

CPI varies between programs on a given CPU

28
Concluding Remarks

Cost/performance is improving
Due to underlying technology development
Hierarchical layers of abstraction
In both hardware and software
Instruction set architecture
The hardware/software interface
Execution time the best performance measure
Power is a limiting factor
Use parallelism to improve performance

1.9 Concluding Remarks

Write a Comment

User Comments (0)

About PowerShow.com

Rosetta Demostrator Project MASC, Adelaide University and Ashenden Designs PowerPoint PPT Presentation