Rosetta Demostrator Project MASC, Adelaide University and Ashenden Designs PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: Rosetta Demostrator Project MASC, Adelaide University and Ashenden Designs


1
  • Lec 3 Sept
    2
  • complete Chapter 1
  • exercises from Chapter 1
  • quiz 1
  • Chapter 2 start

2
Performance Summary
The BIG Picture
  • Performance depends on
  • Algorithm affects IC, possibly CPI
  • Programming language affects IC, CPI
  • Compiler affects IC, CPI
  • Instruction set architecture affects IC, CPI, Tc

3
Exercise 1.2.1 For a color display using 8 bits
for each primary color (R, G, B) per pixel and
with a resolution of 1280 x 800 pixels, what
should be the size (in bytes) of the frame buffer
to store a frame? Each frame requires 1280 x
800 x 3 3072000 3 Mbytes If a computer has 3
GB memory to store such frames, how many frames
can be stored? 3 x 109 / 3 x 106
1000 frames
4
Exercise 1.3 Consider 3 processors P1, P2 and P3
with same instruction set with clock rates and
CPI given below
clock rate CPI P1
2 GHz 1.5 P2
1.5 GHz 1.0 P3
3 GHz 2.5
5
Exercise 1.3 Consider 3 processors P1, P2 and P3
with same instruction set with clock rates and
CPI given below
clock rate CPI P1
2 GHz 1.5 P2
1.5 GHz 1.0 P3
3 GHz 2.5 1.3.1. Which
processor has the highest performance? Suppose
the program has N instructions. Time taken to
execute on P1 is 1.5 N / (2 x 109) 0.75 N x
10-9 Time taken to execute on P2 is N/ (1.5 x
109) 0.66 N x 10-9 Time taken to execute
on P3 is 2.5 N/ (3 x 109) 0.83 N x 10-9
6
Time taken to execute on P1 is 1.5 N / (2 x
109) 0.75 N x 10-9 Time taken to execute on P2
is N/ (1.5 x 109) 0.66 N x 10-9 Time
taken to execute on P3 is 2.5 N/ (3 x 109)
0.83 N x 10-9 P2 has the best performance
(since it takes the least time to execute).
7
Exercise 1.3 Consider 3 processors P1, P2 and P3
with same instruction set with clock rates and
CPI given below
clock rate CPI P1
2 GHz 1.5 P2
1.5 GHz 1.0 P3
3 GHz 2.5 1.3.2. If the
processors each execute a program in 10 seconds,
find the number of cycles and the number of
instructions.
8
Exercise 1.3 Consider 3 processors P1, P2 and P3
with same instruction set with clock rates and
CPI given below
clock rate CPI P1
2 GHz 1.5 P2
1.5 GHz 1.0 P3
3 GHz 2.5 1.3.2. If the
processors each execute a program in 10 seconds,
find the number of cycles and the number of
instructions. Time taken to execute on P1 is
1.5 N / (2 x 109) 0.75 N1 x 10-9

10 So N1 1.33 x 1010
9
Exercise 1.4.3 Given below are the number of
instructions of a program arith store
load branch total 500
50 100 50
700 Assuming the instructions take 1, 5, 5 and 2
cycles, what is the execution time in a 2 GHz
processor?
10
Exercise 1.4.3 Given below are the number of
instructions of a program arith store
load branch total 500
50 100 50
700 Assuming the instructions take 1, 5, 5 and 2
cycles, what is the execution time in a 2 GHz
processor? Solution time to execute cycle
time x CPI x no. of inst Cycle time 1/(2 x
10-9) CPI (500/700 50 x 5/700 100 x 5/700
50 x 2/700) So the total time 675 x 10-9 sec
11
  • Exercise 1.6
  • Compilers have a profound impact on the
    performance of an application on a given
    processor. This problem will explore the impact
    compilers have on execution time.
  • compiler A
    compiler B
  • no instructions exec. Time
    no. instructions exec. Time
  • 1.0 x 109 1 s
    1.2 x 109 1.4 s
  • (b) 1.4 x 109 0.8 s
    1.2 x 109 0.7 s

Find the average CPI for each program given that
the processor has a cycle time of 1 ns.
12
  • Exercise 1.6
  • Compilers have a profound impact on the
    performance of an application on a given
    processor. This problem will explore the impact
    compilers have on execution time.
  • compiler A
    compiler B
  • no instructions exec. Time
    no. instructions exec. Time
  • 1.0 x 109 1 s
    1.2 x 109 1.4 s
  • (b) 1.4 x 109 0.8 s
    1.2 x 109 0.7 s

Find the average CPI for each program given that
the processor has a cycle time of 1 ns. Exec.
Time CPI x cycle time x no. of inst (a)
Compiler A CPI 1/ (10-9 x 109 ) 1
13
Power Trends
1.5 The Power Wall
  • In CMOS IC technology

1000
30
5V ? 1V
14
Reducing Power
  • Suppose a new CPU has
  • 85 of capacitive load of old CPU
  • 15 voltage and 15 frequency reduction
  • The power wall
  • We cant reduce voltage further
  • We cant remove more heat
  • How else can we improve performance?

15
Exercise 1.7
  • 1.7.4. Given the following information about each
    processor, calculate its capacitive load
  • Processor 80286 clock rate 12.5 MHz
  • power 3.3 W
  • voltage 5 V
  • Solution Use the equation
  • power capacitive load x voltage2 x clock
    rate
  • Capacitive load 3.3 / (5 x 5 x 12.5) x 10-6
    0.01056 x 10-6

16
Uniprocessor Performance
1.6 The Sea Change The Switch to Multiprocessors
Constrained by power, instruction-level
parallelism, memory latency
17
Multiprocessors
General-purpose uni-cores have reached limits of
historic performance scaling ?? Power
consumption ?? Wire delays ?? DRAM access
latency ?? Diminishing returns of more
instruction-level parallelism
Slide from Prof. Saman Amarasinghe
18
Multiprocessors
  • Multicore microprocessors
  • More than one processor per chip
  • Requires explicitly parallel programming
  • Compare with instruction level parallelism
  • Hardware executes multiple instructions at once
  • Hidden from the programmer
  • Hard to do
  • Programming for performance
  • Load balancing
  • Optimizing communication and synchronization

19
Manufacturing ICs
1.7 Real Stuff The AMD Opteron X4
  • Yield proportion of working dies per wafer

20
Integrated Circuit Cost
  • Nonlinear relation to area and defect rate
  • Wafer cost and area are fixed
  • Defect rate determined by manufacturing process
  • Die area determined by architecture and circuit
    design

21
SPEC CPU Benchmark
  • Programs used to measure performance
  • Supposedly typical of actual workload
  • Standard Performance Evaluation Corp (SPEC)
  • Develops benchmarks for CPU, I/O, Web,
  • SPEC CPU2006
  • Elapsed time to execute a selection of programs
  • Negligible I/O, so focuses on CPU performance
  • Normalize relative to reference machine
  • Summarize as geometric mean of performance ratios
  • CINT2006 (integer) and CFP2006 (floating-point)

22
CINT2006 for Opteron X4 2356
Name Description IC109 CPI Tc (ns) Exec time Ref time SPECratio
perl Interpreted string processing 2,118 0.75 0.40 637 9,777 15.3
bzip2 Block-sorting compression 2,389 0.85 0.40 817 9,650 11.8
gcc GNU C Compiler 1,050 1.72 0.47 24 8,050 11.1
mcf Combinatorial optimization 336 10.00 0.40 1,345 9,120 6.8
go Go game (AI) 1,658 1.09 0.40 721 10,490 14.6
hmmer Search gene sequence 2,783 0.80 0.40 890 9,330 10.5
sjeng Chess game (AI) 2,176 0.96 0.48 37 12,100 14.5
libquantum Quantum computer simulation 1,623 1.61 0.40 1,047 20,720 19.8
h264avc Video compression 3,102 0.80 0.40 993 22,130 22.3
omnetpp Discrete event simulation 587 2.94 0.40 690 6,250 9.1
astar Games/path finding 1,082 1.79 0.40 773 7,020 9.1
xalancbmk XML parsing 1,058 2.70 0.40 1,143 6,900 6.0
Geometric mean Geometric mean Geometric mean Geometric mean Geometric mean Geometric mean Geometric mean 11.7
High cache miss rates
23
Amdahls Law
f fraction unaffected p speedup
of the rest
Amdahls law speedup achieved if a
fraction f of a task is unaffected and the
remaining 1 f part runs p times as fast.
24
Amdahls Law in design
Example
  • A processor spends 30 of its time on flp
    addition, 25 on flp mult,
  • and 10 on flp division. Evaluate the following
    enhancements, each
  • costing the same to implement
  • Redesign of the flp adder to make it twice as
    fast.
  • Redesign of the flp multiplier to make it three
    times as fast.
  • Redesign the flp divider to make it 10 times as
    fast.

 
25
Amdahls Law in design
Example
  • A processor spends 30 of its time on flp
    addition, 25 on flp mult,
  • and 10 on flp division. Evaluate the following
    enhancements, each
  • costing the same to implement
  • Redesign of the flp adder to make it twice as
    fast.
  • Redesign of the flp multiplier to make it three
    times as fast.
  • Redesign the flp divider to make it 10 times as
    fast.
  • Solution
  • Adder redesign speedup 1 / 0.7 0.3 / 2
    1.18
  • Multiplier redesign speedup 1 / 0.75 0.25 /
    3 1.20
  • Divider redesign speedup 1 / 0.9 0.1 / 10
    1.10
  • What if both the adder and the multiplier are
    redesigned?

 
26
Amdahls Law limit to improvement
  • Improving an aspect of a computer and expecting a
    proportional improvement in overall performance

1.8 Fallacies and Pitfalls
  • Example multiply accounts for 80s/100s
  • How much improvement in multiply performance to
    get 5 overall?
  • Cant be done!
  • Corollary make the common case fast

27
Pitfall MIPS as a Performance Metric
  • MIPS Millions of Instructions Per Second
  • Doesnt account for
  • Differences in ISAs between computers
  • Differences in complexity between instructions
  • CPI varies between programs on a given CPU

28
Concluding Remarks
  • Cost/performance is improving
  • Due to underlying technology development
  • Hierarchical layers of abstraction
  • In both hardware and software
  • Instruction set architecture
  • The hardware/software interface
  • Execution time the best performance measure
  • Power is a limiting factor
  • Use parallelism to improve performance

1.9 Concluding Remarks
Write a Comment
User Comments (0)
About PowerShow.com