EECS 470 - PowerPoint PPT Presentation

1 / 35

About This Presentation

Title:

EECS 470

Description:

Measuring Performance. Use Total Execution Time: A is 3 times faster than B for programs P1,P2 ... Measuring Performance. Normalized Execution Time: ... – PowerPoint PPT presentation

Number of Views:28

Avg rating:3.0/5.0

Slides: 36

Provided by: garyt

Category:

more less

Transcript and Presenter's Notes

Title: EECS 470

1
EECS 470

Computer Architecture
Lecture 2
Coverage Chapters 1-2

2
A Quantitative Approach

Hardware systems performance is generally easy to
quantify
Machine A is 10 faster than Machine B
Of course Machine Bs advertising will show the
opposite conclusion
Example Pentium 4 vs. AMD Hammer
Many software systems tend to have much more
subjective performance evaluations.

3
Measuring Performance

Use Total Execution Time
A is 3 times faster than B for programs P1,P2
Issue Emphasizes long running programs

1
n
?
Timei
n
i1
4
Measuring Performance

Weighted Execution Time
What if P1 is executed far more frequently?

n
?
Weighti ? Timei Weighti 1
Arithmetic mean (AM)
i1
n
?
i1
5
Measuring Performance

Normalized Execution Time
Compare machine performance to a reference
machine and report a ratio.
SPEC ratings measure relative performance to a
reference machine.

6
Example using execution times
CompA CompB
Prog1 1 10
Prog2 1000 100
Total 1001 111
Conclusion B is faster than A It is 1001/111
9.1 times faster
7
Averaging Performance Over Benchmarks
n
1
?

Arithmetic mean (AM)
Geometric mean (GM)
Harmonic mean (HM)

Timei
n
i 1
v
n
n
?
Timei
i 1
n
n
1
?
Ratei
i 1
8
Which is the right Mean?

Arithmetic when dealing with execution time
Harmonic when dealing with rates
flops
MIPS
Hertz
Geometric mean gives an equi-weighted average

9
Use Harmonic Mean with Rates
million flops CompA CompB CompC
Prog1 100 1 10 20
Prog2 100 1000 100 20
Total time Total time 1001 111 40
Notice that the total time ordering is preserved
in the HM of the rates
CompA CompB CompC
Prog1 100 10 5
Prog2 0.1 1 5
AM 50.5 5.5 5
GM 3.2 3.2 5
HM 0.2 1.8 5
Rates (mflops) from above table
10
Normalized Times

Dont take AM of normalized execution times

CompA CompB Normalized to A A B Normalized to A A B Normalized to B A B Normalized to B A B
Prog1 1 10 1 10 0.1 1
Prog2 1000 100 1 0.1 10 1
AM 500.5 55.0 1 5.05 5.05 1
GM 31.6 31.6 1 1 1 1
which one?
which one?

GM doesnt track total execution time last line

11
Notes Benchmarks

AM GM
GM (Xi) / GM (Yi) GM (Xi / Yi )
The GM is unaffected by normalizing it just
doesnt track execution time
Why does SPEC use it?
SPEC system performance evaluation cooperative
http//www.specbench.org/
EEMBC benchmarks for embedded applications
embedded microporcessor benchmark consortium
http//www.eembc.org/

12
Amdahls Law

Rule of Thumb Make the common case faster

Execution timenew Execution timeold ?
?(1 - Fractionenhanced)
)
Fractionenhanced
Speedupenhanced
(Attack longest running part until it is no
longer) repeat
13
Instruction Set Design

Software Systems named variables complex
semantics.
Hardware systems tight timing requirements
small storage structures simple semantics
Instruction set the interface between very
different software and hardware systems

14
Design decisions

How much state is in the microarchitecture?
Registers Flags IP/PC
How is that state accessed/manipulated?
Operand encoding
What commands are supported?
Opcode opcode encoding

15
Design Challenges or why is architecture still
relevant?

Clock frequency is increasing
This changes the number of levels of gates that
can be completed each cycle so old designs dont
work.
It also tend to increase the ration of time spent
on wires (fixed speed of light)
Power
Faster chips are hotter bigger chips are hotter

16
Design Challenges (cont)

Design Complexity
More complex designs to fix frequency/power
issues leads to increased development/testing
costs
Failures (design or transient) can be difficult
to understand (and fix)
We seem far less willing to live with hardware
errors (e.g. FDIV) than software errors
which are often dealt with through upgrades
that we pay for!)

17
Techniques for Encoding Operands

Explicit operands
Includes a field to specify which state data is
referenced
Example register specifier
Implicit operands
All state data can be inferred from the opcode
Example function return (CISC-style)

18
Accumulator

Architectures with one implicit register
Acts as source and/or destination
One other source explicit
Example C A B
Load A // (Acc)umulator ? A
Add B // Acc ? Acc B
Store C // C ? Acc

Ref Instruction Level Distributed Processing
Adapting to Shifting Technology
19
Stack

Architectures with implicit stack
Acts as source(s) and/or destination
Push and Pop operations have 1 explicit operand
Example C A B
Push A // Stack A
Push B // Stack A, B
Add // Stack AB
Pop C // C ? AB Stack

Compact encoding may require more instructions
though
20
Registers

Most general (and common) approach
Small array of storage
Explicit operands (register file index)
Example C A B
Register-memory load/store
Load R1, A Load R1, A
Load R2, B
Add R3, R1, B Add R3, R1, R2
Store R3, C Store R3, C

21
Memory

Big array of storage
More complex ways of indexing than registers
Build addressing modes to support efficient
translation of software abstractions
Uses less space in instruction than 32-bit
immediate field
Ai use base (A) displacement (i)
(scaled?)
a.ptr use base (ptr) displacement (a)

22
Addressing modes

Register Add R4, R3
Immediate Add R4, 3
Base/Displacement Add R4, 100(R1)
Register Indirect Add R4, (R1)
Indexed Add R4, (R1R2)
Direct Add R4, (1001)
Memory Indirect Add R4, _at_(R3)
Autoincrement Add R4, (R2)

23
Other Memory Issues

What is the size of each element in memory?

Byte Half word Word
0x000
0-255
0x000
0 - 65535
0 - 4B
0x000
24
Other Memory Issues

Big-endian or Little-endian? Store 0x114488FF

Points to most significant byte
Points to least significant byte
0x000
11
0x000
FF
44
88
88
44
FF
11
25
Other Memory Issues

Non-word loads? ldb R3, (000)

00 00 00 11
0x000
11
44
88
FF
26
Other Memory Issues

Non-word loads? ldb R3, (003)

FF FF FF FF
11
44
Sign extended
88
0x003
FF
27
Other Memory Issues

Non-word loads? ldbu R3, (003)

00 00 00 FF
11
44
Zero filled
88
FF
0x003
28
Other Memory Issues

Alignment? Word accesses only address ending in
00
Half-word accesses only ending in 0
Byte accesses any address

11
44
ldw R3, (002) is illegal!
88
0x002
Why is it important to be aligned? How can it be
enforced?
FF
29
Techniques for Encoding Operators

Opcode is translated to control signals that
direct data (MUX control)
select operation for ALU
Set read/write selects for register/memory/PC
Tradeoff between how flexible the control is and
how compact the opcode encoding.
Microcode direct control of signals (Improv)
Opcode compact representation of a set of
control signals.
You can make decode easier with careful opcode
selection (as done in HW1)

30
Handling Control Flow

Conditional branches (short range)
Unconditional branches (jumps)
Function calls
Returns
Traps (OS calls and exceptions)
Predicates (conditional retirement)

31
Encoding branch targets

PC-relative addressing
Makes linking code easier
Indirect addressing
Jumps into shared libraries, virtual functions,
case/switch statements
Some unusual modes to simplify target address
calculation
(segment offset) or (trap number)

32
Condition codes

Flags
Implicit flag(s) specified in opcode (bgt)
Flag(s) set by earlier instructions (compare,
add, etc.)
Register
Uses a register requires explicit specifier
Comparison operation
Two registers with compare operation specified in
opcode.

33
Higher Level Semantics Functions

Function call semantics
Save PC 1 instruction for return
Manage parameters
Allocate space on stack
Jump to function
Simple approach
Use a jump instruction other instructions
Complex approach
Build implicit operations into new call
instruction

34
Role of the Compiler

Compilers make the complexity of the ISA (from
the programmers point of view) less relevant.
Non-orthogonal ISAs are more challenging.
State allocation (register allocation) is better
left to compiler heuristics
Complex Semantics lead to more global
optimization easier for a machine to do.

People are good at optimizing 10 lines of
code. Compilers are good at optimizing 10M lines.
35
Next time