How%20to%20measure,%20report,%20and%20summarize%20performance?

About This Presentation

Title:

How%20to%20measure,%20report,%20and%20summarize%20performance?

Description:

e.g., compilers/editors, scientific applications, graphics, etc. Small benchmarks. nice for architects and designers. easy to standardize. can be abused ... – PowerPoint PPT presentation

Number of Views:158

Avg rating:3.0/5.0

Slides: 102

Provided by: toda76

Category:

more less

Transcript and Presenter's Notes

Title: How%20to%20measure,%20report,%20and%20summarize%20performance?

1
Performance

How to measure, report, and summarize
performance?
What factors determine the performance of a
computer?
Critical to purchase and design decisions
best performance?
least cost?
best performance/cost?
QuestionsWhy is some hardware better than
others for different programs?What factors of
system performance are hardware related? (e.g.,
Do we need a new machine, or a new operating
system?)How does the machine's instruction set
affect performance?

2
Computer Performance

Response Time (execution time)
The time between the start and completion of
a task
Throughput The total amount of work done in a
given time
Q If we replace the processor with a faster one,
what do we increase?
A Response time and throughput
Q If we add an additional processor to a system,
what do we increase?
A Throughput

3
Book's Definition of Performance

For some program running on machine X,
PerformanceX 1 / Execution timeX
"X is n times faster than Y" n PerformanceX
/ PerformanceY
Problem Machine A runs a program in 10 seconds
and machine B in 15 seconds. How much faster is
A than B?
Answer n PerformanceA / PerformanceB
Execution timeB/Execution timeA
15/10 1.5
A is 1.5 times faster than B.

4
Execution Time

Elapsed Time, wall-clock time or response time
counts everything (disk and memory accesses, I/O
, etc.)
a useful number, but often not good for
comparison purposes
CPU time
doesn't count I/O or time spent running other
programs
can be broken up into system time, and user time
Our focus user CPU time
time spent executing the lines of code that are
"in" our program

5
Clock Cycles

Instead of reporting execution time in seconds,
we often use cycles
Execution time of clock cycles cycle time
Clock ticks indicate when to start activities
(one abstraction)
cycle time (period) time between ticks
seconds per cycle
clock rate (frequency) cycles per second (1 Hz
1 cycle/sec)A 200 MHz clock has a
cycle time

6
How to Improve Performance

So, to improve performance (everything else being
equal) you can either
reduce the of required clock cycles for a
program
decrease the clock period or, said another way,
increase the clock frequency.

7
Different numbers of cycles for different
instructions
time

Multiplication takes more time than addition
Floating point operations take longer than
integer ones
Accessing memory takes more time than accessing
registers
Important point changing the cycle time often
changes the number of cycles required for various
instructions (more later)
Another point the same instruction might require
a different number of cycles on a different
machine

8
Example

A program runs in 10 seconds on computer A, which
has a 400 MHz clock. We are trying to help a
computer designer build a new machine B, that
will run this program in 6 seconds. The designer
can use new technology to substantially increase
the clock rate, but this increase will affect the
rest of the CPU design, causing machine B to
require 1.2 times as many clock cycles as machine
A. What clock rate should we tell the designer to
target?
Clock cyclesA 10 s 400 MHz 4109 cycles
Clock cyclesB 1.2 4109 cycles 4.8 109
cycles
Execution time of clock cycles cycle time
Clock rateB Clock cyclesB / Execution timeB
4.8 109 cycles / 6 s
800 MHz

9
Now that we understand cycles

A given program will require
some number of instructions (machine
instructions)
some number of cycles
some number of seconds
We have a vocabulary that relates these
quantities
cycle time (seconds per cycle)
clock rate (cycles per second)
CPI (cycles per instruction) AVERAGE VALUE!
a floating point intensive application might
have a higher CPI
MIPS (millions of instructions per second) this
would be higher for a program using simple
instructions

10
Performance

Performance is determined by execution time
Related variables
of cycles to execute program
of instructions in program
of cycles per second
average of cycles per instruction
average of instructions per second
Common pitfall thinking one of the variables is
indicative of performance when it really isnt.

11
CPI Example

Suppose we have two implementations of the same
instruction set architecture (ISA). For some
program,Machine A has a clock cycle time of 10
ns and a CPI of 2.0 Machine B has a clock cycle
time of 20 ns and a CPI of 1.2 Which machine is
faster for this program, and by how much?
Time per instruction for A 2.0 10 ns 20 ns
B
1.2 20 ns 24 ns
A is 24/20 1.2 times faster
If two machines have the same ISA, which of our
quantities (e.g., clock rate, CPI, execution
time, of instructions, MIPS) will always be
identical?
Answer of instructions

12
of Instructions Example

A compiler designer has two alternatives for a
certain code sequence.There are three different
classes of instructions A, B, and C, and they
require one, two, and three cycles, respectively.
The first sequence has 5 instructions 2 of
A, 1 of B, and 2 of C.The second sequence has 6
instructions 4 of A, 1 of B, and 1 of C.Which
sequence will be faster? What are the CPI values?
Sequence 1 211223 10 cycles CPI1 10 /
5 2
Sequence 2 411213 9 cycles CPI2 9 / 6
1.5
Sequence 2 is faster.

13
MIPS

Million Instructions Per Second
MIPS instruction count/(execution time106)
MIPS is easy to understand but
does not take into account the capabilities of
the instructions the instruction counts of
different instruction sets differ
varies between programs even on the same computer
can vary inversely with performance!

14
MIPS example

Two compilers are being tested for a 100 MHz
machine with three different classes of
instructions A, B, and C, which require one,
two, and three cycles, respectively. Compiler 1
Compiled code uses 5 million Class A, 1 million
Class B, and 1 million Class C instructions.Compi
ler 2 Compiled code uses 10 million Class A, 1
million Class B, and 1 million Class C
instructions.
Which sequence will be faster according to MIPS?
Which sequence will be faster according to
execution time?

15
MIPS example

Cycles and instructions
1 10 million cycles, 7 million instructions
2 15 million cycles, 12 million instructions
Execution time Clock cycles/Clock rate
Execution time1 10106 / 100106 0.1 s
Execution time2 15106 / 100106 0.15 s
MIPS Instruction count/(Execution time 106)
MIPS1 7106 / 0.1106 70
MIPS2 12106 / 0.15106 80

16
Benchmarks

Performance best determined by running a real
application
Use programs typical of expected workload
Or, typical of expected class of
applications e.g., compilers/editors, scientific
applications, graphics, etc.
Small benchmarks
nice for architects and designers
easy to standardize
can be abused
SPEC (System Performance Evaluation Cooperative)
companies have agreed on a set of real programs
and inputs
can still be abused
valuable indicator of performance (and compiler
technology)

17
SPEC 95
18
SPEC 89

Compiler effects on performance depend on
applications.

19
SPEC 95

Organisational enhancements enhance performance.
Doubling the clock rate does not double the
performance.

20
Amdahl's Law

Version 1
Execution Time After Improvement
Execution Time Unaffected
Execution Time Affected / Amount of
Improvement
Version 2
Speedup
Performance after improvement /
Performance before improvement
Execution time before improvement/
Execution time after improvement
Execution time before n a
after n a/p
Principle Make the common case fast

21
Amdahl's Law

ExampleSuppose a program runs in 100 seconds on
a machine, with multiply responsible for 80
seconds of this time. How much do we have to
improve the speed of multiplication if we want
the program to run 4 times faster?"100 s/4 80
s/n 20 s
5 s 80s/n
n 80 s/ 5 s 16

22
Amdahl's Law

ExampleA benchmark program spends half of the
time executing floating point instructions.
We improve the performance of the floating point
unit by a factor of four.
What is the speedup?
Time before 10s
Time after 5s 5s/4 6.25 s
Speedup 10/6.25 1.6

23
Machine Instructions

Language of the Machine
Lowest level of programming, control directly the
hardware
Assembly instructions are symbolic versions of
machine instructions
More primitive than higher level languages
Very restrictive
Programs are stored in the memory, one
instruction is fetched and executed at a time
Well be working with the MIPS instruction set
architecture

24
MIPS instruction set

Load from memory
Store in memory
Logic operations
and, or, negation, shift, ...
Arithmetic operations
addition, subtraction, ...
Branch

25
Instruction types

1 operand
Jump address
Jump register number
2 operands
Multiply reg1, reg2
3 operands
Add reg1, reg2, reg3

26
MIPS arithmetic

Instructions have 3 operands
Operand order is fixed (destination
first) Example C code A B C MIPS
code add s0, s1, s2 s0, etc. are
registers
(associated with variables by compiler)

27
MIPS arithmetic

Design Principle 1 simplicity favours
regularity.
Of course this complicates some things... C
code A B C D E F - A MIPS
code add t0, s1, s2 add s0, t0,
s3 sub s4, s5, s0
Operands must be registers, 32 registers provided
Design Principle 2 smaller is faster.

28
Registers vs. Memory

Arithmetic instructions operands are registers
Compiler associates variables with registers
What about programs with lots of variables

29
Memory Organization

Viewed as a large, single-dimension array, with
an address.
A memory address is an index into the array
"Byte addressing" means that the index points to
a byte of memory.

0
8 bits of data
1
8 bits of data
2
8 bits of data
3
8 bits of data
4
8 bits of data
5
8 bits of data
6
8 bits of data
...
30
Memory Organization

Bytes are nice, but most data items use larger
"words"
For MIPS, a word is 32 bits or 4 bytes.
232 bytes with byte addresses from 0 to 232-1
230 words with byte addresses 0, 4, 8, ... 232-4
Words are aligned i.e., the 2 least significant
bits of a word address are equal to 0.

0
32 bits of data
4
32 bits of data
Registers hold 32 bits of data
8
32 bits of data
12
32 bits of data
...
31
Load and store instructions

Example C code A8 h A8 MIPS
code lw t0, 32(s3) add t0, s2, t0 sw
t0, 32(s3)
word offset 8 equals byte offset 32
Store word has destination last
Remember arithmetic operands are registers, not
memory!

32
So far weve learned

MIPS loading and storing words but addressing
bytes arithmetic on registers only
Instruction Meaningadd s1, s2, s3 s1
s2 s3sub s1, s2, s3 s1 s2 s3lw
s1, 100(s2) s1 Memorys2100 sw s1,
100(s2) Memorys2100 s1

33
Machine Language

Instructions, like registers and words of data,
are also 32 bits long
Example add t0, s1, s2
R-type instruction Format 000000 10001 10010 0
1000 00000 100000 op rs rt
rd shamt funct
op opcode, basic operation
rs 1st source reg.
rt 2nd source reg.
rd destination reg
shamt shift amount
funct function, selects the specific variant of
the operation

34
Machine Language

Introduce a new type of instruction format
I-type for data transfer instructions
Example lw t0, 32(s2) 35 18 9
32 op rs rt 16 bit number
rt destination register
new instruction format but fields 13 are the
same
Design principle 3 Good design demands good
compromises

35
Stored Program Concept

Instructions are groups of bits
Programs are stored in memory to be read or
written just like data
Fetch Execute Cycle
Instructions are fetched and put into a special
register
Bits in the register "control" the subsequent
actions
Fetch the next instruction and continue

memory for data, programs, compilers, editors,
etc.
36
Control

Decision making instructions
alter the control flow,
i.e., change the "next" instruction to be
executed
MIPS conditional branch instructions bne t0,
t1, Label branch if not equal
beq t0, t1, Label branch if equal
Example (if) if (ij) h i j bne s0,
s1, Label add s3, s0, s1 Label ....

37
Control

MIPS unconditional branch instructions j label
Example (if - then - else) if (i!j) beq
s4, s5, Label1 hij add s3, s4,
s5 else j Label2 hi-j Label1 sub
s3, s4, s5 Label2 ...

38
Control

Example (loop)Loop ----
iij if(i!h) go to Loop
---
Loop ---
add s1, s1, s2 iij
bne s1, s3, Loop
---

39
So far

Instruction Meaningadd s1,s2,s3 s1 s2
s3sub s1,s2,s3 s1 s2 s3lw
s1,100(s2) s1 Memorys2100 sw
s1,100(s2) Memorys2100 s1bne
s4,s5,L Next instr. is at Label if s4 ?
s5beq s4,s5,L Next instr. is at Label if s4
s5j Label Next instr. is at Label
Formats

R I J
40
Control Flow

We have beq, bne, what about Branch-if-less-than
?
New instruction set on less than
if s1 lt s2 then t0 1 slt
t0, s1, s2 else t0 0
slt and bne can be used to implement branch on
less than
slt t0, s0, s1
bne t0, zero, Less
Note that the assembler needs a register to do
this, there are register conventions for the
MIPS assembly language
we can now build general control structures

41
MIPS Register Convention

at, 1 reserved for assembler
k0, k1, 26-27 reserved for operating system

42
Procedure calls

Procedures and subroutines allow reuse and
structuring of code
Steps
Place parameters in a place where the procedure
can access them
Transfer control to the procedure
Acquire the storage needed for the procedure
Perform the desired task
Place the results in a place where the calling
program can access them
Return control to the point of origin

43
Register assignments for procedure calls

a0...a3 four argument registers for passing
parameters
v0...v1 two return value registers
ra return address register
use of argument and return value register
compiler
handling of control passing mechanism machine
jump and link instruction jal ProcAddress
saves return address (PC4) in ra (Program
Counter holds the address of the current
instruction)
loads ProcAddress in PC
return jump jr ra
loads return address in PC

44
Stack

Used if four argument registers and two return
value registers are not enough or if nested
subroutines (a subroutine calls another one) are
used
Can also contain temporary data
The stack is a last-in-first-out structure in the
memory
Stack pointer (sp) points at the top of the
stack
Push and pop instructions
MIPS stack grows from higher addresses to lower
addresses

45
Stack and Stack Pointer
46
Constants

Small constants are used quite frequently e.g.,
A A 5 B B - 1
Solution 1 put constants in memory and load them
To add a constant to a register
lw t0, AddrConstant(zero)
add sp,sp,t0
Solution 2 to avoid extra instructions keep the
constant inside the instruction itself addi
29, 29, 4 i means immediate slti 8, 18,
10 andi 29, 29, 6
Design principle 4 Make the common case fast.

47
How about larger constants?

We'd like to be able to load a 32 bit constant
into a register
Must use two instructions, new "load upper
immediate" instruction lui t0,
1010101010101010
Then must get the lower order bits right,
i.e., ori t0, t0, 1010101010101010

1010101010101010
0000000000000000
0000000000000000
1010101010101010
ori
48
Overview of MIPS

simple instructions all 32 bits wide
very structured, no unnecessary baggage
only three instruction formats
rely on compiler to achieve performance what
are the compiler's goals?
help compiler where we can

op rs rt rd shamt funct
R I J
op rs rt 16 bit address
op 26 bit address
49
Addresses in Branches and Jumps

Instructions
bne t4,t5,Label Next instruction is at Label
if t4 ? t5
beq t4,t5,Label Next instruction is at Label
if t4 t5
j Label Next instruction is at Label
Formats
Addresses are not 32 bits How do we handle
this with load and store instructions?

op rs rt 16 bit address
I J
op 26 bit address
50
Addresses in Branches

Instructions
bne t4,t5,Label Next instruction is at Label if
t4?t5
beq t4,t5,Label Next instruction is at Label if
t4t5
Formats
Could specify a register (like lw and sw) and add
it to address
use Instruction Address Register (PC program
counter)
most branches are local (principle of locality)
Jump instructions just use high order bits of PC
address boundaries of 256 MB

op rs rt 16 bit address
I
51
MIPS addressing mode summary

Register addressing
operand in a register
Base or displacement addressing
operand in the memory
address is the sumof a register and a constant in
the instruction
Immediate addressing
operand is a constant within the instruction
PC-relative addressing
address is the sum of the PC and a constant in
the instruction
used e.g. in branch instructions
Pseudodirect addressing
jump address is the 26 bits of the instruction
concatenated with the upper bits of the PC
Additional addressing modes in other computers

52
MIPS addressing mode summary
53
To summarize
54
Assembly Language vs. Machine Language

Assembly provides convenient symbolic
representation
much easier than writing down numbers
e.g., destination first
Machine language is the underlying reality
e.g., destination is no longer first
Assembly can provide 'pseudoinstructions'
e.g., move t0, t1 exists only in Assembly
would be implemented using add t0,t1,zero
When considering performance you should count
real instructions

55
Alternative Architectures

Design alternative
provide more powerful operations than found in
MIPS
goal is to reduce number of instructions executed
danger is a slower cycle time and/or a higher CPI
Sometimes referred to as RISC vs. CISC
Reduced Instruction Set Computers
Complex Instruction Set Computers
virtually all new instruction sets since 1982
have been RISC

56
Reduced Instruction Set Computers

Common characteristics of all RISCs
Single cycle issue
Small number of fixed length instruction formats
Load/store architecture
Large number of registers
Additional characteristics of most RISCs
Small number of instructions
Small number of addressing modes
Fast control unit

57
An alternative architecture 80x86

1978 The Intel 8086 is announced (16 bit
architecture)
1980 The 8087 floating point coprocessor is
added
1982 The 80286 increases address space to 24
bits, instructions
1985 The 80386 extends to 32 bits, new
addressing modes
1989-1995 The 80486, Pentium, Pentium Pro add a
few instructions (mostly designed for higher
performance)
1997 MMX is added
Intel had a 16-bit microprocessor two years
before its competitors more elegant
architectures which led to the selection of the
8086 as the CPU for the IBM PC
This history illustrates the impact of the
golden handcuffs of compatibilityan
architecture that is difficult to explain and
impossible to love

58
A dominant architecture 80x86

See your textbook for a more detailed description
Complexity
Instructions from 1 to 17 bytes long
one operand must act as both a source and
destination
one operand can come from memory
complex addressing modes e.g., base or scaled
index with 8 or 32 bit displacement
Saving grace
the most frequently used architectural components
are not too difficult to implement
compilers avoid the portions of the architecture
that are slow

59
Summary

Instruction complexity is only one variable
lower instruction count vs. higher CPI / lower
clock rate
Design Principles
simplicity favours regularity
smaller is faster
good design demands good compromises
make the common case fast
Instruction set architecture
a very important abstraction indeed!

60
Arithmetic

Where we've been
Performance (seconds, cycles, instructions)
Abstractions Instruction Set Architecture
Assembly Language and Machine Language
What's up ahead
Implementing the Architecture

61
Arithmetic

We start with the Arithmetic Logic Unit

62
Numbers

Bits are just bits (no inherent meaning)
conventions define relationship between bits and
numbers
Binary numbers (base 2) 0000 0001 0010 0011 0100
0101 0110 0111 1000 1001... decimal 0...2n-1
Of course it gets more complicated numbers are
finite (overflow) fractions and real
numbers negative numbers
How do we represent negative numbers? i.e.,
which bit patterns will represent which numbers?
Octal and hexadecimal numbers
Floating-point numbers

63
Possible Representations of Signed Numbers

Sign Magnitude One's Complement
Two's Complement 000 0 000 0 000
0 001 1 001 1 001 1 010 2 010
2 010 2 011 3 011 3 011 3 100
-0 100 -3 100 -4 101 -1 101 -2 101
-3 110 -2 110 -1 110 -2 111 -3 111
-0 111 -1
Issues balance, number of zeros, ease of
operations.
Twos complement is best.

64
MIPS

32 bit signed numbers0000 0000 0000 0000 0000
0000 0000 0000two 0ten0000 0000 0000 0000 0000
0000 0000 0001two 1ten0000 0000 0000 0000
0000 0000 0000 0010two 2ten...0111 1111
1111 1111 1111 1111 1111 1110two
2,147,483,646ten0111 1111 1111 1111 1111 1111
1111 1111two 2,147,483,647ten1000 0000 0000
0000 0000 0000 0000 0000two
2,147,483,648ten1000 0000 0000 0000 0000 0000
0000 0001two 2,147,483,647ten1000 0000 0000
0000 0000 0000 0000 0010two
2,147,483,646ten...1111 1111 1111 1111 1111
1111 1111 1101two 3ten1111 1111 1111 1111
1111 1111 1111 1110two 2ten1111 1111 1111
1111 1111 1111 1111 1111two 1ten

65
Two's Complement Operations

Negating a two's complement number invert all
bits and add 1
Remember Negate and invert are different
operations.
You negate a number but invert a bit.
Converting n bit numbers into numbers with more
than n bits
MIPS 16 bit immediate gets converted to 32 bits
for arithmetic
copy the most significant bit (the sign bit) into
the other bits 0010 -gt 0000 0010 1010 -gt
1111 1010
"sign extension"
MIPS load byte instructions
lbu no sign extension
lb sign extension

66
Addition Subtraction

Just like in grade school (carry/borrow 1s)
0111 0111 0110 0110 - 0110 - 0101
1101 0001 0001
Two's complement operations easy
subtraction using addition of negative numbers
0111 1010
10001
Overflow (result too large for finite computer
word)
e.g., adding two n-bit numbers does not yield an
n-bit number 0111 0001
1000

67
Detecting Overflow

No overflow when adding a positive and a negative
number
No overflow when signs are the same for
subtraction
Overflow occurs when the value affects the sign
overflow when adding two positives yields a
negative
or, adding two negatives gives a positive
or, subtract a negative from a positive and get a
negative
or, subtract a positive from a negative and get a
positive
Consider the operations A B, and A B
Can overflow occur if B is 0 ? No.
Can overflow occur if A is 0 ? Yes.

68
Effects of Overflow

An exception (interrupt) occurs
Control jumps to predefined address for exception
Interrupted address is saved for possible
resumption
Details based on software system / language
example flight control vs. homework assignment
Don't always want to detect overflow new MIPS
instructions addu, addiu, subu note addiu
still sign-extends! note sltu, sltiu for
unsigned comparisons

69
Logical Operations

and, andi bit-by-bit AND
or, ori bit-by-bit OR
sll shift left logical
slr shift right logical
0101 1010
shifting left two steps gives 0110 1000
0110 1010
shifting right three bits gives 0000 1011

70
Logical unit

Let's build a logical unit to support the and and
or instructions
we'll just build a 1 bit unit, and use 32 of them
op0 and op1 or
Possible Implementation (sum-of-products)
res a b a op b op

a
b
71
Review The Multiplexor

Selects one of the inputs to be the output,
based on a control input IEC
symbol
of a 4-input
MUX
Lets build our logical unit using a MUX

0
1
72
Different Implementations

Not easy to decide the best way to build
something
Don't want too many inputs to a single gate
Dont want to have to go through too many gates
For our purposes, ease of comprehension is
important
We use multiplexors
Let's look at a 1-bit ALU for addition
How could we build a 1-bit ALU for AND, OR and
ADD?
How could we build a 32-bit ALU?

CarryIn
a
cout a b a cin b cin sum a xor b xor cin
Sum
b
CarryOut
73
Building a 32 bit ALU for AND, OR and ADD
We need a 4-input MUX.
74
What about subtraction (a b) ?

Two's complement approch just negate b and
add.
A clever solution
In a multiple bit ALU the least significant
CarryIn has to be equal to 1 for subtraction.

75
Tailoring the ALU to the MIPS

Need to support the set-on-less-than instruction
(slt)
remember slt is an arithmetic instruction
produces a 1 if rs lt rt and 0 otherwise
use subtraction (a-b) lt 0 implies a lt b
Need to support test for equality (beq t5, t6,
t7)
use subtraction (a-b) 0 implies a b

76
Supporting slt

Other ALUs
Most significant ALU

77
32 bit ALU supporting slt
altb ? a-blt0, thus Set is the sign bit of the
result.
78
Final ALU including test for equality

Notice control lines000 and001 or010
add110 subtract111 slt

Note zero is a 1 when the result is zero!

79
Conclusion

We can build an ALU to support the MIPS
instruction set
key idea use multiplexor to select the output
we want
we can efficiently perform subtraction using
twos complement
we can replicate a 1-bit ALU to produce a 32-bit
ALU
Important points about hardware
all of the gates are always working
the speed of a gate is affected by the number of
inputs to the gate
the speed of a circuit is affected by the number
of gates in series (on the critical path or
the deepest level of logic)
Our primary focus comprehension, however,
Clever changes to organization can improve
performance (similar to using better algorithms
in software)
well look at examples for addition,
multiplication and division

80
Problem ripple carry adder is slow

A 32-bit ALU is much slower than a 1-bit ALU.
There are more than one way to do addition.
the two extremes ripple carry and
sum-of-products
Can you see the ripple? How could you get rid
of it?
c1 b0c0 a0c0 a0b0
c2 b1c1 a1c1 a1b1 c2 c2(a0,b0,c0,a1,b1)
c3 b2c2 a2c2 a2b2 c3 c3(a0,b0,c0,a1,b1,a2
,b2)
c4 b3c3 a3c3 a3b3 c4 c4(a0,b0,c0,a1,b1,a2
,b2,a3,b3)
Not feasible! Too many inputs to the gates.

81
Carry-lookahead adder

An approach in-between the two extremes
Motivation
If we didn't know the value of carry-in, what
could we do?
When would we always generate a carry? gi
ai bi
When would we propagate the carry?
pi ai bi
Look at the truth table!
Did we get rid of the ripple?c1 g0 p0c0
c2 g1 p1c1 c2 g1p1g0p1p0c0
c3 g2 p2c2 c3 g2p2g1p2p1g0p2p1p0c0
c4 g3 p3c3 c4 ...
Feasible! A smaller number of inputs to the
gates.

82
1-bit adder
a b cin cout sum 0 0 0 0 0 0
0 1 0 1 0 1 0 0 1 0 1
1 1 0 1 0 0 0 1 1 0 1
1 0 1 1 0 1 0 1 1 1
1 1
83
Use principle to build bigger adders

Cant build a 16 bit CLA adder (too big)
Could use ripple carry of 4-bit CLA adders
Better use the CLA principle again!
Principle shown in the figure. See textbook for
details.

84
Multiplication

More complicated than addition
can be accomplished via shifting and addition
More time and more area
Let's look at 2 versions based on grammar school
algorithm 0010 (multiplicand)
x_1011 (multiplier) 0010
0010
0000
0010___
0010110
Negative numbers easy way convert to positive
and multiply
there are better techniques

85
Multiplication, First Version
86
Multiplication, Final Version
87
Booths Algorithm

The grammar school method was implemented using
addition and shifting
Booths algorithm also uses subtraction
Based on two bits of the multiplier either add,
subtract or do nothing always shift
Handles twos complement numbers

88
Fast multipliers

Combinational implementations
Conventional multiplier algorithm
partial products with AND gates
adders
Lots of modifications
Sequential implementations
Pipelined multiplier
registers between levels of logic
result delayed
effective speed of multiple multiplications
increased

89
Four-Bit Binary Multiplication

Multiplicand
B3 B2 B1 B0
Multiplier
? A3 A2 A1 A0
1st partial product
A0B3 A0B2 A0B1 A0B0
2nd partial product
A1B3 A1B2 A1B1 A1B0
3rd partial product
A2B3 A2B2 A2B1 A2B0
4th partial product A3B3
A3B2 A3B1 A3B0
Final product P7 P6 P5
P4 P3 P2 P1
P0

90
Classical Implementation
91
Pipelined Multiplier
Clk
/
/
/
/
/
/
/
/
/
/
92
Division

Simple method
Initialise the remainder with the dividend
Start from most significant end
Subtract divisor from the remainder if possible
(quotient bit 1)
Shift divisor to the right and repeat

93
Division, First Version
94
Division, Final Version
Same hardware for multiply and divide.
95
Floating Point (a brief look)

We need a way to represent
numbers with fractions, e.g., 3.1416
very small numbers, e.g., .000000001
very large numbers, e.g., 3.15576 ? 109
Representation
sign, exponent, significand (1)sign
???significand ???2exponent
more bits for significand gives more accuracy
more bits for exponent increases range
IEEE 754 floating point standard
single precision 8 bit exponent, 23 bit
significand
double precision 11 bit exponent, 52 bit
significand

96
IEEE 754 floating-point standard

Leading 1 bit of significand is implicit
Exponent is biased to make sorting easier
all 0s is smallest exponent all 1s is largest
bias of 127 for single precision and 1023 for
double precision
summary (1)sign ?????significand)
???2exponent bias
Example
decimal -.75 -3/4 -3/22
binary -.11 -1.1 x 2-1
floating point exponent 126 01111110
IEEE single precision 10111111010000000000000000
000000

97
Floating-point addition

1. Shift the significand of the number with the
lesser exponent right until the exponents match
2. Add the significands
3. Normalise the sum, checking for overflow or
underflow
4. Round the sum

98
Floating-point addition
99
Floating-point multiplication

1. Add the exponents
2. Multiply the significands
3. Normalise the product, checking for overflow
or underflow
4. Round the product
5. Find out the sign of the product

100
Floating Point Complexities

Operations are somewhat more complicated (see
text)
In addition to overflow we can have underflow
Accuracy can be a big problem
IEEE 754 keeps two extra bits during intermediate
calculations, guard and round
four rounding modes
positive divided by zero yields infinity
zero divide by zero yields not a number
other complexities
Implementing the standard can be tricky

101
Chapter Four Summary

Computer arithmetic is constrained by limited
precision
Bit patterns have no inherent meaning but
standards do exist
twos complement
IEEE 754 floating point
Computer instructions determine meaning of the
bit patterns
Performance and accuracy are important so there
are many complexities in real machines (i.e.,
algorithms and implementation).
We are ready to move on (and implement the
processor)