Title: How%20to%20measure,%20report,%20and%20summarize%20performance?
1Performance
- How to measure, report, and summarize
performance? - What factors determine the performance of a
computer? - Critical to purchase and design decisions
- best performance?
- least cost?
- best performance/cost?
- QuestionsWhy is some hardware better than
others for different programs?What factors of
system performance are hardware related? (e.g.,
Do we need a new machine, or a new operating
system?)How does the machine's instruction set
affect performance?
2Computer Performance
- Response Time (execution time)
- The time between the start and completion of
a task - Throughput The total amount of work done in a
given time - Q If we replace the processor with a faster one,
what do we increase? - A Response time and throughput
-
- Q If we add an additional processor to a system,
what do we increase? - A Throughput
3Book's Definition of Performance
- For some program running on machine X,
PerformanceX 1 / Execution timeX - "X is n times faster than Y" n PerformanceX
/ PerformanceY - Problem Machine A runs a program in 10 seconds
and machine B in 15 seconds. How much faster is
A than B? - Answer n PerformanceA / PerformanceB
- Execution timeB/Execution timeA
15/10 1.5 - A is 1.5 times faster than B.
4Execution Time
- Elapsed Time, wall-clock time or response time
- counts everything (disk and memory accesses, I/O
, etc.) - a useful number, but often not good for
comparison purposes - CPU time
- doesn't count I/O or time spent running other
programs - can be broken up into system time, and user time
- Our focus user CPU time
- time spent executing the lines of code that are
"in" our program
5Clock Cycles
- Instead of reporting execution time in seconds,
we often use cycles - Execution time of clock cycles cycle time
- Clock ticks indicate when to start activities
(one abstraction) - cycle time (period) time between ticks
seconds per cycle - clock rate (frequency) cycles per second (1 Hz
1 cycle/sec)A 200 MHz clock has a
cycle time
6How to Improve Performance
-
- So, to improve performance (everything else being
equal) you can either - reduce the of required clock cycles for a
program - decrease the clock period or, said another way,
increase the clock frequency.
7Different numbers of cycles for different
instructions
time
- Multiplication takes more time than addition
- Floating point operations take longer than
integer ones - Accessing memory takes more time than accessing
registers - Important point changing the cycle time often
changes the number of cycles required for various
instructions (more later) - Another point the same instruction might require
a different number of cycles on a different
machine
8Example
- A program runs in 10 seconds on computer A, which
has a 400 MHz clock. We are trying to help a
computer designer build a new machine B, that
will run this program in 6 seconds. The designer
can use new technology to substantially increase
the clock rate, but this increase will affect the
rest of the CPU design, causing machine B to
require 1.2 times as many clock cycles as machine
A. What clock rate should we tell the designer to
target? - Clock cyclesA 10 s 400 MHz 4109 cycles
- Clock cyclesB 1.2 4109 cycles 4.8 109
cycles - Execution time of clock cycles cycle time
- Clock rateB Clock cyclesB / Execution timeB
- 4.8 109 cycles / 6 s
800 MHz
9Now that we understand cycles
- A given program will require
- some number of instructions (machine
instructions) - some number of cycles
- some number of seconds
- We have a vocabulary that relates these
quantities - cycle time (seconds per cycle)
- clock rate (cycles per second)
- CPI (cycles per instruction) AVERAGE VALUE!
- a floating point intensive application might
have a higher CPI - MIPS (millions of instructions per second) this
would be higher for a program using simple
instructions
10Performance
- Performance is determined by execution time
- Related variables
- of cycles to execute program
- of instructions in program
- of cycles per second
- average of cycles per instruction
- average of instructions per second
- Common pitfall thinking one of the variables is
indicative of performance when it really isnt.
11CPI Example
- Suppose we have two implementations of the same
instruction set architecture (ISA). For some
program,Machine A has a clock cycle time of 10
ns and a CPI of 2.0 Machine B has a clock cycle
time of 20 ns and a CPI of 1.2 Which machine is
faster for this program, and by how much? - Time per instruction for A 2.0 10 ns 20 ns
- B
1.2 20 ns 24 ns - A is 24/20 1.2 times faster
- If two machines have the same ISA, which of our
quantities (e.g., clock rate, CPI, execution
time, of instructions, MIPS) will always be
identical? - Answer of instructions
12 of Instructions Example
- A compiler designer has two alternatives for a
certain code sequence.There are three different
classes of instructions A, B, and C, and they
require one, two, and three cycles, respectively.
The first sequence has 5 instructions 2 of
A, 1 of B, and 2 of C.The second sequence has 6
instructions 4 of A, 1 of B, and 1 of C.Which
sequence will be faster? What are the CPI values? - Sequence 1 211223 10 cycles CPI1 10 /
5 2 - Sequence 2 411213 9 cycles CPI2 9 / 6
1.5 - Sequence 2 is faster.
13MIPS
- Million Instructions Per Second
- MIPS instruction count/(execution time106)
- MIPS is easy to understand but
- does not take into account the capabilities of
the instructions the instruction counts of
different instruction sets differ - varies between programs even on the same computer
- can vary inversely with performance!
14MIPS example
- Two compilers are being tested for a 100 MHz
machine with three different classes of
instructions A, B, and C, which require one,
two, and three cycles, respectively. Compiler 1
Compiled code uses 5 million Class A, 1 million
Class B, and 1 million Class C instructions.Compi
ler 2 Compiled code uses 10 million Class A, 1
million Class B, and 1 million Class C
instructions. - Which sequence will be faster according to MIPS?
- Which sequence will be faster according to
execution time?
15MIPS example
- Cycles and instructions
- 1 10 million cycles, 7 million instructions
- 2 15 million cycles, 12 million instructions
- Execution time Clock cycles/Clock rate
- Execution time1 10106 / 100106 0.1 s
- Execution time2 15106 / 100106 0.15 s
- MIPS Instruction count/(Execution time 106)
- MIPS1 7106 / 0.1106 70
- MIPS2 12106 / 0.15106 80
16Benchmarks
- Performance best determined by running a real
application - Use programs typical of expected workload
- Or, typical of expected class of
applications e.g., compilers/editors, scientific
applications, graphics, etc. - Small benchmarks
- nice for architects and designers
- easy to standardize
- can be abused
- SPEC (System Performance Evaluation Cooperative)
- companies have agreed on a set of real programs
and inputs - can still be abused
- valuable indicator of performance (and compiler
technology)
17SPEC 95
18SPEC 89
- Compiler effects on performance depend on
applications.
19SPEC 95
- Organisational enhancements enhance performance.
- Doubling the clock rate does not double the
performance.
20Amdahl's Law
- Version 1
- Execution Time After Improvement
- Execution Time Unaffected
- Execution Time Affected / Amount of
Improvement - Version 2
- Speedup
- Performance after improvement /
Performance before improvement - Execution time before improvement/
Execution time after improvement - Execution time before n a
- after n a/p
- Principle Make the common case fast
21Amdahl's Law
- ExampleSuppose a program runs in 100 seconds on
a machine, with multiply responsible for 80
seconds of this time. How much do we have to
improve the speed of multiplication if we want
the program to run 4 times faster?"100 s/4 80
s/n 20 s - 5 s 80s/n
- n 80 s/ 5 s 16
22Amdahl's Law
- ExampleA benchmark program spends half of the
time executing floating point instructions. - We improve the performance of the floating point
unit by a factor of four. - What is the speedup?
- Time before 10s
- Time after 5s 5s/4 6.25 s
- Speedup 10/6.25 1.6
23Machine Instructions
- Language of the Machine
- Lowest level of programming, control directly the
hardware - Assembly instructions are symbolic versions of
machine instructions - More primitive than higher level languages
- Very restrictive
- Programs are stored in the memory, one
instruction is fetched and executed at a time - Well be working with the MIPS instruction set
architecture
24MIPS instruction set
- Load from memory
- Store in memory
- Logic operations
- and, or, negation, shift, ...
- Arithmetic operations
- addition, subtraction, ...
- Branch
25Instruction types
- 1 operand
- Jump address
- Jump register number
- 2 operands
- Multiply reg1, reg2
- 3 operands
- Add reg1, reg2, reg3
26MIPS arithmetic
- Instructions have 3 operands
- Operand order is fixed (destination
first) Example C code A B C MIPS
code add s0, s1, s2 s0, etc. are
registers - (associated with variables by compiler)
27MIPS arithmetic
- Design Principle 1 simplicity favours
regularity. - Of course this complicates some things... C
code A B C D E F - A MIPS
code add t0, s1, s2 add s0, t0,
s3 sub s4, s5, s0 - Operands must be registers, 32 registers provided
- Design Principle 2 smaller is faster.
28Registers vs. Memory
- Arithmetic instructions operands are registers
- Compiler associates variables with registers
- What about programs with lots of variables
29Memory Organization
- Viewed as a large, single-dimension array, with
an address. - A memory address is an index into the array
- "Byte addressing" means that the index points to
a byte of memory.
0
8 bits of data
1
8 bits of data
2
8 bits of data
3
8 bits of data
4
8 bits of data
5
8 bits of data
6
8 bits of data
...
30Memory Organization
- Bytes are nice, but most data items use larger
"words" - For MIPS, a word is 32 bits or 4 bytes.
- 232 bytes with byte addresses from 0 to 232-1
- 230 words with byte addresses 0, 4, 8, ... 232-4
- Words are aligned i.e., the 2 least significant
bits of a word address are equal to 0.
0
32 bits of data
4
32 bits of data
Registers hold 32 bits of data
8
32 bits of data
12
32 bits of data
...
31Load and store instructions
- Example C code A8 h A8 MIPS
code lw t0, 32(s3) add t0, s2, t0 sw
t0, 32(s3) - word offset 8 equals byte offset 32
- Store word has destination last
- Remember arithmetic operands are registers, not
memory!
32So far weve learned
- MIPS loading and storing words but addressing
bytes arithmetic on registers only - Instruction Meaningadd s1, s2, s3 s1
s2 s3sub s1, s2, s3 s1 s2 s3lw
s1, 100(s2) s1 Memorys2100 sw s1,
100(s2) Memorys2100 s1
33Machine Language
- Instructions, like registers and words of data,
are also 32 bits long - Example add t0, s1, s2
- R-type instruction Format 000000 10001 10010 0
1000 00000 100000 op rs rt
rd shamt funct - op opcode, basic operation
- rs 1st source reg.
- rt 2nd source reg.
- rd destination reg
- shamt shift amount
- funct function, selects the specific variant of
the operation
34Machine Language
- Introduce a new type of instruction format
- I-type for data transfer instructions
- Example lw t0, 32(s2) 35 18 9
32 op rs rt 16 bit number - rt destination register
- new instruction format but fields 13 are the
same - Design principle 3 Good design demands good
compromises
35Stored Program Concept
- Instructions are groups of bits
- Programs are stored in memory to be read or
written just like data - Fetch Execute Cycle
- Instructions are fetched and put into a special
register - Bits in the register "control" the subsequent
actions - Fetch the next instruction and continue
memory for data, programs, compilers, editors,
etc.
36Control
- Decision making instructions
- alter the control flow,
- i.e., change the "next" instruction to be
executed - MIPS conditional branch instructions bne t0,
t1, Label branch if not equal - beq t0, t1, Label branch if equal
- Example (if) if (ij) h i j bne s0,
s1, Label add s3, s0, s1 Label ....
37Control
- MIPS unconditional branch instructions j label
- Example (if - then - else) if (i!j) beq
s4, s5, Label1 hij add s3, s4,
s5 else j Label2 hi-j Label1 sub
s3, s4, s5 Label2 ...
38Control
- Example (loop)Loop ----
- iij if(i!h) go to Loop
- ---
- Loop ---
- add s1, s1, s2 iij
- bne s1, s3, Loop
- ---
39So far
- Instruction Meaningadd s1,s2,s3 s1 s2
s3sub s1,s2,s3 s1 s2 s3lw
s1,100(s2) s1 Memorys2100 sw
s1,100(s2) Memorys2100 s1bne
s4,s5,L Next instr. is at Label if s4 ?
s5beq s4,s5,L Next instr. is at Label if s4
s5j Label Next instr. is at Label - Formats
R I J
40Control Flow
- We have beq, bne, what about Branch-if-less-than
? - New instruction set on less than
- if s1 lt s2 then t0 1 slt
t0, s1, s2 else t0 0 - slt and bne can be used to implement branch on
less than - slt t0, s0, s1
- bne t0, zero, Less
- Note that the assembler needs a register to do
this, there are register conventions for the
MIPS assembly language - we can now build general control structures
41MIPS Register Convention
- at, 1 reserved for assembler
- k0, k1, 26-27 reserved for operating system
42Procedure calls
- Procedures and subroutines allow reuse and
structuring of code - Steps
- Place parameters in a place where the procedure
can access them - Transfer control to the procedure
- Acquire the storage needed for the procedure
- Perform the desired task
- Place the results in a place where the calling
program can access them - Return control to the point of origin
43Register assignments for procedure calls
- a0...a3 four argument registers for passing
parameters - v0...v1 two return value registers
- ra return address register
- use of argument and return value register
compiler - handling of control passing mechanism machine
- jump and link instruction jal ProcAddress
- saves return address (PC4) in ra (Program
Counter holds the address of the current
instruction) - loads ProcAddress in PC
- return jump jr ra
- loads return address in PC
44Stack
- Used if four argument registers and two return
value registers are not enough or if nested
subroutines (a subroutine calls another one) are
used - Can also contain temporary data
- The stack is a last-in-first-out structure in the
memory - Stack pointer (sp) points at the top of the
stack - Push and pop instructions
- MIPS stack grows from higher addresses to lower
addresses
45Stack and Stack Pointer
46Constants
- Small constants are used quite frequently e.g.,
A A 5 B B - 1 - Solution 1 put constants in memory and load them
- To add a constant to a register
- lw t0, AddrConstant(zero)
- add sp,sp,t0
- Solution 2 to avoid extra instructions keep the
constant inside the instruction itself addi
29, 29, 4 i means immediate slti 8, 18,
10 andi 29, 29, 6 - Design principle 4 Make the common case fast.
47How about larger constants?
- We'd like to be able to load a 32 bit constant
into a register - Must use two instructions, new "load upper
immediate" instruction lui t0,
1010101010101010 - Then must get the lower order bits right,
i.e., ori t0, t0, 1010101010101010
1010101010101010
0000000000000000
0000000000000000
1010101010101010
ori
48Overview of MIPS
- simple instructions all 32 bits wide
- very structured, no unnecessary baggage
- only three instruction formats
- rely on compiler to achieve performance what
are the compiler's goals? - help compiler where we can
op rs rt rd shamt funct
R I J
op rs rt 16 bit address
op 26 bit address
49Addresses in Branches and Jumps
- Instructions
- bne t4,t5,Label Next instruction is at Label
if t4 ? t5 - beq t4,t5,Label Next instruction is at Label
if t4 t5 - j Label Next instruction is at Label
- Formats
- Addresses are not 32 bits How do we handle
this with load and store instructions?
op rs rt 16 bit address
I J
op 26 bit address
50Addresses in Branches
- Instructions
- bne t4,t5,Label Next instruction is at Label if
t4?t5 - beq t4,t5,Label Next instruction is at Label if
t4t5 - Formats
- Could specify a register (like lw and sw) and add
it to address - use Instruction Address Register (PC program
counter) - most branches are local (principle of locality)
- Jump instructions just use high order bits of PC
- address boundaries of 256 MB
op rs rt 16 bit address
I
51MIPS addressing mode summary
- Register addressing
- operand in a register
- Base or displacement addressing
- operand in the memory
- address is the sumof a register and a constant in
the instruction - Immediate addressing
- operand is a constant within the instruction
- PC-relative addressing
- address is the sum of the PC and a constant in
the instruction - used e.g. in branch instructions
- Pseudodirect addressing
- jump address is the 26 bits of the instruction
concatenated with the upper bits of the PC - Additional addressing modes in other computers
52MIPS addressing mode summary
53To summarize
54Assembly Language vs. Machine Language
- Assembly provides convenient symbolic
representation - much easier than writing down numbers
- e.g., destination first
- Machine language is the underlying reality
- e.g., destination is no longer first
- Assembly can provide 'pseudoinstructions'
- e.g., move t0, t1 exists only in Assembly
- would be implemented using add t0,t1,zero
- When considering performance you should count
real instructions
55Alternative Architectures
- Design alternative
- provide more powerful operations than found in
MIPS - goal is to reduce number of instructions executed
- danger is a slower cycle time and/or a higher CPI
- Sometimes referred to as RISC vs. CISC
- Reduced Instruction Set Computers
- Complex Instruction Set Computers
- virtually all new instruction sets since 1982
have been RISC
56Reduced Instruction Set Computers
- Common characteristics of all RISCs
- Single cycle issue
- Small number of fixed length instruction formats
- Load/store architecture
- Large number of registers
- Additional characteristics of most RISCs
- Small number of instructions
- Small number of addressing modes
- Fast control unit
57An alternative architecture 80x86
- 1978 The Intel 8086 is announced (16 bit
architecture) - 1980 The 8087 floating point coprocessor is
added - 1982 The 80286 increases address space to 24
bits, instructions - 1985 The 80386 extends to 32 bits, new
addressing modes - 1989-1995 The 80486, Pentium, Pentium Pro add a
few instructions (mostly designed for higher
performance) - 1997 MMX is added
- Intel had a 16-bit microprocessor two years
before its competitors more elegant
architectures which led to the selection of the
8086 as the CPU for the IBM PC - This history illustrates the impact of the
golden handcuffs of compatibilityan
architecture that is difficult to explain and
impossible to love
58A dominant architecture 80x86
- See your textbook for a more detailed description
- Complexity
- Instructions from 1 to 17 bytes long
- one operand must act as both a source and
destination - one operand can come from memory
- complex addressing modes e.g., base or scaled
index with 8 or 32 bit displacement - Saving grace
- the most frequently used architectural components
are not too difficult to implement - compilers avoid the portions of the architecture
that are slow
59Summary
- Instruction complexity is only one variable
- lower instruction count vs. higher CPI / lower
clock rate - Design Principles
- simplicity favours regularity
- smaller is faster
- good design demands good compromises
- make the common case fast
- Instruction set architecture
- a very important abstraction indeed!
60Arithmetic
- Where we've been
- Performance (seconds, cycles, instructions)
- Abstractions Instruction Set Architecture
Assembly Language and Machine Language - What's up ahead
- Implementing the Architecture
61Arithmetic
- We start with the Arithmetic Logic Unit
62Numbers
- Bits are just bits (no inherent meaning)
conventions define relationship between bits and
numbers - Binary numbers (base 2) 0000 0001 0010 0011 0100
0101 0110 0111 1000 1001... decimal 0...2n-1 - Of course it gets more complicated numbers are
finite (overflow) fractions and real
numbers negative numbers - How do we represent negative numbers? i.e.,
which bit patterns will represent which numbers? - Octal and hexadecimal numbers
- Floating-point numbers
63Possible Representations of Signed Numbers
- Sign Magnitude One's Complement
Two's Complement 000 0 000 0 000
0 001 1 001 1 001 1 010 2 010
2 010 2 011 3 011 3 011 3 100
-0 100 -3 100 -4 101 -1 101 -2 101
-3 110 -2 110 -1 110 -2 111 -3 111
-0 111 -1 - Issues balance, number of zeros, ease of
operations. - Twos complement is best.
64MIPS
- 32 bit signed numbers0000 0000 0000 0000 0000
0000 0000 0000two 0ten0000 0000 0000 0000 0000
0000 0000 0001two 1ten0000 0000 0000 0000
0000 0000 0000 0010two 2ten...0111 1111
1111 1111 1111 1111 1111 1110two
2,147,483,646ten0111 1111 1111 1111 1111 1111
1111 1111two 2,147,483,647ten1000 0000 0000
0000 0000 0000 0000 0000two
2,147,483,648ten1000 0000 0000 0000 0000 0000
0000 0001two 2,147,483,647ten1000 0000 0000
0000 0000 0000 0000 0010two
2,147,483,646ten...1111 1111 1111 1111 1111
1111 1111 1101two 3ten1111 1111 1111 1111
1111 1111 1111 1110two 2ten1111 1111 1111
1111 1111 1111 1111 1111two 1ten
65Two's Complement Operations
- Negating a two's complement number invert all
bits and add 1 - Remember Negate and invert are different
operations. - You negate a number but invert a bit.
- Converting n bit numbers into numbers with more
than n bits - MIPS 16 bit immediate gets converted to 32 bits
for arithmetic - copy the most significant bit (the sign bit) into
the other bits 0010 -gt 0000 0010 1010 -gt
1111 1010 - "sign extension"
- MIPS load byte instructions
- lbu no sign extension
- lb sign extension
66Addition Subtraction
- Just like in grade school (carry/borrow 1s)
0111 0111 0110 Â 0110 -Â 0110 -Â 0101
1101 0001 0001 - Two's complement operations easy
- subtraction using addition of negative numbers
0111 Â 1010 - 10001
- Overflow (result too large for finite computer
word) - e.g., adding two n-bit numbers does not yield an
n-bit number 0111 Â 0001 - 1000
67Detecting Overflow
- No overflow when adding a positive and a negative
number - No overflow when signs are the same for
subtraction - Overflow occurs when the value affects the sign
- overflow when adding two positives yields a
negative - or, adding two negatives gives a positive
- or, subtract a negative from a positive and get a
negative - or, subtract a positive from a negative and get a
positive - Consider the operations A B, and A B
- Can overflow occur if B is 0 ? No.
- Can overflow occur if A is 0 ? Yes.
68Effects of Overflow
- An exception (interrupt) occurs
- Control jumps to predefined address for exception
- Interrupted address is saved for possible
resumption - Details based on software system / language
- example flight control vs. homework assignment
- Don't always want to detect overflow new MIPS
instructions addu, addiu, subu note addiu
still sign-extends! note sltu, sltiu for
unsigned comparisons
69Logical Operations
- and, andi bit-by-bit AND
- or, ori bit-by-bit OR
- sll shift left logical
- slr shift right logical
- 0101 1010
- shifting left two steps gives 0110 1000
- 0110 1010
- shifting right three bits gives 0000 1011
70Logical unit
- Let's build a logical unit to support the and and
or instructions - we'll just build a 1 bit unit, and use 32 of them
- op0 and op1 or
- Possible Implementation (sum-of-products)
- res a b a op b op
a
b
71Review The Multiplexor
- Selects one of the inputs to be the output,
based on a control input IEC
symbol - of a 4-input
- MUX
- Lets build our logical unit using a MUX
0
1
72Different Implementations
- Not easy to decide the best way to build
something - Don't want too many inputs to a single gate
- Dont want to have to go through too many gates
- For our purposes, ease of comprehension is
important - We use multiplexors
- Let's look at a 1-bit ALU for addition
- How could we build a 1-bit ALU for AND, OR and
ADD? - How could we build a 32-bit ALU?
CarryIn
a
cout a b a cin b cin sum a xor b xor cin
Sum
b
CarryOut
73Building a 32 bit ALU for AND, OR and ADD
We need a 4-input MUX.
74What about subtraction (a b) ?
- Two's complement approch just negate b and
add. - A clever solution
- In a multiple bit ALU the least significant
CarryIn has to be equal to 1 for subtraction.
75Tailoring the ALU to the MIPS
- Need to support the set-on-less-than instruction
(slt) - remember slt is an arithmetic instruction
- produces a 1 if rs lt rt and 0 otherwise
- use subtraction (a-b) lt 0 implies a lt b
- Need to support test for equality (beq t5, t6,
t7) - use subtraction (a-b) 0 implies a b
76Supporting slt
- Other ALUs
- Most significant ALU
7732 bit ALU supporting slt
altb ? a-blt0, thus Set is the sign bit of the
result.
78Final ALU including test for equality
- Notice control lines000 and001 or010
add110 subtract111 slt
- Note zero is a 1 when the result is zero!
79Conclusion
- We can build an ALU to support the MIPS
instruction set - key idea use multiplexor to select the output
we want - we can efficiently perform subtraction using
twos complement - we can replicate a 1-bit ALU to produce a 32-bit
ALU - Important points about hardware
- all of the gates are always working
- the speed of a gate is affected by the number of
inputs to the gate - the speed of a circuit is affected by the number
of gates in series (on the critical path or
the deepest level of logic) - Our primary focus comprehension, however,
- Clever changes to organization can improve
performance (similar to using better algorithms
in software) - well look at examples for addition,
multiplication and division
80Problem ripple carry adder is slow
- A 32-bit ALU is much slower than a 1-bit ALU.
- There are more than one way to do addition.
- the two extremes ripple carry and
sum-of-products - Can you see the ripple? How could you get rid
of it? - c1 b0c0 a0c0 a0b0
- c2 b1c1 a1c1 a1b1 c2 c2(a0,b0,c0,a1,b1)
- c3 b2c2 a2c2 a2b2 c3 c3(a0,b0,c0,a1,b1,a2
,b2) - c4 b3c3 a3c3 a3b3 c4 c4(a0,b0,c0,a1,b1,a2
,b2,a3,b3) - Not feasible! Too many inputs to the gates.
81Carry-lookahead adder
- An approach in-between the two extremes
- Motivation
- If we didn't know the value of carry-in, what
could we do? - When would we always generate a carry? gi
ai bi - When would we propagate the carry?
pi ai bi - Look at the truth table!
- Did we get rid of the ripple?c1 g0 p0c0
- c2 g1 p1c1 c2 g1p1g0p1p0c0
- c3 g2 p2c2 c3 g2p2g1p2p1g0p2p1p0c0
- c4 g3 p3c3 c4 ...
- Feasible! A smaller number of inputs to the
gates.
821-bit adder
a b cin cout sum 0 0 0 0 0 0
0 1 0 1 0 1 0 0 1 0 1
1 1 0 1 0 0 0 1 1 0 1
1 0 1 1 0 1 0 1 1 1
1 1
83Use principle to build bigger adders
- Cant build a 16 bit CLA adder (too big)
- Could use ripple carry of 4-bit CLA adders
- Better use the CLA principle again!
- Principle shown in the figure. See textbook for
details.
84Multiplication
- More complicated than addition
- can be accomplished via shifting and addition
- More time and more area
- Let's look at 2 versions based on grammar school
algorithm 0010 (multiplicand)
x_1011 (multiplier) 0010 - 0010
- 0000
- 0010___
- 0010110
- Negative numbers easy way convert to positive
and multiply - there are better techniques
85Multiplication, First Version
86Multiplication, Final Version
87Booths Algorithm
- The grammar school method was implemented using
addition and shifting - Booths algorithm also uses subtraction
- Based on two bits of the multiplier either add,
subtract or do nothing always shift - Handles twos complement numbers
88Fast multipliers
- Combinational implementations
- Conventional multiplier algorithm
- partial products with AND gates
- adders
- Lots of modifications
- Sequential implementations
- Pipelined multiplier
- registers between levels of logic
- result delayed
- effective speed of multiple multiplications
increased
89Four-Bit Binary Multiplication
- Multiplicand
B3 B2 B1 B0 - Multiplier
? A3 A2 A1 A0 - 1st partial product
A0B3 A0B2 A0B1 A0B0 - 2nd partial product
A1B3 A1B2 A1B1 A1B0 - 3rd partial product
A2B3 A2B2 A2B1 A2B0 - 4th partial product A3B3
A3B2 A3B1 A3B0 - Final product P7 P6 P5
P4 P3 P2 P1
P0
90Classical Implementation
91Pipelined Multiplier
Clk
/
/
/
/
/
/
/
/
/
/
92Division
- Simple method
- Initialise the remainder with the dividend
- Start from most significant end
- Subtract divisor from the remainder if possible
(quotient bit 1) - Shift divisor to the right and repeat
93Division, First Version
94Division, Final Version
Same hardware for multiply and divide.
95Floating Point (a brief look)
- We need a way to represent
- numbers with fractions, e.g., 3.1416
- very small numbers, e.g., .000000001
- very large numbers, e.g., 3.15576 ? 109
- Representation
- sign, exponent, significand (1)sign
???significand ???2exponent - more bits for significand gives more accuracy
- more bits for exponent increases range
- IEEE 754 floating point standard
- single precision 8 bit exponent, 23 bit
significand - double precision 11 bit exponent, 52 bit
significand
96IEEE 754 floating-point standard
- Leading 1 bit of significand is implicit
- Exponent is biased to make sorting easier
- all 0s is smallest exponent all 1s is largest
- bias of 127 for single precision and 1023 for
double precision - summary (1)sign ?????significand)
???2exponent bias - Example
- decimal -.75 -3/4 -3/22
- binary -.11 -1.1 x 2-1
- floating point exponent 126 01111110
- IEEE single precision 10111111010000000000000000
000000
97Floating-point addition
- 1. Shift the significand of the number with the
lesser exponent right until the exponents match - 2. Add the significands
- 3. Normalise the sum, checking for overflow or
underflow - 4. Round the sum
98Floating-point addition
99Floating-point multiplication
- 1. Add the exponents
- 2. Multiply the significands
- 3. Normalise the product, checking for overflow
or underflow - 4. Round the product
- 5. Find out the sign of the product
100Floating Point Complexities
- Operations are somewhat more complicated (see
text) - In addition to overflow we can have underflow
- Accuracy can be a big problem
- IEEE 754 keeps two extra bits during intermediate
calculations, guard and round - four rounding modes
- positive divided by zero yields infinity
- zero divide by zero yields not a number
- other complexities
- Implementing the standard can be tricky
101Chapter Four Summary
- Computer arithmetic is constrained by limited
precision - Bit patterns have no inherent meaning but
standards do exist - twos complement
- IEEE 754 floating point
- Computer instructions determine meaning of the
bit patterns - Performance and accuracy are important so there
are many complexities in real machines (i.e.,
algorithms and implementation). - We are ready to move on (and implement the
processor)