CS 2200 Lecture 05a Metrics - PowerPoint PPT Presentation

1 / 104
About This Presentation
Title:

CS 2200 Lecture 05a Metrics

Description:

The College of Computing. Georgia Institute of Technology ... Douglas DC-8. 178,200. 1350. 4000. 132. Concorde. 272,670-298,900. 610. 4150. 447-490. Boeing 747 ... – PowerPoint PPT presentation

Number of Views:105
Avg rating:3.0/5.0
Slides: 105
Provided by: michaelt8
Category:

less

Transcript and Presenter's Notes

Title: CS 2200 Lecture 05a Metrics


1
CS 2200 Lecture 05aMetrics
  • (Lectures based on the work of Jay Brockman,
    Sharon Hu, Randy Katz, Peter Kogge, Bill Leahy,
    Ken MacKenzie, Richard Murphy, and Michael
    Niemier)

2
Determining Performance
  • Response Time
  • Usually something the user cares about!
  • Also referred to as execution time
  • The time between the start and completion of an
    event
  • Throughput
  • The amount of work done in a given amount of time
  • Theres tradeoffs between the two improving 1
    often adversely affects the other

3
Lets look at an example
  • If planes were computers
  • 747 highest throughput, Concorde highest
    execution time, 737 cheapest, DC-8 most range
  • Which one is best???

Here, capacity speed
4
So, how do we compare?
  • Best to stick with execution time! (more later)
  • If we say X is faster than Y, we mean the
    execution time is lower on X than on Y.
  • Alternatively

Execution timeY
X is n times faster than Y
n
Execution timeX
1
i.e. 200 MHz
Execution timeY
PerformanceX
PerformanceY
n

1
Execution timeX
Performancey
PerformanceX
i.e. 50 MHz
50/200 ¼, therefore x 4 times slower than y
5
The times the thing
  • There are many ways to consider exec. time
  • Wall clock time, response time, and elapsed time
    all different names for the latency to complete
    a task
  • (including disk/memory accesses, I/O, OS, etc.)
  • CPU time time CPU is working on a specific task
    excluding I/O, program switching, etc.
  • User CPU time CPU time spent in program
  • System CPU time time spent by OS doing program
    specific tasks
  • Beware of other imposter metrics!
  • Execution time, or time for a task in general
    are the only true and fair measures of
    comparisons

6
Evaluating Performance
  • For a true comparison of machine X to Y, must run
    the same thing on X and Y!
  • But what do you use to compare?
  • Ideally, would have users run a typical workload
    over some period of time
  • This is not practical however
  • Instead different programs are used to gauge
    performance benchmarks

7
Benchmarks
  • In the order of relevance...
  • (1) Real Programs
  • Obviously different users use different programs
  • But, a large subset of people WILL use C
    compilers, Windows products, etc.
  • (2) Kernels
  • What are they? Bits of a real program
  • No user would actually run them
  • Isolate/characterize performance of individual
    machine features

8
More Benchmarks
  • (3) Toy Benchmarks
  • Usually short programs with previously known
    results.
  • More useful for intro. to programming
    assignments
  • (4) Synthetic Benchmarks
  • Similar to kernels try to match avg. frequency
    of operations and operands of a large set of
    programs
  • Dont compute anything a user would even want
  • More details and examples forthcoming
  • Nothing perfect industry demands standard

9
What do people really use to benchmark?
  • Most common are the
  • Standard Performance and Evaluation Corporations
    (SPEC) suites
  • SPEC INT, and SPEC FP for short.
  • Uses a variety of applications
  • Variety will lessen the weakness of any one
    particular benchmark
  • Variety will help defeat opportunists from
    optimizing for good benchmark performance

10
An Example Suite SPEC95 INT
  • Consists of 8 C programs used as benchmarks
  • go Artificial intelligence simulator plays
    go
  • m88ksim Moto 88K Chip simulator runs test
    program
  • gcc New version of GCC builds SPARC code
  • compress Compresses and decompresses file in
    mem.
  • li LISP interpreter
  • ijpeg Graphic compression and decompression
  • perl Manipulates strings and prime numbers in
    Perl
  • compress A database program
  • See http//www.spec.org for more

11
Benchmarkers Beware!
  • Benchmark suites are not perfect but used b/c
    they are a uniform metric
  • Performance determines success and failure and
    companies and researchers know it
  • People falsely inflate benchmark performance
  • Add optimizations to enhance performance
  • Hand-coded library calls (name change will erase
    this one)
  • Special microcode for high frequency segments
  • Register reservations for key constants
  • Compiler recognizes benchmark and runs different
    version

12
An examplefor more confusion
  • Consider
  • All are true
  • A is 10 times faster than B for program P1
  • B is 10 times faster than A for program P2
  • A is 20 times faster than C for program P1
  • C is 50 times faster than A for program P2
  • B is 2 times faster than C for program P1
  • C is 5 times faster than B for program P2
  • Which is the best?

13
Interpreting the example
  • Any one of the previous statements is true!
  • But, which computer is better? By how much?
  • One way Go back to execution time!
  • B is 9.1 times faster than A for programs P1 and
    P2
  • C is 25 times faster than A for programs P1 and
    P2
  • C is 2.75 times faster than B for programs P1 and
    P2
  • Given this, if we had to big 1 configuration over
    another, what should we consider?

14
Some other options
  • Means, etc. etc.
  • Well only talk about two here
  • The arithmetic mean
  • An average of the execution times that tracks the
    total execution time.
  • The weighted arithmetic mean
  • Can be used if programs are not run equally
  • (i.e. P1 40 of load and P2 60 of load)

15
Other ways to measure performance
  • (1) Use MIPS (millions of instructions/second)
  • MIPS is a rate of operations/unit time.
  • Performance can be specified as the inverse of
    execution time so faster machines have a higher
    MIPS rating
  • So, bigger MIPS faster machine. Right?

Instruction Count
Clock Rate
MIPS

Exec. Time x 106
CPI x 106
16
Wrong!!!
  • 3 significant problems with using MIPS
  • Problem 1
  • MIPS is instruction set dependent.
  • (And different computer brands usually have
    different instruction sets)
  • Problem 2
  • MIPS varies between programs on the same computer
  • Problem 3
  • MIPS can vary inversely to performance!
  • Lets look at an example of why MIPS doesnt
    work

17
A MIPS Example (1)
  • Consider the following computer

Instruction counts (in millions) for each
instruction class
The machine runs at 100MHz.
Instruction A requires 1 clock cycle, Instruction
B requires 2 clock cycles, Instruction C
requires 3 clock cycles.
n
S
CPIi x Ci
CPU Clock Cycles
i 1

CPI
!
Instruction Count
Instruction Count
18
A MIPS Example (2)
(5x1) (1x2) (1x3) x 106
CPI1
10/7 1.43

(5 1 1) x 106
cycles
(10x1) (1x2) (1x3) x 106
CPI2
15/12 1.25

(10 1 1) x 106
So, compiler 2 has a higher MIPS rating and
should be faster?
100 MHz
MIPS2
80.0

1.25
19
A MIPS Example (3)
  • Now lets compare CPU time

Instruction Count x CPI
CPU Time
!
Clock Rate
7 x 106 x 1.43
0.10 seconds
CPU Time1
100 x 106
12 x 106 x 1.25
CPU Time2
0.15 seconds
100 x 106
Therefore program 1 is faster despite a lower
MIPS!
20
Other bad benchmarks fallacies
  • MFLOPS is a consistent and useful measure of
    performance
  • Stands for Million floating-point
    operations/second
  • Similar issues as MIPS
  • But even less fair b/c set of floating point
    operations is not consistent across machines
  • Synthetic benchmarks predict performance
  • These are not REAL programs
  • Compilers, etc. can artificially inflate
    performance
  • Dont reward optimization of behavior in real
    program

21
More bad benchmarks/fallacies
  • Benchmarks remain valid indefinitely
  • This is not true!
  • Companies will engineer for benchmark
    performance
  • These people are welllets say not very honest

22
Useful/important performance metrics
(Note important sounding title!)
  • Lets talk about useable and important metrics
  • One of the most important principles in computer
    design is
  • to make the common case fast.
  • Specifically
  • In making a design trade-off, favor the frequent
    case over the infrequent one
  • Improving the frequent event will help
    performance too
  • Often, the frequent case is simpler than the
    infrequent one anyhow

23
Amdahls Law
  • Qualifies performance gain
  • Amdahls Law defined
  • The performance improvement to be gained from
    using some faster mode of execution is limited by
    the amount of time the enhancement is actually
    used.
  • Amdahls Law defines speedup

Perf. for entire task using enhancement when
possible
Speedup
Perf. For entire task without using enhancement
Or
Execution time for entire task without enhancement
Speedup
Execution time for entire task using
enhancement when possible
24
Amdahls Law and Speedup
  • Speedup tells us how much faster the machine will
    run with an enhancement
  • 2 things to consider
  • 1st
  • Fraction of the computation time in the original
    machine that can use the enhancement
  • i.e. if a program executes in 30 seconds and 15
    seconds of exec. uses enhancement, fraction ½
    (always lt 1)
  • 2nd
  • Improvement gained by enhancement (i.e. how much
    faster does the program run overall)
  • i.e. if enhanced task takes 3.5 seconds and
    original task took 7, we say the speedup is 2
    (always gt 1)

25
Amdahls Law Equations
Fractionenhanced
Execution timenew
Execution timeold x
(1 Fractionenhanced)
Speedupenhanced
1
Execution Timeold
Speedupoverall

Execution Timenew
Fractionenhanced
(1 Fractionenhanced)
Speedupenhanced
Use previous equation, Solve for speedup
Please, please, please, dont just try to
memorize these equations and plug numbers into
them. Its always important to think about the
problem too!
26
Deriving the previous formula
Execution Timeold
1
Speedupoverall

Execution Timenew
Fractionenhanced
(1 Fractionenhanced)
Speedupenhanced
(note should be gt 1) (otherwise, performance
gets worse)
Lets do an example on the board.
27
Amdahls Law Example
  • A certain machine has a
  • Floating point multiply that runs too slow
  • It adversely affects benchmark performance.
  • One option
  • Re-design the FP multiply hardware to make it run
    15 times faster than it currently does.
  • However, the manager thinks
  • Re-designing all of the FP hardware to make each
    FP instruction run 3 times faster is the way to
    go.
  • FP multiplies account for 10 of execution time.
  • FP instructions as a whole account for 30 of
    execution time.
  • Which improvement is better?

28
Amdahls Law Example (cont.)
  • The speedup gained by improving the multiply
    instruction is
  • 1 / (1-0.1) (0.1/15) 1.10
  • The speedup gained by improving all of the
    floating point instructions is
  • 1 / (1-0.3) (.3/3) 1.25
  • Believe it or not, the manager is right!
  • Improving all of the FP instructions despite the
    lesser improvement is the better way to go

29
What does Amdahls Law tell us?
  • Diminishing returns exist (just like in
    economics)
  • Speedup diminishes as improvements are added
  • Corollary Only a fraction of code is affected
    cant speed up task by more than reciprocal of 1
    fraction
  • Serves as a guide as to how much an enhancement
    will improve performance AND where to spend your
    resources
  • (It really is all about money after all!)
  • Overall goal
  • Spend your resources where you get the most
    improvement!

30
Execution time, Execution time, Execution time
  • Significantly affected by the CPU rate
  • Also referred to/referenced by
  • Clock ticks, clock periods, clocks, cycles, clock
    cycles
  • Clock time generally referenced by
  • Clock period (i.e. 2 ns)
  • Clock rate (i.e. 500 MHz)
  • CPU time for program can be expressed as
  • CPU time CPU clock cycles for program x Clock
    Cycle time
  • OR

31
More CPU metrics
  • Instruction count also figures into the mix
  • Can affect throughput, execution time, etc.
  • Interested in
  • instruction path length and instruction count
    (IC)
  • Using this information and the total number of
    clock cycles for program can determine clock
    Cycles Per Instruction (CPI)
  • Note Sometimes you see inverse Instructions
    per clock cycle (IPC) this is the same metric
    really

32
Relating the metrics
  • New metrics/formulas lead to alternative ways of
    expressing others
  • CPU Time IC x CPI x Clock cycle time
  • OR
  • DONT memorize formulas. Think units!
  • In fact, lets expand the above equation into
    units
  • Execution time falls out

33
A CPU The Bigger Picture
  • Recall
  • We can see CPU performance dependent on
  • Clock rate, CPI, and instruction count
  • CPU time is directly proportional to all 3
  • Therefore an x improvement in any one variable
    leads to an x improvement in CPU performance
  • But, everything usually affects everything

Clock Cycle Time
Instruction Count
Hardware Tech.
Compiler Technology
Organization
ISAs
CPI
34
More detailed metrics
  • Remember
  • Not all instructions execute in same of clock
    cycles
  • Different programs have different instruction
    mixes
  • Therefore we must weight CPU time eqn
  • ICi of times instruction i is executed
  • CPIi avg. of clock cycles for instruction I
  • Note CPI should be measured and not calculated
    as you must take cache misses, etc. into account

35
An example
  • Its included in the following slides, but well
    work it out together on the board 1st

36
An example
  • Assume that weve made the following
    measurements
  • Frequency of FP operations 25
  • Average CPI of FP operations 4.0
  • Average CPI of other instructions 1.33
  • Frequency of FP Square Root (FPSQR) instruction
    2
  • CPI of FPSQR 20
  • There are two new design alternatives to
    consider
  • It is possible to reduce the CPI of FPSQR to 2!
  • Its also possible to reduce the average CPI of
    all FP operations to 2
  • Which one is better for overall CPU performance?

37
An example continued
  • First we need to calculate a base for comparison
  • CPIoriginal (4.0 0.25) (1.33 0.75) 2.0
  • Note NO equations!!!
  • Next, compute CPI for the enhanced FPSQR option
  • CPInew FPSQR CPIoriginal 0.02(CPIold FPSQR
    CPInew FPSQR)
  • CPInew FPSQR 2.0 0.02(20.0 2) 1.64
  • Now, we can compute a new FP CPI
  • CPInew FP (0.75 1.33) (0.25 2) 1.5
  • This CPI is lower than the first alternative (of
    reducing the FPSQR CPI to 2)
  • Therefore, the speedup is 1.33 with this
    enhancement (2.00/1.5)

38
CS 2200 Lecture 05bThe LC2200 Datapath
  • (Lectures based on the work of Jay Brockman,
    Sharon Hu, Randy Katz, Peter Kogge, Bill Leahy,
    Ken MacKenzie, Richard Murphy, and Michael
    Niemier)

39
Five classic components (of an architecture)
(Remember this???)
40
Lets take a closer look...
Processor
Control
Datapath
41
Review Digital Logic
  • Combinational logic gates, ROMs
  • (how its implemented)
  • tri-state buffer 0/1 or Z (unconnected)
  • Sequential logic edge-triggered flip-flops
  • (a.k.a. memory), stores state
  • all state stored in edge-triggered flip-flops
  • single clock exactly one clock goes to every
    flip-flop
  • Finite State Machines (FSMs)
  • Moore Mealy forms
  • state-transition diagram
  • state-transition table

A combination of combinational logic
and sequential logic (used to build and
control real and useful things)
42
Digital Logic reading?
  • Patterson Hennessy, Appendix B
  • nice quick read
  • old-CS3760 class notes
  • your ECE 2030 book/notes

43
Today
  • recipes for computation
  • combinational
  • sequential single-bus datapath control
  • single-bus datapath for the LC-2200
  • slow but straightforward
  • used in Project 1

44
Computation
  • Weve designed computation elements
  • add/subtract, and/or/xor/not
  • could do multiply divide?
  • How do you build bigger computations?

32
32
32
32
32
32
45
An adder in Boolean gates
  • This is just 1 bit, but obviously we can scale it
    up

46
Example
  • y a bx cx2
  • all numbers (x, y) and constants (a, b, c) are
    32-bit integers

f(x)
x
y
47
Examplecombinational implementation
  • y a bx cx2

cx2
c
x
y
a
b
bx
48
Combinational Example Timing
  • Suppose ADD requires 10 ns and MUL 100 ns
  • Tpd of the whole circuit?

c
x
y
a
b
49
Combinational Circuit
  • Delay is minimum possible
  • Tpd 210 ns
  • imposed by dataflow of the desired computation!
  • Circuit cost is maximum
  • two adders
  • three multipliers
  • No flexibility
  • equation is hardwired in the circuit topology
  • maybe the constants (A, B, C) could be set by
    switches

50
Sequential Circuit
  • A sequential circuit would let us re-use
    functional units and save hardware cost
  • But how to wire them up??

One of each type
some storage
51
A recipethe single-bus datapath
(y a bc cx2)
  • One common bus (32 bits wide)

52
A recipethe single-bus datapath
(y a bc cx2)
One common bus (32 bits wide)
1 type of functional unit each outputs
connected to bus via tri-state buffers
MUL
ADD
tri-state buffers
DrMUL
DrADD
53
A recipethe single-bus datapath
(y a bc cx2)
One common bus (32 bits wide)
A
B
LdA
LdB
D
LdD
C
LdC
Inputs connected to the bus via registers (i.e.
some sequential logic)
MUL
ADD
1 type of functional unit each outputs connected
to bus via tri-state buffers
54
A recipethe single-bus datapath
(y a bc cx2)
One common bus (32 bits wide)
Y
A
B
LdY
LdA
LdB
D
LdD
C
LdC
y
Inputs connected to the bus via registers
ROM 0 a 1 b 2 c 3 unused
MUL
ADD
2
romaddr
1 type of functional unit each outputs connected
to bus via tri-state buffers
x
DrX
DrROM
Other, e.g. constants and I/O
55
A recipe (the single-bus datapath)
y a bx cx2
Ex. A2, B4, C6, x2
Part 1 C ? x D ? x
Part 2 D ? x2 C ? 6
Part 3 A ? Cx2
Part 4 C ? 4 D ? x
Part 5 B ? Bx
Part 6 B ? Bx Cx2 (or RegA RegB)
Part 7 A ? A
Part 8 Y ? RegA RegB
56
Something more complex
  • We want to execute the instructions in the LC2200
    ISA
  • Build a generic datapath similar to what was done
    for solving the y a bx cx2 problem

57
Big Picture
  • Fetch the instruction from memory
  • Decode the instruction and decide what to do
  • Execute the instruction
  • Repeat.
  • What hardware do we need to
  • Add 2 registers together and store the result in
    a 3rd?
  • Lets look at the LC-2200
  • (or alternatively the MIPS)
  • (well do them both but its your choice as to
    which is first)

58
LC-2200 datapath
PC
lets look at a generic instruction (we start
with the PC ? which stores the address of the
next instruction to be executed)
59
LC-2200 datapath
PC
PC indexes memory data-out instruction encoding
60
PC
LC-2200 datapath
We store the output of memory in IR (Side note
why do we need to do this? Ideally we could use
bits of address to set ALU functions, etc.)
IR
61
LC-2200 Instruction Types
all are encoded in single, 32-bit words
R-type Register-Register
31
28
0
19
20
23
24
27
3
4
OP
RA
RB
unused
RD
62
LC-2200 datapath
PC
registers 16x 32 bits
Din
WrREG
4
?
IR
regno
Dout
strip off bits used to index register file (RA,
RB source registers always encoded in same
place)
What if the instruction is an other type? (its
OK ? random un-needed registers will still be
read, but not used b/c of control signals set by
opcode)
63
LC-2200 datapath
PC
registers 16x 32 bits
Din
WrREG
4
?
IR
regno
1st value is read stored in a temporary
register
Dout
A
LdA
(if these are seen as unconditional inputs to the
ALU)
64
LC-2200 datapath
PC
registers 16x 32 bits
Din
WrREG
4
?
IR
regno
2nd value is read stored in a temporary
register
Dout
A
B
LdB
(if these are seen as unconditional inputs to the
ALU)
65
LC-2200 datapath
PC
registers 16x 32 bits
Din
WrREG
4
?
IR
regno
Dout
opcodes add 0000 nand 0001 addi 0010 lw 0011 sw 0
100 beq 0101 jalr 0110 halt 0111
A
B
LdA
LdB
2
could take from opcode
ALU 00 ADD 01 NAND 10 A - B 11 A 1
func
66
QuestionsDid you understand what we just
did?Can someone show me how an add works?(draw
what happens using bits in PCas registers
starting point)
67
PC
Consider the LW instruction lw s0, 4(s1)
lw
0011
RB ? MEMRA Offset
registers 16x 32 bits
Din
WrREG
4
?
IR
regno
this where s1 goes
Dout
A
B
LdA
LdB
2
ALU 00 ADD 01 NAND 10 A - B 11 A 1
func
31
28
0
19
20
23
24
27
OP
RA
RB
immediate 20-bit signed
68
PC
Consider the LW instruction lw s0, 4(s1)
lw
0011
RB ? MEMRA Offset
explain sign extending (well see with
MIPS) cant just send 1002 (would get
garbage otherwise)
registers 16x 32 bits
Din
WrREG
4
?
IR
regno
Dout
A
B
LdA
LdB
encoded as an immediate value
2
ALU 00 ADD 01 NAND 10 A - B 11 A 1
func
31
28
0
19
20
23
24
27
OP
RA
RB
immediate 20-bit signed
69
MAR
PC
Consider the LW instruction lw s0, 4(s1)
registers 16x 32 bits
Din
WrREG
4
?
IR
regno
Dout
A
B
LdA
LdB
2
ALU 00 ADD 01 NAND 10 A - B 11 A 1
func
Used control code 10
70
MAR
PC
Consider the LW instruction lw s0, 4(s1)
memory address register (indexes memory)
registers 16x 32 bits
Din
WrREG
4
?
IR
regno
Dout
A
B
LdA
LdB
2
need destination register
ALU 00 ADD 01 NAND 10 A - B 11 A 1
func
common (i.e. with MIPS) to use ALU to increase PC
while decoding, etc.
71
Making it more real...
  • This is getting very complicated!!!
  • For reasons involving implementation details, the
    simplest (lowest cost, lowest performance)
    technique is to use a single bus to connect all
    the various functional units

( Big jump from this to next example PC 1 or
PC 4 )
72
Bus
Functional Unit
Functional Unit
Functional Unit
Functional Unit
use bus instead of HW for everything
Pro Simpler HW Con Timing protocols/contention
73
BusOnly One Functional Unit at a time can drive
bus
Functional Unit
Functional Unit
Functional Unit
Functional Unit
74
BusAny (and all) functional units can access bus
Latch
Latch
Functional Unit
Functional Unit
Functional Unit
Functional Unit
Drive
75
Questions?
76
LC-2200 Datapath (in terms of a bus structure)
32
A
LdA
B
LdB
10
memory 1024x 32 bits
Addr Din
registers 16x 32 bits
Din
IR31..0
WrREG
WrMEM
2
4
ALU 00 ADD 01 NAND 10 A - B 11 A 1
func
IR19..0
regno
20
Dout
Dout
sign extend
Heres sign extend example
0?
RA 4-bit register number to control logic
IR27..24
RB 4-bit register number to control logic
IR23..20
1
RD 4-bit register number to control logic
IR3..0
OP 4-bit opcode to control logic
IR31..28
Z 1-bit boolean to control logic
1
77
Recall our basic add instruction
78
PC
LC-2200 Datapath
registers 16x 32 bits
Din
WrREG
4
?
IR
regno
Dout
A
B
LdA
LdB
2
ALU 00 ADD 01 NAND 10 A - B 11 A 1
func
Did we leave anything out?
79
Need to increment PC!
80
LC-2200 Datapath (PC used to index memory)
32
A
LdA
MAR
LdMAR
2) Only want to load MAR
10
memory 1024x 32 bits
Addr Din
registers 16x 32 bits
Din
IR31..0
WrREG
WrMEM
2
4
ALU 00 ADD 01 NAND 10 A - B 11 A 1
func
IR19..0
regno
20
1) Let PC control bus
Dout
Dout
sign extend
DrPC
81
LC-2200 Datapath (in terms of a bus structure)
2) IR loaded
32
A
LdA
IR
LdIR
10
memory 1024x 32 bits
Addr Din
registers 16x 32 bits
Din
IR31..0
WrREG
WrMEM
2
4
ALU 00 ADD 01 NAND 10 A - B 11 A 1
func
IR19..0
regno
20
Dout
Dout
sign extend
1) Let mem control bus
DrMEM
0?
RA 4-bit register number to control logic
IR27..24
RB 4-bit register number to control logic
IR23..20
1
RD 4-bit register number to control logic
IR3..0
OP 4-bit opcode to control logic
IR31..28
Z 1-bit boolean to control logic
1
82
LC-2200 Datapath (in terms of a bus structure)
2) Load it into a register
32
A
LdA
10
memory 1024x 32 bits
Addr Din
registers 16x 32 bits
Din
IR31..0
WrREG
WrMEM
2
4
ALU 00 ADD 01 NAND 10 A - B 11 A 1
func
IR19..0
regno
20
Dout
Dout
sign extend
1) Let PC control bus
DrPC
0?
RA 4-bit register number to control logic
IR27..24
RB 4-bit register number to control logic
IR23..20
1
RD 4-bit register number to control logic
IR3..0
OP 4-bit opcode to control logic
IR31..28
Z 1-bit boolean to control logic
1
83
LC-2200 Datapath (in terms of a bus structure)
32
PC
LdPC
10
3) PC is loaded with next inst. to fetch
memory 1024x 32 bits
Addr Din
registers 16x 32 bits
Din
IR31..0
WrREG
WrMEM
2
4
ALU 00 ADD 01 NAND 10 A - B 11 A 1
func
IR19..0
regno
20
1) increment PC
Dout
Dout
sign extend
DrALU
2) ALU contorls the bus
84
LC-2200 Datapath (in terms of a bus structure)
3) write 1st register (A)
32
A
LdA
10
memory 1024x 32 bits
Addr Din
registers 16x 32 bits
Din
IR31..0
WrREG
WrMEM
2
4
ALU 00 ADD 01 NAND 10 A - B 11 A 1
func
IR19..0
regno
20
Dout
Dout
sign extend
1) index register file
DrREG
2) register file controls the bus
85
LC-2200 Datapath (in terms of a bus structure)
3) write 2nd register (B)
32
B
LdB
10
memory 1024x 32 bits
Addr Din
registers 16x 32 bits
Din
IR31..0
WrREG
WrMEM
2
4
ALU 00 ADD 01 NAND 10 A - B 11 A 1
func
IR19..0
regno
20
Dout
Dout
sign extend
1) increment register file
DrREG
2) register file controls the bus
86
LC-2200 Datapath (in terms of a bus structure)
32
3) write registers
10
memory 1024x 32 bits
Addr Din
registers 16x 32 bits
Din
IR31..0
WrREG
WrMEM
2
4
ALU 00 ADD 01 NAND 10 A - B 11 A 1
func
IR19..0
regno
20
Dout
Dout
sign extend
1) get opcode, do ADD
4) index register file
DrALU
2) ALU drives the bus
87
What about all of those red blocks???(note
some foreshadowing on the way)
88
Lets revisit our 1st examplethe single-bus
datapath, HW function
Minimal HW cost in time
y a bx cx2
One common bus (32 bits wide)
Y
LdY
A
LdA
B
LdB
D
LdD
C
LdC
y
Inputs connected to the bus via registers
Other, e.g. constants and I/O
ROM 0 a 1 b 2 c 3 unused
MUL
ADD
2
romaddr
1 type of functional unit each outputs connected
to bus via tri-state buffers
x
DrX
DrADD
89
Wed use a FSM for control(more later like a
lecture or 2)
  • Datapath control inputs are FSM outputs
  • (i.e. control signal to control HW generate by
    FSM)
  • Datapath status outputs (none in this case) would
    be FSM inputs
  • FSM contains as many states as required

load/dont load registers
drive/dont drive buses
what input needed?
90
Try Designing States!y a bx cx2
0 DrX, LdC, LdD
Read X input into both C/D registers
1 DrMUL, LdC
Write XX into C register
Might be slightly different than ordering in
early example
91
Recallthe single-bus datapath, HW function
Minimal HW cost in time
y a bx cx2
One common bus (32 bits wide)
Y
LdY
A
LdA
B
LdB
D
LdD
C
LdC
y
Inputs connected to the bus via registers
Other, e.g. constants and I/O
ROM 0 a 1 b 2 c 3 unused
MUL
ADD
2
romaddr
1 type of functional unit each outputs connected
to bus via tri-state buffers
x
DrX
DrADD
92
FSM states
0 DrX, LdC, LdD
1 DrMUL, LdC
Compute cxx
2 DrROM(C), LdD
3 DrMUL, LdB
4 DrX, LdC
Compute bx
5 DrROM(B), LdD
6 DrMUL, LdA
7 DrADD, LdB
Compute a
8 DrROM(A), LdA
9 DrADD, LdY
93
Timing?
5 100 5 110 ns
registers 5nS
y
MUL 100nS
ADD 10nS
ROM 0 a 1 b 2 c 3 unused
2
20nS
romaddr
x
tri-states 5nS
94
Timing Details
Add
C
MUL
bus
Arbitrary time to generate control
95
Timing (contd)
0 DrX, LdC, LdD
1 DrMUL, LdC
2 DrROM(C), LdD
3 DrMUL, LdB
110 ns 10 states 1100 ns
4 DrX, LdC
5 DrROM(B), LdD
6 DrMUL, LdA
7 DrADD, LdB
8 DrROM(A), LdA
9 DrADD, LdY
96
Timing
  • Datapath circuit 1100 ns
  • Hardwired circuit 210 ns
  • But -- where is the extra time going??
  • 1. loss of parallelism in computation
  • 2. worst-case timing assumptions
  • 3. overhead of flexibility (registers/tristates)
  • 4. loss of parallelism in communication
    (single-bus bottleneck)

97
1. Parallelism in Computation
  • Tpd as drawn 210 ns
  • Suppose it had to be sequential?

c
x
y
a
b
320 ns
98
2. Worst-Case Timing AssumptionsClock cycle
sized to fit MUL 500 ns for five ops
registers 5 ns
y
MUL 100 ns
ADD 10 ns
ROM 0 a 1 b 2 c 3 unused
2
But we have to do more
20 ns
romaddr
x
tri-states 5 ns
99
3. Cost of Flexibility10 ns (10) for
reg/tristate 550 ns for five ops
registers 5 ns
y
MUL 100 ns
ADD 10 ns
ROM 0 a 1 b 2 c 3 unused
2
20 ns
romaddr
x
tri-states 5 ns
100
4. Parallelism in Communication
0 DrX, LdC, LdD
1 DrMUL, LdC
2 DrROM(C), LdD
3 DrMUL, LdB
How many of these states actually compute
something?
4 DrX, LdC
5 DrROM(B), LdD
6 DrMUL, LdA
What are the rest of the states doing?
7 DrADD, LdB
8 DrROM(A), LdA
9 DrADD, LdY
101
(4. Parallelism in Communication)
c
x
y
a
b
320 ns
102
Single bus recipe summary
  • One common bus (32 bits wide)
  • One of each type of functional unit
  • Inputs from bus via registers
  • Outputs connected to the bus via tri-state
    buffers
  • Any other pseudo- functional units
  • I/O
  • constants
  • temporary storage

103
General-Purpose Computation
  • Story so far
  • 1. combinational
  • 2. sequential, using single-bus recipe
  • However, single-bus recipe still requires new
    functional units and a new FSM for every problem.
    How can we make it universal?

104
Universal Machine
  • 1. enough functional units
  • (pretty easy...)
  • 2. enough memory for constants, temporaries, etc
  • 3. (the crux) FSM is an interpreter
Write a Comment
User Comments (0)
About PowerShow.com