Week 5 Lecture slides presentation

About This Presentation

Transcript and Presenter's Notes

Title: Week 5 Lecture slides

1
Cosc 3P92

Week 5 Lecture slides

Voters quickly forget what a man says. Richard
M. Nixon (1913-1994) Former U.S. President
2
Hardware components MIC(overview)

MAR and MDR are registers which latch the
addresses and data prior to processing

3
Hardware components MIC (overview)

Translate byte address 0, 1, 2, 3 to 4 byte
words.
Shift 2 bits left.
Causes word 0, 1, 2, 3 to be addressed.
Alignment of words.

4
Hardware components MIC (overview)

Each micro instruction controls
register enables
bus enables
ALU
Memory
Next Micro instruction address

5
Hardware components MIC (overview)
6
Memory control

MAR - memory address register
CPU writes addresses of memory to read, write
MBR - memory buffer register
contains data for write or read
both act as latches to hold addr, data until
memory finished using them.

7
Control unit

main functions of a control unit
- instruction interpretation
- instruction sequencing
the control unit is a finite-state machine.

8
Execution unit

An execution unit consists of
a register section
an ALU
some dedicated hardware or firmware

9
Data transfer within a CPU

A single-bus architecture
To compute R2 lt R0 R1
1. A lt R0,
2. B lt R1,
3. R2 lt AB

10
Data transfer within a CPU

A two-bus architecture
To compute R2 lt R0 R1
1. Buffer lt R0 R1 (via Bus A and Bus B),
2. R2 lt Buffer (via either Bus A or Bus B).

11
Data transfer within a CPU

A three-bus architecture
To compute R2 lt R0 R1
1. R2 lt R0 R1 (via Bus A, Bus B and Bus C).

12
Design of control units

Hardwired approach

The control unit is treated as a synchronous
(i.e., clocked) sequential circuit and is
implemented as a hardwired state machine.

13
Microprogramming

Use of memory to implement the control unit
Instructions are implemented as sequences of
instructions stored in control memory
Each machine language instruction is interpreted
by circuitry, and executed using sequences of
microprogram instructions
Micro-programs are much like assembled code,
except
direct mapping between instruction fields and
hardware components of the CPU.
control fields are specified.
timing is critical parallelism can be exploited.

14
Microprogramming

What is being controlled?
data paths inter-register connections
control points hardware enabling lines which
govern register-to-register communications
idea is that we can control the operation of ALU
and micro-control unit using combinations of
control fields encoded in micro-instructions

15
Microprogramming

Each control point specifies a micro-operation
All micro operations which may be executed in
parallel can be specified in a single micro
instruction.
Factors which determine parallel operations.
Buses must only have 1 input active at a time.
Registers can be either read/written
Not both at the same time.

16
Microprogramming

Basic microinstruction formats Over heads

17
Data path

32-bit registers (none are user-accessible)
B bus main one to ALU
C bus from ALU back to registers
H reg contains other operand for ALU
loaded by performing null op on data, and sending
it to H

18
Data path

ALU control 6 control lines
shifter 2 control
1. logical shift left 8 bits
2. arithmetic shift right 8 bits

19
Data path timing

Four sub-cycles
1. control signals set up (w)
2. registers loaded on B bus (x)
3. ALU and shifter (y)
4. results available to registers on C (z)

20
Data path timing

These are implicit sub-cycles they rely on
timing of previous steps
Only real clock signals used
falling edge of clock (starts the cycle)
rising edge (loading from C in step 4)
ALU is continually processing all intermediate
values it sees. Its output only makes sense at
the appropriate time above (after 3)
Can operate and save a register in 1 clock cycle
load PC to B
inc
save to PC

21
Memory again

2 memory buffers
32 bit port MAR, MDR (read, write)
word addresses
8-bit MBR
low byte from PC (read only)
byte addresses
can be loaded signed, unsigned onto B bus
call reads into MBR fetches
control
black arrow enable from C bus
white arrow enable onto B bus
2 bus control
out B
in C
out B / in C
none

22
Memory again

MAR aligned to words (32 bits, 4 bytes) 4.4
Memory is available 2 cycles from when read was
initiated
avail. at end of 2nd cycle, so 3rd cycle can use
them

23
Microinstructions

29 signals for data path
1. 9 signals to control C bus output into
registers
2. 9 signals to enable registers onto B bus
3. 9 signals for ALU, shifter functions
4. 2 signals for memory W/R via MAR/MDR
5. 1 signal for memory fetch via PC/MBR
Issues
may load more than 1 reg from C (9 bits)
but never load more than 1 reg onto B (4 bits,
encoded will force this) --gt 4 signals.
Need 2 more fields for determining next m.i.
NextAddr (9 bits, addr space of 512)
conditional jumps (3 bits)

24
Microinstructions

Fields
Addr address of next micro-instruction
JAM determines how next m.i. selected
ALU ALU, shifter control
C which registers written from C bus
Mem memory functions
B B source (encoded)

25
Example micro-architecture Mic-1
26
Example microarchitecture Mic-1

sequencer executes microinstructions
Two tasks
set control signals for system
determine next m.i. to execute
control store contains m.i. for interpreting ISA
instns.
each instn a 36-bit word like 4.5
each m.i specifies its successor
MPC MicroProgram Counter
9-bit address of next m.i. to execute
MIR MicroInstruction Register
36-bit m.i. being executed
Note that bits in MIR may directly control other
parts of the circuit
eg. C

27
Mic-1 operation cycle

Basic ALU cycle
1. set up the inputs to the ALU
2. let the ALU do its computation
3. store the results
Clock cycles for Mic-1
1. MIR enabled (during subcycle w)
2. MIR signals control data path (B bus note H
always enabled) (subcycle x)
3. B and H inputs are stable, and ALUs computes
output shifter finishes N, Z bits stable
(subcycle y)
4. shifter, N, Z outputs loaded from C but into
registers
rising clock edge determines end
MIR is reloaded and calculated at this point as
well
Memory read is initiated at end too
Note that all the above will complete in 1 cycle
microinstructions can specify all these
operations in parallel

28
Mic-1 sequencing

First, 9-bit next addr field copied into MPC
JAM inspected
000 use MPC as it is
if JAMN (or JAMZ) set, then N bit (or Z) are ORed
with high-bit of MPC
hence next address is either MPC, MPC with
high-bit ORed with 1
JMPC set MBR byte ORed with low byte of NextAddr
field
permits multiway jumps
can quickly branch to instn for just-loaded
opcodes (ie. opcode number address in control
store!)

29
Microinstructions and notation

As in assembler programming, helps to use
higher-level notation instead of raw numeric m.i.
fields
can specify everything that happens in 1 clock
cycle
permits parallelism eg. prefetch next instns
Notation high-level, but directly translatable
to single m.i.s
Examples
SPSP1 incr SP by 1
MDR SP copy SP into MDR
MDR SPH rd add SP and H, save in MDR, and
initiate a read
SPMDRSP1 incr SP, load into both MDR, SP

30
Microinstructions and notation

Memory takes 2 cycles
MARSP rd assign value into MDR
(another instn)
memory ready now!
next addresses assume it is the labeled next
m.i. after current one (unless a conditional
jump)
if (Z) goto L1 else goto L2 sets JAMZ
L1 and L2 are same low-8 bits (set by assembler)
Summary of legal operations on operands

31
Example M.I. implementation IJVM

A stack-based virtual machine for which Mic-1 is
designed to implement.
All instructions access the stack no general
registers are used by compiler
eg. parameter passing 4.8
eg. arithmetic 4.9
Recall
JVM instruction formats 5.15
Java memory usage, registers 4.10
Complete instruction set 4.11
Example translated code 4.14

32
(No Transcript)
33
JVM Instruction Formats
34
Memory area of IJVM
35
IJVM Instruction Set
36
Translating Java to IJVM
37
Implementation (cont)

See overheads (book page 234-236)
Note
each m.i. contains address of next instn
micro-assembler labels all instns appropriately,
and must put them in right control store
addresses (equiv. to opcode)
the sequenced instns may reside in any free area
of control store! Microassembler auto sets next
address fields.
only explicit gotos will override this
sequencing
Two parts
1. fetch next byte for next instn (done at Main1)
2. branch to that opcode address and carry out
instruction
Fetching instructions (Main1)
PC always points to next instruction in Java
application program
can be reset by branches (see goto5, T, F,...)
When Main1 executed, assumed next opcode ready.
the fetch at Main1 is for next opcode. Hence
instns must fetch it if necessary(eg. see bipush2)

38
Implementation (cont)

Example 1 iadd (pop 2 words from stack, push
their sum)
iadd1 reads next-to-top word in stack (TOS
register already contains top of stack word)
bumps down the SP for writing result
iadd2 sets TOS ready for addition (put in H)
iadd3 add next-to-top value (read in iadd1) to
H, update TOS, save result in MDR for writing
Example 2 dup (copy top stack word and push
it)
dup1 incr SP pointer, copy to MAR
dup2 save TOS (top stack word) to new SP, write
it
note cant write it in dup1, because both SP and
MDR must be updated thru data path, and not both
at once

39
Implementation (cont)

Example 3 goto offset (unconditional branch)
Fig 4.22
goto1 save addr of opcode to OPC (old PC)
goto2 get the 2nd byte of offset (1st byte
already in MBR)
goto3 shift 1st byte left 8 bits
goto4 OR low byte into high byte
goto5 add 16-bit offset to (old) PC get next
opcode
goto6 goto Main1
Note pause needed in goto6 (must wait 2 extra
cycle)

40
(No Transcript)
41
Improving performance

1. Faster clock, transistors, electrical circuits
2. simpler organization yields shorter clock
cycles
eg. get rid of (B bus) decoder
3. Merge interpreter loop with microcode (pt 2)
4.23, 4.24
saves extra cycles if done in all instns
significant speedup!
4. Three-busses
4.25, 4.26
reduces need for separate instns to load H reg

42
(No Transcript)
43
2 Bus v.s. 3 Bus
44
Improving performance

5. Instruction fetch unit 4.27
in Mic-1, ALU is used to increment PC and fetch
instns
this uses up instn. cycles
IFU can be used
1. pre-fetches all instns outside of main data
path
2. pre-fetches operands if they are required,
they are there (else garbage, but ignored anyway)

45
Fetch Unit
46
Improving performance

Instruction fetch unit (cont)
shift register always loaded with next bytes
from memory
MBR1 (1 byte, as before) and new MBR2 (2 bytes)
values from shift reg dumped into both MBR1, MBR2
after every instn read if needed, they are
quickly put onto data path as reqd
need some fetching logic to know when to read
more bytes into shift register, when to refresh
MBR1, MBR2
IMAR separate memory addr reg (separate from
MAR)
own dedicated incrementer (no need for ALU)
IFU must keep PC incremented properly, depending
on instn length (if MBR1, MBR2 used)
branches may reset PC as well (from C)

47
Improving performance

Mic-2
A, B buses
IFU
new IJVM 4.30, See overheads
smaller, faster
MBR1 always has next opcode (due to IFU)

48
Mic-2
49
Improving performance 6. Pipelining

divide instn. execution into modular steps and
carry out different steps for seql. instns
simultaneously
instruction-level parallelism
superscalar single pipeline with parallel
functional units
most instns take more than 1 cycle to complete
with pipelining n instns in n cycles
To implement it 4.31
add latch to A, B, C buses
they keep values stable during sub-cycles can
use values in 3 sections of the data path
(i) loading before ALU (A, B)
(ii) doing ALU, shift, and loading C latch
(iii) storing C back into registers

50
Mic-3
51
Improving performance 6. Pipelining

need 3 cycles now to complete 1 instn
but maximum delay between all components is
shorter (1/3) so can speed up clock
advantage throughput -- 3 instns can be
processed simult.
all parts of data path are busy... none are idle
(usually)
best analogy car factory assembly line

52
Pipelining (cont)

4.32, 4.33, 4.44
interpreting instns in pipelined processor
(Mic-4)
new sub-cycles microsteps
takes 3 cycles to process instn (steps i, ii, iii
from earlier)
call latches A, B, C (like registers)
advantage 4.33 is that different stages can
work independently of one another now
more stages in pipeline means higher efficiency

53
(No Transcript)
54
(No Transcript)
55
Pipelining (cont)

One complication memory reads
takes 2 cycles to get word from memory
hence a m.i. that uses a word in MDR must wait
until its available
called a true or RAW (read after write)
dependence
pipeline must stall until it is ready
ideally, put other m.i. instns in wait states
Another complication conditional branches
cannot predict which instn to fetch/put into
pipeline
have to squash or flush pipeline when a jump
ruins sequence of instns

56
Pipelines and branch prediction

unconditional branches
fetch unit needs to know in advance where to
access instns
a jump instn. isnt decoded right away, and so
F.U. wont know branch location until later
called the delay slot
soln compiler places other executable instns in
delay, that it knows can be executed
conditional branches
dynamic prediction carried out during run time
keep a running table of branched instn addresses,
along with a branch/no branch bit
if branch in table, and branch bit set, then
predict it will be taken --gt fetch it
can use 2 prediction bits predict its fetched
twice, and not fetched twice (extra logic)

57
Pipelines and branch prediction

static branch prediction carried out during
compile time
if a loop nearly always done, then have a field
in the instn. which tells CPU that branch should
be fetched (eg. UltraSPARC)
can do simulations to determine how cond.
branches executed

58
Improving performance out-of-order exec, reg
renaming

instruction ops can take varying clock cycles
superscalar systems mean those functional units
need more time to process their instns
problem cant exec one instn that requires
results of another
means the pipeline stalls until register values
are computed when subsequent instns require them.
soln move instruction order, so that no idle
waiting
overall exec must be identical to linear order
dependencies
RAW (read after write) try to read reg before
another instn has written it.
WAR (write after read) try to write before
another has read it
WAW (write after write) both write simult.

59
In-order exec, in-order completion

decode in cyc n, exec n1, writeback n2 (except
multiply in n3)
2 instns decoded simult.
uses scoreboard 1 counter per reg keeping track
of instns using it as a source or destination
keeps track of max regs that can be processed
concurrently

60
Out-of-order exec, reg renaming (cont)

idea execute instns so long as resources are
available, and no conflicts
move order of instns to permit this
registers are renamed automatically to reduce
conflicts secret regs
eg. if a register is in conflict, rename it so
conflict is removed.
copy values to original named reg later if
required.
result huge performance gain (were trying to
make pipeline maximally useful!)

61
Improving performance speculative exec

block a section of sequential code 4.45
Can increase throughput by moving instructions
beyond their blocks
hoisting moving an instruction over a branch
speculative execution executing an instruction
before it is known whether it will be needed
OK to do it so long as there is no side effect
(eg. write to memory, trap/interrupt)
may sometimes cause slowdown if spec. exec
fetches an instn from memory that isnt needed
otherwise, idea is to move slower instructions up
the queue so that their processing can occur in
the interim
some solns
speculative instns only fetch/exec instructions
that are in the cache
poison bits dont set traps automatically wait
until that instn actually executed, and if a
poison bit is set, then set the trap

62
Speculative exec
63
Example 1 Pentium II

1. Fetch/decode 4.46
fetches instns and breaks them into m.i.s
2.dispatch/exec
takes m.i.s and execs them
3. retirement unit
completes exec, stores reg values (speculative
exec)
1, 2, 3 above act as high-level pipeline
ROB (reorder buffer) table of m.i.s to execute
Fetch/decode 4.47
7-stage pipeline
multiple formats, sizes means instn decoding is
involved
analyzes instns to determine size,
branch-prediction
usually between 1 and 4 m.i.s per ISA instn.
uses reg renaming
both static, dynamic branch prediction used
Dispatch/exec 4.48
5 m.i.s can be execd at once

64
P2-micro architecture
65
(No Transcript)
66
Example 2 UltraSPARC II

4.49
RISC all instns are 3-register microinstns
already
branch prediction (i) cache flags (ii) 2-bit
prediction (iii) compiler directions in instns
tries to exec 4 instns in parallel all the time
instns may be executed out of order
9-stage pipeline 4.50
split integer, float pipelines
int adds 2 stages (N1, N2) to keep it same as fp

67
UltraSPARC
68
UltraSPARC Pipeline
69
Example 3 picoJava II

4.51
instn, data caches are optional
register file (64 entries)
contains top 64 words of stack
dribbling reg file read/written to memory when
it gets too empty/full
free access, w/o accessing caches (which may
not be used)

70
(No Transcript)
71

6-stage pipeline 4.52
CISC instns
not superscalar instns fetched, retired inorder
(unlike Pentium II)
no branch prediction alg (economy)

72
Folding

Folding 4.53, 4.54, 4.55
replace a set of m.i.s with one m.i.
looks up patterns in a table 4.55, and replaces
with equivalent m.i.
only possible if operands are high in stack, in
register file
huge gain in speed, like RISC performance

73
(No Transcript)
74
(No Transcript)
75
Comparing these examples

common features
all m.i.s contain opcode, 2 source regs, dest
reg
1 m.i. per cycle
deep pipelines
split instn and data caches
Pentium II complexity is in deconstructing its
CISC instns into micro-operations
JVM complexity is in folding sets of m.i.s into
single operations
UltraSparc most straight-forward to implement,
because instns require minimal decoding (all RISC
instructions are micro-operations already!)

76
The end

Write a Comment

User Comments (0)

About PowerShow.com

Week 5 Lecture slides PowerPoint PPT Presentation