Title: Week 5 Lecture slides
1Cosc 3P92
Voters quickly forget what a man says. Richard
M. Nixon (1913-1994) Former U.S. President
2Hardware components MIC(overview)
- MAR and MDR are registers which latch the
addresses and data prior to processing
3Hardware components MIC (overview)
- Translate byte address 0, 1, 2, 3 to 4 byte
words. - Shift 2 bits left.
- Causes word 0, 1, 2, 3 to be addressed.
- Alignment of words.
4Hardware components MIC (overview)
- Each micro instruction controls
- register enables
- bus enables
- ALU
- Memory
- Next Micro instruction address
5Hardware components MIC (overview)
6Memory control
- MAR - memory address register
- CPU writes addresses of memory to read, write
- MBR - memory buffer register
- contains data for write or read
- both act as latches to hold addr, data until
memory finished using them.
7Control unit
- main functions of a control unit
- - instruction interpretation
- - instruction sequencing
- the control unit is a finite-state machine.
8Execution unit
- An execution unit consists of
- a register section
- an ALU
- some dedicated hardware or firmware
9Data transfer within a CPU
- A single-bus architecture
- To compute R2 lt R0 R1
- 1. A lt R0,
- 2. B lt R1,
- 3. R2 lt AB
10Data transfer within a CPU
- A two-bus architecture
- To compute R2 lt R0 R1
- 1. Buffer lt R0 R1 (via Bus A and Bus B),
- 2. R2 lt Buffer (via either Bus A or Bus B).
11Data transfer within a CPU
- A three-bus architecture
- To compute R2 lt R0 R1
- 1. R2 lt R0 R1 (via Bus A, Bus B and Bus C).
12Design of control units
- The control unit is treated as a synchronous
(i.e., clocked) sequential circuit and is
implemented as a hardwired state machine.
13Microprogramming
- Use of memory to implement the control unit
- Instructions are implemented as sequences of
instructions stored in control memory - Each machine language instruction is interpreted
by circuitry, and executed using sequences of
microprogram instructions - Micro-programs are much like assembled code,
except - direct mapping between instruction fields and
hardware components of the CPU. - control fields are specified.
- timing is critical parallelism can be exploited.
14Microprogramming
- What is being controlled?
- data paths inter-register connections
- control points hardware enabling lines which
govern register-to-register communications - idea is that we can control the operation of ALU
and micro-control unit using combinations of
control fields encoded in micro-instructions
15Microprogramming
- Each control point specifies a micro-operation
- All micro operations which may be executed in
parallel can be specified in a single micro
instruction. - Factors which determine parallel operations.
- Buses must only have 1 input active at a time.
- Registers can be either read/written
- Not both at the same time.
16Microprogramming
- Basic microinstruction formats Over heads
17Data path
- 32-bit registers (none are user-accessible)
- B bus main one to ALU
- C bus from ALU back to registers
- H reg contains other operand for ALU
- loaded by performing null op on data, and sending
it to H
18Data path
- ALU control 6 control lines
- shifter 2 control
- 1. logical shift left 8 bits
- 2. arithmetic shift right 8 bits
19Data path timing
- Four sub-cycles
- 1. control signals set up (w)
- 2. registers loaded on B bus (x)
- 3. ALU and shifter (y)
- 4. results available to registers on C (z)
20Data path timing
- These are implicit sub-cycles they rely on
timing of previous steps - Only real clock signals used
- falling edge of clock (starts the cycle)
- rising edge (loading from C in step 4)
- ALU is continually processing all intermediate
values it sees. Its output only makes sense at
the appropriate time above (after 3) - Can operate and save a register in 1 clock cycle
- load PC to B
- inc
- save to PC
21Memory again
- 2 memory buffers
- 32 bit port MAR, MDR (read, write)
- word addresses
- 8-bit MBR
- low byte from PC (read only)
- byte addresses
- can be loaded signed, unsigned onto B bus
- call reads into MBR fetches
- control
- black arrow enable from C bus
- white arrow enable onto B bus
- 2 bus control
- out B
- in C
- out B / in C
- none
22Memory again
- MAR aligned to words (32 bits, 4 bytes) 4.4
- Memory is available 2 cycles from when read was
initiated - avail. at end of 2nd cycle, so 3rd cycle can use
them
23Microinstructions
- 29 signals for data path
- 1. 9 signals to control C bus output into
registers - 2. 9 signals to enable registers onto B bus
- 3. 9 signals for ALU, shifter functions
- 4. 2 signals for memory W/R via MAR/MDR
- 5. 1 signal for memory fetch via PC/MBR
- Issues
- may load more than 1 reg from C (9 bits)
- but never load more than 1 reg onto B (4 bits,
encoded will force this) --gt 4 signals. - Need 2 more fields for determining next m.i.
- NextAddr (9 bits, addr space of 512)
- conditional jumps (3 bits)
24Microinstructions
- Fields
- Addr address of next micro-instruction
- JAM determines how next m.i. selected
- ALU ALU, shifter control
- C which registers written from C bus
- Mem memory functions
- B B source (encoded)
25Example micro-architecture Mic-1
26Example microarchitecture Mic-1
- sequencer executes microinstructions
- Two tasks
- set control signals for system
- determine next m.i. to execute
- control store contains m.i. for interpreting ISA
instns. - each instn a 36-bit word like 4.5
- each m.i specifies its successor
- MPC MicroProgram Counter
- 9-bit address of next m.i. to execute
- MIR MicroInstruction Register
- 36-bit m.i. being executed
- Note that bits in MIR may directly control other
parts of the circuit - eg. C
27Mic-1 operation cycle
- Basic ALU cycle
- 1. set up the inputs to the ALU
- 2. let the ALU do its computation
- 3. store the results
- Clock cycles for Mic-1
- 1. MIR enabled (during subcycle w)
- 2. MIR signals control data path (B bus note H
always enabled) (subcycle x) - 3. B and H inputs are stable, and ALUs computes
output shifter finishes N, Z bits stable
(subcycle y) - 4. shifter, N, Z outputs loaded from C but into
registers - rising clock edge determines end
- MIR is reloaded and calculated at this point as
well - Memory read is initiated at end too
- Note that all the above will complete in 1 cycle
- microinstructions can specify all these
operations in parallel
28Mic-1 sequencing
- First, 9-bit next addr field copied into MPC
- JAM inspected
- 000 use MPC as it is
- if JAMN (or JAMZ) set, then N bit (or Z) are ORed
with high-bit of MPC - hence next address is either MPC, MPC with
high-bit ORed with 1 -
- JMPC set MBR byte ORed with low byte of NextAddr
field - permits multiway jumps
- can quickly branch to instn for just-loaded
opcodes (ie. opcode number address in control
store!)
29Microinstructions and notation
- As in assembler programming, helps to use
higher-level notation instead of raw numeric m.i.
fields - can specify everything that happens in 1 clock
cycle - permits parallelism eg. prefetch next instns
- Notation high-level, but directly translatable
to single m.i.s - Examples
- SPSP1 incr SP by 1
- MDR SP copy SP into MDR
- MDR SPH rd add SP and H, save in MDR, and
initiate a read - SPMDRSP1 incr SP, load into both MDR, SP
30Microinstructions and notation
- Memory takes 2 cycles
- MARSP rd assign value into MDR
- (another instn)
- memory ready now!
- next addresses assume it is the labeled next
m.i. after current one (unless a conditional
jump) - if (Z) goto L1 else goto L2 sets JAMZ
- L1 and L2 are same low-8 bits (set by assembler)
- Summary of legal operations on operands
31Example M.I. implementation IJVM
- A stack-based virtual machine for which Mic-1 is
designed to implement. - All instructions access the stack no general
registers are used by compiler - eg. parameter passing 4.8
- eg. arithmetic 4.9
- Recall
- JVM instruction formats 5.15
- Java memory usage, registers 4.10
- Complete instruction set 4.11
- Example translated code 4.14
32(No Transcript)
33JVM Instruction Formats
34Memory area of IJVM
35IJVM Instruction Set
36Translating Java to IJVM
37Implementation (cont)
- See overheads (book page 234-236)
- Note
- each m.i. contains address of next instn
- micro-assembler labels all instns appropriately,
and must put them in right control store
addresses (equiv. to opcode) - the sequenced instns may reside in any free area
of control store! Microassembler auto sets next
address fields. - only explicit gotos will override this
sequencing - Two parts
- 1. fetch next byte for next instn (done at Main1)
- 2. branch to that opcode address and carry out
instruction - Fetching instructions (Main1)
- PC always points to next instruction in Java
application program - can be reset by branches (see goto5, T, F,...)
- When Main1 executed, assumed next opcode ready.
the fetch at Main1 is for next opcode. Hence
instns must fetch it if necessary(eg. see bipush2)
38Implementation (cont)
- Example 1 iadd (pop 2 words from stack, push
their sum) - iadd1 reads next-to-top word in stack (TOS
register already contains top of stack word)
bumps down the SP for writing result - iadd2 sets TOS ready for addition (put in H)
- iadd3 add next-to-top value (read in iadd1) to
H, update TOS, save result in MDR for writing - Example 2 dup (copy top stack word and push
it) - dup1 incr SP pointer, copy to MAR
- dup2 save TOS (top stack word) to new SP, write
it - note cant write it in dup1, because both SP and
MDR must be updated thru data path, and not both
at once
39Implementation (cont)
- Example 3 goto offset (unconditional branch)
- Fig 4.22
- goto1 save addr of opcode to OPC (old PC)
- goto2 get the 2nd byte of offset (1st byte
already in MBR) - goto3 shift 1st byte left 8 bits
- goto4 OR low byte into high byte
- goto5 add 16-bit offset to (old) PC get next
opcode - goto6 goto Main1
- Note pause needed in goto6 (must wait 2 extra
cycle)
40(No Transcript)
41Improving performance
- 1. Faster clock, transistors, electrical circuits
- 2. simpler organization yields shorter clock
cycles - eg. get rid of (B bus) decoder
- 3. Merge interpreter loop with microcode (pt 2)
- 4.23, 4.24
- saves extra cycles if done in all instns
- significant speedup!
- 4. Three-busses
- 4.25, 4.26
- reduces need for separate instns to load H reg
42(No Transcript)
432 Bus v.s. 3 Bus
44Improving performance
- 5. Instruction fetch unit 4.27
- in Mic-1, ALU is used to increment PC and fetch
instns - this uses up instn. cycles
- IFU can be used
- 1. pre-fetches all instns outside of main data
path - 2. pre-fetches operands if they are required,
they are there (else garbage, but ignored anyway)
45Fetch Unit
46Improving performance
- Instruction fetch unit (cont)
- shift register always loaded with next bytes
from memory - MBR1 (1 byte, as before) and new MBR2 (2 bytes)
- values from shift reg dumped into both MBR1, MBR2
after every instn read if needed, they are
quickly put onto data path as reqd - need some fetching logic to know when to read
more bytes into shift register, when to refresh
MBR1, MBR2 - IMAR separate memory addr reg (separate from
MAR) - own dedicated incrementer (no need for ALU)
- IFU must keep PC incremented properly, depending
on instn length (if MBR1, MBR2 used) - branches may reset PC as well (from C)
47Improving performance
- Mic-2
- A, B buses
- IFU
- new IJVM 4.30, See overheads
- smaller, faster
- MBR1 always has next opcode (due to IFU)
48Mic-2
49Improving performance 6. Pipelining
- divide instn. execution into modular steps and
carry out different steps for seql. instns
simultaneously - instruction-level parallelism
- superscalar single pipeline with parallel
functional units - most instns take more than 1 cycle to complete
- with pipelining n instns in n cycles
- To implement it 4.31
- add latch to A, B, C buses
- they keep values stable during sub-cycles can
use values in 3 sections of the data path - (i) loading before ALU (A, B)
- (ii) doing ALU, shift, and loading C latch
- (iii) storing C back into registers
50Mic-3
51Improving performance 6. Pipelining
- need 3 cycles now to complete 1 instn
- but maximum delay between all components is
shorter (1/3) so can speed up clock - advantage throughput -- 3 instns can be
processed simult. - all parts of data path are busy... none are idle
(usually) - best analogy car factory assembly line
52Pipelining (cont)
- 4.32, 4.33, 4.44
- interpreting instns in pipelined processor
(Mic-4) - new sub-cycles microsteps
- takes 3 cycles to process instn (steps i, ii, iii
from earlier) - call latches A, B, C (like registers)
- advantage 4.33 is that different stages can
work independently of one another now - more stages in pipeline means higher efficiency
53(No Transcript)
54(No Transcript)
55Pipelining (cont)
- One complication memory reads
- takes 2 cycles to get word from memory
- hence a m.i. that uses a word in MDR must wait
until its available - called a true or RAW (read after write)
dependence - pipeline must stall until it is ready
- ideally, put other m.i. instns in wait states
- Another complication conditional branches
- cannot predict which instn to fetch/put into
pipeline - have to squash or flush pipeline when a jump
ruins sequence of instns
56Pipelines and branch prediction
- unconditional branches
- fetch unit needs to know in advance where to
access instns - a jump instn. isnt decoded right away, and so
F.U. wont know branch location until later
called the delay slot - soln compiler places other executable instns in
delay, that it knows can be executed - conditional branches
- dynamic prediction carried out during run time
- keep a running table of branched instn addresses,
along with a branch/no branch bit - if branch in table, and branch bit set, then
predict it will be taken --gt fetch it - can use 2 prediction bits predict its fetched
twice, and not fetched twice (extra logic)
57Pipelines and branch prediction
- static branch prediction carried out during
compile time - if a loop nearly always done, then have a field
in the instn. which tells CPU that branch should
be fetched (eg. UltraSPARC) - can do simulations to determine how cond.
branches executed
58Improving performance out-of-order exec, reg
renaming
- instruction ops can take varying clock cycles
- superscalar systems mean those functional units
need more time to process their instns - problem cant exec one instn that requires
results of another - means the pipeline stalls until register values
are computed when subsequent instns require them. - soln move instruction order, so that no idle
waiting - overall exec must be identical to linear order
- dependencies
- RAW (read after write) try to read reg before
another instn has written it. - WAR (write after read) try to write before
another has read it - WAW (write after write) both write simult.
59In-order exec, in-order completion
- decode in cyc n, exec n1, writeback n2 (except
multiply in n3) - 2 instns decoded simult.
- uses scoreboard 1 counter per reg keeping track
of instns using it as a source or destination - keeps track of max regs that can be processed
concurrently
60Out-of-order exec, reg renaming (cont)
- idea execute instns so long as resources are
available, and no conflicts - move order of instns to permit this
- registers are renamed automatically to reduce
conflicts secret regs - eg. if a register is in conflict, rename it so
conflict is removed. - copy values to original named reg later if
required. - result huge performance gain (were trying to
make pipeline maximally useful!)
61Improving performance speculative exec
- block a section of sequential code 4.45
- Can increase throughput by moving instructions
beyond their blocks - hoisting moving an instruction over a branch
- speculative execution executing an instruction
before it is known whether it will be needed - OK to do it so long as there is no side effect
(eg. write to memory, trap/interrupt) - may sometimes cause slowdown if spec. exec
fetches an instn from memory that isnt needed - otherwise, idea is to move slower instructions up
the queue so that their processing can occur in
the interim - some solns
- speculative instns only fetch/exec instructions
that are in the cache - poison bits dont set traps automatically wait
until that instn actually executed, and if a
poison bit is set, then set the trap
62Speculative exec
63Example 1 Pentium II
- 1. Fetch/decode 4.46
- fetches instns and breaks them into m.i.s
- 2.dispatch/exec
- takes m.i.s and execs them
- 3. retirement unit
- completes exec, stores reg values (speculative
exec) - 1, 2, 3 above act as high-level pipeline
- ROB (reorder buffer) table of m.i.s to execute
- Fetch/decode 4.47
- 7-stage pipeline
- multiple formats, sizes means instn decoding is
involved - analyzes instns to determine size,
branch-prediction - usually between 1 and 4 m.i.s per ISA instn.
- uses reg renaming
- both static, dynamic branch prediction used
- Dispatch/exec 4.48
- 5 m.i.s can be execd at once
64P2-micro architecture
65(No Transcript)
66Example 2 UltraSPARC II
- 4.49
- RISC all instns are 3-register microinstns
already - branch prediction (i) cache flags (ii) 2-bit
prediction (iii) compiler directions in instns - tries to exec 4 instns in parallel all the time
- instns may be executed out of order
- 9-stage pipeline 4.50
- split integer, float pipelines
- int adds 2 stages (N1, N2) to keep it same as fp
67UltraSPARC
68UltraSPARC Pipeline
69Example 3 picoJava II
- 4.51
- instn, data caches are optional
- register file (64 entries)
- contains top 64 words of stack
- dribbling reg file read/written to memory when
it gets too empty/full - free access, w/o accessing caches (which may
not be used)
70(No Transcript)
71- 6-stage pipeline 4.52
- CISC instns
- not superscalar instns fetched, retired inorder
(unlike Pentium II) - no branch prediction alg (economy)
72Folding
- Folding 4.53, 4.54, 4.55
- replace a set of m.i.s with one m.i.
- looks up patterns in a table 4.55, and replaces
with equivalent m.i. - only possible if operands are high in stack, in
register file - huge gain in speed, like RISC performance
73(No Transcript)
74(No Transcript)
75Comparing these examples
- common features
- all m.i.s contain opcode, 2 source regs, dest
reg - 1 m.i. per cycle
- deep pipelines
- split instn and data caches
- Pentium II complexity is in deconstructing its
CISC instns into micro-operations - JVM complexity is in folding sets of m.i.s into
single operations - UltraSparc most straight-forward to implement,
because instns require minimal decoding (all RISC
instructions are micro-operations already!)
76The end