Chapter 4 The Microarchitecture Level - PowerPoint PPT Presentation

1 / 96
About This Presentation
Title:

Chapter 4 The Microarchitecture Level

Description:

A 4-bit code is decoded 16 ways. Only 9 ways are used. Saves 5 bits ... Eliminating decoding. Reducing the path length ... Eliminating decoding. Decoding the ... – PowerPoint PPT presentation

Number of Views:498
Avg rating:3.0/5.0
Slides: 97
Provided by: markt2
Category:

less

Transcript and Presenter's Notes

Title: Chapter 4 The Microarchitecture Level


1
Chapter 4The Microarchitecture Level
  • CS 271 Computer Architecture
  • Indiana University Purdue University Fort Wayne
  • Mark Temte

2
Microarchitecture level context
3
Microarchitecture level
  • The ISA level instructions are also known as
    macroinstructions
  • Familiar from assembly language
  • ADD, LOAD, STORE, BRANCH, etc.

Java c a b assembly LOAD R3,
a language ADD R3, b R3 is register
3 STORE R3, c BRANCH L4
4
Microarchitecture level
  • The control unit within the CPU must generate
    signals to fetch and execute each of the ISA
    level macroinstructions
  • How?
  • Create a microcomputer within the control unit
  • This microcomputer runs microprograms consisting
    of microinstructions that act on the data path
  • There is one microprogram for each
    macroinstruction
  • Execution of the microprogram interprets the
    corresponding macroinstruction

5
Data path
  • This data path of the CPU consists of those parts
    exclusive of the control unit
  • Consists of the ALU, registers, and internal
    buses
  • Example
  • The following slide shows the data path of a
    fictitious computer called IJVM
  • Integer Java Virtual Machine
  • 32-bit data path
  • 32-bit registers
  • 32 1-bit ALUs

6
(No Transcript)
7
Data path
  • The ALU has 6 control lines
  • F0, F1 AND, OR, COMP, SUM
  • ENA gate A inputs into ALU
  • ENB gate B inputs into ALU
  • INVA complement A inputs
  • INC assert carry into low-order bit of ALU
  • The shifter has 2 control lines with 3 actions
  • Negating both lines causes no shift
  • SLL8 Shift Left Logical 8 (shift left 1 byte
    with 0 fill)
  • SRA1 Shift Right Arithmetic 1 not changing
    leftmost bit
  • This divides a twos complement number by 2

8
Example dividing by 2
  • Divide twos complement representation of -14 by 2

Let n 6 bits Represent magnitude 1410
001110 Complement each bit
110001 Add 1
1 110010 Result
-1410 110010
Apply SRA1 to -1410 110010 Obtain 111001 What
is this? 111001 Complement
each bit 000110 Add 1
1 Obtain 710
000111 Thus SRA1 produced 111001 -710
9
Recall . . .
10
(No Transcript)
11
Example
  • How could you increment the SP register?
  • Look at the data path again
  • Enable SP to the B bus
  • Compute B1 as follows . . .
  • Assert ENB
  • Assert lines for SUM
  • Assert INC
  • No shift
  • When shifter output has stabilized, write the C
    bus back into the SP

12
Incrementing the SP register
  • Precise timing of the write pulse to SP is
    important

13
Memory operations
  • There are two ports to memory
  • MAR / MDR
  • 32-bit data port
  • Load the MAR with a 30-bit address
  • The address (multiplied by 4) goes to memory
  • To multiply, simply shift the address left by 2
    bits
  • The MDR receives data (READ) or provides data
    (WRITE)
  • PC / MBR
  • 8-bit data port
  • Only for reading
  • Load the PC with a 32-bit address
  • The address goes to memory
  • The MBR receives a byte
  • Usually an ISA instruction code

14
Memory operations
  • The MBR is gated onto the B bus in two ways
  • Signed (with sign extension)
  • Unsigned
  • There is one control signal for each
  • Only one of these signals may be asserted at a
    time

15
Control signals for the data path
  • There are 29 signals in all
  • 9 - Selects a register to gate to the B-bus
  • A 4-bit code is decoded 16 ways
  • Only 9 ways are used
  • Saves 5 bits
  • 9 - Selects a register to load from the C-bus
  • 8 - ALU and shifter operations
  • 2 - read / write using MAR / MDR
  • 1 - fetch using PC / MBR
  • These control each data path cycle
  • Falling edge of clock to next rising edge

16
Microinstruction format
  • Each microinstruction sets up control signals on
    the data path for the next data path cycle
  • Each microinstruction is 36 bits
  • 24 control bits for the data path
  • 24 29 5
  • 9-bit address of the next microinstruction
  • 3-bit condition code for branching
  • Note each microinstruction specifies its
    successor

17
Microinstruction format
18
Mic-1 architecture
  • Mic-1 is an example architecture we will study
  • Consists of . . .
  • Control store
  • 512 x 36 bit memory for all the microprograms
  • This is ROM
  • MPC
  • MicroProgram Counter
  • MIR
  • MicroInstruction Register

19
(No Transcript)
20
Mic-1 fetch / execute cycle
  • At the falling edge of the clock, the MIR is
    loaded
  • The following components then operate and
    stabilize
  • Decoder
  • B-bus
  • ALU
  • Shifter
  • C-bus
  • Also, the N and Z outputs from the ALU go to
    flip-flops
  • At the next rising edge of the clock
  • Registers and N, Z flip-flops are loaded
  • MDR and MBR are loaded from memory
  • The address of the next microinstruction is
    calculated while the clock is high and the cycle
    repeats

21
Recall the timing of the data path cycle
22
MPC address calculation
  • Address is just NEXT_ADDRESS when JAM 000
  • However . . .
  • JAMN 1 causes OR of N-bit with high-order MPC
    bit
  • JAMZ 1 causes OR of Z-bit with high-order MPC
    bit
  • JMPC 1 causes bitwise OR of MBR and 8 low-order
    bits of NEXT_ADDRESS
  • Typically, NEXT_ADDRESS 0 when JMPC 1
  • This permits a branch to the address in the MBR
  • This address typically is identical to the ISA
    op-code

23
MPC address calculation
24
ISA (macroarchitecture) of the IJVM
  • Memory model
  • Format of methods
  • The IJVM instruction set
  • Local variable frames and the operand stack
  • How a method call is implemented

25
The IJVM memory model
26
The IJVM memory model
  • The constant pool
  • Contains constants, strings, and pointers
  • E.g., pointer to the base address of each method
  • Loaded when the program is loaded
  • Register CPP points to the base of the constant
    pool
  • The constant pool is read-only
  • The method area
  • Contains method code
  • Register PC points to the next instruction
  • Organized as a byte array
  • Operand stack

27
Method format
  • The executable code in a method is preceded by .
    . .
  • Two bytes giving the number of parameters
  • Two bytes giving the size of the local variable
    area
  • The local variable area size is needed to
    initialize the SP to the top of the local
    variable frame

number of parameters
size of LV area
PC
executable code
28
The IJVM instruction set
  • The IJVM instruction set appears on the next
    slide
  • There are 20 instructions altogether
  • Many of the instructions require just a single
    byte
  • These have no operands
  • DUP, IADD, IAND, IOR, IRETURN, ISUB, NOP, POP,
    SWAP, WIDE
  • Others have an additional single 1-byte operand
  • BIPUSH, ILOAD, ISTORE
  • Some have a single 2-byte operand
  • GOTO, IFEQ, IFLT, IF_ICMPEQ, INVOKEVIRTUAL, LDC_W
  • One has two 1-byte operands
  • IINC

29
(No Transcript)
30
Using the IADD instruction
  • To add local variables j and k and save the sum
    in local variable i . . .

ILOAD j // push a copy of local variable
j on the top of the stack ILOAD k // push
a copy of local variable k on the top of the
stack IADD // pop 2 words from the stack
and push their sum back ISTORE i // pop top
word from stack and store in local variable i
31
Sample program fragment
Note The branch instruction IF_ICMPEQ has a
16-bit signed offset that is added to the address
of the current op-code to target L1
32
The local variable frame
  • The local variable frame is where the local
    variables of a method are stored
  • A new local variable frame is created whenever a
    method is called
  • Each local variable frame is pushed onto a stack
    in memory called the operand stack
  • The stack space occupied by a local variable
    frame is released when the associated method
    returns

33
Operand stack example
  • Suppose method A calls method B, which calls
    method C
  • The SP (Stack Pointer) register holds the index
    of the top of the stack
  • The LV (Local Variable pointer) register holds
    the base address of the local variable frame

SP
frame for C
LV
frame for B
frame for A
34
Operand stack example
  • Note how the stack space for B and C is recycled

35
Detailed local variable frame structure
  • The local variable frame also . . .
  • Holds all the parameters set up on the stack in
    advance by the caller
  • Saves the LV and PC registers of the caller
  • The saved PC value is the return address within
    the caller

36
Detailed local variable frame structure
37
Calling a method
  • Call a method using instruction
  • INVOKEVIRTUAL disp
  • Parameter disp gives the position in the constant
    pool holding a pointer to the called method
  • INVOKEVIRTUAL does the following
  • Sets register LV to the value in SP - (
    parameters)
  • Set the value in the location pointed to by LV to
    the value in register SP ( local variables)
    1
  • Increment register SP by ( local variables)
  • Push callers register PC (return address) on the
    stack
  • Set register PC to the 5th byte in the called
    method
  • Push the callers original LV value on the stack

38
Intermediate results
  • The operand stack is used for storing method
    intermediate results
  • These are pushed on the operand stack above the
    local variable frame
  • The return result is the final intermediate
    result
  • It is always left immediately above the local
    variable frame
  • The other intermediate results have already been
    popped
  • Look at Figure 4-9 again
  • IRETURN reverses the steps of INVOKEVIRTUAL

39
Returning from a method
40
The Mic-1 microprogram for IJVM
  • Recall that there is one Mic-1 microprogram for
    each of the IJVM macroinstructions
  • There is also a microprogram for instruction
    fetch
  • Altogether, these microprograms are referred to
    as the Mic-1 microprogram for the IJVM
  • Microinstructions are described using a special
    notation
  • 36 bits could be used instead for each
    microinstruction
  • It is more readable to indicate how the bits
    should be set rather than what they are set to
  • Caution be sure that what is indicated by the
    notation is physically possible

41
Microinstruction notation
  • Some examples
  • Everything on a line is done in one clock cycle
  • The desired result must be physically possible
  • For example, MDR SP MDR is illegal, since
    needs one input from register H

PC PC 1 fetch goto (MBR) MAR SP SP-1
rd H TOS MDR TOS MDR H wr goto Main1
42
Sequencing of instructions
  • All instructions have a implicit or explicit goto
  • Sequential instructions are not necessarily
    sequential in the control store
  • The microinstruction sequence for a
    macroinstruction starts at the control store
    address that corresponds to the numerical value
    of the macroinstructions op-code
  • For example, the IADD op-code is 6016 and the
    microinstruction sequence starts at location 6016
  • The following microinstruction can be located
    anywhere in the control store

43
Microinstruction branching
  • Example
  • Pass TOS through the ALU and look at the Z bit
  • L1 and L2 must be exactly 256 locations apart
  • Example
  • Unconditional branch to instruction pointed to by
    the MBR
  • Convention At the start of any
    macroinstruction, register TOS always contains a
    copy of the value at the top of the operand stack
  • Register OPC is a scratch register
  • Often saves the op-code

Z TOS if (Z) goto L1 else goto L2
goto (MBR)
44
The Mic-1 microprogram for IJVM
  • There are 112 microinstructions in all
  • Starts with the line labeled Main1
  • Before the macroprogram runs . . .
  • the PC contains the address just before the 1st
    macroinstruction
  • the MBR contains 0 (the NOP op-code)
  • Main1 fetches the next macroinstruction op-code
    and branches to the start of the microinstruction
    sequence for the current macroinstruction
  • The last microinstruction in the sequence
    branches back to Main1
  • On the following slides, focus on instructions
    marked with

45
The Mic-1 microprogram for IJVM
46
The Mic-1 microprogram for IJVM
47
The Mic-1 microprogram for IJVM
48
The Mic-1 microprogram for IJVM
49
The Mic-1 microprogram for IJVM
50
Design issues
  • We will modify the Mic-1 design in order to
    increase performance
  • Changes involve . . .
  • Eliminating decoding
  • Reducing the path length
  • The path length is the average number of
    microinstructions per macroinstruction
  • The path length can be reduced by . . .
  • Eliminating Main1
  • Using a 3-bus architecture
  • Adding an independent fetch unit

51
Eliminating decoding
  • Decoding the B-bus slows the potential clock rate
  • The decoding must be completed before anything
    else can happen
  • Cost to eliminate decoding
  • 5 bits in each microinstruction
  • Altogether, 41 bits will be needed instead of 36

52
Eliminating Main1
  • At Main1 there is a microinstruction to fetch the
    opcode of the next macroinstruction
  • This microinstruction can be eliminated by
    merging its code onto the end of the microcode
    sequence of each macroinstruction
  • Usually this can be done in parallel with other
    activity for a saving of 1 cycle
  • This may not always be possible

Main1 PC PC 1 fetch goto( MBR )
53
Eliminating Main1
  • Microinstruction sequence for POP with Main1 code
    merged onto the end

The original order of microinstruction execution
for POP
54
Three-bus architecture
  • This change allows two registers to be added in
    just one clock cycle
  • There is no need to waste a cycle moving one of
    the registers to the H register earlier

55
Adding an independent fetch unit
  • This new specialized functional unit is called
    the IFU
  • Instruction Fetch Unit
  • It independently fetches macroinstruction
    opcodes and processes macroinstruction operands
  • Operands like varnum, disp, offset , etc.
  • This eliminates the Main1 microinstruction
    entirely
  • No longer necessary to merge Main1 code onto the
    end of each microcode sequence

56
Adding an independent fetch unit
  • The IFU gives a dramatic improvement in
    performance, but . . .
  • The IFU is surprisingly complicated
  • Due to branching and operand handling
  • There are some necessary changes in the data path
    due to the IFU
  • In addition to MBR, a new 2-byte register MBR2 is
    added to the data path for holding 2-byte
    operands
  • This eliminates the need to combine two bytes in
    the data path to form an offset or disp
  • The old MBR is renamed MBR1

57
The IFU
  • The PC is now updated by the microprogram only
    when a branch occurs
  • The IFU maintains its own copy of the PC in a
    private register called IMAR
  • The IFU increments the IMAR independently of the
    data path
  • The IFU reads 4 bytes at a time from the user
    program into a special shift register capable of
    holding 5 bytes

58
The IFU
59
Mic-2
  • The revised microarchitecture is called Mic-2
  • Mic-2 includes . . .
  • 3-bus architecture
  • Prefetching using the IFU
  • Shorter microprogram
  • 81 microinstructions instead of 112
  • Major performance gain

60
Mic-2
61
The new microprogram for Mic-2
62
The new microprogram for Mic-2
63
(No Transcript)
64
Additional modifications
  • The clock cycle time can be reduced with a
    piplined design
  • We first add latch registers to the data path

65
Pipelined design
  • This design latches . . .
  • The A and B inputs to the ALU
  • Output from the ALU
  • The old clock cycle is broken into 3 microcycles
  • The clock is adjusted to run approximately 3
    times as fast
  • Now parts of three microinstructions can be
    processed in parallel
  • We need to add a cache memory so memory
    operations can keep up
  • The ALU is active every cycle
  • Not just in the middle of the old cycle

66
The pipeline in action
67
The SWAP instruction
SWAP with piplining
68
The SWAP instruction
  • With piplining, note the need to stall the
    pipeline occasionally
  • The third microinstruction caused the pipeline to
    stall for two cycles
  • The SWAP now requires only 11 microcycles instead
    of 3 x (6 normal cycles) 18 microcycles

69
Mic-3
  • The revised microarchitecture is called Mic-3
  • Mic-3 includes a 4-stage pipeline with stages . .
    .
  • Fetch
  • Latch A and B
  • Calculate with the ALU
  • Writeback

70
Additional modifications
  • Mic-3 still has a problem
  • Various microinstructions contain microbranches
  • Conditional branch
  • Branch with a target microinstruction not known
    in advance
  • For example, the last microinstruction in a
    sequence always branches to a target not known in
    advance
  • Consider the swap6 microinstruction
  • The next microinstruction cannot be prefetched
  • This could cause havoc with the microinstruction
    pipeline
  • There is a separate MIR for each microinstruction
    in the pipeline
  • The pipeline must stall until the next
    microinstruction is known
  • The next microinstruction must be anticipated
  • Add two more components to the design
  • Decoding unit
  • Queueing unit

71
Decoding unit
  • The decoding unit knows which incoming bytes are
    opcodes and which are operands like varnum and
    disp
  • The incoming opcode is an index into a ROM table
    within the decoding unit
  • The indexed row gives . . .
  • The the number of bytes associated with the
    opcode
  • This allows the decoding unit to know when it
    fetches the next opcode
  • The address in the control store of the first
    microinstruction of the sequence associated with
    the opcode

72
Queueing unit
  • The queueing unit contains . . .
  • The old control store (ROM)
  • The microinstructions in the control store for a
    given sequence are now consecutive rather than
    scattered
  • No need for each microinstruction to designate
    its successor
  • A hardware queue of microinstructions (RAM)
  • The microinstruction queue holds the proper
    sequence of microinstructions across ISA
    macroinstruction boundaries

73
Queueing unit
  • Microinstructions have a modified format
  • No longer need the NEXT_ADDRESS field
  • No longer have JAM bits
  • Have added bits for selecting the A bus
  • Also there are two new bits in each
    microinstruction
  • Final bit
  • Goto bit

74
Queueing unit
  • The Final bit is set in the last
    microinstruction in each sequence
  • It is used to indicate the end of the sequence
    for the current macroinstruction and reactivate
    the IFU
  • The Goto bit marks microinstructions that have
    conditional branches (at the ISA level)
  • These microinstructions have a different format
    from other microinstructions
  • Have JAM bits
  • Contain an index into the control store

75
Queueing unit operation (input side)
  • Starting with the first microinstruction of a
    sequence, the queueing unit . . .
  • Copies sequential instructions from the control
    store into the hardware queue of
    microinstructions
  • Copying continues through the first
    microinstruction with the Final bit set
  • If the Goto bit is not set, the queueing unit . .
    .
  • Gets the index associated with the the next
    opcode from the decoding unit
  • Continues copying microinstructions from the
    sequence for the new opcode into the hardware
    queue of microinstructions
  • Copying continues until a Goto bit is set or the
    queue of microinstructions is full

76
Queueing unit operation (input side)
  • When the Goto bit is set (conditional branch)
  • The queueing unit stops copying microinstructions
    from the control store into its hardware queue
  • The unit stalls until the microbranch has been
    resolved
  • The fetch queue in the IFU may have to be cleaned
    up also

77
Queueing unit operation (output side)
  • On the ouput side, the queueing unit
  • Dequeues microinstructions from its queue
  • Feeds them into a queue of four MIRs
  • One MIR for each stage of the data path part of
    the pipeline

78
(No Transcript)
79
Mic-4
  • The revised microarchitecture is called Mic-4
  • Mic-4 includes a 7-stage pipeline with stages . .
    .
  • IFU
  • Decoding unit
  • Queueing unit
  • Latch operands
  • ALU
  • Register writeback
  • Memory
  • See circled numbers on Figure 4-35

80
Cache memory
  • The bottleneck in the Mic-4 design is with memory
  • Memory latency is the delay for read and write
  • Memory bandwidth is the number of bytes involved
    in each read or write
  • For a given memory technology, an increase in
    bandwidth causes an increase in latency
  • The fastest memory technology is not cost
    effective
  • Cache memory is the cost effective alternative

CPU
cache memory
main memory
81
Cache memory terminology
  • Spatial locality
  • Nearby addresses are likely to needed soon
  • Bring in more bytes then needed from the vicinity
    of each reference for later use
  • Temporal locality
  • Recently used addresses are likely to be needed
    again
  • Dont discard these right away

82
Cache memory terminology
  • Cache line
  • The block of bytes brought in when a cache miss
    occurs
  • Typically 4, 8, 16, 32, or 64 consecutive bytes
  • Unified cache
  • Contains both data and instructions
  • Split cache
  • Separate caches for data and instructions
  • Allows parallel access
  • Effectively doubles bandwidth
  • Instruction cache usually read-only from the CPU

83
Several levels of cache are common
84
Direct-mapped cache
  • A direct-mapped cache is organized into rows
  • Each row contains
  • Valid bit
  • Set whenever the row is loaded
  • Bit is clear only when cache line is empty
  • Tag
  • Consists of the high-order address bits
  • Cache line (the data)
  • The next slide is an example of a direct-mapped
    cache
  • with 2048 rows
  • with a 32-byte cache line

85
(No Transcript)
86
Direct-mapped cache
  • The example cache responds to 32-bit addresses
  • The 11-bit line field selects the row of the
    cache
  • The 3-bit word field selects the word of data
    within the cache line
  • The 2-bit byte field selects a byte within the
    word
  • Each row of the cache is shared by all addresses
    with the same line field bits
  • The 16 tag bits of the address are loaded into
    the 16-bit tag field when the cache line is loaded

87
Direct-mapped cache
  • When the cache is referenced . . .
  • The tag bits of the address are compared with the
    bits in the tag field of the row selected by the
    line bits
  • A cache hit occurs if the tag bits are the same
  • A cache miss occurs if the tag bits are different
  • Cache hit
  • The needed word or byte of the cache line is read
    or written
  • Cache miss
  • The existing cache line must be read back to
    memory if it has been modified
  • Replace the cache line with the new data from
    memory
  • Update the tag field
  • Read or write the needed word or byte

88
Set-associative cache
  • Usually 2 or 4 direct-mapped lines per row
  • All tag fields are simultaneously compared
  • On a cache miss, one of the lines must be
    discarded
  • Which one?
  • (LRU) Least Recently Used

89
Writing to a cache
  • When should the copy in main memory be updated?
  • Write through
  • Immediately update
  • More memory traffic
  • Write deferred or write back
  • Wait until the cache line is replaced
  • Write allocation
  • For a cache miss on write, bring the line into
    the cache and write to it there
  • This is in contrast to writing directly to memory
  • Usually used with write deferred

90
Microarchitecture examples
  • Three architectures are considered
  • Pentium 4
  • UltraSPARC-III
  • Intel 8051
  • First two are very similar
  • Three-bus architecture
  • Pipelines
  • Split cache

Note We will skip the following textbook
sections Section 4.5.2 Branch prediction
Section 4.5.3 Out-of-order execution and
register renaming Section 4.5.4 Speculative
execution
91
Microarchitecture examples
  • Pentium 4
  • CISC architecture on the outside (at the ISA
    level)
  • The way it appears to assembly language
    programmers
  • Huge and unwieldy instruction set backward
    compatible with 8088
  • Only 8 visible registers EAX, EBX, ECX, EDX,
    etc.
  • 32-bit architecture with 64-bit memory bus
  • RISC architecture on the inside (at
    microarchitecture level)
  • Microarchitecture named NetBurst
  • Complete break from Pentium III and earlier
    microarchitectures
  • Up to 126 microinstructions active at a time
  • 120 scratch registers
  • Two double-speed integer ALUs and two
    double-speed floating-point ALUs
  • 12 billion integer operations possible each
    second at 3 GHz
  • The Mic-4 resembles the Pentium 4 in many ways
  • However, Pentium 4 has out-of-order execute
    capability
  • Read on your own
  • Textbook pages 312 - 317

92
Overview of the NetBurst Microarchitecture
93
Microarchitecture examples
  • UltraSPARC-III Cu
  • Cu indicates copper wiring on chip (not aluminum)
  • No microarchitecture level
  • True RISC architecture
  • Needs special hardware for graphics and
    multimedia instructions
  • 64-bit data path and registers
  • 128-bit memory bus
  • Microarchitecture much simpler than Pentium 4
  • There is a simpler ISA level to implement
  • 14-stage pipeline
  • Read on your own
  • Textbook pages 317 - 323

94
14-stageUltraSPARC-III pipeline
95
Microarchitecture examples
  • Intel 8051
  • Similar to Mic-1, but more RISC-like than
    CISC-like
  • Only about 60,000 transistors
  • Primary design goal cheap, rather than fast
  • No pipelining, no caching, and in-order issue,
    execute, and retirement
  • Single main bus
  • Registers ACC, B, and SP
  • Similar to Intel 8088s AX, BX, and SP
  • TMP1 and TMP2 are latches for ALU
  • For embedded applications there are . . .
  • Three 16-bit timers for real-time control
  • Four 8-bit I/O ports
  • Read on your own
  • Textbook pages 323 - 325

96
Intel 8051
Write a Comment
User Comments (0)
About PowerShow.com