The Elements of Computers - PowerPoint PPT Presentation

1 / 185
About This Presentation
Title:

The Elements of Computers

Description:

A processor able to interpret and execute programs; A memory for storing the programs and the data they ... Babbage's Difference Engine. The Analytical Engine. ... – PowerPoint PPT presentation

Number of Views:456
Avg rating:3.0/5.0
Slides: 186
Provided by: cwin
Category:

less

Transcript and Presenter's Notes

Title: The Elements of Computers


1
The Elements of Computers
  • A processor able to interpret and execute
    programs
  • A memory for storing the programs and the data
    they process
  • Input-output equipment for transferring
    information between the computer and the outside
    word.

2
The brain versus the Computer
  • Brain central processing unit (CPU)
  • program control unit control instructions
  • arithmetic-logic unit (ALU) execution data
  • Similarities and differences
  • digital or discrete information abacus
  • analog or continuous information slide-rule

3
The Turing Machine
Processor P
Read-write head
Memory tape M
4
An Abstract Computer
  • Turing machine was introduced by the English
    mathematician Alan M. Turing in 1930.
  • The tape M as a memory
  • unbounded length
  • blank or one of a small set of symbols
  • The processor P
  • a small number of internal states
  • linked to M via the read-write head

5
Instruction Format
  • Sh Ti Oj Sk
  • the current state of processor is Sh
  • the symbol it expects to read on the square of M
    under the read-write head is Ti
  • perform the action Oj
  • write a new symbol
  • move the tape to left or right
  • change the state of P to Sk

6
Add Two Unary Numbers via Turing Machine
  • Instructions Comment
  • S0 b R S1 move read-write head one square to
    right S1 1 R S1 move read-write head rightward
    across n1 S1 b 1 S2 replace blank between n1
    and n2 by 1S2 1 L S2 move read-write head
    leftward across n1 S2 b R S3 blank square
    reached move one square to rightS3 1 b
    S3 replace left-most 1 by blankS3 b H S3 halt
    the result n1n2 is now on the tape

7
A Little Flavor of RISC
  • A universal TM can by itself perform every
    reasonable computation.
  • t different tape symbols
  • s different processor states
  • ts lt 30 implies that it can have a very small
    instruction set

8
Limitations of Computers
  • Unsolvable problems
  • no Turing machine and no practical computer can
    solve
  • Goldbachs conjecture
  • Undecidable problems
  • TM halting problem
  • finite-state machines

9
Limitations of Computers
  • Intractable problems
  • no computer can solve a given problem in a
    reasonable amount of time
  • finding an Euler circuit in a graph
  • traveling salesman problem
  • scheduling of airline flights
  • routing of wires in an electronic circuit
  • sequencing of steps in a factory assembly line
  • Brute force

10
Speed Limitations
  • The time complexity of an algorithm
  • order f(n) denoted O(f(n))
  • computing time is growing with the problem size n
  • O(n) n1 n1 100 n1
  • O(n2) n2 n2 10 n2
  • O(n100) n3 n3 1.047 n3
  • O(2n) n4 n4 6.644 n4
  • (Effect of computer speedup by 100 on four
    algorithms)

11
The Mechanical Era
  • Babbages Difference Engine.
  • The Analytical Engine.
  • 1896 Hollerith formed a company and renamed to
    IBM in 1924.

12
Electronic Computers
  • The first generation
  • stored program concept
  • mathematician John von Neumann (1903-1957)
  • vacuum-tube computers (1940-1950)
  • ferrite-core memory until the 1970
  • machine language
  • assembly language
  • IAS computer

13
The IAS Computer(I)
  • 12-bit address
  • 212 4K 40-bit words
  • a pair of 20-bit instructions
  • fixed-point number system
  • one-address instruction format

14
The IAS Computer(II)
  • Program control unit (PCU)
  • AR memory address register
  • IR instruction opcode register
  • IBR next-instruction buffer register
  • PC program counter
  • Data processing unit (DPU)
  • AC accumulator register
  • DR general-purpose data register
  • MQ multiplier-quotient register

15
The IAS Computer(III)
  • Hardware description language (HDL) or
    register-transfer language (RTL)
  • ACM(100)ACACM(100)M(102)AC

16
The IAS Computer(IV)
  • instruction type
  • data transfer
  • ACMQ
  • ACM(X)
  • data processing
  • ACAC M(X)
  • ACAC ? 2
  • program control
  • go to M(X, 019)
  • if AC ? 0 then go to M(X, 019)

17
The Shortcomings of IAS Computer
  • Self-modification process is difficult to
    debugging...
  • The small amount of storage...
  • No procedure call or return instructions
  • Lack of text processing (biased toward numerical
    computation)...
  • I/O instructions are not mentioned...

18
The Contribution of the First Generation
Computers
  • To use of a CPU with a small set of registers
  • A separate main memory for instruction and data
    storage
  • An instruction set with a limited range of
    operations and addressing capabilities.
  • The term von Neumann computer has become
    synonymous with a computer of conventional design.

19
The Second Generation Computer(I)
  • The transistor, a high-speed electronic switch,
    versus the vacuum tube
  • Ferrite cores for the main memories
  • Magnetic disks for the second memories

20
The Second Generation Computer(II)
  • More registers index registers
  • index instructions
  • array
  • More program control instructions
  • call
  • return
  • More scientific Floating-point instructions
  • M ? B-E

21
The Second Generation Computer(III)
  • Input/Output operations
  • trivial data-transfer task
  • at very low speeds compared to CPU
  • Input-output processors
  • channels
  • make CPU execution and IO data transfer
    independently

22
The Second Generation Computer(IV)
  • Programming languages
  • high-level languages
  • Scientific languages
  • 1954, FORmula TRANslation (FORTRAN) Business
    language
  • 1959, Common Business Oriented Language (COBOL)

23
The Second Generation Computer(V)
  • System management
  • batch processing
  • a rudimentary version of operating system
  • multiprogramming
  • time-sharing system
  • keep CPU and IOPs busy by overlapping CPU and IO
    operations.

24
A Nonstandard ArchitectureSTACK Computers
  • Top of the stack (TOS)
  • push operation
  • pop operation
  • stack pointer (SP)
  • generally slower than von Neumann machine
  • Pocket calculator
  • CALL sub and RETURN

25
A Nonstandard ArchitectureSTACK Computers(II)
  • Z W 3 ? (X - Y)
  • Polish notation
  • Z W 3 X Y - ?

PUSH WPUSH 3PUSH XPUSH YSUBTRACTMULTIPLYADD
POP Z
26
The Third Generation Computer(I)
  • 1961, Integrated circuits (IC)
  • a large number of transistors to be combined on a
    tiny piece of semiconductor material, usually
    silicon.
  • Standardize computer
  • Software compatible
  • 1964 IBM/360
  • 1970 IBM/370
  • 1979 IBM/4300
  • 1990 IBM/390
  • about 200 distinct instructions

27
The Third Generation Computer(II)
  • Two major control states of CPU
  • a supervisor state
  • a user state
  • Architecture
  • microprogramming
  • placed in a special control memory in the PCU
  • a CPU can execute floating-point instruction
    without floating-point arithmetic circuits

28
The Third Generation Computer(III)
  • Supercomputer
  • CDC Cyber series
  • pipelining
  • involves overlapping the execution of
    instructions
  • multiprocessor
  • to be executed simultaneously
  • Minicomputers
  • DEC, Digit Equipment Corp. (1965)
  • Programmed Data Processor (PDP)
  • low cost

29
The VLSI Era
  • SSI
  • MSI
  • LSI
  • VLSI very large-scale integration
  • ULSI or MCM

30
CMOSA Zero-Detection Circuit(I)
  • z x0x1x2x3
  • x0x1x2x3 0000 makes z 1
  • x0x1x2x3 0001 makes z 0

31
CMOSA Zero-Detection Circuit(II)
  • Transistor
  • Gate
  • Inverter (NOT)

32
CMOSA Zero-Detection Circuit(III)
  • Gate NAND

33
CMOSA Zero-Detection Circuit(IV)
  • Gate NOR

34
Introduction Circuits
  • In 1959, Texas Instruments and Fairchild Corps.
  • Chip dimensions 10x10mm to 30x30x4mm with 300 or
    more pins
  • IC density SSI, MSI, LSI, VLSI, ULSI...

35
Introduction Circuits
109 106 103 1
1G-bit DRAM
1M-bit DRAM
64-bit microprocessor
16-bit microprocessor
32-bit microprocessor
1K-bit DRAM
8-bit microprocessor
4-bit microprocessor
MSI
SSI
1960 1970 1980 1990 2000 2010
36
Introduction to CMOS
Complementary Metal-Oxide-Semiconductor
  • Why CMOS?
  • Basic Concepts
  • CMOS Technology

37
Why CMOS?
  • Low power dissipation
  • at stable logic 0 or logic 1
  • Pstatic leakage current
  • Pdynamic switching current charging/dischargin
    g of Cload
  • Pdynamic ? frequence
  • A distinct advantage it leads to reduce heating

38
Why CMOS?
  • High Logic Integration Density
  • freedom to adjust the size by demand
  • linewidth as small as 0.1?m resolution is
    possible by using optical lithography
  • logic density is increased as size decreased
  • achieve greater integration densities than that
    of a bipolar technology

39
Why CMOS?
  • Logic Swings
  • rail-to-rail output logic voltage swings
  • better noise immunity
  • more reliable logic circuits
  • a bipolar TTL gate output range 0.3, 3.6
  • a CMOS gate output range 0, 5 volts

40
Why CMOS?
  • Symmetrical Transient Response
  • to switch from a logic 0 to a logic 1 can be made
    equal to the time needed to switch from a logic 1
    to a logic 0
  • Simply timing in a large system design

41
Why CMOS?
  • Bipolar Integrated Circuits
  • bipolar emitter-coupled logic (ECL) is the
    fastest silicon logic available
  • due to much higher power dissipation levels and
    subsequent heating, it has not taken over the
    microprocessor market
  • BiCMOS tends to provide the best aspects of both
    worlds

42
Why CMOS?
  • Gallium Arsenide?
  • The electron mobility is much larger in GaAs
  • can react to higher frequencies
  • Materials Costs
  • Technology Know-How
  • Applications

43
Chapter 2Design Methodology
  • design process
  • gate level
  • register level
  • processor level
  • computer-aided design
  • analysis methods

44
System Design
  • a large and complex system, such as a computer
  • a collection of connected components

45
System Representation (I)
  • A system modeling by a directed graph
  • a set of nodes V v1, v2, v3, ,vn
  • a set of edges E (v1, v2), (v1, v3), ,
    (vn-1, vn)
  • edge e (vi, vj) connects node vi to node vj

46
System Representation (II)
  • a set of information processing components C
  • a set of lines S that carry information signals
    between components
  • the system G is associating C with S

47
Structure versus Behavior (I)
  • structure
  • a graph
  • the abstract graph consisting of block diagram
    with no function information
  • behavior
  • a truth table or a mathematical equation
  • to determine for any given input signal to system
    and its corresponding output

48
Structure versus Behavior (II)
  • neither can be derived from the other
  • a schematic diagram, block diagram
  • conveys structure rather than behavior
  • needs more formal descriptions, a text, truth
    table, a list of equations

49
HDL Hardware Description Language (I)
  • HDL- Hardware Description Language Babbages
    notations
  • VHDL - based on the programming language Ada
  • Verilog - based on the programming language C
  • Both are embodied in formal standards sponsored
    by IEEE (the Institute of Electrical and
    Electronics Engineers)

50
HDL Hardware Description Language (II)
  • more precise
  • technology independent
  • descriptions of gate and register levels
  • documentation
  • suitable for CAD programs
  • long and verbose

51
Half Adder - Block Symbol
Half_adder
x
sum
y
carry
52
Half Adder - Truth Table
53
Half Adder - Behavior
  • entity half_adder is port (x, y in bit sum,
    carry out bit)end half_adderarchitecture
    behavior of half_adder isbegin sum lt x xor y
    carry lt x and yend behavior

54
Half Adder - Structure
  • architecture structure of half_adder is
    component xor_circuit port (a, b in bit c out
    bit) end component component nand_circuit
    port (d, e in bit f out bit) end
    component signal alpha bitbegin XOR
    xor_circuit port map (agtx, bgty, cgtsum)
    NAND1 nand_gate port map (dgtx, egty, fgtalpha)
    NAND2 nand_gate port map (dgtalpha,
    egtalpha,
    fgtcarry) end structure

55
Half Adder - Block Diagram
xor_circuit
XOR
a
x
sum
c
b
alpha
nand_gate
nand_gate
NAND1
NAND2
d
d
f
f
carry
e
e
y
56
Exclusive-OR - Block Diagram
AND
x1
NOT
OR
x1 ? x2
NOT
AND
x2
57
Exclusive-OR - Truth Table
58
Gate Level
  • combinational logic
  • z(x1, x2, , xn)
  • truth table
  • logic circuits
  • standard gates
  • functional complete gate types
  • AND, OR, NOT
  • AND, NOT
  • NAND
  • NOR

59
Full Adder - Truth Table
60
Gate Level (Logic Level)
  • processing with binary digits (bits)
  • 0 and 1
  • design components
  • simple and memoryless logic gates
  • flip-flops of bit-storage devices
  • combinational logic
  • flip-flops
  • sequential circuits

61
Combinational Logic(I)
  • A combinational function is a logic or boolean
    function
  • mapping a set of 2n input combinations of n
    binary variables onto the output values 0 and 1
  • z(x1, x2, , xn)
  • function z can be defined as a truth table

62
Combinational Logic(II)
  • The truth table of full-adder as shown in Figure
    2.9(a) in page 74
  • a pair of three-binary-variable functions
  • the sum output s0(x0, y0, c-1)
  • the carry output c0(x0, y0, c-1)
  • realization using half adders
  • realization using AND and OR gates
  • realization using NAND, NOR, and NOT gates

63
Standard Gates
  • AND
  • x1x2 1 if and only if x1 and x2 are both 1
  • OR
  • x1x2 1 if and only if x1 or x2 or both are 1
  • EXCLUSIVE-OR
  • x1? x2 1 if and only if x1 or x2 but not both
    are 1
  • NOT (inverter)
  • x1 1 if and only if x1 0

64
Functional Complete (I)
  • AND, OR, NOT
  • AND, NOT
  • a b a b a ? b
  • NAND
  • a a ? a
  • a ? b a ? b a ? b ? a ? b
  • a b a b a ? b a ? a ? b ? b

65
Functional Complete (II)
  • NOR
  • a a a
  • a ? b a ? b a b a a b b
  • a b a b a b a b

66
Boolean Algebra
  • George Boole (1815-1864)
  • Boolean equation
  • s0 x0 y0 c-1 x0 y0 c-1 x0 y0 c-1 x0 y0 c-1
  • c0 (x0 c-1)( x0 y0)( y0 c-1)
  • SOP (sum-of-products)
  • POS (product-of-sums)
  • two-level logic circuit the longest IO path -
    propagation delay

67
Balance Logic Design
  • To balance between hardware cost and operating
    speed is depending on IC technology
    considerations
  • Two-level adder has the shortest propagation
    delay
  • Two-level adder has more gates and has a higher
    hardware cost

68
Logic Synthesizer
  • To design circuits automatically via
    computer-aided synthesis tools.
  • Restrictions of synthersizer
  • fan-in of a gate
  • fan-out of a gate
  • gate minimization
  • an intractable problem
  • only practical for small circuits

69
Flip-Flops (I)
  • A flip-flop is an 1-bit storage element
  • a sequence logic circuit a combinational
    circuit memory
  • synchronization
  • external clock signal CK of a flip-flop
  • Four-bit ripple-carry D-flip-flop a serial
    adder

70
Flip-Flops (II)
  • Edge triggering state changes around one edge of
    CK (clock signal)
  • 0-to-1
  • 1-to-0
  • an edge-triggered D (delay) flip-flop
  • 0-to-1 triggering edge of clock signal of CK
  • others well-known flip-flop
  • JK flip-flop
  • SR flip-flop
  • T flip-flop

71
Flip-Flops (II)
  • Edge triggering
  • a sequence of discrete state values y(i)
  • one for every clock cycle i
  • Timing diagram - Figure 2.11
  • Characteristic equation of D flip-flop
  • y(i1) D(i)

72
Sequential circuits
  • A combinational circuit a set of flip-flops
  • A serial adder Figure 2.12
  • A four-bit-stream serial adder Figure 2.13

73
Register Level
  • Register-transfer level
  • a grouped, ordered sets of small combinational or
    sequential circuits
  • process or store words or vectors
  • combinational
  • Multiplexers
  • Decoders and encoders
  • sequential
  • Shift registers
  • Counters

74
Component Types
  • MSI parts in IC series
  • Standard cells in VLSI
  • with or without the functional completeness
    property
  • no universal graphic symbols
  • usually by an abbreviated description of their
    behavior

75
Generic Block Representation of a Register-Level
Component
  • Data input lines
  • Data output lines
  • Control input lines
  • select lines
  • enable lines
  • clock lines
  • etc.
  • Control output lines

m
k
Multifunctionunit
76
Generic Block Representation
  • select lines one of several possible operations
    that the unit is to perform
  • enable lines time or condition for a selected
    operation to be performed
  • active, enable, or asserted state
  • an overbar low enable or active value is 0

77
Operations
  • Gate-level B0, 1
  • Register level Bm set of 2m m-bit words

78
Multiplexers (MUX)
  • a device intended to route data from one of
    several sources to a common destination
  • k-input, m-bit MUXk2p

X0
X1
X2p-1
Data in
m
...
p
Multiplexer (MUX)
Select S
Enable e
m
Data out Z
79
Multiplexersas Function Generators
  • A 2n-input, 1-bit multiplexer MUX can generate
    any n-variable function
  • z(v1, v2, , vn-1)
  • A 2-input, 4-bit multiplexer Figure 2.20
  • An 8-input multiplexer Figure 2.21
  • Multiplexer-based full adder Figure 2.22

80
Decoders
  • 1-out-of-2n or 1/2n decoder
  • a 1/4 decoder Figure 2.23
  • used in RAMs to select storage cells to be read
    from or written into.

81
Encoders
  • To generate the address or index of an active
    input line
  • 2n-to-n encoder
  • x0x1x2x3x4x5x6x7 00000010z0z1z2 110

82
Decoders
  • 1-out-of-2n or 1/2n decoder
  • a 1/4 decoder Figure 2.23
  • used in RAMs to select storage cells to be read
    from or written into.

83
Processor (System) Level
  • The highest in the computer design hierarchy
  • concerned with the storage and processing of
    blocks of information
  • more complex and based on VLSI technology
  • very much a heuristic process

84
Processor-Level Components
  • Four main groups
  • processors
  • memories
  • IO devices
  • interconnection networks

85
Central Processing Unit
  • A general-purpose, instruction-set processor
  • specialized processors such as IOPs
  • operates on word-organized instructions and data

86
Chapter ThreeProcessor Basics
  • the overall design of instruction-set processors
  • CPU of a computer
  • microprocessors RISC and CISC types

87
CPU Organization
  • Fundamentals
  • External communication
  • User and supervisor modes
  • CPU operation
  • Accumulator-based CPU
  • Programming consideration
  • Instruction set
  • Program execution

88
Fundamentals
  • To execute sequences of instructions (programs),
    which are stored in an external main memory.
  • Program execution steps
  • CPU transfers instructions with operands from
    main memory to registers in CPU.
  • CPU executes the instructions sequentially except
    when execution sequence is altered by a branch
    instruction.
  • when necessary, CPU transfers results from CPU
    registers to main memory

89
External Communication(I)
  • without a cache
  • CPU communicates directly with the main memory
  • a high-capacity multi-chip RAM (random-access
    memory)
  • disadvantage speed disparity
  • CPU is significantly faster than memory (5 to 10
    times)

90
External Communication(II)
  • with a cache
  • CM positioned between CPU and MM
  • CM is faster and smaller than MM
  • CM can reside wholly or in part in CPU
  • typically permits CPU to load or store in a
    single clock cycle
  • advantage CM is transparent to CPU's instruction
  • CM as forming a single, seamless memory space of
    2m addressable storage
  • further discussion in chapter 6

91
External Communication(III)
  • with IO devices
  • IO ports IO devices are associated with
    addressable registers
  • CPU can load/store a word from/to IO ports
  • IO-mapped versus memory-mapped IO
  • IO ports share the same set of memory addresses
  • IO instructions produce IO control signals but
    not memory-referencing signals

92
User and Supervisor Modes
  • user programs and supervisor programs
  • a user or application program handles a specific
    application
  • a supervisor program manages various routine
    aspects of the computer system
  • normally, CPU switches back and forth between
    user and supervisor programs
  • interrupt is a way of requesting and switching to
    supervisor mode

93
CPU Operation(I)
  • Overview of a CPU behavior (Figure 3.2, pp. 140)
  • instruction cycle a fetch step and a execution
    step
  • micro-operations (register-transfer operations)
    within an instruction cycle

94
CPU Operation(II)
  • the shortest well-defined CPU micro-operation is
    the CPU cycle time or clock period, Tclock
  • Tclock the CPU cycle time
  • f CPU's clock frequency in MHz
  • Tclock 1/f
  • each instruction is fetch from M in on CPU cycle
    when M is CM
  • execution step is run in another CPU cycle

95
Accumulator-Based CPU(I)
  • to keep CPU relatively small
  • a small set of registers and circuits to
    implement a functionally complete set of
    instructions
  • the central role of registers --- the accumulator
    register

96
Accumulator-Based CPU(I)
  • a small accumulator-based CPU (Figure 3.3, pp.
    141)
  • PCU and DPU
  • fetch step
  • IR.AR M(PC)
  • IRop, ARadr
  • load/store
  • ACM(adr)
  • M(adr)AC

97
Programming Consideration(I)
  • data processing --- thress operands operations
    --- ZXY
  • single address instructions (pp. 142)
  • HDL format ACM(X) DRAC ACM(Y)
    ACACDR M(Z)AC
  • ASM format LD X MOV DR,AC LD Y ADD ST Z
  • implicit operand AC and DR
  • load/store architecture --- only uses the load
    and store instructions to access memory
  • memory-referencing instrution form
    ACf(AC,M(adr))
  • HDL format ACM(X) ACACM(Y) M(Z)AC
  • ASM format LD X ADD Y ST Z
  • more complicated instruction-decoding logic in
    PCU and more execution time in ADD
  • less instructions --- reduced overall execution
    time?
  • the cost performance debate of RISC-CISC

98
Programming Consideration(I)
  • data processing
  • three operands operations
  • Z X Y

99
Programming Consideration(II)
  • single address instructions (pp. 142)
  • HDL format ASM format
  • ACM(X) LD X
  • DRAC MOV DR,AC
  • ACM(Y) LD Y
  • ACACDR ADD
  • M(Z)AC ST Z
  • implicit operand AC and DR
  • load/store architecture
  • only uses the load and store instructions to
    access memory

100
Programming Consideration(III)
  • memory-referencing instruction form
    ACfi(AC,M(adr))
  • HDL format ASM format
  • ACM(X) LD X
  • ACACM(Y) ADD Y
  • M(Z)AC ST Z
  • more complicated instruction-decoding logic in
    PCU and more execution time in ADD
  • less instructions --- reduced overall execution
    time?

101
Programming Consideration(IV)
  • the cost performance debate of RISC-CISC Central
    Processing Unit

102
Instruction Set
  • the flavor instruction set of RISC (a load/store
    architecture)
  • data transfer load, store, move register
  • data processing add, subtract, and, not
  • program control branch, branch zero
  • (Figure 3.4, pp. 143)
  • Example 3.1 a multiplication program (pp.144)

103
A Multiplication Program(I)
  • Line Location Instruction or data
  • 0 one 00001
  • 1 mult N
  • 2 ac 00000
  • 3 prod 00000
  • 4 ST ac
  • 5 loop LD mult
  • 6 BZ exit
  • .
  • 17 BRA loop
  • 18 exit ...

104
A Multiplication Program(II)
  • Line Location Instruction or data
  • 7 LD one
  • 8 MOV DR, AC
  • 9 LD mult
  • 10 SUB
  • 11 ST mult
  • 12 LD acc
  • 13 MOV DR,AC
  • 14 LD prod
  • 15 ADD
  • 16 ST prod

105
Program Execution
  • cycle by cycle execution (pp. 146)
  • PCU actions
  • fetch cycle includes the pair of
    register-transfer operations
  • IR.ARM(PC)
  • PCPC1

106
Architecture Extensions(I)
  • multipurpose register set for storing data and
    address
  • register file
  • additional data, instruction, and address types
  • fixed-point multiply and divide instructions
  • call and return instructions

107
Architecture Extensions(II)
  • register to indicate computation status
  • condition code or flag register
  • zero result or divide by zero, ...
  • program control stack
  • procedure calling
  • extern interrupts
  • push-down stack - stack pointer
  • Figure 3.7, pp. 148.

108
Pipelining(I)
  • CPU speedup techniques
  • cache memories
  • instruction-level parallelism
  • in DPU
  • in PCU
  • Overlapping instructions in a two-stage
    instruction pipeline
  • Figure 3.8, pp. 150.

109
Pipelining(II)
  • Branch instruction
  • reduce the efficiency of instruction pipelining
  • More than two stages
  • to increase the level of parallelism attainable

110
ARM6 Microprocessor
  • Organization of the ARM6
  • SR, PC, WDR, RDR, AR, IR, ALU, Shifter, Buses
    (Figure 3.9, pp. 152)
  • Core instruction set of the ARM6
  • Data transfer, Data processing, Program control
    (Figure 3.10, pp. 153 )
  • Shift or rotation operation
  • LSL logically left shift
  • MOV R0, R1, LSL 2 R0 R1 ? 4

111
Motorola 680X0 family
  • Organization of the 68020
  • D0-D7, A0-A7, PC, CC (Figure 3.11, pp. 155)
  • Instruction set of the 68020
  • Data transfer, Data processing, Program control,
    External synchronization (Figure 3.12, pp.
    156-157 )

112
680X0 ASM for Vector Addition
  • Vector addition (Figure 3.13, pp. 158)
  • MOVE.L 2001, A0 MOVE.L 3001,
    A1 MOVE.L 4001, A2START ABCD -(A0),
    -(A1) MOVE.B (A1), -(A2) CMPA 1001,
    A0 BNE START

113
Chapter SixMemory Organization
  • impact on performance
  • survey storage-device technologies
  • multilevel hierarchical memory systems
  • cache memories

114
Memory Types
  • CPU registers
  • working memory for temporary storage of
    instructions and data
  • Main memory (primary memory)
  • five or more clock cycles are usual
  • Secondary memory
  • in milliseconds
  • Cache
  • one to three clock cycles

115
Performance and Cost
  • Cost/performance trade-off
  • cost of memory cC/S dollars/bit
  • access time tA 10y

116
Pipelining(II)
  • Branch instruction
  • reduce the efficiency of instruction pipelining
  • More than two stages
  • to increase the level of parallelism attainable
  • ARM6 Microprocessor
  • Figure 3.9
  • Figure 3.10

117
The Von Neumann Bottleneck
  • The speed mismatch between the CPU and main
    memory.
  • Storage density has grown rapidly
  • Access time have decreased at a much slower rate
  • capacity of single chip RAM
  • 19754Kb (Kilobit),
  • 1985256Kb,
  • 199516Mb(Megabit)

118
Access Modes(I)
  • Random-access memory
  • storage can be accessed in any order
  • access time is independent of location
  • Serial-access memory
  • storage can be accessed only in a certain
    predetermined sequence
  • magnetic disks, magnetic tapes, and optical disks
    (CD-ROM)
  • access time depends on its position relative to
    the read-write head

119
Access Modes(II)
  • Serial access tends to be slower than random
    access.
  • Semirandom-access mode
  • magnetic disks and CD-ROM
  • if each track has its own read-write head, tracks
    can be accessed randomly
  • access within track is serial

120
Memory Retention(I)
  • Read-only memory
  • memories cannot be altered on-line
  • a nonerasable storage device
  • compact disk ROM
  • Programmable read-only memory
  • memories can be changed off-line
  • CD-recordable disk (CD-R) as a programmable CD

121
Memory Retention(II)
  • DRO destructive readout
  • reading will destroy the stored data
  • restoration
  • a write operation followed by a read
  • NDRO nondestructive readout
  • reading does not effect the stored data

122
Memory Retention(III)
  • Dynamic memory
  • Refreshing periodically
  • a stored 1 tends to 0 or vice versa due to some
    physical decay process
  • a capacitor represents a stored 1 tends to 0 due
    to leaking away
  • Static memory
  • require no refreshing
  • lower access time, faster, than DRAM

123
Memory Retention(IV)
  • Volatile
  • destroy the storage data if power is lost
  • most IC memories are volatile
  • Nonvolatile
  • most magnetic and optical memories

124
Memory Retention(V)
  • Cycle time - the elapsed time tM
  • the minimum time that must elapse between the
    start of two consecutive access operations
  • tM ? tA
  • Data-transfer rate or bandwidth bM
  • bM w/tM
  • w is the number of bits can be transferred
    simultaneously to or from the memory

125
Memory Retention(VI)
  • Reliability - MTBF
  • the mean time before failure
  • no moving parts (mechanical motion) has much
    higher reliability
  • very high density or data-transfer rate has the
    reliability problem
  • error-detecting and error-correcting codes can
    increase the reliability of any memory
  • Performance parameters tA tM bM

126
Memory Retention(VII)
Primary Access Alter- Perfor- Access
Technology medium mode ability mance timeBipolar
NDRO,semiconductor Electronic Random R/W
volatile 10nsMetal oxide (MOS) DRO,
NDRO,semiconductor Electronic Random R/W volatile
50nsMagnetic (hard) NDRO,semiconductor Magne
tic Semirandom R/W volatile 10msMagnetic- NDRO
,optical disk Optical Semirandom
R/W nonvolatile 50msCompact disk NDRO,ROM Op
tical Semirandom R nonvolatile 100nsMagnetic
tape NDRO,cartridge Magnetic
Serial R/W nonvolatile 1s
  • Figure 6.6, pp. 407

127
Random-Access Memory(II)
  • Semiconductor RAMs DRAM and SRAM
  • a capacitor with one transistor versus six
    transistors (Figure 6.9, pp. 410)
  • destructive readout and subsequently written back
    to the cell is required for DRAM

128
MT4LC8M8E1(Micron Technology 1997)
  • 64Mb (226) ? 223 8-bit bytes ? 8M?8bit
  • memory address size m23
  • data word size w8
  • 13-bit row address (external address lines)
  • 10-bit column address
  • control lines RAS, CAS, WE, OE,
  • address bus A0A12
  • data bus DQ1DQ8 (32 pins)

129
MT4LC8M8E1(Micron Technology 1997)
  • 64Mb (226) ? 223 8-bit bytes ? 8M?8bit
  • tA50ns, tM90ns, page mode
  • tREF refresh at least once every 64ms
  • refresh an entire row of storage locations in a
    single read cycle
  • if one-row read operation takes 90ns, total
    refreshing time 90ns ? 8192 0.737ms,the
    fraction of time to refresh 0.737 / 64 1.15
    (a negligible amount)
  • Figure 6.13

130
Other Semiconductor Memories(I)
  • Read-only memory (ROM)
  • without writing ability and nonvolatile
  • store permanent code at the instruction and
    microinstruction levels
  • Programmable ROM
  • can be programmed only once
  • can be programmed repeatedly (FPGA)
  • erased in bulk off-line

131
Other Semiconductor Memories(II)
  • Flash memory - like as a PROM
  • nonvolatile storage and can be programmed and
    erased on-line
  • can be programmed a bit at a time
  • can be erased in a large blocks, that is,a
    flash erase process
  • can randomly read a bit and write a block
  • storage density and access time are comparable to
    those of DRAM

132
Fast RAM interfaces(I)
  • The gap between microprocessor and those of cheap
    but slow DRAMs
  • Use a bigger memory word
  • Access more than one word at a time
  • interleaving rule
  • interference or contention occurs if two or more
    addresses require simultaneous access to the same
    module

133
Fast RAM interfaces(II)
  • Synchronous DRAM (SDRAM)
  • achieves a speed doubling by pipelining its
    internal operations and by implementing two-way
    address interleaving.
  • Cached DRAM (CDRAM)
  • an on-chip cache realized by a small, fast SRAM
    that acts as a high-speed buffer or front-end
    memory for the main DRAM
  • have a fast burst mode of operation

134
Fast RAM interfaces(III)
  • Rambus DRAM (1992)
  • the master transmits an initial packet on
    Rambus channel
  • each Rambus DRAM chip examines it
  • DRAM unit Ri containing the address returns
    ready or busy to the master
  • if Ri ready, the master proceeds to transfer to
    or from Ri a data packet of up to 256 bytes in
    burst mode at speed up to 500 MB/s, that is 1
    byte every 2 ns.
  • if Ri busy, the master must try again later...

135
Serial-Access Memories
  • Tracks
  • data transfer to or from a track serially
  • low cost per bit
  • long access time
  • read-write head positioning time
  • slow speed of tracks moving
  • serial data transfer

136
Access Methods(I)
  • Seek time tS
  • the average time to move a head from one track to
    another
  • Rotational latency time tL
  • the average time to rotate the information cell
    closed to head
  • Block
  • all words in a block are stored in consecutive
    locations such that an entire block takes one
    seek and one latency time

137
Access Methods(II)
  • Data-transfer rate
  • V cm/s the speed of the stored information
    relative to the read-write head
  • T bits/cm the storage density along the track
  • TV bits/s

138
Access Methods(III)
  • Time to access a block in a serial-access memory
    tB tS 1/2r n/rN
  • tS the average seek time
  • r the revolutions per second
  • 1/2r the average latency of a track
  • n the number of words per block
  • N the number of words per track
  • n/rN the data transfer time

139
Magnetic Hard-Disk
  • 9.3GB Quantum XP39100, 1996 pp. 422
  • tS 7.9ms
  • r 0.12 revs/ms
  • n 8
  • N 144X512 73,728 bytes/track

140
Magnetic Tape
  • Cartridge or Cassette
  • Data stored in parallel
  • 80-track tape with
  • density 110 Kb/in
  • tape speed 50 in/s
  • max. data-transfer rate 110K x 80 / 8 x 50 55
    MB/s
  • 200m tape 55/50 x 200/0.0254 8.661 GB

141
Optical Memory
  • CD-ROM (compact disk)
  • CD-R (recordable)
  • CD-RW (rewritable)
  • DVD (digital video disk)

142
Memory Systems
  • General characteristics
  • Multilevel memories
  • Hierarchical organization
  • Two key design issues
  • automatic translation of addresses
  • dynamic relocation of data

143
Multilevel Memories
  • A n-level system (M1, M2, , Mn)
  • two level
  • main memory (semiconductor DRAMs)
  • secondary memory (magnetic-disk units)
  • three level
  • cache memory (semiconductor SRAMs)
  • split cache (Instruction I-cache, Data D-cache)
  • four level
  • level 1 cache
  • level 2 cache
  • both are the nonsplit or unified caches

144
General Characterics(I)
  • Two adjacent memory levels Mi and Mi1
  • cost per bit Ci gt Ci1
  • access time tAi lt tAi1
  • storage capacity Si lt Si1
  • Communication between levels
  • CPU can communicate directly with M1
  • M1 can communicate directly with M2, and so on
  • except that CPU can bypass cache and go to main
    memory

145
General Characterics(II)
  • Relocation of addresses and transferring data
    between two adjacent levels is a relatively slow
    process
  • requires some extent predictable approach to
    guess the future addresses generated by the CPU

146
Cache and Virtual Memory
  • The cache and main memory act as a single memory
    to the software
  • The main and secondary memories are NOT
    transparent to system software
  • The main and secondary memories are transparent
    to user code --- virtual memory --- like a
    single, larger, and directly addressable memory

147
Reasons of Virtual Memory
  • To free user programs from the need of storage
    allocation
  • To permit sharing of memory space among users
  • To make programs independent of physical
    configuration and capacity of memories
  • To achieve the very low access time and cost per
    bit

148
Locality of Reference(I)
  • The characteristic of computer programs
  • the predictability of memory addresses
  • the locality of reference

149
Locality of Reference(II)
  • Spatial locality
  • Instruction and data are specified and
    subsequently stored in memory
  • Temporal locality --- Working set W(t, T)
  • the tend of loops in programs are executed
    repeatedly
  • W(t, T) tends to change rather slower during the
    time interval (t-T, t)

150
Cost and Performance(I)
  • Factors of performance
  • the address-reference statistics
  • the access time (tA)
  • the storage capacity (Si)
  • the size of blocks (pages) (SPi)
  • the allocation algorithm (blocks-swapping
    process)

151
Cost and Performance(II)
  • The average cost per bit of memory
  • to reach the goal of making c approach c2
  • S1 must be smaller than S2

152
Cost and Performance(III)
  • The performance can be measured by the hit ratio
    H
  • the probability that a virtual address generated
    by the CPU are currently stored in the faster
    memory
  • hit reference to M1
  • miss reference to M2
  • miss ratio 1 - H

153
Cost and Performance(IV)
  • N1 the number of references of M1
  • N2 the number of references of M2

154
Address Translation(I)
  • Address mapping or address translation process
  • map virtual address onto real address
  • by programmer
  • by compiler
  • by loader
  • by run-time

155
Address Translation(II)
  • Static translation
  • compete the translation as the program loaded
  • Dynamic translation
  • complete the translation during execution
  • run-time address translation by MMU (Memory
    management unit)

156
Base Addressing
  • Effective address base displacementAeff B
    D or Aeff B.D
  • Limit address length of block
  • Figure 6.25, pp. 434

157
Translation Look-Aside Buffer
  • TLB
  • AV BV.D
  • BRTLB(BV)
  • AR BR.D
  • Figure 6.26, pp. 434
  • MIPS R2/3000, pp. 434-435

158
Pages and Segments(I)
  • Page (frame) is a fixed size of block
  • suitable for physical partitioning and swapping
    of information
  • Segment is logical block of program or data
  • its boundary corresponding to the natural program
    or data boundary
  • stack segment

159
Pages and Segments(II)
  • Two-stage address translation
  • AV SI.PI.D
  • PB Segment TLB(SB.SI)
  • P Page TLB(PB.PI)
  • AR P.D
  • Figure 6.30, pp. 440

160
Memory Address Translation in Intel Pentium
  • 32-bit linear address
  • N 32-bit Effective address AV Segment
    TLB(STB.14-bit Ls) 10-bit Nd. 10-bit Np.
    12-bit Displacement
  • AR Page table TLB( Page directory
    TLB(PDB.Nd).Np)

161
Page Size
  • Utilization versus page size
  • Figure 6.32, pp. 443
  • Hit ratio versus page size
  • Figure 6.33, pp. 443

162
Memory Allocation
  • Nonpreemptive allocation
  • all blocks already occupying memory can be
    overwritten or moved
  • first fit
  • best fit
  • Preemptive allocation
  • relocation is allowed
  • move
  • replace

163
Replacement policies
  • FIFO
  • LRU
  • OPT
  • Figure 6.36, pp. 448
  • Figure 6.37, pp. 449
  • Figure 6.38, pp. 450

164
Caches(I)
  • History
  • appeared as early as 1968 IBM S/360
  • in 1980, caches directly address the von Neumann
    bottleneck by providing the CPU with fast,
    single-cycle access to its external memory

165
Caches(II)
  • A cache servers as
  • a fast intermediate memory
  • a buffer between the CPU and its main memory
  • TLBs within a MMU
  • Data buffers built into high-speed secondary
    memory devices

166
Caches(III)
  • Access time ratio
  • (M1, M2) is around tA2/tA1 5/1
  • (M2, M3) is around tA3/tA2 1000/1
  • By high-speed hardware circuits rather than by
    software routines
  • Figure 6.39, pp. 453

167
Cache Organization(I)
  • Cache data memory (cache blocks or lines)
  • Cache tag memory (cache directory)

Cache M1
Cachedatamemory
Cachetag
Hit
Address
Control
Data
168
Cache Organization(II)
  • Performance factors
  • time to match tag address
  • time to access data memory
  • use the SRAM technology of 10ns access time
  • Two general organizations
  • look-aside
  • look-through
  • Figure 6.41, pp. 454

169
Look-Aside Cache
  • CPU placing a real address on system bus
  • cache comparing address to tag
  • if hit, read or write operates on cache
  • if miss, read or write operates on main memory
    and a block of data including its address is
    transferred from main memory to cache
  • if miss, use the block replacement policy such as
    LRU to determine where to place the incoming
    block
  • the block transfer could tie up the system bus

170
Look-Through Cache
  • CPU placing a real address on a separate local
    bus
  • cache access and memory access can proceed
    concurrently
  • CPU sends memory requests to main memory only
    after a cache miss
  • to speed up cache-main memory transfer, the local
    bus between cache and main memory can be wider
    than system bus, such as wide as the block size
    of cache, 16-byte or 128-bit data bus
  • disadvantages
  • higher complexity and cost
  • longer main memory access time if miss occurs

171
Cache Operation(I)
  • Read operation - Figure 6.42, a cache with 4-byte
    block size and 12-bit address
  • Write operation - Figure 6.43
  • a temporary inconsistency between cache and main
    memory is possible
  • preventing the improper use of stale data is the
    cache coherence or cache consistency problem
  • between multiprocessors
  • between single-CPU and IO controllers
  • a systematically updating policy (chapter 7)

172
Cache Operation(II)
  • a systematically updating policy
  • a change (dirty) bit for each cache block
  • cache write-back or copy-back technique
  • if replacing occurs, the block data is written
    back to main memory when its change bit is on
  • disadvantages
  • has a temporary inconsistency before write-back
  • complicates recovery from system failures

173
Cache Operation(III)
  • cache write-through policy
  • write data to both cache and main memory for
    every memory write cycle
  • use more write cycles than write-back policy and
    slower system performance

174
Address Mapping
  • To quickly determine whether a tag address is
    presented in the cache
  • the fastest technique is to use the associative
    or content addressing scheme to compared all tags
    simultaneously

175
Associative Addressing
  • Fields of item in CAM (Content Address Memories)
  • KEY stored address
  • DATA information to be accessed
  • memory access request
  • an associative cache as the tag, key
  • the incoming tag compared simultaneously to all
    tags in caches tag memory
  • if cache hit occurs, a match signal triggers the
    memory access from caches data field
  • If cache miss occurs, forward request to main
    memory

176
Associative Memory(I)
  • A fixed-length word for each unit
  • Mask register
  • to identify the bit positions (need not be
    adjacent) that define the key
  • Match circuit
  • to compare with a bit of key simultaneously
  • Select circuit
  • to enable the data field to be accessed

177
Associative Memory(II)
  • about 10 transistors for a bit associated memory
    (Figure 6.44-46, pp. 458-460)
  • the caches LRU block replacement policy is
    implemented by special hardware that constantly
    monitors cache usage.

178
Direct Mapping(I)
  • An alternative, simpler addressing-mapping
    technique for caches
  • Divide M1 into s sets M1(0), M1(1), , M1(s1-1)
    where s1 2s
  • each set as a block of n consecutive words

179
Direct Mapping(II)
  • M2(i) is mapping into M1(j) if ji (modulo s1)
  • if S1 26 64blocks with address i, i64,
    i128, i192, can be mapped into M1(i)

180
Set-Associative Addressing
  • K-way set-associative mapping
  • Each set contains k2h blocks
  • Permits up to k members of the same equivalent
    class to be stored in the cache simultaneously
  • M2(i) and M2(j) in the same class if ij (modulo
    s1)
  • One-way set-associative mapping is equivalent to
    set-associative mapping
  • Two-way, four-way, eight-way,
  • Figure 6.49, pp. 463

181
Design of a 2-Way Set-Associative Cache(I)
  • 8KB 2-way set-associative addressing
  • Example of a 32-bit processor
  • (Figure 6.50, pp. 464)
  • 8B block, VAX-11/780 in 1978
  • 32B block, PowerPC/603 in 1993

182
Design of a 2-Way Set-Associative Cache(II)
  • 32-bit Memory address
  • Tag 20 bits ? 20 bits per cache tag
  • Set address 9 bits ? 512 sets
  • Displacement 3 bits ? 64 bits per block
  • Cache architecture
  • Tag RAM ? 512 ? 20 ? 2 (T0 and T1)
  • Data RAM ? 512 ? 64 ? 2 (D0 and D1)
  • 2 20-bit tag comparators

183
Design of a 2-Way Set-Associative Cache(III)
  • Cache operation
  • Use 9-bit set address to read T0 and T1 and
    compare both outputs with Atag simultaneously
  • if a match occurs, Ti is used to initiate a
    memory access of Di to or from 64-bit data bus
  • if a smaller data bus is used, a block needs
    several cycles to transfer data
  • if a miss occurs, a 64-bit block is swapped from
    main memory to cache
  • VAX 11/788 uses a random replacement and
    write-through updating policy
  • PowerPC/603 uses a LRU and write-back policy

184
Structure versus Performance
  • The type of information to store in cache
  • The dimension of cache
  • The control method of cache
  • The impact of performance

185
Cache Types(I)
  • By different access behavior patterns
  • A unified cache stores both instruction and data
    together
  • A split cache has two independent units
  • an I-cache for instructions
  • few write operation
  • more temporal and spatial locality
  • a D-cache for data

186
Cache Types(II)
  • By the level in the memory hierarchy
  • Primary cache Level 1 (L1) cache
  • via part of on-chip memory of a microprocessor
    chip
  • Secondary cache Level 2 (L2) cache
  • via an off-chip memory

187
Performance
  • tB tA2 the block-transfer time from main
    memory to cache can be identical to a single
Write a Comment
User Comments (0)
About PowerShow.com