Title: The Elements of Computers
1The Elements of Computers
- A processor able to interpret and execute
programs - A memory for storing the programs and the data
they process - Input-output equipment for transferring
information between the computer and the outside
word.
2The brain versus the Computer
- Brain central processing unit (CPU)
- program control unit control instructions
- arithmetic-logic unit (ALU) execution data
- Similarities and differences
- digital or discrete information abacus
- analog or continuous information slide-rule
3The Turing Machine
Processor P
Read-write head
Memory tape M
4An Abstract Computer
- Turing machine was introduced by the English
mathematician Alan M. Turing in 1930. - The tape M as a memory
- unbounded length
- blank or one of a small set of symbols
- The processor P
- a small number of internal states
- linked to M via the read-write head
5Instruction Format
- Sh Ti Oj Sk
- the current state of processor is Sh
- the symbol it expects to read on the square of M
under the read-write head is Ti - perform the action Oj
- write a new symbol
- move the tape to left or right
- change the state of P to Sk
6Add Two Unary Numbers via Turing Machine
- Instructions Comment
- S0 b R S1 move read-write head one square to
right S1 1 R S1 move read-write head rightward
across n1 S1 b 1 S2 replace blank between n1
and n2 by 1S2 1 L S2 move read-write head
leftward across n1 S2 b R S3 blank square
reached move one square to rightS3 1 b
S3 replace left-most 1 by blankS3 b H S3 halt
the result n1n2 is now on the tape
7A Little Flavor of RISC
- A universal TM can by itself perform every
reasonable computation. - t different tape symbols
- s different processor states
- ts lt 30 implies that it can have a very small
instruction set
8Limitations of Computers
- Unsolvable problems
- no Turing machine and no practical computer can
solve - Goldbachs conjecture
- Undecidable problems
- TM halting problem
- finite-state machines
9Limitations of Computers
- Intractable problems
- no computer can solve a given problem in a
reasonable amount of time - finding an Euler circuit in a graph
- traveling salesman problem
- scheduling of airline flights
- routing of wires in an electronic circuit
- sequencing of steps in a factory assembly line
- Brute force
10Speed Limitations
- The time complexity of an algorithm
- order f(n) denoted O(f(n))
- computing time is growing with the problem size n
- O(n) n1 n1 100 n1
- O(n2) n2 n2 10 n2
- O(n100) n3 n3 1.047 n3
- O(2n) n4 n4 6.644 n4
- (Effect of computer speedup by 100 on four
algorithms)
11The Mechanical Era
- Babbages Difference Engine.
- The Analytical Engine.
- 1896 Hollerith formed a company and renamed to
IBM in 1924.
12Electronic Computers
- The first generation
- stored program concept
- mathematician John von Neumann (1903-1957)
- vacuum-tube computers (1940-1950)
- ferrite-core memory until the 1970
- machine language
- assembly language
- IAS computer
13The IAS Computer(I)
- 12-bit address
- 212 4K 40-bit words
- a pair of 20-bit instructions
- fixed-point number system
- one-address instruction format
14The IAS Computer(II)
- Program control unit (PCU)
- AR memory address register
- IR instruction opcode register
- IBR next-instruction buffer register
- PC program counter
- Data processing unit (DPU)
- AC accumulator register
- DR general-purpose data register
- MQ multiplier-quotient register
15The IAS Computer(III)
- Hardware description language (HDL) or
register-transfer language (RTL) - ACM(100)ACACM(100)M(102)AC
16The IAS Computer(IV)
- instruction type
- data transfer
- ACMQ
- ACM(X)
- data processing
- ACAC M(X)
- ACAC ? 2
- program control
- go to M(X, 019)
- if AC ? 0 then go to M(X, 019)
17The Shortcomings of IAS Computer
- Self-modification process is difficult to
debugging... - The small amount of storage...
- No procedure call or return instructions
- Lack of text processing (biased toward numerical
computation)... - I/O instructions are not mentioned...
18The Contribution of the First Generation
Computers
- To use of a CPU with a small set of registers
- A separate main memory for instruction and data
storage - An instruction set with a limited range of
operations and addressing capabilities. - The term von Neumann computer has become
synonymous with a computer of conventional design.
19The Second Generation Computer(I)
- The transistor, a high-speed electronic switch,
versus the vacuum tube - Ferrite cores for the main memories
- Magnetic disks for the second memories
20The Second Generation Computer(II)
- More registers index registers
- index instructions
- array
- More program control instructions
- call
- return
- More scientific Floating-point instructions
- M ? B-E
21The Second Generation Computer(III)
- Input/Output operations
- trivial data-transfer task
- at very low speeds compared to CPU
- Input-output processors
- channels
- make CPU execution and IO data transfer
independently
22The Second Generation Computer(IV)
- Programming languages
- high-level languages
- Scientific languages
- 1954, FORmula TRANslation (FORTRAN) Business
language - 1959, Common Business Oriented Language (COBOL)
23The Second Generation Computer(V)
- System management
- batch processing
- a rudimentary version of operating system
- multiprogramming
- time-sharing system
- keep CPU and IOPs busy by overlapping CPU and IO
operations.
24A Nonstandard ArchitectureSTACK Computers
- Top of the stack (TOS)
- push operation
- pop operation
- stack pointer (SP)
- generally slower than von Neumann machine
- Pocket calculator
- CALL sub and RETURN
25A Nonstandard ArchitectureSTACK Computers(II)
- Z W 3 ? (X - Y)
- Polish notation
- Z W 3 X Y - ?
PUSH WPUSH 3PUSH XPUSH YSUBTRACTMULTIPLYADD
POP Z
26The Third Generation Computer(I)
- 1961, Integrated circuits (IC)
- a large number of transistors to be combined on a
tiny piece of semiconductor material, usually
silicon. - Standardize computer
- Software compatible
- 1964 IBM/360
- 1970 IBM/370
- 1979 IBM/4300
- 1990 IBM/390
- about 200 distinct instructions
27The Third Generation Computer(II)
- Two major control states of CPU
- a supervisor state
- a user state
- Architecture
- microprogramming
- placed in a special control memory in the PCU
- a CPU can execute floating-point instruction
without floating-point arithmetic circuits
28The Third Generation Computer(III)
- Supercomputer
- CDC Cyber series
- pipelining
- involves overlapping the execution of
instructions - multiprocessor
- to be executed simultaneously
- Minicomputers
- DEC, Digit Equipment Corp. (1965)
- Programmed Data Processor (PDP)
- low cost
29The VLSI Era
- SSI
- MSI
- LSI
- VLSI very large-scale integration
- ULSI or MCM
30CMOSA Zero-Detection Circuit(I)
- z x0x1x2x3
- x0x1x2x3 0000 makes z 1
- x0x1x2x3 0001 makes z 0
31CMOSA Zero-Detection Circuit(II)
- Transistor
- Gate
- Inverter (NOT)
32CMOSA Zero-Detection Circuit(III)
33CMOSA Zero-Detection Circuit(IV)
34Introduction Circuits
- In 1959, Texas Instruments and Fairchild Corps.
- Chip dimensions 10x10mm to 30x30x4mm with 300 or
more pins - IC density SSI, MSI, LSI, VLSI, ULSI...
35Introduction Circuits
109 106 103 1
1G-bit DRAM
1M-bit DRAM
64-bit microprocessor
16-bit microprocessor
32-bit microprocessor
1K-bit DRAM
8-bit microprocessor
4-bit microprocessor
MSI
SSI
1960 1970 1980 1990 2000 2010
36Introduction to CMOS
Complementary Metal-Oxide-Semiconductor
- Why CMOS?
- Basic Concepts
- CMOS Technology
37Why CMOS?
- Low power dissipation
- at stable logic 0 or logic 1
- Pstatic leakage current
- Pdynamic switching current charging/dischargin
g of Cload - Pdynamic ? frequence
- A distinct advantage it leads to reduce heating
38Why CMOS?
- High Logic Integration Density
- freedom to adjust the size by demand
- linewidth as small as 0.1?m resolution is
possible by using optical lithography - logic density is increased as size decreased
- achieve greater integration densities than that
of a bipolar technology
39Why CMOS?
- Logic Swings
- rail-to-rail output logic voltage swings
- better noise immunity
- more reliable logic circuits
- a bipolar TTL gate output range 0.3, 3.6
- a CMOS gate output range 0, 5 volts
40Why CMOS?
- Symmetrical Transient Response
- to switch from a logic 0 to a logic 1 can be made
equal to the time needed to switch from a logic 1
to a logic 0 - Simply timing in a large system design
41Why CMOS?
- Bipolar Integrated Circuits
- bipolar emitter-coupled logic (ECL) is the
fastest silicon logic available - due to much higher power dissipation levels and
subsequent heating, it has not taken over the
microprocessor market - BiCMOS tends to provide the best aspects of both
worlds
42Why CMOS?
- Gallium Arsenide?
- The electron mobility is much larger in GaAs
- can react to higher frequencies
- Materials Costs
- Technology Know-How
- Applications
43Chapter 2Design Methodology
- design process
- gate level
- register level
- processor level
- computer-aided design
- analysis methods
44System Design
- a large and complex system, such as a computer
- a collection of connected components
45System Representation (I)
- A system modeling by a directed graph
- a set of nodes V v1, v2, v3, ,vn
- a set of edges E (v1, v2), (v1, v3), ,
(vn-1, vn) - edge e (vi, vj) connects node vi to node vj
46System Representation (II)
- a set of information processing components C
- a set of lines S that carry information signals
between components - the system G is associating C with S
47Structure versus Behavior (I)
- structure
- a graph
- the abstract graph consisting of block diagram
with no function information - behavior
- a truth table or a mathematical equation
- to determine for any given input signal to system
and its corresponding output
48Structure versus Behavior (II)
- neither can be derived from the other
- a schematic diagram, block diagram
- conveys structure rather than behavior
- needs more formal descriptions, a text, truth
table, a list of equations
49HDL Hardware Description Language (I)
- HDL- Hardware Description Language Babbages
notations - VHDL - based on the programming language Ada
- Verilog - based on the programming language C
- Both are embodied in formal standards sponsored
by IEEE (the Institute of Electrical and
Electronics Engineers)
50HDL Hardware Description Language (II)
- more precise
- technology independent
- descriptions of gate and register levels
- documentation
- suitable for CAD programs
- long and verbose
51Half Adder - Block Symbol
Half_adder
x
sum
y
carry
52Half Adder - Truth Table
53Half Adder - Behavior
- entity half_adder is port (x, y in bit sum,
carry out bit)end half_adderarchitecture
behavior of half_adder isbegin sum lt x xor y
carry lt x and yend behavior
54Half Adder - Structure
- architecture structure of half_adder is
component xor_circuit port (a, b in bit c out
bit) end component component nand_circuit
port (d, e in bit f out bit) end
component signal alpha bitbegin XOR
xor_circuit port map (agtx, bgty, cgtsum)
NAND1 nand_gate port map (dgtx, egty, fgtalpha)
NAND2 nand_gate port map (dgtalpha,
egtalpha,
fgtcarry) end structure
55Half Adder - Block Diagram
xor_circuit
XOR
a
x
sum
c
b
alpha
nand_gate
nand_gate
NAND1
NAND2
d
d
f
f
carry
e
e
y
56Exclusive-OR - Block Diagram
AND
x1
NOT
OR
x1 ? x2
NOT
AND
x2
57Exclusive-OR - Truth Table
58Gate Level
- combinational logic
- z(x1, x2, , xn)
- truth table
- logic circuits
- standard gates
- functional complete gate types
- AND, OR, NOT
- AND, NOT
- NAND
- NOR
59Full Adder - Truth Table
60Gate Level (Logic Level)
- processing with binary digits (bits)
- 0 and 1
- design components
- simple and memoryless logic gates
- flip-flops of bit-storage devices
- combinational logic
- flip-flops
- sequential circuits
61Combinational Logic(I)
- A combinational function is a logic or boolean
function - mapping a set of 2n input combinations of n
binary variables onto the output values 0 and 1 - z(x1, x2, , xn)
- function z can be defined as a truth table
62Combinational Logic(II)
- The truth table of full-adder as shown in Figure
2.9(a) in page 74 - a pair of three-binary-variable functions
- the sum output s0(x0, y0, c-1)
- the carry output c0(x0, y0, c-1)
- realization using half adders
- realization using AND and OR gates
- realization using NAND, NOR, and NOT gates
63Standard Gates
- AND
- x1x2 1 if and only if x1 and x2 are both 1
- OR
- x1x2 1 if and only if x1 or x2 or both are 1
- EXCLUSIVE-OR
- x1? x2 1 if and only if x1 or x2 but not both
are 1 - NOT (inverter)
- x1 1 if and only if x1 0
64Functional Complete (I)
- AND, OR, NOT
- AND, NOT
- a b a b a ? b
- NAND
- a a ? a
- a ? b a ? b a ? b ? a ? b
- a b a b a ? b a ? a ? b ? b
65Functional Complete (II)
- NOR
- a a a
- a ? b a ? b a b a a b b
- a b a b a b a b
66Boolean Algebra
- George Boole (1815-1864)
- Boolean equation
- s0 x0 y0 c-1 x0 y0 c-1 x0 y0 c-1 x0 y0 c-1
- c0 (x0 c-1)( x0 y0)( y0 c-1)
- SOP (sum-of-products)
- POS (product-of-sums)
- two-level logic circuit the longest IO path -
propagation delay
67Balance Logic Design
- To balance between hardware cost and operating
speed is depending on IC technology
considerations - Two-level adder has the shortest propagation
delay - Two-level adder has more gates and has a higher
hardware cost
68Logic Synthesizer
- To design circuits automatically via
computer-aided synthesis tools. - Restrictions of synthersizer
- fan-in of a gate
- fan-out of a gate
- gate minimization
- an intractable problem
- only practical for small circuits
69Flip-Flops (I)
- A flip-flop is an 1-bit storage element
- a sequence logic circuit a combinational
circuit memory - synchronization
- external clock signal CK of a flip-flop
- Four-bit ripple-carry D-flip-flop a serial
adder
70Flip-Flops (II)
- Edge triggering state changes around one edge of
CK (clock signal) - 0-to-1
- 1-to-0
- an edge-triggered D (delay) flip-flop
- 0-to-1 triggering edge of clock signal of CK
- others well-known flip-flop
- JK flip-flop
- SR flip-flop
- T flip-flop
71Flip-Flops (II)
- Edge triggering
- a sequence of discrete state values y(i)
- one for every clock cycle i
- Timing diagram - Figure 2.11
- Characteristic equation of D flip-flop
- y(i1) D(i)
72Sequential circuits
- A combinational circuit a set of flip-flops
- A serial adder Figure 2.12
- A four-bit-stream serial adder Figure 2.13
73Register Level
- Register-transfer level
- a grouped, ordered sets of small combinational or
sequential circuits - process or store words or vectors
- combinational
- Multiplexers
- Decoders and encoders
- sequential
- Shift registers
- Counters
74Component Types
- MSI parts in IC series
- Standard cells in VLSI
- with or without the functional completeness
property - no universal graphic symbols
- usually by an abbreviated description of their
behavior
75Generic Block Representation of a Register-Level
Component
- Data input lines
- Data output lines
- Control input lines
- select lines
- enable lines
- clock lines
- etc.
- Control output lines
m
k
Multifunctionunit
76Generic Block Representation
- select lines one of several possible operations
that the unit is to perform - enable lines time or condition for a selected
operation to be performed - active, enable, or asserted state
- an overbar low enable or active value is 0
77Operations
- Gate-level B0, 1
- Register level Bm set of 2m m-bit words
78Multiplexers (MUX)
- a device intended to route data from one of
several sources to a common destination - k-input, m-bit MUXk2p
X0
X1
X2p-1
Data in
m
...
p
Multiplexer (MUX)
Select S
Enable e
m
Data out Z
79Multiplexersas Function Generators
- A 2n-input, 1-bit multiplexer MUX can generate
any n-variable function - z(v1, v2, , vn-1)
- A 2-input, 4-bit multiplexer Figure 2.20
- An 8-input multiplexer Figure 2.21
- Multiplexer-based full adder Figure 2.22
80Decoders
- 1-out-of-2n or 1/2n decoder
- a 1/4 decoder Figure 2.23
- used in RAMs to select storage cells to be read
from or written into.
81Encoders
- To generate the address or index of an active
input line - 2n-to-n encoder
- x0x1x2x3x4x5x6x7 00000010z0z1z2 110
82Decoders
- 1-out-of-2n or 1/2n decoder
- a 1/4 decoder Figure 2.23
- used in RAMs to select storage cells to be read
from or written into.
83Processor (System) Level
- The highest in the computer design hierarchy
- concerned with the storage and processing of
blocks of information - more complex and based on VLSI technology
- very much a heuristic process
84Processor-Level Components
- Four main groups
- processors
- memories
- IO devices
- interconnection networks
85Central Processing Unit
- A general-purpose, instruction-set processor
- specialized processors such as IOPs
- operates on word-organized instructions and data
86Chapter ThreeProcessor Basics
- the overall design of instruction-set processors
- CPU of a computer
- microprocessors RISC and CISC types
87CPU Organization
- Fundamentals
- External communication
- User and supervisor modes
- CPU operation
- Accumulator-based CPU
- Programming consideration
- Instruction set
- Program execution
88Fundamentals
- To execute sequences of instructions (programs),
which are stored in an external main memory. - Program execution steps
- CPU transfers instructions with operands from
main memory to registers in CPU. - CPU executes the instructions sequentially except
when execution sequence is altered by a branch
instruction. - when necessary, CPU transfers results from CPU
registers to main memory
89External Communication(I)
- without a cache
- CPU communicates directly with the main memory
- a high-capacity multi-chip RAM (random-access
memory) - disadvantage speed disparity
- CPU is significantly faster than memory (5 to 10
times)
90External Communication(II)
- with a cache
- CM positioned between CPU and MM
- CM is faster and smaller than MM
- CM can reside wholly or in part in CPU
- typically permits CPU to load or store in a
single clock cycle - advantage CM is transparent to CPU's instruction
- CM as forming a single, seamless memory space of
2m addressable storage - further discussion in chapter 6
91External Communication(III)
- with IO devices
- IO ports IO devices are associated with
addressable registers - CPU can load/store a word from/to IO ports
- IO-mapped versus memory-mapped IO
- IO ports share the same set of memory addresses
- IO instructions produce IO control signals but
not memory-referencing signals
92User and Supervisor Modes
- user programs and supervisor programs
- a user or application program handles a specific
application - a supervisor program manages various routine
aspects of the computer system - normally, CPU switches back and forth between
user and supervisor programs - interrupt is a way of requesting and switching to
supervisor mode
93CPU Operation(I)
- Overview of a CPU behavior (Figure 3.2, pp. 140)
- instruction cycle a fetch step and a execution
step - micro-operations (register-transfer operations)
within an instruction cycle
94CPU Operation(II)
- the shortest well-defined CPU micro-operation is
the CPU cycle time or clock period, Tclock - Tclock the CPU cycle time
- f CPU's clock frequency in MHz
- Tclock 1/f
- each instruction is fetch from M in on CPU cycle
when M is CM - execution step is run in another CPU cycle
95Accumulator-Based CPU(I)
- to keep CPU relatively small
- a small set of registers and circuits to
implement a functionally complete set of
instructions - the central role of registers --- the accumulator
register
96Accumulator-Based CPU(I)
- a small accumulator-based CPU (Figure 3.3, pp.
141) - PCU and DPU
- fetch step
- IR.AR M(PC)
- IRop, ARadr
- load/store
- ACM(adr)
- M(adr)AC
97Programming Consideration(I)
- data processing --- thress operands operations
--- ZXY - single address instructions (pp. 142)
- HDL format ACM(X) DRAC ACM(Y)
ACACDR M(Z)AC - ASM format LD X MOV DR,AC LD Y ADD ST Z
- implicit operand AC and DR
- load/store architecture --- only uses the load
and store instructions to access memory - memory-referencing instrution form
ACf(AC,M(adr)) - HDL format ACM(X) ACACM(Y) M(Z)AC
- ASM format LD X ADD Y ST Z
- more complicated instruction-decoding logic in
PCU and more execution time in ADD - less instructions --- reduced overall execution
time? - the cost performance debate of RISC-CISC
98Programming Consideration(I)
- data processing
- three operands operations
- Z X Y
99Programming Consideration(II)
- single address instructions (pp. 142)
- HDL format ASM format
- ACM(X) LD X
- DRAC MOV DR,AC
- ACM(Y) LD Y
- ACACDR ADD
- M(Z)AC ST Z
- implicit operand AC and DR
- load/store architecture
- only uses the load and store instructions to
access memory
100Programming Consideration(III)
- memory-referencing instruction form
ACfi(AC,M(adr)) - HDL format ASM format
- ACM(X) LD X
- ACACM(Y) ADD Y
- M(Z)AC ST Z
- more complicated instruction-decoding logic in
PCU and more execution time in ADD - less instructions --- reduced overall execution
time?
101Programming Consideration(IV)
- the cost performance debate of RISC-CISC Central
Processing Unit
102Instruction Set
- the flavor instruction set of RISC (a load/store
architecture) - data transfer load, store, move register
- data processing add, subtract, and, not
- program control branch, branch zero
- (Figure 3.4, pp. 143)
- Example 3.1 a multiplication program (pp.144)
103A Multiplication Program(I)
- Line Location Instruction or data
- 0 one 00001
- 1 mult N
- 2 ac 00000
- 3 prod 00000
- 4 ST ac
- 5 loop LD mult
- 6 BZ exit
- .
- 17 BRA loop
- 18 exit ...
104A Multiplication Program(II)
- Line Location Instruction or data
- 7 LD one
- 8 MOV DR, AC
- 9 LD mult
- 10 SUB
- 11 ST mult
- 12 LD acc
- 13 MOV DR,AC
- 14 LD prod
- 15 ADD
- 16 ST prod
105Program Execution
- cycle by cycle execution (pp. 146)
- PCU actions
- fetch cycle includes the pair of
register-transfer operations - IR.ARM(PC)
- PCPC1
106Architecture Extensions(I)
- multipurpose register set for storing data and
address - register file
- additional data, instruction, and address types
- fixed-point multiply and divide instructions
- call and return instructions
107Architecture Extensions(II)
- register to indicate computation status
- condition code or flag register
- zero result or divide by zero, ...
- program control stack
- procedure calling
- extern interrupts
- push-down stack - stack pointer
- Figure 3.7, pp. 148.
108Pipelining(I)
- CPU speedup techniques
- cache memories
- instruction-level parallelism
- in DPU
- in PCU
- Overlapping instructions in a two-stage
instruction pipeline - Figure 3.8, pp. 150.
109Pipelining(II)
- Branch instruction
- reduce the efficiency of instruction pipelining
- More than two stages
- to increase the level of parallelism attainable
110ARM6 Microprocessor
- Organization of the ARM6
- SR, PC, WDR, RDR, AR, IR, ALU, Shifter, Buses
(Figure 3.9, pp. 152) - Core instruction set of the ARM6
- Data transfer, Data processing, Program control
(Figure 3.10, pp. 153 ) - Shift or rotation operation
- LSL logically left shift
- MOV R0, R1, LSL 2 R0 R1 ? 4
111Motorola 680X0 family
- Organization of the 68020
- D0-D7, A0-A7, PC, CC (Figure 3.11, pp. 155)
- Instruction set of the 68020
- Data transfer, Data processing, Program control,
External synchronization (Figure 3.12, pp.
156-157 )
112680X0 ASM for Vector Addition
- Vector addition (Figure 3.13, pp. 158)
- MOVE.L 2001, A0 MOVE.L 3001,
A1 MOVE.L 4001, A2START ABCD -(A0),
-(A1) MOVE.B (A1), -(A2) CMPA 1001,
A0 BNE START
113Chapter SixMemory Organization
- impact on performance
- survey storage-device technologies
- multilevel hierarchical memory systems
- cache memories
114Memory Types
- CPU registers
- working memory for temporary storage of
instructions and data - Main memory (primary memory)
- five or more clock cycles are usual
- Secondary memory
- in milliseconds
- Cache
- one to three clock cycles
115Performance and Cost
- Cost/performance trade-off
- cost of memory cC/S dollars/bit
- access time tA 10y
116Pipelining(II)
- Branch instruction
- reduce the efficiency of instruction pipelining
- More than two stages
- to increase the level of parallelism attainable
- ARM6 Microprocessor
- Figure 3.9
- Figure 3.10
117The Von Neumann Bottleneck
- The speed mismatch between the CPU and main
memory. - Storage density has grown rapidly
- Access time have decreased at a much slower rate
- capacity of single chip RAM
- 19754Kb (Kilobit),
- 1985256Kb,
- 199516Mb(Megabit)
118Access Modes(I)
- Random-access memory
- storage can be accessed in any order
- access time is independent of location
- Serial-access memory
- storage can be accessed only in a certain
predetermined sequence - magnetic disks, magnetic tapes, and optical disks
(CD-ROM) - access time depends on its position relative to
the read-write head
119Access Modes(II)
- Serial access tends to be slower than random
access. - Semirandom-access mode
- magnetic disks and CD-ROM
- if each track has its own read-write head, tracks
can be accessed randomly - access within track is serial
120Memory Retention(I)
- Read-only memory
- memories cannot be altered on-line
- a nonerasable storage device
- compact disk ROM
- Programmable read-only memory
- memories can be changed off-line
- CD-recordable disk (CD-R) as a programmable CD
121Memory Retention(II)
- DRO destructive readout
- reading will destroy the stored data
- restoration
- a write operation followed by a read
- NDRO nondestructive readout
- reading does not effect the stored data
122Memory Retention(III)
- Dynamic memory
- Refreshing periodically
- a stored 1 tends to 0 or vice versa due to some
physical decay process - a capacitor represents a stored 1 tends to 0 due
to leaking away - Static memory
- require no refreshing
- lower access time, faster, than DRAM
123Memory Retention(IV)
- Volatile
- destroy the storage data if power is lost
- most IC memories are volatile
- Nonvolatile
- most magnetic and optical memories
124Memory Retention(V)
- Cycle time - the elapsed time tM
- the minimum time that must elapse between the
start of two consecutive access operations - tM ? tA
- Data-transfer rate or bandwidth bM
- bM w/tM
- w is the number of bits can be transferred
simultaneously to or from the memory
125Memory Retention(VI)
- Reliability - MTBF
- the mean time before failure
- no moving parts (mechanical motion) has much
higher reliability - very high density or data-transfer rate has the
reliability problem - error-detecting and error-correcting codes can
increase the reliability of any memory - Performance parameters tA tM bM
126Memory Retention(VII)
Primary Access Alter- Perfor- Access
Technology medium mode ability mance timeBipolar
NDRO,semiconductor Electronic Random R/W
volatile 10nsMetal oxide (MOS) DRO,
NDRO,semiconductor Electronic Random R/W volatile
50nsMagnetic (hard) NDRO,semiconductor Magne
tic Semirandom R/W volatile 10msMagnetic- NDRO
,optical disk Optical Semirandom
R/W nonvolatile 50msCompact disk NDRO,ROM Op
tical Semirandom R nonvolatile 100nsMagnetic
tape NDRO,cartridge Magnetic
Serial R/W nonvolatile 1s
127Random-Access Memory(II)
- Semiconductor RAMs DRAM and SRAM
- a capacitor with one transistor versus six
transistors (Figure 6.9, pp. 410) - destructive readout and subsequently written back
to the cell is required for DRAM
128MT4LC8M8E1(Micron Technology 1997)
- 64Mb (226) ? 223 8-bit bytes ? 8M?8bit
- memory address size m23
- data word size w8
- 13-bit row address (external address lines)
- 10-bit column address
- control lines RAS, CAS, WE, OE,
- address bus A0A12
- data bus DQ1DQ8 (32 pins)
129MT4LC8M8E1(Micron Technology 1997)
- 64Mb (226) ? 223 8-bit bytes ? 8M?8bit
- tA50ns, tM90ns, page mode
- tREF refresh at least once every 64ms
- refresh an entire row of storage locations in a
single read cycle - if one-row read operation takes 90ns, total
refreshing time 90ns ? 8192 0.737ms,the
fraction of time to refresh 0.737 / 64 1.15
(a negligible amount) - Figure 6.13
130Other Semiconductor Memories(I)
- Read-only memory (ROM)
- without writing ability and nonvolatile
- store permanent code at the instruction and
microinstruction levels - Programmable ROM
- can be programmed only once
- can be programmed repeatedly (FPGA)
- erased in bulk off-line
131Other Semiconductor Memories(II)
- Flash memory - like as a PROM
- nonvolatile storage and can be programmed and
erased on-line - can be programmed a bit at a time
- can be erased in a large blocks, that is,a
flash erase process - can randomly read a bit and write a block
- storage density and access time are comparable to
those of DRAM
132Fast RAM interfaces(I)
- The gap between microprocessor and those of cheap
but slow DRAMs - Use a bigger memory word
- Access more than one word at a time
- interleaving rule
- interference or contention occurs if two or more
addresses require simultaneous access to the same
module
133Fast RAM interfaces(II)
- Synchronous DRAM (SDRAM)
- achieves a speed doubling by pipelining its
internal operations and by implementing two-way
address interleaving. - Cached DRAM (CDRAM)
- an on-chip cache realized by a small, fast SRAM
that acts as a high-speed buffer or front-end
memory for the main DRAM - have a fast burst mode of operation
134Fast RAM interfaces(III)
- Rambus DRAM (1992)
- the master transmits an initial packet on
Rambus channel - each Rambus DRAM chip examines it
- DRAM unit Ri containing the address returns
ready or busy to the master - if Ri ready, the master proceeds to transfer to
or from Ri a data packet of up to 256 bytes in
burst mode at speed up to 500 MB/s, that is 1
byte every 2 ns. - if Ri busy, the master must try again later...
135Serial-Access Memories
- Tracks
- data transfer to or from a track serially
- low cost per bit
- long access time
- read-write head positioning time
- slow speed of tracks moving
- serial data transfer
136Access Methods(I)
- Seek time tS
- the average time to move a head from one track to
another - Rotational latency time tL
- the average time to rotate the information cell
closed to head - Block
- all words in a block are stored in consecutive
locations such that an entire block takes one
seek and one latency time
137Access Methods(II)
- Data-transfer rate
- V cm/s the speed of the stored information
relative to the read-write head - T bits/cm the storage density along the track
- TV bits/s
138Access Methods(III)
- Time to access a block in a serial-access memory
tB tS 1/2r n/rN - tS the average seek time
- r the revolutions per second
- 1/2r the average latency of a track
- n the number of words per block
- N the number of words per track
- n/rN the data transfer time
139Magnetic Hard-Disk
- 9.3GB Quantum XP39100, 1996 pp. 422
- tS 7.9ms
- r 0.12 revs/ms
- n 8
- N 144X512 73,728 bytes/track
140Magnetic Tape
- Cartridge or Cassette
- Data stored in parallel
- 80-track tape with
- density 110 Kb/in
- tape speed 50 in/s
- max. data-transfer rate 110K x 80 / 8 x 50 55
MB/s - 200m tape 55/50 x 200/0.0254 8.661 GB
141Optical Memory
- CD-ROM (compact disk)
- CD-R (recordable)
- CD-RW (rewritable)
- DVD (digital video disk)
142Memory Systems
- General characteristics
- Multilevel memories
- Hierarchical organization
- Two key design issues
- automatic translation of addresses
- dynamic relocation of data
143Multilevel Memories
- A n-level system (M1, M2, , Mn)
- two level
- main memory (semiconductor DRAMs)
- secondary memory (magnetic-disk units)
- three level
- cache memory (semiconductor SRAMs)
- split cache (Instruction I-cache, Data D-cache)
- four level
- level 1 cache
- level 2 cache
- both are the nonsplit or unified caches
144General Characterics(I)
- Two adjacent memory levels Mi and Mi1
- cost per bit Ci gt Ci1
- access time tAi lt tAi1
- storage capacity Si lt Si1
- Communication between levels
- CPU can communicate directly with M1
- M1 can communicate directly with M2, and so on
- except that CPU can bypass cache and go to main
memory
145General Characterics(II)
- Relocation of addresses and transferring data
between two adjacent levels is a relatively slow
process - requires some extent predictable approach to
guess the future addresses generated by the CPU
146Cache and Virtual Memory
- The cache and main memory act as a single memory
to the software - The main and secondary memories are NOT
transparent to system software - The main and secondary memories are transparent
to user code --- virtual memory --- like a
single, larger, and directly addressable memory
147Reasons of Virtual Memory
- To free user programs from the need of storage
allocation - To permit sharing of memory space among users
- To make programs independent of physical
configuration and capacity of memories - To achieve the very low access time and cost per
bit
148Locality of Reference(I)
- The characteristic of computer programs
- the predictability of memory addresses
- the locality of reference
149Locality of Reference(II)
- Spatial locality
- Instruction and data are specified and
subsequently stored in memory - Temporal locality --- Working set W(t, T)
- the tend of loops in programs are executed
repeatedly - W(t, T) tends to change rather slower during the
time interval (t-T, t)
150Cost and Performance(I)
- Factors of performance
- the address-reference statistics
- the access time (tA)
- the storage capacity (Si)
- the size of blocks (pages) (SPi)
- the allocation algorithm (blocks-swapping
process)
151Cost and Performance(II)
- The average cost per bit of memory
- to reach the goal of making c approach c2
- S1 must be smaller than S2
152Cost and Performance(III)
- The performance can be measured by the hit ratio
H - the probability that a virtual address generated
by the CPU are currently stored in the faster
memory - hit reference to M1
- miss reference to M2
- miss ratio 1 - H
153Cost and Performance(IV)
- N1 the number of references of M1
- N2 the number of references of M2
154Address Translation(I)
- Address mapping or address translation process
- map virtual address onto real address
- by programmer
- by compiler
- by loader
- by run-time
155Address Translation(II)
- Static translation
- compete the translation as the program loaded
- Dynamic translation
- complete the translation during execution
- run-time address translation by MMU (Memory
management unit)
156Base Addressing
- Effective address base displacementAeff B
D or Aeff B.D - Limit address length of block
- Figure 6.25, pp. 434
157Translation Look-Aside Buffer
- TLB
- AV BV.D
- BRTLB(BV)
- AR BR.D
- Figure 6.26, pp. 434
- MIPS R2/3000, pp. 434-435
158Pages and Segments(I)
- Page (frame) is a fixed size of block
- suitable for physical partitioning and swapping
of information - Segment is logical block of program or data
- its boundary corresponding to the natural program
or data boundary - stack segment
159Pages and Segments(II)
- Two-stage address translation
- AV SI.PI.D
- PB Segment TLB(SB.SI)
- P Page TLB(PB.PI)
- AR P.D
- Figure 6.30, pp. 440
160Memory Address Translation in Intel Pentium
- 32-bit linear address
- N 32-bit Effective address AV Segment
TLB(STB.14-bit Ls) 10-bit Nd. 10-bit Np.
12-bit Displacement - AR Page table TLB( Page directory
TLB(PDB.Nd).Np)
161Page Size
- Utilization versus page size
- Figure 6.32, pp. 443
- Hit ratio versus page size
- Figure 6.33, pp. 443
162Memory Allocation
- Nonpreemptive allocation
- all blocks already occupying memory can be
overwritten or moved - first fit
- best fit
- Preemptive allocation
- relocation is allowed
- move
- replace
163Replacement policies
- FIFO
- LRU
- OPT
- Figure 6.36, pp. 448
- Figure 6.37, pp. 449
- Figure 6.38, pp. 450
164Caches(I)
- History
- appeared as early as 1968 IBM S/360
- in 1980, caches directly address the von Neumann
bottleneck by providing the CPU with fast,
single-cycle access to its external memory
165Caches(II)
- A cache servers as
- a fast intermediate memory
- a buffer between the CPU and its main memory
- TLBs within a MMU
- Data buffers built into high-speed secondary
memory devices
166Caches(III)
- Access time ratio
- (M1, M2) is around tA2/tA1 5/1
- (M2, M3) is around tA3/tA2 1000/1
- By high-speed hardware circuits rather than by
software routines - Figure 6.39, pp. 453
167Cache Organization(I)
- Cache data memory (cache blocks or lines)
- Cache tag memory (cache directory)
Cache M1
Cachedatamemory
Cachetag
Hit
Address
Control
Data
168Cache Organization(II)
- Performance factors
- time to match tag address
- time to access data memory
- use the SRAM technology of 10ns access time
- Two general organizations
- look-aside
- look-through
- Figure 6.41, pp. 454
169Look-Aside Cache
- CPU placing a real address on system bus
- cache comparing address to tag
- if hit, read or write operates on cache
- if miss, read or write operates on main memory
and a block of data including its address is
transferred from main memory to cache - if miss, use the block replacement policy such as
LRU to determine where to place the incoming
block - the block transfer could tie up the system bus
170Look-Through Cache
- CPU placing a real address on a separate local
bus - cache access and memory access can proceed
concurrently - CPU sends memory requests to main memory only
after a cache miss - to speed up cache-main memory transfer, the local
bus between cache and main memory can be wider
than system bus, such as wide as the block size
of cache, 16-byte or 128-bit data bus - disadvantages
- higher complexity and cost
- longer main memory access time if miss occurs
171Cache Operation(I)
- Read operation - Figure 6.42, a cache with 4-byte
block size and 12-bit address - Write operation - Figure 6.43
- a temporary inconsistency between cache and main
memory is possible - preventing the improper use of stale data is the
cache coherence or cache consistency problem - between multiprocessors
- between single-CPU and IO controllers
- a systematically updating policy (chapter 7)
172Cache Operation(II)
- a systematically updating policy
- a change (dirty) bit for each cache block
- cache write-back or copy-back technique
- if replacing occurs, the block data is written
back to main memory when its change bit is on - disadvantages
- has a temporary inconsistency before write-back
- complicates recovery from system failures
173Cache Operation(III)
- cache write-through policy
- write data to both cache and main memory for
every memory write cycle - use more write cycles than write-back policy and
slower system performance
174Address Mapping
- To quickly determine whether a tag address is
presented in the cache - the fastest technique is to use the associative
or content addressing scheme to compared all tags
simultaneously
175Associative Addressing
- Fields of item in CAM (Content Address Memories)
- KEY stored address
- DATA information to be accessed
- memory access request
- an associative cache as the tag, key
- the incoming tag compared simultaneously to all
tags in caches tag memory - if cache hit occurs, a match signal triggers the
memory access from caches data field - If cache miss occurs, forward request to main
memory
176Associative Memory(I)
- A fixed-length word for each unit
- Mask register
- to identify the bit positions (need not be
adjacent) that define the key - Match circuit
- to compare with a bit of key simultaneously
- Select circuit
- to enable the data field to be accessed
177Associative Memory(II)
- about 10 transistors for a bit associated memory
(Figure 6.44-46, pp. 458-460) - the caches LRU block replacement policy is
implemented by special hardware that constantly
monitors cache usage.
178Direct Mapping(I)
- An alternative, simpler addressing-mapping
technique for caches - Divide M1 into s sets M1(0), M1(1), , M1(s1-1)
where s1 2s - each set as a block of n consecutive words
179Direct Mapping(II)
- M2(i) is mapping into M1(j) if ji (modulo s1)
- if S1 26 64blocks with address i, i64,
i128, i192, can be mapped into M1(i)
180Set-Associative Addressing
- K-way set-associative mapping
- Each set contains k2h blocks
- Permits up to k members of the same equivalent
class to be stored in the cache simultaneously - M2(i) and M2(j) in the same class if ij (modulo
s1) - One-way set-associative mapping is equivalent to
set-associative mapping - Two-way, four-way, eight-way,
- Figure 6.49, pp. 463
181Design of a 2-Way Set-Associative Cache(I)
- 8KB 2-way set-associative addressing
- Example of a 32-bit processor
- (Figure 6.50, pp. 464)
- 8B block, VAX-11/780 in 1978
- 32B block, PowerPC/603 in 1993
182Design of a 2-Way Set-Associative Cache(II)
- 32-bit Memory address
- Tag 20 bits ? 20 bits per cache tag
- Set address 9 bits ? 512 sets
- Displacement 3 bits ? 64 bits per block
- Cache architecture
- Tag RAM ? 512 ? 20 ? 2 (T0 and T1)
- Data RAM ? 512 ? 64 ? 2 (D0 and D1)
- 2 20-bit tag comparators
183Design of a 2-Way Set-Associative Cache(III)
- Cache operation
- Use 9-bit set address to read T0 and T1 and
compare both outputs with Atag simultaneously - if a match occurs, Ti is used to initiate a
memory access of Di to or from 64-bit data bus - if a smaller data bus is used, a block needs
several cycles to transfer data - if a miss occurs, a 64-bit block is swapped from
main memory to cache - VAX 11/788 uses a random replacement and
write-through updating policy - PowerPC/603 uses a LRU and write-back policy
184Structure versus Performance
- The type of information to store in cache
- The dimension of cache
- The control method of cache
- The impact of performance
185Cache Types(I)
- By different access behavior patterns
- A unified cache stores both instruction and data
together - A split cache has two independent units
- an I-cache for instructions
- few write operation
- more temporal and spatial locality
- a D-cache for data
186Cache Types(II)
- By the level in the memory hierarchy
- Primary cache Level 1 (L1) cache
- via part of on-chip memory of a microprocessor
chip - Secondary cache Level 2 (L2) cache
- via an off-chip memory
187Performance
- tB tA2 the block-transfer time from main
memory to cache can be identical to a single