OMSE%20510:%20Computing%20Foundations%203:%20Caches,%20Assembly,%20CPU%20Overview - PowerPoint PPT Presentation

About This Presentation
Title:

OMSE%20510:%20Computing%20Foundations%203:%20Caches,%20Assembly,%20CPU%20Overview

Description:

The Five Classic Components of a Computer. Next Topic: Simple caching techniques ... (cold start or process migration, first reference): first access to a block ' ... – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 134
Provided by: franci52
Learn more at: http://web.cecs.pdx.edu
Category:

less

Transcript and Presenter's Notes

Title: OMSE%20510:%20Computing%20Foundations%203:%20Caches,%20Assembly,%20CPU%20Overview


1
OMSE 510 Computing Foundations3 Caches,
Assembly, CPU Overview
  • Chris Gilmore ltgrimjack_at_cs.pdx.edugt
  • Portland State University/OMSE

2
Today
  • Caches
  • DLX Assembly
  • CPU Overview


3
Computer System (Idealized)
Disk
Memory
CPU
Disk Controller
4
The Big Picture Where are We Now?
  • The Five Classic Components of a Computer
  • Next Topic
  • Simple caching techniques
  • Many ways to improve cache performance

5
Recap Levels of the Memory Hierarchy
Upper Level
Processor
faster
Instr. Operands
Cache
Blocks
Memory
Pages
Disk
Files
Tape
Larger
Lower Level
6
Recap exploit locality to achieve fast memory
  • Two Different Types of Locality
  • Temporal Locality (Locality in Time) If an item
    is referenced, it will tend to be referenced
    again soon.
  • Spatial Locality (Locality in Space) If an item
    is referenced, items whose addresses are close by
    tend to be referenced soon.
  • By taking advantage of the principle of locality
  • Present the user with as much memory as is
    available in the cheapest technology.
  • Provide access at the speed offered by the
    fastest technology.
  • DRAM is slow but cheap and dense
  • Good choice for presenting the user with a BIG
    memory system
  • SRAM is fast but expensive and not very dense
  • Good choice for providing the user FAST access
    time.

7
Memory Hierarchy Terminology
  • Hit data appears in some block in the upper
    level (example Block X)
  • Hit Rate the fraction of memory access found in
    the upper level
  • Hit Time Time to access the upper level which
    consists of
  • RAM access time Time to determine hit/miss
  • Miss data needs to be retrieve from a block in
    the lower level (Block Y)
  • Miss Rate 1 - (Hit Rate)
  • Miss Penalty Time to replace a block in the
    upper level
  • Time to deliver the block the processor
  • Hit Time ltlt Miss Penalty

Lower Level Memory
Upper Level Memory
To Processor
Blk X
From Processor
Blk Y
8
The Art of Memory System Design
Workload or Benchmark programs
Processor
reference stream ltop,addrgt, ltop,addrgt,ltop,addrgt,lt
op,addrgt, . . . op i-fetch, read, write
Memory
Optimize the memory system organization to
minimize the average memory access time for
typical workloads

MEM
9
Example Fully Associative
  • Fully Associative Cache
  • No Cache Index
  • Compare the Cache Tags of all cache entries in
    parallel
  • Example Block Size 32 B blocks, we need N
    27-bit comparators
  • By definition Conflict Miss 0 for a fully
    associative cache

0
4
31
Cache Tag (27 bits long)
Byte Select
Ex 0x01
Cache Data
Valid Bit
Cache Tag

Byte 0
Byte 1
Byte 31
X

Byte 32
Byte 33
Byte 63
X
X
X



X
10
Example 1 KB Direct Mapped Cache with 32 B Blocks
  • For a 2 N byte cache
  • The uppermost (32 - N) bits are always the Cache
    Tag
  • The lowest M bits are the Byte Select (Block Size
    2 M)

Block address
0
4
31
9
Cache Index
Cache Tag
Example 0x50
Byte Select
Ex 0x01
Ex 0x00
Stored as part of the cache state
Cache Data
Valid Bit
Cache Tag

0
Byte 0
Byte 1
Byte 31

1
0x50
Byte 32
Byte 33
Byte 63
2
3




31
Byte 992
Byte 1023
11
Set Associative Cache
  • N-way set associative N entries for each Cache
    Index
  • N direct mapped caches operates in parallel
  • Example Two-way set associative cache
  • Cache Index selects a set from the cache
  • The two tags in the set are compared to the input
    in parallel
  • Data is selected based on the tag result

Cache Index
Cache Data
Cache Tag
Valid
Cache Block 0



Adr Tag
Compare
0
1
Mux
Sel1
Sel0
OR
Cache Block
Hit
12
Disadvantage of Set Associative Cache
  • N-way Set Associative Cache versus Direct Mapped
    Cache
  • N comparators vs. 1
  • Extra MUX delay for the data
  • Data comes AFTER Hit/Miss decision and set
    selection
  • In a direct mapped cache, Cache Block is
    available BEFORE Hit/Miss
  • Possible to assume a hit and continue. Recover
    later if miss.

13
Block Size Tradeoff
  • Larger block size take advantage of spatial
    locality BUT
  • Larger block size means larger miss penalty
  • Takes longer time to fill up the block
  • If block size is too big relative to cache size,
    miss rate will go up
  • Too few cache blocks
  • In general, Average Access Time
  • Hit Time x (1 - Miss Rate) Miss Penalty x
    Miss Rate

Average Access Time
Miss Rate
Miss Penalty
Exploits Spatial Locality
Increased Miss Penalty Miss Rate
Fewer blocks compromises temporal locality
Block Size
Block Size
Block Size
14
A Summary on Sources of Cache Misses
  • Compulsory (cold start or process migration,
    first reference) first access to a block
  • Cold fact of life not a whole lot you can do
    about it
  • Note If you are going to run billions of
    instruction, Compulsory Misses are insignificant
  • Conflict (collision)
  • Multiple memory locations mappedto the same
    cache location
  • Solution 1 increase cache size
  • Solution 2 increase associativity
  • Capacity
  • Cache cannot contain all blocks access by the
    program
  • Solution increase cache size
  • Coherence (Invalidation) other process (e.g.,
    I/O) updates memory

15
Source of Cache Misses Quiz
Assume constant cost.
Direct Mapped
N-way Set Associative
Fully Associative
Cache Size Small, Medium, Big?
Compulsory Miss
Conflict Miss
Capacity Miss
Coherence Miss
Choices Zero, Low, Medium, High, Same
16
Sources of Cache Misses Answer
Direct Mapped
N-way Set Associative
Fully Associative
Cache Size
Big
Medium
Small
Compulsory Miss
Same
Same
Same
Conflict Miss
High
Medium
Zero
Capacity Miss
Low
Medium
High
Coherence Miss
Same
Same
Same
Note If you are going to run billions of
instruction, Compulsory Misses are insignificant.
17
Recap Four Questions for Caches and Memory
Hierarchy
  • Q1 Where can a block be placed in the upper
    level? (Block placement)
  • Q2 How is a block found if it is in the upper
    level? (Block identification)
  • Q3 Which block should be replaced on a miss?
    (Block replacement)
  • Q4 What happens on a write? (Write strategy)

18
Q1 Where can a block be placed in the upper
level?
  • Block 12 placed in 8 block cache
  • Fully associative, direct mapped, 2-way set
    associative
  • S.A. Mapping Block Number Modulo Number Sets

Fully associative block 12 can go anywhere
Block no.
0 1 2 3 4 5 6 7
19
Q2 How is a block found if it is in the upper
level?
Set Select
Data Select
  • Direct indexing (using index and block offset),
    tag compares, or combination
  • Increasing associativity shrinks index, expands
    tag

20
Q3 Which block should be replaced on a miss?
  • Easy for Direct Mapped
  • Set Associative or Fully Associative
  • Random
  • FIFO
  • LRU (Least Recently Used)
  • LFU (Least Frequently Used)
  • Associativity 2-way 4-way 8-way
  • Size LRU Random LRU Random LRU Random
  • 16 KB 5.2 5.7 4.7 5.3 4.4 5.0
  • 64 KB 1.9 2.0 1.5 1.7 1.4 1.5
  • 256 KB 1.15 1.17 1.13 1.13 1.12
    1.12

21
Q4 What happens on a write?
  • Write throughThe information is written to both
    the block in the cache and to the block in the
    lower-level memory.
  • Write backThe information is written only to the
    block in the cache. The modified cache block is
    written to main memory only when it is replaced.
  • is block clean or dirty?
  • Pros and Cons of each?
  • WT read misses cannot result in writes,
    coherency easier
  • WB no writes of repeated writes
  • WT always combined with write buffers so they
    dont wait for lower level memory

22
Write Buffer for Write Through
Cache
Processor
DRAM
Write Buffer
  • A Write Buffer is needed between the Cache and
    Memory
  • Processor writes data into the cache and the
    write buffer
  • Memory controller write contents of the buffer
    to memory
  • Write buffer is just a FIFO
  • Typical number of entries 4
  • Works fine if Store frequency (w.r.t. time) ltlt
    1 / DRAM write cycle
  • Memory system designers nightmare
  • Store frequency (w.r.t. time) gt 1 / DRAM write
    cycle
  • Write buffer saturation

23
Write Buffer Saturation
Cache
Processor
DRAM
Write Buffer
  • Store frequency (w.r.t. time) gt 1 / DRAM write
    cycle
  • If this condition exist for a long period of time
    (CPU cycle time too quick and/or too many store
    instructions in a row)
  • Store buffer will overflow no matter how big you
    make it
  • The CPU Cycle Time lt DRAM Write Cycle Time
  • Solution for write buffer saturation
  • Use a write back cache
  • Install a second level (L2) cache (does this
    always work?)

Cache
L2 Cache
Processor
DRAM
Write Buffer
24
Write-miss Policy Write Allocate versus Not
Allocate
  • Assume a 16-bit write to memory location 0x0 and
    causes a miss
  • Do we read in the block?
  • Yes Write Allocate
  • No Write Not Allocate

0
4
31
9
Cache Index
Cache Tag
Example 0x00
Byte Select
Ex 0x00
Ex 0x00
Cache Data
Valid Bit
Cache Tag

0
Byte 0
0x50
Byte 1
Byte 31

1
Byte 32
Byte 33
Byte 63
2
3




31
Byte 992
Byte 1023
25
Impact on Cycle Time
Cache Hit Time directly tied to clock
rate increases with cache size increases with
associativity
Average Memory Access time Hit Time Miss
Rate x Miss Penalty Time IC x CT x (ideal CPI
memory stalls)
26
What happens on a Cache miss?
  • For in-order pipeline, 2 options
  • Freeze pipeline in Mem stage (popular early on
    Sparc, R4000) IF ID EX Mem stall stall stall
    stall Mem Wr IF ID EX stall stall
    stall stall stall Ex Wr
  • Use Full/Empty bits in registers MSHR queue
  • MSHR Miss Status/Handler Registers
    (Kroft)Each entry in this queue keeps track of
    status of outstanding memory requests to one
    complete memory line.
  • Per cache-line keep info about memory address.
  • For each word register (if any) that is waiting
    for result.
  • Used to merge multiple requests to one memory
    line
  • New load creates MSHR entry and sets destination
    register to Empty. Load is released from
    pipeline.
  • Attempt to use register before result returns
    causes instruction to block in decode stage.
  • Limited out-of-order execution with respect to
    loads. Popular with in-order superscalar
    architectures.
  • Out-of-order pipelines already have this
    functionality built in (load queues, etc).

27
Improving Cache Performance 3 general options
Time IC x CT x (ideal CPI memory
stalls) Average Memory Access time Hit Time
(Miss Rate x Miss Penalty) (Hit Rate x Hit
Time) (Miss Rate x Miss Time)
1. Reduce the miss rate, 2. Reduce the miss
penalty, or 3. Reduce the time to hit in the
cache.
28
Improving Cache Performance
1. Reduce the miss rate, 2. Reduce the miss
penalty, or 3. Reduce the time to hit in the
cache.
29
3Cs Absolute Miss Rate (SPEC92)
Conflict
Compulsory vanishingly small
30
21 Cache Rule
miss rate 1-way associative cache size X
miss rate 2-way associative cache size X/2
Conflict
31
3Cs Relative Miss Rate
Conflict
Flaws for fixed block size Good insight gt
invention
32
1. Reduce Misses via Larger Block Size
33
2. Reduce Misses via Higher Associativity
  • 21 Cache Rule
  • Miss Rate DM cache size N Miss Rate 2-way cache
    size N/2
  • Beware Execution time is only final measure!
  • Will Clock Cycle time increase?
  • Hill 1988 suggested hit time for 2-way vs.
    1-way external cache 10, internal 2

34
Example Avg. Memory Access Time vs. Miss Rate
  • Example assume CCT 1.10 for 2-way, 1.12 for
    4-way, 1.14 for 8-way vs. CCT direct mapped
  • Cache Size Associativity
  • (KB) 1-way 2-way 4-way 8-way
  • 1 2.33 2.15 2.07 2.01
  • 2 1.98 1.86 1.76 1.68
  • 4 1.72 1.67 1.61 1.53
  • 8 1.46 1.48 1.47 1.43
  • 16 1.29 1.32 1.32 1.32
  • 32 1.20 1.24 1.25 1.27
  • 64 1.14 1.20 1.21 1.23
  • 128 1.10 1.17 1.18 1.20
  • (Green means A.M.A.T. not improved by more
    associativity)
  • (AMAT Average Memory Access Time)

35
3. Reducing Misses via a Victim Cache
  • How to combine fast hit time of direct mapped
    yet still avoid conflict misses?
  • Add buffer to place data discarded from cache
  • Jouppi 1990 4-entry victim cache removed 20
    to 95 of conflicts for a 4 KB direct mapped data
    cache
  • Used in Alpha, HP machines

DATA
TAGS
One Cache line of Data
Tag and Comparator
One Cache line of Data
Tag and Comparator
One Cache line of Data
Tag and Comparator
One Cache line of Data
Tag and Comparator
To Next Lower Level In
Hierarchy
36
4. Reducing Misses by Hardware Prefetching
  • E.g., Instruction Prefetching
  • Alpha 21064 fetches 2 blocks on a miss
  • Extra block placed in stream buffer
  • On miss check stream buffer
  • Works with data blocks too
  • Jouppi 1990 1 data stream buffer got 25 misses
    from 4KB cache 4 streams got 43
  • Palacharla Kessler 1994 for scientific
    programs for 8 streams got 50 to 70 of misses
    from 2 64KB, 4-way set associative caches
  • Prefetching relies on having extra memory
    bandwidth that can be used without penalty

37
5. Reducing Misses by Software Prefetching Data
  • Data Prefetch
  • Load data into register (HP PA-RISC loads)
  • Cache Prefetch load into cache (MIPS IV,
    PowerPC, SPARC v. 9)
  • Special prefetching instructions cannot cause
    faultsa form of speculative execution
  • Issuing Prefetch Instructions takes time
  • Is cost of prefetch issues lt savings in reduced
    misses?
  • Higher superscalar reduces difficulty of issue
    bandwidth

38
6. Reducing Misses by Compiler Optimizations
  • McFarling 1989 reduced caches misses by 75 on
    8KB direct mapped cache, 4 byte blocks in
    software
  • Instructions
  • Reorder procedures in memory so as to reduce
    conflict misses
  • Profiling to look at conflicts(using tools they
    developed)
  • Data
  • Merging Arrays improve spatial locality by
    single array of compound elements vs. 2 arrays
  • Loop Interchange change nesting of loops to
    access data in order stored in memory
  • Loop Fusion Combine 2 independent loops that
    have same looping and some variables overlap
  • Blocking Improve temporal locality by accessing
    blocks of data repeatedly vs. going down whole
    columns or rows

39
Improving Cache Performance (Continued)
1. Reduce the miss rate, 2. Reduce the miss
penalty, or 3. Reduce the time to hit in the
cache.
40
0. Reducing Penalty Faster DRAM / Interface
  • New DRAM Technologies
  • RAMBUS - same initial latency, but much higher
    bandwidth
  • Synchronous DRAM
  • Better BUS interfaces
  • CRAY Technique only use SRAM

41
1. Reducing Penalty Read Priority over Write on
Miss
Cache
Processor
DRAM
Write Buffer
  • A Write Buffer Allows reads to bypass writes
  • Processor writes data into the cache and the
    write buffer
  • Memory controller write contents of the buffer
    to memory
  • Write buffer is just a FIFO
  • Typical number of entries 4
  • Works fine if Store frequency (w.r.t. time) ltlt
    1 / DRAM write cycle
  • Memory system designers nightmare
  • Store frequency (w.r.t. time) gt 1 / DRAM write
    cycle
  • Write buffer saturation

42
1. Reducing Penalty Read Priority over Write on
Miss
  • Write-Buffer Issues
  • Write through with write buffers offer RAW
    conflicts with main memory reads on cache misses
  • If simply wait for write buffer to empty, might
    increase read miss penalty (old MIPS 1000 by 50
    )
  • ? Check write buffer contents before read
    if no conflicts, let the memory access continue
  • Write Back?
  • Read miss replacing dirty block
  • Normal Write dirty block to memory, and then do
    the read
  • Instead copy the dirty block to a write buffer,
    then do the read, and then do the write
  • CPU stall less since restarts as soon as do read

43
2. Reduce Penalty Early Restart and Critical
Word First
  • Dont wait for full block to be loaded before
    restarting CPU
  • Early restartAs soon as the requested word of
    the block arrives, send it to the CPU and let the
    CPU continue execution
  • Critical Word FirstRequest the missed word first
    from memory and send it to the CPU as soon as it
    arrives let the CPU continue execution while
    filling the rest of the words in the block. Also
    called wrapped fetch and requested word first
  • Generally useful only in large blocks,
  • Spatial locality a problem tend to want next
    sequential word, so not clear if benefit by early
    restart

block
44
3. Reduce Penalty Non-blocking Caches
  • Non-blocking cache or lockup-free cache allow
    data cache to continue to supply cache hits
    during a miss
  • requires F/E bits on registers or out-of-order
    execution
  • requires multi-bank memories
  • hit under miss reduces the effective miss
    penalty by working during miss vs. ignoring CPU
    requests
  • hit under multiple miss or miss under miss
    may further lower the effective miss penalty by
    overlapping multiple misses
  • Significantly increases the complexity of the
    cache controller as there can be multiple
    outstanding memory accesses
  • Requires multiple memory banks (otherwise cannot
    support)
  • Pentium Pro allows 4 outstanding memory misses

45
Value of Hit Under Miss for SPEC
Hit under n Misses
  • FP programs on average AMAT 0.68 -gt 0.52 -gt
    0.34 -gt 0.26
  • Int programs on average AMAT 0.24 -gt 0.20 -gt
    0.19 -gt 0.19
  • 8 KB Data Cache, Direct Mapped, 32B block, 16
    cycle miss

46
4. Reduce Penalty Second-Level Cache
Proc
  • L2 Equations
  • AMAT Hit TimeL1 Miss RateL1 x Miss
    PenaltyL1
  • Miss PenaltyL1 Hit TimeL2 Miss RateL2 x Miss
    PenaltyL2
  • AMAT Hit TimeL1
  • Miss RateL1 x (Hit TimeL2 Miss RateL2 x
    Miss PenaltyL2)
  • Definitions
  • Local miss rate misses in this cache divided by
    the total number of memory accesses to this cache
    (Miss rateL2)
  • Global miss ratemisses in this cache divided by
    the total number of memory accesses generated by
    the CPU (Miss RateL1 x Miss RateL2)
  • Global Miss Rate is what matters

L1 Cache
L2 Cache
47
Reducing Misses which apply to L2 Cache?
  • Reducing Miss Rate
  • 1. Reduce Misses via Larger Block Size
  • 2. Reduce Conflict Misses via Higher
    Associativity
  • 3. Reducing Conflict Misses via Victim Cache
  • 4. Reducing Misses by HW Prefetching Instr, Data
  • 5. Reducing Misses by SW Prefetching Data
  • 6. Reducing Capacity/Conf. Misses by Compiler
    Optimizations

48
L2 cache block size A.M.A.T.
  • 32KB L1, 8 byte path to memory

49
Improving Cache Performance (Continued)
  • 1. Reduce the miss rate,
  • 2. Reduce the miss penalty, or
  • 3. Reduce the time to hit in the cache
  • - Lower Associativity (victim caching)?
  • - 2nd-level cache
  • - Careful Virtual Memory Design

50
Summary 1/ 3
  • The Principle of Locality
  • Program likely to access a relatively small
    portion of the address space at any instant of
    time.
  • Temporal Locality Locality in Time
  • Spatial Locality Locality in Space
  • Three (1) Major Categories of Cache Misses
  • Compulsory Misses sad facts of life. Example
    cold start misses.
  • Conflict Misses increase cache size and/or
    associativity. Nightmare Scenario ping pong
    effect!
  • Capacity Misses increase cache size
  • Coherence Misses Caused by external processors
    or I/O devices
  • Cache Design Space
  • total size, block size, associativity
  • replacement policy
  • write-hit policy (write-through, write-back)
  • write-miss policy

51
Summary 2 / 3 The Cache Design Space
  • Several interacting dimensions
  • cache size
  • block size
  • associativity
  • replacement policy
  • write-through vs write-back
  • write allocation
  • The optimal choice is a compromise
  • depends on access characteristics
  • workload
  • use (I-cache, D-cache, TLB)
  • depends on technology / cost
  • Simplicity often wins

Cache Size
Associativity
Block Size
Bad
Factor A
Factor B
Good
Less
More
52
Summary 3 / 3 Cache Miss Optimization
  • Lots of techniques people use to improve the miss
    rate of caches

Technique MR MP HT Complexity Larger Block
Size 0Higher Associativity 1Victim
Caches 2Pseudo-Associative Caches 2HW
Prefetching of Instr/Data 2Compiler
Controlled Prefetching 3Compiler Reduce
Misses 0
miss rate
53
Onto Assembler!
  • What is assembly language?
  • Machine-Specific Programming Language
  • one-one correspondence between statements and
    native machine language
  • matches machine instruction set and architecture


54
What is an assembler?
  • Systems Level Program
  • Usually works in conjunction with the compiler
  • translates assembly language source code to
    machine language
  • object file - contains machine instructions,
    initial data, and information used when loading
    the program
  • listing file - contains a record of the
    translation process, line numbers, addresses,
    generated code and data, and a symbol table


55
Why learn assembly?
  • Learn how a processor works
  • Understand basic computer architecture
  • Explore the internal representation of data and
    instructions
  • Gain insight into hardware concepts
  • Allows creation of small and efficient programs
  • Allows programmers to bypass high-level language
    restrictions
  • Might be necessary to accomplish certain
    operations


56
Machine Representation
  • A language of numbers, called the Processors
    Instruction Set
  • The set of basic operations a processor can
    perform
  • Each instruction is coded as a number
  • Instructions may be one or more bytes
  • Every number corresponds to an instruction


57
Assembly vs Machine
  • Machine Language Programming
  • Writing a list of numbers representing the bytes
    of machine instructions to be executed and data
    constants to be used by the program
  • Assembly Language Programming
  • Using symbolic instructions to represent the raw
    data that will form the machine language program
    and initial data constants


58
Assembly
  • Mnemonics represent Machine Instructions
  • Each mnemonic used represents a single machine
    instruction
  • The assembler performs the translation
  • Some mnemonics require operands
  • Operands provide additional information
  • register, constant, address, or variable
  • Assembler Directives


59
Instruction Set Architecture a Critical
Interface
software
instruction set

hardware
Portion of the machine that is visible to the
programmer or the compiler writer.
60
Good ISA
  • Lasts through many implementations (portability,
    compatibility)
  • Can be used for many different applications
    (generality)
  • Provide convenient functionality to higher levels
  • Permits an efficient implementation at lower
    levels


61
Von Neumann Machines
  • Von Neumann invented stored program computer in
    1945
  • Instead of program code being hardwired, the
    program code (instructions) is placed in memory
    along with data


Control
ALU
Program Data
62
Basic ISA Classes
  • Memory to Memory Machines
  • Every instruction contains a full memory address
    for each operand
  • Maybe the simplest ISA design
  • However memory is slow
  • Memory is big (lots of address bits)


63
Memory-to-memory machine
  • Assumptions
  • Two operands per operation
  • first operand is also the destination
  • Memory address 16 bits (2 bytes)
  • Operand size 32 bits (4 bytes)
  • Instruction code 8 bits (1 byte)
  • Example A BC (hypothetical code)
  • mov A, B A lt B
  • add A, C A lt BC
  • 5 bytes for instruction
  • 4 bytes for fetch 1st and 2nd operands
  • 4 bytes to store results
  • add needs 17 bytes and mov needs 13 byts
  • Total 30 bytes memory traffic


64
Why CPU Storage?
  • A small amount of storage in the CPU
  • To reduce memory traffic by keeping repeatedly
    used operands in the CPU
  • Avoid re-referencing memory
  • Avoid having to specify full memory address of
    the operand
  • This is a perfect example of make the common
    case fast.
  • Simplest Case
  • A machine with 1 cell of CPU storage the
    accumulator


65
Accumulator Machine
  • Assumptions
  • Two operands per operation
  • 1st operand in the accumulator
  • 2nd operand in the memory
  • accumulator is also the destination (except for
    store)
  • Memory address 16 bits (2 bytes)
  • Operand size 32 bits (4 bytes)
  • Instruction code 8 bits (1 byte)
  • Example A BC (hypothetical code)
  • Load B acc lt B
  • Add C acc lt BC
  • Store A A lt acc
  • 3 bytes for instruction
  • 4 bytes to load or store the second operand
  • 7 bytes per instruction
  • 21 bytes total memory traffic


66
Stack Machines
  • Instruction sets are based on a stack model of
    execution.
  • Aimed for compact instruction encoding
  • Most instructions manipulate top few data items
    (mostly top 2) of a pushdown stack.
  • Top few items of the stack are kept in the CPU
  • Ideal for evaluating expressions (stack holds
    intermediate results)
  • Were thought to be a good match for high level
    languages
  • Awkward
  • Become very slow if stack grows beyond CPU local
    storage
  • No simple way to get data from middle of stack


67
Stack Machines
  • Binary arithmetic and logic operations
  • Operands top 2 items on stack
  • Operands are removed from stack
  • Result is placed on top of stack
  • Unary arithmetic and logic operations
  • Operand top item on the stack
  • Operand is replaced by result of operation
  • Data move operations
  • Push place memory data on top of stack
  • Pop move top of stack to memory


68
General Purpose Register Machines
  • With stack machines, only the top two elements of
    the stack are directly available to instructions.
    In general purpose register machines, the CPU
    storage is organized as a set of registers which
    are equally available to the instructions
  • Frequently used operands are placed in registers
    (under program control)
  • Reduces instruction size
  • Reduces memory traffic


69
General Purpose Registers Dominate
  • 1975-present all machines use general purpose
    registers
  • Advantages of registers
  • registers are faster than memory
  • registers are easier for a compiler to use
  • e.g., (AB) (CD) (EF) can do multiplies in
    any order
  • registers can hold variables
  • memory traffic is reduced, so program is sped up
    (since registers are faster than memory)
  • code density improves (since register named with
    fewer bits than memory location)


70
Classifying General Purpose Register Machines
  • General purpose register machines are
    sub-classified based on whether or not memory
    operands can be used by typical ALU instructions
  • Register-memory machines machines where some ALU
    instructions can specify at least one memory
    operand and one register operand
  • Load-store machines the only instructions that
    can access memory are the load and the store
    instructions


71
Comparing number of instructions
  • Code sequence for A BC for five classes of
    instruction sets

Register (Register-memory) load R1 B add R1
C store A R1
Register (Load-store) Load R1 B Load R2 C Add R1
R1 R2 Store A R1
Stack push B push C add pop A
Memory to Memory mov A B add A C
Accumulator load B add C store A

DLX/MIPS is one of these
72
Instruction Set Definition
  • Objects architecture entities machine state
  • Registers
  • General purpose
  • Special purpose (e.g. program counter, condition
    code, stack pointer)
  • Memory locations
  • Linear address space 0, 1, 2, ,2s -1
  • Operations instruction types
  • Data operation
  • Arithmetic
  • Logical
  • Data transfer
  • Move (from register to register)
  • Load (from memory location to register)
  • Store (from register to memory location)
  • Instruction sequencing
  • Branch (conditional)
  • Jump (unconditional)


73
Topic DLX
  • Instructional Architecture
  • Much nicer and easier to understand than x86
    (barf)
  • The Plan Teach DLX, then move to x86/y86
  • DLX RISC ISA, very similar to MIPS
  • Great links to learn more DLX
  • http//www.softpanorama.org/Hardware/architecture.
    shtmlDLX

74
DLX Architecture
  • Based on observations about instruction set
    architecture
  • Emphasizes
  • Simple load-store instruction set
  • Design for pipeline efficiency
  • Design for compiler target
  • DLX registers
  • 32 32-bit GPRS named R0, R1, ..., R31
  • 32 32-bit FPRs named F0, F2, ..., F30
  • Accessed independently for 32-bit data
  • Accessed in pairs for 64-bit (double-precision)
    data
  • Register R0 is hard-wired to zero
  • Other status registers, e.g., floating-point
    status register
  • Byte addressable in big-endian with 32-bit
    address
  • Arithmetic instructions operands must be
    registers

75
MIPS Software conventions for Registers
0 zero constant 0 1 at reserved for
assembler 2 v0 expression evaluation
3 v1 function results 4 a0 arguments 5 a1 6 a2 7
a3 8 t0 temporary caller saves . . . (callee
can clobber) 15 t7
16 s0 callee saves . . . (callee must
save) 23 s7 24 t8 temporary (contd) 25 t9 26 k0
reserved for OS kernel 27 k1 28 gp Pointer to
global area 29 sp Stack pointer 30 fp frame
pointer 31 ra Return Address (HW)
76
Addressing Modes
This table shows the most common modes.
Addressing Mode Example Instruction Meaning When Used
Register Add R4, R3 RR4 lt- RR4 RR3 When a value is in a register.
Immediate Add R4, 3 RR4 lt- RR4 3 For constants.
Displacement Add R4, 100(R1) RR4 lt- RR4 M100RR1 Accessing local variables.
Register Deferred Add R4, (R1) RR4 lt- RR4 MRR1 Using a pointer or a computed address.
Absolute Add R4, (1001) RR4 lt- RR4 M1001 Used for static data.
77
Memory Organization
  • Viewed as a large, single-dimension array, with
    an address.
  • A memory address is an index into the array
  • "Byte addressing" means that the index points to
    a byte of memory.


0
8 bits of data
1
8 bits of data
2
8 bits of data
3
8 bits of data
4
8 bits of data
5
8 bits of data
6
8 bits of data
78
Memory Addressing
  • Bytes are nice, but most data items use larger
    "words"
  • For DLX, a word is 32 bits or 4 bytes.
  • 2 questions for design of ISA
  • Since one could read a 32-bit word as four loads
    of bytes from sequential byte addresses or as one
    load word from a single byte address,
  • How do byte addresses map to word addresses?
  • Can a word be placed on any byte boundary?


79
Addressing Objects Endianess and Alignment
  • Big Endian address of most significant byte
    word address (xx00 Big End of word)
  • IBM 360/370, Motorola 68k, MIPS, Sparc, HP PA
  • Little Endian address of least significant byte
    word address (xx00 Little End of word)
  • Intel 80x86, DEC Vax, DEC Alpha (Windows NT)

little endian byte 0
3 2 1 0
msb
lsb
0 1 2 3
0 1 2 3
Aligned
big endian byte 0
Not Aligned
Alignment require that objects fall on address
that is multiple of their size.
80
Assembly Language vs. Machine Language
  • Assembly provides convenient symbolic
    representation
  • much easier than writing down numbers
  • e.g., destination first
  • Machine language is the underlying reality
  • e.g., destination is no longer first
  • Assembly can provide 'pseudoinstructions'
  • e.g., move r10, r11 exists only in Assembly
  • would be implemented using add r10,r11,r0
  • When considering performance you should count
    real instructions

81
Stored Program Concept
  • Instructions are bits
  • Programs are stored in memory to be read or
    written just like data
  • Fetch Execute Cycle
  • Instructions are fetched and put into a special
    register
  • Bits in the register "control" the subsequent
    actions
  • Fetch the next instruction and continue

memory for data, programs, compilers, editors,
etc.
82
DLX arithmetic
  • ALU instructions can have 3 operands
  • add R1, R2, R3
  • sub R1, R2, R3
  • Operand order is fixed (destination
    first)Example C code A B
    C DLX code add r1, r2, r3 (registers
    associated with variables by compiler)


83
DLX arithmetic
  • Design Principle simplicity favors regularity.
    Why?
  • Of course this complicates some things... C
    code A B C D E F - A MIPS
    code add r1, r2, r3
  • add r1, r1, r4 sub r5, r6, r1
  • Operands must be registers, only 32 registers
    provided
  • Design Principle smaller is faster. Why?


84
Execution assembly instructions
  • Program counter holds the instruction address
  • CPU fetches instruction from memory and puts it
    onto the instruction register
  • Control logic decodes the instruction and tells
    the register file, ALU and other registers what
    to do
  • An ALU operation (e.g. add) data flows from
    register file, through ALU and back to register
    file

85
ALU Execution Example
86
ALU Execution example
87
Memory Instructions
  • Load and store instructions
  • lw r11, offset(r10)
  • sw r11, offset(r10)
  • Example C code A8 h A8 assume h in
    r2 and base address of the array A in r3
  • DLX code lw r4, 32(r3) add r4, r2, r4 sw
    r4, 32(r3)
  • Store word has destination last
  • Remember arithmetic operands are registers, not
    memory!


88
Memory Operations - Loads
  • Load data from memory
  • lw R6, 0(R5) R6 lt mem0x14

89
Memory Operations - Stores
  • Storing data to memory works essentially the same
    way
  • sw R6 , 0(R5)
  • R6 200 lets assume R5 0x18
  • mem0x18 lt-- 200

90
So far weve learned
  • DLX loading words but addressing bytes
    arithmetic on registers only
  • Instruction Meaningadd r1, r2, r3 r1 r2
    r3sub r1, r2, r3 r1 r2 r3lw r1,
    100(r2) r1 Memoryr2100 sw r1,
    100(r2) Memoryr2100 r1

91
Use of Registers
  • Example
  • a ( b c) - ( d e) // C statement
  • r1 r5 a - e
  • add r10, r2, r3
  • add r11, r4, r5
  • sub r1, r10, r11
  • a b A4 // add an array element to a var
  • // r3 has address A
  • lw r4, 16(r3)
  • add r1, r2, r4


92
Use of Registers load and store
  • Example
  • A8 a A6 // A is in r3, a is in r2
  • lw r1, 24(r3) r1 gets A6 contents
  • add r1, r2, r1 r1 gets the sum
  • sw r1, 32(r3) sum is put in A8


93
load and store
  • Ex
  • a b Ai // A is in r3, a,b, i in //
    r1, r2, r4
  • add r11, r4, r4 r11 2 i
  • add r11, r11, r11 r11 4 i
  • add r11, r11, r3 r11 addr. of Ai
  • (r3(4i))
  • lw r10, 0(r11) r10 Ai
  • add r1, r2, r10 a b Ai


94
Example Swap
  • Swapping words
  • r2 has the base address of the array v

swap lw r10, 0(r2) lw r11, 4(r2) sw r10,
4(r2) sw r11, 0(r2)
temp v0 v0 v1 v1 temp

95
DLX Instruction Format
  • Instruction Format
  • I-type R-type J-type

I-type Instructions 6
5 5 16
R-type Instructions 6
5 5 5
11
J-type Instructions 6
26
96
Machine Language
  • Instructions, like registers and words of data,
    are also 32 bits long
  • Example add r10, r1, r2
  • registers have numbers, 10, 1, 2
  • Instruction Format
  • 000000 00001 00010 01010 0000100000


R-type Instructions 6
5 5 5
11
97
Machine Language
  • Consider the load-word and store-word
    instructions,
  • What would the regularity principle have us do?
  • New principle Good design demands a compromise
  • Introduce a new type of instruction format
  • I-type for data transfer instructions
  • other format was R-type for register
  • Example lw r10, 32(r2)

I-type Instructions (Loads/stores) 6
5 5 16
1000011 01010 00010 0000100000
98
Machine Language
  • Jump instructions
  • Example j .L1

J-type Instructions (Jump, Jump and Link, Trap,
return from exception) 6
26
0000010 offset to .L1
99
DLX Instruction Format
  • Instruction Format
  • I-type R-type J-type

I-type Instructions 6
5 5 16
R-type Instructions 6
5 5 5
16
J-type Instructions 6
26
100
Instructions for Making Decisions
  • beq reg1, reg2, L1
  • Go to the statement labeled L1 if the value in
    reg1 equals the value in reg2
  • bne reg1, reg2, L1
  • Go to the statement labeled L1 if the value in
    reg1 does not equals the value in reg2
  • j L1
  • Unconditional jump
  • jr r10
  • jump register. Jump to the instruction
    specified in register r10

101
Making Decisions
  • Example
  • if ( a ! b) goto L1 // x,y,z,a,b mapped to
    r1-r5
  • x y z
  • L1 x x a
  • bne r4, r5, L1 goto L1 if a ! b
  • add r1, r2, r3 x y z (ignored if ab)
  • L1sub r1, r1, r4 x x a (always ex)

102
if-then-else
  • Example
  • if ( ab) x y z
  • else x y z
  • bne r4, r5, Else goto Else if a!b
  • add r1, r2, r3 x y z
  • j Exit goto Exit
  • Else sub r1,r2,r3 x y z
  • Exit

103
Example Loop with array index
  • Loop g g A i i i j if (i !
    h) goto Loop ....
  • r1, r2, r3, r4 g, h, i, j, array base r5
  • LOOP add r11, r3, r3 r11 2 i add r11,
    r11, r11 r11 4 i add r11, r11, r5 r11
    adr. Of Ai lw r10, 0(r11) load
    Ai add r1, r1, r10 g g Ai add r3,
    r3, r4 i i j bne r3, r2, LOOP

104
Other decisions
  • Set R1 on R2 less than R3 slt R1, R2, R3
  • Compares two registers, R2 and R3
  • R1 1 if R2 lt R3 elseR1 0 if R2 gt R3
  • Example slt r11, r1, r2
  • Branch less than
  • Example if(A lt B) goto LESS
  • slt r11, r1, r2 t1 1 if A lt B
  • bne r11, 0, LESS

105
Loops
  • Example
  • while ( Ai k ) // i,j,k in r3. r4, r5
  • i i j // A is in r6
  • Loop sll r11, r3, 2 r11 4 i
  • add r11, r11, r6 r11 addr. Of Ai
  • lw r10, 0(r11) r10 Ai
  • bne r10, r5, Exit goto Exit if Ai!k
  • add r3, r3, r4 i i j
  • j Loop goto Loop
  • Exit

106
Addresses in Branches and Jumps
  • Instructions
  • bne r14,r15,Label Next instruction is at Label
    if r14?r15
  • beq r14,r15,Label Next instruction is at Label
    if r14r15
  • j Label Next instruction is at Label
  • Formats
  • Addresses are not 32 bits How do we handle
    this with large programs?
  • First idea limitation of branch space to the
    first 216 bits

op rs rt 16 bit address
I J
op 26 bit address
107
Addresses in Branches
  • Instructions
  • bne r14,r15,Label Next instruction is at Label if
    r14?r15
  • beq r14,r15,Label Next instruction is at Label if
    r14r15
  • Formats
  • Treat the 16 bit number as an offset to the PC
    register PC-relative addressing
  • Word offset instead of byte offset, why??
  • most branches are local (principle of locality)
  • Jump instructions just use the high order bits of
    PC Pseudodirect addressing
  • 32-bit jump address 4 Most Significant bits of
    PC concatenated with 26-bit word address (or 28-
    bit byte address)
  • Address boundaries of 256 MB

op rs rt 16 bit address
I
108
Conditional Branch Distance
65 of integer branches are 2 to 4
instructions
109
Conditional Branch Addressing
  • PC-relative since most branches are relatively
    close to the current PC
  • At least 8 bits suggested (?128 instructions)
  • Compare Equal/Not Equal most important for
    integer programs (86)

110
PC-relative addressing
  • For larger distances Jump register jr required.

111
Example
  • LOOP mult 9, 19, 10 R9 R19R10
    lw 8, 1000(9) R8 _at_(R91000)
  • bne 8, 21, EXIT add 19, 19, 20 i
    i j j LOOP EXIT ...
  • Assume address of LOOP is 0x8000

2
0x8000
112
Procedure calls
  • Procedures or subroutines
  • Needed for structured programming
  • Steps followed in executing a procedure call
  • Place parameters in a place where the procedure
    (callee) can access them
  • Transfer control to the procedure
  • Acquire the storage resources needed for the
    procedure
  • Perform desired task
  • Place results in a place where the calling
    program (caller) can access them
  • Return control to the point of origin

113
Resources Involved
  • Registers used for procedure calling
  • a0 - a3 four argument registers in which to
    pass parameters
  • v0 - v1 two value registers in which to
    return values
  • r31 one return address register to return to
    the point of origin
  • Transferring the control to the callee
  • jal ProcedureAddress
  • jump-and-link to the procedure address
  • the return address (PC4) is saved in r31
  • Example jal 20000
  • Returning the control to the caller
  • jr r31
  • instruction following jal is executed next

114
Memory Stacks
Useful for stacked environments/subroutine call
return even if operand stack not part of
architecture
Stacks that Grow Up vs. Stacks that Grow Down
High address
0 Little
inf. Big
a
Memory Addresses
grows up
grows down
SP
b
c
inf. Big
0 Little
Low address
115
Calling conventions
int func(int g, int h, int i, int j) int
f f ( g h ) ( i j ) return ( f
) // g,h,i,j - a0,a1,a2,a3, f in r8 func
addi sp, sp, -12 make room in stack for 3
words sw r11, 8(sp) save the regs we want to
use sw r10, 4(sp) sw r8, 0(sp) add r10,
a0, a1 r10 g h add r11, a2, a3 r11
i j sub r8, r10, r11 r8 has the result
add v0, r8, r0 return reg v0 has f
116
Calling (cont.)
  • lw r8, 0(sp) restore r8
  • lw r10, 4(sp) restore r10
  • lw r11, 8(sp) restore r11
  • addi sp, sp, 12 restore sp
  • jr ra
  • we did not have to restore r10-r19 (caller save)
  • we do need to restore r1-r8 (must be preserved
    by callee)

117
Nested Calls
Stacking of Subroutine Calls Returns and
Environments
A
A CALL B CALL C
C RET
RET
A
B
B
D
A
B
C
A
B
A
  • Some machines provide a memory stack as part of
    the
  • architecture (e.g., VAX, JVM)
  • Sometimes stacks are implemented via software
    convention

118
Compiling a String Copy Proc.
void strcpy ( char x , y ) int
i0 while ( x i y i ! 0) i
// x and y base addr. are in a0 and
a1 strcpy addi sp, sp, -4 reserve 1
word space in stack sw r8, 0(sp) save
r8 add r8, zer0, zer0 i 0 L1 add r11,
a1, r8 addr. of y i in r11 lb r12,
0(r11) r12 y i add r13, a0, r8
addr. Of x i in r13 sb r12, 0(r13) x i
y i beq r12, zero, L2 if y i 0
goto L2 addi r8, r8, 1 i j L1 go to
L1 L2 lw r8, 0(sp) restore r8 addi sp,
sp, 4 restore sp jr ra return
119
IA - 32
  • 1978 The Intel 8086 is announced (16 bit
    architecture)
  • 1980 The 8087 floating point coprocessor is
    added
  • 1982 The 80286 increases address space to 24
    bits, instructions
  • 1985 The 80386 extends to 32 bits, new
    addressing modes
  • 1989-1995 The 80486, Pentium, Pentium Pro add a
    few instructions (mostly designed for higher
    performance)
  • 1997 57 new MMX instructions are added,
    Pentium II
  • 1999 The Pentium III added another 70
    instructions (SSE)
  • 2001 Another 144 instructions (SSE2)
  • 2003 AMD extends the architecture to increase
    address space to 64 bits, widens all registers
    to 64 bits and other changes (AMD64)
  • 2004 Intel capitulates and embraces AMD64
    (calls it EM64T) and adds more media extensions
  • This history illustrates the impact of the
    golden handcuffs of compatibilityadding new
    features as someone might add clothing to a
    packed bagan architecture that is difficult
    to explain and impossible to love

120
IA-32 Overview
  • Complexity
  • Instructions from 1 to 17 bytes long
  • one operand must act as both a source and
    destination
  • one operand can come from memory
  • complex addressing modes e.g., base or scaled
    index with 8 or 32 bit displacement
  • Saving grace
  • the most frequently used instructions are not too
    difficult to build
  • compilers avoid the portions of the architecture
    that are slow
  • what the 80x86 lacks in style is made up in
    quantity, making it beautiful from the right
    perspective

121
IA32 Registers
  • Oversimplified Architecture
  • Four 32-bit general purpose registers
  • eax, ebx, ecx, edx
  • al is a register to mean the lower 8 bits of
    eax
  • Stack Pointer
  • esp
  • Fun fact
  • Once upon a time, only x86 was a 16-bit CPU
  • So, when they upgraded x86 to 32-bits...
  • Added an e in front of every register and
    called it extended


122
Intel 80x86 Integer Registers
GPR0 EAX Accumulator
GPR1 ECX Count register, string, loop
GPR2 EDX Data Register multiply, divide
GPR3 EBX Base Address Register
GPR4 ESP Stack Pointer
GPR5 EBP Base Pointer for base of stack seg.
GPR6 ESI Index Register
GPR7 EDI Index Register
CS Code Segment Pointer
SS Stack Segment Pointer
DS Data Segment Pointer
ES Extra Data Segment Pointer
FS Data Seg. 2
GS Data Seg. 3
PC EIP Instruction Counter
Eflags Condition Codes
123
x86 Assembly
  • mov ltdestgt, ltsrcgt
  • Move the value from ltsrcgt into ltdestgt
  • Used to set initial values
  • add ltdestgt, ltsrcgt
  • Add the value from ltsrcgt to ltdestgt
  • sub ltdestgt, ltsrcgt
  • Subtract the value from ltsrcgt from ltdestgt


124
x86 Assembly
push lttargetgt Push the value in lttargetgt onto
the stack Also decrements the stack pointer,
ESP (remember stack grows from high to low) pop
lttargetgt Pops the value from the top of the
stack, put it in lttargetgt Also increments the
stack pointer, ESP

125
x86 Assembly
jmp ltaddressgt Jump to an instruction (like
goto) Change the EIP to ltaddressgt Call
ltaddressgt A function call. Pushes the current
EIP 1 (next instruction) onto the stack, and
jumps to ltaddressgt

126
x86 Assembly
lea ltdestgt, ltsrcgt Load Effective Address of
ltsrcgt into register ltdestgt. Used for pointer
arithmetic (not actual memory reference) int
ltvaluegt interrupt hardware signal to operating
system kernel, with flag ltvaluegt int 0x80 means
Linux system call

127
x86 Assembly
Condition Codes CZ Carry Flag Overflow
Detection (Unsigned) ZF Zero Flag SF Sign
Flag OF Overflow Flag Overflow Detection
(Signed) Conditional Codes are you usually
accessed through conditional branches (Not
Directly)

128
Interrupt convention
int 0x80 System call interupt eax System
call number (eg. 1-exit, 2-fork, 3-read,
4-write) ebx argument 1 ecx argument 2 edx
argument 3

129
CISC vs RISC
RISC Reduced Instruction Set Computer
(DLX) CISC Complex Instruction Set Computer
(x86) Both have their advantages.

130
RISC
  • Not very many instructions
  • All instructions all the same length in both
    execution time, and bit length
  • Results in simpler CPUs (Easier to optimize)
  • Usually takes more ins
Write a Comment
User Comments (0)
About PowerShow.com