Computer Architecture Basics Pipeline - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Computer Architecture Basics Pipeline

Description:

For example, 5-stage pipeline (Fetch-Decode-Read-Execute-Write) ... As you go down the hierarchy, memory becomes bigger, slower, cheaper ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 41
Provided by: SMI107
Category:

less

Transcript and Presenter's Notes

Title: Computer Architecture Basics Pipeline


1
Computer Architecture Basics Pipeline Memory
Hierarchy
  • Lynn Choi
  • Dept. of Computer and Electronics Engineering

2
The Fetch-Execute Cycle
Figure 5.3 The Fetch-Execute Cycle
3
Motivation
  • Non-pipelined design
  • Single-cycle implementation
  • The cycle time depends on the slowest instruction
  • Every instruction takes the same amount of time
  • Multi-cycle implementation
  • Divide the execution of an instruction into
    multiple steps
  • Each instruction may take variable number of
    steps (clock cycles)
  • Pipelined design
  • Divide the execution of an instruction into
    multiple steps (stages)
  • Overlap the execution of different instructions
    in different stages
  • Each cycle different instruction is executed in
    different stages
  • For example, 5-stage pipeline (Fetch-Decode-Read-E
    xecute-Write),
  • 5 instructions are executed concurrently in 5
    different pipeline stages
  • Complete the execution of one instruction every
    cycle (instead of every 5 cycle)
  • Can increase the throughput of the machine 5
    times

4
Pipeline Example
LD R1 lt- A ADD R5, R3, R4 LD R2 lt- B SUB R8, R6,
R7 ST C lt- R5
5 stage pipeline Fetch Decode Read Execute
- Write
Non-pipelined processor 25 cycles number of
instrs (5) number of stages (5)
F
D
R
E
W
F
D
R
E
W
F
D
R
E
W
F
D
R
E
W
Pipelined processor 9 cycles start-up latency
(4) number of instrs (5)
F
F
D
R
E
W
Draining the pipeline
F
D
R
E
W
F
D
R
E
W
F
D
R
E
W
Filling the pipeline
F
D
R
E
W
5
Data Dependence Hazards
  • Data Dependence
  • Read-After-Write (RAW) dependence
  • True dependence
  • Must consume data after the producer produces the
    data
  • Write-After-Write (WAW) dependence
  • Output dependence
  • The result of a later instruction can be
    overwritten by an earlier instruction
  • Write-After-Read (WAR) dependence
  • Anti dependence
  • Must not overwrite the value before its consumer
  • Notes
  • WAW WAR are called false dependences, which
    happen due to storage conflicts
  • All three types of dependences can happen for
    both registers and memory locations
  • Characteristics of programs (not machines)

6
Example 1
1 LD R1 lt- A 2 LD R2 lt- B 3 MULT R3, R1,
R2 4 ADD R4, R3, R2 5 SUB R3, R3, R4 6 ST A
lt- R3
RAW dependence 1-gt3, 2-gt 3, 2-gt4, 3 -gt 4, 3 -gt
5, 4-gt 5, 5-gt 6 WAW dependence 3-gt 5 WAR
dependence 4 -gt 5, 1 -gt 6 (memory location A)
Execution Time 18 cycles start-up latency (4)
number of instrs (6)
number of pipeline bubbles (8)
F
D
R
E
W
F
D
R
E
W
F
D
R
R
R
E
W
F
D
D
D
R
R
R
R
E
W
D
R
F
D
D
R
R
E
W
F
F
F
D
F
F
D
D
R
R
R
E
W
Pipeline bubbles due to RAW dependences (Data
Hazards)
7
Example 2
Changes 1. Assume that MULT execution takes
6 cycles Instead of 1 cycle 2. Assume that we
have separate ALUs for MULT and ADD/SUB
1 LD R1 lt- A 2 LD R2 lt- B 3 MULT R3, R1,
R2 4 ADD R4, R5, R6 5 SUB R3, R1, R4 6 ST A
lt- R3
Execution Time 18 cycles start-up latency (4)
number of instrs (6)
number of pipeline bubbles (8)
Dead Code
F
D
R
E
W
due to WAW
due to RAW
F
D
R
E
W
F
D
R
R
R
E
E
E
E
E
E
W
Out-of-order (OOO) Completion
F
D
D
D
R
R
E
W
R
R
F
D
R
R
R
W
E
F
F
D
D
F
D
D
D
R
R
E
W
R
Multi-cycle execution like MULT can cause
out-of-order completion
8
Pipeline stalls (Pipeline interlock)
  • Need reg-id comparators for
  • RAW dependences
  • Reg-id comparators between the sources of a
    consumer instruction in REG stage and the
    destinations of producer instructions in EXE, WRB
    stages
  • WAW dependences
  • Reg-id comparators between the destination of an
    instruction in REG stage and the destinations of
    instructions in EXE stage (if the instruction in
    EXE stage takes more execution cycles than the
    instruction in REG)
  • WAR dependences
  • Can never cause the pipeline to stall since
    register read of an instruction always happens
    earlier than the write of a following instruction
  • If there is a match, recycle dependent
    instructions
  • The current instruction in REG stage need to be
    recycled and all the instructions in FET and DEC
    stage need to be recycled as well

9
Data Bypass (Forwarding)
  • Motivation
  • Minimize the pipeline stalls due to data
    dependence (RAW) hazards
  • Idea
  • Lets propagate the result as soon as the result
    is available from ALU or from memory (in parallel
    with register write)
  • Requires
  • Data path from ALU output to the input of
    execution units (input of integer ALU, address or
    data input of memory pipeline, etc.)
  • Register Read stage can read data from register
    file or from the output of the previous execution
    stage
  • Require MUX in front of the input of execution
    stage

10
Datapath w/ Forwarding
11
Example 1 with Bypass
1 LD R1 lt- A 2 LD R2 lt- B 3 MULT R3, R1,
R2 4 ADD R4, R3, R2 5 SUB R3, R3, R4 6 ST A
lt- R3
Execution Time 10 cycles start-up latency (4)
number of instrs (6)
number of pipeline bubbles (0)
F
D
R
E
W
F
D
R
E
W
F
D
R
E
W
F
D
R
E
W
F
D
R
E
W
F
D
R
E
W
12
Example 2 with Bypass
1 LD R1 lt- A 2 LD R2 lt- B 3 MULT R3, R1,
R2 4 ADD R4, R5, R6 5 SUB R3, R1, R4 6 ST A
lt- R3
F
D
R
E
W
F
D
R
E
W
Pipeline bubbles due to WAW
F
D
R
E
E
E
E
E
E
W
F
D
R
E
W
R
R
R
R
R
E
F
D
W
D
D
D
D
R
E
F
D
W
13
Pipeline Hazards
  • Data Hazards
  • Caused by data (RAW, WAW, WAR) dependences
  • Require
  • Pipeline interlock (stall) mechanism to detect
    dependences and generate machine stall cycles
  • Reg-id comparators between instrs in REG stage
    and instrs in EXE/WRB stages
  • Stalls due to RAW hazards can be reduced by
    bypass network
  • Reg-id comparators data bypass paths mux
  • Structural Hazards
  • Caused by resource constraints
  • Require pipeline stall mechanism to detect
    structural constraints
  • Control (Branch) Hazards
  • Caused by branches
  • Instruction fetch of a next instruction has to
    wait until the target (including the branch
    condition) of the current branch instruction need
    to be resolved
  • Use
  • Pipeline stall to delay the fetch of the next
    instruction
  • Predict the next target address (branch
    prediction) and if wrong, flush all the
    speculatively fetched instructions from the
    pipeline

14
Structural Hazard Example
  • Assume that
  • We have 1 memory unit and 1 integer ALU unit
  • LD takes 2 cycles and MULT takes 4 cycles

1 LD R1 lt- A 2 LD R2 lt- B 3 MULT R3, R1,
R2 4 ADD R4, R5, R6 5 SUB R3, R1, R4 6 ST A
lt- R3
F
D
R
E
E
W
F
D
R
R
E
E
W
F
D
D
R
R
E
E
E
E
W
F
F
D
D
R
R
R
R
E
W
F
F
D
D
D
D
R
E
W
Structural Hazards
F
F
F
F
D
R
E
W
RAW
Structural Hazards
15
Structural Hazard Example
1 LD R1 lt- A 2 LD R2 lt- B 3 MULT R3, R1,
R2 4 ADD R4, R5, R6 5 SUB R3, R1, R4 6 OR
R10 lt- R3, R1
  • Assume that
  • We have 1 memory pipelined unit and
  • and 1 integer add unit and 1 integer
    multiply unit
  • 2. LD takes 2 cycles and MULT takes 4 cycles

F
D
R
E
E
W
F
D
R
E
E
W
F
D
R
R
E
E
E
E
W
F
D
D
R
E
W
F
D
R
F
E
W
D
F
R
E
E
W
RAW
Structural Hazards due to write port
16
Control Hazard Example (Stall)
  • 1 LD R1 lt- A
  • 2 LD R2 lt- B
  • 3 MULT R3, R1, R2
  • 4 BEQ R1, R2, TARGET
  • 5 SUB R3, R1, R4
  • ST A lt- R3
  • TARGET

F
D
R
E
E
W
F
D
R
E
E
W
F
D
R
R
E
E
E
E
W
F
D
D
R
E
W
F
F
F
F
F
D
R
E
W
F
D
R
E
W
RAW
Branch Target is known
Control Hazards
17
Control Hazard Example (Flush)
  • 1 LD R1 lt- A
  • 2 LD R2 lt- B
  • 3 MULT R3, R1, R2
  • 4 BEQ R1, R2, TARGET
  • 5 SUB R3, R1, R4
  • ST A lt- R3
  • TARGET ADD R4, R1, R2

F
D
R
E
E
W
Branch Target is known
F
D
R
E
E
W
F
D
R
R
E
E
E
E
W
F
D
D
R
E
W
F
D
R
F
E
W
Speculative execution These instructions will be
flushed on branch misprediction
F
D
R
E
W
F
D
R
E
W
F
D
R
E
W
18
Memory Hierarchy
  • Motivated by
  • Principles of Locality
  • Locality principle
  • Spatial Locality nearby references are likely
  • Example arrays, program codes
  • Access a block of contiguous words
  • Temporal Locality same reference is likely soon
  • Example loops, reuse of variables
  • Keep recently accessed data to closer to the
    processor
  • Speed vs. Size vs. Cost tradeoff
  • SRAM - DRAM - Disk - Tape
  • As you go down the hierarchy, memory becomes
    bigger, slower, cheaper
  • As you go up the hierarchy, memory becomes
    smaller, faster, more expensive

19
Levels of Memory Hierarchy
Capacity/Access Time
Moved By
Faster/Smaller
Registers
100Bs
Instruction Operands
Program/Compiler 1- 16B
Cache
KBs - MBs
H/W 16B - 512B
Cache Line
Main Memory
GBs
OS 512B 64MB
Page
Disk
100GBs
User any size
File
Infinite
Tape
Slower/Larger
20
Cache
  • A small but fast memory located between processor
    and main memory
  • Benefits
  • Reduce load latency
  • Reduce store latency
  • Reduce bus traffic (on-chip caches)
  • Cache Block Allocation (When to place)
  • On a read miss
  • On a write miss
  • Write-allocate vs. no-write-allocate
  • Cache Block Placement (Where to place)
  • Fully-associative cache
  • Direct-mapped cache
  • Set-associative cache

21
Fully Associative Cache
32b Word, 4 Word Cache Block
Virtual Address Space 32 bit VA 4GB (DRAM)
0
A memory block can be placed into any cache
block location!
32KB cache (SRAM)
Memory Block
0
Cache Block (Cache Line)
211-1
228-1
22
Fully Associative Cache
32KB DATA RAM
TAG RAM
V

0
0
31
3 0
tag
offset
Yes



211-1
211-1
Data out
Word Byte select
Data to CPU
Cache Hit
Advantages Disadvantages 1.
High hit rate 1. Very expensive
2. Fast
23
Direct Mapped Cache
Virtual Address Space 32 bit VA 4GB (DRAM)
0
A memory block can be placed into only a single
cache block!
32KB cache (SRAM)
211
Memory Block
0
..
2211
211-1
(217-1)211
228-1
24
Direct Mapped Cache
32KB DATA RAM
TAG RAM
V
0
0
14
4
3 0
31
index
offset
tag
decoder
211-1
211-1
Data out

Word Byte select
Yes
Data to CPU
Disadvantages Advantages 1. Low
hit rate 1. Simple HW
2. Fast Implementation
Cache Hit
25
Set Associative Cache
In an M-way set associative cache, A memory block
can be placed into M cache blocks!
0
32KB cache (SRAM)
210
Memory Block
0
Way 0
210 sets
210-1
2210
210
Way 1
210 sets
211-1
(218-1)210
228-1
26
Set Associative Cache
32KB DATA RAM
TAG RAM
V
0
0
13
4
3 0
31
index
offset
tag
decoder
210-1
210-1
Data out


Word Byte select
Yes
Wmux
Data to CPU
Most caches are implemented as set-associative
caches!
Cache Hit
27
Cache Block Replacement
  • Random
  • Just pick one and replace it
  • Pseudo-random use simple hash algorithm using
    address
  • LRU (least recently used)
  • need to keep timestamp
  • expensive due to global compare
  • Pseudo-LRU use LFU using bit tags
  • Replacement policy critical for small caches

28
Write Policy
  • Write-through
  • write to cache and to next level of memory
    hierarchy
  • simple to design, memory consistent
  • generates more write traffic
  • Write-back
  • Only write to cache (not to lower level)
  • update memory when a dirty block is replaced
  • less write traffic, write independent of main
    memory
  • complex to design, memory inconsistent
  • Write-Allocate Policy
  • Write-allocate allocate cache block on a write
    miss
  • For write-back cache
  • No-write-allocate
  • For write-through cache

29
31 Types of Cache Misses
  • Cold-start misses (or compulsory misses) the
    first access to a block is always not in the
    cache
  • Misses even in an infinite cache
  • Capacity misses if the memory blocks needed by a
    program is bigger than the cache size, then
    capacity misses will occur due to cache block
    replacement.
  • Misses even in fully associative cache
  • Conflict misses (or collision misses) for
    direct-mapped or set-associative cache, too many
    blocks can be mapped to the same set.
  • Invalidation misses (or sharing misses) cache
    blocks can be invalidated due to coherence
    traffic

30
Miss Rates (SPEC92)
31
Cache Performance vs Block Size
Miss Penalty
Miss Rate
Transfer Time
Access Time
Block Size
Block Size
Average Access Time
Sweet Spot
Block Size
32
Multi-level Cache
  • For L1 organization,
  • AMAT Hit_Time Miss_Rate Miss_Penalty
  • For L1/L2 organization,
  • AMAT Hit_TimeL1 Miss_RateL1 (Hit_TimeL2
    Miss_RateL2 Miss_PenaltyL2)
  • Advantages
  • For capacity misses and conflict misses in L1, a
    significant penalty reduction
  • Disadvantages
  • For L1-L2 misses, miss penalty increases slightly
  • L2 does not help compulsory misses
  • Design Issues
  • Size(L2) gtgt Size(L1)
  • Usually, Block_size(L2) gt Block_size(L1)

33
Memory Performance Parameters
  • Access Time
  • The time elapsed from asserting an address to
    when the data is available on the output
  • Row Access Time The time elapsed from asserting
    RAS to when the row is available in the row
    buffer
  • Column Access Time - the time elapsed from
    asserting CAS to when the valid data is present
    on the output pins
  • Cycle Time
  • The minimum time between two different requests
    to memory
  • Latency
  • The time to access the first word of a block
  • Bandwidth
  • Transmission rate (bytes per second)

34
DRAM Structure
2N
Row Address
Memory Array
Row Decoder
M
2M
Column Decoder Multiplexer
Data
2K
Column Address
N-K
35
Pentium III Example
Pentium III Processor
16KB I-Cache
Pentium III Core Pipeline
256 KB 8-way 2nd-level Cache
16KB D-Cache
800 MHz 256b data
System Bus
FSB (133 MHz 64b data 32b address)
Memory Bus Multiplexed (RAS/CAS)
AGP
Host to PCI Bridge
Main Memory
Graphics
DIMMs 16 16M4b 133 MHz SDRAM constitutes 128MB
DRAM module with 64b data bus
36
Virtual Memory
  • Objective
  • Large address spaces -gt Easy Programming
  • Provide illusion of infinite amount of memory
  • Program code/data can exceed the main memory size
  • Processes partially resident in memory
  • Protection of codes and data
  • Privilege level
  • Access rights read/modify/execute permission
  • Sharing of codes and data
  • Benefits
  • Easier programming
  • Software portability
  • Increased CPU utilization
  • More programs can run at the same time
  • Virtual address space
  • Programmers view of infinite memory
  • Physical address space
  • Machines physical memory

37
Virtual Memory
  • Require the following functions
  • Memory allocation (Placement)
  • Memory deallocation (Replacement)
  • Memory mapping (Translation)
  • Memory management
  • Automatic movement of data between main memory
    and secondary storage
  • Main memory contains only the most frequently
    used portions of a processs address space
  • Illusion of infinite memory (size of secondary
    storage) but access time equal to main memory
  • Usually implemented by demand paging

38
Paging
  • Divide address space into fixed size page frames
  • VA consists of (VPN, offset)
  • PA consists of (PPN, offset)
  • Map a virtual page to a physical page at runtime
  • Demand paging bring in a page on a page miss
  • Page table entry (PTE) contains
  • Virtual page number
  • Physical page number
  • Presence bit
  • Reference bit
  • Dirty bit
  • Access control read/write/execute
  • Privilege level
  • Disk address
  • Internal fragmentation

39
TLB
  • TLB (Translation Lookaside Buffer)
  • Cache of page table entries (PTEs)
  • On TLB hit, can do virtual to physical
    translation without accessing the page map table
  • On TLB miss, must search page table for the
    mapping and insert it into the TLB before
    processing continues
  • TLB configuration
  • 100 entries, fully or set-associative cache
  • usually separate I-TLB and D-TLB, accessed every
    cycle
  • Different virtual memory faults
  • TLB miss - PTE not in TLB
  • PTE miss - PTE not in main memory
  • page miss - page not in main memory
  • access violation
  • privilege violation

40
TLB and Cache Implementation of DECStation 3100
Write a Comment
User Comments (0)
About PowerShow.com