Computer Architecture Basics Pipeline

About This Presentation

Title:

Computer Architecture Basics Pipeline

Description:

For example, 5-stage pipeline (Fetch-Decode-Read-Execute-Write) ... As you go down the hierarchy, memory becomes bigger, slower, cheaper ... – PowerPoint PPT presentation

Number of Views:32

Avg rating:3.0/5.0

Slides: 41

Provided by: SMI107

Category:

more less

Transcript and Presenter's Notes

Title: Computer Architecture Basics Pipeline

1
Computer Architecture Basics Pipeline Memory
Hierarchy

Lynn Choi
Dept. of Computer and Electronics Engineering

2
The Fetch-Execute Cycle
Figure 5.3 The Fetch-Execute Cycle
3
Motivation

Non-pipelined design
Single-cycle implementation
The cycle time depends on the slowest instruction
Every instruction takes the same amount of time
Multi-cycle implementation
Divide the execution of an instruction into
multiple steps
Each instruction may take variable number of
steps (clock cycles)
Pipelined design
Divide the execution of an instruction into
multiple steps (stages)
Overlap the execution of different instructions
in different stages
Each cycle different instruction is executed in
different stages
For example, 5-stage pipeline (Fetch-Decode-Read-E
xecute-Write),
5 instructions are executed concurrently in 5
different pipeline stages
Complete the execution of one instruction every
cycle (instead of every 5 cycle)
Can increase the throughput of the machine 5
times

4
Pipeline Example
LD R1 lt- A ADD R5, R3, R4 LD R2 lt- B SUB R8, R6,
R7 ST C lt- R5
5 stage pipeline Fetch Decode Read Execute
- Write
Non-pipelined processor 25 cycles number of
instrs (5) number of stages (5)
F
D
R
E
W
F
D
R
E
W
F
D
R
E
W
F
D
R
E
W
Pipelined processor 9 cycles start-up latency
(4) number of instrs (5)
F
F
D
R
E
W
Draining the pipeline
F
D
R
E
W
F
D
R
E
W
F
D
R
E
W
Filling the pipeline
F
D
R
E
W
5
Data Dependence Hazards

Data Dependence
Read-After-Write (RAW) dependence
True dependence
Must consume data after the producer produces the
data
Write-After-Write (WAW) dependence
Output dependence
The result of a later instruction can be
overwritten by an earlier instruction
Write-After-Read (WAR) dependence
Anti dependence
Must not overwrite the value before its consumer
Notes
WAW WAR are called false dependences, which
happen due to storage conflicts
All three types of dependences can happen for
both registers and memory locations
Characteristics of programs (not machines)

6
Example 1
1 LD R1 lt- A 2 LD R2 lt- B 3 MULT R3, R1,
R2 4 ADD R4, R3, R2 5 SUB R3, R3, R4 6 ST A
lt- R3
RAW dependence 1-gt3, 2-gt 3, 2-gt4, 3 -gt 4, 3 -gt
5, 4-gt 5, 5-gt 6 WAW dependence 3-gt 5 WAR
dependence 4 -gt 5, 1 -gt 6 (memory location A)
Execution Time 18 cycles start-up latency (4)
number of instrs (6)
number of pipeline bubbles (8)
F
D
R
E
W
F
D
R
E
W
F
D
R
R
R
E
W
F
D
D
D
R
R
R
R
E
W
D
R
F
D
D
R
R
E
W
F
F
F
D
F
F
D
D
R
R
R
E
W
Pipeline bubbles due to RAW dependences (Data
Hazards)
7
Example 2
Changes 1. Assume that MULT execution takes
6 cycles Instead of 1 cycle 2. Assume that we
have separate ALUs for MULT and ADD/SUB
1 LD R1 lt- A 2 LD R2 lt- B 3 MULT R3, R1,
R2 4 ADD R4, R5, R6 5 SUB R3, R1, R4 6 ST A
lt- R3
Execution Time 18 cycles start-up latency (4)
number of instrs (6)
number of pipeline bubbles (8)
Dead Code
F
D
R
E
W
due to WAW
due to RAW
F
D
R
E
W
F
D
R
R
R
E
E
E
E
E
E
W
Out-of-order (OOO) Completion
F
D
D
D
R
R
E
W
R
R
F
D
R
R
R
W
E
F
F
D
D
F
D
D
D
R
R
E
W
R
Multi-cycle execution like MULT can cause
out-of-order completion
8
Pipeline stalls (Pipeline interlock)

Need reg-id comparators for
RAW dependences
Reg-id comparators between the sources of a
consumer instruction in REG stage and the
destinations of producer instructions in EXE, WRB
stages
WAW dependences
Reg-id comparators between the destination of an
instruction in REG stage and the destinations of
instructions in EXE stage (if the instruction in
EXE stage takes more execution cycles than the
instruction in REG)
WAR dependences
Can never cause the pipeline to stall since
register read of an instruction always happens
earlier than the write of a following instruction
If there is a match, recycle dependent
instructions
The current instruction in REG stage need to be
recycled and all the instructions in FET and DEC
stage need to be recycled as well

9
Data Bypass (Forwarding)

Motivation
Minimize the pipeline stalls due to data
dependence (RAW) hazards
Idea
Lets propagate the result as soon as the result
is available from ALU or from memory (in parallel
with register write)
Requires
Data path from ALU output to the input of
execution units (input of integer ALU, address or
data input of memory pipeline, etc.)
Register Read stage can read data from register
file or from the output of the previous execution
stage
Require MUX in front of the input of execution
stage

10
Datapath w/ Forwarding
11
Example 1 with Bypass
1 LD R1 lt- A 2 LD R2 lt- B 3 MULT R3, R1,
R2 4 ADD R4, R3, R2 5 SUB R3, R3, R4 6 ST A
lt- R3
Execution Time 10 cycles start-up latency (4)
number of instrs (6)
number of pipeline bubbles (0)
F
D
R
E
W
F
D
R
E
W
F
D
R
E
W
F
D
R
E
W
F
D
R
E
W
F
D
R
E
W
12
Example 2 with Bypass
1 LD R1 lt- A 2 LD R2 lt- B 3 MULT R3, R1,
R2 4 ADD R4, R5, R6 5 SUB R3, R1, R4 6 ST A
lt- R3
F
D
R
E
W
F
D
R
E
W
Pipeline bubbles due to WAW
F
D
R
E
E
E
E
E
E
W
F
D
R
E
W
R
R
R
R
R
E
F
D
W
D
D
D
D
R
E
F
D
W
13
Pipeline Hazards

Data Hazards
Caused by data (RAW, WAW, WAR) dependences
Require
Pipeline interlock (stall) mechanism to detect
dependences and generate machine stall cycles
Reg-id comparators between instrs in REG stage
and instrs in EXE/WRB stages
Stalls due to RAW hazards can be reduced by
bypass network
Reg-id comparators data bypass paths mux
Structural Hazards
Caused by resource constraints
Require pipeline stall mechanism to detect
structural constraints
Control (Branch) Hazards
Caused by branches
Instruction fetch of a next instruction has to
wait until the target (including the branch
condition) of the current branch instruction need
to be resolved
Use
Pipeline stall to delay the fetch of the next
instruction
Predict the next target address (branch
prediction) and if wrong, flush all the
speculatively fetched instructions from the
pipeline

14
Structural Hazard Example

Assume that
We have 1 memory unit and 1 integer ALU unit
LD takes 2 cycles and MULT takes 4 cycles

1 LD R1 lt- A 2 LD R2 lt- B 3 MULT R3, R1,
R2 4 ADD R4, R5, R6 5 SUB R3, R1, R4 6 ST A
lt- R3
F
D
R
E
E
W
F
D
R
R
E
E
W
F
D
D
R
R
E
E
E
E
W
F
F
D
D
R
R
R
R
E
W
F
F
D
D
D
D
R
E
W
Structural Hazards
F
F
F
F
D
R
E
W
RAW
Structural Hazards
15
Structural Hazard Example
1 LD R1 lt- A 2 LD R2 lt- B 3 MULT R3, R1,
R2 4 ADD R4, R5, R6 5 SUB R3, R1, R4 6 OR
R10 lt- R3, R1

Assume that
We have 1 memory pipelined unit and
and 1 integer add unit and 1 integer
multiply unit
2. LD takes 2 cycles and MULT takes 4 cycles

F
D
R
E
E
W
F
D
R
E
E
W
F
D
R
R
E
E
E
E
W
F
D
D
R
E
W
F
D
R
F
E
W
D
F
R
E
E
W
RAW
Structural Hazards due to write port
16
Control Hazard Example (Stall)

1 LD R1 lt- A
2 LD R2 lt- B
3 MULT R3, R1, R2
4 BEQ R1, R2, TARGET
5 SUB R3, R1, R4
ST A lt- R3
TARGET

F
D
R
E
E
W
F
D
R
E
E
W
F
D
R
R
E
E
E
E
W
F
D
D
R
E
W
F
F
F
F
F
D
R
E
W
F
D
R
E
W
RAW
Branch Target is known
Control Hazards
17
Control Hazard Example (Flush)

1 LD R1 lt- A
2 LD R2 lt- B
3 MULT R3, R1, R2
4 BEQ R1, R2, TARGET
5 SUB R3, R1, R4
ST A lt- R3
TARGET ADD R4, R1, R2

F
D
R
E
E
W
Branch Target is known
F
D
R
E
E
W
F
D
R
R
E
E
E
E
W
F
D
D
R
E
W
F
D
R
F
E
W
Speculative execution These instructions will be
flushed on branch misprediction
F
D
R
E
W
F
D
R
E
W
F
D
R
E
W
18
Memory Hierarchy

Motivated by
Principles of Locality
Locality principle
Spatial Locality nearby references are likely
Example arrays, program codes
Access a block of contiguous words
Temporal Locality same reference is likely soon
Example loops, reuse of variables
Keep recently accessed data to closer to the
processor
Speed vs. Size vs. Cost tradeoff
SRAM - DRAM - Disk - Tape
As you go down the hierarchy, memory becomes
bigger, slower, cheaper
As you go up the hierarchy, memory becomes
smaller, faster, more expensive

19
Levels of Memory Hierarchy
Capacity/Access Time
Moved By
Faster/Smaller
Registers
100Bs
Instruction Operands
Program/Compiler 1- 16B
Cache
KBs - MBs
H/W 16B - 512B
Cache Line
Main Memory
GBs
OS 512B 64MB
Page
Disk
100GBs
User any size
File
Infinite
Tape
Slower/Larger
20
Cache

A small but fast memory located between processor
and main memory
Benefits
Reduce load latency
Reduce store latency
Reduce bus traffic (on-chip caches)
Cache Block Allocation (When to place)
On a read miss
On a write miss
Write-allocate vs. no-write-allocate
Cache Block Placement (Where to place)
Fully-associative cache
Direct-mapped cache
Set-associative cache

21
Fully Associative Cache
32b Word, 4 Word Cache Block
Virtual Address Space 32 bit VA 4GB (DRAM)
0
A memory block can be placed into any cache
block location!
32KB cache (SRAM)
Memory Block
0
Cache Block (Cache Line)
211-1
228-1
22
Fully Associative Cache
32KB DATA RAM
TAG RAM
V

0
0
31
3 0
tag
offset
Yes

211-1
211-1
Data out
Word Byte select
Data to CPU
Cache Hit
Advantages Disadvantages 1.
High hit rate 1. Very expensive
2. Fast
23
Direct Mapped Cache
Virtual Address Space 32 bit VA 4GB (DRAM)
0
A memory block can be placed into only a single
cache block!
32KB cache (SRAM)
211
Memory Block
0
..
2211
211-1
(217-1)211
228-1
24
Direct Mapped Cache
32KB DATA RAM
TAG RAM
V
0
0
14
4
3 0
31
index
offset
tag
decoder
211-1
211-1
Data out

Word Byte select
Yes
Data to CPU
Disadvantages Advantages 1. Low
hit rate 1. Simple HW
2. Fast Implementation
Cache Hit
25
Set Associative Cache
In an M-way set associative cache, A memory block
can be placed into M cache blocks!
0
32KB cache (SRAM)
210
Memory Block
0
Way 0
210 sets
210-1
2210
210
Way 1
210 sets
211-1
(218-1)210
228-1
26
Set Associative Cache
32KB DATA RAM
TAG RAM
V
0
0
13
4
3 0
31
index
offset
tag
decoder
210-1
210-1
Data out

Word Byte select
Yes
Wmux
Data to CPU
Most caches are implemented as set-associative
caches!
Cache Hit
27
Cache Block Replacement

Random
Just pick one and replace it
Pseudo-random use simple hash algorithm using
address
LRU (least recently used)
need to keep timestamp
expensive due to global compare
Pseudo-LRU use LFU using bit tags
Replacement policy critical for small caches

28
Write Policy

Write-through
write to cache and to next level of memory
hierarchy
simple to design, memory consistent
generates more write traffic
Write-back
Only write to cache (not to lower level)
update memory when a dirty block is replaced
less write traffic, write independent of main
memory
complex to design, memory inconsistent
Write-Allocate Policy
Write-allocate allocate cache block on a write
miss
For write-back cache
No-write-allocate
For write-through cache

29
31 Types of Cache Misses

Cold-start misses (or compulsory misses) the
first access to a block is always not in the
cache
Misses even in an infinite cache
Capacity misses if the memory blocks needed by a
program is bigger than the cache size, then
capacity misses will occur due to cache block
replacement.
Misses even in fully associative cache
Conflict misses (or collision misses) for
direct-mapped or set-associative cache, too many
blocks can be mapped to the same set.
Invalidation misses (or sharing misses) cache
blocks can be invalidated due to coherence
traffic

30
Miss Rates (SPEC92)
31
Cache Performance vs Block Size
Miss Penalty
Miss Rate
Transfer Time
Access Time
Block Size
Block Size
Average Access Time
Sweet Spot
Block Size
32
Multi-level Cache

For L1 organization,
AMAT Hit_Time Miss_Rate Miss_Penalty
For L1/L2 organization,
AMAT Hit_TimeL1 Miss_RateL1 (Hit_TimeL2
Miss_RateL2 Miss_PenaltyL2)
Advantages
For capacity misses and conflict misses in L1, a
significant penalty reduction
Disadvantages
For L1-L2 misses, miss penalty increases slightly
L2 does not help compulsory misses
Design Issues
Size(L2) gtgt Size(L1)
Usually, Block_size(L2) gt Block_size(L1)

33
Memory Performance Parameters

Access Time
The time elapsed from asserting an address to
when the data is available on the output
Row Access Time The time elapsed from asserting
RAS to when the row is available in the row
buffer
Column Access Time - the time elapsed from
asserting CAS to when the valid data is present
on the output pins
Cycle Time
The minimum time between two different requests
to memory
Latency
The time to access the first word of a block
Bandwidth
Transmission rate (bytes per second)

34
DRAM Structure
2N
Row Address
Memory Array
Row Decoder
M
2M
Column Decoder Multiplexer
Data
2K
Column Address
N-K
35
Pentium III Example
Pentium III Processor
16KB I-Cache
Pentium III Core Pipeline
256 KB 8-way 2nd-level Cache
16KB D-Cache
800 MHz 256b data
System Bus
FSB (133 MHz 64b data 32b address)
Memory Bus Multiplexed (RAS/CAS)
AGP
Host to PCI Bridge
Main Memory
Graphics
DIMMs 16 16M4b 133 MHz SDRAM constitutes 128MB
DRAM module with 64b data bus
36
Virtual Memory

Objective
Large address spaces -gt Easy Programming
Provide illusion of infinite amount of memory
Program code/data can exceed the main memory size
Processes partially resident in memory
Protection of codes and data
Privilege level
Access rights read/modify/execute permission
Sharing of codes and data
Benefits
Easier programming
Software portability
Increased CPU utilization
More programs can run at the same time
Virtual address space
Programmers view of infinite memory
Physical address space
Machines physical memory

37
Virtual Memory

Require the following functions
Memory allocation (Placement)
Memory deallocation (Replacement)
Memory mapping (Translation)
Memory management
Automatic movement of data between main memory
and secondary storage
Main memory contains only the most frequently
used portions of a processs address space
Illusion of infinite memory (size of secondary
storage) but access time equal to main memory
Usually implemented by demand paging

38
Paging

Divide address space into fixed size page frames
VA consists of (VPN, offset)
PA consists of (PPN, offset)
Map a virtual page to a physical page at runtime
Demand paging bring in a page on a page miss
Page table entry (PTE) contains
Virtual page number
Physical page number
Presence bit
Reference bit
Dirty bit
Access control read/write/execute
Privilege level
Disk address
Internal fragmentation

39
TLB

TLB (Translation Lookaside Buffer)
Cache of page table entries (PTEs)
On TLB hit, can do virtual to physical
translation without accessing the page map table
On TLB miss, must search page table for the
mapping and insert it into the TLB before
processing continues
TLB configuration
100 entries, fully or set-associative cache
usually separate I-TLB and D-TLB, accessed every
cycle
Different virtual memory faults
TLB miss - PTE not in TLB
PTE miss - PTE not in main memory
page miss - page not in main memory
access violation
privilege violation