CPUs - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

CPUs

Description:

Pipeline stalls. If every step cannot be completed in the same amount of time, pipeline stalls. Bubbles ... Stall time may depend on whether branch is taken ... – PowerPoint PPT presentation

Number of Views:32

Avg rating:3.0/5.0

Slides: 28

Provided by: wayne74

Category:

Tags: cpus | stall

more less

Transcript and Presenter's Notes

Title: CPUs

1
CPUs

CPU performance
CPU power consumption

2
Elements of CPU performance

Cycle time
CPU pipeline
Memory system

3
Pipelining

Several instructions are executed simultaneously
at different stages of completion
Various conditions can cause pipeline bubbles
that reduce utilization
branches
memory system delays
etc.

4
Pipeline structures

ARM7 has 3-stage pipes
fetch instruction from memory
decode opcode and operands
execute
ARM9 have 5-stage pipes
Instruction fetch
Decode
Execute
Data memory access
Register write

5
ARM9 core instruction pipeline
6
ARM7 pipeline execution
add r0,r1,5
fetch
sub r2,r3,r6
execute
cmp r2,3
time
1
2
3
7
Performance measures

Latency time it takes for an instruction to get
through the pipeline
Throughput number of instructions executed per
time period
Pipelining increases throughput without reducing
latency

8
Pipeline stalls

If every step cannot be completed in the same
amount of time, pipeline stalls
Bubbles introduced by stall increase latency,
reduce throughput

9
ARM multi-cycle LDMIA instruction
fetch
decode
ex ld r2
ex ld r3
ldmia r0,r2,r3
sub r2,r3,r6
fetch
decode
ex sub
fetch
decode
ex cmp
cmp r2,3
time
10
Control stalls

Branches often introduce stalls (branch penalty)
Stall time may depend on whether branch is taken
May have to squash instructions that already
started executing
Dont know what to fetch until condition is
evaluated

11
ARM pipelined branch
time
12
Delayed branch

To increase pipeline efficiency, delayed branch
mechanism requires n instructions after branch
always executed whether branch is executed or not

13
Example ARM7 execution time

Determine execution time of FIR filter
for (i0 iltN i)
f f cixi
Only branch in loop test may take more than one
cycle.
BLT loop takes 1 cycle best case, 3 worst case.

14
ARM10 processor execution time

Impossible to describe briefly the exact behavior
of all instructions in all circumstances
Branch prediction
Prefetch buffer
Branch folding
The independent Load/Store Unit
Data alignment
How many accesses hit in the cache and TLB

15
ARM10 integer core
3 instrs
16
Integer core

Prefetch Unit
Fetches instructions from I-cache or external
memory
Predicts the outcome of branches whenever it can
Integer Unit
Decode
Barrel shifter, ALU, Multiplier
Main instruction sequencer
Load/store Unit
Load or store two registers(64bits) per cycle
Decouple from the integer unit after the first
access of a LDM or STM instruction
Supports Hit-Under-Miss (HUM) operation

17
Pipeline

Fetch
I-cache access, branch prediction
Issue
Initial instruction decode
Decode
Final instruction decode, register read for ALU
op, forwarding, and initial interlock resolution
Execute
Data address calculation, shift, flag setting, CC
check, branch mispredict detection, and store
data register read
Memory
Data cache access
Write
Register writes, instruction retirement

18
Typical operations
19
Load or store operation
20
LDR operation that misses
21
Interlocks

Integer core
forwarding to resolve data dependencies between
instructions
Pipeline interlocks
Data dependency interlocks
Instructions that have a source register that is
loaded from memory by the previous instruction
Hardware dependency
A new load waiting for the LSU to finish an
existing LDM or STM
A load that misses when the HUM slot is already
occupied
A new multiply waiting for a previous multiply to
free up the first stage of the multiply