CPUs - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

CPUs

Description:

Pipeline stalls. If every step cannot be completed in the same amount of time, pipeline stalls. Bubbles ... Stall time may depend on whether branch is taken ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 28
Provided by: wayne74
Category:
Tags: cpus | stall

less

Transcript and Presenter's Notes

Title: CPUs


1
CPUs
  • CPU performance
  • CPU power consumption

2
Elements of CPU performance
  • Cycle time
  • CPU pipeline
  • Memory system

3
Pipelining
  • Several instructions are executed simultaneously
    at different stages of completion
  • Various conditions can cause pipeline bubbles
    that reduce utilization
  • branches
  • memory system delays
  • etc.

4
Pipeline structures
  • ARM7 has 3-stage pipes
  • fetch instruction from memory
  • decode opcode and operands
  • execute
  • ARM9 have 5-stage pipes
  • Instruction fetch
  • Decode
  • Execute
  • Data memory access
  • Register write

5
ARM9 core instruction pipeline
6
ARM7 pipeline execution
add r0,r1,5
fetch
sub r2,r3,r6
execute
cmp r2,3
time
1
2
3
7
Performance measures
  • Latency time it takes for an instruction to get
    through the pipeline
  • Throughput number of instructions executed per
    time period
  • Pipelining increases throughput without reducing
    latency

8
Pipeline stalls
  • If every step cannot be completed in the same
    amount of time, pipeline stalls
  • Bubbles introduced by stall increase latency,
    reduce throughput

9
ARM multi-cycle LDMIA instruction
fetch
decode
ex ld r2
ex ld r3
ldmia r0,r2,r3
sub r2,r3,r6
fetch
decode
ex sub
fetch
decode
ex cmp
cmp r2,3
time
10
Control stalls
  • Branches often introduce stalls (branch penalty)
  • Stall time may depend on whether branch is taken
  • May have to squash instructions that already
    started executing
  • Dont know what to fetch until condition is
    evaluated

11
ARM pipelined branch
time
12
Delayed branch
  • To increase pipeline efficiency, delayed branch
    mechanism requires n instructions after branch
    always executed whether branch is executed or not

13
Example ARM7 execution time
  • Determine execution time of FIR filter
  • for (i0 iltN i)
  • f f cixi
  • Only branch in loop test may take more than one
    cycle.
  • BLT loop takes 1 cycle best case, 3 worst case.

14
ARM10 processor execution time
  • Impossible to describe briefly the exact behavior
    of all instructions in all circumstances
  • Branch prediction
  • Prefetch buffer
  • Branch folding
  • The independent Load/Store Unit
  • Data alignment
  • How many accesses hit in the cache and TLB

15
ARM10 integer core
3 instrs
16
Integer core
  • Prefetch Unit
  • Fetches instructions from I-cache or external
    memory
  • Predicts the outcome of branches whenever it can
  • Integer Unit
  • Decode
  • Barrel shifter, ALU, Multiplier
  • Main instruction sequencer
  • Load/store Unit
  • Load or store two registers(64bits) per cycle
  • Decouple from the integer unit after the first
    access of a LDM or STM instruction
  • Supports Hit-Under-Miss (HUM) operation

17
Pipeline
  • Fetch
  • I-cache access, branch prediction
  • Issue
  • Initial instruction decode
  • Decode
  • Final instruction decode, register read for ALU
    op, forwarding, and initial interlock resolution
  • Execute
  • Data address calculation, shift, flag setting, CC
    check, branch mispredict detection, and store
    data register read
  • Memory
  • Data cache access
  • Write
  • Register writes, instruction retirement

18
Typical operations
19
Load or store operation
20
LDR operation that misses
21
Interlocks
  • Integer core
  • forwarding to resolve data dependencies between
    instructions
  • Pipeline interlocks
  • Data dependency interlocks
  • Instructions that have a source register that is
    loaded from memory by the previous instruction
  • Hardware dependency
  • A new load waiting for the LSU to finish an
    existing LDM or STM
  • A load that misses when the HUM slot is already
    occupied
  • A new multiply waiting for a previous multiply to
    free up the first stage of the multiply

22
Pipeline forwarding paths
23
Example of interlocking and forwarding
  • Execute-to-execute
  • mov r0, 1
  • add r1, r0, 1
  • Memory-to-execute
  • mov r0, 1
  • sub r1, r2, 2
  • add r2, r0, 1

24
Example of interlocking and forwarding, contd
  • Single cycle interlock
  • ldr r0, r1, r2str r3, r0, r4

r1r2
r0 read
ldr
fetch
decode
write
memory
execute
issue
r0r4 r3 read
r3 write
str
25
Superscalar execution
  • Superscalar processor can execute several
    instructions per cycle.
  • Uses multiple pipelined data paths.
  • Programs execute faster, but it is harder to
    determine how much faster.

26
Data dependencies
  • Execution time depends on operands, not just
    opcode.
  • Superscalar CPU checks data dependencies
    dynamically

r0
r1
add r2,r0,r1 add r3,r2,r5
r2
r5
r3
27
Memory system performance
  • Caches introduce indeterminacy in execution time
  • Depends on order of execution
  • Cache miss penalty added time due to a cache
    miss
  • Several reasons for a miss compulsory, conflict,
    capacity
Write a Comment
User Comments (0)
About PowerShow.com