Title: CPUs
1CPUs
- CPU performance
- CPU power consumption
2Elements of CPU performance
- Cycle time
- CPU pipeline
- Memory system
3Pipelining
- Several instructions are executed simultaneously
at different stages of completion - Various conditions can cause pipeline bubbles
that reduce utilization - branches
- memory system delays
- etc.
4Pipeline structures
- ARM7 has 3-stage pipes
- fetch instruction from memory
- decode opcode and operands
- execute
- ARM9 have 5-stage pipes
- Instruction fetch
- Decode
- Execute
- Data memory access
- Register write
5ARM9 core instruction pipeline
6ARM7 pipeline execution
add r0,r1,5
fetch
sub r2,r3,r6
execute
cmp r2,3
time
1
2
3
7Performance measures
- Latency time it takes for an instruction to get
through the pipeline - Throughput number of instructions executed per
time period - Pipelining increases throughput without reducing
latency
8Pipeline stalls
- If every step cannot be completed in the same
amount of time, pipeline stalls - Bubbles introduced by stall increase latency,
reduce throughput
9ARM multi-cycle LDMIA instruction
fetch
decode
ex ld r2
ex ld r3
ldmia r0,r2,r3
sub r2,r3,r6
fetch
decode
ex sub
fetch
decode
ex cmp
cmp r2,3
time
10Control stalls
- Branches often introduce stalls (branch penalty)
- Stall time may depend on whether branch is taken
- May have to squash instructions that already
started executing - Dont know what to fetch until condition is
evaluated
11ARM pipelined branch
time
12Delayed branch
- To increase pipeline efficiency, delayed branch
mechanism requires n instructions after branch
always executed whether branch is executed or not
13Example ARM7 execution time
- Determine execution time of FIR filter
- for (i0 iltN i)
- f f cixi
- Only branch in loop test may take more than one
cycle. - BLT loop takes 1 cycle best case, 3 worst case.
14ARM10 processor execution time
- Impossible to describe briefly the exact behavior
of all instructions in all circumstances - Branch prediction
- Prefetch buffer
- Branch folding
- The independent Load/Store Unit
- Data alignment
- How many accesses hit in the cache and TLB
15ARM10 integer core
3 instrs
16Integer core
- Prefetch Unit
- Fetches instructions from I-cache or external
memory - Predicts the outcome of branches whenever it can
- Integer Unit
- Decode
- Barrel shifter, ALU, Multiplier
- Main instruction sequencer
- Load/store Unit
- Load or store two registers(64bits) per cycle
- Decouple from the integer unit after the first
access of a LDM or STM instruction - Supports Hit-Under-Miss (HUM) operation
17Pipeline
- Fetch
- I-cache access, branch prediction
- Issue
- Initial instruction decode
- Decode
- Final instruction decode, register read for ALU
op, forwarding, and initial interlock resolution - Execute
- Data address calculation, shift, flag setting, CC
check, branch mispredict detection, and store
data register read - Memory
- Data cache access
- Write
- Register writes, instruction retirement
18Typical operations
19Load or store operation
20LDR operation that misses
21Interlocks
- Integer core
- forwarding to resolve data dependencies between
instructions - Pipeline interlocks
- Data dependency interlocks
- Instructions that have a source register that is
loaded from memory by the previous instruction - Hardware dependency
- A new load waiting for the LSU to finish an
existing LDM or STM - A load that misses when the HUM slot is already
occupied - A new multiply waiting for a previous multiply to
free up the first stage of the multiply
22Pipeline forwarding paths
23Example of interlocking and forwarding
- Execute-to-execute
- mov r0, 1
- add r1, r0, 1
- Memory-to-execute
- mov r0, 1
- sub r1, r2, 2
- add r2, r0, 1
24Example of interlocking and forwarding, contd
- Single cycle interlock
- ldr r0, r1, r2str r3, r0, r4
r1r2
r0 read
ldr
fetch
decode
write
memory
execute
issue
r0r4 r3 read
r3 write
str
25Superscalar execution
- Superscalar processor can execute several
instructions per cycle. - Uses multiple pipelined data paths.
- Programs execute faster, but it is harder to
determine how much faster.
26Data dependencies
- Execution time depends on operands, not just
opcode. - Superscalar CPU checks data dependencies
dynamically
r0
r1
add r2,r0,r1 add r3,r2,r5
r2
r5
r3
27Memory system performance
- Caches introduce indeterminacy in execution time
- Depends on order of execution
- Cache miss penalty added time due to a cache
miss - Several reasons for a miss compulsory, conflict,
capacity