Title: CS 161Computer Architecture Introduction to Advanced Architecturs Lecture 13
1CS 161Computer Architecture Introduction to
Advanced ArchitectursLecture 13
- Instructor L.N. Bhuyan
- www.cs.ucr.edu/bhuyan
- Adapted from notes by Dave Patterson(http.cs.berk
eley.edu/patterson)
2Stages of Execution in Pipelined MIPS
- 5 stage instruction pipeline
- 1) I-fetch Fetch Instruction, Increment PC
- 2) Decode Instruction, Read Registers
- 3) Execute Mem-reference Calculate
Address R-format Perform ALU Operation - 4) Memory Load Read Data from Data Memory
Store Write Data to Data Memory - 5) Write Back Write Data to Register
3Pipelined Execution Representation
Time
IFtch
Dcd
Exec
Mem
WB
IFtch
Dcd
Exec
Mem
WB
IFtch
Dcd
Exec
Mem
WB
IFtch
Dcd
Exec
Mem
WB
Program Flow
- To simplify pipeline, every instruction takes
same number of steps, called stages - One clock cycle per stage
4Review Single-cycle Datapath for MIPS
Stage 5
Instruction Memory (Imem)
Data Memory (Dmem)
- Use datapath figure to represent pipeline
5Graphical Pipeline Representation
Time (clock cycles)
I n s t r. O r d e r
Reg
DM
Reg
Load
IM
Reg
DM
Reg
Add
Reg
DM
Reg
Store
IM
Reg
DM
Reg
Sub
Reg
DM
Reg
Or
(right half highlighted means read, left half
write)
6Required Changes to Datapath
- Introduce registers to separate 5 stages by
putting IF/ID, ID/EX, EX/MEM, and MEM/WB
registers in the datapath. - Next PC value is computed in the 3rd step, but we
need to bring in next instn in the next cycle
Move PCSrc Mux to 1st stage - Branch address is computed in 3rd stage. With
pipeline, the PC value has changed! Must carry
the PC value along with instn. Width of IF/ID
register (IR)(PC) 64 bits. - For lw instn, we need write register address at
stage 5. But the IR is now occupied by another
instn! So, we must carry the IR destination field
as we move along the stages. See connection in
fig. Length od ID/EX register
(Reg1)(Reg2)(offset)(PC) destn 133 bits
7Pipelined Datapath (with Pipeline Regs)(6.2)
Fetch Decode
Execute Memory
Write Back
0
M
u
x
1
IF/ID
EX/MEM
ID/EX
MEM/WB
A
d
d
A
d
d
4
A
d
d
r
e
s
u
l
t
S
h
i
f
t
l
e
f
t
2
R
e
a
d
n
o
r
e
g
i
s
t
e
r
1
i
A
d
d
r
e
s
s
P
C
t
R
e
a
d
c
u
d
a
t
a
1
r
t
R
e
a
d
s
Z
e
r
o
n
r
e
g
i
s
t
e
r
2
I
A
L
U
R
e
a
d
A
L
U
0
R
e
a
d
W
r
i
t
e
A
d
d
r
e
s
s
1
d
a
t
a
2
r
e
s
u
l
t
d
a
t
a
r
e
g
i
s
t
e
r
M
M
Imem
u
Regs
u
W
r
i
t
e
x
x
d
a
t
a
1
0
W
r
i
t
e
Dmem
d
a
t
a
3
2
1
6
S
i
g
n
e
x
t
e
n
d
5
69 bits
64 bits
133 bits
102 bits
8Pipelined Control (6.3)
- Start with single-cycle controller
- Group control lines by pipeline stage needed
- Extend pipeline registers with control bits
W
B
I
n
s
t
r
u
c
t
i
o
n
Mem
W
B
C
o
n
t
r
o
l
E
X
W
B
Mem
MemToRegRegWrite
Branch MemReadMemWrite
I
F
/
I
D
I
D
/
E
X
E
X
/
M
E
M
M
E
M
/
W
B
9Problems for Pipelining
- Hazards prevent next instruction from executing
during its designated clock cycle, limiting
speedup - Structural hazards HW cannot support this
combination of instructions (single person to
fold and put clothes away) - Control hazards conditional branches other
instructions may stall the pipeline delaying
later instructions (must check detergent level
before washing next load) - Data hazards Instruction depends on result of
prior instruction still in the pipeline (matching
socks in later load)
10MIPS R4000 pipeline
11Advanced Architectural Concepts
- Can we achieve CPI lt 1? (i.e., can we have IPC gt
1?) State-of-the-Art Microprocessor - Superscalar execution or Instruction Level
Parallelism (ILP) - Deeper Pipeline gt Dynamic Branch Prediction gt
Speculation gt Recovery - Out-of-order Execution gt Instruction Window
and Prefetch gt Reorder Buffers - VLIW Ex Intel/HP Titanium
12Instruction Level Parallelism (ILP) IPC gt 1
Time
IFtch
Dcd
Exec
Mem
WB
Dcd
WB
IFetch
Exec
Mem
Mem
WB
Exec
IFtch
Dcd
WB
Exec
Dcd
Mem
IFtch
Exec
WB
Dcd
IFtch
Mem
Program Flow ILP 2
EX Pentium, SPARC, MIPS 10000, IBM Power PC
13HW Schemes Instruction Parallelism
- Key idea Allow instructions behind stall to
proceed - DIVD F0,F2,F4
- ADDD F10,F0,F8
- SUBD F12,F8,F14
- Enables out-of-order execution gt out-of-order
completion - ID stage checks for hazards. If no hazards, issue
the instn for execution. Scoreboard dates to CDC
6600 in 1963
14How ILP Works
- Issuing multiple instructions per cycle would
require fetching multiple instructions from
memory per cycle gt called Superscalar degree or
Issue width - To find independent instructions, we must have a
big pool of instructions to choose from, called
instruction buffer (IB). As IB length increases,
complexity of decoder (control) increases that
increases the datapath cycle time - Prefetch instructions sequentially by an IFU that
operates independently from datapath control.
Fetch instruction (PC)L, where L is the IB size
or as directed by the branch predictor.
15Microarchitecture of an ILP-based CPU (Power PC)
16(No Transcript)
17Very Large Instruction Word (VLIW) IPC gt 1
Time
IFtch
Dcd
Exec
Mem
WB
Exec
Exec
WB
Exec
Dcd
Mem
IFtch
Exec
Program Flow EX Itanium
18TriMedia TM32 Architecture
32-bit peripheral bus
64-bit memory bus
multi-port 128 words x 32 bits register file
bypass network
datacache16KB
FU
FU
FU
FU
FU
PC
instruction cache 32 KB
Compressed code in the Instruction Cache
19What is Multiprocessing
- Parallelism at the Instruction Level is limited
because of data dependency gt Speed up is
limited!! - Abundant availability of program level
parallelism, like Do I 1000, Loop Level
Parallelism. How about employing multiple
processors to execute the loops gt Parallel
processing or Multiprocessing - With billion transistors on a chip, we can put a
few CPUs in one chip gt Chip multiprocessor
20Hardware Multithreading
- We need to develop a hardware multithreading
technique because switching between threads in
software is very time-consuming (Why?), so not
suitable for main memory (instead of I/O) access,
Ex Multitasking - Develop multiple PCs and register sets on the CPU
so that thread switching can occur without having
to store the register contents in main memory
(stack, like it is done for context switching). - Several threads reside in the CPU simultaneously,
and execution switches between the threads on
main memory access. - How about both multiprocessors and multithreading
on a chip? gt Network Processor
21Hardware Multithreading
- How can we guarantee no dependencies between
instructions in a pipeline? - One way is to interleave execution of
instructions from different program threads on
same pipeline - Interleave 4 threads, T1-T4, on non-bypassed
5-stage pipe - T1 LW r1, 0(r2)
- T2 ADD r7, r1, r4
- T3 XORI r5, r4, 12
- T4 SW 0(r7), r5
- T1 LW r5, 12(r1)
22Architectural Comparisons (cont.)
Simultaneous Multithreading
Multiprocessing
Superscalar
Fine-Grained
Coarse-Grained
Time (processor cycle)
Thread 1
Thread 3
Thread 5
Thread 2
Thread 4
Idle slot
23Intel IXP2400 Network Processor
- XScale core replaces StrongARM
- 1.4 GHz target in 0.13-micron
- Nearest neighbor routes added between
microengines - Hardware to accelerate CRC operations and Random
number generation - 16 entry CAM
24IBM Cell Processor
SPU Synergetic Processor Unit
25Chip Multiprocessors