CS 161Computer Architecture Introduction to Advanced Architecturs Lecture 13 - PowerPoint PPT Presentation

About This Presentation

Title:

CS 161Computer Architecture Introduction to Advanced Architecturs Lecture 13

Description:

Group control lines by pipeline stage needed. Extend pipeline registers with control bits ... of instructions (single person to fold and put clothes away) ... – PowerPoint PPT presentation

Number of Views:42

Avg rating:3.0/5.0

Slides: 26

Provided by: davep173

Learn more at: http://www.cs.ucr.edu

Category:

more less

Transcript and Presenter's Notes

Title: CS 161Computer Architecture Introduction to Advanced Architecturs Lecture 13

1
CS 161Computer Architecture Introduction to
Advanced ArchitectursLecture 13

Instructor L.N. Bhuyan
www.cs.ucr.edu/bhuyan
Adapted from notes by Dave Patterson(http.cs.berk
eley.edu/patterson)

2
Stages of Execution in Pipelined MIPS

5 stage instruction pipeline
1) I-fetch Fetch Instruction, Increment PC
2) Decode Instruction, Read Registers
3) Execute Mem-reference Calculate
Address R-format Perform ALU Operation
4) Memory Load Read Data from Data Memory
Store Write Data to Data Memory
5) Write Back Write Data to Register

3
Pipelined Execution Representation
Time
IFtch
Dcd
Exec
Mem
WB
IFtch
Dcd
Exec
Mem
WB
IFtch
Dcd
Exec
Mem
WB
IFtch
Dcd
Exec
Mem
WB
Program Flow

To simplify pipeline, every instruction takes
same number of steps, called stages
One clock cycle per stage

4
Review Single-cycle Datapath for MIPS
Stage 5
Instruction Memory (Imem)
Data Memory (Dmem)

Use datapath figure to represent pipeline

5
Graphical Pipeline Representation
Time (clock cycles)
I n s t r. O r d e r
Reg
DM
Reg
Load
IM
Reg
DM
Reg
Add
Reg
DM
Reg
Store
IM
Reg
DM
Reg
Sub
Reg
DM
Reg
Or
(right half highlighted means read, left half
write)
6
Required Changes to Datapath

Introduce registers to separate 5 stages by
putting IF/ID, ID/EX, EX/MEM, and MEM/WB
registers in the datapath.
Next PC value is computed in the 3rd step, but we
need to bring in next instn in the next cycle
Move PCSrc Mux to 1st stage
Branch address is computed in 3rd stage. With
pipeline, the PC value has changed! Must carry
the PC value along with instn. Width of IF/ID
register (IR)(PC) 64 bits.
For lw instn, we need write register address at
stage 5. But the IR is now occupied by another
instn! So, we must carry the IR destination field
as we move along the stages. See connection in
fig. Length od ID/EX register
(Reg1)(Reg2)(offset)(PC) destn 133 bits

7
Pipelined Datapath (with Pipeline Regs)(6.2)
Fetch Decode
Execute Memory
Write Back
0
M
u
x
1
IF/ID
EX/MEM
ID/EX
MEM/WB
A
d
d
A
d
d
4
A
d
d
r
e
s
u
l
t
S
h
i
f
t
l
e
f
t

2
R
e
a
d
n
o
r
e
g
i
s
t
e
r

1
i
A
d
d
r
e
s
s
P
C
t
R
e
a
d
c
u
d
a
t
a

1
r
t
R
e
a
d
s
Z
e
r
o
n
r
e
g
i
s
t
e
r

2
I
A
L
U
R
e
a
d
A
L
U
0
R
e
a
d
W
r
i
t
e
A
d
d
r
e
s
s
1
d
a
t
a

2
r
e
s
u
l
t
d
a
t
a
r
e
g
i
s
t
e
r
M
M
Imem
u
Regs
u
W
r
i
t
e
x
x
d
a
t
a
1
0
W
r
i
t
e
Dmem
d
a
t
a
3
2
1
6
S
i
g
n
e
x
t
e
n
d
5
69 bits
64 bits
133 bits
102 bits
8
Pipelined Control (6.3)

Start with single-cycle controller
Group control lines by pipeline stage needed
Extend pipeline registers with control bits

W
B
I
n
s
t
r
u
c
t
i
o
n
Mem
W
B
C
o
n
t
r
o
l
E
X
W
B
Mem
MemToRegRegWrite
Branch MemReadMemWrite
I
F
/
I
D
I
D
/
E
X
E
X
/
M
E
M
M
E
M
/
W
B
9
Problems for Pipelining

Hazards prevent next instruction from executing
during its designated clock cycle, limiting
speedup
Structural hazards HW cannot support this
combination of instructions (single person to
fold and put clothes away)
Control hazards conditional branches other
instructions may stall the pipeline delaying
later instructions (must check detergent level
before washing next load)
Data hazards Instruction depends on result of
prior instruction still in the pipeline (matching
socks in later load)

10
MIPS R4000 pipeline
11
Advanced Architectural Concepts

Can we achieve CPI lt 1? (i.e., can we have IPC gt
1?) State-of-the-Art Microprocessor
Superscalar execution or Instruction Level
Parallelism (ILP)
Deeper Pipeline gt Dynamic Branch Prediction gt
Speculation gt Recovery
Out-of-order Execution gt Instruction Window
and Prefetch gt Reorder Buffers
VLIW Ex Intel/HP Titanium

12
Instruction Level Parallelism (ILP) IPC gt 1
Time
IFtch
Dcd
Exec
Mem
WB
Dcd
WB
IFetch
Exec
Mem
Mem
WB
Exec
IFtch
Dcd
WB
Exec
Dcd
Mem
IFtch
Exec
WB
Dcd
IFtch
Mem
Program Flow ILP 2
EX Pentium, SPARC, MIPS 10000, IBM Power PC
13
HW Schemes Instruction Parallelism

Key idea Allow instructions behind stall to
proceed
DIVD F0,F2,F4
ADDD F10,F0,F8
SUBD F12,F8,F14
Enables out-of-order execution gt out-of-order
completion
ID stage checks for hazards. If no hazards, issue
the instn for execution. Scoreboard dates to CDC
6600 in 1963

14
How ILP Works

Issuing multiple instructions per cycle would
require fetching multiple instructions from
memory per cycle gt called Superscalar degree or
Issue width
To find independent instructions, we must have a
big pool of instructions to choose from, called
instruction buffer (IB). As IB length increases,
complexity of decoder (control) increases that
increases the datapath cycle time
Prefetch instructions sequentially by an IFU that
operates independently from datapath control.
Fetch instruction (PC)L, where L is the IB size
or as directed by the branch predictor.

15
Microarchitecture of an ILP-based CPU (Power PC)
16
(No Transcript)
17
Very Large Instruction Word (VLIW) IPC gt 1
Time
IFtch
Dcd
Exec
Mem
WB
Exec
Exec
WB
Exec
Dcd
Mem
IFtch
Exec
Program Flow EX Itanium
18
TriMedia TM32 Architecture
32-bit peripheral bus
64-bit memory bus
multi-port 128 words x 32 bits register file
bypass network
datacache16KB
FU
FU
FU
FU
FU
PC
instruction cache 32 KB
Compressed code in the Instruction Cache
19
What is Multiprocessing

Parallelism at the Instruction Level is limited
because of data dependency gt Speed up is
limited!!
Abundant availability of program level
parallelism, like Do I 1000, Loop Level
Parallelism. How about employing multiple
processors to execute the loops gt Parallel
processing or Multiprocessing
With billion transistors on a chip, we can put a
few CPUs in one chip gt Chip multiprocessor

20
Hardware Multithreading

We need to develop a hardware multithreading
technique because switching between threads in
software is very time-consuming (Why?), so not
suitable for main memory (instead of I/O) access,
Ex Multitasking
Develop multiple PCs and register sets on the CPU
so that thread switching can occur without having
to store the register contents in main memory
(stack, like it is done for context switching).
Several threads reside in the CPU simultaneously,
and execution switches between the threads on
main memory access.
How about both multiprocessors and multithreading
on a chip? gt Network Processor

21
Hardware Multithreading

How can we guarantee no dependencies between
instructions in a pipeline?
One way is to interleave execution of
instructions from different program threads on
same pipeline
Interleave 4 threads, T1-T4, on non-bypassed
5-stage pipe
T1 LW r1, 0(r2)
T2 ADD r7, r1, r4
T3 XORI r5, r4, 12
T4 SW 0(r7), r5
T1 LW r5, 12(r1)

22
Architectural Comparisons (cont.)
Simultaneous Multithreading
Multiprocessing
Superscalar
Fine-Grained
Coarse-Grained
Time (processor cycle)
Thread 1
Thread 3
Thread 5
Thread 2
Thread 4
Idle slot
23
Intel IXP2400 Network Processor