Title: Superscalar Microprocessor Architecture
1Superscalar Microprocessor Architecture
- EE382M
- Lizy Kurian John
- Chapter 1
- Some slides adapted from Prof. Hoe, CMU and
Instructor Garcia, Berkeley
2Superscalar Microprocessor Architecture
-
- Super scalar
- Scalar one
- Superscalar more than one
- Vector multiple concurrent operations on
vectors/arrays - Microprocessor A processor implemented in one
or a small number of semiconductor chips - Architecture Dictionary definition the
style the design and structure of anything
the organization and layout or distribution of
resources
3Microprocessors
-
- PCs, workstations, servers, hand-held, mobile,
automobile, supercomputers - 100 microprocessors per person
- 1 B microprocessors shipped per year
- Microcontrollers, embedded microprocessors
- Instruction Set Processors
- ISA (Instruction Set Architecture)
- Microarchitecture
- Implementation
4Evolution of Single-Chip Microprocessors
5 ISA vs Microarchitecture
-
- Specification vs implementation
- Specification what does it do?
- Implementation how does it do it?
- Synthesis and Analysis
- Synthesis find an implementation based on spec
- Analysis examines an implementation to see how
well it meets specs correctness and
effectiveness - Effectiveness - Performance Bottleneck Analysis
6 Specification vs Implementation
-
- Specification
- Synthesis
- Implementation
- Analysis
- ISA Instruction Set Architecture
- HDL Hardware Description Language
- RTL Register Transfer Language
7Anatomy of Engineering Design
- Specification Behavioral description of What
does it do? - Synthesis Search for possible solutions pick
the best one. - Implementation Structural description of How is
it constructed? - Analysis Figure out if the design meets the
specification. - Does it do the right thing? How well does
it perform?
8Dynamic-Static Interface
DSI ISA a contract between the program and
the machine.
9 Where to place DSI
DEL
CISC
VLIW
RISC
HLL Program
DSI-1
DSI-2
DSI-3
Hardware
10 Where to place the HSI
Moderately complex ISA
DEL
CISC
RISC
HLL Program
Assembly-1
Assembly-2
Assembly-3
Hardware
Fortran machine LISP machine
PowerPC
MIPS
x86
11 Hardware Design
Implementation issues
abstraction
Architect Specification level Microarchitect D
esign at behavioral level (eg HDL) Design at
structural level (circuit design) Layout
12 Software Design
Implementation issues
abstraction
Architect Specification level Block Level
Code Designer Programmer
13Course Scope
- INSTRUCTION SET ARCHITECTURE (ISA)
- programmer/compiler view - Functional appearance
to its immediate user/system programmer - IMPLEMENTATION (microarchitecture)
- processor designer view - Logical structure or
organization that performs the architecture - REALIZATION (Chip)
- chip/system designer view - Physical structure
that embodies the implementation
14 Why are we going for Superscalar?
Performance What is Performance? MIPS? MFLOPS?
MHz? Exec Time N CPI avg Clock period Exec
Time of What? Benchmark Programs
15Iron Law of Processor Performance
Processor Performance
16 Improving Performance
Reduce N Reduce CPI Reduce clock
period/increase clock frequency
17 Improving Performance
RISC CISC Pipelining Caches Multiple
Issue, Out of ordering Speculation/Prediction MMX
(SIMD Extensions) Custom Design dynamic
logic Copper? SOI?
18 CISC
Reduces no of instructions Encode instructions
densely Less gap between HLL and underlying
machine
19 RISC
Reduce the cycles taken to execute an instruction
(okay to slightly increase the total no of
instructions) Less chip area on control Move HSI
towards hardware
20 RISC
Load/store architecture Uniform and orthogonal
ISA Simple explicit operations and operands Fewer
addressing modes Large general purpose register
set
21 CISC RISC examples
CISC Intel Pentium 4 AMD K5, K6, K7,
Opteron RISC MIPS R2000-R14000 Sun Sparc,
UltraSPARC, Sunfire HP PA-RISC IBM Power
PC ARM Processors
22Instruction Set Architecture
- ISA, the boundary between software and hardware
- Specifies the logical machine that is visible to
the programmer - Also, a functional spec for the processor
designers - What needs to be specified by an ISA
- Operations add, sub, mult
- Temporary Operand Storage in the CPU
- accumulator, stacks, registers
- Number of operands per instruction
- Operand location
- where and how to specify the operands
- Type and size of operands
- Instruction-to-Binary Encoding
23Microarchitecture
- All the structures necessary to implement the ISA
- All the structures necessary to give good
performance - Pipelining
- Caches
- Prefetching
- Superscalar and out-of-order mechanisms
- Branch Predictors
24Steps in Executing Instructions
- 1) IFetch Fetch Instruction, Increment PC
- 2) Decode Instruction, Read Registers
- 3) Execute Mem-ref Calculate Address
Arith-log Perform Operation - 4) Memory Load Read Data from Memory
Store Write Data to Memory - 5) Write Back Write Data to Register
25Review Datapath for MIPS
- Use datapath figure to represent pipeline
26Pipelined Execution Representation
- Every instruction must take same number of steps,
also called pipeline stages, so some will go
idle sometimes
27Pipelining Basics
- Pipelining doesnt help latency of single task,
it helps throughput of entire workload - Multiple tasks operating simultaneously using
different resources - Potential speedup Number pipe stages
- Time to fill pipeline and time to drain
reduce speedupeg 2.3X v. 4X
28General Definitions
- Latency time to completely execute a certain
task - Throughput amount of work that can be done over
a period of time
29Pipelined Design
- Motivation Increase throughput with
- little increase in hardware
- Latency required for each task remains the same
or may even increase slightly.
30 Evaluating Pipelining using laws of Parallel
Processing
Amdahls Law Efficiency Vectorizability
31Amdahls Law
- In parallel processing, serial part limits total
performance - T original execution time
- S fraction of time in serial code, eg 0.2
- P fraction of time in parallel code 1-S
- N number of parallel units
- Speedup 1/ (S P/N)
- Max speedup 1/S
-
32Fig 1.5 Amdahls Law
33Amdahls Law
Assume I-mix (load 25, branch 20, taken
branches 66.6 of branches, hardware uses NT as
policy, Branch penalty is 4 cycles, load penalty
is 1 cycle Speedup of a 6-stage pipeline under
these circumstances Eq 1.6 S 1 / g1/1 g2/2
g3/3 gN/N S 1 / 0.13/2 0.25/5
0.62/6 4.5 Ideal S 6 Difference between
peak and actual pipelining improvement
34Easing Sequential Bottleneck
Eq 1.8 - Speedup, S 1/(1-f)(f/N) Eq 1.5
Speedup S 1 / (1-f) (f/6) Eq 1.10 S
1 / (1-f)/2 f/6
35Easing Sequential Bottleneck using ILP
36Instruction-Level Parallelism
- ILP - the aggregate degree of parallelism that
can be achieved by the concurrent execution of
multiple instructions - Measured in number of instructions
D
37ILP Instruction-Level Parallelism
- ILP is is a measure of the amount of
inter-dependencies between instructions - Average ILP no. instruction / no. cyc required
- code1 ILP 1
- i.e. must execute serially
- code2 ILP 3
- i.e. can execute at the same time
code2 r1 ? r2 1 r3 ? r9 / 17 r4 ? r0 - r10
code1 r1 ? r2 1 r3 ? r1 / 17 r4 ? r0 - r3
38Inter-instruction Dependences
39Scope of ILP Analysis
r1 ? r2 1 r3 ? r1 / 17 r4 ? r0 - r3
r11 ? r12 1 r13 ? r19 / 17 r14 ? r0 - r20
Out-of-order execution permits more ILP to be
exploited
40Purported Limits on ILP
- Weiss and Smith 1984 1.58
- Sohi and Vajapeyam 1987 1.81
- Tjaden and Flynn 1970 1.86
- Tjaden and Flynn 1973 1.96
- Uht 1986 2.00
- Smith et al. 1989 2.00
- Jouppi and Wall 1988 2.40
- Johnson 1991 2.50
- Acosta et al. 1986 2.79
- Wedig 1982 3.00
- Butler et al. 1991 5.8
- Melvin and Patt 1991 6
- Wall 1991 7
- Kuck et al. 1972 8
- Riseman and Foster 1972 51
- Nicolau and Fisher 1984 90
41Optimistic and Pessimistic ILP Estimates
- Flynns bottleneck ILP is less than 2
(1970/1973) - Fishers Optimism ILP is much greater than 2
(1984) - Johnson ILP is 2 (1991)
- Butler et al ILP is greater than 2 (1991)
42Parameters for ILP Machines
- Operation Latency (OL) Number of machine cycles
required for execution of an instruction number
of cycles until the result is available - Machine Parallelism (ML) Number of instructions
in flight - Issue Latency (IL) Number of cycles required
between issuing 2 consecutive instructions - (Issue means initiating an instruction into the
pipeline) - Issue Parallelism (IP) Max number of
instructions that can be issued in a machine
cycle
43Scalar Pipeline
- Scalar Pipeline (baseline)
- Instruction Parallelism D
- Operation Latency 1
- Peak IPC 1
D
44Superpipelined Machine
- Superpipelined Execution
- MP DxM IP 1 (per minor cycle)
- OL M minor cycles IL 1 minor cycle
- Peak IPC 1 per minor cycle (M per baseline
cycle)
major cycle M minor cycle
minor cycle
1
2
3
4
5
6
IF
DE
EX
WB
2
5
1
4
6
3
45Superpipelined MIPS R4000
46Superscalar Machines
- Superscalar (Pipelined) Execution
- IP N IL1 MPN D
- OL 1 baseline cycles
- Peak IPC N per baseline cycle
N
47Superscalar and Superpipelined
- Superscalar and superpipelined machines of equal
degree have roughly the same performance, i.e. if
n m then both have about the same IPC.
48VLIW
- Very Long Instruction Word (VLIW)
- One big VLIW instruction separately directs each
functional unit - MultiFlow TRACE, TI C6X, IA-64
- IA 64 EPIC is really VLIW
- EPIC (Explicitly Parallel Instruction Computing)
add r1,r2,r3
load r4,r54
mov r6,r2
mul r7,r8,r9
VLIW
Instruction
Execution
FU
FU
FU
FU
49VLIW vs Superscalar
- VLIW - Compiler finds parallelism
- Superscalar hardware finds parallelism
- VLIW Simpler hardware
- Superscalar More complex hardware
- VLIW less power
- Superscalar More power
- VLIW works only if compiler has done the right
things - Superscalar works even with lousy compiler
50Problem Set 1
- 1.6
- 1.11
- 1.15 (only first part perf improvement)
- 1.19,1.20,1.21,1.22,1.23,1.24,1.25,1.27,
- 1.29,1.30