Superscalar Microprocessor Architecture - PowerPoint PPT Presentation

1 / 50

About This Presentation

Title:

Superscalar Microprocessor Architecture

Description:

PCs, workstations, servers, hand-held, mobile, automobile, supercomputers ... Output dependence. r3 r1 op r2 Write-after-Write. r5 r3 op r4 (WAW) r3 r6 op r7 ... – PowerPoint PPT presentation

Number of Views:101

Avg rating:5.0/5.0

Slides: 51

Provided by: lizy

Category:

more less

Transcript and Presenter's Notes

Title: Superscalar Microprocessor Architecture

1
Superscalar Microprocessor Architecture

EE382M
Lizy Kurian John
Chapter 1
Some slides adapted from Prof. Hoe, CMU and
Instructor Garcia, Berkeley

2
Superscalar Microprocessor Architecture

Super scalar
Scalar one
Superscalar more than one
Vector multiple concurrent operations on
vectors/arrays
Microprocessor A processor implemented in one
or a small number of semiconductor chips
Architecture Dictionary definition the
style the design and structure of anything
the organization and layout or distribution of
resources

3
Microprocessors

PCs, workstations, servers, hand-held, mobile,
automobile, supercomputers
100 microprocessors per person
1 B microprocessors shipped per year
Microcontrollers, embedded microprocessors
Instruction Set Processors
ISA (Instruction Set Architecture)
Microarchitecture
Implementation

4
Evolution of Single-Chip Microprocessors
5
ISA vs Microarchitecture

Specification vs implementation
Specification what does it do?
Implementation how does it do it?
Synthesis and Analysis
Synthesis find an implementation based on spec
Analysis examines an implementation to see how
well it meets specs correctness and
effectiveness
Effectiveness - Performance Bottleneck Analysis

6
Specification vs Implementation

Specification
Synthesis
Implementation
Analysis
ISA Instruction Set Architecture
HDL Hardware Description Language
RTL Register Transfer Language

7
Anatomy of Engineering Design

Specification Behavioral description of What
does it do?
Synthesis Search for possible solutions pick
the best one.
Implementation Structural description of How is
it constructed?
Analysis Figure out if the design meets the
specification.
Does it do the right thing? How well does
it perform?

8
Dynamic-Static Interface
DSI ISA a contract between the program and
the machine.
9
Where to place DSI
DEL

CISC
VLIW
RISC
HLL Program
DSI-1
DSI-2
DSI-3
Hardware
10
Where to place the HSI
Moderately complex ISA
DEL

CISC
RISC
HLL Program
Assembly-1
Assembly-2
Assembly-3
Hardware
Fortran machine LISP machine
PowerPC
MIPS
x86
11
Hardware Design
Implementation issues
abstraction
Architect Specification level Microarchitect D
esign at behavioral level (eg HDL) Design at
structural level (circuit design) Layout
12
Software Design
Implementation issues
abstraction
Architect Specification level Block Level
Code Designer Programmer
13
Course Scope

INSTRUCTION SET ARCHITECTURE (ISA)
programmer/compiler view - Functional appearance
to its immediate user/system programmer
IMPLEMENTATION (microarchitecture)
processor designer view - Logical structure or
organization that performs the architecture
REALIZATION (Chip)
chip/system designer view - Physical structure
that embodies the implementation

14
Why are we going for Superscalar?
Performance What is Performance? MIPS? MFLOPS?
MHz? Exec Time N CPI avg Clock period Exec
Time of What? Benchmark Programs
15
Iron Law of Processor Performance
Processor Performance
16
Improving Performance
Reduce N Reduce CPI Reduce clock
period/increase clock frequency
17
Improving Performance
RISC CISC Pipelining Caches Multiple
Issue, Out of ordering Speculation/Prediction MMX
(SIMD Extensions) Custom Design dynamic
logic Copper? SOI?
18
CISC
Reduces no of instructions Encode instructions
densely Less gap between HLL and underlying
machine
19
RISC
Reduce the cycles taken to execute an instruction
(okay to slightly increase the total no of
instructions) Less chip area on control Move HSI
towards hardware
20
RISC
Load/store architecture Uniform and orthogonal
ISA Simple explicit operations and operands Fewer
addressing modes Large general purpose register
set
21
CISC RISC examples
CISC Intel Pentium 4 AMD K5, K6, K7,
Opteron RISC MIPS R2000-R14000 Sun Sparc,
UltraSPARC, Sunfire HP PA-RISC IBM Power
PC ARM Processors

22
Instruction Set Architecture

ISA, the boundary between software and hardware
Specifies the logical machine that is visible to
the programmer
Also, a functional spec for the processor
designers
What needs to be specified by an ISA
Operations add, sub, mult
Temporary Operand Storage in the CPU
accumulator, stacks, registers
Number of operands per instruction
Operand location
where and how to specify the operands
Type and size of operands
Instruction-to-Binary Encoding

23
Microarchitecture

All the structures necessary to implement the ISA
All the structures necessary to give good
performance
Pipelining
Caches
Prefetching
Superscalar and out-of-order mechanisms
Branch Predictors

24
Steps in Executing Instructions

1) IFetch Fetch Instruction, Increment PC
2) Decode Instruction, Read Registers
3) Execute Mem-ref Calculate Address
Arith-log Perform Operation
4) Memory Load Read Data from Memory
Store Write Data to Memory
5) Write Back Write Data to Register

25
Review Datapath for MIPS

Use datapath figure to represent pipeline

26
Pipelined Execution Representation

Every instruction must take same number of steps,
also called pipeline stages, so some will go
idle sometimes

27
Pipelining Basics

Pipelining doesnt help latency of single task,
it helps throughput of entire workload
Multiple tasks operating simultaneously using
different resources
Potential speedup Number pipe stages
Time to fill pipeline and time to drain
reduce speedupeg 2.3X v. 4X

28
General Definitions

Latency time to completely execute a certain
task
Throughput amount of work that can be done over
a period of time

29
Pipelined Design

Motivation Increase throughput with
little increase in hardware
Latency required for each task remains the same
or may even increase slightly.

30
Evaluating Pipelining using laws of Parallel
Processing
Amdahls Law Efficiency Vectorizability

31
Amdahls Law

In parallel processing, serial part limits total
performance
T original execution time
S fraction of time in serial code, eg 0.2
P fraction of time in parallel code 1-S
N number of parallel units
Speedup 1/ (S P/N)
Max speedup 1/S

32
Fig 1.5 Amdahls Law
33
Amdahls Law
Assume I-mix (load 25, branch 20, taken
branches 66.6 of branches, hardware uses NT as
policy, Branch penalty is 4 cycles, load penalty
is 1 cycle Speedup of a 6-stage pipeline under
these circumstances Eq 1.6 S 1 / g1/1 g2/2
g3/3 gN/N S 1 / 0.13/2 0.25/5
0.62/6 4.5 Ideal S 6 Difference between
peak and actual pipelining improvement
34
Easing Sequential Bottleneck
Eq 1.8 - Speedup, S 1/(1-f)(f/N) Eq 1.5
Speedup S 1 / (1-f) (f/6) Eq 1.10 S
1 / (1-f)/2 f/6
35
Easing Sequential Bottleneck using ILP
36
Instruction-Level Parallelism

ILP - the aggregate degree of parallelism that
can be achieved by the concurrent execution of
multiple instructions
Measured in number of instructions

D
37
ILP Instruction-Level Parallelism

ILP is is a measure of the amount of
inter-dependencies between instructions
Average ILP no. instruction / no. cyc required
code1 ILP 1
i.e. must execute serially
code2 ILP 3
i.e. can execute at the same time

code2 r1 ? r2 1 r3 ? r9 / 17 r4 ? r0 - r10
code1 r1 ? r2 1 r3 ? r1 / 17 r4 ? r0 - r3
38
Inter-instruction Dependences
39
Scope of ILP Analysis
r1 ? r2 1 r3 ? r1 / 17 r4 ? r0 - r3
r11 ? r12 1 r13 ? r19 / 17 r14 ? r0 - r20
Out-of-order execution permits more ILP to be
exploited
40
Purported Limits on ILP

Weiss and Smith 1984 1.58
Sohi and Vajapeyam 1987 1.81
Tjaden and Flynn 1970 1.86
Tjaden and Flynn 1973 1.96
Uht 1986 2.00
Smith et al. 1989 2.00
Jouppi and Wall 1988 2.40
Johnson 1991 2.50
Acosta et al. 1986 2.79
Wedig 1982 3.00
Butler et al. 1991 5.8
Melvin and Patt 1991 6
Wall 1991 7
Kuck et al. 1972 8
Riseman and Foster 1972 51
Nicolau and Fisher 1984 90

41
Optimistic and Pessimistic ILP Estimates

Flynns bottleneck ILP is less than 2
(1970/1973)
Fishers Optimism ILP is much greater than 2
(1984)
Johnson ILP is 2 (1991)
Butler et al ILP is greater than 2 (1991)

42
Parameters for ILP Machines

Operation Latency (OL) Number of machine cycles
required for execution of an instruction number
of cycles until the result is available
Machine Parallelism (ML) Number of instructions
in flight
Issue Latency (IL) Number of cycles required
between issuing 2 consecutive instructions
(Issue means initiating an instruction into the
pipeline)
Issue Parallelism (IP) Max number of
instructions that can be issued in a machine
cycle

43
Scalar Pipeline