Superscalar Microprocessor Architecture - PowerPoint PPT Presentation

1 / 50
About This Presentation
Title:

Superscalar Microprocessor Architecture

Description:

PCs, workstations, servers, hand-held, mobile, automobile, supercomputers ... Output dependence. r3 r1 op r2 Write-after-Write. r5 r3 op r4 (WAW) r3 r6 op r7 ... – PowerPoint PPT presentation

Number of Views:101
Avg rating:5.0/5.0
Slides: 51
Provided by: lizy
Category:

less

Transcript and Presenter's Notes

Title: Superscalar Microprocessor Architecture


1
Superscalar Microprocessor Architecture
  • EE382M
  • Lizy Kurian John
  • Chapter 1
  • Some slides adapted from Prof. Hoe, CMU and
    Instructor Garcia, Berkeley

2
Superscalar Microprocessor Architecture
  • Super scalar
  • Scalar one
  • Superscalar more than one
  • Vector multiple concurrent operations on
    vectors/arrays
  • Microprocessor A processor implemented in one
    or a small number of semiconductor chips
  • Architecture Dictionary definition the
    style the design and structure of anything
    the organization and layout or distribution of
    resources

3
Microprocessors
  • PCs, workstations, servers, hand-held, mobile,
    automobile, supercomputers
  • 100 microprocessors per person
  • 1 B microprocessors shipped per year
  • Microcontrollers, embedded microprocessors
  • Instruction Set Processors
  • ISA (Instruction Set Architecture)
  • Microarchitecture
  • Implementation

4
Evolution of Single-Chip Microprocessors
5
ISA vs Microarchitecture
  • Specification vs implementation
  • Specification what does it do?
  • Implementation how does it do it?
  • Synthesis and Analysis
  • Synthesis find an implementation based on spec
  • Analysis examines an implementation to see how
    well it meets specs correctness and
    effectiveness
  • Effectiveness - Performance Bottleneck Analysis

6
Specification vs Implementation
  • Specification
  • Synthesis
  • Implementation
  • Analysis
  • ISA Instruction Set Architecture
  • HDL Hardware Description Language
  • RTL Register Transfer Language

7
Anatomy of Engineering Design
  • Specification Behavioral description of What
    does it do?
  • Synthesis Search for possible solutions pick
    the best one.
  • Implementation Structural description of How is
    it constructed?
  • Analysis Figure out if the design meets the
    specification.
  • Does it do the right thing? How well does
    it perform?

8
Dynamic-Static Interface
DSI ISA a contract between the program and
the machine.
9
Where to place DSI
DEL

CISC
VLIW
RISC
HLL Program
DSI-1
DSI-2
DSI-3
Hardware
10
Where to place the HSI
Moderately complex ISA
DEL

CISC
RISC
HLL Program
Assembly-1
Assembly-2
Assembly-3
Hardware
Fortran machine LISP machine
PowerPC
MIPS
x86
11
Hardware Design
Implementation issues
abstraction
Architect Specification level Microarchitect D
esign at behavioral level (eg HDL) Design at
structural level (circuit design) Layout
12
Software Design
Implementation issues
abstraction
Architect Specification level Block Level
Code Designer Programmer
13
Course Scope
  • INSTRUCTION SET ARCHITECTURE (ISA)
  • programmer/compiler view - Functional appearance
    to its immediate user/system programmer
  • IMPLEMENTATION (microarchitecture)
  • processor designer view - Logical structure or
    organization that performs the architecture
  • REALIZATION (Chip)
  • chip/system designer view - Physical structure
    that embodies the implementation

14
Why are we going for Superscalar?
Performance What is Performance? MIPS? MFLOPS?
MHz? Exec Time N CPI avg Clock period Exec
Time of What? Benchmark Programs
15
Iron Law of Processor Performance
Processor Performance
16
Improving Performance
Reduce N Reduce CPI Reduce clock
period/increase clock frequency
17
Improving Performance
RISC CISC Pipelining Caches Multiple
Issue, Out of ordering Speculation/Prediction MMX
(SIMD Extensions) Custom Design dynamic
logic Copper? SOI?
18
CISC
Reduces no of instructions Encode instructions
densely Less gap between HLL and underlying
machine
19
RISC
Reduce the cycles taken to execute an instruction
(okay to slightly increase the total no of
instructions) Less chip area on control Move HSI
towards hardware
20
RISC
Load/store architecture Uniform and orthogonal
ISA Simple explicit operations and operands Fewer
addressing modes Large general purpose register
set
21
CISC RISC examples
CISC Intel Pentium 4 AMD K5, K6, K7,
Opteron RISC MIPS R2000-R14000 Sun Sparc,
UltraSPARC, Sunfire HP PA-RISC IBM Power
PC ARM Processors

22
Instruction Set Architecture
  • ISA, the boundary between software and hardware
  • Specifies the logical machine that is visible to
    the programmer
  • Also, a functional spec for the processor
    designers
  • What needs to be specified by an ISA
  • Operations add, sub, mult
  • Temporary Operand Storage in the CPU
  • accumulator, stacks, registers
  • Number of operands per instruction
  • Operand location
  • where and how to specify the operands
  • Type and size of operands
  • Instruction-to-Binary Encoding

23
Microarchitecture
  • All the structures necessary to implement the ISA
  • All the structures necessary to give good
    performance
  • Pipelining
  • Caches
  • Prefetching
  • Superscalar and out-of-order mechanisms
  • Branch Predictors

24
Steps in Executing Instructions
  • 1) IFetch Fetch Instruction, Increment PC
  • 2) Decode Instruction, Read Registers
  • 3) Execute Mem-ref Calculate Address
    Arith-log Perform Operation
  • 4) Memory Load Read Data from Memory
    Store Write Data to Memory
  • 5) Write Back Write Data to Register

25
Review Datapath for MIPS
  • Use datapath figure to represent pipeline

26
Pipelined Execution Representation
  • Every instruction must take same number of steps,
    also called pipeline stages, so some will go
    idle sometimes

27
Pipelining Basics
  • Pipelining doesnt help latency of single task,
    it helps throughput of entire workload
  • Multiple tasks operating simultaneously using
    different resources
  • Potential speedup Number pipe stages
  • Time to fill pipeline and time to drain
    reduce speedupeg 2.3X v. 4X

28
General Definitions
  • Latency time to completely execute a certain
    task
  • Throughput amount of work that can be done over
    a period of time

29
Pipelined Design
  • Motivation Increase throughput with
  • little increase in hardware
  • Latency required for each task remains the same
    or may even increase slightly.

30
Evaluating Pipelining using laws of Parallel
Processing
Amdahls Law Efficiency Vectorizability

31
Amdahls Law
  • In parallel processing, serial part limits total
    performance
  • T original execution time
  • S fraction of time in serial code, eg 0.2
  • P fraction of time in parallel code 1-S
  • N number of parallel units
  • Speedup 1/ (S P/N)
  • Max speedup 1/S

32
Fig 1.5 Amdahls Law
33
Amdahls Law
Assume I-mix (load 25, branch 20, taken
branches 66.6 of branches, hardware uses NT as
policy, Branch penalty is 4 cycles, load penalty
is 1 cycle Speedup of a 6-stage pipeline under
these circumstances Eq 1.6 S 1 / g1/1 g2/2
g3/3 gN/N S 1 / 0.13/2 0.25/5
0.62/6 4.5 Ideal S 6 Difference between
peak and actual pipelining improvement
34
Easing Sequential Bottleneck
Eq 1.8 - Speedup, S 1/(1-f)(f/N) Eq 1.5
Speedup S 1 / (1-f) (f/6) Eq 1.10 S
1 / (1-f)/2 f/6
35
Easing Sequential Bottleneck using ILP
36
Instruction-Level Parallelism
  • ILP - the aggregate degree of parallelism that
    can be achieved by the concurrent execution of
    multiple instructions
  • Measured in number of instructions

D
37
ILP Instruction-Level Parallelism
  • ILP is is a measure of the amount of
    inter-dependencies between instructions
  • Average ILP no. instruction / no. cyc required
  • code1 ILP 1
  • i.e. must execute serially
  • code2 ILP 3
  • i.e. can execute at the same time

code2 r1 ? r2 1 r3 ? r9 / 17 r4 ? r0 - r10
code1 r1 ? r2 1 r3 ? r1 / 17 r4 ? r0 - r3
38
Inter-instruction Dependences
39
Scope of ILP Analysis
r1 ? r2 1 r3 ? r1 / 17 r4 ? r0 - r3
r11 ? r12 1 r13 ? r19 / 17 r14 ? r0 - r20
Out-of-order execution permits more ILP to be
exploited
40
Purported Limits on ILP
  • Weiss and Smith 1984 1.58
  • Sohi and Vajapeyam 1987 1.81
  • Tjaden and Flynn 1970 1.86
  • Tjaden and Flynn 1973 1.96
  • Uht 1986 2.00
  • Smith et al. 1989 2.00
  • Jouppi and Wall 1988 2.40
  • Johnson 1991 2.50
  • Acosta et al. 1986 2.79
  • Wedig 1982 3.00
  • Butler et al. 1991 5.8
  • Melvin and Patt 1991 6
  • Wall 1991 7
  • Kuck et al. 1972 8
  • Riseman and Foster 1972 51
  • Nicolau and Fisher 1984 90

41
Optimistic and Pessimistic ILP Estimates
  • Flynns bottleneck ILP is less than 2
    (1970/1973)
  • Fishers Optimism ILP is much greater than 2
    (1984)
  • Johnson ILP is 2 (1991)
  • Butler et al ILP is greater than 2 (1991)

42
Parameters for ILP Machines
  • Operation Latency (OL) Number of machine cycles
    required for execution of an instruction number
    of cycles until the result is available
  • Machine Parallelism (ML) Number of instructions
    in flight
  • Issue Latency (IL) Number of cycles required
    between issuing 2 consecutive instructions
  • (Issue means initiating an instruction into the
    pipeline)
  • Issue Parallelism (IP) Max number of
    instructions that can be issued in a machine
    cycle

43
Scalar Pipeline
  • Scalar Pipeline (baseline)
  • Instruction Parallelism D
  • Operation Latency 1
  • Peak IPC 1

D
44
Superpipelined Machine
  • Superpipelined Execution
  • MP DxM IP 1 (per minor cycle)
  • OL M minor cycles IL 1 minor cycle
  • Peak IPC 1 per minor cycle (M per baseline
    cycle)

major cycle M minor cycle
minor cycle
1
2
3
4
5
6
IF
DE
EX
WB
2
5
1
4
6
3
45
Superpipelined MIPS R4000
46
Superscalar Machines
  • Superscalar (Pipelined) Execution
  • IP N IL1 MPN D
  • OL 1 baseline cycles
  • Peak IPC N per baseline cycle

N
47
Superscalar and Superpipelined
  • Superscalar and superpipelined machines of equal
    degree have roughly the same performance, i.e. if
    n m then both have about the same IPC.

48
VLIW
  • Very Long Instruction Word (VLIW)
  • One big VLIW instruction separately directs each
    functional unit
  • MultiFlow TRACE, TI C6X, IA-64
  • IA 64 EPIC is really VLIW
  • EPIC (Explicitly Parallel Instruction Computing)

add r1,r2,r3
load r4,r54
mov r6,r2
mul r7,r8,r9
VLIW
Instruction
Execution
FU
FU
FU
FU
49
VLIW vs Superscalar
  • VLIW - Compiler finds parallelism
  • Superscalar hardware finds parallelism
  • VLIW Simpler hardware
  • Superscalar More complex hardware
  • VLIW less power
  • Superscalar More power
  • VLIW works only if compiler has done the right
    things
  • Superscalar works even with lousy compiler

50
Problem Set 1
  • 1.6
  • 1.11
  • 1.15 (only first part perf improvement)
  • 1.19,1.20,1.21,1.22,1.23,1.24,1.25,1.27,
  • 1.29,1.30
Write a Comment
User Comments (0)
About PowerShow.com