Microarchitecture Hows, and whys

About This Presentation

Title:

Microarchitecture Hows, and whys

Description:

Instruction Fetch & Data Access latency. Wasted-Cycles on Deeply-Pipelined ... Fetching, Decoding (and implicit conversion to RISC instructions) Execution ... – PowerPoint PPT presentation

Number of Views:60

Avg rating:3.0/5.0

Slides: 58

Provided by: saeed

Category:

more less

Transcript and Presenter's Notes

Title: Microarchitecture Hows, and whys

1
MicroarchitectureHows, and whys!

Saeed Beiki
SIC, Intel Innovation Center (IIC)
Dubai, Dubai Internet City (DIC), Microprocessor
TechZone Staff Congress,
November 2007

2
Who Am I

A Senior Information Contributor (SIC) at Intel
Innovation Center, Dubai
Involved in other fields that arent related to
the present subject

3
Agenda

Memory Subsystem
Segmented and Harvard Models
Virtual Addresses and TLB
General Principles
Five-Steps IA, ID, EX, DA, WB
Instruction Lifecycle
RISC vs. CISC Architecture
Register-Register vs. Register-Memory
Architecture
ILP
What ILP Is?
Pipelining
Pipeline Hazards
Block-Instructions, Pipeline Bubbles and Stall
Cycles
Locality of Reference
Data Forwarding
Limits
Superscalar Processing
Single-Issue vs. Multi-Issue Architecture
Limits

4
Agenda (cont.)

DLP
What DLP Is?
SIMD and MIMD vs. SISD
SIMD Streaming Types
Limits
Branch Prediction
What BP Is?
Types
Static
Dynamic (BTBBHT)
BHB
BTB
RAB
Limits
OOO (Order-of-Order) or Dynamic Execution
Speculation phenomenon
Register-Renaming and RAT
ROB
RLSB

5
Agenda (cont.)

Evolutions (selected)
Drive Stages
GEHL
Profile Propagation
Overall Problems
Instruction Fetch Data Access latency
Wasted-Cycles on Deeply-Pipelined Architectures
Low-rated Entropy
SupILP-DLP
My Proposals
Sequential Speculation
Custom Storage
Line Splitter Strategy
QA
Estimated Time 215 hours

6
Memory Subsystem
7
Segmented and Harvard model

The memory is split into exclusive parts
(segments) for any computing resource identity
CODE Text (or Code)
DATA Stack, Heap, BSS, Data
Effect of Register Length and Register State
Registers and Base Locations and Indexing
Properties of segments
CODE Instructions, RO
DATA
Data Global, Static, Initialized, Writable,
Fixed
BSS Global, Static, Uninitialized, Writable,
Fixed
Heap Other Data Types, RunTime-Alloc, Writable,
Non-Fixed
Stack Other Data Types, Temporary, Writable,
Abstract Non-Fixed
LIFO or FIFO Types
Harvard model?

8
Virtual Address

Access State
Object Hierarchy
Translation Look-aside Buffer (TLB)
Memory
Disk
TLB-Miss
Memory-Miss
Disk-Failure
TLB Size Calculation

9
General Principles
10
Five Key Steps in Instruction Processing

Any computer system should perform the below
required steps to execute an instruction
completely and successfully
Instruction Access (IA)
Instruction Decode (ID)
Execution (EX)
Data Access (DA)
Write-Back (WB)

11
Instruction Access (IA)

Read from Memory Subsystem
Indicated by Program-Counter (PC)
Only Physical Addresses!? Yes, TLB!
Can be equalized with Instruction Fetch (IF)?
Nope!
Latency is Alive
Latency in Translation
Latency in Reading Fetching

12
Instruction Decode (ID)

Control Information
Operation Code (OPCode)
Immediate Data
Embedded
Consequence
Target Address
Instruction, eg Branch instruction(s)
Data, eg Load/Store instruction(s)
Register Data Fetch
CISC to RISC conversion

13
Execute (EX)

Perspectives
Mathematics operations
Movement operations
Effective Address (EF)
Address Offsets
Indirect Memory Reference
Register Address-housing
Execution Units
Memory units
Arithmetic units
Execution Core?
Flush (back)

14
Data Access (DA)

Give your ticket, take your data!
Address Bus
Data Bus
Control Bus
So what about Memory Store instructions?

15
Write-back (WB)

What is Write-back?
Data Load/ Data Store
Types
Register
Memory
Disk
Register-Memory vs. Register-Register

16
So what is IT, then?

Instruction Translation (IT)
OS-side
Assemblers
Stateful IT
Pre-Explored IT (PEIT)
Roles of Directives
Roles of Memory layout
Processor-based IT
Note So our five stage model will be modified
and will get into challenges with some
complexities.

17
Instruction Lifecycle
Load
Decode
Fetch
Execution
Write-back
Retire

Facts
Reload
Pre-Fetch
OOO Execution
In-Order Retire
Infinitive flush hazards

Flush
18
RISC vs. CISC Architecture

RISC (Reduced Instruction Set Computer)
CISC (Complex Instruction Set Computer)
Execution speed
Memory Parsing
Symbol states
Conversion Rule
Every modern processor will convert its CISC
instructions to RISC

19
Register-Register vs. Register-Memory

Register-Register (RISC)
Only Load/Store operations can access Memory
Register-Memory (CISC)
Designate units (like ALU) can access Memory
Contract Register-Memory vs. RISC principle
Architectural Contract
Contrast of CISC-to-RISC Conversion
Solution Extra-RISC Instructions
Clock Rate state
Hardware Optimization state

20
Instruction-Level Parallelism (ILP)
21
Abstract

What ILP is?
Execute more instructions at the same time in
parallel
Single-core ILP
Multi-core ILP
Methods
Pipelining
Superscalar
Super-pipelining
Combinations

22
Pipelining
1
2
3
4
5
IA
ID
EX
DA
WB
Normal
1
2
4
7
IA
ID
EX
DA
WB
Pipelined
5
8
3
6
9

First instruction 5 cycle
Consequence instructions 1 cycle latency
Still each one has 5 stages (5 cycles)

23
Pipelining (cont.)

Pipeline Hazard?
Hazard types
Data hazard aka Data Dependency (eg, unavailable
data access)
Control hazard (eg, pipelined branch
instructions)
Structural hazard (eg, instruction conflicts,
same-time access (sta) problem)
This is why we separate instruction and data
flowports
Stall (freeze) time
Wait to load, wait to fetch, wait to execute
Pipeline Bubble
Blocked pipes
A related set of instructions
Non-blocked pipes

24
Pipelining (cont.)

Locality of Reference
Data Bypassing technique
Data Splitting technique
Pipelining Limits
Deep pipelines are more prone to hazards
Solution
hyper-pipelining (super-pipelining)
Circuit design
Hazard cranking
Circuit Skewer
Energy (Watt) limitations
Latch and Setup n hold time

25
Superscalar Processing

Single-issue architecture
One clock rate, one instruction
Superscalar architecture
Fetch Bandwidth
Limits
Data hazards
Duplicating the hardware?

26
Puttin it together!

Duplication is still here
Pipeline duplication
OOO entrance
Execution types
Floating point
Integer
media
Media instructions
Video processing
Audio processing
Solutions 3DNow! (AMD), SSE, and AVX

27
Data-Level Parallelism (DLP)
28
Abstract

What DLP is?
Objective of DLP
media instructions sake
Requirements
More data is needed to be accessed (DA) for a
single instruction
One instruction is repeated over a data set
Data are commonly Dependent

29
SIMD and MIMD vs. SISD
data
data
data
instruction
instruction
instruction
EX
EX
EX
SISD
SIMD
MIMD

Objectives and Limits
SISD Single Instruction, Single Data
SIMD Single Instruction, Multiple Data
MIMD Multiple Instruction, Multiple Data
Oops! What about MISD? Is that applicable? And
why it should be implemented?

30
SIMD Streaming Types

Multiple data Load/Store operations
Timing constraints of media type nature
Data Cache

31
Faults and Limits

Data underflow
More and More Memory pressure (latency)
Solutions
Instruction Buffers (Inst. TLB)
Data Buffers (Data TLB)
I-Fetch Latency
Cache hierarchy
Branch Prediction

32
Branch Prediction
33
Abstract

What Branch Prediction (BP) is?
Guessing the branch direction
Branch direction
Forward
Backward
Pre-fetch phenomenon
Fetch before Need
There are instructions to allow the software to
cope with locality of data (now is limited
locality of data)
Load Request Buffering
Branch types
Forward Conditional (PC)
Backward Conditional (-PC-)
Unconditional (?)

34
Branch Prediction Types

Static Prediction
Statistical Analysis
4/1 Comparison
60 of Forward branches are taken
85 of Backward branches are not taken (for the
sake of LOOP)
So Coding-style is critical!
Accuracy is not Absolute!
Dynamic Prediction
BHB (Branch History Buffer) or Branch History
Table (BHT)
Indexing address bit portions (usually 2 bits) of
recent taken branches
Accuracy depends on indexing bits
Indexing bits are depended on bound of
accessible memory
BTB (Branch Target Buffer)
Storing actual addresses of recent taken branches

35
Branch History Buffer (BHB)

1-bit indexing
1 bits are taken
0 bits are not taken
Fault long loops, misprediction
2-bits to achieve more accuracy
Misprediction
Bits are Inverted
Relational Branches
Global History Counter (GHC) or Two-layer
predictor
Global Branch History (GBH)
Updating other branches bits
Implementation
GShare Algorithm
GBHR (Global Branch History Register)

36
Branch Target Buffer (BTB)

Storing Instruction address
Storing Target address
Scope of Pre-fetching
Recent Branch Addresses ? Next PC
Subroutines
Its a branch tho!
Return Addresses?
Return Address Buffer (RAB)
Caching recent return addresses
Repetitive subroutines

37
Limits

Cache Size
Bus Rate
Loop Detection

38
Out-of-Order (OOO) Execution
39
Abstract

Steps
speculating related instruction blocks
executing related instruction blocks out of
order in execution units
commitment of results (data) or instructions
in-order
retiring instructions

40
OOO (or Dynamic/Speculative) Execution

OOO or Speculative or Dynamic Execution
They are the same, with different names
How is the OOO process?
Fetching, Decoding (and implicit conversion to
RISC instructions)
Execution
Being executed in the other best matches the
available resouces
So they may be executed out of original order
The results will commit back
Write-back
Stronger branch prediction is required!
Memory bandwidth waste
Execution time waste
Power waste

41
Register Renaming

Fact
OOO-executed blocks cant access the same
registers
OOO execution needs more than real registers to
keep track of results
Solution having virtual registers
Register Alias Table (RAT)
It renames and maps GPRs to a set of temporal,
and chip-internal register locations
Maximum number of instances of each register?
128 / 8 16
Does it depend on processor wide-bits? NO!
Committed back when instructions are committed
back!

42
Reorder Buffer (ROB) / Retirement Unit

Roles
Keeping track of Instruction Status (eg,
Available Data)
Keeping track of Instruction State (eg,
Completed, Flushed, Refetched, etc)
Retiring instructions
Instructions use RAT
Instructions will be dispatched to EUs (as data
is available)
Instructions will be queued in RSs
Instructions will be retired in its main order
Performance improvement
Results can directly be bypassed to another
instructions renamed registers
Data hazards are limited
Pipeline queue is moving in-consequense

43
Recent Load/Store Buffer (RLSB)

Speculative approach on data
Store instructions
Executing stores as we make sure the data should
be changed
Stores will be buffered and committed
Program order is maintained through RLSB
Load instructions
Theyre more critical and latency-based
Speculative execution of loads
OOO calculation of EA
OOO Cache Access
Avoiding access-conflicts
Retional Cache Access (To know a store
instruction is in pipeline-- and not committed
back)
Bypassing Store results
If load requires a former store result
Saves load times

44
Evolutions
45
Drive Stage

More parallelism
Used to drive data across the microchip
Overcoming IC Design
Transmit signal across wire
Design is no longer being only worried in speed
of transistors
Physical metal Aluminum ? Copper
A paradox
The deep-pipeline has to run at higher frequency
to do work as equal to a shorter pipeline

46
Geometric History Length (GEHL)

Predictor Tables (Ti)
Indexed with functions (hashed) in GBH and Branch
Addresses
Functions ?? Branches
Geometric series Li ai-1 L(1)
Counters (Ci)
Each counter is read on each predictor table
Counters are Signed (S)
Counter Scope numbers (M)
Prediction Calculation
S M /2 Sigma(0ltiltM)C(i)
S Positive ? Taken
S Negative ? Not Taken
Predictor Update
Its done only as mispredictions or being smaller
than threshold SH
S lt SH
Predictor Output
Out ! 0 ? C(i) C(i) - 1
Out 0 ? C(i) C(i) 1

47
Profile Propagation

Software-based
Map program sections in old and new versions of
program
Matching .text and .data blocks
Fuzzy Match
Changing pairs (primary, secondary)
Hardware-based
ID match
OPCode sequence matching
uOP down-ride time matching (statistically)
uOP Block matching
Data value matching
Data position matching

48
Overall Problems(A quick and brief overview of
important problems in discussed ones)
49
Overall Problems

Instruction Fetch Data Access latency
Latency is still alive
Solutions I-Fetch latency Branch prediction,
Super-pipelining
Wasted-Cycles on Deeply-Pipelined Architectures
More pipeline stages are good, but the
performance hit is great when a misprediction is
occurred.
Flush back
Low-rated Entropy
Entropy vs. Block state
SupILP-DLP (Super Instruction/Data -Level
Parallelism)
More Prediction accuracy
More Pipeline stages
More Superscalar execution

50
Proposals
51
Sequential Speculation

Rational and Sequential

ONE
To Take (2)
To Take (4)
TWO
TWO
Miss
To Take (n2)
THREE
THREE
Miss
Hier. Back
Hier. Back
TLB Entry Edit
52
Custom Storage

WriteRead storage
Signed by Directive
Bus Bandwidth Limit
Matrix objective

Data TLB
Inst. TLB
L1 Cache
L2 Cache
Mixed, Different Depths
IWR
DWR
53
Line Splitter Strategy

Splitting and indexing dependent on Attributes,
attrib1, attrib2, attrib n
Storing in Custom Storages (Matrix), if it has
free space available (signed not used by
software-side)
Its like Drive Stages

attrib1
attrib2
attrib3
attrib n
DATA OR INSTRUCTION FLOW
1 1 1 1 1 1 1
2 2 2 2 2 2 2
3 3 3 3 3 3 3
n n n n n n n
INDEX LINES
Custom Storage Matrix View
54
Conclusion

Latency is always Alive!
I think, soon or late software-based optimization
(like PPO) scenarios are going to come into play!
I just mentioned some.
Hardware optimization is not just Clock Cycle
Rate improvement. It includes more accurate
branch prediction, wire attribute improvements,
and of course calculus innovations, too! I just
mentioned some too.

55
References

Intel Architecture Optimization Manual, Intel.com
Intel IPP (Integrated Performance Primitive)
manuals, Intel.com
Intel 64 and IA-32 Architectures Optimization
Reference Manual, Intel.com
Intel 64 and IA-32 Architectures Software
Developers Manual - Documentation Changes,
Intel.com
Intel 64 Architecture Memory Ordering White
Paper, Intel.com
Intel 64 and IA-32 Architectures Software
Developers Manual - Documentation Changes,
Intel.com
Intel 64 and IA-32 Architectures Software
Developers Manual - Volume 3A - System
Programming Guide, Part1, Intel.com
Intel Debugger (IDB) Manual, Intel.com
Intel Hyper-Threading Technology Technical
User's Guide, Intel.com
Intel Math Kernel Library (Reference Manual),
Intel.com
Intel Pentium 4 Processor Optimization
(Reference Manual), Intel.com
Intel Pentium 4 Processor with 512-KB L2 Cache
on 0.13 Micron Process Thermal Design Guidelines,
Intel.com
Intel SSE4 Programming Reference, Intel.com
Intel Virtualization Technology (VT) in
Converged Application Platforms, Intel.com
Arstechnica.com Resource website
I386 - System V Application Binary Interface,
Intel.com
IA64 - Intel Itanium - System V Application
Binary Interface, Intel.com
ARM Architecture Reference Manual