Title: Microarchitecture Hows, and whys
1MicroarchitectureHows, and whys!
- Saeed Beiki
- SIC, Intel Innovation Center (IIC)
- Dubai, Dubai Internet City (DIC), Microprocessor
TechZone Staff Congress, - November 2007
2Who Am I
- A Senior Information Contributor (SIC) at Intel
Innovation Center, Dubai - Involved in other fields that arent related to
the present subject
3Agenda
- Memory Subsystem
- Segmented and Harvard Models
- Virtual Addresses and TLB
- General Principles
- Five-Steps IA, ID, EX, DA, WB
- Instruction Lifecycle
- RISC vs. CISC Architecture
- Register-Register vs. Register-Memory
Architecture - ILP
- What ILP Is?
- Pipelining
- Pipeline Hazards
- Block-Instructions, Pipeline Bubbles and Stall
Cycles - Locality of Reference
- Data Forwarding
- Limits
- Superscalar Processing
- Single-Issue vs. Multi-Issue Architecture
- Limits
4Agenda (cont.)
- DLP
- What DLP Is?
- SIMD and MIMD vs. SISD
- SIMD Streaming Types
- Limits
- Branch Prediction
- What BP Is?
- Types
- Static
- Dynamic (BTBBHT)
- BHB
- BTB
- RAB
- Limits
- OOO (Order-of-Order) or Dynamic Execution
- Speculation phenomenon
- Register-Renaming and RAT
- ROB
- RLSB
5Agenda (cont.)
- Evolutions (selected)
- Drive Stages
- GEHL
- Profile Propagation
- Overall Problems
- Instruction Fetch Data Access latency
- Wasted-Cycles on Deeply-Pipelined Architectures
- Low-rated Entropy
- SupILP-DLP
- My Proposals
- Sequential Speculation
- Custom Storage
- Line Splitter Strategy
- QA
- Estimated Time 215 hours
6Memory Subsystem
7Segmented and Harvard model
- The memory is split into exclusive parts
(segments) for any computing resource identity - CODE Text (or Code)
- DATA Stack, Heap, BSS, Data
- Effect of Register Length and Register State
- Registers and Base Locations and Indexing
- Properties of segments
- CODE Instructions, RO
- DATA
- Data Global, Static, Initialized, Writable,
Fixed - BSS Global, Static, Uninitialized, Writable,
Fixed - Heap Other Data Types, RunTime-Alloc, Writable,
Non-Fixed - Stack Other Data Types, Temporary, Writable,
Abstract Non-Fixed - LIFO or FIFO Types
- Harvard model?
8Virtual Address
- Access State
- Object Hierarchy
- Translation Look-aside Buffer (TLB)
- Memory
- Disk
- TLB-Miss
- Memory-Miss
- Disk-Failure
- TLB Size Calculation
9General Principles
10Five Key Steps in Instruction Processing
- Any computer system should perform the below
required steps to execute an instruction
completely and successfully - Instruction Access (IA)
- Instruction Decode (ID)
- Execution (EX)
- Data Access (DA)
- Write-Back (WB)
11Instruction Access (IA)
- Read from Memory Subsystem
- Indicated by Program-Counter (PC)
- Only Physical Addresses!? Yes, TLB!
- Can be equalized with Instruction Fetch (IF)?
Nope! - Latency is Alive
- Latency in Translation
- Latency in Reading Fetching
12Instruction Decode (ID)
- Control Information
- Operation Code (OPCode)
- Immediate Data
- Embedded
- Consequence
- Target Address
- Instruction, eg Branch instruction(s)
- Data, eg Load/Store instruction(s)
- Register Data Fetch
- CISC to RISC conversion
13Execute (EX)
- Perspectives
- Mathematics operations
- Movement operations
- Effective Address (EF)
- Address Offsets
- Indirect Memory Reference
- Register Address-housing
- Execution Units
- Memory units
- Arithmetic units
- Execution Core?
- Flush (back)
14Data Access (DA)
- Give your ticket, take your data!
- Address Bus
- Data Bus
- Control Bus
- So what about Memory Store instructions?
15Write-back (WB)
- What is Write-back?
- Data Load/ Data Store
- Types
- Register
- Memory
- Disk
- Register-Memory vs. Register-Register
16 So what is IT, then?
- Instruction Translation (IT)
- OS-side
- Assemblers
- Stateful IT
- Pre-Explored IT (PEIT)
- Roles of Directives
- Roles of Memory layout
- Processor-based IT
- Note So our five stage model will be modified
and will get into challenges with some
complexities.
17Instruction Lifecycle
Load
Decode
Fetch
Execution
Write-back
Retire
- Facts
- Reload
- Pre-Fetch
- OOO Execution
- In-Order Retire
- Infinitive flush hazards
Flush
18RISC vs. CISC Architecture
- RISC (Reduced Instruction Set Computer)
- CISC (Complex Instruction Set Computer)
- Execution speed
- Memory Parsing
- Symbol states
- Conversion Rule
- Every modern processor will convert its CISC
instructions to RISC
19Register-Register vs. Register-Memory
- Register-Register (RISC)
- Only Load/Store operations can access Memory
- Register-Memory (CISC)
- Designate units (like ALU) can access Memory
- Contract Register-Memory vs. RISC principle
- Architectural Contract
- Contrast of CISC-to-RISC Conversion
- Solution Extra-RISC Instructions
- Clock Rate state
- Hardware Optimization state
20Instruction-Level Parallelism (ILP)
21Abstract
- What ILP is?
- Execute more instructions at the same time in
parallel - Single-core ILP
- Multi-core ILP
- Methods
- Pipelining
- Superscalar
- Super-pipelining
- Combinations
22Pipelining
1
2
3
4
5
IA
ID
EX
DA
WB
Normal
1
2
4
7
IA
ID
EX
DA
WB
Pipelined
5
8
3
6
9
- First instruction 5 cycle
- Consequence instructions 1 cycle latency
- Still each one has 5 stages (5 cycles)
23Pipelining (cont.)
- Pipeline Hazard?
- Hazard types
- Data hazard aka Data Dependency (eg, unavailable
data access) - Control hazard (eg, pipelined branch
instructions) - Structural hazard (eg, instruction conflicts,
same-time access (sta) problem) - This is why we separate instruction and data
flowports - Stall (freeze) time
- Wait to load, wait to fetch, wait to execute
- Pipeline Bubble
- Blocked pipes
- A related set of instructions
- Non-blocked pipes
24Pipelining (cont.)
- Locality of Reference
- Data Bypassing technique
- Data Splitting technique
- Pipelining Limits
- Deep pipelines are more prone to hazards
- Solution
- hyper-pipelining (super-pipelining)
- Circuit design
- Hazard cranking
- Circuit Skewer
- Energy (Watt) limitations
- Latch and Setup n hold time
25Superscalar Processing
- Single-issue architecture
- One clock rate, one instruction
- Superscalar architecture
- Fetch Bandwidth
- Limits
- Data hazards
- Duplicating the hardware?
26Puttin it together!
- Duplication is still here
- Pipeline duplication
- OOO entrance
- Execution types
- Floating point
- Integer
- media
- Media instructions
- Video processing
- Audio processing
- Solutions 3DNow! (AMD), SSE, and AVX
27Data-Level Parallelism (DLP)
28Abstract
- What DLP is?
- Objective of DLP
- media instructions sake
- Requirements
- More data is needed to be accessed (DA) for a
single instruction - One instruction is repeated over a data set
- Data are commonly Dependent
29SIMD and MIMD vs. SISD
data
data
data
instruction
instruction
instruction
EX
EX
EX
SISD
SIMD
MIMD
- Objectives and Limits
- SISD Single Instruction, Single Data
- SIMD Single Instruction, Multiple Data
- MIMD Multiple Instruction, Multiple Data
- Oops! What about MISD? Is that applicable? And
why it should be implemented?
30SIMD Streaming Types
- Multiple data Load/Store operations
- Timing constraints of media type nature
- Data Cache
31Faults and Limits
- Data underflow
- More and More Memory pressure (latency)
- Solutions
- Instruction Buffers (Inst. TLB)
- Data Buffers (Data TLB)
- I-Fetch Latency
- Cache hierarchy
- Branch Prediction
32Branch Prediction
33Abstract
- What Branch Prediction (BP) is?
- Guessing the branch direction
- Branch direction
- Forward
- Backward
- Pre-fetch phenomenon
- Fetch before Need
- There are instructions to allow the software to
cope with locality of data (now is limited
locality of data) - Load Request Buffering
- Branch types
- Forward Conditional (PC)
- Backward Conditional (-PC-)
- Unconditional (?)
34Branch Prediction Types
- Static Prediction
- Statistical Analysis
- 4/1 Comparison
- 60 of Forward branches are taken
- 85 of Backward branches are not taken (for the
sake of LOOP) - So Coding-style is critical!
- Accuracy is not Absolute!
- Dynamic Prediction
- BHB (Branch History Buffer) or Branch History
Table (BHT) - Indexing address bit portions (usually 2 bits) of
recent taken branches - Accuracy depends on indexing bits
- Indexing bits are depended on bound of
accessible memory - BTB (Branch Target Buffer)
- Storing actual addresses of recent taken branches
35Branch History Buffer (BHB)
- 1-bit indexing
- 1 bits are taken
- 0 bits are not taken
- Fault long loops, misprediction
- 2-bits to achieve more accuracy
- Misprediction
- Bits are Inverted
- Relational Branches
- Global History Counter (GHC) or Two-layer
predictor - Global Branch History (GBH)
- Updating other branches bits
- Implementation
- GShare Algorithm
- GBHR (Global Branch History Register)
36Branch Target Buffer (BTB)
- Storing Instruction address
- Storing Target address
- Scope of Pre-fetching
- Recent Branch Addresses ? Next PC
- Subroutines
- Its a branch tho!
- Return Addresses?
- Return Address Buffer (RAB)
- Caching recent return addresses
- Repetitive subroutines
37Limits
- Cache Size
- Bus Rate
- Loop Detection
38Out-of-Order (OOO) Execution
39Abstract
- Steps
- speculating related instruction blocks
- executing related instruction blocks out of
order in execution units - commitment of results (data) or instructions
in-order - retiring instructions
40OOO (or Dynamic/Speculative) Execution
- OOO or Speculative or Dynamic Execution
- They are the same, with different names
- How is the OOO process?
- Fetching, Decoding (and implicit conversion to
RISC instructions) - Execution
- Being executed in the other best matches the
available resouces - So they may be executed out of original order
- The results will commit back
- Write-back
- Stronger branch prediction is required!
- Memory bandwidth waste
- Execution time waste
- Power waste
41Register Renaming
- Fact
- OOO-executed blocks cant access the same
registers - OOO execution needs more than real registers to
keep track of results - Solution having virtual registers
- Register Alias Table (RAT)
- It renames and maps GPRs to a set of temporal,
and chip-internal register locations - Maximum number of instances of each register?
- 128 / 8 16
- Does it depend on processor wide-bits? NO!
- Committed back when instructions are committed
back!
42Reorder Buffer (ROB) / Retirement Unit
- Roles
- Keeping track of Instruction Status (eg,
Available Data) - Keeping track of Instruction State (eg,
Completed, Flushed, Refetched, etc) - Retiring instructions
- Instructions use RAT
- Instructions will be dispatched to EUs (as data
is available) - Instructions will be queued in RSs
- Instructions will be retired in its main order
- Performance improvement
- Results can directly be bypassed to another
instructions renamed registers - Data hazards are limited
- Pipeline queue is moving in-consequense
43Recent Load/Store Buffer (RLSB)
- Speculative approach on data
- Store instructions
- Executing stores as we make sure the data should
be changed - Stores will be buffered and committed
- Program order is maintained through RLSB
- Load instructions
- Theyre more critical and latency-based
- Speculative execution of loads
- OOO calculation of EA
- OOO Cache Access
- Avoiding access-conflicts
- Retional Cache Access (To know a store
instruction is in pipeline-- and not committed
back) - Bypassing Store results
- If load requires a former store result
- Saves load times
44Evolutions
45Drive Stage
- More parallelism
- Used to drive data across the microchip
- Overcoming IC Design
- Transmit signal across wire
- Design is no longer being only worried in speed
of transistors - Physical metal Aluminum ? Copper
- A paradox
- The deep-pipeline has to run at higher frequency
to do work as equal to a shorter pipeline
46Geometric History Length (GEHL)
- Predictor Tables (Ti)
- Indexed with functions (hashed) in GBH and Branch
Addresses - Functions ?? Branches
- Geometric series Li ai-1 L(1)
- Counters (Ci)
- Each counter is read on each predictor table
- Counters are Signed (S)
- Counter Scope numbers (M)
- Prediction Calculation
- S M /2 Sigma(0ltiltM)C(i)
- S Positive ? Taken
- S Negative ? Not Taken
- Predictor Update
- Its done only as mispredictions or being smaller
than threshold SH - S lt SH
- Predictor Output
- Out ! 0 ? C(i) C(i) - 1
- Out 0 ? C(i) C(i) 1
47Profile Propagation
- Software-based
- Map program sections in old and new versions of
program - Matching .text and .data blocks
- Fuzzy Match
- Changing pairs (primary, secondary)
- Hardware-based
- ID match
- OPCode sequence matching
- uOP down-ride time matching (statistically)
- uOP Block matching
- Data value matching
- Data position matching
48Overall Problems(A quick and brief overview of
important problems in discussed ones)
49Overall Problems
- Instruction Fetch Data Access latency
- Latency is still alive
- Solutions I-Fetch latency Branch prediction,
Super-pipelining - Wasted-Cycles on Deeply-Pipelined Architectures
- More pipeline stages are good, but the
performance hit is great when a misprediction is
occurred. - Flush back
- Low-rated Entropy
- Entropy vs. Block state
- SupILP-DLP (Super Instruction/Data -Level
Parallelism) - More Prediction accuracy
- More Pipeline stages
- More Superscalar execution
50Proposals
51Sequential Speculation
ONE
To Take (2)
To Take (4)
TWO
TWO
Miss
To Take (n2)
THREE
THREE
Miss
Hier. Back
Hier. Back
TLB Entry Edit
52Custom Storage
- WriteRead storage
- Signed by Directive
- Bus Bandwidth Limit
- Matrix objective
Data TLB
Inst. TLB
L1 Cache
L2 Cache
Mixed, Different Depths
IWR
DWR
53Line Splitter Strategy
- Splitting and indexing dependent on Attributes,
attrib1, attrib2, attrib n - Storing in Custom Storages (Matrix), if it has
free space available (signed not used by
software-side) - Its like Drive Stages
attrib1
attrib2
attrib3
attrib n
DATA OR INSTRUCTION FLOW
1 1 1 1 1 1 1
2 2 2 2 2 2 2
3 3 3 3 3 3 3
n n n n n n n
INDEX LINES
Custom Storage Matrix View
54Conclusion
- Latency is always Alive!
- I think, soon or late software-based optimization
(like PPO) scenarios are going to come into play!
I just mentioned some. - Hardware optimization is not just Clock Cycle
Rate improvement. It includes more accurate
branch prediction, wire attribute improvements,
and of course calculus innovations, too! I just
mentioned some too.
55References
- Intel Architecture Optimization Manual, Intel.com
- Intel IPP (Integrated Performance Primitive)
manuals, Intel.com - Intel 64 and IA-32 Architectures Optimization
Reference Manual, Intel.com - Intel 64 and IA-32 Architectures Software
Developers Manual - Documentation Changes,
Intel.com - Intel 64 Architecture Memory Ordering White
Paper, Intel.com - Intel 64 and IA-32 Architectures Software
Developers Manual - Documentation Changes,
Intel.com - Intel 64 and IA-32 Architectures Software
Developers Manual - Volume 3A - System
Programming Guide, Part1, Intel.com - Intel Debugger (IDB) Manual, Intel.com
- Intel Hyper-Threading Technology Technical
User's Guide, Intel.com - Intel Math Kernel Library (Reference Manual),
Intel.com - Intel Pentium 4 Processor Optimization
(Reference Manual), Intel.com - Intel Pentium 4 Processor with 512-KB L2 Cache
on 0.13 Micron Process Thermal Design Guidelines,
Intel.com - Intel SSE4 Programming Reference, Intel.com
- Intel Virtualization Technology (VT) in
Converged Application Platforms, Intel.com - Arstechnica.com Resource website
- I386 - System V Application Binary Interface,
Intel.com - IA64 - Intel Itanium - System V Application
Binary Interface, Intel.com - ARM Architecture Reference Manual
56QAAny questions around?
57Thank you The Important thing is not to stop
Questioning. Sir. Albert Einstein