Title: SPEAR: Speculative Pre-Execution Assisted by CompileR
1SPEAR Speculative Pre-Execution Assisted by
CompileR
- Jean-Luc Gaudiot
- University of California, Irvine
- Won Woo Ro
- California State University, Northridge
E-Seminar IFIP Working Group 10.3 September 7,
2005
PASCAL PArallel Systems and Computer
Architecture Lab.
University of California, Irvine
2Outline
- High Performance Processor Design Issues
- Instruction Level Parallelism
- Memory Wall Problem on Superscalar
- Background
- Speculative Pre-Execution
- Single-Chip Multithreading
- SPEAR Speculative Pre-Execution Assisted by
Compiler - Design Characteristics
- Hardware/Software Descriptions
- Evaluation and Analysis
- Summary
3High-Performance Microprocessor Design
- Instruction Level Parallelism
- Multiple instruction issue and out-of-order
execution - Aggressive branch prediction and speculative
execution - Examples
- Superscalar and VLIW
- Various processors for Thread Level Parallelism
- Chip Multiprocessor and Simultaneous
Multithreading - Extracting more ILP by exploiting TLP
- Challenge
- Increasing performance gap between processor and
memory
4Challenge Memory Wall Problem
55/year
Processor-Memory Performance Gap
7/year
Hennessy and Patterson, Computer Architecture
A Quantitative Approach (2003)
- Technological trend
- Memory latency is getting longer relative to the
microprocessor speed - Problems
- Considerable performance degradation upon cache
misses - Pipeline stall during memory access
- Todays applications contain more
- irregular memory accesses (e.g., pointer chasing)
- memory related operations (data-intensive
applications)
5Limitations of Current Approaches
- Cache
- Works well only if there exists locality
- Weak for large data and irregular memory access
- Cache Prefetching
- Depending on predictability
- Hardware prefetching
- Behavior based on past and present execution-time
behavior - Software prefetching
- Hard to insert prefetches for irregular access
patterns - Latency hiding techniques (SMT)
- Enhance the utilization and throughput at thread
level - No latency reduction for a single threaded program
- Solution Cache prefetching by early execution
of the future miss-causing
load instructions
6Early Scheduling of Load Instructions Static or
Dynamic?
- Scheduling of load instruction is crucial to hide
memory latency - Reducing pipeline stall
- Static or dynamic scheduling
- Current ILP architectures
- VLIW Static instruction scheduling at
compile-time - Superscalar Dynamic instruction scheduling at
run-time - Decoupled architectures are in between
- Trade-off hardware complexity vs. compiler
compatibility
High level language
VLIW
Dependency Analysis Resource Binding
Front-end
DecoupledArchitecture
Instruction Separationfor Functionality
Superscalar
Back-end
Back-end
Back-end
Instruction Window Wake-up logic
Functional UnitAssignment
InstructionExecution
7Superscalar Why Have a Large Scheduling Window?
- Effective to find more independent instructions
to execute - How can dynamic scheduling solve the long memory
latency problem? - A large window is helpful to uncover independent
instructions in order to hide the long memory
latency - Dynamic scheduling performs a limited form of
dynamic preloading
Last in instructions
First in instruction
Instruction Scheduling Window
8Limitation of Scheduling Window
- Dynamic scheduling of superscalar
- Centralized and atomic structure
- Globally accessed and cannot be pipelined
- Issue window and bypass logic will be the most
critical part in future superscalar architectures - Palacharla, Jouppi, and Smith, Complexity-Effecti
ve Superscalar Processors, ISCA24, 1997 - The long wire delay on centralized design will be
a major performance bottleneck - Agarwal, Hrishikesh, Keckler, and Burger, Clock
Rate versus IPC The End of the Road for
Conventional Microarchitectures, ISCA27, 2000 - Complexity of scheduling window impacts on cycle
time
- Decentralized Design and Static Scheduling
Summary We aim at achieving early execution of
load instructions by static
scheduling of extra treads.
9Outline
- High Performance Processor Design Issues
- Instruction Level Parallelism
- Memory Wall Problem
- Background
- Speculative Pre-Execution
- Single-Chip Multithreading
- SPEAR Speculative Pre-Execution Assisted by
Compiler - Design Characteristics
- Hardware/Software Descriptions
- Evaluation and Analysis
- Summary
10Access/Execute Decoupled Architectures
- Two separate executions of the sequential
instruction stream - Memory access and computation
- Motivation
- Exploiting ILP between two streams
- Slip distance between streams provides memory
latency hiding - Executing the two instruction streams on separate
processors - Out-of-order execution between two stream
- Communication through the queues
- Examples DAE, PIPE, ZS-1 and WM
ExecuteInstructions
ExecuteProcessor
Load Data Queue
Store Data Queue
AccessProcessor
AccessInstructions
Main Memory
11What is Speculative Pre-Execution?
- Further extension on the traditional
access/execute decoupled architecture - Instead of decoupling and pre-executing every
load instruction, it only pre-executes
miss-friendly load instructions. - Data prefetching method that defines and
pre-executes the future miss causing load
instructions and the backward slices - Backward slices include the instructions on which
those load instructions have data dependencies - Speculative execution of the prefetching thread
on spare hardware context provides effective
cache prefetching
12Speculative Pre-Execution
- Define probable cache miss instructions
(delinquent loads) and their backward slice as
the prefetching thread (p-thread) - Cache access profile is used to define the
delinquent loads - Execute the p-thread as an auxiliary thread in a
multithreaded manner - Triggering the p-thread at run-time
- It is lightweight and can run faster than the
normal thread
Instruction Stream
13Speculative Pre-Execution
- Define probable cache miss instructions
(delinquent loads) and their backward slice as
the prefetching thread (p-thread) - Cache access profile is used to define the
delinquent loads - Execute the p-thread as an auxiliary thread in a
multithreaded manner - Triggering the p-thread at run-time
- It is lightweight and can run faster than the
normal thread
14Working Example of Speculative Pre-Execution
for ( k0 kltN k) xk q yk(
rzk10 tzk11 )
lw 24, 24(sp)
mul 25, 24, 8
la 8, z
addu 9, 25, 8
l.d f16, 88(9)
l.d f18, 0(sp)
mul.d f4, f16, f18
l.d f6, 8(sp)
l.d f8, 80(9)
mul.d f10, f6, f8
add.d f16, f4, f10
la 10, y
addu 11, 25, 10
l.d f18, 0(11)
mul.d f6, f16, f18
l.d f8, 16(sp)
add.d f4, f6, f8
la 12, x
addu 13, 25, 12
s.d f4, 0(13)
lw 14, 24(sp)
addu 15, 14, 1
sw 15, 24(sp)
blt 15, 1024, 33
lw 24, 24(sp)
mul 25, 24, 8
la 8, z
addu 9, 25, 8
l.d f16, 88(9)
l.d f18, 0(sp)
mul.d f4, f16, f18
l.d f6, 8(sp)
l.d f8, 80(9)
mul.d f10, f6, f8
add.d f16, f4, f10
la 10, y
addu 11, 25, 10
l.d f18, 0(11)
iteration i
iteration i1
Dynamic Instruction Stream for Main Program
15Executing the p-thread in a multithreaded manner
Time
Main Program Flow
- Parallel execution of the p-thread and the main
program thread - Using hardware multithreading such as SMT or CMP
- Either reduce the cache misses or cache miss
penalty
16Single-Chip (On-Chip) Multithreading
- Two approaches for multithreaded execution
- Simultaneous Multithreading (SMT) and Chip
Multiprocessor (CMP) - Both support simultaneous execution of the
multiple threads - SMT supports multiple contexts on a superscalar
core - CMP arranges multiple small processing units on a
chip - Differences lie in resource sharing
- In SMT, every structure (instruction fetch queue,
scheduling window, reorder buffer, registers, and
functional units) is shared between threads - Better utilization and resource sharing
- In CMP, every processing unit has dedicated
resources - It is implemented with simpler atomic hardware
structures (hence a higher clock rate)
17Outline
- High Performance Processor Design Issues
- Instruction Level Parallelism
- Memory Wall Problem on Superscalar
- Background
- Speculative Pre-Execution
- Single-Chip Multithreading
- SPEAR Speculative Pre-Execution Assisted by
Compiler - Design Characteristics
- Hardware/Software Descriptions
- Evaluation and Analysis
- Summary
18Various Implementation for Speculative
Pre-Execution
Source code (high-level language)
Source code (high-level language)
Source code (high-level language)
Binary executable
Assembly/ Binary executable
SP compiler (source to source)
Hardware-based P-thread construction
S/W
Assembly/ Binary translation
Main program
P-thread
Main program
P-thread
Main program
P-thread
H/W
Compiler
Execution core
SMT processor
SMT processor
Software-based static approach
Hardware-based dynamic approach
19Related Work Speculative Pre-Execution
Software-Based Approach (Static, Compile-time)
SourceCode
Binary
Software-Controlled Pre-Execution Luk, ISCA 2001 Analysis of high-level language program, not automated
Software-Based Speculative Precomputation Liao et al. , PLDI 2002 Post-pass binary adaptation, Itanium
Compiler Algorithms for Pre-Execution Kim and Yeung, ASPLOS 2002 Analysis of high-level language program, automated compiler
Compiler or Binary Analyzer
P-thread
Hardware-Based Approach (Dynamic, Run-time)
I-Cache
Dependence Graph Precomputation Annavaram et al. , ISCA 2001 H/W based, front-end draws DGP
Slice-Processors Moshovos et al. , ICS 2001 H/W based, back-end driven
Dynamic Speculative Precomputation Collins et al. , Micro 2001 H/W based, back-end driven, chaining p-slices
Instruction Queue
P-thread
Reorder Buffer
20Two Main Operations of Speculative Pre-Execution
Construct the p-thread
Trigger the p-thread (runtime)
21SPEAR Speculative Pre-Execution Assisted by
compileR
Hardware-based Construction
Construct the p-thread
Compiler-basedConstruction
Hardware-controlled Triggering
Trigger the p-thread (runtime)
Software-controlled Triggering
- Two steps of p-thread construction and p-thread
triggering are not required to bond to the same
methodologies - Compiler-based construction can also potentially
benefit Hardware-controlled triggering - A Hybrid model of speculative pre-execution
22Hardware Description for SPEAR
D-load detected!!
- Based on a SMT architecture - P-thread is
extracted from the IFQ
P-thread Table
PC
P-thread Detector
I-cache
Shared between two threads
P-thread Extractor
P-thread Extractor
Register file (main thread)
ROB (main thread)
IFQ
Main-thread
L2 cache/Main Memory
Functional Units
D-cache
Decoder
ROB (p-thread)
P-thread
P-thread indicator
Register file (p-thread)
Fetch
Detect p-thread
Execute
Writeback
Commit
Decode
Corresponding pipeline stage
- Three additional hardware structures
- PD Detecting the p-thread instructions
- PT Providing the p-thread information
- PE Extracting the p-thread instructions and send
them to the decoder
23The SPEAR Compiler
Program structure information
Source code
gcc
1.PFG drawing tool
3. program slicing tool
4. attaching tool
Regular binary
PISA executable
Dynamic information (d-load and loop count)
2. profiling tool
input data
SPEAR compiler
SPEAR binary
- Four individual modules are implemented (based on
SimpleScalar-3.0) - The input is SimpleScalar executable and the
output code is SPEAR binary
24Control-Flow Detection for the P-thread
Case 1
Case 2
backward slice 1
backward slice 1
B1
B1
delinquent load
B2
B2
B3
backward slice 2
B3
backward slice 2
delinquent load
B4
B4
- Static only method for the backward slice results
in backward slices 1 and 2 - Case 1 The result from the profiling tool
indicates that the majority of cache misses
happens when the program runs via B3 - The p-thread does not need to include backward
slice 2 in B2 - Case 2 The outer loop execution causes more
cache misses at the load in B2 than the inner
loop execution - The p-thread does not need to include backward
slice 2 in B3
25Compiler Support for P-thread Construction
Pointer
while . partition fieldindex
. for (ll0 llltw ll)
x fieldindexll
if (x gt max) high else if
(x gt min)
partition x balance 0 for
(lllll1 lllltw lll) if
(fieldindexlllgtpartition) balance
if (balancehigh w/2) break
else if (balancehigh gt w/2)
min partition
else max partition high
if (min max) break
index (partitionhops)(f-w)
hops
A
partition field index
A
Delinquent load
B
C
C
x field indexll
D
D
ll llltw
B
E
index (partition hops)(f-w) hops while
condition
E
26Compiler Support for P-thread Construction
Pointer
while . partition fieldindex
. for (ll0 llltw ll)
x fieldindexll
if (x gt max) high else if
(x gt min)
partition x balance 0 for
(lllll1 lllltw lll) if
(fieldindexlllgtpartition) balance
if (balancehigh w/2) break
else if (balancehigh gt w/2)
min partition
else max partition high
if (min max) break
index (partitionhops)(f-w)
hops
Basic Trigger
x fieldindexll partition x index
(partitionhops)(f-w) partition fieldindex
A
B
C
Trigger instruction
D
E
27Outline
- High Performance Processor Design Issues
- Instruction Level Parallelism
- Memory Wall Problem
- Background
- Speculative Pre-Execution
- Single-Chip Multithreading
- SPEAR Speculative Pre-Execution Assisted by
Compiler - Design Characteristics
- Hardware/Software Descriptions
- Evaluation and Analysis
- Summary
28SPEAR Simulation Parameters
Branch predict mode Bimodal
Branch table size 2048
Issue width 8
Commit width 8
Instruction fetch queue size various (128, 256)
Reorder buffer size 128 instructions
Integer functional unit ALU( x 4), MUL/DIV
Floating point functional unit ALU( x 4), MUL/DIV
Number of memory port 2
Data L1 cache configuration 256 sets, 32 block, 4 -way set associative , LRU
Data L1 cache latency 1 CPU clock cycle
Unified L2 cache configuration 1024 sets, 64 block, 4 way set associative, LRU
Unified L2 cache latency 12 CPU clock cycles
Memory access latency 120 CPU clock cycles
The architecture simulator is based on the
SimpleScalar 3.0 tool set
29SPEAR Evaluation Benchmark Descriptions
Suite Name (abbreviation) Skipped instructions Simulated instructions No. of load exe. No. of d-load exe. Static d-load P-thread Instructions
Stressmark Pointer Full running 85.9M 30.7M 669405 1 15
Stressmark Update Full running 53.2M 12.8M 39045 1 42
Stressmark Neighborhood (nbh) Full running 763.1M 297.4M 15.4M 4 51
Stressmark Transitive Closure (tr) 1B 929.7M 325.7M 23.2M 5 97
Stressmark Matrix 300M 500M 117.3M 21.1M 59 415
Stressmark Field Full running 552.9M 86.5M 2M 2 71
DIS Benchmarks Data Management (dm) Full running 507.5M 172.4M 6M 102 471
DIS Benchmarks Ray Tracing (ray) 300M 1B 280.7M 3.8M 15 46
DIS Benchmarks Fast Fourier Transform (fft) 1B 500M 140.2M 7.7M 143 1,129
SPEC CINT2000 164.gzip 1B 500M 107.5M 49.2M 93 336
SPEC CINT2000 181.mcf 1B 500M 164.1M 2.6M 43 134
SPEC CINT2000 175.vpr 1B 500M 126.8M 10.3M 126 606
SPEC CINT2000 256.bzip2 1B 500M 107M 16.5M 107 564
SPEC CFP2000 183.equake 1B 1B 207.7M 1.3M 119 444
SPEC CFP2000 179.art 1B 500M 105.2M 64.5M 51 140
30Benchmark Behavior
Not memory-intensive
i-cache misses
Average
sus.128.8/Ideal
Perfect d-load/Ideal
53.1
74.3
84.7 72.2 84.5 24.1 35.3 99.6 53.8 29.3 53.3 93.6 12.6 27.4 56.2 64.3 20.6
99.7 99.9 99.9 39.7 96.3 99.8 58.1 31.3 54.2 96.8 25.9 42.2 99.0 92.9 63.2
Speed-up of perfect d-load
pointer update nbh tr matrix field dm ray fft gzip mcf vpr bzip2 equake art
1.178 1.383 1.183 1.650 2.731 1.002 1.079 1.069 1.016 1.035 2.059 1.539 1.763 1.445 3.061
Average 1.546 (54.6 speed-up)
31Performance Results Speed-Up
- Improve performance for 11 out of 15 applications
- The best result 87.6 improvement for mcf
- Four applications (tr, field , fft, and gzip)
experience a slight performance degradation - Resource conflict may cause the problem
Configuration Performance Improvement
SPEAR-128 12.7
SPEAR-256 20.1
32Speed-Up Dedicated Resources for P-thread
- Field and gzip are not memory-intensive programs
- Ray performance strongly depends on the I-cache
miss - Dm and fft selection of d-loads may be
inadequate - Fft even cannot reduce the cache misses
- The p-thread is too large
- Cache pollution may happen
Configuration Performance Improvement
SPEAR.sf-128 18.9
SPEAR.sf-256 26.3
6.2 performance improvement with dedicated
resources
33Results Compared to the Perfect D-load Prefetching
pointer update nbh tr matrix field dm ray fft gzip mcf vpr bzip2 equake art
98.7 81.7 98.6 79.8 62.6 100.0 98.5 97.9 94.7 98.8 91.3 72.2 72.9 85.5 60.7
On average, SPEAR achieves 86.3 of performance
of the perfect d-load case
34Performance Improvement and Cache Hit Rate
Speed-Up
Cache hit ratio
superscalar
SPEAR.sf-256
35Latency Tolerance
Various L2 Cache Latency and Memory Access
Latency
Prefetching capability of the SPEAR architecture
provides robust performance for the long latencies
36Long Memory Latency Tolerance
- Performance normalized to the shortest memory
latency for each configuration - On the average, performance degradation at the
longest latency compared to the shortest latency
is - SPEAR-128 loses 39.7 and SPEAR-256 loses 38.4
of the baseline performance at the longest
latency - Superscalar loses 48.5 at the longest latency
37Outline
- High Performance Processor Design Issues
- Instruction Level Parallelism
- Memory Wall Problem
- Background
- Speculative Pre-Execution
- Single-Chip Multithreading
- SPEAR Speculative Pre-Execution Assisted by
Compiler - Design Characteristics
- Hardware/Software Descriptions
- Evaluation and Analysis
- Summary
38Summary
- Speculative execution of the future cache miss
slice in a multithreaded manner provides
effective data prefetching - Automated binary tool (as the SPEAR compiler) and
the supporting hardware have been proposed - Compiler-assisted p-thread construction and
hardware-supported p-thread triggering mesh quite
well and yield good performance results - Over 15 benchmark programs, SPEAR.sf-256 model
achieves a 26.3 improvement over the baseline
superscalar architecture
http//pascal.eng.uci.edu/
39Future Work
- Circuit level analysis considering VLSI
implementation - For clock speed, area, and power consumption
estimation - Since it has complexity effectiveness, the
possibilities of using this scheme in embedded
processors should be investigated - More algorithms to choose efficient prefetching
thread - D-load selection as well as p-thread selection
- Selective execution among the p-thread
instructions - Multiple p-threads execution under
multiprogramming workload
40End of the Presentation
Thank you very much!!