SPEAR: Speculative Pre-Execution Assisted by CompileR

About This Presentation

Title:

SPEAR: Speculative Pre-Execution Assisted by CompileR

Description:

Complexity of scheduling window impacts on cycle time ... It is implemented with simpler atomic hardware structures (hence a higher clock rate) ... – PowerPoint PPT presentation

Number of Views:42

Avg rating:3.0/5.0

Slides: 41

Provided by: won82

Learn more at: http://www.ifipwg103.org

Category:

more less

Transcript and Presenter's Notes

Title: SPEAR: Speculative Pre-Execution Assisted by CompileR

1
SPEAR Speculative Pre-Execution Assisted by
CompileR

Jean-Luc Gaudiot
University of California, Irvine
Won Woo Ro
California State University, Northridge

E-Seminar IFIP Working Group 10.3 September 7,
2005
PASCAL PArallel Systems and Computer
Architecture Lab.
University of California, Irvine
2
Outline

High Performance Processor Design Issues
Instruction Level Parallelism
Memory Wall Problem on Superscalar
Background
Speculative Pre-Execution
Single-Chip Multithreading
SPEAR Speculative Pre-Execution Assisted by
Compiler
Design Characteristics
Hardware/Software Descriptions
Evaluation and Analysis
Summary

3
High-Performance Microprocessor Design

Instruction Level Parallelism
Multiple instruction issue and out-of-order
execution
Aggressive branch prediction and speculative
execution
Examples
Superscalar and VLIW
Various processors for Thread Level Parallelism
Chip Multiprocessor and Simultaneous
Multithreading
Extracting more ILP by exploiting TLP
Challenge
Increasing performance gap between processor and
memory

4
Challenge Memory Wall Problem
55/year
Processor-Memory Performance Gap
7/year
Hennessy and Patterson, Computer Architecture
A Quantitative Approach (2003)

Technological trend
Memory latency is getting longer relative to the
microprocessor speed
Problems
Considerable performance degradation upon cache
misses
Pipeline stall during memory access
Todays applications contain more
irregular memory accesses (e.g., pointer chasing)
memory related operations (data-intensive
applications)

5
Limitations of Current Approaches

Cache
Works well only if there exists locality
Weak for large data and irregular memory access
Cache Prefetching
Depending on predictability
Hardware prefetching
Behavior based on past and present execution-time
behavior
Software prefetching
Hard to insert prefetches for irregular access
patterns
Latency hiding techniques (SMT)
Enhance the utilization and throughput at thread
level
No latency reduction for a single threaded program

Solution Cache prefetching by early execution
of the future miss-causing
load instructions

6
Early Scheduling of Load Instructions Static or
Dynamic?

Scheduling of load instruction is crucial to hide
memory latency
Reducing pipeline stall
Static or dynamic scheduling
Current ILP architectures
VLIW Static instruction scheduling at
compile-time
Superscalar Dynamic instruction scheduling at
run-time
Decoupled architectures are in between
Trade-off hardware complexity vs. compiler
compatibility

High level language
VLIW
Dependency Analysis Resource Binding
Front-end
DecoupledArchitecture
Instruction Separationfor Functionality
Superscalar
Back-end
Back-end
Back-end
Instruction Window Wake-up logic
Functional UnitAssignment
InstructionExecution
7
Superscalar Why Have a Large Scheduling Window?

Effective to find more independent instructions
to execute
How can dynamic scheduling solve the long memory
latency problem?
A large window is helpful to uncover independent
instructions in order to hide the long memory
latency
Dynamic scheduling performs a limited form of
dynamic preloading

Last in instructions
First in instruction
Instruction Scheduling Window
8
Limitation of Scheduling Window

Dynamic scheduling of superscalar
Centralized and atomic structure
Globally accessed and cannot be pipelined
Issue window and bypass logic will be the most
critical part in future superscalar architectures
Palacharla, Jouppi, and Smith, Complexity-Effecti
ve Superscalar Processors, ISCA24, 1997
The long wire delay on centralized design will be
a major performance bottleneck
Agarwal, Hrishikesh, Keckler, and Burger, Clock
Rate versus IPC The End of the Road for
Conventional Microarchitectures, ISCA27, 2000
Complexity of scheduling window impacts on cycle
time

Decentralized Design and Static Scheduling

Summary We aim at achieving early execution of
load instructions by static
scheduling of extra treads.
9
Outline

High Performance Processor Design Issues
Instruction Level Parallelism
Memory Wall Problem
Background
Speculative Pre-Execution
Single-Chip Multithreading
SPEAR Speculative Pre-Execution Assisted by
Compiler
Design Characteristics
Hardware/Software Descriptions
Evaluation and Analysis
Summary

10
Access/Execute Decoupled Architectures

Two separate executions of the sequential
instruction stream
Memory access and computation
Motivation
Exploiting ILP between two streams
Slip distance between streams provides memory
latency hiding
Executing the two instruction streams on separate
processors
Out-of-order execution between two stream
Communication through the queues
Examples DAE, PIPE, ZS-1 and WM

ExecuteInstructions
ExecuteProcessor
Load Data Queue
Store Data Queue
AccessProcessor
AccessInstructions
Main Memory
11
What is Speculative Pre-Execution?

Further extension on the traditional
access/execute decoupled architecture
Instead of decoupling and pre-executing every
load instruction, it only pre-executes
miss-friendly load instructions.
Data prefetching method that defines and
pre-executes the future miss causing load
instructions and the backward slices
Backward slices include the instructions on which
those load instructions have data dependencies
Speculative execution of the prefetching thread
on spare hardware context provides effective
cache prefetching

12
Speculative Pre-Execution

Define probable cache miss instructions
(delinquent loads) and their backward slice as
the prefetching thread (p-thread)
Cache access profile is used to define the
delinquent loads
Execute the p-thread as an auxiliary thread in a
multithreaded manner
Triggering the p-thread at run-time
It is lightweight and can run faster than the
normal thread

Instruction Stream
13
Speculative Pre-Execution

Define probable cache miss instructions
(delinquent loads) and their backward slice as
the prefetching thread (p-thread)
Cache access profile is used to define the
delinquent loads
Execute the p-thread as an auxiliary thread in a
multithreaded manner
Triggering the p-thread at run-time
It is lightweight and can run faster than the
normal thread

14
Working Example of Speculative Pre-Execution
for ( k0 kltN k) xk q yk(
rzk10 tzk11 )
lw 24, 24(sp)
mul 25, 24, 8
la 8, z
addu 9, 25, 8
l.d f16, 88(9)
l.d f18, 0(sp)
mul.d f4, f16, f18
l.d f6, 8(sp)
l.d f8, 80(9)
mul.d f10, f6, f8
add.d f16, f4, f10
la 10, y
addu 11, 25, 10
l.d f18, 0(11)
mul.d f6, f16, f18
l.d f8, 16(sp)
add.d f4, f6, f8
la 12, x
addu 13, 25, 12
s.d f4, 0(13)
lw 14, 24(sp)
addu 15, 14, 1
sw 15, 24(sp)
blt 15, 1024, 33
lw 24, 24(sp)
mul 25, 24, 8
la 8, z
addu 9, 25, 8
l.d f16, 88(9)
l.d f18, 0(sp)
mul.d f4, f16, f18
l.d f6, 8(sp)
l.d f8, 80(9)
mul.d f10, f6, f8
add.d f16, f4, f10
la 10, y
addu 11, 25, 10
l.d f18, 0(11)
iteration i
iteration i1
Dynamic Instruction Stream for Main Program
15
Executing the p-thread in a multithreaded manner
Time
Main Program Flow

Parallel execution of the p-thread and the main
program thread
Using hardware multithreading such as SMT or CMP
Either reduce the cache misses or cache miss
penalty

16
Single-Chip (On-Chip) Multithreading

Two approaches for multithreaded execution
Simultaneous Multithreading (SMT) and Chip
Multiprocessor (CMP)
Both support simultaneous execution of the
multiple threads
SMT supports multiple contexts on a superscalar
core
CMP arranges multiple small processing units on a
chip
Differences lie in resource sharing
In SMT, every structure (instruction fetch queue,
scheduling window, reorder buffer, registers, and
functional units) is shared between threads
Better utilization and resource sharing
In CMP, every processing unit has dedicated
resources
It is implemented with simpler atomic hardware
structures (hence a higher clock rate)

17
Outline

High Performance Processor Design Issues
Instruction Level Parallelism
Memory Wall Problem on Superscalar
Background
Speculative Pre-Execution
Single-Chip Multithreading
SPEAR Speculative Pre-Execution Assisted by
Compiler
Design Characteristics
Hardware/Software Descriptions
Evaluation and Analysis
Summary

18
Various Implementation for Speculative
Pre-Execution
Source code (high-level language)
Source code (high-level language)
Source code (high-level language)
Binary executable
Assembly/ Binary executable
SP compiler (source to source)
Hardware-based P-thread construction
S/W
Assembly/ Binary translation
Main program
P-thread
Main program
P-thread
Main program
P-thread
H/W
Compiler
Execution core
SMT processor
SMT processor
Software-based static approach
Hardware-based dynamic approach
19
Related Work Speculative Pre-Execution
Software-Based Approach (Static, Compile-time)
SourceCode
Binary
Software-Controlled Pre-Execution Luk, ISCA 2001 Analysis of high-level language program, not automated
Software-Based Speculative Precomputation Liao et al. , PLDI 2002 Post-pass binary adaptation, Itanium
Compiler Algorithms for Pre-Execution Kim and Yeung, ASPLOS 2002 Analysis of high-level language program, automated compiler
Compiler or Binary Analyzer
P-thread
Hardware-Based Approach (Dynamic, Run-time)
I-Cache
Dependence Graph Precomputation Annavaram et al. , ISCA 2001 H/W based, front-end draws DGP
Slice-Processors Moshovos et al. , ICS 2001 H/W based, back-end driven
Dynamic Speculative Precomputation Collins et al. , Micro 2001 H/W based, back-end driven, chaining p-slices
Instruction Queue
P-thread
Reorder Buffer
20
Two Main Operations of Speculative Pre-Execution
Construct the p-thread
Trigger the p-thread (runtime)
21
SPEAR Speculative Pre-Execution Assisted by
compileR
Hardware-based Construction
Construct the p-thread
Compiler-basedConstruction
Hardware-controlled Triggering
Trigger the p-thread (runtime)
Software-controlled Triggering

Two steps of p-thread construction and p-thread
triggering are not required to bond to the same
methodologies
Compiler-based construction can also potentially
benefit Hardware-controlled triggering
A Hybrid model of speculative pre-execution

22
Hardware Description for SPEAR
D-load detected!!
- Based on a SMT architecture - P-thread is
extracted from the IFQ
P-thread Table
PC
P-thread Detector
I-cache
Shared between two threads
P-thread Extractor
P-thread Extractor
Register file (main thread)
ROB (main thread)
IFQ
Main-thread
L2 cache/Main Memory
Functional Units
D-cache
Decoder
ROB (p-thread)
P-thread
P-thread indicator
Register file (p-thread)
Fetch
Detect p-thread
Execute
Writeback
Commit
Decode
Corresponding pipeline stage

Three additional hardware structures
PD Detecting the p-thread instructions
PT Providing the p-thread information
PE Extracting the p-thread instructions and send
them to the decoder

23
The SPEAR Compiler
Program structure information
Source code
gcc
1.PFG drawing tool
3. program slicing tool
4. attaching tool
Regular binary
PISA executable
Dynamic information (d-load and loop count)
2. profiling tool
input data
SPEAR compiler
SPEAR binary

Four individual modules are implemented (based on
SimpleScalar-3.0)
The input is SimpleScalar executable and the
output code is SPEAR binary

24
Control-Flow Detection for the P-thread
Case 1
Case 2
backward slice 1
backward slice 1
B1
B1
delinquent load
B2
B2
B3
backward slice 2
B3
backward slice 2
delinquent load
B4
B4

Static only method for the backward slice results
in backward slices 1 and 2
Case 1 The result from the profiling tool
indicates that the majority of cache misses
happens when the program runs via B3
The p-thread does not need to include backward
slice 2 in B2
Case 2 The outer loop execution causes more
cache misses at the load in B2 than the inner
loop execution
The p-thread does not need to include backward
slice 2 in B3

25
Compiler Support for P-thread Construction
Pointer
while . partition fieldindex
. for (ll0 llltw ll)
x fieldindexll
if (x gt max) high else if
(x gt min)
partition x balance 0 for
(lllll1 lllltw lll) if
(fieldindexlllgtpartition) balance
if (balancehigh w/2) break
else if (balancehigh gt w/2)
min partition
else max partition high
if (min max) break
index (partitionhops)(f-w)
hops
A
partition field index
A
Delinquent load
B
C
C
x field indexll
D
D
ll llltw
B
E
index (partition hops)(f-w) hops while
condition
E
26
Compiler Support for P-thread Construction
Pointer
while . partition fieldindex
. for (ll0 llltw ll)
x fieldindexll
if (x gt max) high else if
(x gt min)
partition x balance 0 for
(lllll1 lllltw lll) if
(fieldindexlllgtpartition) balance
if (balancehigh w/2) break
else if (balancehigh gt w/2)
min partition
else max partition high
if (min max) break
index (partitionhops)(f-w)
hops
Basic Trigger
x fieldindexll partition x index
(partitionhops)(f-w) partition fieldindex
A
B
C
Trigger instruction
D
E
27
Outline

High Performance Processor Design Issues
Instruction Level Parallelism
Memory Wall Problem
Background
Speculative Pre-Execution
Single-Chip Multithreading
SPEAR Speculative Pre-Execution Assisted by
Compiler
Design Characteristics
Hardware/Software Descriptions
Evaluation and Analysis
Summary

28
SPEAR Simulation Parameters
Branch predict mode Bimodal
Branch table size 2048
Issue width 8
Commit width 8
Instruction fetch queue size various (128, 256)
Reorder buffer size 128 instructions
Integer functional unit ALU( x 4), MUL/DIV
Floating point functional unit ALU( x 4), MUL/DIV
Number of memory port 2
Data L1 cache configuration 256 sets, 32 block, 4 -way set associative , LRU
Data L1 cache latency 1 CPU clock cycle
Unified L2 cache configuration 1024 sets, 64 block, 4 way set associative, LRU
Unified L2 cache latency 12 CPU clock cycles
Memory access latency 120 CPU clock cycles
The architecture simulator is based on the
SimpleScalar 3.0 tool set
29
SPEAR Evaluation Benchmark Descriptions
Suite Name (abbreviation) Skipped instructions Simulated instructions No. of load exe. No. of d-load exe. Static d-load P-thread Instructions
Stressmark Pointer Full running 85.9M 30.7M 669405 1 15
Stressmark Update Full running 53.2M 12.8M 39045 1 42
Stressmark Neighborhood (nbh) Full running 763.1M 297.4M 15.4M 4 51
Stressmark Transitive Closure (tr) 1B 929.7M 325.7M 23.2M 5 97
Stressmark Matrix 300M 500M 117.3M 21.1M 59 415
Stressmark Field Full running 552.9M 86.5M 2M 2 71
DIS Benchmarks Data Management (dm) Full running 507.5M 172.4M 6M 102 471
DIS Benchmarks Ray Tracing (ray) 300M 1B 280.7M 3.8M 15 46
DIS Benchmarks Fast Fourier Transform (fft) 1B 500M 140.2M 7.7M 143 1,129
SPEC CINT2000 164.gzip 1B 500M 107.5M 49.2M 93 336
SPEC CINT2000 181.mcf 1B 500M 164.1M 2.6M 43 134
SPEC CINT2000 175.vpr 1B 500M 126.8M 10.3M 126 606
SPEC CINT2000 256.bzip2 1B 500M 107M 16.5M 107 564
SPEC CFP2000 183.equake 1B 1B 207.7M 1.3M 119 444
SPEC CFP2000 179.art 1B 500M 105.2M 64.5M 51 140
30
Benchmark Behavior
Not memory-intensive
i-cache misses
Average
sus.128.8/Ideal
Perfect d-load/Ideal
53.1
74.3
84.7 72.2 84.5 24.1 35.3 99.6 53.8 29.3 53.3 93.6 12.6 27.4 56.2 64.3 20.6
99.7 99.9 99.9 39.7 96.3 99.8 58.1 31.3 54.2 96.8 25.9 42.2 99.0 92.9 63.2
Speed-up of perfect d-load
pointer update nbh tr matrix field dm ray fft gzip mcf vpr bzip2 equake art
1.178 1.383 1.183 1.650 2.731 1.002 1.079 1.069 1.016 1.035 2.059 1.539 1.763 1.445 3.061
Average 1.546 (54.6 speed-up)
31
Performance Results Speed-Up

Improve performance for 11 out of 15 applications
The best result 87.6 improvement for mcf
Four applications (tr, field , fft, and gzip)
experience a slight performance degradation
Resource conflict may cause the problem

Configuration Performance Improvement
SPEAR-128 12.7
SPEAR-256 20.1
32
Speed-Up Dedicated Resources for P-thread

Field and gzip are not memory-intensive programs
Ray performance strongly depends on the I-cache
miss
Dm and fft selection of d-loads may be
inadequate
Fft even cannot reduce the cache misses
The p-thread is too large
Cache pollution may happen

Configuration Performance Improvement
SPEAR.sf-128 18.9
SPEAR.sf-256 26.3
6.2 performance improvement with dedicated
resources
33
Results Compared to the Perfect D-load Prefetching
pointer update nbh tr matrix field dm ray fft gzip mcf vpr bzip2 equake art
98.7 81.7 98.6 79.8 62.6 100.0 98.5 97.9 94.7 98.8 91.3 72.2 72.9 85.5 60.7
On average, SPEAR achieves 86.3 of performance
of the perfect d-load case
34
Performance Improvement and Cache Hit Rate
Speed-Up
Cache hit ratio
superscalar
SPEAR.sf-256
35
Latency Tolerance
Various L2 Cache Latency and Memory Access
Latency
Prefetching capability of the SPEAR architecture
provides robust performance for the long latencies
36
Long Memory Latency Tolerance

Performance normalized to the shortest memory
latency for each configuration
On the average, performance degradation at the
longest latency compared to the shortest latency
is
SPEAR-128 loses 39.7 and SPEAR-256 loses 38.4
of the baseline performance at the longest
latency
Superscalar loses 48.5 at the longest latency

37
Outline

High Performance Processor Design Issues
Instruction Level Parallelism
Memory Wall Problem
Background
Speculative Pre-Execution
Single-Chip Multithreading
SPEAR Speculative Pre-Execution Assisted by
Compiler
Design Characteristics
Hardware/Software Descriptions
Evaluation and Analysis
Summary

38
Summary

Speculative execution of the future cache miss
slice in a multithreaded manner provides
effective data prefetching
Automated binary tool (as the SPEAR compiler) and
the supporting hardware have been proposed
Compiler-assisted p-thread construction and
hardware-supported p-thread triggering mesh quite
well and yield good performance results
Over 15 benchmark programs, SPEAR.sf-256 model
achieves a 26.3 improvement over the baseline
superscalar architecture

http//pascal.eng.uci.edu/
39
Future Work

Circuit level analysis considering VLSI
implementation
For clock speed, area, and power consumption
estimation
Since it has complexity effectiveness, the
possibilities of using this scheme in embedded
processors should be investigated
More algorithms to choose efficient prefetching
thread
D-load selection as well as p-thread selection
Selective execution among the p-thread
instructions
Multiple p-threads execution under
multiprogramming workload