SPEAR: Speculative Pre-Execution Assisted by CompileR - PowerPoint PPT Presentation

About This Presentation
Title:

SPEAR: Speculative Pre-Execution Assisted by CompileR

Description:

Complexity of scheduling window impacts on cycle time ... It is implemented with simpler atomic hardware structures (hence a higher clock rate) ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 41
Provided by: won82
Learn more at: http://www.ifipwg103.org
Category:

less

Transcript and Presenter's Notes

Title: SPEAR: Speculative Pre-Execution Assisted by CompileR


1
SPEAR Speculative Pre-Execution Assisted by
CompileR
  • Jean-Luc Gaudiot
  • University of California, Irvine
  • Won Woo Ro
  • California State University, Northridge

E-Seminar IFIP Working Group 10.3 September 7,
2005
PASCAL PArallel Systems and Computer
Architecture Lab.
University of California, Irvine
2
Outline
  • High Performance Processor Design Issues
  • Instruction Level Parallelism
  • Memory Wall Problem on Superscalar
  • Background
  • Speculative Pre-Execution
  • Single-Chip Multithreading
  • SPEAR Speculative Pre-Execution Assisted by
    Compiler
  • Design Characteristics
  • Hardware/Software Descriptions
  • Evaluation and Analysis
  • Summary

3
High-Performance Microprocessor Design
  • Instruction Level Parallelism
  • Multiple instruction issue and out-of-order
    execution
  • Aggressive branch prediction and speculative
    execution
  • Examples
  • Superscalar and VLIW
  • Various processors for Thread Level Parallelism
  • Chip Multiprocessor and Simultaneous
    Multithreading
  • Extracting more ILP by exploiting TLP
  • Challenge
  • Increasing performance gap between processor and
    memory

4
Challenge Memory Wall Problem
55/year
Processor-Memory Performance Gap
7/year
Hennessy and Patterson, Computer Architecture
A Quantitative Approach (2003)
  • Technological trend
  • Memory latency is getting longer relative to the
    microprocessor speed
  • Problems
  • Considerable performance degradation upon cache
    misses
  • Pipeline stall during memory access
  • Todays applications contain more
  • irregular memory accesses (e.g., pointer chasing)
  • memory related operations (data-intensive
    applications)

5
Limitations of Current Approaches
  • Cache
  • Works well only if there exists locality
  • Weak for large data and irregular memory access
  • Cache Prefetching
  • Depending on predictability
  • Hardware prefetching
  • Behavior based on past and present execution-time
    behavior
  • Software prefetching
  • Hard to insert prefetches for irregular access
    patterns
  • Latency hiding techniques (SMT)
  • Enhance the utilization and throughput at thread
    level
  • No latency reduction for a single threaded program
  • Solution Cache prefetching by early execution
    of the future miss-causing
    load instructions

6
Early Scheduling of Load Instructions Static or
Dynamic?
  • Scheduling of load instruction is crucial to hide
    memory latency
  • Reducing pipeline stall
  • Static or dynamic scheduling
  • Current ILP architectures
  • VLIW Static instruction scheduling at
    compile-time
  • Superscalar Dynamic instruction scheduling at
    run-time
  • Decoupled architectures are in between
  • Trade-off hardware complexity vs. compiler
    compatibility

High level language
VLIW
Dependency Analysis Resource Binding
Front-end
DecoupledArchitecture
Instruction Separationfor Functionality
Superscalar
Back-end
Back-end
Back-end
Instruction Window Wake-up logic
Functional UnitAssignment
InstructionExecution
7
Superscalar Why Have a Large Scheduling Window?
  • Effective to find more independent instructions
    to execute
  • How can dynamic scheduling solve the long memory
    latency problem?
  • A large window is helpful to uncover independent
    instructions in order to hide the long memory
    latency
  • Dynamic scheduling performs a limited form of
    dynamic preloading

Last in instructions
First in instruction
Instruction Scheduling Window
8
Limitation of Scheduling Window
  • Dynamic scheduling of superscalar
  • Centralized and atomic structure
  • Globally accessed and cannot be pipelined
  • Issue window and bypass logic will be the most
    critical part in future superscalar architectures
  • Palacharla, Jouppi, and Smith, Complexity-Effecti
    ve Superscalar Processors, ISCA24, 1997
  • The long wire delay on centralized design will be
    a major performance bottleneck
  • Agarwal, Hrishikesh, Keckler, and Burger, Clock
    Rate versus IPC The End of the Road for
    Conventional Microarchitectures, ISCA27, 2000
  • Complexity of scheduling window impacts on cycle
    time
  • Decentralized Design and Static Scheduling

Summary We aim at achieving early execution of
load instructions by static
scheduling of extra treads.
9
Outline
  • High Performance Processor Design Issues
  • Instruction Level Parallelism
  • Memory Wall Problem
  • Background
  • Speculative Pre-Execution
  • Single-Chip Multithreading
  • SPEAR Speculative Pre-Execution Assisted by
    Compiler
  • Design Characteristics
  • Hardware/Software Descriptions
  • Evaluation and Analysis
  • Summary

10
Access/Execute Decoupled Architectures
  • Two separate executions of the sequential
    instruction stream
  • Memory access and computation
  • Motivation
  • Exploiting ILP between two streams
  • Slip distance between streams provides memory
    latency hiding
  • Executing the two instruction streams on separate
    processors
  • Out-of-order execution between two stream
  • Communication through the queues
  • Examples DAE, PIPE, ZS-1 and WM

ExecuteInstructions
ExecuteProcessor
Load Data Queue
Store Data Queue
AccessProcessor
AccessInstructions
Main Memory
11
What is Speculative Pre-Execution?
  • Further extension on the traditional
    access/execute decoupled architecture
  • Instead of decoupling and pre-executing every
    load instruction, it only pre-executes
    miss-friendly load instructions.
  • Data prefetching method that defines and
    pre-executes the future miss causing load
    instructions and the backward slices
  • Backward slices include the instructions on which
    those load instructions have data dependencies
  • Speculative execution of the prefetching thread
    on spare hardware context provides effective
    cache prefetching

12
Speculative Pre-Execution
  • Define probable cache miss instructions
    (delinquent loads) and their backward slice as
    the prefetching thread (p-thread)
  • Cache access profile is used to define the
    delinquent loads
  • Execute the p-thread as an auxiliary thread in a
    multithreaded manner
  • Triggering the p-thread at run-time
  • It is lightweight and can run faster than the
    normal thread

Instruction Stream
13
Speculative Pre-Execution
  • Define probable cache miss instructions
    (delinquent loads) and their backward slice as
    the prefetching thread (p-thread)
  • Cache access profile is used to define the
    delinquent loads
  • Execute the p-thread as an auxiliary thread in a
    multithreaded manner
  • Triggering the p-thread at run-time
  • It is lightweight and can run faster than the
    normal thread

14
Working Example of Speculative Pre-Execution
for ( k0 kltN k) xk q yk(
rzk10 tzk11 )
lw 24, 24(sp)
mul 25, 24, 8
la 8, z
addu 9, 25, 8
l.d f16, 88(9)
l.d f18, 0(sp)
mul.d f4, f16, f18
l.d f6, 8(sp)
l.d f8, 80(9)
mul.d f10, f6, f8
add.d f16, f4, f10
la 10, y
addu 11, 25, 10
l.d f18, 0(11)
mul.d f6, f16, f18
l.d f8, 16(sp)
add.d f4, f6, f8
la 12, x
addu 13, 25, 12
s.d f4, 0(13)
lw 14, 24(sp)
addu 15, 14, 1
sw 15, 24(sp)
blt 15, 1024, 33
lw 24, 24(sp)
mul 25, 24, 8
la 8, z
addu 9, 25, 8
l.d f16, 88(9)
l.d f18, 0(sp)
mul.d f4, f16, f18
l.d f6, 8(sp)
l.d f8, 80(9)
mul.d f10, f6, f8
add.d f16, f4, f10
la 10, y
addu 11, 25, 10
l.d f18, 0(11)
iteration i
iteration i1
Dynamic Instruction Stream for Main Program
15
Executing the p-thread in a multithreaded manner
Time
Main Program Flow
  • Parallel execution of the p-thread and the main
    program thread
  • Using hardware multithreading such as SMT or CMP
  • Either reduce the cache misses or cache miss
    penalty

16
Single-Chip (On-Chip) Multithreading
  • Two approaches for multithreaded execution
  • Simultaneous Multithreading (SMT) and Chip
    Multiprocessor (CMP)
  • Both support simultaneous execution of the
    multiple threads
  • SMT supports multiple contexts on a superscalar
    core
  • CMP arranges multiple small processing units on a
    chip
  • Differences lie in resource sharing
  • In SMT, every structure (instruction fetch queue,
    scheduling window, reorder buffer, registers, and
    functional units) is shared between threads
  • Better utilization and resource sharing
  • In CMP, every processing unit has dedicated
    resources
  • It is implemented with simpler atomic hardware
    structures (hence a higher clock rate)

17
Outline
  • High Performance Processor Design Issues
  • Instruction Level Parallelism
  • Memory Wall Problem on Superscalar
  • Background
  • Speculative Pre-Execution
  • Single-Chip Multithreading
  • SPEAR Speculative Pre-Execution Assisted by
    Compiler
  • Design Characteristics
  • Hardware/Software Descriptions
  • Evaluation and Analysis
  • Summary

18
Various Implementation for Speculative
Pre-Execution
Source code (high-level language)
Source code (high-level language)
Source code (high-level language)
Binary executable
Assembly/ Binary executable
SP compiler (source to source)
Hardware-based P-thread construction
S/W
Assembly/ Binary translation
Main program
P-thread
Main program
P-thread
Main program
P-thread
H/W
Compiler
Execution core
SMT processor
SMT processor
Software-based static approach
Hardware-based dynamic approach
19
Related Work Speculative Pre-Execution
Software-Based Approach (Static, Compile-time)
SourceCode
Binary
Software-Controlled Pre-Execution Luk, ISCA 2001 Analysis of high-level language program, not automated
Software-Based Speculative Precomputation Liao et al. , PLDI 2002 Post-pass binary adaptation, Itanium
Compiler Algorithms for Pre-Execution Kim and Yeung, ASPLOS 2002 Analysis of high-level language program, automated compiler
Compiler or Binary Analyzer
P-thread
Hardware-Based Approach (Dynamic, Run-time)
I-Cache
Dependence Graph Precomputation Annavaram et al. , ISCA 2001 H/W based, front-end draws DGP
Slice-Processors Moshovos et al. , ICS 2001 H/W based, back-end driven
Dynamic Speculative Precomputation Collins et al. , Micro 2001 H/W based, back-end driven, chaining p-slices
Instruction Queue
P-thread
Reorder Buffer
20
Two Main Operations of Speculative Pre-Execution
Construct the p-thread
Trigger the p-thread (runtime)
21
SPEAR Speculative Pre-Execution Assisted by
compileR
Hardware-based Construction
Construct the p-thread
Compiler-basedConstruction
Hardware-controlled Triggering
Trigger the p-thread (runtime)
Software-controlled Triggering
  • Two steps of p-thread construction and p-thread
    triggering are not required to bond to the same
    methodologies
  • Compiler-based construction can also potentially
    benefit Hardware-controlled triggering
  • A Hybrid model of speculative pre-execution

22
Hardware Description for SPEAR
D-load detected!!
- Based on a SMT architecture - P-thread is
extracted from the IFQ
P-thread Table
PC
P-thread Detector
I-cache
Shared between two threads
P-thread Extractor
P-thread Extractor
Register file (main thread)
ROB (main thread)
IFQ
Main-thread
L2 cache/Main Memory
Functional Units
D-cache
Decoder
ROB (p-thread)
P-thread
P-thread indicator
Register file (p-thread)
Fetch
Detect p-thread
Execute
Writeback
Commit
Decode
Corresponding pipeline stage
  • Three additional hardware structures
  • PD Detecting the p-thread instructions
  • PT Providing the p-thread information
  • PE Extracting the p-thread instructions and send
    them to the decoder

23
The SPEAR Compiler
Program structure information
Source code
gcc
1.PFG drawing tool
3. program slicing tool
4. attaching tool
Regular binary
PISA executable
Dynamic information (d-load and loop count)
2. profiling tool
input data
SPEAR compiler
SPEAR binary
  • Four individual modules are implemented (based on
    SimpleScalar-3.0)
  • The input is SimpleScalar executable and the
    output code is SPEAR binary

24
Control-Flow Detection for the P-thread
Case 1
Case 2
backward slice 1
backward slice 1
B1
B1
delinquent load
B2
B2
B3
backward slice 2
B3
backward slice 2
delinquent load
B4
B4
  • Static only method for the backward slice results
    in backward slices 1 and 2
  • Case 1 The result from the profiling tool
    indicates that the majority of cache misses
    happens when the program runs via B3
  • The p-thread does not need to include backward
    slice 2 in B2
  • Case 2 The outer loop execution causes more
    cache misses at the load in B2 than the inner
    loop execution
  • The p-thread does not need to include backward
    slice 2 in B3

25
Compiler Support for P-thread Construction
Pointer
while . partition fieldindex
. for (ll0 llltw ll)
x fieldindexll
if (x gt max) high else if
(x gt min)
partition x balance 0 for
(lllll1 lllltw lll) if
(fieldindexlllgtpartition) balance
if (balancehigh w/2) break
else if (balancehigh gt w/2)
min partition
else max partition high
if (min max) break
index (partitionhops)(f-w)
hops
A
partition field index
A
Delinquent load
B
C
C
x field indexll
D
D
ll llltw
B
E
index (partition hops)(f-w) hops while
condition
E
26
Compiler Support for P-thread Construction
Pointer
while . partition fieldindex
. for (ll0 llltw ll)
x fieldindexll
if (x gt max) high else if
(x gt min)
partition x balance 0 for
(lllll1 lllltw lll) if
(fieldindexlllgtpartition) balance
if (balancehigh w/2) break
else if (balancehigh gt w/2)
min partition
else max partition high
if (min max) break
index (partitionhops)(f-w)
hops
Basic Trigger
x fieldindexll partition x index
(partitionhops)(f-w) partition fieldindex
A
B
C
Trigger instruction
D
E
27
Outline
  • High Performance Processor Design Issues
  • Instruction Level Parallelism
  • Memory Wall Problem
  • Background
  • Speculative Pre-Execution
  • Single-Chip Multithreading
  • SPEAR Speculative Pre-Execution Assisted by
    Compiler
  • Design Characteristics
  • Hardware/Software Descriptions
  • Evaluation and Analysis
  • Summary

28
SPEAR Simulation Parameters
Branch predict mode Bimodal
Branch table size 2048
Issue width 8
Commit width 8
Instruction fetch queue size various (128, 256)
Reorder buffer size 128 instructions
Integer functional unit ALU( x 4), MUL/DIV
Floating point functional unit ALU( x 4), MUL/DIV
Number of memory port 2
Data L1 cache configuration 256 sets, 32 block, 4 -way set associative , LRU
Data L1 cache latency 1 CPU clock cycle
Unified L2 cache configuration 1024 sets, 64 block, 4 way set associative, LRU
Unified L2 cache latency 12 CPU clock cycles
Memory access latency 120 CPU clock cycles
The architecture simulator is based on the
SimpleScalar 3.0 tool set
29
SPEAR Evaluation Benchmark Descriptions
Suite Name (abbreviation) Skipped instructions Simulated instructions No. of load exe. No. of d-load exe. Static d-load P-thread Instructions
Stressmark Pointer Full running 85.9M 30.7M 669405 1 15
Stressmark Update Full running 53.2M 12.8M 39045 1 42
Stressmark Neighborhood (nbh) Full running 763.1M 297.4M 15.4M 4 51
Stressmark Transitive Closure (tr) 1B 929.7M 325.7M 23.2M 5 97
Stressmark Matrix 300M 500M 117.3M 21.1M 59 415
Stressmark Field Full running 552.9M 86.5M 2M 2 71
DIS Benchmarks Data Management (dm) Full running 507.5M 172.4M 6M 102 471
DIS Benchmarks Ray Tracing (ray) 300M 1B 280.7M 3.8M 15 46
DIS Benchmarks Fast Fourier Transform (fft) 1B 500M 140.2M 7.7M 143 1,129
SPEC CINT2000 164.gzip 1B 500M 107.5M 49.2M 93 336
SPEC CINT2000 181.mcf 1B 500M 164.1M 2.6M 43 134
SPEC CINT2000 175.vpr 1B 500M 126.8M 10.3M 126 606
SPEC CINT2000 256.bzip2 1B 500M 107M 16.5M 107 564
SPEC CFP2000 183.equake 1B 1B 207.7M 1.3M 119 444
SPEC CFP2000 179.art 1B 500M 105.2M 64.5M 51 140
30
Benchmark Behavior
Not memory-intensive
i-cache misses
Average
sus.128.8/Ideal
Perfect d-load/Ideal
53.1
74.3
84.7 72.2 84.5 24.1 35.3 99.6 53.8 29.3 53.3 93.6 12.6 27.4 56.2 64.3 20.6
99.7 99.9 99.9 39.7 96.3 99.8 58.1 31.3 54.2 96.8 25.9 42.2 99.0 92.9 63.2
Speed-up of perfect d-load
pointer update nbh tr matrix field dm ray fft gzip mcf vpr bzip2 equake art
1.178 1.383 1.183 1.650 2.731 1.002 1.079 1.069 1.016 1.035 2.059 1.539 1.763 1.445 3.061
Average 1.546 (54.6 speed-up)
31
Performance Results Speed-Up
  • Improve performance for 11 out of 15 applications
  • The best result 87.6 improvement for mcf
  • Four applications (tr, field , fft, and gzip)
    experience a slight performance degradation
  • Resource conflict may cause the problem

Configuration Performance Improvement
SPEAR-128 12.7
SPEAR-256 20.1
32
Speed-Up Dedicated Resources for P-thread
  • Field and gzip are not memory-intensive programs
  • Ray performance strongly depends on the I-cache
    miss
  • Dm and fft selection of d-loads may be
    inadequate
  • Fft even cannot reduce the cache misses
  • The p-thread is too large
  • Cache pollution may happen

Configuration Performance Improvement
SPEAR.sf-128 18.9
SPEAR.sf-256 26.3
6.2 performance improvement with dedicated
resources
33
Results Compared to the Perfect D-load Prefetching
pointer update nbh tr matrix field dm ray fft gzip mcf vpr bzip2 equake art
98.7 81.7 98.6 79.8 62.6 100.0 98.5 97.9 94.7 98.8 91.3 72.2 72.9 85.5 60.7
On average, SPEAR achieves 86.3 of performance
of the perfect d-load case
34
Performance Improvement and Cache Hit Rate
Speed-Up
Cache hit ratio
superscalar
SPEAR.sf-256
35
Latency Tolerance
Various L2 Cache Latency and Memory Access
Latency
Prefetching capability of the SPEAR architecture
provides robust performance for the long latencies
36
Long Memory Latency Tolerance
  • Performance normalized to the shortest memory
    latency for each configuration
  • On the average, performance degradation at the
    longest latency compared to the shortest latency
    is
  • SPEAR-128 loses 39.7 and SPEAR-256 loses 38.4
    of the baseline performance at the longest
    latency
  • Superscalar loses 48.5 at the longest latency

37
Outline
  • High Performance Processor Design Issues
  • Instruction Level Parallelism
  • Memory Wall Problem
  • Background
  • Speculative Pre-Execution
  • Single-Chip Multithreading
  • SPEAR Speculative Pre-Execution Assisted by
    Compiler
  • Design Characteristics
  • Hardware/Software Descriptions
  • Evaluation and Analysis
  • Summary

38
Summary
  • Speculative execution of the future cache miss
    slice in a multithreaded manner provides
    effective data prefetching
  • Automated binary tool (as the SPEAR compiler) and
    the supporting hardware have been proposed
  • Compiler-assisted p-thread construction and
    hardware-supported p-thread triggering mesh quite
    well and yield good performance results
  • Over 15 benchmark programs, SPEAR.sf-256 model
    achieves a 26.3 improvement over the baseline
    superscalar architecture

http//pascal.eng.uci.edu/
39
Future Work
  • Circuit level analysis considering VLSI
    implementation
  • For clock speed, area, and power consumption
    estimation
  • Since it has complexity effectiveness, the
    possibilities of using this scheme in embedded
    processors should be investigated
  • More algorithms to choose efficient prefetching
    thread
  • D-load selection as well as p-thread selection
  • Selective execution among the p-thread
    instructions
  • Multiple p-threads execution under
    multiprogramming workload

40
End of the Presentation
  • Questions or comments?

Thank you very much!!
Write a Comment
User Comments (0)
About PowerShow.com