Microarchitectural Techniques to Exploit Repetitive Computations and Values - PowerPoint PPT Presentation

1 / 52
About This Presentation
Title:

Microarchitectural Techniques to Exploit Repetitive Computations and Values

Description:

LECTURA DE TESIS, (Barcelona,14 de Diciembre de 2005) ... Cacti 3.0. Simplescalar Tool Set. Benchmarks. Spec CPU95. Spec CPU2000. 11. Outline ... – PowerPoint PPT presentation

Number of Views:21
Avg rating:3.0/5.0
Slides: 53
Provided by: carlos101
Category:

less

Transcript and Presenter's Notes

Title: Microarchitectural Techniques to Exploit Repetitive Computations and Values


1
Microarchitectural Techniques to Exploit
Repetitive Computations and Values
LECTURA DE TESIS, (Barcelona,14 de Diciembre de
2005)
  • Carlos Molina Clemente

Advisors Antonio González and Jordi Tubella
2
Outline
  • Motivation Objectives
  • Overview of Proposals
  • To improve the memory system
  • To speed-up the execution of instructions
  • Non Redundant Data Cache
  • Trace-Level Speculative Multithreaded Arch.
  • Conclusions Future Work

3
Outline
  • Motivation Objectives
  • Overview of Proposals
  • To improve the memory system
  • To speed-up the execution of instructions
  • Non Redundant Data Cache
  • Trace-Level Speculative Multithreaded Arch.
  • Conclusions Future Work

4
Motivation
  • General by design
  • real-world programs
  • operating systems
  • Often designed in mind to
  • future expansion
  • code reuse
  • Input sets have little variation

5
Types of Repetition
Repetition
z F (x, y)
6
Repetitive Computations
100 90 80 70 60 50 40 30
20 10 0
Spec CPU2000, 500 million instructions
7
Types of Repetition
Repetition
z F (x, y)
8
Repetitive Values
100 90 80 70 60 50 40 30
20 10 0
Spec CPU2000, 500 million instructions, analysis
of destination value
9
Objectives
10
Experimental Framework
  • Methodology
  • Analysis of benchmarks
  • Definition of proposal
  • Evaluation of proposal
  • Tools
  • Atom
  • Cacti 3.0
  • Simplescalar Tool Set
  • Benchmarks
  • Spec CPU95
  • Spec CPU2000

11
Outline
  • Motivation Objectives
  • Overview of Proposals
  • To improve the memory system
  • To speed-up the execution of instructions
  • Non Redundant Data Cache
  • Trace-Level Speculative Multithreaded Arch.
  • Conclusions Future Work

12
Techniques to Improve Memory
Value Repetition
13
Redundant Stores Instructions
  • Do NOT modify memory

STORE (_at_i , Value Y)
  • If (Value XValue Y) then

Redundant Store
  • Contributions
  • Redundant stores
  • Analysis of repetition into same storage location
  • Redundant stores applied to reduce memory traffic
  • Main results
  • 15-25 of redundant store instructions
  • 5-20 of memory traffic reduction

Molina, González, Tubella, Reducing Memory
Traffic via Redundant Store Instructions, HPCN99
14
Non Redundant Data Cache
  • If (Value AValue D) then

Value Repetition
  • Non redundant data cache (NRC)
  • Contributions
  • Analysis of repetition in several storage
    locations
  • Main results
  • On average, a value is stored 4 times at any
    given time
  • NRC -32 area, -13 energy, -25 latency, 5
    miss

Molina, Aliagas, García,Tubella, González, Non
Redundant Data Cache, ISLPED03 Aliagas, Molina,
García, González, Tubella, Value Compression to
Reduce Power in Data Caches, EUROPAR03
15
Outline
  • Motivation Objectives
  • Overview of Proposals
  • To improve the memory system
  • To speed-up the execution of instructions
  • Non Redundant Data Cache
  • Trace-Level Speculative Multithreaded Arch.
  • Conclusions Future Work

16
Techniques to Speed-up I Execution
Computation Repetition
  • Avoid serialization caused by data dependences
  • Determine results of instructions without
    executing them
  • Target is to speed-up the execution of programs

17
Techniques to Speed-up I Execution
Computation Repetition
18
Techniques to Speed-up I Execution
Computation Repetition
19
Techniques to Speed-up I Execution
Computation Repetition
20
Techniques to Speed-up I Execution
Computation Repetition
21
Techniques to Speed-up I Execution
Computation Repetition
22
Instruction Level Reuse (ILR)
Reuse Table
RCB
  • Redundant Computation Buffer (RCB)
  • Contributions
  • Performance potential of ILR
  • Main results
  • Ideal ILR speed-up of 1.5
  • RCB speed-up of 1.1 (outperforms previous
    proposals)

Molina, González, Tubella, Dynamic Removal of
Redundant Computations, ICS99
23
Trace Level Reuse (TLR)
  • Contributions
  • Trace Level Reuse
  • Initial design issues for integrating TLR
  • Performance potential of TLR
  • Main results
  • Ideal TLR speed-up of 3.6
  • 4K-entry table 25 of reuse, average trace size
    of 6

González, Tubella, Molina, Trace-Level Reuse,
ICPP99
24
Trace Level Speculation (TLS)
  • Two orthogonal issues
  • Compiler analysis to support TSMA
  • Contributions
  • Trace Level Speculative Multithreaded Architecture
  • Main results
  • speedup of 1.38 with a 20 of misspeculations

Molina, González, Tubella, Trace-Level
Speculative Multithreaded Architecture (TSMA),
ICCD02 Molina, González, Tubella Compiler
Analysis for TSMA, INTERACT05 Molina, Tubella,
González, Reducing Misspeculation Penalty in
TSMA, ISHPC05
25
Objectives Proposals
  • To improve the memory system
  • Redundant store instructions
  • Non redundant data cache
  • To speed-up the execution of instructions
  • Redundant computation buffer (ILR)
  • Trace-level reuse buffer (TLR)
  • Trace-level speculative multithreaded
    architecture (TLS)

26
Outline
  • Motivation Objectives
  • Overview of Proposals
  • To improve the memory system
  • To speed-up the execution of instructions
  • Non Redundant Data Cache
  • Trace-Level Speculative Multithreaded Arch.
  • Conclusions Future Work

27
Motivation
  • Caches spend close to 50 of total die area
  • Caches are responsible of a significant part of
    total power dissipated by a processor

28
Data Value Repetition
percentage of repetitive values
percentage of time
Spec CPU2000, 1 billion instructions, 256KB data
cache
29
Conventional Cache
  • If (Value AValue D) then

Value Repetition
30
Non Redundant Data Cache
Pointer Table
Value Table
Die Area Reduction
31
Non Redundant Data Cache
Pointer Table
Value Table
32
Non Redundant Data Cache
Pointer Table
Value Table
1
2
1
33
Data Value Inlining
  • Some values can be represented with a small
    number of bits (Narrow Values)
  • Narrow values can be inlined into pointer area
  • Simple sign extension is applied
  • Benefits
  • enlarges effective capacity of VT
  • reduces latency
  • reduces power dissipation

34
Non Redundant Data Cache
Pointer Table
Value Table
F
2
1234
0
Data Value Inlining
35
Miss Rate vs Die Area
L2 Cache 256KB 512KB
1MB 2MB 4MB




Miss Ratio







0,1 0,5
1,0
cm2
VT50
VT30
VT20
CONV
Spec CPU2000, 1 billion instructions
36
Results
  • Caches ranging from 256 KB to 4 MB

37
Outline
  • Motivation Objectives
  • Overview of Proposals
  • To improve the memory system
  • To speed-up the execution of instructions
  • Non Redundant Data Cache
  • Trace-Level Speculative Multithreaded Arch.
  • Conclusions Future Work

38
Trace Level Speculation
  • Avoids serialization caused by data dependences
  • Skips in a row multiple instructions
  • Predicts values based on the past
  • Solves live-input test
  • Introduces penalties due to misspeculations

39
Trace Level Speculation
  • Two orthogonal issues
  • microarchitecture support for trace speculation
  • control and data speculation techniques
  • prediction of initial and final points
  • prediction of live output values
  • Trace Level Speculative Multithreaded
    Architecture (TSMA)
  • does not introduce significant misspeculation
    penalties
  • Compiler Analysis
  • based on static analysis that uses profiling data

40
Trace Level Speculation with Live Output Test
ST
NST
41
TSMA Block Diagram
Look Ahead Buffer
42
Compiler Analysis
  • Focuses on
  • developing effective trace selection schemes for
    TSMA
  • based on static analysis that uses profiling data
  • Trace Selection
  • Graph Construction (CFG DDG)
  • Graph Analysis

43
Graph Analysis
  • Two important issues
  • initial and final point of a trace
  • maximize trace length minimize misspeculations
  • predictability of live output values
  • prediction accuracy and utilization degree
  • Three basic heuristics
  • Procedure Trace Heuristic
  • Loop Trace Heuristic
  • Instruction Chaining Trace Heuristic

44
Trace Speculation Engine
  • Traces are communicated to the hardware
  • at program loading time
  • filling a special hardware structure (trace
    table)
  • Each entry of the trace table contains
  • initial PC
  • final PC
  • live-output values information
  • branch history
  • frequency counter

45
Simulation Parameters
  • Base microarchitecture
  • out of order machine, 4 instructions per cycle
  • I cache 16KB, D cache 16KB, L2 shared 256KB
  • bimodal predictor
  • 64-entry ROB, FUs 4 int, 2 div, 2 mul, 4 fps
  • TSMA additional structures
  • each thread I window, reorder buffer, register
    file
  • speculative data cache 1KB
  • trace table 128 entries, 4-way set associative
  • look ahead buffer 128 entries
  • verification engine up to 8 instructions per
    cycle

46
Speedup
1.45
1.40
1.35
1.30
1.25
1.20
1.15
1.10
1.05
1.00
Spec CPU2000, 250 million instructions
47
Misspeculations
Spec CPU2000, 250 million instructions
48
Outline
  • Motivation Objectives
  • Overview of Proposals
  • To improve memory system
  • To speed-up the execution of instructions
  • Non Redundant Data Cache
  • Trace-Level Speculative Multithreaded Arch.
  • Conclusions Future Work

49
Conclusions
  • Repetition is very common in programs
  • Can be applied
  • to improve the memory system
  • to speed-up the execution of instructions
  • Investigated several alternatives
  • Novel cache organizations
  • Instruction level reuse approach
  • Trace level reuse concept
  • Trace level speculation architecture

50
Future Work
  • Value repetition in instruction caches
  • Profiling to support data value reuse schemes
  • Traces starting at different PCs
  • Value prediction in TSMA
  • Multiple speculations in TSMA
  • Multiple threads in TSMA

51
Publications
  • Value Repetition in Cache Organizations
  • Reducing Memory Traffic Via Redundant Store
    Instructions, HPCN'99
  • Non Redundant Data Cache, ISLPED'03
  • Value Compression to Reduce Power in Data Caches,
    EUROPAR'03
  • Instruction Trace Level Reuse
  • The Performance Potential of Data Value Reuse,
    TR-UPC-DAC98
  • Dynamic Removal of Redundant Computations, ICS'99
  • Trace Level Reuse, ICPP'99
  • Trace Level Speculation
  • Trace-Level Speculative Multithreaded
    Architecture, ICCD'02
  • Compiler Analysis for TSMA, INTERACT05
  • Reducing Misspeculation Penalty in TSMA, ISHPC05

52
Microarchitectural Techniques to Exploit
Repetitive Computations and Values
LECTURA DE TESIS, (Barcelona, 14 de Diciembre de
2005)
  • Carlos Molina Clemente

Advisors Antonio González and Jordi Tubella
Write a Comment
User Comments (0)
About PowerShow.com