Gautham K.Dorai and - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

Gautham K.Dorai and

Description:

Subordinate Threading (Prefetching/Pre-Execution,Cache management, ... Offload Prefetch Code to Transparent Threads. Zero Overhead No profiling Required ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 39
Provided by: mossCs
Category:
Tags: dorai | gautham | offload

less

Transcript and Presenter's Notes

Title: Gautham K.Dorai and


1
TRANSPARENT THREADS
  • Gautham K.Dorai and
  • Dr.Donald Yeung
  • ECE Dept., Univ. Maryland, College Park

2
SMT Processors
Priority Mechanisms
Pipeline
Multiple Threads
  • ICOUNT, IQPOSN, MISSCOUNT, BRCOUNT
  • Tullsen ISCA96

3
Individual Threads run Slower!
Individual Thread Performance 31
4
Single-Thread Performance
  • Multiprogramming (Process Scheduling)
  • Subordinate Threading (Prefetching/Pre-Execution,C
    ache management, Branch Prediction etc.)
  • Performance Monitoring (Dynamic Profiling)

5
Transparent Threads
Foreground Thread
Background Thread (Transparent)
0 SLOWDOWN
6
Single-Thread Performance
  • Multiprogramming
  • Latency of critical high-priority process
  • Subordinate Threading
  • Performance Monitoring
  • Benefit vs Cost (Overhead) Tradeoff

7
Road Map
  • Motivation
  • Transparent Threads
  • Experimental Evaluation
  • Transparent Software Prefetching
  • Conclusion

8
Shared vs. Private
Transparency No stealing of shared resources
PC
ROB
Predictor
Functional Units
Issue Queues
Register File
I-Cache
Register Map
D-Cache
Fetch Queue
9
Slots, Buffers and Memories
SLOTS Allocation based on current cycle only
BUFFERS Allocation based on future cycles
PC
PC
MEMORIES Allocation based on future cycles
ROB
Predictor
Issue Units
Issue Queues
Register File
I-Cache
I-Cache
I-Cache
Register Map
D-Cache
Fetch Queue
10
Slot Prioritization
ICOUNT (background) ICOUNT(background)
Inst-Window Size
Foreground
Background
I-CACHE BLOCK
ICOUNT 2.N
Fetch Slots
11
Slot Prioritization
ICOUNT (background) ICOUNT(background)
Inst-Window Size
Foreground
Background
I-CACHE BLOCK
ICOUNT 2.N
Fetch Slots
12
Slot Prioritization
ICOUNT (background) ICOUNT(background)
Inst-Window Size
Foreground
Background
I-CACHE BLOCK
ICOUNT 2.N
Fetch Slots
13
Buffer Transparency
ROB
Fetch Hardware
PC1
PC2
Fetch Queue
Foreground
Background
Issue Queue
14
Background Thread Window Partitioning
  • Limit on Background Thread Instructions

ROB
  • Stops fetch when ICOUNT reaches limit

Fetch Hardware
PC1
PC2
Partition
Fetch Queue
Foreground
Background
Issue Queue
15
Background Thread Window Partitioning
ROB
  • Foreground Thread can occupy all available
    entries

Fetch Hardware
PC1
PC2
Partition
Fetch Queue
Foreground
Background
Issue Queue
16
Background Thread Flushing
ROB
  • No limit on Background Thread

Head
Tail
Fetch Hardware
PC1
PC2
Tail
Fetch Queue
Foreground
Background
Issue Queue
Head
17
Background Thread Flushing
ROB
  • No limit on Background Thread

Head
  • Flush Triggered on Conflict

Tail
Fetch Hardware
PC1
PC2
Fetch Queue
Foreground
Tail
Background
Issue Queue
Head
18
Background Thread Flushing
ROB
  • No limit on Background Thread

Head
  • Flush Triggered on Conflict

Fetch Hardware
Tail
PC1
PC2
Fetch Queue
Foreground
Tail
Background
Issue Queue
Head
19
Foreground Thread Flushing
ROB
  • Instructions remain stagnant in ROB

Load Miss
Head
  • Flush Triggered on load miss at head
  • Flush Stagnated Entries

Fetch Hardware
PC1
PC2
Fetch Queue
Foreground
Background
Issue Queue
Tail
20
Foreground Thread Flushing
ROB
Load Miss
Head
  • Flush F Entries from the tail
  • Block the fetch for T Cycles

Fetch Hardware
PC1
PC2
Tail
Fetch Queue
Head
Foreground
Background
Issue Queue
21

Foreground Thread Flushing
ROB
Load Miss
Head
  • After T Cycles allow to fetch again
  • F T depend on R (Residual Cache Latency)

Fetch Hardware
PC1
PC2
Fetch Queue
Foreground
Background
Tail
Issue Queue
22
SimpleScalar-based SMT
23
Benchmark Suites
  • Evaluate Transparency Mechanisms
  • Transparent Software Prefetching

24
Transparency Mechanisms
Background Thread Window Partitioning (32 Entries)
Slot Prioritization
Background Thread Flushing
Private Caches
Private Predictor
Equal Priority
EP
SP
BP
BF
PC
PP
25
Transparency Mechanisms
Equal Priority 30 Slowdown Slot Prioritization
16 Slowdown Background Window Partitioning
9 Slowdown Background Thread Flushing 3
Slowdown
EP
SP
BP
BF
PC
PP
26
Performance Mechanisms
Foreground Thread Window Partitioning (112F 32B)
ICOUNT 2.8 with Flushing
Equal Priority
ICOUNT 2.8
2B
2F
2P
EP
27
Performance Mechanisms
Equal Priority 31 degradation ICOUNT 2.8 -
41 slower than EP ICOUNT 2.8 Foreground Thread
Flushing 23 slower than EP Foreground Thread
Window Partitioning 13 slower than EP
2B
2F
2P
EP
Normalized IPC
28
Transparent Software Prefetching
Conventional
Transparent Software Prefetching
Computation Thread
Transparent Prefetch Thread
For (I0 I lt N-PD I8) prefetch(bI)
bI z bI
For (I0 I lt N-PD I8) bI z bI
For (I0 I lt N-PD I8) prefetch(bI)
  • In-lined Prefetch Code
  • Offload Prefetch Code to Transparent Threads
  • Profitability of Prefetching
  • Zero Overhead No profiling Required
  • Benefit vs Cost tradeoff

(Profiling required)
29

Transparent Software Prefetching
Naive Conventional Software Prefetching
Profiled Conventional Software Prefetching
No Prefetching
Transparent Software Prefetching
Normalized Execution Time
NP PF PS TSP
VPR
30
Transparent Software Prefetching
Naïve Software Prefetching 19.6 Overhead, 0.8
Performance Selective Software Prefetching
14.13 Overhead, 2.47 Performance Transparent
Software Prefetching 1.38 Overhead, 9.52
Performance
NP PF PS TSP
NP PF PS TSP
NP PF PS TSP
NP PF PS TSP
NP PF PS TSP
NP PF PS TSP
NP PF PS TSP
VPR
BZIP
GAP
EQUAKE
ART
AMMP
IRREG
31
Conclusions
  • Transparency Mechanisms
  • 3 overhead on foreground thread
  • Less than 1 without cache and predictor
    contention
  • Throughput Mechanisms
  • Within 23 of Equal Priority
  • Transparent Software Prefetching
  • 9.52 gain with 1.38 Overhead
  • Eliminates the need for profiling
  • Availability of spare bandwidth
  • Can be used transparently for interesting
    applications

32
Related Work
  • Tullsens work on Flushing mechanisms
  • Tullsen Micro-2001
  • Raaschs work on prioritization
  • Raasch MTEAC Worshop 1999
  • Snavelys work on Job Scheduling
  • Snavely ICMM-2001
  • Chappells work on Subordinate Multithreading and
  • Duboiss work on Assisted Execution
  • Chappell ISCA-1999Dubois Tech-Report Oct98

33
Foreground Thread Window Partitioning
  • Advantages
  • Minimal guaranteed entries
  • Disadvantages
  • Transparency minimized

Fetch Hardware
PC1
PC2
Partition
Fetch Queue
Foreground
Background
Issue Queue
34
Benchmark Suites
  • Evaluate Transparency Mechanisms
  • Transparent Software Prefetching

35
Transparency Mechanisms
EP SP BF
EP SP BF
EP SP BF
EP SP BF
EP SP
EP SP BF
EP SP BF
EP SP BF
EP SP BF
36
Transparency Mechanisms
EP SP BF
EP SP BF
EP SP BF
EP SP BF
SP BF
EP SP BF
EP SP BF
EP SP BF
EP SP BF
37
Transparency Mechanisms
38
Transparent Software Prefetching
NP PF PS TSP NF
NP PF PS TSP NF
NP PF PS TSP NF
NP PF PS TSP NF
NP PF PS TSP NF
NP PF PS TSP NF
NP PF PS TSP NF
Write a Comment
User Comments (0)
About PowerShow.com