Decoupled Prefetching for Data Intensive Application - PowerPoint PPT Presentation

1 / 34

About This Presentation

Title:

Decoupled Prefetching for Data Intensive Application

Description:

Enhance the utilization and throughput. Lower ... the SMT increases the utilization of functional unit ... Dynamically utilize the functional unit ... – PowerPoint PPT presentation

Number of Views:46

Avg rating:3.0/5.0

Slides: 35

Provided by: wonw

Category:

more less

Transcript and Presenter's Notes

Title: Decoupled Prefetching for Data Intensive Application

1
Decoupled Prefetching for Data Intensive
Application

- HiDISC and beyond

Wonwoo Ro Parallel and Distributed Processing
Center (PDPC)University of Southern California
2
Overview HiDISC Architecture
3
Outline

Introduction
Decoupled Architecture
Data Intensive Application
De-SMT (Decoupled SMT)
Conclusion

4
Memory Wall Problem
// c-code 9 ii 2
//MIPS Assemblylw 24, 0(15)addu 25,
24, 2sw 25, 0(15)
5
Limitation of Present Solutions

Huge cache
Slow and works well only if working set fits
cache and there exists locality
Prefetching
Hardware prefetching
Cannot be tailored for each application
Behavior based on past and present execution-time
behavior
Software prefetching
Ensure overheads of prefetching do not outweigh
the benefits
Hard to insert prefetches for irregular access
patterns
SMT
Enhance the utilization and throughput at thread
level

6
Outline

Introduction
Decoupled Architecture
Data Intensive Application
De-SMT (Decoupled SMT)
Conclusion

7
Memory Wall Solutions

Using multithreading
Enhance the utilization and throughput
Lower the height of wall
New DRAM technology RDRAM, DDR DRAM
Integration of memory and processor (IRAM)
Prefetching
Single stream hardware/software
Multiple streams Decoupled architecture

Memory Wall
Program flow
8
Decoupled Architectures
Memory Wall
Program flow
9
Decoupled Architectures

Exploiting ILP by executing the two instruction
streams on separate processors
Out-of-order issue comes cheaply using queue
Access processor runs slip far ahead of the
execute processor
Hide memory latency
Instruction Level Parallelism due to the multiple
streams

10
Decoupled Architectures History

Early Decoupled architectures (80)
PIPE and the Astronautics ZS
exploiting a functional partitioning of
instruction streams
Compared to simple RISC with Cache
Decoupled architectures on 90
Emergence of Superscalar Architecture
Decoupled architecture with Cache
Decouple architecture is comparable to SS with
less complexity
Out-of-order issue, register renaming and loop
unrolling
Defeated
not an von Neumann architectures ? compiler is
not ready
scientific applications have been the major
benchmarks

11
Decoupled Architectures on New Millennium

Supporting Cache Prefetching
Cache miss is more large than ever
PIMs and RAMBUS offer a high-bandwidth memory
system
useful for speculative prefetching
Attractive as decentralized characteristic
Complexity-effective design
Speculative data-driven multithreading (Amir and
Sohi)
Micro architectural level access decoupling
Restricted to only those accesses that are likely
to result in cache misses
Aggressive pointer following supported by access
stream

12
Outline

Introduction
Decoupled Architecture
Data Intensive Application
De-SMT (Decoupled SMT)
Conclusion

13
Pointer-based data structure

Linked List, Tree and Graph
Popular in various applications including
compiler and database
Object-Oriented Programming is the trend (C
and Java)
Serial natural of pointer dereferences
Known as Pointer Chasing Problem
Makes Memory Wall problem more serious

14
DIS Benchmarks

Atlantic Aerospace DIS benchmark suite
Application oriented benchmarks
Many defense application employ large data sets
non contiguous memory access and no temporal
locality
Requires compiler that can link multiple object
file
Atlantic Aerospace Stressmarks suite
Smaller and more specific procedures
Seven individual data intensive benchmarks
Directly illustrate particular elements of the
DIS problem
with simplicity, but reduced realism

15
Stressmark Suite
DIS Stressmark Suite Version 1.0, Atlantic
Aerospace Division
16
Pointer Stressmark

Basic idea repeatedly follow pointers to
randomized locations in memory
Memory access pattern is unpredictable
Randomized memory access pattern
Not sufficient temporal and spatial locality for
conventional cache architectures
HiDISC architecture provides lower memory access
latency

17
Pointer Stressmark

partition fieldindex
for (ll0 llltw ll)
x fieldindexll
if (x gt max) high
else if (x gt min)
partition x
balance 0
for (lllll1 lllltw lll)
if (fieldindexlll gt partition)
balance
// if it results in median, break
if (balancehigh w/2) break
else if (balancehigh gt w/2)
min partition
else max partition

18
Pointer Stressmark

Sequential access to the elements in a window (w)
Locality
32Bytes cache line causes at least 2 cache misses
at w15
Decoupled Architecture
Effectively reduces L1 cache miss by prefetching
The first accesses in each hop causes cache misses

19
Decoupling of Pointer Stressmark
while (not EOD)if (field gt partition)
balance if (balancehigh w/2) breakelse
if (balancehigh gt w/2) min
partitionelse max partition
high
for (ij1iltwi) if (fieldindexi gt
partition) balance if (balancehigh w/2)
breakelse if (balancehigh gt w/2) min
partitionelse max partition
high
Computation Processor Code
for (ij1 iltw i) load (fieldindexi) G
ET_SCQ send (EOD token)
Access Processor Code
Inner loop for the next indexing
for (ij1 iltw i) prefetch
(fieldindexi) PUT_SCQ
Cache Management Code
20
Update Stressmark

Pointer following with memory update
Companion stressmark of Pointer stressmark
Essentially, the same memory access pattern
Due to the memory writing, it can not be
parallelized

21
Field Stressmark

The Field Stressmark emphasizes regular access to
large quantities of data.
Sequential pattern matching
Field Database, Token Key
Scanning a field of strings to find out token
strings
At matching instance, elements is modified
For the every token, this process is performed
over whole filed
Large field size results in cache miss

22
Transitive Closure Stressmark

Floyd-Warshall algorithm to find all-pairs
shortest path
Input adjacency matrix of a directed graph
Output adjacency matrix of the shortest-path
transitive closure
Semi-regular access to elements in multiple
matrices concurrently

23
Transitive Closure Stressmark (Cont)

for (k0 kltn k)
unsigned int old, new
unsigned int dtemp
for (i0 iltn i)
for (j0 jltn j)
old (din jn i)
new (din jn k) (din kn i)
(dout jn i) (new lt old ? new old)
.
/ end of loop for j /
/ end of loop for i /
dtemp dout
dout din
din dtemp

Due to n in the most inner loop, spatial locality
is destroyed
Each iteration of the loop (j-loop), the memory
address increases by n

24
Stressmarks

Hand-compile the 7 individual benchmarks
Use gcc as front-end
Manually partition each of the three instruction
streams and insert synchronizing instruction
Modification on Simulator
Supports dynamic linking for shared library
Loading libc.so.1 and libm.so
Supporting multiple memory space

25
Simulator Overview

Based on MIPS RISC pipeline Architecture (dsim)
Supports MIPS1 and MIPS2 ISA
Supports dynamic linking for shared library
Loading shared library on the memory of simulator
Hand compiling
Using SGI-cc or gcc as front-end
Making each streams of codes
Using SGI-cc as compiler back-end

.c
cc mips2 -s

Modification on three .s codes to HiDISC
assembly
Convert .hs to .s for three codes (hs2s)

.s
.cp.s
.ap.s
.cmp.s
cc mips2 -o
sharedlibrary
.cp
.ap
.cmp
dsim
26
Outline

Introduction
Decoupled Architecture
Data Intensive Application
De-SMT (Decoupled SMT)
Conclusion

27
Multiple Issue Processor

Instruction level parallelism (ILP)
Multiple issue paradigm
Inherently, out-of-order(OOO) issue
Two prevailing architectures
Superscalar Dynamic dependency check (HW)
VLIW Static dependency check (compiler)
Superscalar conquer the commercial world
feasible variation of traditional pipelined
processor
flexible adaptation for commercial work load

28
Out of Order Superscalar Limitation

Limited ILP of basic block ? idle functional
units
Speculation is also wasted
Increased memory access delays versus processor
clock
Results in memory stall and pipeline penalties
Solution
Trace driven processor
speculative processor, multiscalar
TLP
CMP localizes processor resources
SMT efficient use of FUs, latency tolerance

29
Future of the Wide Issue Superscalar

Windows wakeup and selection logic will be the
most critical part in superscalar
Palacharla 97, Complexity effective superscalar
processor, ISCA24
Instruction windows are the centralized resources
Poor wire scaling as semiconductor devices shrink
The amount of state that can be accessed in a
single clock cycle will cease to grow
Improvements in clock rate and IPC become
directly antagonistic
Agarwal, 2000, ISCA27

? Decentralize Design
30
Decoupled SMT (De-SMT)
Execute Thread 1
Program 1
Access Thread 1
Execute Thread 2
Program 2
Access Thread 2
Execute Thread 3
Program 3
Access Thread 3
Execute Thread 4
Program 4
Access Thread 4

Decoupled architectures reduces memory latencies,
the SMT increases the utilization of functional
unit
Complexity of scheduling logic can be reduced by
the decoupled characteristics

31
Why does SMT Need Decoupling ?

SMT inherits the circuit complexity of wide issue
super scalar with monolithic scheduler
OOO is questionable in SMT architecture
Additional logics (multiple PC and register file)
impose the complexity
Some cache performance problems are also appeared
by multithreading behavior of SMT
Cache pollution due to multiple threads
Bad things can happen simultaneously

32
Decoupled Architectures on SMT

Dynamically utilize the functional unit by SMT
features.
The weakest point of Decupled Architecture
Block of Execution due to the synchronization
Block caused by the control flow
Thread can dropped and switched to the other
thread at blocking
Each stream of original Decoupled architectures
can be implemented as thread in SMT

33
Outline

Introduction
Decoupled Architecture
De-SMT (Decoupled SMT)
Data Intensive Application
Conclusion

34
Conclusion

Decoupled architecture can reduce memory access
latency by separating program streams
Decoupled prefetching is attractive alternative
on Data Intensive applications
Between Von Neumann and data-flow
Decoupled architecture has merits as complexity
effective ILP processor
Access streams can be supported by thread concept
on Multithreading architecture

Write a Comment

User Comments (0)