Title: Decoupled Prefetching for Data Intensive Application
1Decoupled Prefetching for Data Intensive
Application
Wonwoo Ro Parallel and Distributed Processing
Center (PDPC)University of Southern California
2Overview HiDISC Architecture
3Outline
- Introduction
- Decoupled Architecture
- Data Intensive Application
- De-SMT (Decoupled SMT)
- Conclusion
4Memory Wall Problem
// c-code 9 ii 2
//MIPS Assemblylw 24, 0(15)addu 25,
24, 2sw 25, 0(15)
5Limitation of Present Solutions
- Huge cache
- Slow and works well only if working set fits
cache and there exists locality - Prefetching
- Hardware prefetching
- Cannot be tailored for each application
- Behavior based on past and present execution-time
behavior - Software prefetching
- Ensure overheads of prefetching do not outweigh
the benefits - Hard to insert prefetches for irregular access
patterns - SMT
- Enhance the utilization and throughput at thread
level
6Outline
- Introduction
- Decoupled Architecture
- Data Intensive Application
- De-SMT (Decoupled SMT)
- Conclusion
7Memory Wall Solutions
- Using multithreading
- Enhance the utilization and throughput
- Lower the height of wall
- New DRAM technology RDRAM, DDR DRAM
- Integration of memory and processor (IRAM)
- Prefetching
- Single stream hardware/software
- Multiple streams Decoupled architecture
Memory Wall
Program flow
8Decoupled Architectures
Memory Wall
Program flow
9Decoupled Architectures
- Exploiting ILP by executing the two instruction
streams on separate processors - Out-of-order issue comes cheaply using queue
- Access processor runs slip far ahead of the
execute processor - Hide memory latency
- Instruction Level Parallelism due to the multiple
streams
10 Decoupled Architectures History
- Early Decoupled architectures (80)
- PIPE and the Astronautics ZS
- exploiting a functional partitioning of
instruction streams - Compared to simple RISC with Cache
- Decoupled architectures on 90
- Emergence of Superscalar Architecture
- Decoupled architecture with Cache
- Decouple architecture is comparable to SS with
less complexity - Out-of-order issue, register renaming and loop
unrolling - Defeated
- not an von Neumann architectures ? compiler is
not ready - scientific applications have been the major
benchmarks
11Decoupled Architectures on New Millennium
- Supporting Cache Prefetching
- Cache miss is more large than ever
- PIMs and RAMBUS offer a high-bandwidth memory
system - useful for speculative prefetching
- Attractive as decentralized characteristic
- Complexity-effective design
- Speculative data-driven multithreading (Amir and
Sohi) - Micro architectural level access decoupling
- Restricted to only those accesses that are likely
to result in cache misses - Aggressive pointer following supported by access
stream
12Outline
- Introduction
- Decoupled Architecture
- Data Intensive Application
- De-SMT (Decoupled SMT)
- Conclusion
13Pointer-based data structure
- Linked List, Tree and Graph
- Popular in various applications including
compiler and database - Object-Oriented Programming is the trend (C
and Java) - Serial natural of pointer dereferences
- Known as Pointer Chasing Problem
- Makes Memory Wall problem more serious
14DIS Benchmarks
- Atlantic Aerospace DIS benchmark suite
- Application oriented benchmarks
- Many defense application employ large data sets
non contiguous memory access and no temporal
locality - Requires compiler that can link multiple object
file - Atlantic Aerospace Stressmarks suite
- Smaller and more specific procedures
- Seven individual data intensive benchmarks
- Directly illustrate particular elements of the
DIS problem - with simplicity, but reduced realism
15Stressmark Suite
DIS Stressmark Suite Version 1.0, Atlantic
Aerospace Division
16Pointer Stressmark
- Basic idea repeatedly follow pointers to
randomized locations in memory - Memory access pattern is unpredictable
- Randomized memory access pattern
- Not sufficient temporal and spatial locality for
conventional cache architectures - HiDISC architecture provides lower memory access
latency
17Pointer Stressmark
- partition fieldindex
-
- for (ll0 llltw ll)
-
- x fieldindexll
- if (x gt max) high
- else if (x gt min)
-
- partition x
- balance 0
- for (lllll1 lllltw lll)
-
- if (fieldindexlll gt partition)
balance -
- // if it results in median, break
- if (balancehigh w/2) break
- else if (balancehigh gt w/2)
- min partition
- else max partition
18Pointer Stressmark
- Sequential access to the elements in a window (w)
- Locality
- 32Bytes cache line causes at least 2 cache misses
at w15 - Decoupled Architecture
- Effectively reduces L1 cache miss by prefetching
- The first accesses in each hop causes cache misses
19Decoupling of Pointer Stressmark
while (not EOD)if (field gt partition)
balance if (balancehigh w/2) breakelse
if (balancehigh gt w/2) min
partitionelse max partition
high
for (ij1iltwi) if (fieldindexi gt
partition) balance if (balancehigh w/2)
breakelse if (balancehigh gt w/2) min
partitionelse max partition
high
Computation Processor Code
for (ij1 iltw i) load (fieldindexi) G
ET_SCQ send (EOD token)
Access Processor Code
Inner loop for the next indexing
for (ij1 iltw i) prefetch
(fieldindexi) PUT_SCQ
Cache Management Code
20Update Stressmark
- Pointer following with memory update
- Companion stressmark of Pointer stressmark
- Essentially, the same memory access pattern
- Due to the memory writing, it can not be
parallelized
21Field Stressmark
- The Field Stressmark emphasizes regular access to
large quantities of data. - Sequential pattern matching
- Field Database, Token Key
- Scanning a field of strings to find out token
strings - At matching instance, elements is modified
- For the every token, this process is performed
over whole filed - Large field size results in cache miss
22Transitive Closure Stressmark
- Floyd-Warshall algorithm to find all-pairs
shortest path - Input adjacency matrix of a directed graph
- Output adjacency matrix of the shortest-path
transitive closure - Semi-regular access to elements in multiple
matrices concurrently
23Transitive Closure Stressmark (Cont)
- for (k0 kltn k)
-
- unsigned int old, new
- unsigned int dtemp
- for (i0 iltn i)
-
- for (j0 jltn j)
-
- old (din jn i)
- new (din jn k) (din kn i)
- (dout jn i) (new lt old ? new old)
- .
- / end of loop for j /
- / end of loop for i /
- dtemp dout
- dout din
- din dtemp
-
- Due to n in the most inner loop, spatial locality
is destroyed - Each iteration of the loop (j-loop), the memory
address increases by n
24Stressmarks
- Hand-compile the 7 individual benchmarks
- Use gcc as front-end
- Manually partition each of the three instruction
streams and insert synchronizing instruction - Modification on Simulator
- Supports dynamic linking for shared library
- Loading libc.so.1 and libm.so
- Supporting multiple memory space
25Simulator Overview
- Based on MIPS RISC pipeline Architecture (dsim)
- Supports MIPS1 and MIPS2 ISA
- Supports dynamic linking for shared library
- Loading shared library on the memory of simulator
- Hand compiling
- Using SGI-cc or gcc as front-end
- Making each streams of codes
- Using SGI-cc as compiler back-end
.c
cc mips2 -s
- Modification on three .s codes to HiDISC
assembly - Convert .hs to .s for three codes (hs2s)
.s
.cp.s
.ap.s
.cmp.s
cc mips2 -o
sharedlibrary
.cp
.ap
.cmp
dsim
26Outline
- Introduction
- Decoupled Architecture
- Data Intensive Application
- De-SMT (Decoupled SMT)
- Conclusion
27Multiple Issue Processor
- Instruction level parallelism (ILP)
- Multiple issue paradigm
- Inherently, out-of-order(OOO) issue
- Two prevailing architectures
- Superscalar Dynamic dependency check (HW)
- VLIW Static dependency check (compiler)
- Superscalar conquer the commercial world
- feasible variation of traditional pipelined
processor - flexible adaptation for commercial work load
28Out of Order Superscalar Limitation
- Limited ILP of basic block ? idle functional
units - Speculation is also wasted
- Increased memory access delays versus processor
clock - Results in memory stall and pipeline penalties
- Solution
- Trace driven processor
- speculative processor, multiscalar
- TLP
- CMP localizes processor resources
- SMT efficient use of FUs, latency tolerance
29Future of the Wide Issue Superscalar
- Windows wakeup and selection logic will be the
most critical part in superscalar - Palacharla 97, Complexity effective superscalar
processor, ISCA24 - Instruction windows are the centralized resources
- Poor wire scaling as semiconductor devices shrink
- The amount of state that can be accessed in a
single clock cycle will cease to grow - Improvements in clock rate and IPC become
directly antagonistic - Agarwal, 2000, ISCA27
? Decentralize Design
30Decoupled SMT (De-SMT)
Execute Thread 1
Program 1
Access Thread 1
Execute Thread 2
Program 2
Access Thread 2
Execute Thread 3
Program 3
Access Thread 3
Execute Thread 4
Program 4
Access Thread 4
- Decoupled architectures reduces memory latencies,
the SMT increases the utilization of functional
unit - Complexity of scheduling logic can be reduced by
the decoupled characteristics
31Why does SMT Need Decoupling ?
- SMT inherits the circuit complexity of wide issue
super scalar with monolithic scheduler - OOO is questionable in SMT architecture
- Additional logics (multiple PC and register file)
impose the complexity - Some cache performance problems are also appeared
by multithreading behavior of SMT - Cache pollution due to multiple threads
- Bad things can happen simultaneously
32Decoupled Architectures on SMT
- Dynamically utilize the functional unit by SMT
features. - The weakest point of Decupled Architecture
- Block of Execution due to the synchronization
- Block caused by the control flow
- Thread can dropped and switched to the other
thread at blocking - Each stream of original Decoupled architectures
can be implemented as thread in SMT
33Outline
- Introduction
- Decoupled Architecture
- De-SMT (Decoupled SMT)
- Data Intensive Application
- Conclusion
34Conclusion
- Decoupled architecture can reduce memory access
latency by separating program streams - Decoupled prefetching is attractive alternative
on Data Intensive applications - Between Von Neumann and data-flow
- Decoupled architecture has merits as complexity
effective ILP processor - Access streams can be supported by thread concept
on Multithreading architecture