Decoupled Prefetching for Data Intensive Application - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Decoupled Prefetching for Data Intensive Application

Description:

Enhance the utilization and throughput. Lower ... the SMT increases the utilization of functional unit ... Dynamically utilize the functional unit ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 35
Provided by: wonw
Category:

less

Transcript and Presenter's Notes

Title: Decoupled Prefetching for Data Intensive Application


1
Decoupled Prefetching for Data Intensive
Application
  • - HiDISC and beyond

Wonwoo Ro Parallel and Distributed Processing
Center (PDPC)University of Southern California
2
Overview HiDISC Architecture
3
Outline
  • Introduction
  • Decoupled Architecture
  • Data Intensive Application
  • De-SMT (Decoupled SMT)
  • Conclusion

4
Memory Wall Problem
// c-code 9 ii 2
//MIPS Assemblylw 24, 0(15)addu 25,
24, 2sw 25, 0(15)
5
Limitation of Present Solutions
  • Huge cache
  • Slow and works well only if working set fits
    cache and there exists locality
  • Prefetching
  • Hardware prefetching
  • Cannot be tailored for each application
  • Behavior based on past and present execution-time
    behavior
  • Software prefetching
  • Ensure overheads of prefetching do not outweigh
    the benefits
  • Hard to insert prefetches for irregular access
    patterns
  • SMT
  • Enhance the utilization and throughput at thread
    level

6
Outline
  • Introduction
  • Decoupled Architecture
  • Data Intensive Application
  • De-SMT (Decoupled SMT)
  • Conclusion

7
Memory Wall Solutions
  • Using multithreading
  • Enhance the utilization and throughput
  • Lower the height of wall
  • New DRAM technology RDRAM, DDR DRAM
  • Integration of memory and processor (IRAM)
  • Prefetching
  • Single stream hardware/software
  • Multiple streams Decoupled architecture

Memory Wall
Program flow
8
Decoupled Architectures
Memory Wall
Program flow
9
Decoupled Architectures
  • Exploiting ILP by executing the two instruction
    streams on separate processors
  • Out-of-order issue comes cheaply using queue
  • Access processor runs slip far ahead of the
    execute processor
  • Hide memory latency
  • Instruction Level Parallelism due to the multiple
    streams

10
Decoupled Architectures History
  • Early Decoupled architectures (80)
  • PIPE and the Astronautics ZS
  • exploiting a functional partitioning of
    instruction streams
  • Compared to simple RISC with Cache
  • Decoupled architectures on 90
  • Emergence of Superscalar Architecture
  • Decoupled architecture with Cache
  • Decouple architecture is comparable to SS with
    less complexity
  • Out-of-order issue, register renaming and loop
    unrolling
  • Defeated
  • not an von Neumann architectures ? compiler is
    not ready
  • scientific applications have been the major
    benchmarks

11
Decoupled Architectures on New Millennium
  • Supporting Cache Prefetching
  • Cache miss is more large than ever
  • PIMs and RAMBUS offer a high-bandwidth memory
    system
  • useful for speculative prefetching
  • Attractive as decentralized characteristic
  • Complexity-effective design
  • Speculative data-driven multithreading (Amir and
    Sohi)
  • Micro architectural level access decoupling
  • Restricted to only those accesses that are likely
    to result in cache misses
  • Aggressive pointer following supported by access
    stream

12
Outline
  • Introduction
  • Decoupled Architecture
  • Data Intensive Application
  • De-SMT (Decoupled SMT)
  • Conclusion

13
Pointer-based data structure
  • Linked List, Tree and Graph
  • Popular in various applications including
    compiler and database
  • Object-Oriented Programming is the trend (C
    and Java)
  • Serial natural of pointer dereferences
  • Known as Pointer Chasing Problem
  • Makes Memory Wall problem more serious

14
DIS Benchmarks
  • Atlantic Aerospace DIS benchmark suite
  • Application oriented benchmarks
  • Many defense application employ large data sets
    non contiguous memory access and no temporal
    locality
  • Requires compiler that can link multiple object
    file
  • Atlantic Aerospace Stressmarks suite
  • Smaller and more specific procedures
  • Seven individual data intensive benchmarks
  • Directly illustrate particular elements of the
    DIS problem
  • with simplicity, but reduced realism

15
Stressmark Suite
DIS Stressmark Suite Version 1.0, Atlantic
Aerospace Division
16
Pointer Stressmark
  • Basic idea repeatedly follow pointers to
    randomized locations in memory
  • Memory access pattern is unpredictable
  • Randomized memory access pattern
  • Not sufficient temporal and spatial locality for
    conventional cache architectures
  • HiDISC architecture provides lower memory access
    latency

17
Pointer Stressmark
  • partition fieldindex
  • for (ll0 llltw ll)
  • x fieldindexll
  • if (x gt max) high
  • else if (x gt min)
  • partition x
  • balance 0
  • for (lllll1 lllltw lll)
  • if (fieldindexlll gt partition)
    balance
  • // if it results in median, break
  • if (balancehigh w/2) break
  • else if (balancehigh gt w/2)
  • min partition
  • else max partition

18
Pointer Stressmark
  • Sequential access to the elements in a window (w)
  • Locality
  • 32Bytes cache line causes at least 2 cache misses
    at w15
  • Decoupled Architecture
  • Effectively reduces L1 cache miss by prefetching
  • The first accesses in each hop causes cache misses

19
Decoupling of Pointer Stressmark
while (not EOD)if (field gt partition)
balance if (balancehigh w/2) breakelse
if (balancehigh gt w/2) min
partitionelse max partition
high
for (ij1iltwi) if (fieldindexi gt
partition) balance if (balancehigh w/2)
breakelse if (balancehigh gt w/2) min
partitionelse max partition
high
Computation Processor Code
for (ij1 iltw i) load (fieldindexi) G
ET_SCQ send (EOD token)
Access Processor Code
Inner loop for the next indexing
for (ij1 iltw i) prefetch
(fieldindexi) PUT_SCQ
Cache Management Code
20
Update Stressmark
  • Pointer following with memory update
  • Companion stressmark of Pointer stressmark
  • Essentially, the same memory access pattern
  • Due to the memory writing, it can not be
    parallelized

21
Field Stressmark
  • The Field Stressmark emphasizes regular access to
    large quantities of data.
  • Sequential pattern matching
  • Field Database, Token Key
  • Scanning a field of strings to find out token
    strings
  • At matching instance, elements is modified
  • For the every token, this process is performed
    over whole filed
  • Large field size results in cache miss

22
Transitive Closure Stressmark
  • Floyd-Warshall algorithm to find all-pairs
    shortest path
  • Input adjacency matrix of a directed graph
  • Output adjacency matrix of the shortest-path
    transitive closure
  • Semi-regular access to elements in multiple
    matrices concurrently

23
Transitive Closure Stressmark (Cont)
  • for (k0 kltn k)
  • unsigned int old, new
  • unsigned int dtemp
  • for (i0 iltn i)
  • for (j0 jltn j)
  • old (din jn i)
  • new (din jn k) (din kn i)
  • (dout jn i) (new lt old ? new old)
  • .
  • / end of loop for j /
  • / end of loop for i /
  • dtemp dout
  • dout din
  • din dtemp
  • Due to n in the most inner loop, spatial locality
    is destroyed
  • Each iteration of the loop (j-loop), the memory
    address increases by n

24
Stressmarks
  • Hand-compile the 7 individual benchmarks
  • Use gcc as front-end
  • Manually partition each of the three instruction
    streams and insert synchronizing instruction
  • Modification on Simulator
  • Supports dynamic linking for shared library
  • Loading libc.so.1 and libm.so
  • Supporting multiple memory space

25
Simulator Overview
  • Based on MIPS RISC pipeline Architecture (dsim)
  • Supports MIPS1 and MIPS2 ISA
  • Supports dynamic linking for shared library
  • Loading shared library on the memory of simulator
  • Hand compiling
  • Using SGI-cc or gcc as front-end
  • Making each streams of codes
  • Using SGI-cc as compiler back-end

.c
cc mips2 -s
  • Modification on three .s codes to HiDISC
    assembly
  • Convert .hs to .s for three codes (hs2s)

.s
.cp.s
.ap.s
.cmp.s
cc mips2 -o
sharedlibrary
.cp
.ap
.cmp
dsim
26
Outline
  • Introduction
  • Decoupled Architecture
  • Data Intensive Application
  • De-SMT (Decoupled SMT)
  • Conclusion

27
Multiple Issue Processor
  • Instruction level parallelism (ILP)
  • Multiple issue paradigm
  • Inherently, out-of-order(OOO) issue
  • Two prevailing architectures
  • Superscalar Dynamic dependency check (HW)
  • VLIW Static dependency check (compiler)
  • Superscalar conquer the commercial world
  • feasible variation of traditional pipelined
    processor
  • flexible adaptation for commercial work load

28
Out of Order Superscalar Limitation
  • Limited ILP of basic block ? idle functional
    units
  • Speculation is also wasted
  • Increased memory access delays versus processor
    clock
  • Results in memory stall and pipeline penalties
  • Solution
  • Trace driven processor
  • speculative processor, multiscalar
  • TLP
  • CMP localizes processor resources
  • SMT efficient use of FUs, latency tolerance

29
Future of the Wide Issue Superscalar
  • Windows wakeup and selection logic will be the
    most critical part in superscalar
  • Palacharla 97, Complexity effective superscalar
    processor, ISCA24
  • Instruction windows are the centralized resources
  • Poor wire scaling as semiconductor devices shrink
  • The amount of state that can be accessed in a
    single clock cycle will cease to grow
  • Improvements in clock rate and IPC become
    directly antagonistic
  • Agarwal, 2000, ISCA27

? Decentralize Design
30
Decoupled SMT (De-SMT)
Execute Thread 1
Program 1
Access Thread 1
Execute Thread 2
Program 2
Access Thread 2
Execute Thread 3
Program 3
Access Thread 3
Execute Thread 4
Program 4
Access Thread 4
  • Decoupled architectures reduces memory latencies,
    the SMT increases the utilization of functional
    unit
  • Complexity of scheduling logic can be reduced by
    the decoupled characteristics

31
Why does SMT Need Decoupling ?
  • SMT inherits the circuit complexity of wide issue
    super scalar with monolithic scheduler
  • OOO is questionable in SMT architecture
  • Additional logics (multiple PC and register file)
    impose the complexity
  • Some cache performance problems are also appeared
    by multithreading behavior of SMT
  • Cache pollution due to multiple threads
  • Bad things can happen simultaneously

32
Decoupled Architectures on SMT
  • Dynamically utilize the functional unit by SMT
    features.
  • The weakest point of Decupled Architecture
  • Block of Execution due to the synchronization
  • Block caused by the control flow
  • Thread can dropped and switched to the other
    thread at blocking
  • Each stream of original Decoupled architectures
    can be implemented as thread in SMT

33
Outline
  • Introduction
  • Decoupled Architecture
  • De-SMT (Decoupled SMT)
  • Data Intensive Application
  • Conclusion

34
Conclusion
  • Decoupled architecture can reduce memory access
    latency by separating program streams
  • Decoupled prefetching is attractive alternative
    on Data Intensive applications
  • Between Von Neumann and data-flow
  • Decoupled architecture has merits as complexity
    effective ILP processor
  • Access streams can be supported by thread concept
    on Multithreading architecture
Write a Comment
User Comments (0)
About PowerShow.com