Memory Access Optimization of Dynamic Binary Translation for Reconfigurable Architectures - PowerPoint PPT Presentation

1 / 23

About This Presentation

Title:

Memory Access Optimization of Dynamic Binary Translation for Reconfigurable Architectures

Description:

Dynamo[HP00], DAISY[IBM96], FX!32[DIGITAL97], etc ... token networks. code for inspectors. dataflow backend. binary modification ... – PowerPoint PPT presentation

Number of Views:31

Avg rating:3.0/5.0

Slides: 24

Provided by: say50

Category:

more less

Transcript and Presenter's Notes

Title: Memory Access Optimization of Dynamic Binary Translation for Reconfigurable Architectures

1
Memory Access Optimization of Dynamic Binary
Translation for Reconfigurable Architectures

November 9, 2005
KAIST
Sejong Oh, Tag Gon Kim

2
Outline

Introduction
Binary Translation Framework
Dynamic Translation for Loop Pipelining
Experimental Results
Conclusion

3
Reconfigurable Architectures
L1 lw r2, r9, r3 lw r6, r10, r3 add r4,
r4, 1 add r1, r4, r2 ltlt 2 sw r11, r3, r1 add
r3, r3, 1 blt L1, r3, 15

Reconfigurable Architecture (RA)
Spatial computation with flexiblity
High computation performance
Low power dissipation
High degree of flexibility
Modern Reconfigurable Architectures
Partial, dynamic reconfiguration
Existing architectures
FPGAs
XPP, Morphosys, picoArray, QuickSilver
EDGE, WaveScalar, etc

1 cycle / iteration
1
2

ltlt

4
Binary Translation for Reconfigurable
Architectures

Binary Translation for RAs
Translate hot-spot instructions to netlist and
sw/hw interface
Attractions
Executable code is migrated without source code
BT is compatible with the existing design flow
No modification of tools on host processor
Binary is an universal language

Binary netlist
High-level description
Binary
kernel
kernel
BT
Hardware implementation on RA
High-level tool
5
Motivation

Performance degradation
Optimizations for RAs
Transformations at high-level representation
Binary loses high-level information

Specialized language
C/C (extension)
Loop transforms Vectorization
Optimizations
Mapper
Netlists
6
Related Works

Dynamic binary translation
Much research for general-purpose architectures
DynamoHP00, DAISYIBM96, FX!32DIGITAL97, etc
Binary translation for reconfigurable
architectures
Dynamic partitioning on binary-level ICCAD02,
DAC03
Automatic co-processor synthesis from software
executable
http//www.criticalblue.com
Software migration from TI 6000 binary to FPGAs
DAC04

7
Target Architecture
Binary
Memory
Host processor
profiling
Migrate kernels for better performance
Hardware region selection
Reconfigurable processor
Binary translator
Reconfigurable SOC
Modified binary
Netlists
8
Binary Translation Framework
Binary
Partitioning Information

Dynamic optimizer
Runtime analysis
Netlist transform
Dynamic translator
Control RA and dynamic optimizers
Stub code
Control transfer to dynamic translator

Preprocessing
machine-independent RTL
Dataflow Generation / Optimization
dataflow graph
Static analysis/optimization
Generation of dynamic optimizer
Dynamic Translator
dataflow graphs, code for optimizations
Stub code
BackEnd for dataflow
Binary Modification
Netlists
Modified binary
9
Flow of Control
Modified binary
ldr
r4, .L19
sub
fp
,
ip
, 4
ldr
r6, .L194
pc 0x8025
stub code
Save context
.
.
.
Runtime test
kernel (loops) instructions
Configure RA, if needed
.
.
Execute RA
Restore context
Dynamic Translator
ldr
r4, .L19
ldr
r3, .L194
ldr
r2, r4,
-
60
ldr
r1, r3,
-
60
10
Loop Pipelining

Essential optimization to realize the performance
of dataflow machine
Removing dependences, token edges, between memory
operations
A great deal of performance improvement
Dependence/pointer analysis in high-level
synthesis
Lead to be conservative on binary-level

void fir(int in, int out, int len)
AGEN
in0
?
c0
?
in1
AGEN
?
c1
in2
o
?
?
c2
?
in3
?
?
Removing token edges with dynamic compilation
c3
in4
?
c4
11
Overview of Dynamic Loop Pipelining

Observation on applications
Dependences between streams are statically
determined

void IMG_wave_horz( const short restrict
in_data, const short restrict qmf, const
short restrict mqmf, short restrict
out_data, int cols )
Dependences among streams arefixed statically in
most kernels

If two streams are independent at the first
execution of kernel
Probably independent to the end of program
execution
Dependent once
Dependent to the end

12
Overview of Dynamic Loop Pipelining (2)

Strategy for dataflow generation
Optimistic netlist at the first time
Assuming that all streams are independent
If assumptions turn out to be false at runtime,
transform netlist to more conservative one
Overcome the drawback of binary translation

M
execution time
M
t
Assumptions prove to be false
13
Flow of Dynamic Loop Pipelining
Dataflow graph

Recover memory access pattern
Discover dependences symbolically
Generate dependence inspectors

Static-analysis/optimization
base netlist, partial netlist, inspectors
First Execution
yes
no
Dependence Inspection
Execution with dependence inspection
M
Netlist transform
Dynamic Translator
Store netlist and result of inspection
Error
yes
Configuration
no
Execution
M
Context Restore
14
Static Analysis and Optimization
Dataflow graph

Runtime constant propagation
Propagate values in context through dataflow

Runtime-constant propagation
Access pattern recovery
Induction Variable Analysis
0
Range Analysis
c(r7)
2
r5
c(r8)
memory access descriptors
-
c(r6)
1
2
c(r7)-2
foo(int p, int o, int cols, int rows, int w)
//save context for (i 0 i lt cols(rows-2) -
2 i) i22pi2w2
o i1
foo(int p, int o, int cols, int rows, int w)
//save context for (i 0 i lt cols(rows-2) -
2 i) i22pi2w2
o i1

?
c(r8)(c(r7)-2)
c(r6)-2
lt

r5
2
c(r3)
ltlt

c(r3) (c(r6) 2 ?) ltlt 2
M
c k ? value of register k at context
15
Static Analysis and Optimization (2)
dataflow graph
0
c(r7)
2
Runtime-constant propagation
r5
c(r8)
-
1
Access pattern recovery
Induction Variable Analysis
2
c(r7)-2
c(r6)

Range Analysis

c(r8)(c(r7)-2)
lt
c(r6)-2
?
memory access descriptors

r5
2
c(r3)
ltlt

c(r3) (c(r6) 2 ?) ltlt 2
M
16
Static Analysis and Optimization (3)
Dataflow graph
Runtime-constant propagation
Induction Variable Analysis
Range Analysis
memory access descriptors
Mi-2
Mj-4
Construct access clusters
Mj
Intra-cluster token edge insertion
Inter-cluster token edge generation
Mi
Dependence inspector generation
base dataflow
token networks
code for inspectors
Inspector are two clusters dependent?
dataflow backend
binary modification
base netlist, partial netlists
modified binary
17
Dependence Inspector

Inspector
Small code fragment for dependence test between
two clusters
Inequality equations for range test
Execution

host
RA
host
RA
Inspector
Kernel
Inspector
Kernel
Data commit
t
18
Dependence Inspector (2)

Example

r2 0 L1 r3 M32r2 r4 M32r21 r5
f(r3,r4) M32r6 r5 r2 r2 1 r6 r6
2 b r2 lt r10, L1
inspector context ? dep, indep //range1
(c(r2), c(r2)c(r10)31) //range2 (c(r6),
c(r6)c(r10)ltlt13) if(c(r2) lt c(r6)c(r10)ltlt1)
return indep else if(c(r6)c(r10)ltlt1 lt c(r2))
return indep return dep
w
r
19
Evaluation Framework