Title: Memory Access Optimization of Dynamic Binary Translation for Reconfigurable Architectures
1Memory Access Optimization of Dynamic Binary
Translation for Reconfigurable Architectures
- November 9, 2005
- KAIST
- Sejong Oh, Tag Gon Kim
2Outline
- Introduction
- Binary Translation Framework
- Dynamic Translation for Loop Pipelining
- Experimental Results
- Conclusion
3Reconfigurable Architectures
L1 lw r2, r9, r3 lw r6, r10, r3 add r4,
r4, 1 add r1, r4, r2 ltlt 2 sw r11, r3, r1 add
r3, r3, 1 blt L1, r3, 15
- Reconfigurable Architecture (RA)
- Spatial computation with flexiblity
- High computation performance
- Low power dissipation
- High degree of flexibility
- Modern Reconfigurable Architectures
- Partial, dynamic reconfiguration
- Existing architectures
- FPGAs
- XPP, Morphosys, picoArray, QuickSilver
- EDGE, WaveScalar, etc
1 cycle / iteration
1
2
ltlt
4Binary Translation for Reconfigurable
Architectures
- Binary Translation for RAs
- Translate hot-spot instructions to netlist and
sw/hw interface - Attractions
- Executable code is migrated without source code
- BT is compatible with the existing design flow
- No modification of tools on host processor
- Binary is an universal language
Binary netlist
High-level description
Binary
kernel
kernel
BT
Hardware implementation on RA
High-level tool
5Motivation
- Performance degradation
- Optimizations for RAs
- Transformations at high-level representation
- Binary loses high-level information
Specialized language
C/C (extension)
Loop transforms Vectorization
Optimizations
Mapper
Netlists
6Related Works
- Dynamic binary translation
- Much research for general-purpose architectures
- DynamoHP00, DAISYIBM96, FX!32DIGITAL97, etc
- Binary translation for reconfigurable
architectures - Dynamic partitioning on binary-level ICCAD02,
DAC03 - Automatic co-processor synthesis from software
executable - http//www.criticalblue.com
- Software migration from TI 6000 binary to FPGAs
DAC04
7Target Architecture
Binary
Memory
Host processor
profiling
Migrate kernels for better performance
Hardware region selection
Reconfigurable processor
Binary translator
Reconfigurable SOC
Modified binary
Netlists
8Binary Translation Framework
Binary
Partitioning Information
- Dynamic optimizer
- Runtime analysis
- Netlist transform
- Dynamic translator
- Control RA and dynamic optimizers
- Stub code
- Control transfer to dynamic translator
Preprocessing
machine-independent RTL
Dataflow Generation / Optimization
dataflow graph
Static analysis/optimization
Generation of dynamic optimizer
Dynamic Translator
dataflow graphs, code for optimizations
Stub code
BackEnd for dataflow
Binary Modification
Netlists
Modified binary
9Flow of Control
Modified binary
ldr
r4, .L19
sub
fp
,
ip
, 4
ldr
r6, .L194
pc 0x8025
stub code
Save context
.
.
.
Runtime test
kernel (loops) instructions
Configure RA, if needed
.
.
Execute RA
Restore context
Dynamic Translator
ldr
r4, .L19
ldr
r3, .L194
ldr
r2, r4,
-
60
ldr
r1, r3,
-
60
10Loop Pipelining
- Essential optimization to realize the performance
of dataflow machine - Removing dependences, token edges, between memory
operations - A great deal of performance improvement
- Dependence/pointer analysis in high-level
synthesis - Lead to be conservative on binary-level
void fir(int in, int out, int len)
AGEN
in0
?
c0
?
in1
AGEN
?
c1
in2
o
?
?
c2
?
in3
?
?
Removing token edges with dynamic compilation
c3
in4
?
c4
11Overview of Dynamic Loop Pipelining
- Observation on applications
- Dependences between streams are statically
determined
void IMG_wave_horz( const short restrict
in_data, const short restrict qmf, const
short restrict mqmf, short restrict
out_data, int cols )
Dependences among streams arefixed statically in
most kernels
- If two streams are independent at the first
execution of kernel - Probably independent to the end of program
execution - Dependent once
- Dependent to the end
12Overview of Dynamic Loop Pipelining (2)
- Strategy for dataflow generation
- Optimistic netlist at the first time
- Assuming that all streams are independent
- If assumptions turn out to be false at runtime,
transform netlist to more conservative one - Overcome the drawback of binary translation
M
execution time
M
t
Assumptions prove to be false
13Flow of Dynamic Loop Pipelining
Dataflow graph
- Recover memory access pattern
- Discover dependences symbolically
- Generate dependence inspectors
Static-analysis/optimization
base netlist, partial netlist, inspectors
First Execution
yes
no
Dependence Inspection
Execution with dependence inspection
M
Netlist transform
Dynamic Translator
Store netlist and result of inspection
Error
yes
Configuration
no
Execution
M
Context Restore
14Static Analysis and Optimization
Dataflow graph
- Runtime constant propagation
- Propagate values in context through dataflow
Runtime-constant propagation
Access pattern recovery
Induction Variable Analysis
0
Range Analysis
c(r7)
2
r5
c(r8)
memory access descriptors
-
c(r6)
1
2
c(r7)-2
foo(int p, int o, int cols, int rows, int w)
//save context for (i 0 i lt cols(rows-2) -
2 i) i22pi2w2
o i1
foo(int p, int o, int cols, int rows, int w)
//save context for (i 0 i lt cols(rows-2) -
2 i) i22pi2w2
o i1
?
c(r8)(c(r7)-2)
c(r6)-2
lt
r5
2
c(r3)
ltlt
c(r3) (c(r6) 2 ?) ltlt 2
M
c k ? value of register k at context
15Static Analysis and Optimization (2)
dataflow graph
0
c(r7)
2
Runtime-constant propagation
r5
c(r8)
-
1
Access pattern recovery
Induction Variable Analysis
2
c(r7)-2
c(r6)
Range Analysis
c(r8)(c(r7)-2)
lt
c(r6)-2
?
memory access descriptors
r5
2
c(r3)
ltlt
c(r3) (c(r6) 2 ?) ltlt 2
M
16Static Analysis and Optimization (3)
Dataflow graph
Runtime-constant propagation
Induction Variable Analysis
Range Analysis
memory access descriptors
Mi-2
Mj-4
Construct access clusters
Mj
Intra-cluster token edge insertion
Inter-cluster token edge generation
Mi
Dependence inspector generation
base dataflow
token networks
code for inspectors
Inspector are two clusters dependent?
dataflow backend
binary modification
base netlist, partial netlists
modified binary
17Dependence Inspector
- Inspector
- Small code fragment for dependence test between
two clusters - Inequality equations for range test
- Execution
host
RA
host
RA
Inspector
Kernel
Inspector
Kernel
Data commit
t
18Dependence Inspector (2)
r2 0 L1 r3 M32r2 r4 M32r21 r5
f(r3,r4) M32r6 r5 r2 r2 1 r6 r6
2 b r2 lt r10, L1
inspector context ? dep, indep //range1
(c(r2), c(r2)c(r10)31) //range2 (c(r6),
c(r6)c(r10)ltlt13) if(c(r2) lt c(r6)c(r10)ltlt1)
return indep else if(c(r6)c(r10)ltlt1 lt c(r2))
return indep return dep
w
r
19Evaluation Framework
- Simulator for the target architecture
- ARM and reconfigurable processor
- Implemented in SystemC
ARMulator
Reconfigurable Array
Configuration Manager
Memory
Memory Unit
20Preliminary Results
- Experimental setup
- Speedup against ARM processor (SimpleScalar)
- Inspectors are amortized
- Simulation parameters
- ALU 1 cycle
- b/w of memory 4 ports
- Memory access 2 cycles
- Multiplier 2 cycles
- Data steering operators log2( of inputs x a
) - Speedup
- Up to 3.02
47,97
fir is not pipelined
21Preliminary Results of Loop Pipelining (2)
28
19
8
22Conclusion and Further Work
- Dynamic compilation technique for RAs
- Overcome the drawback of binary translation
- Generate netlist optimistically
- Discover dependences among streams at runtime
- Further work
- Accesses with complex address
- Look-up tables
- Other optimizations with runtime decisions
23Thank you!