Memory Access Optimization of Dynamic Binary Translation for Reconfigurable Architectures - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Memory Access Optimization of Dynamic Binary Translation for Reconfigurable Architectures

Description:

Dynamo[HP00], DAISY[IBM96], FX!32[DIGITAL97], etc ... token networks. code for inspectors. dataflow backend. binary modification ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 24
Provided by: say50
Category:

less

Transcript and Presenter's Notes

Title: Memory Access Optimization of Dynamic Binary Translation for Reconfigurable Architectures


1
Memory Access Optimization of Dynamic Binary
Translation for Reconfigurable Architectures
  • November 9, 2005
  • KAIST
  • Sejong Oh, Tag Gon Kim

2
Outline
  • Introduction
  • Binary Translation Framework
  • Dynamic Translation for Loop Pipelining
  • Experimental Results
  • Conclusion

3
Reconfigurable Architectures
L1 lw r2, r9, r3 lw r6, r10, r3 add r4,
r4, 1 add r1, r4, r2 ltlt 2 sw r11, r3, r1 add
r3, r3, 1 blt L1, r3, 15
  • Reconfigurable Architecture (RA)
  • Spatial computation with flexiblity
  • High computation performance
  • Low power dissipation
  • High degree of flexibility
  • Modern Reconfigurable Architectures
  • Partial, dynamic reconfiguration
  • Existing architectures
  • FPGAs
  • XPP, Morphosys, picoArray, QuickSilver
  • EDGE, WaveScalar, etc

1 cycle / iteration
1
2

ltlt

4
Binary Translation for Reconfigurable
Architectures
  • Binary Translation for RAs
  • Translate hot-spot instructions to netlist and
    sw/hw interface
  • Attractions
  • Executable code is migrated without source code
  • BT is compatible with the existing design flow
  • No modification of tools on host processor
  • Binary is an universal language

Binary netlist
High-level description
Binary
kernel
kernel
BT
Hardware implementation on RA
High-level tool
5
Motivation
  • Performance degradation
  • Optimizations for RAs
  • Transformations at high-level representation
  • Binary loses high-level information

Specialized language
C/C (extension)
Loop transforms Vectorization
Optimizations
Mapper
Netlists
6
Related Works
  • Dynamic binary translation
  • Much research for general-purpose architectures
  • DynamoHP00, DAISYIBM96, FX!32DIGITAL97, etc
  • Binary translation for reconfigurable
    architectures
  • Dynamic partitioning on binary-level ICCAD02,
    DAC03
  • Automatic co-processor synthesis from software
    executable
  • http//www.criticalblue.com
  • Software migration from TI 6000 binary to FPGAs
    DAC04

7
Target Architecture
Binary
Memory
Host processor
profiling
Migrate kernels for better performance
Hardware region selection
Reconfigurable processor
Binary translator
Reconfigurable SOC
Modified binary
Netlists
8
Binary Translation Framework
Binary
Partitioning Information
  • Dynamic optimizer
  • Runtime analysis
  • Netlist transform
  • Dynamic translator
  • Control RA and dynamic optimizers
  • Stub code
  • Control transfer to dynamic translator

Preprocessing
machine-independent RTL
Dataflow Generation / Optimization
dataflow graph
Static analysis/optimization
Generation of dynamic optimizer
Dynamic Translator
dataflow graphs, code for optimizations
Stub code
BackEnd for dataflow
Binary Modification
Netlists
Modified binary
9
Flow of Control
Modified binary
ldr
r4, .L19
sub
fp
,
ip
, 4
ldr
r6, .L194
pc 0x8025
stub code
Save context
.
.
.
Runtime test
kernel (loops) instructions
Configure RA, if needed
.
.
Execute RA
Restore context
Dynamic Translator
ldr
r4, .L19
ldr
r3, .L194
ldr
r2, r4,
-
60
ldr
r1, r3,
-
60
10
Loop Pipelining
  • Essential optimization to realize the performance
    of dataflow machine
  • Removing dependences, token edges, between memory
    operations
  • A great deal of performance improvement
  • Dependence/pointer analysis in high-level
    synthesis
  • Lead to be conservative on binary-level

void fir(int in, int out, int len)
AGEN
in0
?
c0
?
in1
AGEN
?
c1
in2
o
?
?
c2
?
in3
?
?
Removing token edges with dynamic compilation
c3
in4
?
c4
11
Overview of Dynamic Loop Pipelining
  • Observation on applications
  • Dependences between streams are statically
    determined

void IMG_wave_horz( const short restrict
in_data, const short restrict qmf, const
short restrict mqmf, short restrict
out_data, int cols )
Dependences among streams arefixed statically in
most kernels
  • If two streams are independent at the first
    execution of kernel
  • Probably independent to the end of program
    execution
  • Dependent once
  • Dependent to the end

12
Overview of Dynamic Loop Pipelining (2)
  • Strategy for dataflow generation
  • Optimistic netlist at the first time
  • Assuming that all streams are independent
  • If assumptions turn out to be false at runtime,
    transform netlist to more conservative one
  • Overcome the drawback of binary translation

M
execution time
M
t
Assumptions prove to be false
13
Flow of Dynamic Loop Pipelining
Dataflow graph
  • Recover memory access pattern
  • Discover dependences symbolically
  • Generate dependence inspectors

Static-analysis/optimization
base netlist, partial netlist, inspectors
First Execution
yes
no
Dependence Inspection
Execution with dependence inspection
M
Netlist transform
Dynamic Translator
Store netlist and result of inspection
Error
yes
Configuration
no
Execution
M
Context Restore
14
Static Analysis and Optimization
Dataflow graph
  • Runtime constant propagation
  • Propagate values in context through dataflow

Runtime-constant propagation
Access pattern recovery
Induction Variable Analysis
0
Range Analysis
c(r7)
2
r5
c(r8)
memory access descriptors
-
c(r6)
1
2
c(r7)-2
foo(int p, int o, int cols, int rows, int w)
//save context for (i 0 i lt cols(rows-2) -
2 i) i22pi2w2
o i1
foo(int p, int o, int cols, int rows, int w)
//save context for (i 0 i lt cols(rows-2) -
2 i) i22pi2w2
o i1



?
c(r8)(c(r7)-2)
c(r6)-2
lt

r5
2
c(r3)
ltlt

c(r3) (c(r6) 2 ?) ltlt 2
M
c k ? value of register k at context
15
Static Analysis and Optimization (2)
dataflow graph
0
c(r7)
2
Runtime-constant propagation
r5
c(r8)
-
1
Access pattern recovery
Induction Variable Analysis
2
c(r7)-2
c(r6)


Range Analysis

c(r8)(c(r7)-2)
lt
c(r6)-2
?
memory access descriptors

r5
2
c(r3)
ltlt

c(r3) (c(r6) 2 ?) ltlt 2
M
16
Static Analysis and Optimization (3)
Dataflow graph
Runtime-constant propagation
Induction Variable Analysis
Range Analysis
memory access descriptors
Mi-2
Mj-4
Construct access clusters
Mj
Intra-cluster token edge insertion
Inter-cluster token edge generation
Mi
Dependence inspector generation
base dataflow
token networks
code for inspectors
Inspector are two clusters dependent?
dataflow backend
binary modification
base netlist, partial netlists
modified binary
17
Dependence Inspector
  • Inspector
  • Small code fragment for dependence test between
    two clusters
  • Inequality equations for range test
  • Execution

host
RA
host
RA
Inspector
Kernel
Inspector
Kernel
Data commit
t
18
Dependence Inspector (2)
  • Example

r2 0 L1 r3 M32r2 r4 M32r21 r5
f(r3,r4) M32r6 r5 r2 r2 1 r6 r6
2 b r2 lt r10, L1
inspector context ? dep, indep //range1
(c(r2), c(r2)c(r10)31) //range2 (c(r6),
c(r6)c(r10)ltlt13) if(c(r2) lt c(r6)c(r10)ltlt1)
return indep else if(c(r6)c(r10)ltlt1 lt c(r2))
return indep return dep
w
r
19
Evaluation Framework
  • Simulator for the target architecture
  • ARM and reconfigurable processor
  • Implemented in SystemC

ARMulator
Reconfigurable Array
Configuration Manager
Memory
Memory Unit
20
Preliminary Results
  • Experimental setup
  • Speedup against ARM processor (SimpleScalar)
  • Inspectors are amortized
  • Simulation parameters
  • ALU 1 cycle
  • b/w of memory 4 ports
  • Memory access 2 cycles
  • Multiplier 2 cycles
  • Data steering operators log2( of inputs x a
    )
  • Speedup
  • Up to 3.02

47,97
fir is not pipelined
21
Preliminary Results of Loop Pipelining (2)
28
19
8
22
Conclusion and Further Work
  • Dynamic compilation technique for RAs
  • Overcome the drawback of binary translation
  • Generate netlist optimistically
  • Discover dependences among streams at runtime
  • Further work
  • Accesses with complex address
  • Look-up tables
  • Other optimizations with runtime decisions

23
Thank you!
Write a Comment
User Comments (0)
About PowerShow.com