Title: Fine Grained Application Source Code Profiling for ASIP Design
1Fine Grained Application Source Code Profiling
for ASIP Design
- Kingshuk Karuri, Mohammad Al Faruque, Stefan
Kraemer, Rainer Leupers, Gerd Ascheid,
Heinrich Meyr
Institute for Integrated Signal Processing
Systems RWTH Aachen University, Germany
2Organization
- Introduction
- µ-Profiling for ASIP Design
- µ-Profiler at work MP3 ASIP Case Study
- Performance and accuracy
- Conclusions
3Introduction
- Mapping an embedded application to an
architecture - Two major goals
- Flexibility (Programmability)
- Efficiency (MIPS/Watt)
Application
?
Behavioral Synthesis
C-Compiler
ARM/MIPS
ASIC
ASIP
Flexibility
Efficiency
4Pre-architecture Exploration
Algorithm design (Matlab, SPW, ...)
C code generation or implementation
?
initial processor architecture
architecture optimization
e.g. LISATek
Extensive APPLICATION PROFILING is required
5Related Work Fine-Grained Profiling
- High-level SW performance estimation
- P. Giusto, G. Martin et. al., Reliable Estimation
of Execution Time of Embedded Software, DATE 2001 - L. Lavagno, J. R. Bammi et. al., Software
Performance Estimation Strategies in a
System-level Design Tool, CODES 2000 - Focus on memory and communication design
- L. Cai, A. Gerstlauer et. al., Retargetable
Profiling for Rapid, Early System-level Design
Space Exploration, DAC 2004 - M. Ravasi, M. Mattavelli, High-level Algorithmic
Complexity Evaluation for System Design, Journal
on Systems Architecture, no. 48, Elsevier, 2003 - So far no dedicated profilers for ASIP design
6µ-Profiling Approach
7Profiling for ASIP Design
- Traditional application code profiling
- Goal Optimization of the computationally
intensive areas of a given application - Used to identify application hot-spots that are
manually optimized later - Usually done at C source code level or assembly
level - Profiling for ASIP design
- Goal Optimization of a target architecture for
an already optimized application source code - Identification of application characteristics
useful for micro-architecture and ISA design
8Profiling at Assembly Level
Algorithm design (Matlab, SPW, ...)
C code generation or implementation
initial processor architecture
Assembly Level
architecture optimization
- highly accurate
- machine-specific
- needs an initial architecture
- slow (1000x native C)
9Profiling at C Source Code Level
Algorithm design (Matlab, SPW, ...)
Source Level
C code generation or implementation
e.g. gprof/gcov
- fast
- can be done on host machine
- only reports per C line data
- C operator level information unavailable
- cannot capture effects of code optimization
initial architecture
architecture optimization
10µ-Profiling Approach
- Source level profiling is too coarse grained
- Only at C function or C source line granularity
- No capture of hidden operations, e.g. address
arithmetic - Cannot capture the effects of compiler
optimizations - Potentially misleading profiling results
- Assembly level profiling is too target specific
- Need an initial target architecture to generate
ISS - Comparatively slow
- New approach profile at the intermediate
representation(IR) level - All C operators, data and control flow are
explicit - High level code optimizations can be performed
11Profiling at IR Level
int p, flag float b, a20 p 5 f
(flag)? b(ap 2)
p 5 if (flag) goto LL1
Explicit operations and control flow
t1 (char )a t2 p 2 t3 t2
sizeof(float) t4 t3 t1 t5 (float
)t4 t6 t5 goto LL2
t2 10
Replaced by t4 t1 40
t3 10 4
LL1 t4 b
LL2 f t4
12µ-Profiler Tool Architecture
C Source Code
C Front End
IncStatementExecCount() IncOperatorUseCount()
x a b
Optimizations
3 Address IR
Code Instrumenter
Profiler Library
IncrStatementExecCount() IncOperatorUseCount()
Compiler
Object Code
Linker
a.out on host machine
13µ-Profiler User Interface
14Using the µ-Profiler
15Case Study MP3 Decoder ASIP
- Goal
- Use µ-Profiler to tailor an initial processor
architecture for a given application - Target application
- Publicly available MP3 decoder ANSI C source code
- Initial target architecture
- CoWare LISATek RISC (LT RISC)
- Simple fixed point RISC template architecture
with 32 bit instruction words
16Initial Estimates
- Coarse estimation
- No FPU in initial architecture
- SW FPU emulation 100x slower than HW FPU
- Real time constraints for MP3 standard 38 frames
per second - gt 60 M cycles per MP3 frame
- gt 2 GHz clock frequency _at_ 192 kbps
17Hints from µ-Profiling
- Need a HW FPU
- Pay with significant area overhead ?
- ints are in range -7012,17664
- Reduce int width of from 32-bit to16-bit
- Migrate from 32 to 16-bit integer ALU
- Significant amount of area reduction
- Use the extra area for HW FPU
18More Hints from µ-Profiling
- Almost all integer comparisons are gt
- Leave out others from ISA
- gt 98 of int immediates occupy less than 8 bits
- Reduce immediate instruction field from 16 to 8
bits - Few far jumps
- Reduce jump field from 20 to 16 bits
- Reduce instruction word size to 24 bits
- Significant amount of code size reduction
19Final MP3 ASIP Architecture
- Hardware synthesis (gate-level)
- 0.18 µm CMOS lib
- Target clock frequency that safely meets real
time constraints 25 MHz - Net effect
- 300x cycle count reduction
- No area increase
- Considerable code size reduction
20Performance and accuracy
21Speed µ-Profiler, gcc, and ISS
788.5x
95x
18.9x
Relative Execution Time
3.1x
1x
0
Basic dynamic value range profiling
Basic dynamic value range trace generation
MIPS ISS
gcc
Basic profiing
22Accuracy MIPS ISS vs µ-Profiler
- Average deviation without optimizations 36
- Average deviation with optimizations 23
23Cycle Count Estimates LT RISC ISS vs µ-Profiler
- Average deviation without optimizations 27
- Average deviation with optimizations 11
24Conclusions
25Conclusions
- State-of-the-art ASIP ISA and micro-architecture
exploration tools - Pre-architecture exploration tools can make
ASIP design even more efficient - µ-Profiler can help designers to take early
design decisions on initial ASIP architecture - Future work
- Accurate cost estimation of profiler hints
- Automatic translation to ADL models
- More case studies with diverse applications and
architectures
26Thank you