Title: ProfileGuided Optimization Targeting High Performance Embedded Applications
1Profile-Guided OptimizationTargeting High
Performance Embedded Applications
- David Kaeli
- Murat Bicer
- Efe Yardimci
- Center for Subsurface Sensing and
- Imaging Systems (CenSSIS)
- Northeastern University
- Jeffrey Smith
- Mercury Computer
2 Why Consider Using Profile-Guided Optimization?
- Much of the potential performance available on
data-parallel systems can not be obtained due to
unpredictable control flow and data flow in
programs - Memory system performance continues to dominate
the performance of many data-parallel
applications - Program profiles provide clues to the
compiler/linker/runtime to - Enable more aggressive use of interprocedural
optimizations - Eliminate bottlenecks in the data flow/control
flow and - Improve a programs layout on the available
memory hierarchy - Applications can then be developed at higher
levels of programming abstraction (e.g., from
UML) and tuned for performance later
3Profile Guidance
- Obtain run-time profiles in the form of
- Procedure call graphs, basic block traversals
- Program variable value profiles
- Hardware performance counters (using PCL)
- Cache and TLB misses, pipeline stalls, heap
allocations, synchronization messages - Utilize run-time profiles as input to
- Provide user feedback (e.g., program
visualization) - Perform profile-driven compilation (recompile
using the profile) - Enable dynamic optimization (just-in-time
compilation) - Evaluate software testing coverage
4Profiling Tools
- Mercury Tools
- TATL Trace Analysis Tool and Library
- Procedure profiles
- Gnu gprof
- PowerPC Performance Counters
- PCL Performance Counter Library
- PM API targeting the PowerPC
- Greenhills Compiler
- MULTI profiling support
- Custom instrumentation drivers
5Data Parallel Applications
SAR
Binary-level Optimizations
Program Binary
C O M P I L E R
Feedback
GPR
Program run
- Program profile
- counter values
- program paths
- variable values
Software Defined Radio
Feedback
MRI
Compile-time Optimizations
6Target Optimizations
- Compile-time
- Aggressive procedure inlining
- Aggressive constant propagation
- Program variable specialization
- Procedure cloning
- Removal of redundant loads/stores
- Link-time
- Code reordering utilizing coloring
- Static data reordering
- Dynamic (during runtime)
- Heap layout optimization
7Memory Performance is Key to Scalability in
Data-parallel applications
- The performance gap between processor technology
and memory technology continues to grow - Hierarchical memory systems (multi-level caches)
have been used to bridge this gap - Embedded processing applications place a heavy
burden on the supporting memory system - Applications will need to adapt (potentially
dynamically) to better utilize the available
memory system
8Cache Line Coloring
- Attempts to reorder a program executable by
coloring the cache space, avoiding caller-callee
conflicts in a cache - Can be driven by either statically-generated call
graphs or profile data - Improves upon the work of Pettis and Hansen by
considering the organization of the cache space
(i.e., cache size, line size, associativity) - Can be used with different levels of granularity
(procedures, basic blocks) and both intra- and
inter- procedurally
9Cache Line Coloring Algorithm
A
40
90
- Build program call graph
- nodes represent procedures
- edges represent calls
- edge weight represent call frequencies
- Prune edges based on a threshold value
- Sort graph edges and process in decreasing edge
weight order - Place procedures in the cache space, avoiding
color conflicts - Fill in gaps with remaining procedures
- Reduces execution time by up to 49 for data
compression algorithms
B
E
10Data Memory Access
- A disproportionate number of data cache misses
are caused by accesses to dynamically allocated
(heap) memory - Increases in cache size do not effectively reduce
data cache misses caused by heap accesses - A small number of objects account for a large
percentage of heap misses (90/10 rule) - Existing memory allocation routines tend to
balance allocation speed and memory usage
(locality preservation has not been a major
concern)
11Miss rates () vs. Cache Configurations
12Profile-driven Data Layout
- We have developed a profile-guided approach to
allocating heap objects to improve heap behavior
- The idea is to use existing knowledge of the
computing platform (e.g., cache organization),
combined with profile data, to enable the target
application to execute more efficiently - Mapping temporally local memory blocks possessing
high reference counts to the same cache area will
generate a significant number of cache misses
13Allocation
- We have developed our own malloc routine which
uses a conflict profile to avoid allocating
potentially conflicting addresses - A multi-step allocation algorithm is repeated
until a non-conflicting allocation is made - If all steps produce conflicts, allocation is
made within the wilderness region - If conflicts still occur in the wilderness
region, we allocate these conflicting chunks
(creating a hole) - Allocation occurs at the first non-conflicting
address after the chunk - The hole is immediately freed, causing minimal
space wastage (though possibly some limited
fragmentation)
14Runtime improvements over non-optimized heap
layout
15Future Work
- Present algorithms have only been evaluated on
uniprocessor platforms - Follow-on work will target Mercury RACE
multiprocessor systems - Target applications will include
- FM3TR for Software Defined Radio
- Steepest Decent Fast Multipole Methods (SDFMM)
and Method for demining applications
16Related Publications
- Improving the Performance of Heap-based Memory
Access, E. Yardimci and D. Kaeli, Proc. of the
Workshop on Memory Performance Issues, June 2001. - Accurate Simulation and Evaluation of Code
Reordering, J. Kalamatianos and D. Kaeli, Proc.
of the IEEE International Symposium on the
Performance Analysis of Systems and Software, May
2000. - Model Based Parallel Programming with
Profile-Guided Application Optimization, J.
Smith and D. Kaeli, Proc. of the 4th Annual High
Performance Embedded Computing Workshop, MIT
Lincoln Labs, Lexington, MA, September 2000,
pp.85-86. - Cache Line Coloring Using Real and Estimated
Profiles, A. Hashemi, J. Kalamatianos, D. Kaeli
and W. Meleis, Digital Technical Journal, Special
Issues on Tools and Languages, February 1999. - Parameter Value Characterization of Windows
NT-based Applications,' J. Kalamatianos and D.
Kaeli, Workload Characterization Methodology and
Case Studies, IEEE Computer Society, 1999,
pp.142-149.
17Related Publications (also see http//www.ece.neu.
edu/info/architecture/publications.html)
- Analysis of Temporal-based Program Behavior for
Improved Instruction Cache Performance, J.
Kalamatianos, A. Khalafi, H. Hashemi, D. Kaeli
and W. Meleis, IEEE Transactions on Computers,
Vol.10, No. 2, February 1999, pp. 168-175. - Memory Architecture Dependent Program Mapping,
B. Calder, A. Hashemi, and D. Kaeli, US Patent
No. 5,963,972, October 5, 1999. - Temporal-based Procedure Reordering for Improved
Instruction Cache Performance, Proc. of the 4th
HPCA, Feb. 1998, pp. 244-253. - Efficient Procedure Mapping Using Cache Line
Coloring, H. Hashemi, D. Kaeli and B. Calder,
Proc. of PLDI97, June 1997, pp. 171-182. - Procedure Mapping Using Static Call Graph
Estimation, Proc. of the Workshop on the
Interaction Between Compilers and Computer
Architecture, TCCA News, 1997.