ProfileGuided Optimization Targeting High Performance Embedded Applications - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

ProfileGuided Optimization Targeting High Performance Embedded Applications

Description:

Follow-on work will target Mercury RACE multiprocessor systems. Target applications will include: ... Computing Workshop, MIT Lincoln Labs, Lexington, MA, ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 18
Provided by: drdavid99
Category:

less

Transcript and Presenter's Notes

Title: ProfileGuided Optimization Targeting High Performance Embedded Applications


1
Profile-Guided OptimizationTargeting High
Performance Embedded Applications
  • David Kaeli
  • Murat Bicer
  • Efe Yardimci
  • Center for Subsurface Sensing and
  • Imaging Systems (CenSSIS)
  • Northeastern University
  • Jeffrey Smith
  • Mercury Computer

2
Why Consider Using Profile-Guided Optimization?
  • Much of the potential performance available on
    data-parallel systems can not be obtained due to
    unpredictable control flow and data flow in
    programs
  • Memory system performance continues to dominate
    the performance of many data-parallel
    applications
  • Program profiles provide clues to the
    compiler/linker/runtime to
  • Enable more aggressive use of interprocedural
    optimizations
  • Eliminate bottlenecks in the data flow/control
    flow and
  • Improve a programs layout on the available
    memory hierarchy
  • Applications can then be developed at higher
    levels of programming abstraction (e.g., from
    UML) and tuned for performance later

3
Profile Guidance
  • Obtain run-time profiles in the form of
  • Procedure call graphs, basic block traversals
  • Program variable value profiles
  • Hardware performance counters (using PCL)
  • Cache and TLB misses, pipeline stalls, heap
    allocations, synchronization messages
  • Utilize run-time profiles as input to
  • Provide user feedback (e.g., program
    visualization)
  • Perform profile-driven compilation (recompile
    using the profile)
  • Enable dynamic optimization (just-in-time
    compilation)
  • Evaluate software testing coverage

4
Profiling Tools
  • Mercury Tools
  • TATL Trace Analysis Tool and Library
  • Procedure profiles
  • Gnu gprof
  • PowerPC Performance Counters
  • PCL Performance Counter Library
  • PM API targeting the PowerPC
  • Greenhills Compiler
  • MULTI profiling support
  • Custom instrumentation drivers

5
Data Parallel Applications
SAR
Binary-level Optimizations
Program Binary
C O M P I L E R
Feedback
GPR
Program run
  • Program profile
  • counter values
  • program paths
  • variable values

Software Defined Radio
Feedback
MRI
Compile-time Optimizations
6
Target Optimizations
  • Compile-time
  • Aggressive procedure inlining
  • Aggressive constant propagation
  • Program variable specialization
  • Procedure cloning
  • Removal of redundant loads/stores
  • Link-time
  • Code reordering utilizing coloring
  • Static data reordering
  • Dynamic (during runtime)
  • Heap layout optimization

7
Memory Performance is Key to Scalability in
Data-parallel applications
  • The performance gap between processor technology
    and memory technology continues to grow
  • Hierarchical memory systems (multi-level caches)
    have been used to bridge this gap
  • Embedded processing applications place a heavy
    burden on the supporting memory system
  • Applications will need to adapt (potentially
    dynamically) to better utilize the available
    memory system

8
Cache Line Coloring
  • Attempts to reorder a program executable by
    coloring the cache space, avoiding caller-callee
    conflicts in a cache
  • Can be driven by either statically-generated call
    graphs or profile data
  • Improves upon the work of Pettis and Hansen by
    considering the organization of the cache space
    (i.e., cache size, line size, associativity)
  • Can be used with different levels of granularity
    (procedures, basic blocks) and both intra- and
    inter- procedurally

9
Cache Line Coloring Algorithm
A
40
90
  • Build program call graph
  • nodes represent procedures
  • edges represent calls
  • edge weight represent call frequencies
  • Prune edges based on a threshold value
  • Sort graph edges and process in decreasing edge
    weight order
  • Place procedures in the cache space, avoiding
    color conflicts
  • Fill in gaps with remaining procedures
  • Reduces execution time by up to 49 for data
    compression algorithms

B
E
10
Data Memory Access
  • A disproportionate number of data cache misses
    are caused by accesses to dynamically allocated
    (heap) memory
  • Increases in cache size do not effectively reduce
    data cache misses caused by heap accesses
  • A small number of objects account for a large
    percentage of heap misses (90/10 rule)
  • Existing memory allocation routines tend to
    balance allocation speed and memory usage
    (locality preservation has not been a major
    concern)

11
Miss rates () vs. Cache Configurations
12
Profile-driven Data Layout
  • We have developed a profile-guided approach to
    allocating heap objects to improve heap behavior
  • The idea is to use existing knowledge of the
    computing platform (e.g., cache organization),
    combined with profile data, to enable the target
    application to execute more efficiently
  • Mapping temporally local memory blocks possessing
    high reference counts to the same cache area will
    generate a significant number of cache misses

13
Allocation
  • We have developed our own malloc routine which
    uses a conflict profile to avoid allocating
    potentially conflicting addresses
  • A multi-step allocation algorithm is repeated
    until a non-conflicting allocation is made
  • If all steps produce conflicts, allocation is
    made within the wilderness region
  • If conflicts still occur in the wilderness
    region, we allocate these conflicting chunks
    (creating a hole)
  • Allocation occurs at the first non-conflicting
    address after the chunk
  • The hole is immediately freed, causing minimal
    space wastage (though possibly some limited
    fragmentation)

14
Runtime improvements over non-optimized heap
layout
15
Future Work
  • Present algorithms have only been evaluated on
    uniprocessor platforms
  • Follow-on work will target Mercury RACE
    multiprocessor systems
  • Target applications will include
  • FM3TR for Software Defined Radio
  • Steepest Decent Fast Multipole Methods (SDFMM)
    and Method for demining applications

16
Related Publications
  • Improving the Performance of Heap-based Memory
    Access, E. Yardimci and D. Kaeli, Proc. of the
    Workshop on Memory Performance Issues, June 2001.
  • Accurate Simulation and Evaluation of Code
    Reordering, J. Kalamatianos and D. Kaeli, Proc.
    of the IEEE International Symposium on the
    Performance Analysis of Systems and Software, May
    2000.
  • Model Based Parallel Programming with
    Profile-Guided Application Optimization, J.
    Smith and D. Kaeli, Proc. of the 4th Annual High
    Performance Embedded Computing Workshop, MIT
    Lincoln Labs, Lexington, MA, September 2000,
    pp.85-86.
  • Cache Line Coloring Using Real and Estimated
    Profiles, A. Hashemi, J. Kalamatianos, D. Kaeli
    and W. Meleis, Digital Technical Journal, Special
    Issues on Tools and Languages, February 1999.
  • Parameter Value Characterization of Windows
    NT-based Applications,' J. Kalamatianos and D.
    Kaeli, Workload Characterization Methodology and
    Case Studies, IEEE Computer Society, 1999,
    pp.142-149.

17
Related Publications (also see http//www.ece.neu.
edu/info/architecture/publications.html)
  • Analysis of Temporal-based Program Behavior for
    Improved Instruction Cache Performance, J.
    Kalamatianos, A. Khalafi, H. Hashemi, D. Kaeli
    and W. Meleis, IEEE Transactions on Computers,
    Vol.10, No. 2, February 1999, pp. 168-175.
  • Memory Architecture Dependent Program Mapping,
    B. Calder, A. Hashemi, and D. Kaeli, US Patent
    No. 5,963,972, October 5, 1999.
  • Temporal-based Procedure Reordering for Improved
    Instruction Cache Performance, Proc. of the 4th
    HPCA, Feb. 1998, pp. 244-253.
  • Efficient Procedure Mapping Using Cache Line
    Coloring, H. Hashemi, D. Kaeli and B. Calder,
    Proc. of PLDI97, June 1997, pp. 171-182.
  • Procedure Mapping Using Static Call Graph
    Estimation, Proc. of the Workshop on the
    Interaction Between Compilers and Computer
    Architecture, TCCA News, 1997.
Write a Comment
User Comments (0)
About PowerShow.com