Title: Structure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements
1 Structure Layout Optimizations in the Open64
Compiler Design, Implementation and Measurements
- Gautam Chakrabarti
- and
- Fred Chow
- PathScale, LLC.
2Outline
- Motivation
- Types of structure layout optimizations
- Criteria for structure layout optimizations
- Implementation details
- Performance results
- Future work
- Conclusion
3Motivation
- Poor data locality in many applications
- High data cache miss rates
- Growing gap between processor and memory speeds
- Our Aim
- Make applications more cache-friendly
- Our Approach
- Change layout of data structures
- Requires whole-program optimization
- Use Inter-Procedural Analysis and Optimizations
(IPA)
4IPA
- Summarization
- Analysis
- Optimization
5Types of Structure Layout Optimizations
struct struct_A double d1 double d2 int
i float f long long l char c struct
struct_A next
struct struct_A double d1 double d2 int
i float f long long l char c
6Structure Splitting Example
- struct new_struct_A
-
- double d1
- int i
- long long l
- struct new_struct_A next
- struct cold_sub_struct_A p
struct struct_A double d1 double d2 int
i float f long long l char c struct
struct_A next
struct cold_sub_struct_A double d2 float
f char c
7Structure Peeling Example
- struct new_struct_A
-
- double d1
- int i
- long long l
struct struct_A double d1 double d2 int
i float f long long l char c
struct cold_sub_struct_A double d2 float
f char c
8Criteria for structure layout optimizations
- Legality Analysis
- Type cast
- Address of a field is taken
- Escaped types
- Parameter types
- Full visibility to IPA
- Alignment restrictions
- Profitability Analysis
- Hotness
- Affinity
- Field accesses at loop level
- Size
9Implementation Details
- Step 1 Type information summarization (IPL)
- Step 2 Symbol table merging (IPA)
- Step 3 Legality and profitability analysis (IPA
analysis) - Step 4 Transforming the program (IPA
optimization)
10Implementation Details Type information
summarization
- Information summarization in IPL
- Framework for computing static profiles using
heuristics - New TY flag TY_NO_SPLIT
- SUMMARY_TY_INFO
- SUMMARY_LOOP
- For each DO_LOOP, WHILE_DO, DO_WHILE
- Bit-vector to track field accesses of up to N
structure for each loop - Considers field accesses immediately inside loop
- These fields are considered affine to each other
- Execution count of statements immediately inside
loop - From statically estimated profiles or from
runtime feedback
11Implementation Details IPA Analysis
- Inter-procedurally update statically estimated
execution count of PUs - Update statically estimated loop frequencies in
SUMMARY_LOOP - Consider SUMMARY_LOOP from the hottest P PUs
- Determine candidates for structure-layout
transformation - Determine new layout of structures
12Implementation Details IPA Analysis Example
F4 F3 F2 F1 BV
L1 22 22 0101
L2 14 0010
L3 12 12 0101
L4 8 8 1100
L5 6 6 0101
F4 F3 F2 F1
AG1 40 40
AG2 14
AG3 8 8
Li Loops Fj Fields in a struct
AGk Affinity groups
13Implementation Details Transforming the program
- New type definitions
- Field table update
- Field access statements
- New symbols
- Assignment statements
Example
struct S struct T
// N
fields // AG1 fields
struct T p // AG2
fields // M fields
// peel T
struct S // N fields struct T1 p1
struct T2 p2 // M fields
struct T1 struct T2
// AG1 fields
// AG2 fields
14Implementation Details Transforming the program
(continued)
- Function calls to memory management routines
Example
p (T ) malloc (N sizeof (T))
if (p NULL) exit (1)
- Detect memory management routine calls involving
transformed type T - Replicate call, assignment statements
- Update size of memory being allocated
- Handle comparisons involving pointer p
15Performance Results
- Compilations options -Ofast at 32-bit ABI
- Speedup due to structure layout optimizations
Benchmarks AMD Opteron (2.8GHz, 4GB, 1MB) AMD Barcelona(2.0GHz, 8GB, 512KB) Intel EM64T(3.4GHz, 4GB, 1MB) Intel Core(3.0 GHz, 4GB, 4MB) SiCortex MIPS(500MHz, 4GB, 256KB) Geometric Mean
179.art 134 66 56 47 41 62.5
181.mcf 24 23 23 31 13 22.0
462.libquantum 32 17 40 72 62 39.6
Geometric Mean 46.9 29.6 37.2 47.2 32.1 37.9
16Performance Results (continued)
- Compilations options -Ofast at 64-bit ABI
- Speedup due to structure layout optimizations
Benchmarks AMD Opteron (2.8GHz, 4GB, 1MB) AMD Barcelona(2.0GHz, 8GB, 512KB) Intel EM64T(3.4GHz, 4GB, 1MB) Intel Core(3.0 GHz, 4GB, 4MB) SiCortex MIPS(500MHz, 4GB, 256KB) Geometric Mean
179.art 169 66 53 60 45 69.3
181.mcf 25 35 12 30 7 18.6
462.libquantum 82 51 75 70 69 68.6
Geometric Mean 70.2 49.0 36.3 50.1 27.9 44.6
17Performance Results (continued)
- Compilations options -Ofast at 64-bit ABI
- Multiple copies of 462.libquantum running on
multi-core chip - Platform Quad-core AMD Barcelona (2.0 GHz, 8GB,
512KB, 2MB) - 3rd level cache shared among 4 cores
- Speedup from structure layout optimizations
Benchmark 1 copy 2 copies 4 copies
462.libquantum 51 69 123
18Future Work
- Tune static profile estimation
- Less restrictions
- Integrate with field-reordering
19Conclusion
- A framework for performing structure layout
transformations is now available in the Open64
compiler. - The superior infrastructure in the Open64
compiler helped us implement the optimizations
cleanly and with relatively less effort. - Substantial speedups are possible on some of the
CPU2000 and CPU2006 SPEC benchmarks. - Structure layout optimization is a required
feature for a compiler to remain competitive.