Structure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements presentation

About This Presentation

Transcript and Presenter's Notes

Title: Structure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements

1
Structure Layout Optimizations in the Open64
Compiler Design, Implementation and Measurements

Gautam Chakrabarti
and
Fred Chow
PathScale, LLC.

2
Outline

Motivation
Types of structure layout optimizations
Criteria for structure layout optimizations
Implementation details
Performance results
Future work
Conclusion

3
Motivation

Poor data locality in many applications
High data cache miss rates
Growing gap between processor and memory speeds

Our Aim
Make applications more cache-friendly

Our Approach
Change layout of data structures
Requires whole-program optimization
Use Inter-Procedural Analysis and Optimizations
(IPA)

4
IPA

Summarization
Analysis
Optimization

5
Types of Structure Layout Optimizations

Structure splitting

Structure peeling

struct struct_A double d1 double d2 int
i float f long long l char c struct
struct_A next
struct struct_A double d1 double d2 int
i float f long long l char c
6
Structure Splitting Example

struct new_struct_A
double d1
int i
long long l
struct new_struct_A next
struct cold_sub_struct_A p

struct struct_A double d1 double d2 int
i float f long long l char c struct
struct_A next
struct cold_sub_struct_A double d2 float
f char c
7
Structure Peeling Example

struct new_struct_A
double d1
int i
long long l

struct struct_A double d1 double d2 int
i float f long long l char c
struct cold_sub_struct_A double d2 float
f char c
8
Criteria for structure layout optimizations

Legality Analysis
Type cast
Address of a field is taken
Escaped types
Parameter types
Full visibility to IPA
Alignment restrictions

Profitability Analysis
Hotness
Affinity
Field accesses at loop level
Size

9
Implementation Details

Step 1 Type information summarization (IPL)
Step 2 Symbol table merging (IPA)
Step 3 Legality and profitability analysis (IPA
analysis)
Step 4 Transforming the program (IPA
optimization)

10
Implementation Details Type information
summarization

Information summarization in IPL
Framework for computing static profiles using
heuristics
New TY flag TY_NO_SPLIT
SUMMARY_TY_INFO
SUMMARY_LOOP
For each DO_LOOP, WHILE_DO, DO_WHILE
Bit-vector to track field accesses of up to N
structure for each loop
Considers field accesses immediately inside loop
These fields are considered affine to each other
Execution count of statements immediately inside
loop
From statically estimated profiles or from
runtime feedback

11
Implementation Details IPA Analysis

Inter-procedurally update statically estimated
execution count of PUs
Update statically estimated loop frequencies in
SUMMARY_LOOP
Consider SUMMARY_LOOP from the hottest P PUs
Determine candidates for structure-layout
transformation
Determine new layout of structures

12
Implementation Details IPA Analysis Example
F4 F3 F2 F1 BV
L1 22 22 0101
L2 14 0010
L3 12 12 0101
L4 8 8 1100
L5 6 6 0101
F4 F3 F2 F1
AG1 40 40
AG2 14
AG3 8 8
Li Loops Fj Fields in a struct
AGk Affinity groups
13
Implementation Details Transforming the program

New type definitions
Field table update
Field access statements
New symbols
Assignment statements

Example
struct S struct T
// N
fields // AG1 fields
struct T p // AG2
fields // M fields
// peel T
struct S // N fields struct T1 p1
struct T2 p2 // M fields
struct T1 struct T2
// AG1 fields
// AG2 fields
14
Implementation Details Transforming the program
(continued)

Function calls to memory management routines

Example
p (T ) malloc (N sizeof (T))
if (p NULL) exit (1)

Detect memory management routine calls involving
transformed type T
Replicate call, assignment statements
Update size of memory being allocated
Handle comparisons involving pointer p

15
Performance Results

Compilations options -Ofast at 32-bit ABI
Speedup due to structure layout optimizations

Benchmarks AMD Opteron (2.8GHz, 4GB, 1MB) AMD Barcelona(2.0GHz, 8GB, 512KB) Intel EM64T(3.4GHz, 4GB, 1MB) Intel Core(3.0 GHz, 4GB, 4MB) SiCortex MIPS(500MHz, 4GB, 256KB) Geometric Mean
179.art 134 66 56 47 41 62.5
181.mcf 24 23 23 31 13 22.0
462.libquantum 32 17 40 72 62 39.6
Geometric Mean 46.9 29.6 37.2 47.2 32.1 37.9
16
Performance Results (continued)

Compilations options -Ofast at 64-bit ABI
Speedup due to structure layout optimizations

Benchmarks AMD Opteron (2.8GHz, 4GB, 1MB) AMD Barcelona(2.0GHz, 8GB, 512KB) Intel EM64T(3.4GHz, 4GB, 1MB) Intel Core(3.0 GHz, 4GB, 4MB) SiCortex MIPS(500MHz, 4GB, 256KB) Geometric Mean
179.art 169 66 53 60 45 69.3
181.mcf 25 35 12 30 7 18.6
462.libquantum 82 51 75 70 69 68.6
Geometric Mean 70.2 49.0 36.3 50.1 27.9 44.6
17
Performance Results (continued)

Compilations options -Ofast at 64-bit ABI
Multiple copies of 462.libquantum running on
multi-core chip
Platform Quad-core AMD Barcelona (2.0 GHz, 8GB,
512KB, 2MB)
3rd level cache shared among 4 cores
Speedup from structure layout optimizations

Benchmark 1 copy 2 copies 4 copies
462.libquantum 51 69 123
18
Future Work

Tune static profile estimation
Less restrictions
Integrate with field-reordering

19
Conclusion

A framework for performing structure layout
transformations is now available in the Open64
compiler.
The superior infrastructure in the Open64
compiler helped us implement the optimizations
cleanly and with relatively less effort.
Substantial speedups are possible on some of the
CPU2000 and CPU2006 SPEC benchmarks.
Structure layout optimization is a required
feature for a compiler to remain competitive.

Write a Comment

User Comments (0)

About PowerShow.com

Structure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements PowerPoint PPT Presentation