Title: Commutativity Analysis: A New Analysis Framework for Parallelizing Compilers
1Commutativity Analysis A New Analysis Framework
for Parallelizing Compilers
- Martin C. Rinard
- Pedro C. Diniz
- University of California, Santa Barbara
- Santa Barbara, California 93106
- martin,pedro_at_cs.ucsb.edu
- http//www.cs.ucsb.edu/martin,pedro
2Goal
- Develop a Parallelizing Compiler for
Object-Oriented Computations - Current Focus
- Irregular Computations
- Dynamic Data Structures
- Future
- Persistent Data
- Distributed Computations
- New Analysis Technique
- Commutativity Analysis
3Structure of Talk
- Model of Computation
- Example
- Commutativity Testing
- Steps To Practicality
- Experimental Results
- Conclusion
4Model of Computation
operations
objects
executing operation
new object state
operation
initial object state
invoked operations
5Graph Traversal Example
- class graph
- int val, sum
- graph left, right
-
- void graphtraverse(int v)
- sum v
- if (left !NULL) left-gttraverse(val)
- if (right!NULL) right-gttraverse(val)
Goal Execute left and right traverse operations
in parallel
6Parallel Traversal
7Commuting Operations in Parallel Traversal
3
8Model of Computation
- Operations Method Invocations
- In Example Invocations of graphtraverse
- left-gttraverse(3)
- right-gttraverse(2)
- Objects Instances of Classes
- In Example Graph Nodes
- Instance Variables Implement Object State
- In Example val, sum, left, right
9Model of Computation
- Operations Method Invocations
- In Example Invocations of graphtraverse
- left-gttraverse(3)
- right-gttraverse(2)
- Objects Instances of Classes
- In Example Graph Nodes
10Separable Operations
- Each Operation Consists of Two Sections
Object Section Only Accesses Receiver Object
Invocation Section Only Invokes Operations
Both Sections Can Access Parameters
11Basic Approach
- Compiler Chooses A Computation to Parallelize
- In Example Entire graphtraverse Computation
- Compiler Computes Extent of the Computation
- Representation of all Operations in Computation
- Current Representation Set of Methods
- In Example graphtraverse
- Do All Pairs of Operations in Extent Commute?
- No - Generate Serial Code
- Yes - Generate Parallel Code
- In Example All Pairs Commute
12Code GenerationFor Each Method in Parallel
Computation
- Augments Class Declaration With Mutual Exclusion
Lock -
- Generates Driver Version of Method
- Invoked from Serial Code to Start Parallel
Execution - Invokes Parallel Version of Operation
- Waits for Entire Parallel Computation to Finish
- Generates Parallel Version of Method
- Object Section
- Lock Acquired at Beginning
- Lock Released at End
- Ensure Atomic Execution
- Invocation Section
- Invoked Operations
- Execute in Parallel
- Invokes Parallel Version
13Driver Version
Code Generation In Example
class graph lock mutex int val, sum graph
left, right
Class Declaration
void graphtraverse(int v) parallel_traverse(v)
wait()
14Parallel Version In Example
void graphparallel_traverse(int v)
mutex.acquire() sum v mutex.release()
if (left ! NULL) spawn(left-gtparallel_traverse
(val)) if (right ! NULL) spawn(right-gtparall
el_traverse(val))
15Compiler Structure
Computation Selection
Entire Computation of Each Method
Extent Computation
Traverse Call Graph to Extract Extent
All Pairs of Operations In Extent
Commutativity Testing
All Operations Commute
Operations May Not Commute
Generate Serial Code
Generate Parallel Code
16Traditional Approach
- Data Dependence Analysis
- Analyzes Reads and Writes
- Independent Pieces of Code Execute in Parallel
- Demonstrated Success for Array-Based Programs
17Data Dependence Analysis in Example
- For Data Dependence Analysis To Succeed in
Example - left and right traverse Must Be Independent
- left and right Subgraphs Must Be Disjoint
- Graph Must Be a Tree
- Depends on Global Topology of Data Structure
- Analyze Code that Builds Data Structure
- Extract and Propagate Topology Information
- Fails For Graphs
18Properties of Commutativity Analysis
- Oblivious to Data Structure Topology
- Local Analysis
- Simple Analysis
- Wide Range of Computations
- Lists, Trees and Graphs
- Updates to Central Data Structure
- General Reductions
- Introduces Synchronization
- Relies on Commuting Operations
19 20Commutativity Testing Conditions
- Do Two Operations A and B Commute?
- Compiler Considers Two Execution Orders
- AB - A executes before B
- BA - B executes before A
- Compiler Must Check Two Conditions
Instance Variables New values of instance
variables are same in both execution orders
Invoked Operations A and B together directly
invoke same set of operations in both execution
orders
21Commutativity Testing Conditions
22Commutativity Testing Algorithm
- Symbolic Execution
- Compiler Executes Operations
- Computes with Expressions not Values
- Compiler Symbolically Executes Operations
- In Both Execution Orders
- Expressions for New Values of Instance Variables
- Expressions for Multiset of Invoked Operations
23Expression Simplification and Comparison
- Compiler Applies Rewrite Rules to Simplify
Expressions - a(bc) ????ab)(ac)
- b(ac) ???(abc)
- aif(bltc,d,e) ? if(bltc,ad,ae)
- Compiler Compares Corresponding Expressions
- If All Equal - Operations Commute
- If Not All Equal - Operations May Not Commute
24Commutativity Testing Example
- Two Operations
- r-gttraverse(v1) and r-gttraverse(v2)
- In Order r-gttraverse(v1)r-gttraverse(v2)
Instance Variables New sum (sumv1)v2
Invoked Operations if(right!NULL,right-gttraverse(
val)), if(left!NULL,left-gttraverse(val)),
if(right!NULL,right-gttraverse(val)), if(left!NU
LL,left-gttraverse(val))
- In Order r-gttraverse(v2)r-gttraverse(v1)
Instance Variables New sum (sumv2)v1
Invoked Operations if(right!NULL,right-gttraverse(
val)), if(left!NULL,left-gttraverse(val)),
if(right!NULL,right-gttraverse(val)), if(left!NU
LL,left-gttraverse(val))
25Important Special Case
- Independent Operations Commute
- Analysis in Current Compiler
- Dependence Analysis
- Operations on Objects of Different Classes
- Independent Operations on Objects of Same Class
- Symbolic Commutativity Testing
- Dependent Operations on Objects of Same Class
- Future
- Integrate Pointer or Alias Analysis
- Integrate Array Data Dependence Analysis
26Important Special Case
- Independent Operations Commute
- Conditions for Independence
- Operations Have Different Receivers
- Neither Operation Writes an Instance Variable
that Other Operation Accesses - Detecting Independent Operations
- In Type-Safe Languages
- Class Declarations
- Instance Variable Accesses
- Pointer or Alias Analysis
27Analysis in Current Compiler
- Dependence Analysis
- Operations on Objects of Different Classes
- Independent Operations on Objects of Same Class
- Symbolic Commutativity Testing
- Dependent Operations on Objects of Same Class
- Future
- Integrate Pointer or Alias Analysis
- Integrate Array Data Dependence Analysis
28 29Programming Model Extensions
- Extensions for Read-Only Data
- Allow Operations to Freely Access Read-Only Data
- Enhances Ability of Compiler to Represent
Expressions - Increases Set of Programs that Compiler can
Analyze - Analysis Granularity Extensions
- Integrate Operations into Callers for Analysis
Purposes - Coarsens Commutativity Testing Granularity
- Reduces Number of Pairs Tested for Commutativity
- Enhances Effectiveness of Commutativity Testing
30Optimizations
- Synchronization Optimizations
- Eliminate Synchronization Constructs in Methods
that Only Access Read-Only Data - Reduce Number of Acquire and Release Constructs
- Parallel Loop Optimization
- Suppress Exploitation of Excess Concurrency
31Extent Constants
- Motivation Allow Parallel Operations to Freely
Access Read-Only Data - Extent Constant Variable Global variable or
instance variable written by no operation in
extent - Extent Constant Expression Expression whose value
depends only on extent constant variables or
parameters - Extent Constant Value Value computed by extent
constant expression - Extent Constant Automatically generated opaque
constant used to represent an extent constant
value - Requires Interprocedural Data Usage Analysis
- Result Summarizes How Operations Access Instance
Variables - Interprocedural Pointer Analysis for Reference
Parameters
32Extent Constant Variables In Example
Extent Constant Variable
void graphtraverse(int v) sum v if
(left ! NULL) left-gttraverse(val) if (right
! NULL) right-gttraverse(val)
Extent Constant Variable
33Advantages of Extent Constants
- Extent Constants Extend Programming Model
- Enable Direct Global Variable Access
- Enable Direct Access of Objects other than
Receiver - Extent Constants Make Compiler More Effective
- Enable Compact Representations of Large
Expressions - Enable Compiler to Represent Values Computed by
Otherwise Unanalyzable Constructs
34Auxiliary Operations
- Motivation Coarsen Granularity of Commutativity
Testing - An Operation is an Auxiliary Operation if its
Entire Computation - Only Computes Extent Constant Values
- Only Externally Visible Writes are to Local
Variables of Caller - Auxiliary Operations are Conceptually Part of
Caller - Analysis Integrates Auxiliary Operations into
Caller - Represents Computed Values using Extent Constants
- Requires
- Interprocedural Data Usage Analysis
- Interprocedural Pointer Analysis for Reference
Parameters - Intraprocedural Reaching Definition Analysis
35Auxiliary Operation Example
- int graphsquare_and_add(int v)
- return(valval v)
-
- void graphtraverse(int v)
- sum square_and_add(v)
- if (left ! NULL) left-gttraverse(val)
- if (right ! NULL) right-gttraverse(val)
Extent Constant Variable
Parameter
Extent Constant Expression
36Advantages of Auxiliary Operations
- Coarsen Granularity of Commutativity Testing
- Reduces Number of Pairs Tested for Commutativity
- Enhances Effectiveness of Commutativity Testing
Algorithm - Support Modular Programming
37Synchronization Optimizations
- Goal Eliminate or Reduce Synchronization
Overhead - Synchronization Elimination
An Operation Only Computes Extent Constant
Values
Compiler Does Not Generate Lock Acquire and
Release
Then
If
Data Use One Lock for Multiple Objects
Computation Generate One Lock Acquire and
Release for Multiple Operations on the Same Object
38Data Lock Coarsening Example
Original Code
Optimized Code
class vector lock mutex double
valNDIM void vectoradd(double v)
mutex.acquire() for(int i0 i lt NDIM i)
vali vi mutex.release() class body
lock mutex double phi vector
acc void bodygravsub(body b) double p,
vNDIM mutex.acquire() p
computeInter(b,v) phi - p
mutex.release() acc.add(v)
class vector double valNDIM void
vectoradd(double v) for(int i0 i lt
NDIM i) vali vi class body
lock mutex double phi vector acc void
bodygravsub(body b) double p, vNDIM
mutex.acquire() p computeInter(b,v) phi
- p acc.add(v) mutex.release()
39Computation Lock Coarsening Example
Original Code
Optimized Code
class body lock mutex double phi vector
acc void bodygravsub(body b) double p,
vNDIM p computeInter(b,v) phi - p
acc.add(v) void bodyloopsub(body b) int
i mutex.acquire() for (i 0 i lt N i)
this-gtgravsub(bi) mutex.release()
- class body
- lock mutex
- double phi
- vector acc
-
- void bodygravsub(body b)
- double p, vNDIM
- mutex.acquire()
- p computeInter(b,v)
- phi - p
- acc.add(v)
- mutex.release()
-
- void bodyloopsub(body b)
- int i
- for (i 0 i lt N i)
- this-gtgravsub(bi)
-
40Parallel Loops
- Goal Generate Efficient Code for Parallel Loops
- If A Loop is in the Following Form
- for (i exp1 i lt exp2 i exp3)
- exp4-gtop(exp5,exp6, ...)
-
- Where exp1, exp2, ... Extent Constant
Expressions - Then Compiler Generates Parallel Loop Code
41Parallel Loop Optimization
- Without Parallel Loop Optimization
- Each Loop Iteration Generates a Task
- Tasks are Created and Scheduled Sequentially
- Each Iteration Incurs Task Creation and
Scheduling Overhead - With Parallel Loop Optimization
- Generated Code Immediately Exposes All Iterations
- Scheduler Operates on Chunks of Loop Iterations
- Each Chunk of Iterations Incurs Scheduling
Overhead - Advantages
- Enables Compact Representation for Loop
Computation - Reduces Task Creation and Scheduling Overhead
- Parallelizes Overhead
42Suppressing Excess Concurrency
- Goal Reduce Overhead of Exploiting Parallelism
- Goal Achieved by Generating Computations that
- Execute Operations Serially with No
Parallelization Overhead - Use Synchronization Required to Execute Safely in
Parallel Context - Mechanism Mutex Versions of Methods
- Object Section
- Acquires Lock at Beginning
- Releases Lock at End
- Invocation Section
- Operations Execute Serially
- Invokes Mutex Version
- Current Policy
- Each Parallel Loop Iteration Invokes Mutex
Version of Operation - Suppresses Parallel Execution Within Iterations
of Parallel Loops
43 44Methodology
- Built Prototype Compiler
- Built Run Time System
- Concurrency Generation and Task Management
- Dynamic Load Balancing
- Synchronization
- Acquired Two Complete Applications
- Barnes-Hut N-Body Solver
- Water Code
- Automatically Parallelized Applications
- Ran Applications on Stanford DASH Machine
- Compare Performance with Highly Tuned, Explicitly
Parallel Versions from SPLASH-2 Benchmark Suite
45Prototype Compiler
- Clean Subset of C
- Sage is Front End
- Structured As a Source-To-Source Translator
- Analysis Finds Parallel Loops and Methods
- Compiler Generates Annotation File
- Identifies Parallel Loops and Methods
- Classes to Augment with Locks
- Code Generator Reads Annotation File
- Generates Parallel Versions of Methods
- Inserts Synchronization and Parallelization Code
- Parallelizes Unannotated Programs
46Major Restrictions
- Motivation Simplify Implementation of Prototype
- No Virtual Methods
- No Operator or Method Overloading
- No Multiple Inheritance or Templates
- No typedef, struct, union or enum types
- Global Variables must be Class Types
- No Static Members or Pointers to Members
- No Default Arguments or Variable Numbers of
Arguments - No Operation Accesses a Variable Declared in a
Class from which its Receiver Class Inherits
47Run Time Library
- Motivation Provide Basic Concurrency Managment
- Single Program, Multiple Data Execution Model
- Single Address Space
- Alternate Serial and Parallel Phases
- Library Provides
- Task Creation and Synchronization Primitives
- Dynamic Load Balancing
- Implemented
- Stanford DASH Shared-Memory Multiprocessor
- SGI Shared-Memory Multiprocessors
48Applications
- Barnes-Hut
- O(NlgN) N-Body Solver
- Space Subdivision Tree
- 1500 Lines of C Code
- Water
- Simulates Liquid Water
- O(N2) Algorithm
- 1850 Lines of C Code
49Obtaining Serial C Version of Barnes-Hut
- Started with Explicitly Parallel Version
(SPLASH-2) -
- Removed Parallel Constructs to get Serial C
- Converted to Clean Object-Based C
- Major Structural Changes
- Eliminated Scheduling Code and Data Structures
- Split a Loop in Force Computation Phase
- Introduced New Field into Particle Data Structure
50Obtaining Serial C Version of Water
- Started with Serial C translated from FORTRAN
- Converted to Clean Object-Based C
- Major Structural Change
- Auxiliary Objects for O(N2) phases
51Commutativity Statistics for Barnes-Hut
Symbolically Executed Pairs
20
Independent Pairs
15
Pairs Tested for Commutativity
10
5
Position (3 Methods)
Force (6 Methods)
Velocity (3 Methods)
Parallel Extent
52Auxiliary Operation Statistics for Barnes-Hut
15
Call Sites
10
5
Auxiliary Operation Call Sites
Position (3 Methods)
Force (6 Methods)
Velocity (3 Methods)
Parallel Extent
53Performance Results for Barnes-Hut
54Performance Analysis
- Motivation Understand Behavior of Parallelized
Program - Instrumented Code to Measure Execution Time
Breakdowns - Parallel Idle - Time Spent Idle in Parallel
Section - Serial Idle - Time Spent Idle in a Serial Section
- Blocked - Time Spent Waiting to Acquire a Lock
Held by Another Processor - Parallel Compute - Time Spent Doing Useful Work
in a Parallel Section - Serial Compute - Time Spent Doing Useful Work in
a Serial Section
55Performance Analysis for Barnes-Hut
120
300
100
250
80
200
60
150
Cumulative Total Time (seconds)
Cumulative Total Time (seconds)
40
100
20
50
0
0
1
2
4
8
16
24
32
1
2
4
8
16
24
32
Number of Processors
Number of Processors
Barnes-Hut on DASH Data Set - 8K Particles
Barnes-Hut on DASH Data Set - 16K Particles
56Performance Results for Water
H
H
H
H
Speedup
Speedup
H
H
H
H
H
H
H
H
H
H
J
J
J
J
J
J
J
J
J
J
J
J
J
H
J
H
J
J
H
J
H
J
H
H
J
J
0
8
16
24
32
Number of Processors
Number of Processors
Water on DASH Data Set - 343 Molecules
Water on DASH Data Set - 512 Molecules
57Performance Results for Computation Replication
Version of Water
Speedup
Speedup
Number of Processors
Number of Processors
Water on DASH Data Set - 512 Molecules
Water on DASH Data Set - 343 Molecules
58Commutativity Statistics for Water
15
Symbolically Executed Pairs
10
Independent Pairs
Pairs Tested for Commutativity
5
Virtual (3 Methods)
Forces (2 Methods)
Loading (4 Methods)
Momenta (2 Methods)
Energy (5 Methods)
Parallel Extent
59Auxiliary Operation Statistics for Water
15
10
Call Sites
Auxiliary Operation Call Sites
5
Virtual (3 Methods)
Forces (2 Methods)
Loading (4 Methods)
Momenta (2 Methods)
Energy (5 Methods)
Parallel Extent
60Performance Analysis for Water
1400
600
1200
500
1000
400
800
Cumulative Total Time (seconds)
Cumulative Total Time (seconds)
300
600
200
400
100
200
0
0
1
2
4
8
16
24
32
1
2
4
8
16
24
32
Number of Processors
Number of Processors
Water on DASH Data Set - 343 molecules
Water on DASH Data Set - 512 molecules
61Future Work
- Relative Commutativity
- Integrate Other Analysis Frameworks
- Pointer or Alias Analysis
- Array Data Dependence Analysis
- Analysis Problems
- Synchronization Optimizations
- Analysis Granularity Optimizations
- Generation of Self-Tuning Code
- Message Passing Implementation
62Related Work
- Bernstein (IEEE Transactions on Computers 1966)
- Dependence Analysis for Pointer-Based Data
Structures - Reduction Analysis
- Ghuloum and Fisher (PPOPP 95)
- Pinter and Pinter (POPL 92)
- Callahan (LCPC 91)
- Commuting Operations in Parallel Languages
- Rinard and Lam (PPOPP 91)
- Steele (POPL 90)
- Barth, Nikhil and Arvind (FPCA 91)
- Landi, Ryder and Zhang (PLDI 93)
- Hendren, Hummel and Nicolau (PLDI 92)
- Plevyak, Karamcheti and Chien (LCPC 93)
- Chase, Wegman and Zadek (PLDI 90)
- Larus and Hilfinger (PLDI 88)
- Ghiya and Hendren (POPL 96)
- Ruf (PLDI 95)
- Wilson and Lam (PLDI 95)
- Deutsch (PLDI 94)
- Choi, Burke and Carini (POPL 93)
63Conclusions
64Conclusion
- Commutativity Analysis
- New Analysis Framework for Parallelizing
Compilers - Basic Idea
- Recognize Commuting Operations
- Generate Parallel Code
- Current Focus
- Dynamic, Pointer-Based Data Structures
- Good Initial Results
- Future
- Persistent Data
- Distributed Computations
65Latest Version of Paper
- http//www.cs.ucsb.edu/martin/paper/pldi96.ps
66What if Operations Do Not Commute?
- Parallel Tree Traversal
- Example Distance of Node from Root
class tree int distance tree left tree
right treeset_distance(int d) distance
d if (left ! NULL) left-gtset_distance(d1)
if (right ! NULL) right-gtset_distance(d1)
67Equivalent Computation with Commuting Operations
treezero_distance() distance 0 if (left
! NULL) left-gtzero_distance() if (right !
NULL) right-gtzero_distance()
treesum_distance(int d) distance distance
d if (left ! NULL) left-gtsum_distance(d1)
if (right ! NULL) right-gtsum_distance(d1)
treeset_distance(int d) zero_distance() sum
_distance(d)
68Theoretical Result
- For Any Tree Traversal on Data With
- A Commutative Operator (for example ) that has
- A Zero Element (for example 0)
- There Exists A Program P such that
- P Computes the Traversal
- Commutativity Analysis Can Automatically
Parallelize P - Complexity Results
- Program P is asymptotically optimal if the Data
Struture is a Perfectly Balanced Tree - Program P has complexity O(N2) if the Data
Structure is a Linked-List
69Pure Object-Based Model of Computation
- Goal
- Obtain a Powerful, Clean Model of Computation
- Enable Compiler to Analyze Program
- Objects Instances of Classes
- Implement State with Instance Variables
- Primitive Types from Underlying Language (int,
...) - References to Other Objects
- Nested Objects
- Operations Invocations of Methods
- Each Operation Has Single Receiver Object