Commutativity Analysis: A New Analysis Framework for Parallelizing Compilers

About This Presentation

Title:

Commutativity Analysis: A New Analysis Framework for Parallelizing Compilers

Description:

... Example: Entire graph::traverse Computation. Compiler Computes Extent ... left and right traverse Must Be Independent. left and right Subgraphs Must Be Disjoint ... – PowerPoint PPT presentation

Number of Views:68

Avg rating:3.0/5.0

Slides: 70

Provided by: martin49

Learn more at: http://people.csail.mit.edu

Category:

more less

Transcript and Presenter's Notes

Title: Commutativity Analysis: A New Analysis Framework for Parallelizing Compilers

1
Commutativity Analysis A New Analysis Framework
for Parallelizing Compilers

Martin C. Rinard
Pedro C. Diniz
University of California, Santa Barbara
Santa Barbara, California 93106
martin,pedro_at_cs.ucsb.edu
http//www.cs.ucsb.edu/martin,pedro

2
Goal

Develop a Parallelizing Compiler for
Object-Oriented Computations
Current Focus
Irregular Computations
Dynamic Data Structures
Future
Persistent Data
Distributed Computations
New Analysis Technique
Commutativity Analysis

3
Structure of Talk

Model of Computation
Example
Commutativity Testing
Steps To Practicality
Experimental Results
Conclusion

4
Model of Computation
operations
objects
executing operation
new object state
operation
initial object state
invoked operations
5
Graph Traversal Example

class graph
int val, sum
graph left, right
void graphtraverse(int v)
sum v
if (left !NULL) left-gttraverse(val)
if (right!NULL) right-gttraverse(val)

Goal Execute left and right traverse operations
in parallel
6
Parallel Traversal
7
Commuting Operations in Parallel Traversal
3
8
Model of Computation

Operations Method Invocations
In Example Invocations of graphtraverse
left-gttraverse(3)
right-gttraverse(2)
Objects Instances of Classes
In Example Graph Nodes
Instance Variables Implement Object State
In Example val, sum, left, right

9
Model of Computation

Operations Method Invocations
In Example Invocations of graphtraverse
left-gttraverse(3)
right-gttraverse(2)
Objects Instances of Classes
In Example Graph Nodes

10
Separable Operations

Each Operation Consists of Two Sections

Object Section Only Accesses Receiver Object
Invocation Section Only Invokes Operations
Both Sections Can Access Parameters
11
Basic Approach

Compiler Chooses A Computation to Parallelize
In Example Entire graphtraverse Computation
Compiler Computes Extent of the Computation
Representation of all Operations in Computation
Current Representation Set of Methods
In Example graphtraverse
Do All Pairs of Operations in Extent Commute?
No - Generate Serial Code
Yes - Generate Parallel Code
In Example All Pairs Commute

12
Code GenerationFor Each Method in Parallel
Computation

Augments Class Declaration With Mutual Exclusion
Lock
Generates Driver Version of Method
Invoked from Serial Code to Start Parallel
Execution
Invokes Parallel Version of Operation
Waits for Entire Parallel Computation to Finish
Generates Parallel Version of Method

Object Section
Lock Acquired at Beginning
Lock Released at End
Ensure Atomic Execution

Invocation Section
Invoked Operations
Execute in Parallel
Invokes Parallel Version

13
Driver Version
Code Generation In Example
class graph lock mutex int val, sum graph
left, right
Class Declaration
void graphtraverse(int v) parallel_traverse(v)
wait()
14
Parallel Version In Example
void graphparallel_traverse(int v)
mutex.acquire() sum v mutex.release()
if (left ! NULL) spawn(left-gtparallel_traverse
(val)) if (right ! NULL) spawn(right-gtparall
el_traverse(val))
15
Compiler Structure
Computation Selection
Entire Computation of Each Method
Extent Computation
Traverse Call Graph to Extract Extent
All Pairs of Operations In Extent
Commutativity Testing
All Operations Commute
Operations May Not Commute
Generate Serial Code
Generate Parallel Code
16
Traditional Approach

Data Dependence Analysis
Analyzes Reads and Writes
Independent Pieces of Code Execute in Parallel
Demonstrated Success for Array-Based Programs

17
Data Dependence Analysis in Example

For Data Dependence Analysis To Succeed in
Example
left and right traverse Must Be Independent
left and right Subgraphs Must Be Disjoint
Graph Must Be a Tree
Depends on Global Topology of Data Structure
Analyze Code that Builds Data Structure
Extract and Propagate Topology Information
Fails For Graphs

18
Properties of Commutativity Analysis

Oblivious to Data Structure Topology
Local Analysis
Simple Analysis
Wide Range of Computations
Lists, Trees and Graphs
Updates to Central Data Structure
General Reductions
Introduces Synchronization
Relies on Commuting Operations

Commutativity Testing

20
Commutativity Testing Conditions

Do Two Operations A and B Commute?
Compiler Considers Two Execution Orders
AB - A executes before B
BA - B executes before A
Compiler Must Check Two Conditions

Instance Variables New values of instance
variables are same in both execution orders
Invoked Operations A and B together directly
invoke same set of operations in both execution
orders
21
Commutativity Testing Conditions
22
Commutativity Testing Algorithm

Symbolic Execution
Compiler Executes Operations
Computes with Expressions not Values
Compiler Symbolically Executes Operations
In Both Execution Orders
Expressions for New Values of Instance Variables
Expressions for Multiset of Invoked Operations

23
Expression Simplification and Comparison

Compiler Applies Rewrite Rules to Simplify
Expressions
a(bc) ????ab)(ac)
b(ac) ???(abc)
aif(bltc,d,e) ? if(bltc,ad,ae)
Compiler Compares Corresponding Expressions
If All Equal - Operations Commute
If Not All Equal - Operations May Not Commute

24
Commutativity Testing Example

Two Operations
r-gttraverse(v1) and r-gttraverse(v2)
In Order r-gttraverse(v1)r-gttraverse(v2)

Instance Variables New sum (sumv1)v2
Invoked Operations if(right!NULL,right-gttraverse(
val)), if(left!NULL,left-gttraverse(val)),
if(right!NULL,right-gttraverse(val)), if(left!NU
LL,left-gttraverse(val))

In Order r-gttraverse(v2)r-gttraverse(v1)

Instance Variables New sum (sumv2)v1
Invoked Operations if(right!NULL,right-gttraverse(
val)), if(left!NULL,left-gttraverse(val)),
if(right!NULL,right-gttraverse(val)), if(left!NU
LL,left-gttraverse(val))
25
Important Special Case

Independent Operations Commute
Analysis in Current Compiler
Dependence Analysis
Operations on Objects of Different Classes
Independent Operations on Objects of Same Class
Symbolic Commutativity Testing
Dependent Operations on Objects of Same Class
Future
Integrate Pointer or Alias Analysis
Integrate Array Data Dependence Analysis

26
Important Special Case

Independent Operations Commute
Conditions for Independence
Operations Have Different Receivers
Neither Operation Writes an Instance Variable
that Other Operation Accesses
Detecting Independent Operations
In Type-Safe Languages
Class Declarations
Instance Variable Accesses
Pointer or Alias Analysis

27
Analysis in Current Compiler

Dependence Analysis
Operations on Objects of Different Classes
Independent Operations on Objects of Same Class
Symbolic Commutativity Testing
Dependent Operations on Objects of Same Class
Future
Integrate Pointer or Alias Analysis
Integrate Array Data Dependence Analysis

Steps to Practicality

29
Programming Model Extensions

Extensions for Read-Only Data
Allow Operations to Freely Access Read-Only Data
Enhances Ability of Compiler to Represent
Expressions
Increases Set of Programs that Compiler can
Analyze
Analysis Granularity Extensions
Integrate Operations into Callers for Analysis
Purposes
Coarsens Commutativity Testing Granularity
Reduces Number of Pairs Tested for Commutativity
Enhances Effectiveness of Commutativity Testing

30
Optimizations

Synchronization Optimizations
Eliminate Synchronization Constructs in Methods
that Only Access Read-Only Data
Reduce Number of Acquire and Release Constructs
Parallel Loop Optimization
Suppress Exploitation of Excess Concurrency

31
Extent Constants

Motivation Allow Parallel Operations to Freely
Access Read-Only Data
Extent Constant Variable Global variable or
instance variable written by no operation in
extent
Extent Constant Expression Expression whose value
depends only on extent constant variables or
parameters
Extent Constant Value Value computed by extent
constant expression
Extent Constant Automatically generated opaque
constant used to represent an extent constant
value
Requires Interprocedural Data Usage Analysis
Result Summarizes How Operations Access Instance
Variables
Interprocedural Pointer Analysis for Reference
Parameters

32
Extent Constant Variables In Example
Extent Constant Variable
void graphtraverse(int v) sum v if
(left ! NULL) left-gttraverse(val) if (right
! NULL) right-gttraverse(val)
Extent Constant Variable
33
Advantages of Extent Constants

Extent Constants Extend Programming Model
Enable Direct Global Variable Access
Enable Direct Access of Objects other than
Receiver
Extent Constants Make Compiler More Effective
Enable Compact Representations of Large
Expressions
Enable Compiler to Represent Values Computed by
Otherwise Unanalyzable Constructs

34
Auxiliary Operations

Motivation Coarsen Granularity of Commutativity
Testing
An Operation is an Auxiliary Operation if its
Entire Computation
Only Computes Extent Constant Values
Only Externally Visible Writes are to Local
Variables of Caller
Auxiliary Operations are Conceptually Part of
Caller
Analysis Integrates Auxiliary Operations into
Caller
Represents Computed Values using Extent Constants
Requires
Interprocedural Data Usage Analysis
Interprocedural Pointer Analysis for Reference
Parameters
Intraprocedural Reaching Definition Analysis

35
Auxiliary Operation Example

int graphsquare_and_add(int v)
return(valval v)
void graphtraverse(int v)
sum square_and_add(v)
if (left ! NULL) left-gttraverse(val)
if (right ! NULL) right-gttraverse(val)

Extent Constant Variable
Parameter
Extent Constant Expression
36
Advantages of Auxiliary Operations

Coarsen Granularity of Commutativity Testing
Reduces Number of Pairs Tested for Commutativity
Enhances Effectiveness of Commutativity Testing
Algorithm
Support Modular Programming

37
Synchronization Optimizations

Goal Eliminate or Reduce Synchronization
Overhead
Synchronization Elimination

An Operation Only Computes Extent Constant
Values
Compiler Does Not Generate Lock Acquire and
Release
Then
If

Lock Coarsening

Data Use One Lock for Multiple Objects
Computation Generate One Lock Acquire and
Release for Multiple Operations on the Same Object
38
Data Lock Coarsening Example
Original Code
Optimized Code
class vector lock mutex double
valNDIM void vectoradd(double v)
mutex.acquire() for(int i0 i lt NDIM i)
vali vi mutex.release() class body
lock mutex double phi vector
acc void bodygravsub(body b) double p,
vNDIM mutex.acquire() p
computeInter(b,v) phi - p
mutex.release() acc.add(v)
class vector double valNDIM void
vectoradd(double v) for(int i0 i lt
NDIM i) vali vi class body
lock mutex double phi vector acc void
bodygravsub(body b) double p, vNDIM
mutex.acquire() p computeInter(b,v) phi
- p acc.add(v) mutex.release()
39
Computation Lock Coarsening Example
Original Code
Optimized Code
class body lock mutex double phi vector
acc void bodygravsub(body b) double p,
vNDIM p computeInter(b,v) phi - p
acc.add(v) void bodyloopsub(body b) int
i mutex.acquire() for (i 0 i lt N i)
this-gtgravsub(bi) mutex.release()

class body
lock mutex
double phi
vector acc
void bodygravsub(body b)
double p, vNDIM
mutex.acquire()
p computeInter(b,v)
phi - p
acc.add(v)
mutex.release()
void bodyloopsub(body b)
int i
for (i 0 i lt N i)
this-gtgravsub(bi)

40
Parallel Loops

Goal Generate Efficient Code for Parallel Loops
If A Loop is in the Following Form
for (i exp1 i lt exp2 i exp3)
exp4-gtop(exp5,exp6, ...)
Where exp1, exp2, ... Extent Constant
Expressions
Then Compiler Generates Parallel Loop Code

41
Parallel Loop Optimization

Without Parallel Loop Optimization
Each Loop Iteration Generates a Task
Tasks are Created and Scheduled Sequentially
Each Iteration Incurs Task Creation and
Scheduling Overhead
With Parallel Loop Optimization
Generated Code Immediately Exposes All Iterations
Scheduler Operates on Chunks of Loop Iterations
Each Chunk of Iterations Incurs Scheduling
Overhead
Advantages
Enables Compact Representation for Loop
Computation
Reduces Task Creation and Scheduling Overhead
Parallelizes Overhead

42
Suppressing Excess Concurrency

Goal Reduce Overhead of Exploiting Parallelism
Goal Achieved by Generating Computations that
Execute Operations Serially with No
Parallelization Overhead
Use Synchronization Required to Execute Safely in
Parallel Context
Mechanism Mutex Versions of Methods

Object Section
Acquires Lock at Beginning
Releases Lock at End

Invocation Section
Operations Execute Serially
Invokes Mutex Version

Current Policy
Each Parallel Loop Iteration Invokes Mutex
Version of Operation
Suppresses Parallel Execution Within Iterations
of Parallel Loops

Experimental Results

44
Methodology

Built Prototype Compiler
Built Run Time System
Concurrency Generation and Task Management
Dynamic Load Balancing
Synchronization
Acquired Two Complete Applications
Barnes-Hut N-Body Solver
Water Code
Automatically Parallelized Applications
Ran Applications on Stanford DASH Machine
Compare Performance with Highly Tuned, Explicitly
Parallel Versions from SPLASH-2 Benchmark Suite

45
Prototype Compiler

Clean Subset of C
Sage is Front End
Structured As a Source-To-Source Translator
Analysis Finds Parallel Loops and Methods
Compiler Generates Annotation File
Identifies Parallel Loops and Methods
Classes to Augment with Locks
Code Generator Reads Annotation File
Generates Parallel Versions of Methods
Inserts Synchronization and Parallelization Code
Parallelizes Unannotated Programs

46
Major Restrictions

Motivation Simplify Implementation of Prototype
No Virtual Methods
No Operator or Method Overloading
No Multiple Inheritance or Templates
No typedef, struct, union or enum types
Global Variables must be Class Types
No Static Members or Pointers to Members
No Default Arguments or Variable Numbers of
Arguments
No Operation Accesses a Variable Declared in a
Class from which its Receiver Class Inherits

47
Run Time Library

Motivation Provide Basic Concurrency Managment
Single Program, Multiple Data Execution Model
Single Address Space
Alternate Serial and Parallel Phases
Library Provides
Task Creation and Synchronization Primitives
Dynamic Load Balancing
Implemented
Stanford DASH Shared-Memory Multiprocessor
SGI Shared-Memory Multiprocessors

48
Applications

Barnes-Hut
O(NlgN) N-Body Solver
Space Subdivision Tree
1500 Lines of C Code
Water
Simulates Liquid Water
O(N2) Algorithm
1850 Lines of C Code

49
Obtaining Serial C Version of Barnes-Hut

Started with Explicitly Parallel Version
(SPLASH-2)
Removed Parallel Constructs to get Serial C
Converted to Clean Object-Based C
Major Structural Changes
Eliminated Scheduling Code and Data Structures
Split a Loop in Force Computation Phase
Introduced New Field into Particle Data Structure

50
Obtaining Serial C Version of Water

Started with Serial C translated from FORTRAN
Converted to Clean Object-Based C
Major Structural Change
Auxiliary Objects for O(N2) phases

51
Commutativity Statistics for Barnes-Hut
Symbolically Executed Pairs
20
Independent Pairs
15
Pairs Tested for Commutativity
10
5
Position (3 Methods)
Force (6 Methods)
Velocity (3 Methods)
Parallel Extent
52
Auxiliary Operation Statistics for Barnes-Hut
15
Call Sites
10
5
Auxiliary Operation Call Sites
Position (3 Methods)
Force (6 Methods)
Velocity (3 Methods)
Parallel Extent
53
Performance Results for Barnes-Hut
54
Performance Analysis

Motivation Understand Behavior of Parallelized
Program
Instrumented Code to Measure Execution Time
Breakdowns
Parallel Idle - Time Spent Idle in Parallel
Section
Serial Idle - Time Spent Idle in a Serial Section
Blocked - Time Spent Waiting to Acquire a Lock
Held by Another Processor
Parallel Compute - Time Spent Doing Useful Work
in a Parallel Section
Serial Compute - Time Spent Doing Useful Work in
a Serial Section

55
Performance Analysis for Barnes-Hut
120
300
100
250
80
200
60
150
Cumulative Total Time (seconds)
Cumulative Total Time (seconds)
40
100
20
50
0
0
1
2
4
8
16
24
32
1
2
4
8
16
24
32
Number of Processors
Number of Processors
Barnes-Hut on DASH Data Set - 8K Particles
Barnes-Hut on DASH Data Set - 16K Particles
56
Performance Results for Water
H
H
H
H
Speedup
Speedup
H
H
H
H
H
H
H
H
H
H
J
J
J
J
J
J
J
J
J
J
J
J
J
H
J
H
J
J
H
J
H
J
H
H
J
J
0
8
16
24
32
Number of Processors
Number of Processors
Water on DASH Data Set - 343 Molecules
Water on DASH Data Set - 512 Molecules
57
Performance Results for Computation Replication
Version of Water
Speedup
Speedup
Number of Processors
Number of Processors
Water on DASH Data Set - 512 Molecules
Water on DASH Data Set - 343 Molecules
58
Commutativity Statistics for Water
15
Symbolically Executed Pairs
10
Independent Pairs
Pairs Tested for Commutativity
5
Virtual (3 Methods)
Forces (2 Methods)
Loading (4 Methods)
Momenta (2 Methods)
Energy (5 Methods)
Parallel Extent
59
Auxiliary Operation Statistics for Water
15
10
Call Sites
Auxiliary Operation Call Sites
5
Virtual (3 Methods)
Forces (2 Methods)
Loading (4 Methods)
Momenta (2 Methods)
Energy (5 Methods)
Parallel Extent
60
Performance Analysis for Water

1400
600
1200
500
1000
400
800
Cumulative Total Time (seconds)
Cumulative Total Time (seconds)
300
600
200
400
100
200
0
0
1
2
4
8
16
24
32
1
2
4
8
16
24
32
Number of Processors
Number of Processors
Water on DASH Data Set - 343 molecules
Water on DASH Data Set - 512 molecules
61
Future Work

Relative Commutativity
Integrate Other Analysis Frameworks
Pointer or Alias Analysis
Array Data Dependence Analysis
Analysis Problems
Synchronization Optimizations
Analysis Granularity Optimizations
Generation of Self-Tuning Code
Message Passing Implementation

62
Related Work

Bernstein (IEEE Transactions on Computers 1966)
Dependence Analysis for Pointer-Based Data
Structures
Reduction Analysis
Ghuloum and Fisher (PPOPP 95)
Pinter and Pinter (POPL 92)
Callahan (LCPC 91)
Commuting Operations in Parallel Languages
Rinard and Lam (PPOPP 91)
Steele (POPL 90)
Barth, Nikhil and Arvind (FPCA 91)

Landi, Ryder and Zhang (PLDI 93)
Hendren, Hummel and Nicolau (PLDI 92)
Plevyak, Karamcheti and Chien (LCPC 93)
Chase, Wegman and Zadek (PLDI 90)
Larus and Hilfinger (PLDI 88)

Ghiya and Hendren (POPL 96)
Ruf (PLDI 95)
Wilson and Lam (PLDI 95)
Deutsch (PLDI 94)
Choi, Burke and Carini (POPL 93)

63
Conclusions
64
Conclusion

Commutativity Analysis
New Analysis Framework for Parallelizing
Compilers
Basic Idea
Recognize Commuting Operations
Generate Parallel Code
Current Focus
Dynamic, Pointer-Based Data Structures
Good Initial Results
Future
Persistent Data
Distributed Computations

65
Latest Version of Paper

http//www.cs.ucsb.edu/martin/paper/pldi96.ps

66
What if Operations Do Not Commute?

Parallel Tree Traversal
Example Distance of Node from Root

class tree int distance tree left tree
right treeset_distance(int d) distance
d if (left ! NULL) left-gtset_distance(d1)
if (right ! NULL) right-gtset_distance(d1)
67
Equivalent Computation with Commuting Operations
treezero_distance() distance 0 if (left
! NULL) left-gtzero_distance() if (right !
NULL) right-gtzero_distance()
treesum_distance(int d) distance distance
d if (left ! NULL) left-gtsum_distance(d1)
if (right ! NULL) right-gtsum_distance(d1)
treeset_distance(int d) zero_distance() sum
_distance(d)
68
Theoretical Result

For Any Tree Traversal on Data With
A Commutative Operator (for example ) that has
A Zero Element (for example 0)
There Exists A Program P such that
P Computes the Traversal
Commutativity Analysis Can Automatically
Parallelize P
Complexity Results
Program P is asymptotically optimal if the Data
Struture is a Perfectly Balanced Tree
Program P has complexity O(N2) if the Data
Structure is a Linked-List

69
Pure Object-Based Model of Computation

Goal
Obtain a Powerful, Clean Model of Computation
Enable Compiler to Analyze Program
Objects Instances of Classes
Implement State with Instance Variables
Primitive Types from Underlying Language (int,
...)
References to Other Objects
Nested Objects
Operations Invocations of Methods
Each Operation Has Single Receiver Object

Write a Comment

User Comments (0)

About PowerShow.com

Commutativity Analysis: A New Analysis Framework for Parallelizing Compilers - PowerPoint PPT Presentation

Commutativity Analysis: A New Analysis Framework for Parallelizing Compilers

... Example: Entire graph::traverse Computation. Compiler Computes Extent ... left and right traverse Must Be Independent. left and right Subgraphs Must Be Disjoint ... – PowerPoint PPT presentation