Commutativity Analysis: A New Analysis Framework for Parallelizing Compilers - PowerPoint PPT Presentation

About This Presentation
Title:

Commutativity Analysis: A New Analysis Framework for Parallelizing Compilers

Description:

... Example: Entire graph::traverse Computation. Compiler Computes Extent ... left and right traverse Must Be Independent. left and right Subgraphs Must Be Disjoint ... – PowerPoint PPT presentation

Number of Views:68
Avg rating:3.0/5.0
Slides: 70
Provided by: martin49
Category:

less

Transcript and Presenter's Notes

Title: Commutativity Analysis: A New Analysis Framework for Parallelizing Compilers


1
Commutativity Analysis A New Analysis Framework
for Parallelizing Compilers
  • Martin C. Rinard
  • Pedro C. Diniz
  • University of California, Santa Barbara
  • Santa Barbara, California 93106
  • martin,pedro_at_cs.ucsb.edu
  • http//www.cs.ucsb.edu/martin,pedro

2
Goal
  • Develop a Parallelizing Compiler for
    Object-Oriented Computations
  • Current Focus
  • Irregular Computations
  • Dynamic Data Structures
  • Future
  • Persistent Data
  • Distributed Computations
  • New Analysis Technique
  • Commutativity Analysis

3
Structure of Talk
  • Model of Computation
  • Example
  • Commutativity Testing
  • Steps To Practicality
  • Experimental Results
  • Conclusion

4
Model of Computation
operations
objects
executing operation
new object state
operation
initial object state
invoked operations
5
Graph Traversal Example
  • class graph
  • int val, sum
  • graph left, right
  • void graphtraverse(int v)
  • sum v
  • if (left !NULL) left-gttraverse(val)
  • if (right!NULL) right-gttraverse(val)

Goal Execute left and right traverse operations
in parallel
6
Parallel Traversal
7
Commuting Operations in Parallel Traversal
3
8
Model of Computation
  • Operations Method Invocations
  • In Example Invocations of graphtraverse
  • left-gttraverse(3)
  • right-gttraverse(2)
  • Objects Instances of Classes
  • In Example Graph Nodes
  • Instance Variables Implement Object State
  • In Example val, sum, left, right

9
Model of Computation
  • Operations Method Invocations
  • In Example Invocations of graphtraverse
  • left-gttraverse(3)
  • right-gttraverse(2)
  • Objects Instances of Classes
  • In Example Graph Nodes

10
Separable Operations
  • Each Operation Consists of Two Sections

Object Section Only Accesses Receiver Object
Invocation Section Only Invokes Operations
Both Sections Can Access Parameters
11
Basic Approach
  • Compiler Chooses A Computation to Parallelize
  • In Example Entire graphtraverse Computation
  • Compiler Computes Extent of the Computation
  • Representation of all Operations in Computation
  • Current Representation Set of Methods
  • In Example graphtraverse
  • Do All Pairs of Operations in Extent Commute?
  • No - Generate Serial Code
  • Yes - Generate Parallel Code
  • In Example All Pairs Commute

12
Code GenerationFor Each Method in Parallel
Computation
  • Augments Class Declaration With Mutual Exclusion
    Lock
  • Generates Driver Version of Method
  • Invoked from Serial Code to Start Parallel
    Execution
  • Invokes Parallel Version of Operation
  • Waits for Entire Parallel Computation to Finish
  • Generates Parallel Version of Method
  • Object Section
  • Lock Acquired at Beginning
  • Lock Released at End
  • Ensure Atomic Execution
  • Invocation Section
  • Invoked Operations
  • Execute in Parallel
  • Invokes Parallel Version

13
Driver Version
Code Generation In Example
class graph lock mutex int val, sum graph
left, right
Class Declaration
void graphtraverse(int v) parallel_traverse(v)
wait()
14
Parallel Version In Example
void graphparallel_traverse(int v)
mutex.acquire() sum v mutex.release()
if (left ! NULL) spawn(left-gtparallel_traverse
(val)) if (right ! NULL) spawn(right-gtparall
el_traverse(val))
15
Compiler Structure
Computation Selection
Entire Computation of Each Method
Extent Computation
Traverse Call Graph to Extract Extent
All Pairs of Operations In Extent
Commutativity Testing
All Operations Commute
Operations May Not Commute
Generate Serial Code
Generate Parallel Code
16
Traditional Approach
  • Data Dependence Analysis
  • Analyzes Reads and Writes
  • Independent Pieces of Code Execute in Parallel
  • Demonstrated Success for Array-Based Programs

17
Data Dependence Analysis in Example
  • For Data Dependence Analysis To Succeed in
    Example
  • left and right traverse Must Be Independent
  • left and right Subgraphs Must Be Disjoint
  • Graph Must Be a Tree
  • Depends on Global Topology of Data Structure
  • Analyze Code that Builds Data Structure
  • Extract and Propagate Topology Information
  • Fails For Graphs

18
Properties of Commutativity Analysis
  • Oblivious to Data Structure Topology
  • Local Analysis
  • Simple Analysis
  • Wide Range of Computations
  • Lists, Trees and Graphs
  • Updates to Central Data Structure
  • General Reductions
  • Introduces Synchronization
  • Relies on Commuting Operations

19
  • Commutativity Testing

20
Commutativity Testing Conditions
  • Do Two Operations A and B Commute?
  • Compiler Considers Two Execution Orders
  • AB - A executes before B
  • BA - B executes before A
  • Compiler Must Check Two Conditions

Instance Variables New values of instance
variables are same in both execution orders
Invoked Operations A and B together directly
invoke same set of operations in both execution
orders
21
Commutativity Testing Conditions
22
Commutativity Testing Algorithm
  • Symbolic Execution
  • Compiler Executes Operations
  • Computes with Expressions not Values
  • Compiler Symbolically Executes Operations
  • In Both Execution Orders
  • Expressions for New Values of Instance Variables
  • Expressions for Multiset of Invoked Operations

23
Expression Simplification and Comparison
  • Compiler Applies Rewrite Rules to Simplify
    Expressions
  • a(bc) ????ab)(ac)
  • b(ac) ???(abc)
  • aif(bltc,d,e) ? if(bltc,ad,ae)
  • Compiler Compares Corresponding Expressions
  • If All Equal - Operations Commute
  • If Not All Equal - Operations May Not Commute

24
Commutativity Testing Example
  • Two Operations
  • r-gttraverse(v1) and r-gttraverse(v2)
  • In Order r-gttraverse(v1)r-gttraverse(v2)

Instance Variables New sum (sumv1)v2
Invoked Operations if(right!NULL,right-gttraverse(
val)), if(left!NULL,left-gttraverse(val)),
if(right!NULL,right-gttraverse(val)), if(left!NU
LL,left-gttraverse(val))
  • In Order r-gttraverse(v2)r-gttraverse(v1)

Instance Variables New sum (sumv2)v1
Invoked Operations if(right!NULL,right-gttraverse(
val)), if(left!NULL,left-gttraverse(val)),
if(right!NULL,right-gttraverse(val)), if(left!NU
LL,left-gttraverse(val))
25
Important Special Case
  • Independent Operations Commute
  • Analysis in Current Compiler
  • Dependence Analysis
  • Operations on Objects of Different Classes
  • Independent Operations on Objects of Same Class
  • Symbolic Commutativity Testing
  • Dependent Operations on Objects of Same Class
  • Future
  • Integrate Pointer or Alias Analysis
  • Integrate Array Data Dependence Analysis

26
Important Special Case
  • Independent Operations Commute
  • Conditions for Independence
  • Operations Have Different Receivers
  • Neither Operation Writes an Instance Variable
    that Other Operation Accesses
  • Detecting Independent Operations
  • In Type-Safe Languages
  • Class Declarations
  • Instance Variable Accesses
  • Pointer or Alias Analysis

27
Analysis in Current Compiler
  • Dependence Analysis
  • Operations on Objects of Different Classes
  • Independent Operations on Objects of Same Class
  • Symbolic Commutativity Testing
  • Dependent Operations on Objects of Same Class
  • Future
  • Integrate Pointer or Alias Analysis
  • Integrate Array Data Dependence Analysis

28
  • Steps to Practicality

29
Programming Model Extensions
  • Extensions for Read-Only Data
  • Allow Operations to Freely Access Read-Only Data
  • Enhances Ability of Compiler to Represent
    Expressions
  • Increases Set of Programs that Compiler can
    Analyze
  • Analysis Granularity Extensions
  • Integrate Operations into Callers for Analysis
    Purposes
  • Coarsens Commutativity Testing Granularity
  • Reduces Number of Pairs Tested for Commutativity
  • Enhances Effectiveness of Commutativity Testing

30
Optimizations
  • Synchronization Optimizations
  • Eliminate Synchronization Constructs in Methods
    that Only Access Read-Only Data
  • Reduce Number of Acquire and Release Constructs
  • Parallel Loop Optimization
  • Suppress Exploitation of Excess Concurrency

31
Extent Constants
  • Motivation Allow Parallel Operations to Freely
    Access Read-Only Data
  • Extent Constant Variable Global variable or
    instance variable written by no operation in
    extent
  • Extent Constant Expression Expression whose value
    depends only on extent constant variables or
    parameters
  • Extent Constant Value Value computed by extent
    constant expression
  • Extent Constant Automatically generated opaque
    constant used to represent an extent constant
    value
  • Requires Interprocedural Data Usage Analysis
  • Result Summarizes How Operations Access Instance
    Variables
  • Interprocedural Pointer Analysis for Reference
    Parameters

32
Extent Constant Variables In Example
Extent Constant Variable
void graphtraverse(int v) sum v if
(left ! NULL) left-gttraverse(val) if (right
! NULL) right-gttraverse(val)
Extent Constant Variable
33
Advantages of Extent Constants
  • Extent Constants Extend Programming Model
  • Enable Direct Global Variable Access
  • Enable Direct Access of Objects other than
    Receiver
  • Extent Constants Make Compiler More Effective
  • Enable Compact Representations of Large
    Expressions
  • Enable Compiler to Represent Values Computed by
    Otherwise Unanalyzable Constructs

34
Auxiliary Operations
  • Motivation Coarsen Granularity of Commutativity
    Testing
  • An Operation is an Auxiliary Operation if its
    Entire Computation
  • Only Computes Extent Constant Values
  • Only Externally Visible Writes are to Local
    Variables of Caller
  • Auxiliary Operations are Conceptually Part of
    Caller
  • Analysis Integrates Auxiliary Operations into
    Caller
  • Represents Computed Values using Extent Constants
  • Requires
  • Interprocedural Data Usage Analysis
  • Interprocedural Pointer Analysis for Reference
    Parameters
  • Intraprocedural Reaching Definition Analysis

35
Auxiliary Operation Example
  • int graphsquare_and_add(int v)
  • return(valval v)
  • void graphtraverse(int v)
  • sum square_and_add(v)
  • if (left ! NULL) left-gttraverse(val)
  • if (right ! NULL) right-gttraverse(val)

Extent Constant Variable
Parameter
Extent Constant Expression
36
Advantages of Auxiliary Operations
  • Coarsen Granularity of Commutativity Testing
  • Reduces Number of Pairs Tested for Commutativity
  • Enhances Effectiveness of Commutativity Testing
    Algorithm
  • Support Modular Programming

37
Synchronization Optimizations
  • Goal Eliminate or Reduce Synchronization
    Overhead
  • Synchronization Elimination

An Operation Only Computes Extent Constant
Values
Compiler Does Not Generate Lock Acquire and
Release
Then
If
  • Lock Coarsening

Data Use One Lock for Multiple Objects
Computation Generate One Lock Acquire and
Release for Multiple Operations on the Same Object
38
Data Lock Coarsening Example
Original Code
Optimized Code
class vector lock mutex double
valNDIM void vectoradd(double v)
mutex.acquire() for(int i0 i lt NDIM i)
vali vi mutex.release() class body
lock mutex double phi vector
acc void bodygravsub(body b) double p,
vNDIM mutex.acquire() p
computeInter(b,v) phi - p
mutex.release() acc.add(v)
class vector double valNDIM void
vectoradd(double v) for(int i0 i lt
NDIM i) vali vi class body
lock mutex double phi vector acc void
bodygravsub(body b) double p, vNDIM
mutex.acquire() p computeInter(b,v) phi
- p acc.add(v) mutex.release()
39
Computation Lock Coarsening Example
Original Code
Optimized Code
class body lock mutex double phi vector
acc void bodygravsub(body b) double p,
vNDIM p computeInter(b,v) phi - p
acc.add(v) void bodyloopsub(body b) int
i mutex.acquire() for (i 0 i lt N i)
this-gtgravsub(bi) mutex.release()
  • class body
  • lock mutex
  • double phi
  • vector acc
  • void bodygravsub(body b)
  • double p, vNDIM
  • mutex.acquire()
  • p computeInter(b,v)
  • phi - p
  • acc.add(v)
  • mutex.release()
  • void bodyloopsub(body b)
  • int i
  • for (i 0 i lt N i)
  • this-gtgravsub(bi)

40
Parallel Loops
  • Goal Generate Efficient Code for Parallel Loops
  • If A Loop is in the Following Form
  • for (i exp1 i lt exp2 i exp3)
  • exp4-gtop(exp5,exp6, ...)
  • Where exp1, exp2, ... Extent Constant
    Expressions
  • Then Compiler Generates Parallel Loop Code

41
Parallel Loop Optimization
  • Without Parallel Loop Optimization
  • Each Loop Iteration Generates a Task
  • Tasks are Created and Scheduled Sequentially
  • Each Iteration Incurs Task Creation and
    Scheduling Overhead
  • With Parallel Loop Optimization
  • Generated Code Immediately Exposes All Iterations
  • Scheduler Operates on Chunks of Loop Iterations
  • Each Chunk of Iterations Incurs Scheduling
    Overhead
  • Advantages
  • Enables Compact Representation for Loop
    Computation
  • Reduces Task Creation and Scheduling Overhead
  • Parallelizes Overhead

42
Suppressing Excess Concurrency
  • Goal Reduce Overhead of Exploiting Parallelism
  • Goal Achieved by Generating Computations that
  • Execute Operations Serially with No
    Parallelization Overhead
  • Use Synchronization Required to Execute Safely in
    Parallel Context
  • Mechanism Mutex Versions of Methods
  • Object Section
  • Acquires Lock at Beginning
  • Releases Lock at End
  • Invocation Section
  • Operations Execute Serially
  • Invokes Mutex Version
  • Current Policy
  • Each Parallel Loop Iteration Invokes Mutex
    Version of Operation
  • Suppresses Parallel Execution Within Iterations
    of Parallel Loops

43
  • Experimental Results

44
Methodology
  • Built Prototype Compiler
  • Built Run Time System
  • Concurrency Generation and Task Management
  • Dynamic Load Balancing
  • Synchronization
  • Acquired Two Complete Applications
  • Barnes-Hut N-Body Solver
  • Water Code
  • Automatically Parallelized Applications
  • Ran Applications on Stanford DASH Machine
  • Compare Performance with Highly Tuned, Explicitly
    Parallel Versions from SPLASH-2 Benchmark Suite

45
Prototype Compiler
  • Clean Subset of C
  • Sage is Front End
  • Structured As a Source-To-Source Translator
  • Analysis Finds Parallel Loops and Methods
  • Compiler Generates Annotation File
  • Identifies Parallel Loops and Methods
  • Classes to Augment with Locks
  • Code Generator Reads Annotation File
  • Generates Parallel Versions of Methods
  • Inserts Synchronization and Parallelization Code
  • Parallelizes Unannotated Programs

46
Major Restrictions
  • Motivation Simplify Implementation of Prototype
  • No Virtual Methods
  • No Operator or Method Overloading
  • No Multiple Inheritance or Templates
  • No typedef, struct, union or enum types
  • Global Variables must be Class Types
  • No Static Members or Pointers to Members
  • No Default Arguments or Variable Numbers of
    Arguments
  • No Operation Accesses a Variable Declared in a
    Class from which its Receiver Class Inherits

47
Run Time Library
  • Motivation Provide Basic Concurrency Managment
  • Single Program, Multiple Data Execution Model
  • Single Address Space
  • Alternate Serial and Parallel Phases
  • Library Provides
  • Task Creation and Synchronization Primitives
  • Dynamic Load Balancing
  • Implemented
  • Stanford DASH Shared-Memory Multiprocessor
  • SGI Shared-Memory Multiprocessors

48
Applications
  • Barnes-Hut
  • O(NlgN) N-Body Solver
  • Space Subdivision Tree
  • 1500 Lines of C Code
  • Water
  • Simulates Liquid Water
  • O(N2) Algorithm
  • 1850 Lines of C Code

49
Obtaining Serial C Version of Barnes-Hut
  • Started with Explicitly Parallel Version
    (SPLASH-2)
  • Removed Parallel Constructs to get Serial C
  • Converted to Clean Object-Based C
  • Major Structural Changes
  • Eliminated Scheduling Code and Data Structures
  • Split a Loop in Force Computation Phase
  • Introduced New Field into Particle Data Structure

50
Obtaining Serial C Version of Water
  • Started with Serial C translated from FORTRAN
  • Converted to Clean Object-Based C
  • Major Structural Change
  • Auxiliary Objects for O(N2) phases

51
Commutativity Statistics for Barnes-Hut
Symbolically Executed Pairs
20
Independent Pairs
15
Pairs Tested for Commutativity
10
5
Position (3 Methods)
Force (6 Methods)
Velocity (3 Methods)
Parallel Extent
52
Auxiliary Operation Statistics for Barnes-Hut
15
Call Sites
10
5
Auxiliary Operation Call Sites
Position (3 Methods)
Force (6 Methods)
Velocity (3 Methods)
Parallel Extent
53
Performance Results for Barnes-Hut
54
Performance Analysis
  • Motivation Understand Behavior of Parallelized
    Program
  • Instrumented Code to Measure Execution Time
    Breakdowns
  • Parallel Idle - Time Spent Idle in Parallel
    Section
  • Serial Idle - Time Spent Idle in a Serial Section
  • Blocked - Time Spent Waiting to Acquire a Lock
    Held by Another Processor
  • Parallel Compute - Time Spent Doing Useful Work
    in a Parallel Section
  • Serial Compute - Time Spent Doing Useful Work in
    a Serial Section

55
Performance Analysis for Barnes-Hut
120
300
100
250
80
200
60
150
Cumulative Total Time (seconds)
Cumulative Total Time (seconds)
40
100
20
50
0
0
1
2
4
8
16
24
32
1
2
4
8
16
24
32
Number of Processors
Number of Processors
Barnes-Hut on DASH Data Set - 8K Particles
Barnes-Hut on DASH Data Set - 16K Particles
56
Performance Results for Water
H
H
H
H
Speedup
Speedup
H
H
H
H
H
H
H
H
H
H
J
J
J
J
J
J
J
J
J
J
J
J
J
H
J
H
J
J
H
J
H
J
H
H
J
J
0
8
16
24
32
Number of Processors
Number of Processors
Water on DASH Data Set - 343 Molecules
Water on DASH Data Set - 512 Molecules
57
Performance Results for Computation Replication
Version of Water
Speedup
Speedup
Number of Processors
Number of Processors
Water on DASH Data Set - 512 Molecules
Water on DASH Data Set - 343 Molecules
58
Commutativity Statistics for Water
15
Symbolically Executed Pairs
10
Independent Pairs
Pairs Tested for Commutativity
5
Virtual (3 Methods)
Forces (2 Methods)
Loading (4 Methods)
Momenta (2 Methods)
Energy (5 Methods)
Parallel Extent
59
Auxiliary Operation Statistics for Water
15
10
Call Sites
Auxiliary Operation Call Sites
5
Virtual (3 Methods)
Forces (2 Methods)
Loading (4 Methods)
Momenta (2 Methods)
Energy (5 Methods)
Parallel Extent
60
Performance Analysis for Water


1400
600
1200
500
1000
400
800
Cumulative Total Time (seconds)
Cumulative Total Time (seconds)
300
600
200
400
100
200
0
0
1
2
4
8
16
24
32
1
2
4
8
16
24
32
Number of Processors
Number of Processors
Water on DASH Data Set - 343 molecules
Water on DASH Data Set - 512 molecules
61
Future Work
  • Relative Commutativity
  • Integrate Other Analysis Frameworks
  • Pointer or Alias Analysis
  • Array Data Dependence Analysis
  • Analysis Problems
  • Synchronization Optimizations
  • Analysis Granularity Optimizations
  • Generation of Self-Tuning Code
  • Message Passing Implementation

62
Related Work
  • Bernstein (IEEE Transactions on Computers 1966)
  • Dependence Analysis for Pointer-Based Data
    Structures
  • Reduction Analysis
  • Ghuloum and Fisher (PPOPP 95)
  • Pinter and Pinter (POPL 92)
  • Callahan (LCPC 91)
  • Commuting Operations in Parallel Languages
  • Rinard and Lam (PPOPP 91)
  • Steele (POPL 90)
  • Barth, Nikhil and Arvind (FPCA 91)
  • Landi, Ryder and Zhang (PLDI 93)
  • Hendren, Hummel and Nicolau (PLDI 92)
  • Plevyak, Karamcheti and Chien (LCPC 93)
  • Chase, Wegman and Zadek (PLDI 90)
  • Larus and Hilfinger (PLDI 88)
  • Ghiya and Hendren (POPL 96)
  • Ruf (PLDI 95)
  • Wilson and Lam (PLDI 95)
  • Deutsch (PLDI 94)
  • Choi, Burke and Carini (POPL 93)

63
Conclusions
64
Conclusion
  • Commutativity Analysis
  • New Analysis Framework for Parallelizing
    Compilers
  • Basic Idea
  • Recognize Commuting Operations
  • Generate Parallel Code
  • Current Focus
  • Dynamic, Pointer-Based Data Structures
  • Good Initial Results
  • Future
  • Persistent Data
  • Distributed Computations

65
Latest Version of Paper
  • http//www.cs.ucsb.edu/martin/paper/pldi96.ps

66
What if Operations Do Not Commute?
  • Parallel Tree Traversal
  • Example Distance of Node from Root

class tree int distance tree left tree
right treeset_distance(int d) distance
d if (left ! NULL) left-gtset_distance(d1)
if (right ! NULL) right-gtset_distance(d1)
67
Equivalent Computation with Commuting Operations
treezero_distance() distance 0 if (left
! NULL) left-gtzero_distance() if (right !
NULL) right-gtzero_distance()
treesum_distance(int d) distance distance
d if (left ! NULL) left-gtsum_distance(d1)
if (right ! NULL) right-gtsum_distance(d1)
treeset_distance(int d) zero_distance() sum
_distance(d)
68
Theoretical Result
  • For Any Tree Traversal on Data With
  • A Commutative Operator (for example ) that has
  • A Zero Element (for example 0)
  • There Exists A Program P such that
  • P Computes the Traversal
  • Commutativity Analysis Can Automatically
    Parallelize P
  • Complexity Results
  • Program P is asymptotically optimal if the Data
    Struture is a Perfectly Balanced Tree
  • Program P has complexity O(N2) if the Data
    Structure is a Linked-List

69
Pure Object-Based Model of Computation
  • Goal
  • Obtain a Powerful, Clean Model of Computation
  • Enable Compiler to Analyze Program
  • Objects Instances of Classes
  • Implement State with Instance Variables
  • Primitive Types from Underlying Language (int,
    ...)
  • References to Other Objects
  • Nested Objects
  • Operations Invocations of Methods
  • Each Operation Has Single Receiver Object
Write a Comment
User Comments (0)
About PowerShow.com