Eliminating Synchronization Bottlenecks in Object-Based Programs Using Adaptive Replication - PowerPoint PPT Presentation

About This Presentation

Title:

Eliminating Synchronization Bottlenecks in Object-Based Programs Using Adaptive Replication

Description:

Context. Basic Idea: View computation as atomic operations on objects ... Context. Synchronization. Optimization. Lock Coarsening. Adaptive Replication ... – PowerPoint PPT presentation

Number of Views:24

Avg rating:3.0/5.0

Slides: 65

Provided by: martin49

Learn more at: https://people.csail.mit.edu

Category:

more less

Transcript and Presenter's Notes

Title: Eliminating Synchronization Bottlenecks in Object-Based Programs Using Adaptive Replication

1
Eliminating Synchronization Bottlenecks in
Object-Based Programs Using Adaptive Replication

Martin Rinard
Laboratory for Computer Science
Massachusetts Institute of Technology

Pedro Diniz Information Sciences
Institute University of Southern California
2
Context
Parallelizing Compiler Commutativity Analysis
Parallel Program with Mutual Exclusion Synchroniza
tion
Sequential Program
3
Context
Parallelizing Compiler Commutativity Analysis
Parallel Program with Mutual Exclusion Synchroniza
tion
Sequential Program

Basic Idea View computation as atomic operations
on objects
If all pairs of operations in a given phase
commute (generate same final result in both
execution orders)
Compiler generates parallel code

4
Context
Parallelizing Compiler Commutativity Analysis
Parallel Program with Mutual Exclusion Synchroniza
tion
Sequential Program
Synchronization Optimization Lock
Coarsening Adaptive Replication
Optimized Parallel Program with Mutual Exclusion
Synchronization and Data Replication
5
Outline

Example
Model of Computation
Basic Issues
Interaction with Lock Coarsening
Experimental Results
Conclusion

6
Example
2
1
7
Example
0
5
2
5
2
4
7
2
1
4
7
2
1
3
2
4
1
2
6
3
5
3
2
4
1
2
6
3
5
2
2
1
1
3
1
2
2
14
8
Outline of Algorithm

Graph Traversal
Acquire Lock in Object
Update Sum
Release Lock
In Parallel, Recursively Traverse Left Child and
Right Child

9
Parallel Program

class node
lock mutex
node left, right
int left_weight
int right_weight
int sum

void nodetraverse(int weight)
mutex.acquire() sum weight
mutex.release() if (left !NULL) spawn
left-gttraverse(left_weight) if (right!NULL)
spawn right-gttraverse(right_weight)
10
Example
0
5
2
0
0
4
7
2
1
0
0
0
0
3
2
4
1
2
6
3
5
0
0
0
0
0
0
0
0
2
2
1
1
3
1
2
2
0
11
Example
0
5
2
0
0
4
7
2
1
0
0
0
0
3
2
4
1
2
6
3
5
0
0
0
0
0
0
0
0
2
2
1
1
3
1
2
2
0
12
Example
0
5
2
5
2
4
7
2
1
0
0
0
0
3
2
4
1
2
6
3
5
0
0
0
0
0
0
0
0
2
2
1
1
3
1
2
2
0
13
Example
0
5
2
5
2
4
7
2
1
4
7
2
1
3
2
4
1
2
6
3
5
0
0
0
0
0
0
0
0
2
2
1
1
3
1
2
2
0
14
Example
0
5
2
5
2
4
7
2
1
4
7
2
1
3
2
4
1
2
6
3
5
3
2
4
1
2
6
3
5
2
2
1
1
3
1
2
2
0
15
Example
0
5
2
5
2
4
7
2
1
4
7
2
1
3
2
4
1
2
6
3
5
3
2
4
1
2
6
3
5
2
2
1
1
3
1
2
2
2
16
Example
0
5
2
5
2
4
7
2
1
4
7
2
1
3
2
4
1
2
6
3
5
3
2
4
1
2
6
3
5
2
2
1
1
3
1
2
2
3
17
Example
0
5
2
5
2
4
7
2
1
4
7
2
1
3
2
4
1
2
6
3
5
3
2
4
1
2
6
3
5
2
2
1
1
3
1
2
2
4
18
Example
0
5
2
5
2
4
7
2
1
4
7
2
1
3
2
4
1
2
6
3
5
3
2
4
1
2
6
3
5
2
2
1
1
3
1
2
2
6
19
Example
0
5
2
5
2
4
7
2
1
4
7
2
1
3
2
4
1
2
6
3
5
3
2
4
1
2
6
3
5
2
2
1
1
3
1
2
2
8
20
Example
0
5
2
5
2
4
7
2
1
4
7
2
1
3
2
4
1
2
6
3
5
3
2
4
1
2
6
3
5
2
2
1
1
3
1
2
2
9
21
Example
0
5
2
5
2
4
7
2
1
4
7
2
1
3
2
4
1
2
6
3
5
3
2
4
1
2
6
3
5
2
2
1
1
3
1
2
2
12
22
Example
0
5
2
5
2
4
7
2
1
4
7
2
1
3
2
4
1
2
6
3
5
3
2
4
1
2
6
3
5
2
2
1
1
3
1
2
2
14
23
Synchronization Bottleneck

Lots of Updates to One Object
Because of Mutual Exclusion, Updates Execute
Sequentially
Processors Spend Time Waiting to Acquire the Lock
in the Object
Performance Suffers

24
Solution in Example

Replicate Object that Causes Bottleneck
Give Each Processor Its Own Local Copy
Each Processor Updates Local Copy
Combine Copies at End of Parallel Phase

3
2
4
1
2
6
3
5
2
2
1
1
3
1
2
2
0
Replicate This Object
25
Example with Four Processors
3
2
4
1
2
6
3
5
2
2
1
1
3
1
2
2
0
0
0
0
0
Processor 0
Processor 1
Processor 2
Processor 3
26
Add In First Number
3
2
4
1
2
6
3
5
2
2
1
1
3
1
2
2
0
2
1
2
3
Processor 0
Processor 1
Processor 2
Processor 3
27
Add In Second Number
3
2
4
1
2
6
3
5
2
2
1
1
3
1
2
2
0
3
3
3
5
Processor 0
Processor 1
Processor 2
Processor 3
28
Combine To Get Final Result
3
2
4
1
2
6
3
5
2
2
1
1
3
1
2
2
14

29
Goal Automate Technique of Replicating Objects
to Eliminate Synchronization Bottlenecks
30
Object-Based Model of Computation
4

Objects
Instance variables (left, right, sum, )
represent state of each object
Operations on Receiver Objects
In example, traverse is an operation
Updated graph node is receiver object
Operation Execution
Updates instance variables in receiver
Invokes other operations

3
2
3
2
31
Parallel Execution
Execution of Application Consists of an
Alternating Sequence of Serial Phases and
Parallel Phases
Serial Phase
Parallel Phase
Serial Phase
Parallel Phase
Serial Phase
32
Operations in Parallel Phases

Instance variable updates execute atomically
Each object has mutual exclusion lock
Lock acquired before updates
Lock released after updates
Invoked operations execute in parallel

33
Legality of Replicating Objects

Is it always legal to replicate objects?
No. All updates to object in parallel phase must
be replicatable
Updates of the form v vexp are replicatable,
where
is a commutative, associative operator with a
zero, and
variables in exp are not updated during the
parallel phase

34
Which Objects to Replicate?

Why Not Just Replicate All Replicatable Objects?
Some Objects Dont Cause Bottlenecks
Replication Overhead
Space for Copies
Time to Create and Initialize Copies
Goal
Identify Objects With High Contention
Replicate Only Those Objects

35
Basic Approach

Dynamically Measure Contention At Each Object
If Contention Is High
Replicate Object (Dynamically)
Perform Update on Local Copy
If No Contention
Perform Update on Original Object
Pay Replication Overhead Only When There is a
Payoff in Parallelism

36
Details

What is the replication policy?
Processor attempts to acquire lock.
Creates local copy only if it fails to acquire
lock.
Where are replicas stored?
In a hash table.
Cant space overhead be too high?
No. Impose a space limit.
If a replication would exceed space limit, dont
replicate object. Wait for lock.

37
More Details

What happens at end of parallel phase?
Generated code traverses hash tables
Finds Replicas
Combines contributions -gt original objects
Deallocates replicas

6
Processor 0
14
8
Processor 1
Hash Tables
Replicas
Original Object
38
More Details

What happens at end of parallel phase?
Generated code traverses hash tables
Finds Replicas
Combines contributions -gt original objects
Deallocates replicas

6
Processor 0
14
8
Processor 1
Hash Tables
Replicas
Original Object
39
More Details

What happens at end of parallel phase?
Generated code traverses hash tables
Finds Replicas
Combines contributions -gt original objects
Deallocates replicas

6
Processor 0
14
8
Processor 1
Hash Tables
Replicas
Original Object
40
More Details

What happens at end of parallel phase?
Generated code traverses hash tables
Finds Replicas
Combines contributions -gt original objects
Deallocates replicas

Processor 0
14
Processor 1
Hash Tables
Original Object
41
Generated Code
void nodetraverse(int weight) node replica
lookup(this) // Check for existing copy if
(replica) replica-gtreplicaTraverse(p) // Update
existing copy else if (mutex.tryAcquire()) //
Try to acquire lock 1 sum weight //
Perform update on original object
mutex.release() if (left !NULL) spawn
left-gttraverse(leftWeight) if
(right!NULL) spawn right-gttraverse(rightWeight)
else // No existing copy, failed to acquire
lock replica this-gtreplicate() // Try to
replicate object if (replica)
replica-gtreplicaTraverse(p) // Update new copy
elsemutex.acquire()goto 1// Replicate
failed, wait for lock
42
Updating A Replica
void nodereplicaTraverse(int weight) sum
weight if (left !NULL) spawn
left-gttraverse(leftWeight) if (right!NULL)
spawn right-gttraverse(rightWeight)
Updates Execute Without Synchronization
43
Replicating An Object
void nodereplicate() // Check to see if
limit exceeded if (allocated sizeof(node) gt
limit) return(NULL) // Allocate New Copy
node replica new node allocated
sizeof(node) // Zero out updated
fields replica-gtvalue 0 // Copy other
fields replica-gtleft left replica-gtleftWeight
leftWeight replica-gtright right
replica-gtrightWeight rightWeight insert(this,r
eplica) // Insert replica into hash
table return(replica)
44
Adaptive Replication Summary

Static Analysis to Discover Replicatable Objects
Dynamic Measurement of Contention to Determine
Which Objects to Replicate
Generated Code
Measures Contention
Replicates Objects
Updates Original and Replica Objects
Combines Results in Replicas Back Into Original
Objects

45
Lock Coarsening

obj.mutex.acquire()
update obj
obj.mutex.release()
unsynchronized computation
obj.mutex.acquire()
update obj
obj.mutex.release()
unsynchronized computation
obj.mutex.acquire()
update obj
obj.mutex.release()

obj.mutex.acquire() update obj unsynchronized
computation update obj unsynchronized
computation update obj obj.mutex.release()
46
Lock Coarsening
obj.mutex.acquire() while (c) unsynchronized
computation update obj obj.mutex.release()
while (c) unsynchronized computation
obj.mutex.acquire() update obj
obj.mutex.release()
47
Lock Coarsening Tradeoffs

Advantage
Fewer Executed Lock Constructs
Acquires
Releases
Less Lock Overhead
Disadvantage
Critical Sections Larger
May Cause Additional Serialization
In Some Cases, Completely Serializes Parallel
Phase

48
Lock Coarsening Tradeoffs With Adaptive
Replication

Advantages
Fewer Executed Lock and Replication Constructs
Replica Lookups
Lock Acquires and Releases
Less Lock and Replication Overhead
No Additional Serialization
Disadvantage
Potential For Increased Memory Usage

49
Result

Automatically Generated Code That Replicates
Objects to Eliminate Synchronization Bottlenecks
Replication Policy Dynamically Adapts to the
Amount of Contention for Each Object on Each
Processor
Lock Coarsening Plus Adaptive Replication
Increases Granularity and Reduces Overhead
Without Increasing Serialization

50
Experimental Results

Prototype Implementation
In Context of Parallelizing Compiler
Commutativity Analysis
Lock Coarsening, Adaptive Replication
Four Versions
Adaptive Replication, Lock Coarsening
Adaptive Replication, No Lock Coarsening
No Replication, Best Lock Coarsening
Full Replication, Lock Coarsening

51
Applications and Hardware Platform

Three Applications
Water
Barnes-Hut
String
Hardware Platform
SGI Challenge XL
24 100 MHz R4400 Mips Processors,
IRIX Operating System, Version 6.2
MipsPro Compiler, Version 7.1

52
Speedups for Water
Adaptive Replication, with Lock Coarsening
Adaptive Replication, No Lock Coarsening
No Replication, No Lock Coarsening
Always Replicate, with Lock Coarsening
53
Time Breakdowns for Water
Adaptive Replication, with Lock Coarsening
Adaptive Replication, No Lock Coarsening
No Replication, No Lock Coarsening
Always Replicate, with Lock Coarsening
54
Peak Memory for Water
Adaptive Replication, with Lock Coarsening
Adaptive Replication, No Lock Coarsening
No Replication, No Lock Coarsening
Always Replicate, with Lock Coarsening
55
Speedups for Barnes-Hut
Adaptive Replication, with Lock Coarsening
Adaptive Replication, No Lock Coarsening
No Replication, with Lock Coarsening
Always Replicate, with Lock Coarsening
56
Time Breakdowns for Barnes-Hut
Adaptive Replication, with Lock Coarsening
Adaptive Replication, No Lock Coarsening
No Replication, with Lock Coarsening
Always Replicate, with Lock Coarsening
57
Peak Memory for Barnes-Hut
Adaptive Replication, with Lock Coarsening
Adaptive Replication, No Lock Coarsening
No Replication, with Lock Coarsening
Always Replicate, with Lock Coarsening
58
Speedups for String
Adaptive Replication, with Lock Coarsening
Adaptive Replication, No Lock Coarsening
No Replication, No Lock Coarsening
Always Replicate, with Lock Coarsening
59
Time Breakdowns for String
Adaptive Replication, with Lock Coarsening
Adaptive Replication, No Lock Coarsening
No Replication, No Lock Coarsening
Always Replicate, with Lock Coarsening
60
Peak Memory for String
Adaptive Replication, with Lock Coarsening
Adaptive Replication, No Lock Coarsening
No Replication, No Lock Coarsening
Always Replicate, with Lock Coarsening
61
Related Work

Reduction Analysis for Loop Nests
Pinter and Pinter (POPL 91)
Fisher and Ghuloum (PLDI 94)
Callahan (LCPC 91)
Hall, Amarasinghe, Murphy, Liao, and Lam
(SuperComputing 95)
Replication for Concurrent Reads (Caching)

62
Conclusion

Basic Idea Replicate Objects to Eliminate
Synchronization Bottlenecks
Adaptive Dynamically Identifies and Replicates
High-Contention Objects Only
Synergistic Interaction with Lock Coarsening
Robust - Enables Good Performance Without Running
Risk of Excessive Memory Consumption or Run-Time
Overhead
Algorithm for Analysis and Transformation of
Explicitly Parallel Programs

63
Context