Title: Eliminating Synchronization Bottlenecks in Object-Based Programs Using Adaptive Replication
1Eliminating Synchronization Bottlenecks in
Object-Based Programs Using Adaptive Replication
- Martin Rinard
- Laboratory for Computer Science
- Massachusetts Institute of Technology
Pedro Diniz Information Sciences
Institute University of Southern California
2Context
Parallelizing Compiler Commutativity Analysis
Parallel Program with Mutual Exclusion Synchroniza
tion
Sequential Program
3Context
Parallelizing Compiler Commutativity Analysis
Parallel Program with Mutual Exclusion Synchroniza
tion
Sequential Program
- Basic Idea View computation as atomic operations
on objects - If all pairs of operations in a given phase
commute (generate same final result in both
execution orders) - Compiler generates parallel code
4Context
Parallelizing Compiler Commutativity Analysis
Parallel Program with Mutual Exclusion Synchroniza
tion
Sequential Program
Synchronization Optimization Lock
Coarsening Adaptive Replication
Optimized Parallel Program with Mutual Exclusion
Synchronization and Data Replication
5Outline
- Example
- Model of Computation
- Basic Issues
- Interaction with Lock Coarsening
- Experimental Results
- Conclusion
6Example
2
1
7Example
0
5
2
5
2
4
7
2
1
4
7
2
1
3
2
4
1
2
6
3
5
3
2
4
1
2
6
3
5
2
2
1
1
3
1
2
2
14
8Outline of Algorithm
- Graph Traversal
- Acquire Lock in Object
- Update Sum
- Release Lock
- In Parallel, Recursively Traverse Left Child and
Right Child
9Parallel Program
- class node
- lock mutex
- node left, right
- int left_weight
- int right_weight
- int sum
void nodetraverse(int weight)
mutex.acquire() sum weight
mutex.release() if (left !NULL) spawn
left-gttraverse(left_weight) if (right!NULL)
spawn right-gttraverse(right_weight)
10Example
0
5
2
0
0
4
7
2
1
0
0
0
0
3
2
4
1
2
6
3
5
0
0
0
0
0
0
0
0
2
2
1
1
3
1
2
2
0
11Example
0
5
2
0
0
4
7
2
1
0
0
0
0
3
2
4
1
2
6
3
5
0
0
0
0
0
0
0
0
2
2
1
1
3
1
2
2
0
12Example
0
5
2
5
2
4
7
2
1
0
0
0
0
3
2
4
1
2
6
3
5
0
0
0
0
0
0
0
0
2
2
1
1
3
1
2
2
0
13Example
0
5
2
5
2
4
7
2
1
4
7
2
1
3
2
4
1
2
6
3
5
0
0
0
0
0
0
0
0
2
2
1
1
3
1
2
2
0
14Example
0
5
2
5
2
4
7
2
1
4
7
2
1
3
2
4
1
2
6
3
5
3
2
4
1
2
6
3
5
2
2
1
1
3
1
2
2
0
15Example
0
5
2
5
2
4
7
2
1
4
7
2
1
3
2
4
1
2
6
3
5
3
2
4
1
2
6
3
5
2
2
1
1
3
1
2
2
2
16Example
0
5
2
5
2
4
7
2
1
4
7
2
1
3
2
4
1
2
6
3
5
3
2
4
1
2
6
3
5
2
2
1
1
3
1
2
2
3
17Example
0
5
2
5
2
4
7
2
1
4
7
2
1
3
2
4
1
2
6
3
5
3
2
4
1
2
6
3
5
2
2
1
1
3
1
2
2
4
18Example
0
5
2
5
2
4
7
2
1
4
7
2
1
3
2
4
1
2
6
3
5
3
2
4
1
2
6
3
5
2
2
1
1
3
1
2
2
6
19Example
0
5
2
5
2
4
7
2
1
4
7
2
1
3
2
4
1
2
6
3
5
3
2
4
1
2
6
3
5
2
2
1
1
3
1
2
2
8
20Example
0
5
2
5
2
4
7
2
1
4
7
2
1
3
2
4
1
2
6
3
5
3
2
4
1
2
6
3
5
2
2
1
1
3
1
2
2
9
21Example
0
5
2
5
2
4
7
2
1
4
7
2
1
3
2
4
1
2
6
3
5
3
2
4
1
2
6
3
5
2
2
1
1
3
1
2
2
12
22Example
0
5
2
5
2
4
7
2
1
4
7
2
1
3
2
4
1
2
6
3
5
3
2
4
1
2
6
3
5
2
2
1
1
3
1
2
2
14
23Synchronization Bottleneck
- Lots of Updates to One Object
- Because of Mutual Exclusion, Updates Execute
Sequentially - Processors Spend Time Waiting to Acquire the Lock
in the Object - Performance Suffers
24Solution in Example
- Replicate Object that Causes Bottleneck
- Give Each Processor Its Own Local Copy
- Each Processor Updates Local Copy
- Combine Copies at End of Parallel Phase
3
2
4
1
2
6
3
5
2
2
1
1
3
1
2
2
0
Replicate This Object
25Example with Four Processors
3
2
4
1
2
6
3
5
2
2
1
1
3
1
2
2
0
0
0
0
0
Processor 0
Processor 1
Processor 2
Processor 3
26Add In First Number
3
2
4
1
2
6
3
5
2
2
1
1
3
1
2
2
0
2
1
2
3
Processor 0
Processor 1
Processor 2
Processor 3
27Add In Second Number
3
2
4
1
2
6
3
5
2
2
1
1
3
1
2
2
0
3
3
3
5
Processor 0
Processor 1
Processor 2
Processor 3
28Combine To Get Final Result
3
2
4
1
2
6
3
5
2
2
1
1
3
1
2
2
14
29Goal Automate Technique of Replicating Objects
to Eliminate Synchronization Bottlenecks
30Object-Based Model of Computation
4
- Objects
- Instance variables (left, right, sum, )
- represent state of each object
- Operations on Receiver Objects
- In example, traverse is an operation
- Updated graph node is receiver object
- Operation Execution
- Updates instance variables in receiver
- Invokes other operations
3
2
3
2
31Parallel Execution
Execution of Application Consists of an
Alternating Sequence of Serial Phases and
Parallel Phases
Serial Phase
Parallel Phase
Serial Phase
Parallel Phase
Serial Phase
32Operations in Parallel Phases
- Instance variable updates execute atomically
- Each object has mutual exclusion lock
- Lock acquired before updates
- Lock released after updates
- Invoked operations execute in parallel
33Legality of Replicating Objects
- Is it always legal to replicate objects?
- No. All updates to object in parallel phase must
be replicatable - Updates of the form v vexp are replicatable,
where - is a commutative, associative operator with a
zero, and - variables in exp are not updated during the
parallel phase
34Which Objects to Replicate?
- Why Not Just Replicate All Replicatable Objects?
- Some Objects Dont Cause Bottlenecks
- Replication Overhead
- Space for Copies
- Time to Create and Initialize Copies
- Goal
- Identify Objects With High Contention
- Replicate Only Those Objects
35Basic Approach
- Dynamically Measure Contention At Each Object
- If Contention Is High
- Replicate Object (Dynamically)
- Perform Update on Local Copy
- If No Contention
- Perform Update on Original Object
- Pay Replication Overhead Only When There is a
Payoff in Parallelism
36Details
- What is the replication policy?
- Processor attempts to acquire lock.
- Creates local copy only if it fails to acquire
lock. - Where are replicas stored?
- In a hash table.
- Cant space overhead be too high?
- No. Impose a space limit.
- If a replication would exceed space limit, dont
replicate object. Wait for lock.
37More Details
- What happens at end of parallel phase?
- Generated code traverses hash tables
- Finds Replicas
- Combines contributions -gt original objects
- Deallocates replicas
6
Processor 0
14
8
Processor 1
Hash Tables
Replicas
Original Object
38More Details
- What happens at end of parallel phase?
- Generated code traverses hash tables
- Finds Replicas
- Combines contributions -gt original objects
- Deallocates replicas
6
Processor 0
14
8
Processor 1
Hash Tables
Replicas
Original Object
39More Details
- What happens at end of parallel phase?
- Generated code traverses hash tables
- Finds Replicas
- Combines contributions -gt original objects
- Deallocates replicas
6
Processor 0
14
8
Processor 1
Hash Tables
Replicas
Original Object
40More Details
- What happens at end of parallel phase?
- Generated code traverses hash tables
- Finds Replicas
- Combines contributions -gt original objects
- Deallocates replicas
Processor 0
14
Processor 1
Hash Tables
Original Object
41Generated Code
void nodetraverse(int weight) node replica
lookup(this) // Check for existing copy if
(replica) replica-gtreplicaTraverse(p) // Update
existing copy else if (mutex.tryAcquire()) //
Try to acquire lock 1 sum weight //
Perform update on original object
mutex.release() if (left !NULL) spawn
left-gttraverse(leftWeight) if
(right!NULL) spawn right-gttraverse(rightWeight)
else // No existing copy, failed to acquire
lock replica this-gtreplicate() // Try to
replicate object if (replica)
replica-gtreplicaTraverse(p) // Update new copy
elsemutex.acquire()goto 1// Replicate
failed, wait for lock
42Updating A Replica
void nodereplicaTraverse(int weight) sum
weight if (left !NULL) spawn
left-gttraverse(leftWeight) if (right!NULL)
spawn right-gttraverse(rightWeight)
Updates Execute Without Synchronization
43Replicating An Object
void nodereplicate() // Check to see if
limit exceeded if (allocated sizeof(node) gt
limit) return(NULL) // Allocate New Copy
node replica new node allocated
sizeof(node) // Zero out updated
fields replica-gtvalue 0 // Copy other
fields replica-gtleft left replica-gtleftWeight
leftWeight replica-gtright right
replica-gtrightWeight rightWeight insert(this,r
eplica) // Insert replica into hash
table return(replica)
44Adaptive Replication Summary
- Static Analysis to Discover Replicatable Objects
- Dynamic Measurement of Contention to Determine
Which Objects to Replicate - Generated Code
- Measures Contention
- Replicates Objects
- Updates Original and Replica Objects
- Combines Results in Replicas Back Into Original
Objects
45Lock Coarsening
- obj.mutex.acquire()
- update obj
- obj.mutex.release()
- unsynchronized computation
- obj.mutex.acquire()
- update obj
- obj.mutex.release()
- unsynchronized computation
- obj.mutex.acquire()
- update obj
- obj.mutex.release()
-
obj.mutex.acquire() update obj unsynchronized
computation update obj unsynchronized
computation update obj obj.mutex.release()
46Lock Coarsening
obj.mutex.acquire() while (c) unsynchronized
computation update obj obj.mutex.release()
while (c) unsynchronized computation
obj.mutex.acquire() update obj
obj.mutex.release()
47Lock Coarsening Tradeoffs
- Advantage
- Fewer Executed Lock Constructs
- Acquires
- Releases
- Less Lock Overhead
- Disadvantage
- Critical Sections Larger
- May Cause Additional Serialization
- In Some Cases, Completely Serializes Parallel
Phase
48Lock Coarsening Tradeoffs With Adaptive
Replication
- Advantages
- Fewer Executed Lock and Replication Constructs
- Replica Lookups
- Lock Acquires and Releases
- Less Lock and Replication Overhead
- No Additional Serialization
- Disadvantage
- Potential For Increased Memory Usage
49Result
- Automatically Generated Code That Replicates
Objects to Eliminate Synchronization Bottlenecks - Replication Policy Dynamically Adapts to the
Amount of Contention for Each Object on Each
Processor - Lock Coarsening Plus Adaptive Replication
Increases Granularity and Reduces Overhead
Without Increasing Serialization
50Experimental Results
- Prototype Implementation
- In Context of Parallelizing Compiler
- Commutativity Analysis
- Lock Coarsening, Adaptive Replication
- Four Versions
- Adaptive Replication, Lock Coarsening
- Adaptive Replication, No Lock Coarsening
- No Replication, Best Lock Coarsening
- Full Replication, Lock Coarsening
51Applications and Hardware Platform
- Three Applications
- Water
- Barnes-Hut
- String
- Hardware Platform
- SGI Challenge XL
- 24 100 MHz R4400 Mips Processors,
- IRIX Operating System, Version 6.2
- MipsPro Compiler, Version 7.1
52Speedups for Water
Adaptive Replication, with Lock Coarsening
Adaptive Replication, No Lock Coarsening
No Replication, No Lock Coarsening
Always Replicate, with Lock Coarsening
53Time Breakdowns for Water
Adaptive Replication, with Lock Coarsening
Adaptive Replication, No Lock Coarsening
No Replication, No Lock Coarsening
Always Replicate, with Lock Coarsening
54Peak Memory for Water
Adaptive Replication, with Lock Coarsening
Adaptive Replication, No Lock Coarsening
No Replication, No Lock Coarsening
Always Replicate, with Lock Coarsening
55Speedups for Barnes-Hut
Adaptive Replication, with Lock Coarsening
Adaptive Replication, No Lock Coarsening
No Replication, with Lock Coarsening
Always Replicate, with Lock Coarsening
56Time Breakdowns for Barnes-Hut
Adaptive Replication, with Lock Coarsening
Adaptive Replication, No Lock Coarsening
No Replication, with Lock Coarsening
Always Replicate, with Lock Coarsening
57Peak Memory for Barnes-Hut
Adaptive Replication, with Lock Coarsening
Adaptive Replication, No Lock Coarsening
No Replication, with Lock Coarsening
Always Replicate, with Lock Coarsening
58Speedups for String
Adaptive Replication, with Lock Coarsening
Adaptive Replication, No Lock Coarsening
No Replication, No Lock Coarsening
Always Replicate, with Lock Coarsening
59Time Breakdowns for String
Adaptive Replication, with Lock Coarsening
Adaptive Replication, No Lock Coarsening
No Replication, No Lock Coarsening
Always Replicate, with Lock Coarsening
60Peak Memory for String
Adaptive Replication, with Lock Coarsening
Adaptive Replication, No Lock Coarsening
No Replication, No Lock Coarsening
Always Replicate, with Lock Coarsening
61Related Work
- Reduction Analysis for Loop Nests
- Pinter and Pinter (POPL 91)
- Fisher and Ghuloum (PLDI 94)
- Callahan (LCPC 91)
- Hall, Amarasinghe, Murphy, Liao, and Lam
(SuperComputing 95) - Replication for Concurrent Reads (Caching)
62Conclusion
- Basic Idea Replicate Objects to Eliminate
Synchronization Bottlenecks - Adaptive Dynamically Identifies and Replicates
High-Contention Objects Only - Synergistic Interaction with Lock Coarsening
- Robust - Enables Good Performance Without Running
Risk of Excessive Memory Consumption or Run-Time
Overhead - Algorithm for Analysis and Transformation of
Explicitly Parallel Programs
63Context
- Commutativity Analysis (IPPS 96, PLDI 96)
- Semantic Foundations (EuroPar 96)
- Lock Optimizations
- Lock Coarsening (LCPC 96)
- General Transformations (POPL 97)
- Dynamic Feedback (PLDI 97)
- Optimistic Synchronization (PPoPP 97)
- Adaptive Replication (ICS 99)
64Maximum Speedup Comparison