Title: Analyses and Optimizations for Multithreaded Programs
1Analyses and Optimizations for Multithreaded
Programs
- Martin Rinard,
- Alex Salcianu,
- Brian Demsky
- MIT Laboratory for
- Computer Science
John Whaley IBM Tokyo Research Laboratory
2Motivation
- Threads are Ubiquitous
- Parallel Programming for Performance
- Manage Multiple Connections
- System Structuring Mechanism
- Overhead
- Thread Management
- Synchronization
- Opportunities
- Improved Memory Management
3What This Talk is About
- New Abstraction Parallel Interaction Graph
- Points-To Information
- Reachability and Escape Information
- Interaction Information
- Caller-Callee Interactions
- Starter-Startee Interactions
- Action Ordering Information
- Analysis Algorithm
- Analysis Uses (synchronization elimination, stack
allocation, per-thread heap allocation)
4Outline
- Example
- Analysis Representation and Algorithm
- Lightweight Threads
- Results
- Conclusion
5Sum Sequence of Numbers
9
8
1
5
3
7
2
6
6Group in Subsequences
7Sum Subsequences (in Parallel)
8Add Sums Into Accumulator
Accumulator
0
9Add Sums Into Accumulator
Accumulator
17
10Add Sums Into Accumulator
Accumulator
23
11Add Sums Into Accumulator
Accumulator
33
12Add Sums Into Accumulator
Accumulator
41
13Common Schema
- Set of tasks
- Chunk tasks to increase granularity
- Tasks have both
- Independent computation
- Updates to shared data
14Realization in Java
- class Accumulator
- int value 0
- synchronized void add(int v)
- value v
-
-
15Realization in Java
class Task extends Thread Vector work
Accumulator dest Task(Vector w, Accumulator d)
work w dest d public void
run() int sum 0 Enumeration e
work.elements() while (e.hasMoreElements())
sum ((Integer) e.nextElement()).intValue()
dest.add(sum)
Task
work
dest
Vector
Accumulator
16Realization in Java
class Task extends Thread Vector work
Accumulator dest Task(Vector w, Accumulator d)
work w dest d public void
run() int sum 0 Enumeration e
work.elements() while (e.hasMoreElements())
sum ((Integer) e.nextElement()).intValue()
dest.add(sum)
Enumeration
Task
work
dest
Vector
Accumulator
17Realization in Java
void generateTask(int l, int u, Accumulator a)
Vector v new Vector() for (int j l
j lt u j) v.addElement(new Integer(j))
Task t new Task(v,a) t.start() void
generate(int n, int m, Accumulator a) for
(int i 0 i lt n i ) generateTask(im,
i(m1), a)
18Task Generation
Accumulator
0
19Task Generation
Accumulator
0
Vector
20Task Generation
Accumulator
0
Vector
21Task Generation
Accumulator
0
Vector
22Task Generation
Task
work
dest
Accumulator
0
Vector
23Task Generation
Task
work
dest
Accumulator
0
Vector
Vector
24Task Generation
Task
work
dest
Accumulator
0
Vector
dest
work
Task
Vector
25Task Generation
Task
work
dest
Accumulator
0
Vector
dest
dest
Task
work
work
Task
Vector
Vector
26Analysis
27Analysis Overview
- Interprocedural
- Interthread
- Flow-sensitive
- Statement ordering within thread
- Action ordering between threads
- Compositional, Bottom Up
- Explicitly Represent Potential Interactions
Between Analyzed and Unanalyzed Parts - Partial Program Analysis
28Analysis Result for run Method
public void run() int sum 0
Enumeration e work.elements() while
(e.hasMoreElements()) sum ((Integer)
e.nextElement()).intValue() dest.add(sum)
this
Enumeration
Task
- Abstraction Points-to Graph
- Nodes Represent Objects
- Edges Represent References
work
dest
Vector
Accumulator
29Analysis Result for run Method
public void run() int sum 0
Enumeration e work.elements() while
(e.hasMoreElements()) sum ((Integer)
e.nextElement()).intValue() dest.add(sum)
this
Enumeration
Task
- Inside Nodes
- Objects Created Within Current Analysis Scope
- One Inside Node Per Allocation Site
- Represents All Objects Created At That Site
work
dest
Vector
Accumulator
30Analysis Result for run Method
public void run() int sum 0
Enumeration e work.elements() while
(e.hasMoreElements()) sum ((Integer)
e.nextElement()).intValue() dest.add(sum)
this
Enumeration
Task
- Outside Nodes
- Objects Created Outside Current Analysis Scope
- Objects Accessed Via References Created Outside
Current Analysis Scope
work
dest
Vector
Accumulator
31Analysis Result for run Method
public void run() int sum 0
Enumeration e work.elements() while
(e.hasMoreElements()) sum ((Integer)
e.nextElement()).intValue() dest.add(sum)
this
Enumeration
Task
- Outside Nodes
- One per Static Class Field
- One per Parameter
- One per Load Statement
- Represents Objects Loaded at That Statement
work
dest
Vector
Accumulator
32Analysis Result for run Method
public void run() int sum 0
Enumeration e work.elements() while
(e.hasMoreElements()) sum ((Integer)
e.nextElement()).intValue() dest.add(sum)
this
Enumeration
Task
- Inside Edges
- References Created Inside Current Analysis Scope
work
dest
Vector
Accumulator
33Analysis Result for run Method
public void run() int sum 0
Enumeration e work.elements() while
(e.hasMoreElements()) sum ((Integer)
e.nextElement()).intValue() dest.add(sum)
this
Enumeration
Task
- Outside Edges
- References Created Outside Current Analysis Scope
- Potential Interactions in Which Analyzed Part
Reads Reference Created in Unanalyzed Part
work
dest
Vector
Accumulator
34Concept of Escaped Node
- Escaped Nodes Represent Objects Accessible
Outside Current Analysis Scope - parameter nodes, load nodes
- static class field nodes
- nodes passed to unanalyzed methods
- nodes reachable from unanalyzed but started
threads - nodes reachable from escaped nodes
- Node is Captured if it is Not Escaped
35Why Escaped Concept is Important
- Completeness of Analysis Information
- Complete information for captured nodes
- Potentially incomplete for escaped nodes
- Lifetime Implications
- Captured nodes are inaccessible when analyzed
part of the program terminates - Memory Management Optimizations
- Stack allocation
- Per-Thread Heap Allocation
36Intrathread Dataflow Analysis
- Computes a points-to escape graph for each
program point - Points-to escape graph is a pair ltI,O,egt
- I - set of inside edges
- O - set of outside edges
- e - escape information for each node
37Dataflow Analysis
- Initial state
- I formals point to parameter nodes,
- classes point to class nodes
- O Ø
- Transfer functions
- I (I KillI ) U GenI
- O O U GenO
- Confluence operator is U
38Intraprocedural Analysis
- Must define transfer functions for
- copy statement l v
- load statement l1 l2.f
- store statement l1.f l2
- return statement return l
- object creation site l new cl
- method invocation l l0.op(l1lk)
39- copy statement l v
- KillI edges(I, l)
- GenI l succ(I, v)
- I (I KillI ) U GenI
Existing edges
l
v
40- copy statement l v
- KillI edges(I, l)
- GenI l succ(I, v)
- I (I KillI ) U GenI
Generated edges
l
v
41- load statement l1 l2.f
- SE n2 in succ(I, l2) . escaped(n2)
- SI Usucc(I, n2, f) . n2 in succ(I, l2)
- case 1 l2 does not point to an escaped node (SE
Ø) - KillI edges(I, l1)
- GenI l1 SI
Existing edges
l1
f
l2
42- load statement l1 l2.f
- SE n2 in succ(I, l2) . escaped(n2)
- SI Usucc(I, n2, f) . n2 in succ(I, l2)
- case 1 l2 does not point to an escaped node (SE
Ø) - KillI edges(I, l1)
- GenI l1 SI
Generated edges
l1
f
l2
43- load statement l1 l2.f
- case 2 l2 does point to an escaped node (not SE
Ø) - KillI edges(I, l1)
- GenI l1 (SI U n)
- GenO (SE f) n
- where n is the load node for l1 l2.f
Existing edges
l1
l2
44- load statement l1 l2.f
- case 2 l2 does point to an escaped node (not SE
Ø) - KillI edges(I, l1)
- GenI l1 (SI U n)
- GenO (SE f) n
- where n is the load node for l1 l2.f
Generated edges
l1
n
f
l2
45- store statement l1.f l2
- GenI (succ(I, l1) f) succ(I, l2)
- I I U GenI
Existing edges
l1
l2
46- store statement l1.f l2
- GenI (succ(I, l1) f) succ(I, l2)
- I I U GenI
Generated edges
l1
f
l2
47- object creation site l new cl
- KillI edges(I, l)
- GenI ltl, ngt
- where n is inside node for l new cl
Existing edges
l
48- object creation site l new cl
- KillI edges(I, l)
- GenI ltl, ngt
- where n is inside node for l new cl
Generated edges
n
l
49Method Call
- Analysis of a method call
- Start with points-to escape graph before the call
site - Retrieve the points-to escape graph from analysis
of callee - Map outside nodes of callee graph to nodes of
caller graph - Combine callee graph into caller graph
- Result is the points-to escape graph after the
call site
50Start With Graph Before Call
Points-to Escape Graph before call to t new
Task(v,a)
51Retrieve Graph from Callee
Points-to Escape Graph before call to t new
Task(v,a)
Points-to Escape Graph from analysis of Task(w,d)
dest
this
work
w
52Map Parameters from Callee to Caller
Points-to Escape Graph before call to t new
Task(v,a)
Points-to Escape Graph from analysis of Task(w,d)
dest
this
work
w
53Transfer Edges from Callee to Caller
Combined Graph after call to t new Task(v,a)
Points-to Escape Graph from analysis of Task(w,d)
dest
dest
this
work
work
w
54Discard Parameter Nodes from Callee
Combined Graph after call to t new Task(v,a)
dest
work
55More General Example
Points-to Escape Graph before call to x.foo()
Points-to Escape Graph from analysis of foo()
x
this
z
y
56Initialize MappingMap Formals to Actuals
Points-to Escape Graph before call to x.foo()
Points-to Escape Graph from analysis of foo()
x
this
z
y
57Extend MappingMatch Inside and Outside Edges
Points-to Escape Graph before call to x.foo()
Points-to Escape Graph from analysis of foo()
x
this
z
y
Mapping is Unidirectional From Callee to Caller
58Complete Mapping Automap Load and Inside Nodes
Reachable from Mapped Nodes
Points-to Escape Graph before call to x.foo()
Points-to Escape Graph from analysis of foo()
x
this
z
y
59Combine MappingProject Edges from Callee Into
Combined Graph
Combined Graph after call to x.foo()
Points-to Escape Graph from analysis of foo()
x
this
z
y
60Discard Callee Graph
Combined Graph after call to x.foo()
x
z
61Discard Outside Edges From Captured Nodes
Combined Graph after call to x.foo()
x
z
62Interthread Analysis
- Augment Analysis Representation
- Parallel Thread Set
- Action Set (read,write,sync,create edge)
- Action Ordering Information (relative to thread
start actions) - Thread Interaction Analysis
- Combine points-to graphs
- Induces combination of other information
- Can perform interthread analysis at any point to
improve precision of results
63Combining Points-to Graphs
Points-to Escape Graph sometime after call to
x.start()
Points-to Escape Graph from analysis of run()
x
this
64Initialize MappingMap Startee Thread to Starter
Thread
Points-to Escape Graph sometime after call to
x.start()
Points-to Escape Graph from analysis of run()
x
this
65Extend MappingMatch Inside and Outside Edges
Points-to Escape Graph sometime after call to
x.start()
Points-to Escape Graph from analysis of run()
x
this
66Extend MappingMatch Inside and Outside Edges
Points-to Escape Graph sometime after call to
x.start()
Points-to Escape Graph from analysis of run()
x
this
67Extend MappingMatch Inside and Outside Edges
Points-to Escape Graph sometime after call to
x.start()
Points-to Escape Graph from analysis of run()
x
this
Mapping is Bidirectional From Startee to
Starter From Starter to Startee
68Complete Mapping Automap Load and Inside Nodes
Reachable from Mapped Nodes
Points-to Escape Graph sometime after call to
x.start()
Points-to Escape Graph from analysis of run()
x
this
69Combine GraphsProject Edges Through Mappings
Into Combined Graph
Combined Points-to Escape Graph sometime after
call to x.start()
x
this
70Combine GraphsProject Edges Through Mappings
Into Combined Graph
Combined Points-to Escape Graph sometime after
call to x.start()
x
this
71Combine GraphsProject Edges Through Mappings
Into Combined Graph
Combined Points-to Escape Graph sometime after
call to x.start()
x
this
72Combine GraphsProject Edges Through Mappings
Into Combined Graph
Combined Points-to Escape Graph sometime after
call to x.start()
x
this
73Discard StarteeThread Node
Combined Points-to Escape Graph sometime after
call to x.start()
x
this
74Discard Startee Thread Node
Combined Points-to Escape Graph sometime after
call to x.start()
x
75Discard Outside Edges From Captured Nodes
Combined Points-to Escape Graph sometime after
call to x.start()
x
76Life is not so Simple
- Dependences between phases
- Mapping best framed as constraint satisfaction
problem - Solved using constraint satisfaction algorithm
77Interthread Analysis With Actions and Ordering
78Analysis Result for generateTask
Parallel Threads
Points-to Graph
Actions
Action Ordering
t
sync b
All actions happen before thread a
starts executing
a
rd b
Task
a
wr a
work
dest
wr b
b
e
Vector
wr c
Accumulator
wr d
c
d
79Analysis Result for run
Parallel Threads
Actions
Action Ordering
Points-to Graph
rd 1
sync 2
none
this
no parallel threads
rd 2
sync 5
Enumeration
Task
rd 3
edge(1,2)
1
6
work
dest
rd 4
edge(1,5)
rd 5
2
5
Vector
edge(2,3)
Accumulator
rd 6
edge(3,4)
3
wr 5
wr 6
4
80Role of edge(1,2) Actions
- One edge action for each outside edge
- Action order for edge actions improves precision
of interthread analysis - If starter thread reads a reference before
startee thread is started - Then reference was not created by startee thread
- Outside edge actions record order
- Inside edges from startee matched only against
parallel outside edges
81Edge Actions in Combining Points-to Graphs
Points-to Escape Graph sometime after call to
x.start()
Points-to Escape Graph from analysis of run()
Action Ordering
82Edge Actions in Combining Points-to Graphs
Points-to Escape Graph sometime after call to
x.start()
Points-to Escape Graph from analysis of run()
Action Ordering
none
83Analysis Result After Interaction
Parallel Threads
Points-to Graph
Actions
Action Ordering
t
a
sync b
All actions from current thread happen
before thread a starts executing
sync b, a
sync e, a
rd b
Task
wr a
rd a, a
a
work
dest
rd b, a
wr b
b
e
Vector
rd c, a
wr c
Accumulator
rd d, a
wr d
c
rd e, a
wr e, a
d
84Roles of Intrathread and Interthread Analyses
- Basic Analysis
- Intrathread analysis delivers parallel
interaction graph at each program point - records parallel threads
- does not compute thread interaction
- Choose program point (end of method)
- Interthread analysis delivers additional
precision at that program point - Does not exploit ordering information from thread
join constructs
85Join Ordering
t new Task() t.start() computation that
runs in parallel with task t t.join() computa
tion that runs after task t
t.run() computation from task t
86Exploiting Join Ordering
- At join point
- Interthread analysis delivers new (more precise)
parallel interaction graph - Intrathread analysis uses new graph
- No parallel interactions between
- Thread
- Computation after join
87Extensions
- Partial program analysis
- can analyze method independent of callers
- can analyze method independent of methods it
invokes - can incrementally analyze callees to improve
precision - Dial down precision to improve efficiency
- Demand-driven formulations
88Key Ideas
- Explicitly represent potential interactions
between analyzed and unanalyzed parts - Inside versus outside nodes and edges
- Escaped versus captured nodes
- Precisely bound ignorance
- Exploit ordering information
- intrathread (flow sensitive)
- interthread (starts, edge orders, joins)
89Analysis Uses
- Overheads in Standard Execution and How to
Eliminate Them
90Intrathread Analysis Result from End of run Method
- Enumeration object is captured
- Does not escape to caller
- Does not escape to parallel threads
- Lifetime of Enumeration object is bounded by
lifetime of run - Can allocate Enumeration object on call stack
instead of heap
this
Enumeration
Task
1
6
work
dest
2
5
Vector
Accumulator
3
4
91Interthread Analysis Result from End of
generateTask Method
Parallel Threads
Actions
Points-to Graph
Action Ordering
All actions from current thread happen
before thread a starts executing
sync b, a
sync b
a
rd b
sync e, a
rd a, a
wr a
rd b, a
wr b
- Vector object is captured
- Multiple threads synchronize on Vector object
- But synchronizations from different threads do
not occur concurrently - Can eliminate synchronization on Vector object
rd c, a
wr c
rd d, a
wr d
rd e, a
wr e, a
92Interthread Analysis Result from End of
generateTask Method
Parallel Threads
Actions
Points-to Graph
Action Ordering
All actions from current thread happen
before thread a starts executing
sync b, a
sync b
a
rd b
sync e, a
rd a, a
wr a
rd b, a
wr b
- Vectors, Tasks, Integers captured
- Parent, child access objects
- Parent completes accesses before child starts
accesses - Can allocate objects on childs per-thread heap
rd c, a
wr c
rd d, a
wr d
rd e, a
wr e, a
93Thread Overhead
- Inefficient Thread Implementations
- Thread Creation Overhead
- Thread Management Overhead
- Stack Overhead
- Use a more efficient thread implementation
- User-level thread management
- Per-thread heaps
- Event-driven form
94Standard Thread Implementation
return address
- Call frames allocated on stack
- Context Switch
- Save state on stack
- Resume another thread
- One stack per thread
frame pointer
x
y
return address
frame pointer
a
b
c
95Standard Thread Implementation
return address
- Call frames allocated on stack
- Context Switch
- Save state on stack
- Resume another thread
- One stack per thread
frame pointer
x
y
return address
frame pointer
a
b
c
save area
96Event-Driven Form
- Call frames allocated on stack
- Context Switch
- Build continuation on heap
- Copy out live variables
- Return out of computation
- Resume another continuation
- One stack per processor
return address
frame pointer
x
y
return address
frame pointer
a
b
resume method
c
x
resume method
c
97Complications
- Standard thread models use blocking I/O
- Automatically convert blocking I/O to
asynchronous I/O - Scheduler manages interleaving of thread
executions - Stack Allocatable Objects May Be Live Across
Blocking Calls - Transfer allocation to per-thread heap
98Opportunity
- On a uniprocessor, compiler controls placement of
context switch points - If program does not hold lock across blocking
call, can eliminate lock
99Experimental Results
- MIT Flex Compiler System
- Static Compiler
- Native code for StrongARM
- Server Benchmarks
- http, phone, echo, time
- Scientific Computing Benchmarks
- water, barnes
100Server Benchmark Characteristics
Inter Thread Analysis Time (secs)
Intra Thread Analysis Time (secs)
Pre Analysis Time (secs)
IR Size (instrs)
Number of Methods
4,639
131
28
74
73
echo
time
4,573
136
29
70
74
http
10,643
292
103
199
269
phone
9,547
267
75
191
256
101Percentage of Eliminated Synchronization
Operations
100
80
60
Intrathread only
Interthread
40
20
0
http
phone
time
echo
mtrt
102Compilation Options for Performance Results
- Standard
- kernel threads, synch included
- Event-Driven
- event-driven, no synch at all
- Per-Thread Heap
- event-driven, no synch at all, per-thread heap
allocation
103Throughput (Responses per Second)
400
300
Standard
Event-Driven
200
Per-Thread Heap
100
0
http 2K
http 20K
echo
time
phone
104Scientific Benchmark Characteristics
IR Size (instrs)
Number of Methods
Total Analysis Time (secs)
Pre Analysis Time (secs)
water
25,583
335
1156
380
barnes
19,764
364
491
129
105Compiler Options
- 0 Sequential C
- 1 Baseline - Kernel Threads
- 2 Lightweight Threads
- 3 Lightweight Threads Stack Allocation
- 4 Lightweight Threads Stack Allocation -
Synchronization
106Execution Times
1
0.8
Baseline
0.6
Light
Stack
0.4
-Synch
0.2
0
water small
water
barnes
Proportion of Sequential C Execution Time
107Related Work
- Pointer Analysis for Sequential Programs
- Chatterjee, Ryder, Landi (POPL 99)
- Sathyanathan Lam (LCPC 96)
- Steensgaard (POPL 96)
- Wilson Lam (PLDI 95)
- Emami, Ghiya, Hendren (PLDI 94)
- Choi, Burke, Carini (POPL 93)
108Related Work
- Pointer Analysis for Multithreaded Programs
- Rugina and Rinard (PLDI 99) (fork-join
parallelism, not compositional) - We have extended our points-to analysis for
multithreaded programs (irregular, thread-based
concurrency, compositional) - Escape Analysis
- Blanchet (POPL 98)
- Deutsch (POPL 90, POPL 97)
- Park Goldberg (PLDI 92)
109Related Work
- Synchronization Optimizations
- Diniz Rinard (LCPC 96, POPL 97)
- Plevyak, Zhang, Chien (POPL 95)
- Aldrich, Chambers, Sirer, Eggers (SAS99)
- Blanchet (OOPSLA 99)
- Bogda, Hoelzle (OOPSLA 99)
- Choi, Gupta, Serrano, Sreedhar, Midkiff (OOPSLA
99) - Ruf (PLDI 00)
110Conclusion
- New Analysis Algorithm
- Flow-sensitive, compositional
- Multithreaded programs
- Explicitly represent interactions between
analyzed and unanalyzed parts - Analysis Uses
- Synchronization elimination
- Stack allocation
- Per-thread heap allocation
- Lightweight Threads