Analyses and Optimizations for Multithreaded Programs

About This Presentation

Title:

Analyses and Optimizations for Multithreaded Programs

Description:

Analysis Uses (synchronization elimination, stack allocation, per-thread heap ... Chunk tasks to increase granularity. Tasks have both. Independent computation ... – PowerPoint PPT presentation

Number of Views:37

Avg rating:3.0/5.0

Slides: 111

Provided by: martin185

Learn more at: https://people.csail.mit.edu

Category:

more less

Transcript and Presenter's Notes

Title: Analyses and Optimizations for Multithreaded Programs

1
Analyses and Optimizations for Multithreaded
Programs

Martin Rinard,
Alex Salcianu,
Brian Demsky
MIT Laboratory for
Computer Science

John Whaley IBM Tokyo Research Laboratory
2
Motivation

Threads are Ubiquitous
Parallel Programming for Performance
Manage Multiple Connections
System Structuring Mechanism
Overhead
Thread Management
Synchronization
Opportunities
Improved Memory Management

3
What This Talk is About

New Abstraction Parallel Interaction Graph
Points-To Information
Reachability and Escape Information
Interaction Information
Caller-Callee Interactions
Starter-Startee Interactions
Action Ordering Information
Analysis Algorithm
Analysis Uses (synchronization elimination, stack
allocation, per-thread heap allocation)

4
Outline

Example
Analysis Representation and Algorithm
Lightweight Threads
Results
Conclusion

5
Sum Sequence of Numbers
9
8
1
5
3
7
2
6
6
Group in Subsequences
7
Sum Subsequences (in Parallel)
8
Add Sums Into Accumulator
Accumulator
0
9
Add Sums Into Accumulator
Accumulator
17
10
Add Sums Into Accumulator
Accumulator
23
11
Add Sums Into Accumulator
Accumulator
33
12
Add Sums Into Accumulator
Accumulator
41
13
Common Schema

Set of tasks
Chunk tasks to increase granularity
Tasks have both
Independent computation
Updates to shared data

14
Realization in Java

class Accumulator
int value 0
synchronized void add(int v)
value v

15
Realization in Java
class Task extends Thread Vector work
Accumulator dest Task(Vector w, Accumulator d)
work w dest d public void
run() int sum 0 Enumeration e
work.elements() while (e.hasMoreElements())
sum ((Integer) e.nextElement()).intValue()
dest.add(sum)
Task
work
dest
Vector
Accumulator
16
Realization in Java
class Task extends Thread Vector work
Accumulator dest Task(Vector w, Accumulator d)
work w dest d public void
run() int sum 0 Enumeration e
work.elements() while (e.hasMoreElements())
sum ((Integer) e.nextElement()).intValue()
dest.add(sum)
Enumeration
Task
work
dest
Vector
Accumulator
17
Realization in Java
void generateTask(int l, int u, Accumulator a)
Vector v new Vector() for (int j l
j lt u j) v.addElement(new Integer(j))
Task t new Task(v,a) t.start() void
generate(int n, int m, Accumulator a) for
(int i 0 i lt n i ) generateTask(im,
i(m1), a)
18
Task Generation
Accumulator
0
19
Task Generation
Accumulator
0
Vector
20
Task Generation
Accumulator
0
Vector
21
Task Generation
Accumulator
0
Vector
22
Task Generation
Task
work
dest
Accumulator
0
Vector
23
Task Generation
Task
work
dest
Accumulator
0
Vector
Vector
24
Task Generation
Task
work
dest
Accumulator
0
Vector
dest
work
Task
Vector
25
Task Generation
Task
work
dest
Accumulator
0
Vector
dest
dest
Task
work
work
Task
Vector
Vector
26
Analysis
27
Analysis Overview

Interprocedural
Interthread
Flow-sensitive
Statement ordering within thread
Action ordering between threads
Compositional, Bottom Up
Explicitly Represent Potential Interactions
Between Analyzed and Unanalyzed Parts
Partial Program Analysis

28
Analysis Result for run Method
public void run() int sum 0
Enumeration e work.elements() while
(e.hasMoreElements()) sum ((Integer)
e.nextElement()).intValue() dest.add(sum)
this
Enumeration
Task

Abstraction Points-to Graph
Nodes Represent Objects
Edges Represent References

work
dest
Vector
Accumulator
29
Analysis Result for run Method
public void run() int sum 0
Enumeration e work.elements() while
(e.hasMoreElements()) sum ((Integer)
e.nextElement()).intValue() dest.add(sum)
this
Enumeration
Task

Inside Nodes
Objects Created Within Current Analysis Scope
One Inside Node Per Allocation Site
Represents All Objects Created At That Site

work
dest
Vector
Accumulator
30
Analysis Result for run Method
public void run() int sum 0
Enumeration e work.elements() while
(e.hasMoreElements()) sum ((Integer)
e.nextElement()).intValue() dest.add(sum)
this
Enumeration
Task

Outside Nodes
Objects Created Outside Current Analysis Scope
Objects Accessed Via References Created Outside
Current Analysis Scope

work
dest
Vector
Accumulator
31
Analysis Result for run Method
public void run() int sum 0
Enumeration e work.elements() while
(e.hasMoreElements()) sum ((Integer)
e.nextElement()).intValue() dest.add(sum)
this
Enumeration
Task

Outside Nodes
One per Static Class Field
One per Parameter
One per Load Statement
Represents Objects Loaded at That Statement

work
dest
Vector
Accumulator
32
Analysis Result for run Method
public void run() int sum 0
Enumeration e work.elements() while
(e.hasMoreElements()) sum ((Integer)
e.nextElement()).intValue() dest.add(sum)
this
Enumeration
Task

Inside Edges
References Created Inside Current Analysis Scope

work
dest
Vector
Accumulator
33
Analysis Result for run Method
public void run() int sum 0
Enumeration e work.elements() while
(e.hasMoreElements()) sum ((Integer)
e.nextElement()).intValue() dest.add(sum)
this
Enumeration
Task

Outside Edges
References Created Outside Current Analysis Scope
Potential Interactions in Which Analyzed Part
Reads Reference Created in Unanalyzed Part

work
dest
Vector
Accumulator
34
Concept of Escaped Node

Escaped Nodes Represent Objects Accessible
Outside Current Analysis Scope
parameter nodes, load nodes
static class field nodes
nodes passed to unanalyzed methods
nodes reachable from unanalyzed but started
threads
nodes reachable from escaped nodes
Node is Captured if it is Not Escaped

35
Why Escaped Concept is Important

Completeness of Analysis Information
Complete information for captured nodes
Potentially incomplete for escaped nodes
Lifetime Implications
Captured nodes are inaccessible when analyzed
part of the program terminates
Memory Management Optimizations
Stack allocation
Per-Thread Heap Allocation

36
Intrathread Dataflow Analysis

Computes a points-to escape graph for each
program point
Points-to escape graph is a pair ltI,O,egt
I - set of inside edges
O - set of outside edges
e - escape information for each node

37
Dataflow Analysis

Initial state
I formals point to parameter nodes,
classes point to class nodes
O Ø
Transfer functions
I (I KillI ) U GenI
O O U GenO
Confluence operator is U

38
Intraprocedural Analysis

Must define transfer functions for
copy statement l v
load statement l1 l2.f
store statement l1.f l2
return statement return l
object creation site l new cl
method invocation l l0.op(l1lk)

copy statement l v
KillI edges(I, l)
GenI l succ(I, v)
I (I KillI ) U GenI

Existing edges
l
v
40

copy statement l v
KillI edges(I, l)
GenI l succ(I, v)
I (I KillI ) U GenI

Generated edges
l
v
41

load statement l1 l2.f
SE n2 in succ(I, l2) . escaped(n2)
SI Usucc(I, n2, f) . n2 in succ(I, l2)
case 1 l2 does not point to an escaped node (SE
Ø)
KillI edges(I, l1)
GenI l1 SI

Existing edges
l1
f
l2
42

load statement l1 l2.f
SE n2 in succ(I, l2) . escaped(n2)
SI Usucc(I, n2, f) . n2 in succ(I, l2)
case 1 l2 does not point to an escaped node (SE
Ø)
KillI edges(I, l1)
GenI l1 SI

Generated edges
l1
f
l2
43

load statement l1 l2.f
case 2 l2 does point to an escaped node (not SE
Ø)
KillI edges(I, l1)
GenI l1 (SI U n)
GenO (SE f) n
where n is the load node for l1 l2.f

Existing edges
l1
l2
44

load statement l1 l2.f
case 2 l2 does point to an escaped node (not SE
Ø)
KillI edges(I, l1)
GenI l1 (SI U n)
GenO (SE f) n
where n is the load node for l1 l2.f

Generated edges
l1
n
f
l2
45

store statement l1.f l2
GenI (succ(I, l1) f) succ(I, l2)
I I U GenI

Existing edges
l1
l2
46

store statement l1.f l2
GenI (succ(I, l1) f) succ(I, l2)
I I U GenI

Generated edges
l1
f
l2
47

object creation site l new cl
KillI edges(I, l)
GenI ltl, ngt
where n is inside node for l new cl

Existing edges
l
48

object creation site l new cl
KillI edges(I, l)
GenI ltl, ngt
where n is inside node for l new cl

Generated edges
n
l
49
Method Call

Analysis of a method call
Start with points-to escape graph before the call
site
Retrieve the points-to escape graph from analysis
of callee
Map outside nodes of callee graph to nodes of
caller graph
Combine callee graph into caller graph
Result is the points-to escape graph after the
call site

50
Start With Graph Before Call
Points-to Escape Graph before call to t new
Task(v,a)
51
Retrieve Graph from Callee
Points-to Escape Graph before call to t new
Task(v,a)
Points-to Escape Graph from analysis of Task(w,d)
dest
this
work
w
52
Map Parameters from Callee to Caller
Points-to Escape Graph before call to t new
Task(v,a)
Points-to Escape Graph from analysis of Task(w,d)
dest
this
work
w
53
Transfer Edges from Callee to Caller
Combined Graph after call to t new Task(v,a)
Points-to Escape Graph from analysis of Task(w,d)
dest
dest
this
work
work
w
54
Discard Parameter Nodes from Callee
Combined Graph after call to t new Task(v,a)
dest
work
55
More General Example
Points-to Escape Graph before call to x.foo()
Points-to Escape Graph from analysis of foo()
x
this
z
y
56
Initialize MappingMap Formals to Actuals
Points-to Escape Graph before call to x.foo()
Points-to Escape Graph from analysis of foo()
x
this
z
y
57
Extend MappingMatch Inside and Outside Edges
Points-to Escape Graph before call to x.foo()
Points-to Escape Graph from analysis of foo()
x
this
z
y
Mapping is Unidirectional From Callee to Caller
58
Complete Mapping Automap Load and Inside Nodes
Reachable from Mapped Nodes
Points-to Escape Graph before call to x.foo()
Points-to Escape Graph from analysis of foo()
x
this
z
y
59
Combine MappingProject Edges from Callee Into
Combined Graph
Combined Graph after call to x.foo()
Points-to Escape Graph from analysis of foo()
x
this
z
y
60
Discard Callee Graph
Combined Graph after call to x.foo()
x
z
61
Discard Outside Edges From Captured Nodes
Combined Graph after call to x.foo()
x
z
62
Interthread Analysis

Augment Analysis Representation
Parallel Thread Set
Action Set (read,write,sync,create edge)
Action Ordering Information (relative to thread
start actions)
Thread Interaction Analysis
Combine points-to graphs
Induces combination of other information
Can perform interthread analysis at any point to
improve precision of results

63
Combining Points-to Graphs
Points-to Escape Graph sometime after call to
x.start()
Points-to Escape Graph from analysis of run()
x
this
64
Initialize MappingMap Startee Thread to Starter
Thread
Points-to Escape Graph sometime after call to
x.start()
Points-to Escape Graph from analysis of run()
x
this
65
Extend MappingMatch Inside and Outside Edges
Points-to Escape Graph sometime after call to
x.start()
Points-to Escape Graph from analysis of run()
x
this
66
Extend MappingMatch Inside and Outside Edges
Points-to Escape Graph sometime after call to
x.start()
Points-to Escape Graph from analysis of run()
x
this
67
Extend MappingMatch Inside and Outside Edges
Points-to Escape Graph sometime after call to
x.start()
Points-to Escape Graph from analysis of run()
x
this
Mapping is Bidirectional From Startee to
Starter From Starter to Startee
68
Complete Mapping Automap Load and Inside Nodes
Reachable from Mapped Nodes
Points-to Escape Graph sometime after call to
x.start()
Points-to Escape Graph from analysis of run()
x
this
69
Combine GraphsProject Edges Through Mappings
Into Combined Graph
Combined Points-to Escape Graph sometime after
call to x.start()
x
this
70
Combine GraphsProject Edges Through Mappings
Into Combined Graph
Combined Points-to Escape Graph sometime after
call to x.start()
x
this
71
Combine GraphsProject Edges Through Mappings
Into Combined Graph
Combined Points-to Escape Graph sometime after
call to x.start()
x
this
72
Combine GraphsProject Edges Through Mappings
Into Combined Graph
Combined Points-to Escape Graph sometime after
call to x.start()
x
this
73
Discard StarteeThread Node
Combined Points-to Escape Graph sometime after
call to x.start()
x
this
74
Discard Startee Thread Node
Combined Points-to Escape Graph sometime after
call to x.start()
x
75
Discard Outside Edges From Captured Nodes
Combined Points-to Escape Graph sometime after
call to x.start()
x
76
Life is not so Simple

Dependences between phases
Mapping best framed as constraint satisfaction
problem
Solved using constraint satisfaction algorithm

77
Interthread Analysis With Actions and Ordering
78
Analysis Result for generateTask
Parallel Threads
Points-to Graph
Actions
Action Ordering
t
sync b
All actions happen before thread a
starts executing
a
rd b
Task
a
wr a
work
dest
wr b
b
e
Vector
wr c
Accumulator
wr d
c
d
79
Analysis Result for run
Parallel Threads
Actions
Action Ordering
Points-to Graph
rd 1
sync 2
none
this
no parallel threads
rd 2
sync 5
Enumeration
Task
rd 3
edge(1,2)
1
6
work
dest
rd 4
edge(1,5)
rd 5
2
5
Vector
edge(2,3)
Accumulator
rd 6
edge(3,4)
3
wr 5
wr 6
4
80
Role of edge(1,2) Actions

One edge action for each outside edge
Action order for edge actions improves precision
of interthread analysis
If starter thread reads a reference before
startee thread is started
Then reference was not created by startee thread
Outside edge actions record order
Inside edges from startee matched only against
parallel outside edges

81
Edge Actions in Combining Points-to Graphs
Points-to Escape Graph sometime after call to
x.start()
Points-to Escape Graph from analysis of run()
Action Ordering
82
Edge Actions in Combining Points-to Graphs
Points-to Escape Graph sometime after call to
x.start()
Points-to Escape Graph from analysis of run()
Action Ordering
none
83
Analysis Result After Interaction
Parallel Threads
Points-to Graph
Actions
Action Ordering
t
a
sync b
All actions from current thread happen
before thread a starts executing
sync b, a
sync e, a
rd b
Task
wr a
rd a, a
a
work
dest
rd b, a
wr b
b
e
Vector
rd c, a
wr c
Accumulator
rd d, a
wr d
c
rd e, a
wr e, a
d
84
Roles of Intrathread and Interthread Analyses

Basic Analysis
Intrathread analysis delivers parallel
interaction graph at each program point
records parallel threads
does not compute thread interaction
Choose program point (end of method)
Interthread analysis delivers additional
precision at that program point
Does not exploit ordering information from thread
join constructs

85
Join Ordering
t new Task() t.start() computation that
runs in parallel with task t t.join() computa
tion that runs after task t
t.run() computation from task t
86
Exploiting Join Ordering

At join point
Interthread analysis delivers new (more precise)
parallel interaction graph
Intrathread analysis uses new graph
No parallel interactions between
Thread
Computation after join

87
Extensions

Partial program analysis
can analyze method independent of callers
can analyze method independent of methods it
invokes
can incrementally analyze callees to improve
precision
Dial down precision to improve efficiency
Demand-driven formulations

88
Key Ideas

Explicitly represent potential interactions
between analyzed and unanalyzed parts
Inside versus outside nodes and edges
Escaped versus captured nodes
Precisely bound ignorance
Exploit ordering information
intrathread (flow sensitive)
interthread (starts, edge orders, joins)

89
Analysis Uses

Overheads in Standard Execution and How to
Eliminate Them

90
Intrathread Analysis Result from End of run Method

Enumeration object is captured
Does not escape to caller
Does not escape to parallel threads
Lifetime of Enumeration object is bounded by
lifetime of run
Can allocate Enumeration object on call stack
instead of heap

this
Enumeration
Task
1
6
work
dest
2
5
Vector
Accumulator
3
4
91
Interthread Analysis Result from End of
generateTask Method
Parallel Threads
Actions
Points-to Graph
Action Ordering
All actions from current thread happen
before thread a starts executing
sync b, a
sync b
a
rd b
sync e, a
rd a, a
wr a
rd b, a
wr b

Vector object is captured
Multiple threads synchronize on Vector object
But synchronizations from different threads do
not occur concurrently
Can eliminate synchronization on Vector object

rd c, a
wr c
rd d, a
wr d
rd e, a
wr e, a
92
Interthread Analysis Result from End of
generateTask Method
Parallel Threads
Actions
Points-to Graph
Action Ordering
All actions from current thread happen
before thread a starts executing
sync b, a
sync b
a
rd b
sync e, a
rd a, a
wr a
rd b, a
wr b

Vectors, Tasks, Integers captured
Parent, child access objects
Parent completes accesses before child starts
accesses
Can allocate objects on childs per-thread heap

rd c, a
wr c
rd d, a
wr d
rd e, a
wr e, a
93
Thread Overhead

Inefficient Thread Implementations
Thread Creation Overhead
Thread Management Overhead
Stack Overhead
Use a more efficient thread implementation
User-level thread management
Per-thread heaps
Event-driven form

94
Standard Thread Implementation
return address

Call frames allocated on stack
Context Switch
Save state on stack
Resume another thread
One stack per thread

frame pointer
x
y
return address
frame pointer
a
b
c
95
Standard Thread Implementation
return address

Call frames allocated on stack
Context Switch
Save state on stack
Resume another thread
One stack per thread

frame pointer
x
y
return address
frame pointer
a
b
c
save area
96
Event-Driven Form

Call frames allocated on stack
Context Switch
Build continuation on heap
Copy out live variables
Return out of computation
Resume another continuation
One stack per processor

return address
frame pointer
x
y
return address
frame pointer
a
b
resume method
c
x
resume method
c
97
Complications

Standard thread models use blocking I/O
Automatically convert blocking I/O to
asynchronous I/O
Scheduler manages interleaving of thread
executions
Stack Allocatable Objects May Be Live Across
Blocking Calls
Transfer allocation to per-thread heap

98
Opportunity

On a uniprocessor, compiler controls placement of
context switch points
If program does not hold lock across blocking
call, can eliminate lock

99
Experimental Results

MIT Flex Compiler System
Static Compiler
Native code for StrongARM
Server Benchmarks
http, phone, echo, time
Scientific Computing Benchmarks
water, barnes

100
Server Benchmark Characteristics
Inter Thread Analysis Time (secs)
Intra Thread Analysis Time (secs)
Pre Analysis Time (secs)
IR Size (instrs)
Number of Methods
4,639
131
28
74
73
echo
time
4,573
136
29
70
74
http
10,643
292
103
199
269
phone
9,547
267
75
191
256
101
Percentage of Eliminated Synchronization
Operations
100
80
60
Intrathread only
Interthread
40
20
0
http
phone
time
echo
mtrt
102
Compilation Options for Performance Results

Standard
kernel threads, synch included
Event-Driven
event-driven, no synch at all
Per-Thread Heap
event-driven, no synch at all, per-thread heap
allocation

103
Throughput (Responses per Second)
400
300
Standard
Event-Driven
200
Per-Thread Heap
100
0
http 2K
http 20K
echo
time
phone
104
Scientific Benchmark Characteristics
IR Size (instrs)
Number of Methods
Total Analysis Time (secs)
Pre Analysis Time (secs)
water
25,583
335
1156
380
barnes
19,764
364
491
129
105
Compiler Options

0 Sequential C
1 Baseline - Kernel Threads
2 Lightweight Threads
3 Lightweight Threads Stack Allocation
4 Lightweight Threads Stack Allocation -
Synchronization

106
Execution Times
1
0.8
Baseline
0.6
Light
Stack
0.4
-Synch
0.2
0
water small
water
barnes
Proportion of Sequential C Execution Time
107
Related Work

Pointer Analysis for Sequential Programs
Chatterjee, Ryder, Landi (POPL 99)
Sathyanathan Lam (LCPC 96)
Steensgaard (POPL 96)
Wilson Lam (PLDI 95)
Emami, Ghiya, Hendren (PLDI 94)
Choi, Burke, Carini (POPL 93)

108
Related Work

Pointer Analysis for Multithreaded Programs
Rugina and Rinard (PLDI 99) (fork-join
parallelism, not compositional)
We have extended our points-to analysis for
multithreaded programs (irregular, thread-based
concurrency, compositional)
Escape Analysis
Blanchet (POPL 98)
Deutsch (POPL 90, POPL 97)
Park Goldberg (PLDI 92)

109
Related Work