Analyses and Optimizations for Multithreaded Programs - PowerPoint PPT Presentation

About This Presentation
Title:

Analyses and Optimizations for Multithreaded Programs

Description:

Analysis Uses (synchronization elimination, stack allocation, per-thread heap ... Chunk tasks to increase granularity. Tasks have both. Independent computation ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 111
Provided by: martin185
Category:

less

Transcript and Presenter's Notes

Title: Analyses and Optimizations for Multithreaded Programs


1
Analyses and Optimizations for Multithreaded
Programs
  • Martin Rinard,
  • Alex Salcianu,
  • Brian Demsky
  • MIT Laboratory for
  • Computer Science

John Whaley IBM Tokyo Research Laboratory
2
Motivation
  • Threads are Ubiquitous
  • Parallel Programming for Performance
  • Manage Multiple Connections
  • System Structuring Mechanism
  • Overhead
  • Thread Management
  • Synchronization
  • Opportunities
  • Improved Memory Management

3
What This Talk is About
  • New Abstraction Parallel Interaction Graph
  • Points-To Information
  • Reachability and Escape Information
  • Interaction Information
  • Caller-Callee Interactions
  • Starter-Startee Interactions
  • Action Ordering Information
  • Analysis Algorithm
  • Analysis Uses (synchronization elimination, stack
    allocation, per-thread heap allocation)

4
Outline
  • Example
  • Analysis Representation and Algorithm
  • Lightweight Threads
  • Results
  • Conclusion

5
Sum Sequence of Numbers
9
8
1
5
3
7
2
6
6
Group in Subsequences
7
Sum Subsequences (in Parallel)
8
Add Sums Into Accumulator
Accumulator
0
9
Add Sums Into Accumulator
Accumulator
17
10
Add Sums Into Accumulator
Accumulator
23
11
Add Sums Into Accumulator
Accumulator
33
12
Add Sums Into Accumulator
Accumulator
41
13
Common Schema
  • Set of tasks
  • Chunk tasks to increase granularity
  • Tasks have both
  • Independent computation
  • Updates to shared data

14
Realization in Java
  • class Accumulator
  • int value 0
  • synchronized void add(int v)
  • value v

15
Realization in Java
class Task extends Thread Vector work
Accumulator dest Task(Vector w, Accumulator d)
work w dest d public void
run() int sum 0 Enumeration e
work.elements() while (e.hasMoreElements())
sum ((Integer) e.nextElement()).intValue()
dest.add(sum)
Task
work
dest
Vector
Accumulator
16
Realization in Java
class Task extends Thread Vector work
Accumulator dest Task(Vector w, Accumulator d)
work w dest d public void
run() int sum 0 Enumeration e
work.elements() while (e.hasMoreElements())
sum ((Integer) e.nextElement()).intValue()
dest.add(sum)
Enumeration
Task
work
dest
Vector
Accumulator
17
Realization in Java
void generateTask(int l, int u, Accumulator a)
Vector v new Vector() for (int j l
j lt u j) v.addElement(new Integer(j))
Task t new Task(v,a) t.start() void
generate(int n, int m, Accumulator a) for
(int i 0 i lt n i ) generateTask(im,
i(m1), a)
18
Task Generation
Accumulator
0
19
Task Generation
Accumulator
0
Vector
20
Task Generation
Accumulator
0
Vector
21
Task Generation
Accumulator
0
Vector
22
Task Generation
Task
work
dest
Accumulator
0
Vector
23
Task Generation
Task
work
dest
Accumulator
0
Vector
Vector
24
Task Generation
Task
work
dest
Accumulator
0
Vector
dest
work
Task
Vector
25
Task Generation
Task
work
dest
Accumulator
0
Vector
dest
dest
Task
work
work
Task
Vector
Vector
26
Analysis
27
Analysis Overview
  • Interprocedural
  • Interthread
  • Flow-sensitive
  • Statement ordering within thread
  • Action ordering between threads
  • Compositional, Bottom Up
  • Explicitly Represent Potential Interactions
    Between Analyzed and Unanalyzed Parts
  • Partial Program Analysis

28
Analysis Result for run Method
public void run() int sum 0
Enumeration e work.elements() while
(e.hasMoreElements()) sum ((Integer)
e.nextElement()).intValue() dest.add(sum)
this
Enumeration
Task
  • Abstraction Points-to Graph
  • Nodes Represent Objects
  • Edges Represent References

work
dest
Vector
Accumulator
29
Analysis Result for run Method
public void run() int sum 0
Enumeration e work.elements() while
(e.hasMoreElements()) sum ((Integer)
e.nextElement()).intValue() dest.add(sum)
this
Enumeration
Task
  • Inside Nodes
  • Objects Created Within Current Analysis Scope
  • One Inside Node Per Allocation Site
  • Represents All Objects Created At That Site

work
dest
Vector
Accumulator
30
Analysis Result for run Method
public void run() int sum 0
Enumeration e work.elements() while
(e.hasMoreElements()) sum ((Integer)
e.nextElement()).intValue() dest.add(sum)
this
Enumeration
Task
  • Outside Nodes
  • Objects Created Outside Current Analysis Scope
  • Objects Accessed Via References Created Outside
    Current Analysis Scope

work
dest
Vector
Accumulator
31
Analysis Result for run Method
public void run() int sum 0
Enumeration e work.elements() while
(e.hasMoreElements()) sum ((Integer)
e.nextElement()).intValue() dest.add(sum)
this
Enumeration
Task
  • Outside Nodes
  • One per Static Class Field
  • One per Parameter
  • One per Load Statement
  • Represents Objects Loaded at That Statement

work
dest
Vector
Accumulator
32
Analysis Result for run Method
public void run() int sum 0
Enumeration e work.elements() while
(e.hasMoreElements()) sum ((Integer)
e.nextElement()).intValue() dest.add(sum)
this
Enumeration
Task
  • Inside Edges
  • References Created Inside Current Analysis Scope

work
dest
Vector
Accumulator
33
Analysis Result for run Method
public void run() int sum 0
Enumeration e work.elements() while
(e.hasMoreElements()) sum ((Integer)
e.nextElement()).intValue() dest.add(sum)
this
Enumeration
Task
  • Outside Edges
  • References Created Outside Current Analysis Scope
  • Potential Interactions in Which Analyzed Part
    Reads Reference Created in Unanalyzed Part

work
dest
Vector
Accumulator
34
Concept of Escaped Node
  • Escaped Nodes Represent Objects Accessible
    Outside Current Analysis Scope
  • parameter nodes, load nodes
  • static class field nodes
  • nodes passed to unanalyzed methods
  • nodes reachable from unanalyzed but started
    threads
  • nodes reachable from escaped nodes
  • Node is Captured if it is Not Escaped

35
Why Escaped Concept is Important
  • Completeness of Analysis Information
  • Complete information for captured nodes
  • Potentially incomplete for escaped nodes
  • Lifetime Implications
  • Captured nodes are inaccessible when analyzed
    part of the program terminates
  • Memory Management Optimizations
  • Stack allocation
  • Per-Thread Heap Allocation

36
Intrathread Dataflow Analysis
  • Computes a points-to escape graph for each
    program point
  • Points-to escape graph is a pair ltI,O,egt
  • I - set of inside edges
  • O - set of outside edges
  • e - escape information for each node

37
Dataflow Analysis
  • Initial state
  • I formals point to parameter nodes,
  • classes point to class nodes
  • O Ø
  • Transfer functions
  • I (I KillI ) U GenI
  • O O U GenO
  • Confluence operator is U

38
Intraprocedural Analysis
  • Must define transfer functions for
  • copy statement l v
  • load statement l1 l2.f
  • store statement l1.f l2
  • return statement return l
  • object creation site l new cl
  • method invocation l l0.op(l1lk)

39
  • copy statement l v
  • KillI edges(I, l)
  • GenI l succ(I, v)
  • I (I KillI ) U GenI

Existing edges
l
v
40
  • copy statement l v
  • KillI edges(I, l)
  • GenI l succ(I, v)
  • I (I KillI ) U GenI

Generated edges
l
v
41
  • load statement l1 l2.f
  • SE n2 in succ(I, l2) . escaped(n2)
  • SI Usucc(I, n2, f) . n2 in succ(I, l2)
  • case 1 l2 does not point to an escaped node (SE
    Ø)
  • KillI edges(I, l1)
  • GenI l1 SI

Existing edges
l1
f
l2
42
  • load statement l1 l2.f
  • SE n2 in succ(I, l2) . escaped(n2)
  • SI Usucc(I, n2, f) . n2 in succ(I, l2)
  • case 1 l2 does not point to an escaped node (SE
    Ø)
  • KillI edges(I, l1)
  • GenI l1 SI

Generated edges
l1
f
l2
43
  • load statement l1 l2.f
  • case 2 l2 does point to an escaped node (not SE
    Ø)
  • KillI edges(I, l1)
  • GenI l1 (SI U n)
  • GenO (SE f) n
  • where n is the load node for l1 l2.f

Existing edges
l1
l2
44
  • load statement l1 l2.f
  • case 2 l2 does point to an escaped node (not SE
    Ø)
  • KillI edges(I, l1)
  • GenI l1 (SI U n)
  • GenO (SE f) n
  • where n is the load node for l1 l2.f

Generated edges
l1
n
f
l2
45
  • store statement l1.f l2
  • GenI (succ(I, l1) f) succ(I, l2)
  • I I U GenI

Existing edges
l1
l2
46
  • store statement l1.f l2
  • GenI (succ(I, l1) f) succ(I, l2)
  • I I U GenI

Generated edges
l1
f
l2
47
  • object creation site l new cl
  • KillI edges(I, l)
  • GenI ltl, ngt
  • where n is inside node for l new cl

Existing edges
l
48
  • object creation site l new cl
  • KillI edges(I, l)
  • GenI ltl, ngt
  • where n is inside node for l new cl

Generated edges
n
l
49
Method Call
  • Analysis of a method call
  • Start with points-to escape graph before the call
    site
  • Retrieve the points-to escape graph from analysis
    of callee
  • Map outside nodes of callee graph to nodes of
    caller graph
  • Combine callee graph into caller graph
  • Result is the points-to escape graph after the
    call site

50
Start With Graph Before Call
Points-to Escape Graph before call to t new
Task(v,a)
51
Retrieve Graph from Callee
Points-to Escape Graph before call to t new
Task(v,a)
Points-to Escape Graph from analysis of Task(w,d)
dest
this
work
w
52
Map Parameters from Callee to Caller
Points-to Escape Graph before call to t new
Task(v,a)
Points-to Escape Graph from analysis of Task(w,d)
dest
this
work
w
53
Transfer Edges from Callee to Caller
Combined Graph after call to t new Task(v,a)
Points-to Escape Graph from analysis of Task(w,d)
dest
dest
this
work
work
w
54
Discard Parameter Nodes from Callee
Combined Graph after call to t new Task(v,a)
dest
work
55
More General Example
Points-to Escape Graph before call to x.foo()
Points-to Escape Graph from analysis of foo()
x
this
z
y
56
Initialize MappingMap Formals to Actuals
Points-to Escape Graph before call to x.foo()
Points-to Escape Graph from analysis of foo()
x
this
z
y
57
Extend MappingMatch Inside and Outside Edges
Points-to Escape Graph before call to x.foo()
Points-to Escape Graph from analysis of foo()
x
this
z
y
Mapping is Unidirectional From Callee to Caller
58
Complete Mapping Automap Load and Inside Nodes
Reachable from Mapped Nodes
Points-to Escape Graph before call to x.foo()
Points-to Escape Graph from analysis of foo()
x
this
z
y
59
Combine MappingProject Edges from Callee Into
Combined Graph
Combined Graph after call to x.foo()
Points-to Escape Graph from analysis of foo()
x
this
z
y
60
Discard Callee Graph
Combined Graph after call to x.foo()
x
z
61
Discard Outside Edges From Captured Nodes
Combined Graph after call to x.foo()
x
z
62
Interthread Analysis
  • Augment Analysis Representation
  • Parallel Thread Set
  • Action Set (read,write,sync,create edge)
  • Action Ordering Information (relative to thread
    start actions)
  • Thread Interaction Analysis
  • Combine points-to graphs
  • Induces combination of other information
  • Can perform interthread analysis at any point to
    improve precision of results

63
Combining Points-to Graphs
Points-to Escape Graph sometime after call to
x.start()
Points-to Escape Graph from analysis of run()
x
this
64
Initialize MappingMap Startee Thread to Starter
Thread
Points-to Escape Graph sometime after call to
x.start()
Points-to Escape Graph from analysis of run()
x
this
65
Extend MappingMatch Inside and Outside Edges
Points-to Escape Graph sometime after call to
x.start()
Points-to Escape Graph from analysis of run()
x
this
66
Extend MappingMatch Inside and Outside Edges
Points-to Escape Graph sometime after call to
x.start()
Points-to Escape Graph from analysis of run()
x
this
67
Extend MappingMatch Inside and Outside Edges
Points-to Escape Graph sometime after call to
x.start()
Points-to Escape Graph from analysis of run()
x
this
Mapping is Bidirectional From Startee to
Starter From Starter to Startee
68
Complete Mapping Automap Load and Inside Nodes
Reachable from Mapped Nodes
Points-to Escape Graph sometime after call to
x.start()
Points-to Escape Graph from analysis of run()
x
this
69
Combine GraphsProject Edges Through Mappings
Into Combined Graph
Combined Points-to Escape Graph sometime after
call to x.start()
x
this
70
Combine GraphsProject Edges Through Mappings
Into Combined Graph
Combined Points-to Escape Graph sometime after
call to x.start()
x
this
71
Combine GraphsProject Edges Through Mappings
Into Combined Graph
Combined Points-to Escape Graph sometime after
call to x.start()
x
this
72
Combine GraphsProject Edges Through Mappings
Into Combined Graph
Combined Points-to Escape Graph sometime after
call to x.start()
x
this
73
Discard StarteeThread Node
Combined Points-to Escape Graph sometime after
call to x.start()
x
this
74
Discard Startee Thread Node
Combined Points-to Escape Graph sometime after
call to x.start()
x
75
Discard Outside Edges From Captured Nodes
Combined Points-to Escape Graph sometime after
call to x.start()
x
76
Life is not so Simple
  • Dependences between phases
  • Mapping best framed as constraint satisfaction
    problem
  • Solved using constraint satisfaction algorithm

77
Interthread Analysis With Actions and Ordering
78
Analysis Result for generateTask
Parallel Threads
Points-to Graph
Actions
Action Ordering
t
sync b
All actions happen before thread a
starts executing
a
rd b
Task
a
wr a
work
dest
wr b
b
e
Vector
wr c
Accumulator
wr d
c
d
79
Analysis Result for run
Parallel Threads
Actions
Action Ordering
Points-to Graph
rd 1
sync 2
none
this
no parallel threads
rd 2
sync 5
Enumeration
Task
rd 3
edge(1,2)
1
6
work
dest
rd 4
edge(1,5)
rd 5
2
5
Vector
edge(2,3)
Accumulator
rd 6
edge(3,4)
3
wr 5
wr 6
4
80
Role of edge(1,2) Actions
  • One edge action for each outside edge
  • Action order for edge actions improves precision
    of interthread analysis
  • If starter thread reads a reference before
    startee thread is started
  • Then reference was not created by startee thread
  • Outside edge actions record order
  • Inside edges from startee matched only against
    parallel outside edges

81
Edge Actions in Combining Points-to Graphs
Points-to Escape Graph sometime after call to
x.start()
Points-to Escape Graph from analysis of run()
Action Ordering
82
Edge Actions in Combining Points-to Graphs
Points-to Escape Graph sometime after call to
x.start()
Points-to Escape Graph from analysis of run()
Action Ordering
none
83
Analysis Result After Interaction
Parallel Threads
Points-to Graph
Actions
Action Ordering
t
a
sync b
All actions from current thread happen
before thread a starts executing
sync b, a
sync e, a
rd b
Task
wr a
rd a, a
a
work
dest
rd b, a
wr b
b
e
Vector
rd c, a
wr c
Accumulator
rd d, a
wr d
c
rd e, a
wr e, a
d
84
Roles of Intrathread and Interthread Analyses
  • Basic Analysis
  • Intrathread analysis delivers parallel
    interaction graph at each program point
  • records parallel threads
  • does not compute thread interaction
  • Choose program point (end of method)
  • Interthread analysis delivers additional
    precision at that program point
  • Does not exploit ordering information from thread
    join constructs

85
Join Ordering
t new Task() t.start() computation that
runs in parallel with task t t.join() computa
tion that runs after task t
t.run() computation from task t
86
Exploiting Join Ordering
  • At join point
  • Interthread analysis delivers new (more precise)
    parallel interaction graph
  • Intrathread analysis uses new graph
  • No parallel interactions between
  • Thread
  • Computation after join

87
Extensions
  • Partial program analysis
  • can analyze method independent of callers
  • can analyze method independent of methods it
    invokes
  • can incrementally analyze callees to improve
    precision
  • Dial down precision to improve efficiency
  • Demand-driven formulations

88
Key Ideas
  • Explicitly represent potential interactions
    between analyzed and unanalyzed parts
  • Inside versus outside nodes and edges
  • Escaped versus captured nodes
  • Precisely bound ignorance
  • Exploit ordering information
  • intrathread (flow sensitive)
  • interthread (starts, edge orders, joins)

89
Analysis Uses
  • Overheads in Standard Execution and How to
    Eliminate Them

90
Intrathread Analysis Result from End of run Method
  • Enumeration object is captured
  • Does not escape to caller
  • Does not escape to parallel threads
  • Lifetime of Enumeration object is bounded by
    lifetime of run
  • Can allocate Enumeration object on call stack
    instead of heap

this
Enumeration
Task
1
6
work
dest
2
5
Vector
Accumulator
3
4
91
Interthread Analysis Result from End of
generateTask Method
Parallel Threads
Actions
Points-to Graph
Action Ordering
All actions from current thread happen
before thread a starts executing
sync b, a
sync b
a
rd b
sync e, a
rd a, a
wr a
rd b, a
wr b
  • Vector object is captured
  • Multiple threads synchronize on Vector object
  • But synchronizations from different threads do
    not occur concurrently
  • Can eliminate synchronization on Vector object

rd c, a
wr c
rd d, a
wr d
rd e, a
wr e, a
92
Interthread Analysis Result from End of
generateTask Method
Parallel Threads
Actions
Points-to Graph
Action Ordering
All actions from current thread happen
before thread a starts executing
sync b, a
sync b
a
rd b
sync e, a
rd a, a
wr a
rd b, a
wr b
  • Vectors, Tasks, Integers captured
  • Parent, child access objects
  • Parent completes accesses before child starts
    accesses
  • Can allocate objects on childs per-thread heap

rd c, a
wr c
rd d, a
wr d
rd e, a
wr e, a
93
Thread Overhead
  • Inefficient Thread Implementations
  • Thread Creation Overhead
  • Thread Management Overhead
  • Stack Overhead
  • Use a more efficient thread implementation
  • User-level thread management
  • Per-thread heaps
  • Event-driven form

94
Standard Thread Implementation
return address
  • Call frames allocated on stack
  • Context Switch
  • Save state on stack
  • Resume another thread
  • One stack per thread

frame pointer
x
y
return address
frame pointer
a
b
c
95
Standard Thread Implementation
return address
  • Call frames allocated on stack
  • Context Switch
  • Save state on stack
  • Resume another thread
  • One stack per thread

frame pointer
x
y
return address
frame pointer
a
b
c
save area
96
Event-Driven Form
  • Call frames allocated on stack
  • Context Switch
  • Build continuation on heap
  • Copy out live variables
  • Return out of computation
  • Resume another continuation
  • One stack per processor

return address
frame pointer
x
y
return address
frame pointer
a
b
resume method
c
x
resume method
c
97
Complications
  • Standard thread models use blocking I/O
  • Automatically convert blocking I/O to
    asynchronous I/O
  • Scheduler manages interleaving of thread
    executions
  • Stack Allocatable Objects May Be Live Across
    Blocking Calls
  • Transfer allocation to per-thread heap

98
Opportunity
  • On a uniprocessor, compiler controls placement of
    context switch points
  • If program does not hold lock across blocking
    call, can eliminate lock

99
Experimental Results
  • MIT Flex Compiler System
  • Static Compiler
  • Native code for StrongARM
  • Server Benchmarks
  • http, phone, echo, time
  • Scientific Computing Benchmarks
  • water, barnes

100
Server Benchmark Characteristics
Inter Thread Analysis Time (secs)
Intra Thread Analysis Time (secs)
Pre Analysis Time (secs)
IR Size (instrs)
Number of Methods
4,639
131
28
74
73
echo
time
4,573
136
29
70
74
http
10,643
292
103
199
269
phone
9,547
267
75
191
256
101
Percentage of Eliminated Synchronization
Operations
100
80
60
Intrathread only
Interthread
40
20
0
http
phone
time
echo
mtrt
102
Compilation Options for Performance Results
  • Standard
  • kernel threads, synch included
  • Event-Driven
  • event-driven, no synch at all
  • Per-Thread Heap
  • event-driven, no synch at all, per-thread heap
    allocation

103
Throughput (Responses per Second)
400
300
Standard
Event-Driven
200
Per-Thread Heap
100
0
http 2K
http 20K
echo
time
phone
104
Scientific Benchmark Characteristics
IR Size (instrs)
Number of Methods
Total Analysis Time (secs)
Pre Analysis Time (secs)
water
25,583
335
1156
380
barnes
19,764
364
491
129
105
Compiler Options
  • 0 Sequential C
  • 1 Baseline - Kernel Threads
  • 2 Lightweight Threads
  • 3 Lightweight Threads Stack Allocation
  • 4 Lightweight Threads Stack Allocation -
    Synchronization

106
Execution Times
1
0.8
Baseline
0.6
Light
Stack
0.4
-Synch
0.2
0
water small
water
barnes
Proportion of Sequential C Execution Time
107
Related Work
  • Pointer Analysis for Sequential Programs
  • Chatterjee, Ryder, Landi (POPL 99)
  • Sathyanathan Lam (LCPC 96)
  • Steensgaard (POPL 96)
  • Wilson Lam (PLDI 95)
  • Emami, Ghiya, Hendren (PLDI 94)
  • Choi, Burke, Carini (POPL 93)

108
Related Work
  • Pointer Analysis for Multithreaded Programs
  • Rugina and Rinard (PLDI 99) (fork-join
    parallelism, not compositional)
  • We have extended our points-to analysis for
    multithreaded programs (irregular, thread-based
    concurrency, compositional)
  • Escape Analysis
  • Blanchet (POPL 98)
  • Deutsch (POPL 90, POPL 97)
  • Park Goldberg (PLDI 92)

109
Related Work
  • Synchronization Optimizations
  • Diniz Rinard (LCPC 96, POPL 97)
  • Plevyak, Zhang, Chien (POPL 95)
  • Aldrich, Chambers, Sirer, Eggers (SAS99)
  • Blanchet (OOPSLA 99)
  • Bogda, Hoelzle (OOPSLA 99)
  • Choi, Gupta, Serrano, Sreedhar, Midkiff (OOPSLA
    99)
  • Ruf (PLDI 00)

110
Conclusion
  • New Analysis Algorithm
  • Flow-sensitive, compositional
  • Multithreaded programs
  • Explicitly represent interactions between
    analyzed and unanalyzed parts
  • Analysis Uses
  • Synchronization elimination
  • Stack allocation
  • Per-thread heap allocation
  • Lightweight Threads
Write a Comment
User Comments (0)
About PowerShow.com