analyses and optimizations for multithreaded programs martin rinard, alex salcianu, brian demsky mit...
TRANSCRIPT
Analyses and Optimizations for Multithreaded Programs
Martin Rinard, Alex Salcianu,Brian Demsky
MIT Laboratory for Computer Science
John Whaley IBM Tokyo Research Laboratory
Motivation
• Threads are Ubiquitous• Parallel Programming for Performance• Manage Multiple Connections• System Structuring Mechanism
• Overhead• Thread Management• Synchronization
• Opportunities• Improved Memory Management
What This Talk is About
• New Abstraction: Parallel Interaction Graph• Points-To Information• Reachability and Escape Information • Interaction Information
•Caller-Callee Interactions•Starter-Startee Interactions
• Action Ordering Information• Analysis Algorithm• Analysis Uses (synchronization elimination,
stack allocation, per-thread heap allocation)
Outline
• Example• Analysis Representation and Algorithm• Lightweight Threads• Results• Conclusion
Sum Sequence of Numbers
9 8 1 5 3 7 2 6
Group in Subsequences
9 8 1 5 3 7 2 6
Sum Subsequences (in Parallel)
9 8 1 5 3 7 2 6
+
6
+
17
+
10
+
8
Add Sums Into Accumulator
9 8 1 5 3 7 2 6
+
6
+
17
+
10
+
8
Accumulator0
Add Sums Into Accumulator
9 8 1 5 3 7 2 6
+
6
+
17
+
10
+
8
Accumulator17
Add Sums Into Accumulator
9 8 1 5 3 7 2 6
+
6
+
17
+
10
+
8
Accumulator23
Add Sums Into Accumulator
9 8 1 5 3 7 2 6
+
6
+
17
+
10
+
8
Accumulator33
Add Sums Into Accumulator
9 8 1 5 3 7 2 6
+
6
+
17
+
10
+
8
Accumulator41
Common Schema
• Set of tasks• Chunk tasks to increase granularity• Tasks have both
• Independent computation• Updates to shared data
Realization in Java
class Accumulator { int value = 0; synchronized void add(int v) { value += v; }}
Realization in Java
class Task extends Thread { Vector work; Accumulator dest; Task(Vector w, Accumulator d) { work = w; dest = d; }
public void run() { int sum = 0; Enumeration e = work.elements(); while (e.hasMoreElements()) sum += ((Integer) e.nextElement()).intValue(); dest.add(sum); }}
0
work dest
Task
62
Accumulator
Vector
Realization in Java
class Task extends Thread { Vector work; Accumulator dest; Task(Vector w, Accumulator d) { work = w; dest = d; }
public void run() { int sum = 0; Enumeration e = work.elements(); while (e.hasMoreElements()) sum += ((Integer) e.nextElement()).intValue(); dest.add(sum); }}
0
work dest
Task
62
Accumulator
Vector
Enumeration
Realization in Java
void generateTask(int l, int u, Accumulator a) { Vector v = new Vector(); for (int j = l; j < u; j++) v.addElement(new Integer(j)); Task t = new Task(v,a); t.start();}void generate(int n, int m, Accumulator a) { for (int i = 0; i < n; i ++) generateTask(i*m, i*(m+1),
a);}
Accumulator0
Task Generation
Accumulator
Vector0
Task Generation
Accumulator
Vector0
Task Generation
2
62
Accumulator
Vector0
Task Generation
work dest
Task
62
Accumulator
Vector0
Task Generation
work dest
Task
62
Accumulator
Vector0
98
Vector
Task Generation
work dest
Task
62
Accumulator
Vector0
work
dest
Task
98
Vector
Task Generation
work dest
Task
62
Accumulator
Vector0
work
dest
Task
98
Vector
work
dest
Task
51
Vector
Task Generation
Analysis
Analysis Overview
• Interprocedural• Interthread • Flow-sensitive
• Statement ordering within thread• Action ordering between threads
• Compositional, Bottom Up• Explicitly Represent Potential
Interactions Between Analyzed and Unanalyzed Parts
• Partial Program Analysis
Analysis Result for run Method
Accumulator
public void run() { int sum = 0; Enumeration e = work.elements(); while (e.hasMoreElements()) sum += ((Integer) e.nextElement()).intValue(); dest.add(sum);}
•Abstraction: Points-to Graph
•Nodes Represent Objects•Edges Represent References
work dest
Task
Vector
Enumeration
this
Analysis Result for run Method
Accumulator
public void run() { int sum = 0; Enumeration e = work.elements(); while (e.hasMoreElements()) sum += ((Integer) e.nextElement()).intValue(); dest.add(sum);}•Inside Nodes
•Objects Created Within Current Analysis Scope
•One Inside Node Per Allocation Site
•Represents All Objects Created At That Site
work dest
Task
Vector
Enumeration
this
Analysis Result for run Method
Accumulator
public void run() { int sum = 0; Enumeration e = work.elements(); while (e.hasMoreElements()) sum += ((Integer) e.nextElement()).intValue(); dest.add(sum);}
•Outside Nodes•Objects Created Outside Current Analysis Scope
•Objects Accessed Via References Created Outside Current Analysis Scope
work dest
Task
Vector
Enumeration
this
Analysis Result for run Method
Accumulator
public void run() { int sum = 0; Enumeration e = work.elements(); while (e.hasMoreElements()) sum += ((Integer) e.nextElement()).intValue(); dest.add(sum);}•Outside Nodes
•One per Static Class Field •One per Parameter•One per Load Statement
• Represents Objects Loaded at That Statement
work dest
Task
Vector
Enumeration
this
Analysis Result for run Method
Accumulator
public void run() { int sum = 0; Enumeration e = work.elements(); while (e.hasMoreElements()) sum += ((Integer) e.nextElement()).intValue(); dest.add(sum);}
•Inside Edges•References Created Inside Current Analysis Scope
work dest
Task
Vector
Enumeration
this
Analysis Result for run Method
Accumulator
public void run() { int sum = 0; Enumeration e = work.elements(); while (e.hasMoreElements()) sum += ((Integer) e.nextElement()).intValue(); dest.add(sum);}
•Outside Edges•References Created Outside Current Analysis Scope
•Potential Interactions in Which Analyzed Part Reads Reference Created in Unanalyzed Part
work dest
Task
Vector
Enumeration
this
Concept of Escaped Node
• Escaped Nodes Represent Objects Accessible Outside Current Analysis Scope• parameter nodes, load nodes• static class field nodes• nodes passed to unanalyzed methods• nodes reachable from unanalyzed but
started threads• nodes reachable from escaped nodes
• Node is Captured if it is Not Escaped
Why Escaped Concept is Important
• Completeness of Analysis Information• Complete information for captured
nodes• Potentially incomplete for escaped nodes
• Lifetime Implications• Captured nodes are inaccessible when
analyzed part of the program terminates• Memory Management Optimizations
•Stack allocation •Per-Thread Heap Allocation
Intrathread Dataflow Analysis
• Computes a points-to escape graph for each program point
• Points-to escape graph is a pair <I,O,e>• I - set of inside edges• O - set of outside edges• e - escape information for each node
Dataflow Analysis
• Initial state:I : formals point to parameter
nodes,classes point to class nodes
O: Ø• Transfer functions:
I´ = (I – KillI ) U GenI
O´ = O U GenO
• Confluence operator is U
Intraprocedural Analysis
• Must define transfer functions for:• copy statement l = v
• load statement l1 = l2.f
• store statement l1.f = l2
• return statement return l• object creation site l = new cl
• method invocation l = l0.op(l1…lk)
copy statement l = v
KillI = edges(I, l)
GenI = {l} × succ(I, v)
I´ = (I – KillI ) U GenI
l
v
Existing edges
copy statement l = v
KillI = edges(I, l)
GenI = {l} × succ(I, v)
I´ = (I – KillI ) U GenI
Generated edges
l
v
load statement l1 = l2.f
SE = {n2 in succ(I, l2) . escaped(n2)}
SI = U{succ(I, n2, f) . n2 in succ(I, l2)}
case 1: l2 does not point to an escaped node (SE = Ø)
KillI = edges(I, l1)
GenI = {l1} × SI
l1
l2
Existing edges
f
load statement l1 = l2.f
SE = {n2 in succ(I, l2) . escaped(n2)}
SI = U{succ(I, n2, f) . n2 in succ(I, l2)}
case 1: l2 does not point to an escaped node (SE = Ø)
KillI = edges(I, l1)
GenI = {l1} × SI
Generated edges
l1
l2
f
load statement l1 = l2.f
case 2: l2 does point to an escaped node (not SE = Ø)
KillI = edges(I, l1)
GenI = {l1} × (SI U {n})
GenO = (SE × {f}) × {n}
where n is the load node for l1 = l2.f
l1
l2
Existing edges
load statement l1 = l2.f
case 2: l2 does point to an escaped node (not SE = Ø)
KillI = edges(I, l1)
GenI = {l1} × (SI U {n})
GenO = (SE × {f}) × {n}
where n is the load node for l1 = l2.f
Generated edges
l1
l2
nf
store statement l1.f = l2
GenI = (succ(I, l1) × {f}) × succ(I, l2)
I´ = I U GenI
l2
Existing edges
l1
store statement l1.f = l2
GenI = (succ(I, l1) × {f}) × succ(I, l2)
I´ = I U GenI
Generated edges
l2
l1f
object creation site l = new cl
KillI = edges(I, l)
GenI = {<l, n>}
where n is inside node for l = new cl
l
Existing edges
object creation site l = new cl
KillI = edges(I, l)
GenI = {<l, n>}
where n is inside node for l = new cl
Generated edges
l n
Method Call
• Analysis of a method call:• Start with points-to escape graph
before the call site• Retrieve the points-to escape graph
from analysis of callee• Map outside nodes of callee graph to
nodes of caller graph• Combine callee graph into caller graph
• Result is the points-to escape graph after the call site
v
t
a
Points-to Escape Graphbefore call to
t = new Task(v,a)
Start With Graph Before Call
work
dest
v
t
a
this
w
d
Points-to Escape Graphbefore call to
t = new Task(v,a)
Points-to Escape Graphfrom analysis of
Task(w,d)
Retrieve Graph from Callee
work
dest
v
t
a
this
w
d
Points-to Escape Graphbefore call to
t = new Task(v,a)
Points-to Escape Graphfrom analysis of
Task(w,d)
Map Parameters from Callee to Caller
work
dest
v
t
a
this
w
d
Combined Graphafter call to
t = new Task(v,a)
Points-to Escape Graphfrom analysis of
Task(w,d)
Transfer Edges from Callee to Caller
work
dest
v
t
a
Combined Graphafter call to
t = new Task(v,a)
Discard Parameter Nodes from Callee
work
dest
Points-to Escape Graphbefore call to
x.foo()
Points-to Escape Graphfrom analysis of
foo()
thisx
More General Example
yz
Points-to Escape Graphbefore call to
x.foo()
Points-to Escape Graphfrom analysis of
foo()
thisx
Initialize MappingMap Formals to Actuals
yz
Points-to Escape Graphbefore call to
x.foo()
Points-to Escape Graphfrom analysis of
foo()
thisx
Extend MappingMatch Inside and Outside Edges
y
Mapping is UnidirectionalFrom Callee to Caller
z
Points-to Escape Graphbefore call to
x.foo()
Points-to Escape Graphfrom analysis of
foo()
thisx
Complete Mapping Automap Load and Inside Nodes Reachable
from Mapped Nodes
yz
Combined Graphafter call to
x.foo()
Points-to Escape Graphfrom analysis of
foo()
thisx
Combine MappingProject Edges from Callee Into Combined
Graph
yz
Combined Graphafter call to
x.foo()
x
Discard Callee Graph
z
Combined Graphafter call to
x.foo()
x
Discard Outside Edges From Captured Nodes
z
Interthread Analysis
• Augment Analysis Representation • Parallel Thread Set• Action Set (read,write,sync,create edge)• Action Ordering Information
(relative to thread start actions)• Thread Interaction Analysis
• Combine points-to graphs• Induces combination of other information
• Can perform interthread analysis at any point to improve precision of results
Points-to Escape Graphsometime after call to
x.start()
Points-to Escape Graphfrom analysis of
run()
Combining Points-to Graphs
x this
Points-to Escape Graphsometime after call to
x.start()
Points-to Escape Graphfrom analysis of
run()
Initialize MappingMap Startee Thread to Starter
Thread
x this
Points-to Escape Graphsometime after call to
x.start()
Points-to Escape Graphfrom analysis of
run()
Extend MappingMatch Inside and Outside Edges
x this
Points-to Escape Graphsometime after call to
x.start()
Points-to Escape Graphfrom analysis of
run()
Extend MappingMatch Inside and Outside Edges
x this
Points-to Escape Graphsometime after call to
x.start()
Points-to Escape Graphfrom analysis of
run()
Extend MappingMatch Inside and Outside Edges
x this
Mapping is BidirectionalFrom Startee to StarterFrom Starter to Startee
Points-to Escape Graphsometime after call to
x.start()
Points-to Escape Graphfrom analysis of
run()
Complete Mapping Automap Load and Inside Nodes Reachable from Mapped Nodes
x this
Combined Points-to Escape Graph sometime after call to
x.start()
Combine GraphsProject Edges Through Mappings Into
Combined Graph
x this
Combined Points-to Escape Graph sometime after call to
x.start()
Combine GraphsProject Edges Through Mappings Into
Combined Graph
x this
Combined Points-to Escape Graph sometime after call to
x.start()
Combine GraphsProject Edges Through Mappings Into
Combined Graph
x this
Combined Points-to Escape Graph sometime after call to
x.start()
Combine GraphsProject Edges Through Mappings Into
Combined Graph
x this
Combined Points-to Escape Graph sometime after call to
x.start()
Discard StarteeThread Node
x this
Combined Points-to Escape Graph sometime after call to
x.start()
Discard Startee Thread Node
x
Combined Points-to Escape Graph sometime after call to
x.start()
Discard Outside Edges From Captured Nodes
x
Life is not so Simple
• Dependences between phases• Mapping best framed as constraint
satisfaction problem• Solved using constraint satisfaction
algorithm
Interthread Analysis With Actions and Ordering
Accumulatorb e
awork dest
Task
d
c
Vector
ta
ParallelThreads
Actions
wr a
wr b
wr c
wr d
sync b
rd b
Points-to Graph
Action Ordering
“All actionshappen before
thread a starts
executing”
Analysis Result for generateTask
6
Enumeration
Accumulator2 5
1work dest
Task
4
3
Vector
this
ParallelThreads
Actions
rd 1
rd 2
rd 3
rd 4
Action Ordering
noparallelthreads
none
rd 5
wr 5
sync 2
rd 6
wr 6
Points-to Graph
Analysis Result for run
sync 5
edge(1,2)
edge(1,5)
edge(2,3)
edge(3,4)
Role of edge(1,2) Actions
• One edge action for each outside edge• Action order for edge actions improves
precision of interthread analysis• If starter thread reads a reference
before startee thread is started• Then reference was not created by
startee thread• Outside edge actions record order• Inside edges from startee matched only
against parallel outside edges
Points-to Escape Graphsometime after call to
x.start()
Points-to Escape Graphfrom analysis of
run()
Edge Actions in Combining Points-to Graphs
1
2
3
x this
Action Ordering
edge(1,2) || 1
Points-to Escape Graphsometime after call to
x.start()
Points-to Escape Graphfrom analysis of
run()
Edge Actions in Combining Points-to Graphs
1
2
3
x this
Action Ordering
(i.e., edge(1,2)created before
started)1
none
Accumulatorb e
awork dest
Task
d
c
Vector
t
ParallelThreads
Actions
wr a
wr b
wr c
wr d
sync b
rd b
Points-to Graph
Action Ordering
“All actions from
current threadhappen before
thread a starts
executing”
Analysis Result After Interaction
rd a, a
rd b, a
rd c, a
rd d, a
rd e, a
wr e, a
sync b, a
sync e, a
a
Roles of Intrathread and Interthread Analyses
• Basic Analysis• Intrathread analysis delivers parallel
interaction graph at each program point•records parallel threads•does not compute thread interaction
• Choose program point (end of method)• Interthread analysis delivers additional
precision at that program point• Does not exploit ordering information from
thread join constructs
Join Ordering
t = new Task();t.start();
“computation that runs in parallel with task t”
t.join();
“computation that runs after task t”
t.run();“computation
from task t”
Exploiting Join Ordering
• At join point• Interthread analysis delivers new
(more precise) parallel interaction graph
• Intrathread analysis uses new graph• No parallel interactions between
• Thread• Computation after join
Extensions
• Partial program analysis• can analyze method independent of
callers• can analyze method independent of
methods it invokes• can incrementally analyze callees to
improve precision• Dial down precision to improve efficiency• Demand-driven formulations
Key Ideas
• Explicitly represent potential interactions between analyzed and unanalyzed parts• Inside versus outside nodes and
edges• Escaped versus captured nodes• Precisely bound ignorance
• Exploit ordering information• intrathread (flow sensitive)• interthread (starts, edge orders, joins)
Analysis Uses
Overheads in Standard Execution and How to Eliminate Them
6
Enumeration
Accumulator2 5
1work dest
Task
4
3
Vector
this
Intrathread Analysis Result from End of run Method
•Enumeration object is captured•Does not escape to caller•Does not escape to parallel
threads•Lifetime of Enumeration object
is bounded by lifetime of run•Can allocate Enumeration
object on call stack instead of heap
Accumulator
b e
awork dest
Task
d
c
Vector
t
ParallelThreads
Actions
wr a
wr b
wr c
wr d
sync b
rd b
Points-to Graph
Action Ordering
“All actions from current thread happen before
thread a startsexecuting”
rd a, a
rd b, a
rd c, a
rd d, a
rd e, a
wr e, a
sync b, a
sync e, a
a
•Vector object is captured•Multiple threads synchronize on
Vector object•But synchronizations from
different threads do not occur concurrently
•Can eliminate synchronization on Vector object
Interthread Analysis Result from End of generateTask Method
Accumulator
b e
awork dest
Task
d
c
Vector
t
ParallelThreads
Actions
wr a
wr b
wr c
wr d
sync b
rd b
Points-to Graph
Action Ordering
“All actions from current thread happen before
thread a startsexecuting”
rd a, a
rd b, a
rd c, a
rd d, a
rd e, a
wr e, a
sync b, a
sync e, a
a
•Vectors, Tasks, Integers captured
•Parent, child access objects•Parent completes accesses
before child starts accesses•Can allocate objects on child’s
per-thread heap
Interthread Analysis Result from End of generateTask Method
Thread Overhead
• Inefficient Thread Implementations• Thread Creation Overhead• Thread Management Overhead• Stack Overhead
• Use a more efficient thread implementation• User-level thread management• Per-thread heaps• Event-driven form
Standard Thread Implementation
return address
frame pointer
x
y
return address
frame pointer
b
c
a
•Call frames allocated on stack•Context Switch
• Save state on stack• Resume another thread
•One stack per thread
Standard Thread Implementation
return address
frame pointer
x
y
return address
frame pointer
b
c
a
save area
•Call frames allocated on stack•Context Switch
• Save state on stack• Resume another thread
•One stack per thread
Event-Driven Form
return address
frame pointer
x
y
return address
frame pointer
b
c
a
•Call frames allocated on stack•Context Switch
• Build continuation on heap• Copy out live variables• Return out of computation• Resume another continuation
•One stack per processor
c
x
resumemethod
resumemethod
Complications
• Standard thread models use blocking I/O• Automatically convert blocking I/O to
asynchronous I/O• Scheduler manages interleaving of
thread executions• Stack Allocatable Objects May Be Live
Across Blocking Calls• Transfer allocation to per-thread heap
Opportunity
• On a uniprocessor, compiler controls placement of context switch points
• If program does not hold lock across blocking call, can eliminate lock
Experimental Results
• MIT Flex Compiler System• Static Compiler• Native code for StrongARM
• Server Benchmarks • http, phone, echo, time
• Scientific Computing Benchmarks• water, barnes
Server Benchmark Characteristics
IR Size
(instrs)
Number of
Methods
PreAnalysis
Time (secs)
echo 4,639 131 28
time 4,573 136 29
http 10,643 292 103
phone 9,547 267 75
IntraThreadAnalysis
Time (secs)
InterThreadAnalysis
Time (secs)
74
70
199
191
73
74
269
256
Percentage of Eliminated Synchronization Operations
0
20
40
60
80
100
http phone time echo mtrt
Intrathread only
Interthread
Compilation Options for Performance Results
• Standard• kernel threads, synch included
• Event-Driven• event-driven, no synch at all
• +Per-Thread Heap• event-driven, no synch at all, per-
thread heap allocation
Throughput (Responses per Second)
Standard
Event-Driven
+Per-ThreadHeap
echo timehttp2K
http20K
0
100
200
300
400
phone
water 25,583 335 1156
IR Size(instrs)
Number ofMethods
Total AnalysisTime (secs)
barnes 19,764 364 491
380
Pre AnalysisTime (secs)
129
Scientific Benchmark Characteristics
Compiler Options
0: Sequential C++1: Baseline - Kernel Threads2: Lightweight Threads3: Lightweight Threads + Stack Allocation4: Lightweight Threads + Stack Allocation
- Synchronization
0
0.2
0.4
0.6
0.8
1
Baseline
+Light
+Stack
-Synch
Execution Times
Proportion of Sequential C++ Execution Time
water small water barnes
Related Work
• Pointer Analysis for Sequential Programs• Chatterjee, Ryder, Landi (POPL 99)• Sathyanathan & Lam (LCPC 96)• Steensgaard (POPL 96)• Wilson & Lam (PLDI 95)• Emami, Ghiya, Hendren (PLDI 94)• Choi, Burke, Carini (POPL 93)
Related Work
• Pointer Analysis for Multithreaded Programs• Rugina and Rinard (PLDI 99) (fork-
join parallelism, not compositional)• We have extended our points-to analysis
for multithreaded programs (irregular, thread-based concurrency, compositional)
• Escape Analysis• Blanchet (POPL 98)• Deutsch (POPL 90, POPL 97)• Park & Goldberg (PLDI 92)
Related Work
• Synchronization Optimizations• Diniz & Rinard (LCPC 96, POPL 97)• Plevyak, Zhang, Chien (POPL 95)• Aldrich, Chambers, Sirer, Eggers
(SAS99)• Blanchet (OOPSLA 99)• Bogda, Hoelzle (OOPSLA 99)• Choi, Gupta, Serrano, Sreedhar, Midkiff
(OOPSLA 99)• Ruf (PLDI 00)
Conclusion
• New Analysis Algorithm• Flow-sensitive, compositional• Multithreaded programs• Explicitly represent interactions between
analyzed and unanalyzed parts• Analysis Uses
• Synchronization elimination• Stack allocation• Per-thread heap allocation
• Lightweight Threads