Title: Frameworks for domain-specific optimization at run-time
1Frameworks for domain-specific optimization at
run-time
- Paul Kelly (Imperial College London)
- Joint work with
- Kwok Cheung Yeung
- Milos Puzovic
September 2005
2Where were coming from
- I lead the Software Performance Optimisation
group within Computer Systems - Stuff Id love to talk about another time
- Scalable interactive fluid-flow visualisation
- FPGA and GPU accelerators
- Bounds-checking for C, links with unchecked code
- Is Morton-order layout for 2D arrays competitive?
- Efficient algorithms for scalable pointer alias
analysis - Domain-specific optimisation frameworks
- Instant-access cycle stealing
- Proxying in CC-NUMA cache-coherence protocols
adaptive randomisation and combining
V A
Science Museum
Dept of Computing
Albert Hall
Hyde Park
3Mission statement
- Extend optimising compiler technology to
challenging contexts beyond scope of conventional
compilers - Component-based software cross-component
optimisation - For example in distributed systems
- Optimisation across network boundaries
- Between different security domains
- Maintaining proper semantics in the event of
failures - Emergent mission (mission creep)
- Design a domain-specification optimisation
plug-in architecture for compiler/VM
4Abstraction
- Most performance improvement opportunities come
from adapting components to their context - Most performance improvement measures break
abstraction boundaries - So the goal of performance programming tool
support is get performance without making a mess
of your code
- Optimisations are cross-cutting
5- Slogan Optimisations are features
- and features can be separately-deployable,
separately-marketable, components, or aspects - How can this be made to work?
6Open compilers
- Idea implement optimisation features as compiler
passes - Need to design open plug-in architecture for
inserting new optimization passes - Some interesting issues in how to design
extensible intermediate representation - Feature composition raises research issues
- Interference can we verify that feature A
doesnt interfere with feature B? - Phase ordering problem which should come first?
- Can feature B benefit from feature As program
analysis?
7Open virtual machines
- How about an open optimizing VM?
- Fresh issues
- Dynamic installation of optimisation features?
- Open access to instrumentation/profiling
- Exploit opportunity to use dynamic information as
well as static analysis
8Open virtual machines
- How about an open optimizing VM?
- Fresh issues
- Dynamic installation of optimisation features?
- Open access to instrumentation/profiling
- Exploit opportunity to use dynamic information as
well as static analysis
- This talk has three parts
- Motivating example
- A framework for deploying optimisations as
separately-deployable features, or components - Support for optimisations that integrate static
analysis with dynamic information
9Project strategy
- Implement aggregation optimisation for .Net
Remoting - Do it with lower overheads than our Java version
- To do it, build general-purpose tools for
- Domain-specific optimisation
- Run-time optimisation
- Results so far
- reflective dataflow analysis framework
- optimisations as aspects framework prototype
- Plugin architecture for domain-specific
optimisation features (DSOFs) - elementary Remoting aggregation works, with
excellent performance
10Aggregating remote calls
void m(RemoteObject r, int a) int x
r.f(a) int y r.g(a,x) int z r.h(a,y)
System.Console.WriteLine(z)
a
a
x
a,x
y
a,y
a,z
z
Six messages
Two messages, no need to copy x and y back
- Aggregation
- a sequence of calls to same server can be
executed in a single message exchange - Reduce number of messages
- Also reduce amount of data transferred
- Common parameters
- Results passed from one call to another
- Non-trivial correctness issues see
YoshidaAhern, OOPSLA05
11Real-world benchmarks
- Simple example Multi-user Dungeon (from
Flanagans Java Examples in a Nutshell) - Look method
- String mudname p.getServer().getMudName()
- String placename p.getPlaceName()
- String description p.getDescription()
- Vector things p.getThings()
- Vector names p.getNames()
- Vector exits p.getExits()
- Seven aggregated calls
Time taken to execute look Ethernet ADSL
Without call aggregation 5.4ms 759.6ms
With call aggregation 5.8ms 164.9ms
Speedup 0.93 4.61
Client Athlon XP 1800 Servers Pentium III
500MHz, 650MHz and dual 700MHz Linux, Sun JDK
1.4.1_01 (Hotspot) Network Ethernet 10.03 MB/s,
ping 0.1ms, DSL 10.7KB/s, ping 98ms Mean of 3
trials of 1000 iterations each
12Call aggregation our first implementation
Veneer virtual JVM intercepts class loading,
and fragments each method. Interpretive
executor inspects local fragment following each
remote call
int m() while (pltN) q x1.m1(p)
p 0 p x2.m2(p)
System.out.println(p) return p
Fragment W
pltN
Fragment X
q x1.m1(p)
poss.remote
Fragment Y
p 0
poss.remote
Fragment Z
p x2.m2(p)
println(p)
return p
- Each fragment carries use/def and liveness info
- Y can be executed before X, but p must be copied
- Z cannot be delayed because p is printed
X Y Z B2
Defs q p p
Uses x1,p,q x2,p p
13Call aggregation our first implementation
- At this point, executor has collected a sequence
of delayed remote calls (fragments X and Z) - But execution is now forced by need to print
- Now, we can inspect delayed fragments and
construct optimised execution plan
Fragment W
pltN
porig p
Fragment Y executed first Fragments X and Z
are delayed
Fragment Y
p 0
Fragment X
q x1.m1(porig)
Fragment Z
p x2.m2(p)
println(p)
return p
- If x1 and x2 are on same server, send aggregate
call - If x1 and x2 are on different servers, send
execution plan to x2s server, telling it to
invoke x1.m1(porig) on x1s server
14Aggregation with conditional forcing
- Runtime optimisation is justified for optimising
heavyweight operations - In this example aggregation is valid if x gt y
- If we intercept the fork we can find out whether
aggregation will be valid - Original Veneer implementation intercepts all
control-flow forks in methods containing
aggregationopportunities - We need a better analysis, that pays overheads
only when a benefit might occur
15Deferred DFA motivating example
- Identifies lossy, predictable control-flow
forks - rescue data-flow information thrown away by
conservative analysis by deferring meet operation - Generates data-flow summary functions for regions
between - Uses predicted control-flow to stitch together
summary function for actual path, using the work
list algorithm
Outcome known at run-time
Deferred data-flow analysis. Shamik Sharma,
Anurag Acharya and Joel Saltz UC Santa Barbara
techreport TRCS98-38
16DSOFs
- Domain-specific optimisation features
- Need a framework to plug the components into
- What does the framework need to achieve?
- Cross-cutting
- Separately-deployable
- Query language to select target sites
- Static access to dataflow/dependence information
- Dynamic access to dataflow/dependence information
- Lets start with AOP
17RMI aggregation DSOF, based on Loom aspect weaver
public class RemoteCallDSOF Loom.Aspect
private OpDomains opDomains new OpDomains()
private DelayedCalls delayedCalls new
DelayedCalls() private Set
delayedCallsDef new HashedSet() public
RemoteCallDSOF (DDFAAnalysis analysis)
this.opDomains analysis.getOpDomains()
Loom.ConnectionPoint.IncludeAll
Loom.Call(Invoke.Instead) public object
AnyMethod(object args) OpDomain
thisOpDomain opDomains.getOpDomain(Context.Metho
dName) OpNode opNode
thisOpDomain.OpNode Set opNodeDef
opNode.getDefs() Set opDomainUse
thisOpDomain.getUses() if (((Set)
opNodeDef opDomainUse).Count gt 0)
(((Set) opDomainUse delayedCallsDef).Count gt
0) delayedCalls.Execute()
object ret Context.Invoke(args)
return ret else
delayedCalls.Add(Context.MethodName, args)
if(!opDomains.hasNext())
object ret delayedCalls.Execute()
return ret return
null
- Dynamic part of pointcut, refers to dataflow
properties of control flow that can be predicted
from this point
- getOpDomain() function stitches together summary
functions - thisOpDomain.getUses() function returns all
variables that are used within the op-domain - opNode.getDefs() function returns all variables
that are defined by op-node
18RMI Optimisation using a souped-up aspect weaver
public aspect OptimiseRMICall public pointcut
LikelyRMICall() public pointcut
StaticDelayableRMI() void around()
LikelyRMICall() StaticDelayableRMI()
if (DynamicDelayableRMI())
DelayedCalls.add(thisJoinPoint.ProceedClosure())
void around() LikelyRMICall()
StaticDelayableRMI() if
(!DynamicDelayableRMI())
DelayedCalls.execute() proceed() void
around() LikelyRMICall()
!StaticDelayableRMI()
DelayedCalls.execute() proceed()
Artists impression of RMI aggregation DSOF,
based on AspectJ
19RMI Optimisation using a souped-up aspect weaver
Artists impression of RMI aggregation DSOF,
based on AspectJ
public aspect OptimiseRMICall public pointcut
LikelyRMICall() public pointcut
StaticDelayableRMI() void around()
LikelyRMICall() StaticDelayableRMI()
if (DynamicDelayableRMI())
DelayedCalls.add(thisJoinPoint.ProceedClosure())
void around() LikelyRMICall()
StaticDelayableRMI() if
(!DynamicDelayableRMI())
DelayedCalls.execute() proceed() void
around() LikelyRMICall()
!StaticDelayableRMI()
DelayedCalls.execute() proceed()
public pointcut LikelyRMICall() call(void
(..) throws RemoteException)
public static bool DynamicDelayableRMI()
return thisOpDomain. getUses().intersects(Dela
yedCalls.getDefs())
20RMI Optimisation using a souped-up aspect weaver
Artists impression of RMI aggregation DSOF,
based on AspectJ
public aspect OptimiseRMICall public pointcut
LikelyRMICall() public pointcut
StaticDelayableRMI() void around()
LikelyRMICall() StaticDelayableRMI()
if (DynamicDelayableRMI())
DelayedCalls.add(thisJoinPoint.ProceedClosure())
void around() LikelyRMICall()
StaticDelayableRMI() if
(!DynamicDelayableRMI())
DelayedCalls.execute() proceed() void
around() LikelyRMICall()
!StaticDelayableRMI()
DelayedCalls.execute() proceed()
public pointcut LikelyRMICall() call(void
(..) throws RemoteException)
public pointcut StaticDelayableRMI()
thisOpDomainStaticPart. getUses().intersects(D
elayedCalls.getDefs())
public static bool DynamicDelayableRMI()
return thisOpDomain. getUses().intersects(Dela
yedCalls.getDefs())
21Remote call aggregation benchmark
- public Double vectorAddition (DDFAAnalysis
analysis, int size ) - Double v1 new Double size
- Double v2 new Double size
- ArrayAdder adder new ArrayAdder()
- Double ret1 adder.Add(v1, v2)
- Double ret2 adder.Add(ret1, v2)
- Double ret3 adder.Add(ret2, v2)
- Double ret4 adder.Add(ret3, v2)
- return ret4
- Includes four consecutive calls to same remote
object - There is data-dependency between the calls
22Remote call aggregation benchmark
- ILMethod method CodeDatabase.GetMethod (new
Function(Example.adder) ) - DDFAAnalysis analysis new DDFAAnalysis ( )
- analysis.Apply(method)
- public Double vectorAddition (DDFAAnalysis
analysis, int size ) - Double v1 new Double size
- Double v2 new Double size
- RemoteCallDSOF opt new RemoteCallDSOF(analysis)
- IAdder adder (IAdder) Loom.Weaver.CreateInstance
(typeof(ArrayAdder), null, opt ) - Double ret1 adder.Add(v1, v2)
- Double ret2 adder.Add(ret1, v2)
- Double ret3 adder.Add(ret2, v2)
- Double ret4 adder.Add(ret3, v2)
- return ret4
- We deploy DSOF using Loom aspect weaver
- When adder is created, DSOF is interposed
- Slightly clunky
23Remote call aggregation benchmark
- ILMethod method CodeDatabase.GetMethod (new
Function(Example.adder) ) - DDFAAnalysis analysis new DDFAAnalysis ( )
- analysis.Apply(method)
- public Double vectorAddition (DDFAAnalysis
analysis, int size ) - Double v1 new Double size
- Double v2 new Double size
- RemoteCallDSOF opt new RemoteCallDSOF(analysis)
- IAdder adder (IAdder) Loom.Weaver.CreateInstance
(typeof(ArrayAdder), null, opt ) - Double ret1 adder.Add(v1, v2)
- Double ret2 adder.Add(ret1, v2)
- Double ret3 adder.Add(ret2, v2)
- Double ret4 adder.Add(ret3, v2)
- return ret4
- Aspect intercepts control flow at potential
remote call sites - Accesses results of static dataflow analysis
- Uses values of variables to determine whether
future control flow will allow aggregation
24Performance results
Modem, ping time 156.2ms (client 1.2GHz Pentium
4, server 2.6GhHz Pentium 4, .Net V1.1)
loopback device (3GHz Pentium 4, .Net V1.1)
- Very preliminary results
- Vector addition benchmark
- Substantial speedup even on fast loopback
connection - By avoiding interpretive mechanism, overheads are
smaller than in our Java implementation
25ObservationsVM
- No change to VM
- Not needed for our work so far
- Though a more powerful dynamic interposition
mechanism (ie aspect weaver) would be good - More ambitiously
- access VMs dataflow analysis?
- Access and control VMs instrumentation
- Via a dynamic aspect weaver?
26ObservationsAOP
- What is the function of the aspect weaver here?
- Type-safe binary rewriting
- Pointcut language goes some way towards providing
open access to intermediate representation - We have built a reflective dataflow analysis
library to extend this somewhat
27ObservationsDSI
- Our scheme for aggregating Remote calls is an
example of a Domain-Specific Interpreter
pattern - Delay execution of calls
- Execution of delayed calls is eventually forced
by a dependence - Inspect list delayed calls, plan efficient
execution - This idea is useful for optimising many APIs
- Example parallelising VTK (Beckmann, Kelly et al
LCPC05) - Example Fusing MPI collective communications
(Field, Kelly, Hansen EuroPar02) - Example Data alignment in parallel linear
algebra (Beckmann Kelly, LCR98)
28Observationsother DSOFs
- Were interested in API-specific optimisations
- anti-pattern rewriting
- Commonly heavyweight, so some runtime overhead
can be justified - But not all optimisations fit the Domain-Specific
Interpreter pattern - Eg SELECT antipattern
- Find all the uses of the result set
- Find all the columns that might actually be used
- Rewrite the query to select just the columns
needed
29Conclusions and future directions
- Implementation incomplete
- Needs to be embedded in aspect language
- Can deferred dataflow analysis work
interprocedurally? - How would we derive where lp-fork aspects have to
be deployed in order to produce the dataflow data
needed by selected aspect - Apply optimisation statically where possible
- Represent optimisation more abstractly?
- Composition metaprogramming
- Optimisation encapsulated as aspect
- Operates on code that composes functions from
some API - Exploits component metadata
30Software products
- Our Adon (Adaptive Optimisation for .Net)
library is available at - http//www.doc.ic.ac.uk/phjk/Software/Adon/
- Adon can be used interactively using the Adon
Browser - Or programmatically, for example to apply partial
evaluation to specialize a method from your
program
31Programming with Adon specialization
// Get the representation for the method
Example.Power ILMethod method
CodeDatabase.GetMethod(Example.Power) //
Create a specialising transformation,
specialising the second // parameter of the
transformed method to the integer value
3 SpecialisingTransformation transform new
SpecialisingTransformation() transform.Specialise
(method.Parameters1, 3) // Apply the
transformation to Example.Power transform.Apply(me
thod) // Generate the modified
method MethodInfo dynamicMethod
method.Generate() // Invoke the new
method Console.Out.WriteLine(dynamicMethod.Invoke(
null, new object 2 ))
- Allows us to extract and mess with any method of
the running applications code
32The Adon Browser
- Example lets mess with Bubblesort
33The Adon Browser
- Browser GUI interfaces to Adon library
- Browse and analysis your apps bytecode
34The Adon Browser
- Browser GUI interfaces to library
- Browse and analysis your apps bytecode
35The Adon Browser
- Browser GUI interfaces to library
- Browse and analysis your apps bytecode
- Apply selected analyses
36The Adon Browser
- Browser GUI interfaces to library
- Browse and analysis your apps bytecode
- Apply selected analyses
37The Adon Browser
- Browser GUI interfaces to library
- Browse and analysis your apps bytecode
- Apply selected analyses
38The Adon Browser
- Apply selected transformations
39The Adon Browser
- Apply selected transformations
40The Adon Browser
- Apply selected transformations