Title: The Big Picture
1The Big Picture
- Problem how to build responsive interactive
applications on distributed systems - Why interesting? large group of apps, large
numbers of users - Why hard? complex dynamic nature of systems,
cant easily change systems - Other approaches dynamic load balancing,
distributed soft real-time, quality of service - My approach dynamic mapping of activation trees
using history-based prediction - Success dynamic mapping algorithm that performs
close to optimal for benchmark set - Thesis structure Outline and timeline as before
2Example Activation Tree
MoveListener(2)
0
N
MS
MS
ListenerResponses(2)
DisplayListenerResponses(2)
N
M
...
SourceListenerResponse(1,2)
SourceListenerResponse(S,2)
1
3N
N
1
N
2N
S Sources L Listeners N Elements M Steps
B Blocksize
...
Init()
Step(2)
Step(1)
Extract(2)
Step(M)
Extract(M)
Extract(1)
3B
B
...
Block(0)
Block(1)
Block(2)
3Tradeoff Example for ARM
Shared Server 50 MFLOP/s
User Machine 25 MFLOP/s
10 MB/s Shared Network
- Problem size N2,744,000 (4 m3 room, 3KHz)
- Each step requires 26N FLOPs
- To run Forward(k) on server, 3N doubles in, 3N
doubles out - k26N/25 MFLOP/s on user
- k26N/50 MFLOP/s 48N/10 MB/s on server
- For fewer than k10 steps, faster to use user
machine - Other load on server or network changes tradeoff
point - Wouldnt it be nicer to map steps?
4Responsiveness Specification
- Are bounds the right specification?
- It is very clear what they mean
- Utility functions
- Make the problem much uglier
- Restrict allowable functions for tractability
- Statistical specifications
- (mean,variance), PDFs
- Doesnt seem to make sense to say 95 should be
faster than 100ms in best effort
5Performance Metric
- Simplicity should make it clear for programmer
- Programmer can also measure exec times to compute
his own metrics and adjust his specifications - May look at extensions
- Variance, etc.
- Treat mapping failures differently
6Execution Time of a Node/Subtree
texec
tmap
tincomm
tcompute
toutcomm
tupdate
tslack
tnow
tnow tmin
tnow tmax
tcomputelocal texecchildren
local
local
local
child
child
texec
texec
7Decomposition of Bounds
foo()
tmin,tmax
partially executed, known
tmin,tmax?
unexecuted, known
bar()
unexecuted, unknown
unexecuted, unknown
tmax atmax tmin atmin where a is a function
of the foo-gtbar history b1, b2, , where bs are
actual fractions of time spent in bar() subtree
on previous executions of foo()
- Choice of tmin,tmax for bar() depends on
unvisited portion of the tree - Past execution history encoded in bs
- Choose fraction of bounds to give to bar() based
on that history
8Distributed Soft Real-time Systems
- Context RT CORBA (OMG-96, DiPippo-97)
- End-to-end priorities
- Prediction
- Application-level resource schedulers (Polze-96)
- Generalized deadlines (Jensen-96, OMG-96)
- Scalable information dissemination
- Estimate remote schedulers state from delayed
information (Bestavros-WoRTA93) - Exploit application-specific properties
(Hailperin-93)
9Dynamic Load Balancing
- OS-centric - distribute load evenly
- Value of process migration (Eager-sigmetrics88Lel
and-sigmetrics86, Harchol-Baltar-sigmetrics96) - Load balancing is detrimental to RT
(Bestavros-ICDCS97) - Application-centric - minimize exec time
- Task parallelism Jade (Rinard-Computer93),
- Loop parallelism Dome (Beguelin-95),
(Siegell-HPDC-94) - Scalable information dissemination
- Adaptive communication (Shivaratri-ICDCS90)
- Application domain-specific (Hailperin-93)
- Neural net prediction (Mehra-93)
10Quality of Service
- Communication resource reservation
- Tenet, RSVP, ATM CBR/VBR, ...
- Extensions to application level
- QoS (Zinky-DUTC-95, Zinky-TaPOS-97)
- Contracts
- Qual (Florissi-94)
- Mobile adaptation Odyssey (Noble-SOSP-97)
11Mapping Algorithm Characteristics
- Introspection Measures self
- Locality Use only local information
- No extra communication
- Procedure call durations
- Independence Each procedure is separate
- History Decisions based on history (logs)
- Simplicity Measurement scheme, history, and
decision making all must fit into an object
reference and have overhead comparable to remote
call
12Load Traces
- 1 Hz sample rate, one week
- Various Hosts (39 machines)
- PSC Alpha Supercluster (13 machines)
- Local Alpha Desktops (16 machines)
- Manchester Testbed (8 machines)
- Local Compute Servers (2 machines)
13Result Load Trace Analysis
- Self-similarity
- Hgt.5, typically gt.8
- Epochal behavior of frequency content
- Frequency content remains same for 150 to 450
seconds - Simple phase space behavior
- These results suggest that load is predictable
from its past behavior
14Load Trace Characterization
15Why Simulation to Develop and Evaluate Selectors?
- Easier to control environment
- Form groupings of machines at will
- Change mapping algorithm trivially
- Compare against optimal mapping algorithm
- MUCH faster than running on a testbed
- 100,000 100ms calls on testbed gt3 hours
- 100,000 100ms calls on simulator 1-5 minutes
16Assumption of Trace-based Simulation Approach
- Traces define the background work on the hosts
and network - We ask, if we had injected this extra work at
this time, how long would it have taken to
complete? - Answer is valid if this additional work would not
have significantly perturbed the background work. - Assumption is that we would not have
significantly perturbed the background work
17Structure of Simulator
Time
CPU Demand
Mapping Request Generator
Mapping Request
Compute Time Estimator
Mapping Algorithm
Mapping Result
Statistics and Analysis
18Execution time from load traces
- Load trace load(t)
- Given tnominal seconds of cpu time to map at time
tnow, tcomputelocal is found by - Experimental validation shows strong (.99)
correlation between predicted and actual
execution times.
19Load Trace Characterization
20Why use traces directly?
- They are very hard to fit a model to
- Exhibit interesting transient behavior we dont
want to lose - Examples
21Mapping Algorithms
- Mean
- WinMean(W), WinVar(W), Confidance(W)
- RandWinMean(W,P), RandWinVar(W,P),
RandConfidence(W,P) - NeuralNet(W)
- RangeCounter(W)
22Activation Trees from Windows
- Already have methodology for collecting
activation trees from binaries with COFF
debugging - Guard page manipulation or load/store
instrumentation for reference behavior - Two phase data collection
23Activation Tree Traces Other Options
- Collect or derive trees by hand
- Windows apps with source
- acoustic room modelling app, image editor
- Existing distributed apps
- Derive trees from CORBA or DCOM IDL
specifications - Implement and focus on specific class of
application - Quake design optimization, for example
24Capturing Data Movement
- Data reference traces tell us what data was
referenced, not what data the remote execution
facility would have to move - Special case page faults and simple DSM
- IDL extensions
- Compiler-based analysis
- Need source...
25Evaluation
- Empiricism vs. Generalization
- Extensive evaluation study
- Benchmarks
- Parameterizable models for traces
- Generate testcases using models and study the
effects of the parameters - May be difficult - Load trace complexity
- Algoritithm parameters
26Scalability Concerns
- Issue local information on each host is
proportional to the number of hosts and the
number of procedures - Number of hosts is likely to be low
- Sequential execution model
- Data per procedure is on the order of size of
code in an LDOS object reference for the
preliminary work - Opportunities for information sharing if necessary
27Execution Time Dependencies
Host (load(t))
tcompute
Deadline Meets
Activation Tree Structure (nominal(t))
lt
texec
tcommin, tcommout
Network (BW(t),latency(t))
bounds
28Information Sharing
Application-level
Lower-level
stub_a
stub_b
processes
hosts
29Comparison with Shared Measurement System
- Potentially more scalable
- Information dissemination problem
- Lower level information
- Load, BW, latency
- Cant concentrate measurements
- Measurement granularity may not be what
application needs - Communication overhead
30Comparison with Reservations
- Determinism is appealing
- Translation of application demands to resource
demands may be difficult - Could lead to over-reservation
- Significant OS and network modifications
- Limits sites where possible
31Why not just use the least loaded machine?
- How do you know which is the least loaded
machine? - Delayed information
- Probably still need prediction!
- How do you account for communication overheads?
- Minimization of exec time may waste resources
- Cost model
32Implementing the service in a Distributed Object
System
Local object reference
Remote object instances
o.Foo(x)
Local history
mapinfo (per-proc)
ObjectReffoo(x) if (mapinfo-gtInMap())
ObjectReffoo_map(x,mapinfo) ObjectReffoo_ma
p(x,mapinfo) bind(SelectInstance(foo,mapinfo,
history)) // do call, passing along mapping
params, status...
33Method Invocation Overheads
- Commercial DCE and CORBA implementations 0.1 to
1 ms Schmidt97 - LDOS Study200 MHz Ppro, NT 4.0, VC 4.2
- Faster LANs PAPERS 3 microsecond
application-application latency Dietz96
34LDOS Data Transfer Throughput
Process to Process, 256 KB argument, No
Conversion, 200 MHz Ppro, NT 4.0, VC 4.2
35Related Work
- Applications
- CAVE, CUMULUS, DIS, Wolfe-Lau-95
- Remote execution
- Systems RPC, DSM, DCE, CORBA, DCOM, ...
- Performance Schmidt-sigcomm97, PAPERS, ...