The Big Picture - PowerPoint PPT Presentation

About This Presentation

Title:

The Big Picture

Description:

Introspection: Measures self. Locality: Use only local information. No extra ... Self-similarity. H .5, typically .8. Epochal behavior of ... Determinism ... – PowerPoint PPT presentation

Number of Views:56

Avg rating:3.0/5.0

Slides: 36

Provided by: petera153

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: The Big Picture

1
The Big Picture

Problem how to build responsive interactive
applications on distributed systems
Why interesting? large group of apps, large
numbers of users
Why hard? complex dynamic nature of systems,
cant easily change systems
Other approaches dynamic load balancing,
distributed soft real-time, quality of service
My approach dynamic mapping of activation trees
using history-based prediction
Success dynamic mapping algorithm that performs
close to optimal for benchmark set
Thesis structure Outline and timeline as before

2
Example Activation Tree
MoveListener(2)
0
N
MS
MS
ListenerResponses(2)
DisplayListenerResponses(2)
N
M
...
SourceListenerResponse(1,2)
SourceListenerResponse(S,2)
1
3N
N
1
N
2N
S Sources L Listeners N Elements M Steps
B Blocksize
...
Init()
Step(2)
Step(1)
Extract(2)
Step(M)
Extract(M)
Extract(1)
3B
B
...
Block(0)
Block(1)
Block(2)
3
Tradeoff Example for ARM
Shared Server 50 MFLOP/s
User Machine 25 MFLOP/s
10 MB/s Shared Network

Problem size N2,744,000 (4 m3 room, 3KHz)
Each step requires 26N FLOPs
To run Forward(k) on server, 3N doubles in, 3N
doubles out
k26N/25 MFLOP/s on user
k26N/50 MFLOP/s 48N/10 MB/s on server
For fewer than k10 steps, faster to use user
machine
Other load on server or network changes tradeoff
point
Wouldnt it be nicer to map steps?

4
Responsiveness Specification

Are bounds the right specification?
It is very clear what they mean
Utility functions
Make the problem much uglier
Restrict allowable functions for tractability
Statistical specifications
(mean,variance), PDFs
Doesnt seem to make sense to say 95 should be
faster than 100ms in best effort

5
Performance Metric

Simplicity should make it clear for programmer
Programmer can also measure exec times to compute
his own metrics and adjust his specifications
May look at extensions
Variance, etc.
Treat mapping failures differently

6
Execution Time of a Node/Subtree
texec
tmap
tincomm
tcompute
toutcomm
tupdate
tslack
tnow
tnow tmin
tnow tmax
tcomputelocal texecchildren
local
local
local
child
child
texec
texec
7
Decomposition of Bounds
foo()
tmin,tmax
partially executed, known
tmin,tmax?
unexecuted, known
bar()
unexecuted, unknown
unexecuted, unknown
tmax atmax tmin atmin where a is a function
of the foo-gtbar history b1, b2, , where bs are
actual fractions of time spent in bar() subtree
on previous executions of foo()

Choice of tmin,tmax for bar() depends on
unvisited portion of the tree
Past execution history encoded in bs
Choose fraction of bounds to give to bar() based
on that history

8
Distributed Soft Real-time Systems

Context RT CORBA (OMG-96, DiPippo-97)
End-to-end priorities
Prediction
Application-level resource schedulers (Polze-96)
Generalized deadlines (Jensen-96, OMG-96)
Scalable information dissemination
Estimate remote schedulers state from delayed
information (Bestavros-WoRTA93)
Exploit application-specific properties
(Hailperin-93)

9
Dynamic Load Balancing

OS-centric - distribute load evenly
Value of process migration (Eager-sigmetrics88Lel
and-sigmetrics86, Harchol-Baltar-sigmetrics96)
Load balancing is detrimental to RT
(Bestavros-ICDCS97)
Application-centric - minimize exec time
Task parallelism Jade (Rinard-Computer93),
Loop parallelism Dome (Beguelin-95),
(Siegell-HPDC-94)
Scalable information dissemination
Adaptive communication (Shivaratri-ICDCS90)
Application domain-specific (Hailperin-93)
Neural net prediction (Mehra-93)

10
Quality of Service

Communication resource reservation
Tenet, RSVP, ATM CBR/VBR, ...
Extensions to application level
QoS (Zinky-DUTC-95, Zinky-TaPOS-97)
Contracts
Qual (Florissi-94)
Mobile adaptation Odyssey (Noble-SOSP-97)

11
Mapping Algorithm Characteristics

Introspection Measures self
Locality Use only local information
No extra communication
Procedure call durations
Independence Each procedure is separate
History Decisions based on history (logs)
Simplicity Measurement scheme, history, and
decision making all must fit into an object
reference and have overhead comparable to remote
call

12
Load Traces

1 Hz sample rate, one week
Various Hosts (39 machines)
PSC Alpha Supercluster (13 machines)
Local Alpha Desktops (16 machines)
Manchester Testbed (8 machines)
Local Compute Servers (2 machines)

13
Result Load Trace Analysis

Self-similarity
Hgt.5, typically gt.8
Epochal behavior of frequency content
Frequency content remains same for 150 to 450
seconds
Simple phase space behavior
These results suggest that load is predictable
from its past behavior

14
Load Trace Characterization
15
Why Simulation to Develop and Evaluate Selectors?

Easier to control environment
Form groupings of machines at will
Change mapping algorithm trivially
Compare against optimal mapping algorithm
MUCH faster than running on a testbed
100,000 100ms calls on testbed gt3 hours
100,000 100ms calls on simulator 1-5 minutes

16
Assumption of Trace-based Simulation Approach

Traces define the background work on the hosts
and network
We ask, if we had injected this extra work at
this time, how long would it have taken to
complete?
Answer is valid if this additional work would not
have significantly perturbed the background work.
Assumption is that we would not have
significantly perturbed the background work

17
Structure of Simulator
Time
CPU Demand
Mapping Request Generator
Mapping Request
Compute Time Estimator
Mapping Algorithm
Mapping Result
Statistics and Analysis
18
Execution time from load traces

Load trace load(t)
Given tnominal seconds of cpu time to map at time
tnow, tcomputelocal is found by
Experimental validation shows strong (.99)
correlation between predicted and actual
execution times.

19
Load Trace Characterization
20
Why use traces directly?

They are very hard to fit a model to
Exhibit interesting transient behavior we dont
want to lose
Examples

21
Mapping Algorithms

Mean
WinMean(W), WinVar(W), Confidance(W)
RandWinMean(W,P), RandWinVar(W,P),
RandConfidence(W,P)
NeuralNet(W)
RangeCounter(W)

22
Activation Trees from Windows

Already have methodology for collecting
activation trees from binaries with COFF
debugging
Guard page manipulation or load/store
instrumentation for reference behavior
Two phase data collection

23
Activation Tree Traces Other Options

Collect or derive trees by hand
Windows apps with source
acoustic room modelling app, image editor
Existing distributed apps
Derive trees from CORBA or DCOM IDL
specifications
Implement and focus on specific class of
application
Quake design optimization, for example

24
Capturing Data Movement

Data reference traces tell us what data was
referenced, not what data the remote execution
facility would have to move
Special case page faults and simple DSM
IDL extensions
Compiler-based analysis
Need source...

25
Evaluation

Empiricism vs. Generalization
Extensive evaluation study
Benchmarks
Parameterizable models for traces
Generate testcases using models and study the
effects of the parameters
May be difficult - Load trace complexity
Algoritithm parameters

26
Scalability Concerns

Issue local information on each host is
proportional to the number of hosts and the
number of procedures
Number of hosts is likely to be low
Sequential execution model
Data per procedure is on the order of size of
code in an LDOS object reference for the
preliminary work
Opportunities for information sharing if necessary

27
Execution Time Dependencies
Host (load(t))
tcompute
Deadline Meets
Activation Tree Structure (nominal(t))

lt
texec
tcommin, tcommout
Network (BW(t),latency(t))
bounds
28
Information Sharing
Application-level
Lower-level
stub_a
stub_b
processes
hosts
29
Comparison with Shared Measurement System

Potentially more scalable
Information dissemination problem
Lower level information
Load, BW, latency
Cant concentrate measurements
Measurement granularity may not be what
application needs
Communication overhead

30
Comparison with Reservations

Determinism is appealing
Translation of application demands to resource
demands may be difficult
Could lead to over-reservation
Significant OS and network modifications
Limits sites where possible

31
Why not just use the least loaded machine?

How do you know which is the least loaded
machine?
Delayed information
Probably still need prediction!
How do you account for communication overheads?
Minimization of exec time may waste resources
Cost model

32
Implementing the service in a Distributed Object
System
Local object reference
Remote object instances
o.Foo(x)
Local history
mapinfo (per-proc)
ObjectReffoo(x) if (mapinfo-gtInMap())
ObjectReffoo_map(x,mapinfo) ObjectReffoo_ma
p(x,mapinfo) bind(SelectInstance(foo,mapinfo,
history)) // do call, passing along mapping
params, status...
33
Method Invocation Overheads