The Big Picture - PowerPoint PPT Presentation

About This Presentation
Title:

The Big Picture

Description:

Introspection: Measures self. Locality: Use only local information. No extra ... Self-similarity. H .5, typically .8. Epochal behavior of ... Determinism ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 36
Provided by: petera153
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: The Big Picture


1
The Big Picture
  • Problem how to build responsive interactive
    applications on distributed systems
  • Why interesting? large group of apps, large
    numbers of users
  • Why hard? complex dynamic nature of systems,
    cant easily change systems
  • Other approaches dynamic load balancing,
    distributed soft real-time, quality of service
  • My approach dynamic mapping of activation trees
    using history-based prediction
  • Success dynamic mapping algorithm that performs
    close to optimal for benchmark set
  • Thesis structure Outline and timeline as before

2
Example Activation Tree
MoveListener(2)
0
N
MS
MS
ListenerResponses(2)
DisplayListenerResponses(2)
N
M
...
SourceListenerResponse(1,2)
SourceListenerResponse(S,2)
1
3N
N
1
N
2N
S Sources L Listeners N Elements M Steps
B Blocksize
...
Init()
Step(2)
Step(1)
Extract(2)
Step(M)
Extract(M)
Extract(1)
3B
B
...
Block(0)
Block(1)
Block(2)
3
Tradeoff Example for ARM
Shared Server 50 MFLOP/s
User Machine 25 MFLOP/s
10 MB/s Shared Network
  • Problem size N2,744,000 (4 m3 room, 3KHz)
  • Each step requires 26N FLOPs
  • To run Forward(k) on server, 3N doubles in, 3N
    doubles out
  • k26N/25 MFLOP/s on user
  • k26N/50 MFLOP/s 48N/10 MB/s on server
  • For fewer than k10 steps, faster to use user
    machine
  • Other load on server or network changes tradeoff
    point
  • Wouldnt it be nicer to map steps?

4
Responsiveness Specification
  • Are bounds the right specification?
  • It is very clear what they mean
  • Utility functions
  • Make the problem much uglier
  • Restrict allowable functions for tractability
  • Statistical specifications
  • (mean,variance), PDFs
  • Doesnt seem to make sense to say 95 should be
    faster than 100ms in best effort

5
Performance Metric
  • Simplicity should make it clear for programmer
  • Programmer can also measure exec times to compute
    his own metrics and adjust his specifications
  • May look at extensions
  • Variance, etc.
  • Treat mapping failures differently

6
Execution Time of a Node/Subtree
texec
tmap
tincomm
tcompute
toutcomm
tupdate
tslack
tnow
tnow tmin
tnow tmax
tcomputelocal texecchildren
local
local
local
child
child
texec
texec
7
Decomposition of Bounds
foo()
tmin,tmax
partially executed, known
tmin,tmax?
unexecuted, known
bar()
unexecuted, unknown
unexecuted, unknown
tmax atmax tmin atmin where a is a function
of the foo-gtbar history b1, b2, , where bs are
actual fractions of time spent in bar() subtree
on previous executions of foo()
  • Choice of tmin,tmax for bar() depends on
    unvisited portion of the tree
  • Past execution history encoded in bs
  • Choose fraction of bounds to give to bar() based
    on that history

8
Distributed Soft Real-time Systems
  • Context RT CORBA (OMG-96, DiPippo-97)
  • End-to-end priorities
  • Prediction
  • Application-level resource schedulers (Polze-96)
  • Generalized deadlines (Jensen-96, OMG-96)
  • Scalable information dissemination
  • Estimate remote schedulers state from delayed
    information (Bestavros-WoRTA93)
  • Exploit application-specific properties
    (Hailperin-93)

9
Dynamic Load Balancing
  • OS-centric - distribute load evenly
  • Value of process migration (Eager-sigmetrics88Lel
    and-sigmetrics86, Harchol-Baltar-sigmetrics96)
  • Load balancing is detrimental to RT
    (Bestavros-ICDCS97)
  • Application-centric - minimize exec time
  • Task parallelism Jade (Rinard-Computer93),
  • Loop parallelism Dome (Beguelin-95),
    (Siegell-HPDC-94)
  • Scalable information dissemination
  • Adaptive communication (Shivaratri-ICDCS90)
  • Application domain-specific (Hailperin-93)
  • Neural net prediction (Mehra-93)

10
Quality of Service
  • Communication resource reservation
  • Tenet, RSVP, ATM CBR/VBR, ...
  • Extensions to application level
  • QoS (Zinky-DUTC-95, Zinky-TaPOS-97)
  • Contracts
  • Qual (Florissi-94)
  • Mobile adaptation Odyssey (Noble-SOSP-97)

11
Mapping Algorithm Characteristics
  • Introspection Measures self
  • Locality Use only local information
  • No extra communication
  • Procedure call durations
  • Independence Each procedure is separate
  • History Decisions based on history (logs)
  • Simplicity Measurement scheme, history, and
    decision making all must fit into an object
    reference and have overhead comparable to remote
    call

12
Load Traces
  • 1 Hz sample rate, one week
  • Various Hosts (39 machines)
  • PSC Alpha Supercluster (13 machines)
  • Local Alpha Desktops (16 machines)
  • Manchester Testbed (8 machines)
  • Local Compute Servers (2 machines)

13
Result Load Trace Analysis
  • Self-similarity
  • Hgt.5, typically gt.8
  • Epochal behavior of frequency content
  • Frequency content remains same for 150 to 450
    seconds
  • Simple phase space behavior
  • These results suggest that load is predictable
    from its past behavior

14
Load Trace Characterization
15
Why Simulation to Develop and Evaluate Selectors?
  • Easier to control environment
  • Form groupings of machines at will
  • Change mapping algorithm trivially
  • Compare against optimal mapping algorithm
  • MUCH faster than running on a testbed
  • 100,000 100ms calls on testbed gt3 hours
  • 100,000 100ms calls on simulator 1-5 minutes

16
Assumption of Trace-based Simulation Approach
  • Traces define the background work on the hosts
    and network
  • We ask, if we had injected this extra work at
    this time, how long would it have taken to
    complete?
  • Answer is valid if this additional work would not
    have significantly perturbed the background work.
  • Assumption is that we would not have
    significantly perturbed the background work

17
Structure of Simulator
Time
CPU Demand
Mapping Request Generator
Mapping Request
Compute Time Estimator
Mapping Algorithm
Mapping Result
Statistics and Analysis
18
Execution time from load traces
  • Load trace load(t)
  • Given tnominal seconds of cpu time to map at time
    tnow, tcomputelocal is found by
  • Experimental validation shows strong (.99)
    correlation between predicted and actual
    execution times.

19
Load Trace Characterization
20
Why use traces directly?
  • They are very hard to fit a model to
  • Exhibit interesting transient behavior we dont
    want to lose
  • Examples

21
Mapping Algorithms
  • Mean
  • WinMean(W), WinVar(W), Confidance(W)
  • RandWinMean(W,P), RandWinVar(W,P),
    RandConfidence(W,P)
  • NeuralNet(W)
  • RangeCounter(W)

22
Activation Trees from Windows
  • Already have methodology for collecting
    activation trees from binaries with COFF
    debugging
  • Guard page manipulation or load/store
    instrumentation for reference behavior
  • Two phase data collection

23
Activation Tree Traces Other Options
  • Collect or derive trees by hand
  • Windows apps with source
  • acoustic room modelling app, image editor
  • Existing distributed apps
  • Derive trees from CORBA or DCOM IDL
    specifications
  • Implement and focus on specific class of
    application
  • Quake design optimization, for example

24
Capturing Data Movement
  • Data reference traces tell us what data was
    referenced, not what data the remote execution
    facility would have to move
  • Special case page faults and simple DSM
  • IDL extensions
  • Compiler-based analysis
  • Need source...

25
Evaluation
  • Empiricism vs. Generalization
  • Extensive evaluation study
  • Benchmarks
  • Parameterizable models for traces
  • Generate testcases using models and study the
    effects of the parameters
  • May be difficult - Load trace complexity
  • Algoritithm parameters

26
Scalability Concerns
  • Issue local information on each host is
    proportional to the number of hosts and the
    number of procedures
  • Number of hosts is likely to be low
  • Sequential execution model
  • Data per procedure is on the order of size of
    code in an LDOS object reference for the
    preliminary work
  • Opportunities for information sharing if necessary

27
Execution Time Dependencies
Host (load(t))
tcompute
Deadline Meets
Activation Tree Structure (nominal(t))

lt
texec
tcommin, tcommout
Network (BW(t),latency(t))
bounds
28
Information Sharing
Application-level
Lower-level
stub_a
stub_b
processes
hosts
29
Comparison with Shared Measurement System
  • Potentially more scalable
  • Information dissemination problem
  • Lower level information
  • Load, BW, latency
  • Cant concentrate measurements
  • Measurement granularity may not be what
    application needs
  • Communication overhead

30
Comparison with Reservations
  • Determinism is appealing
  • Translation of application demands to resource
    demands may be difficult
  • Could lead to over-reservation
  • Significant OS and network modifications
  • Limits sites where possible

31
Why not just use the least loaded machine?
  • How do you know which is the least loaded
    machine?
  • Delayed information
  • Probably still need prediction!
  • How do you account for communication overheads?
  • Minimization of exec time may waste resources
  • Cost model

32
Implementing the service in a Distributed Object
System
Local object reference
Remote object instances
o.Foo(x)
Local history
mapinfo (per-proc)
ObjectReffoo(x) if (mapinfo-gtInMap())
ObjectReffoo_map(x,mapinfo) ObjectReffoo_ma
p(x,mapinfo) bind(SelectInstance(foo,mapinfo,
history)) // do call, passing along mapping
params, status...
33
Method Invocation Overheads
  • Commercial DCE and CORBA implementations 0.1 to
    1 ms Schmidt97
  • LDOS Study200 MHz Ppro, NT 4.0, VC 4.2
  • Faster LANs PAPERS 3 microsecond
    application-application latency Dietz96

34
LDOS Data Transfer Throughput
Process to Process, 256 KB argument, No
Conversion, 200 MHz Ppro, NT 4.0, VC 4.2
35
Related Work
  • Applications
  • CAVE, CUMULUS, DIS, Wolfe-Lau-95
  • Remote execution
  • Systems RPC, DSM, DCE, CORBA, DCOM, ...
  • Performance Schmidt-sigcomm97, PAPERS, ...
Write a Comment
User Comments (0)
About PowerShow.com