Automatic Statistical Evaluation of Resources for Condor - PowerPoint PPT Presentation

About This Presentation
Title:

Automatic Statistical Evaluation of Resources for Condor

Description:

Resources free up and drop out frequently. Long running apps must be checkpointed ... Network load improvements are substantial (particularly useful in wide area) ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 19
Provided by: dn57
Category:

less

Transcript and Presenter's Notes

Title: Automatic Statistical Evaluation of Resources for Condor


1
Automatic Statistical Evaluation of Resources for
Condor
  • Daniel Nurmi, John Brevik, Rich Wolski
  • University of California, Santa Barbara

2
Motivation
  • Distributed System/Grid applications execute on
    wide variety of architectures
  • Clusters
  • Large SMP systems
  • Interactive workstation networks
  • Condor provides vast, easily accessible resource
    pool, but is best suited to Condor applications

3
Condor As Resource Pool
  • Provides many required features
  • Resource manager
  • Account manager
  • Scheduler
  • Resource availability very dynamic
  • Controlled by large number of variables including
    overall load, user priority, occupancy time,
    owner revocation, etc.
  • Resources free up and drop out frequently
  • Long running apps must be checkpointed

4
Checkpointing Schemes
  • Condor checkpointing
  • Standard Universe uses system call liftoff
  • Core file is used to capture process state for
    restart
  • Application-level checkpointing
  • Application developer must generate checkpoints
    from within the application
  • Disk storage may be limited (none available
    locally)

5
Condor Checkpointing
  • Checkpointing is invisible to application
    developer, but
  • No threads
  • No forking
  • Single architecture support
  • Must use compiler supported by Condor (e.g. no
    GMP)

6
Application-Level Checkpointing
  • No support from Condor for checkpointing in
    Vanilla universe
  • Left to the application
  • No restrictions on system calls or compilation
  • If it compiles it will run
  • No local disk storage
  • Checkpoints must traverse the network to a
    machine with stable storage
  • Checkpoint schedule major performance concern

7
Checkpoint Scheduling
  • Given a long running application and volatile
    resource, determine the amount of time perform
    useful computation between checkpoints such that
    the overhead of checkpointing is minimized
  • Well studied
  • K. M. Chandy, C. V. Ramamoorthy. Rollback and
    recovery strategies for computer systems.
  • M. Elnozahy, L. Alvisi, Y. M. Wang, D. B.
    Johnson. A survey of rollback-recovery protocols
    in message passing systems.
  • A. Duda. The effects of checkpointing on program
    execution time.
  • N. H. Vaidya. Impact of checkpoint latency on
    overhead ratio of a checkpointing scheme
  • We use Markov Model based approach proposed by N.
    H. Vaidya.

8
Checkpoint Interval Selection
  • Model requires statistical distribution
    describing resource availability
  • Vaidya, and later Plank assume exponential
    distributions

9
What is the Availability Distribution?
  • Weibull
  • T. Heath, P. M. Martin, T. D. Nguyen. The shape
    of failure
  • J. Xu, Z. Kalbarczyk, R. K. Iyer. Networked
    Windows NT system field failure data analysis
  • Hyperexponential
  • M. Mutka, M. Livny. Profiling workstations
    available capacity for remote execution.
  • I. Lee, D. Tang, R. K. Iyer, M. C. Hsueh.
    Measurement-based evaluation of operating system
    fault tolerance.

10
Generating Statistical Models
  • Network Weather Service monitoring of Condor pool
    over 2 year period
  • 708 machines observed
  • Automatic model fitting software
  • Takes as input distribution type and historical
    Condor uptime values
  • Outputs best fit parameters for given
    distribution
  • Design experiment to test overall work efficiency
    of checkpointing scheme using four different
    distributions

11
Checkpoint Experiment
  • Test application submitted to Condor and when it
    runs
  • Sends resource information to central server
  • Model fitting software estimates model parameters
    using MLE or EMpht methods
  • Checkpoint scheduler solves the Markov model
    using tested distribution
  • Application uses schedule, checkpoints its
    memory, and records performance
  • Test different distributions
  • Checkpointing to disks at UCSB

12
Empirical Results Execution Time
13
Empirical Results Network Utilization
14
Moral
  • We can determine optimal checkpoint schedules for
    Condor jobs automatically
  • Execution performance impact is about the same
    until checkpoint costs get big
  • Network load improvements are substantial
    (particularly useful in wide area)
  • Software is real, but non-NWS parts are in
    prototype
  • We want to bring them into the NWS release cycle
  • Paper in submission to HPDC

15
Whats Next
  • Better Models
  • Brevik Method we can predict the percentiles of
    availability with provable confidence bounds
    using less data
  • Cant use it (yet) for Markov model
  • Better Utility
  • Provide information to Condor itself
  • Automatic fault and anomaly detection
  • Better Information for users
  • Publish availability predictions the in matchmaker

16
Thanks
  • Rich Wolski
  • John Brevik
  • Miron Livny
  • NSF Next Generation Software program
  • VGrADS Project (NSF ITR, Ken Kennedy, PI)
  • NSF Middleware Initiative (NWS)
  • Questions?

17
Simulation Results Execution Time
18
Simulation Results Network Utilization
Write a Comment
User Comments (0)
About PowerShow.com