Bust a Move - PowerPoint PPT Presentation

About This Presentation

Title:

Bust a Move

Description:

Bust a Move – PowerPoint PPT presentation

Number of Views:55

Avg rating:3.0/5.0

Slides: 28

Provided by: lro80

Learn more at: https://research.cs.wisc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Bust a Move

1

Bust a Move
Young MC

Modeling and Predicting Machine Availability in
Volatile Computing Environments
Rich Wolski
John Brevik
Dan Nurmi
University of California, Santa Barbara

3
Explorations In Grid Computing

Performance How can programs extract high
performance levels given that the resource pool
is heterogeneous and dynamically changing?
The Network Weather Service
On-line performance monitoring and prediction
Programming What programming abstractions are
needed to enable the Grid paradigm?
EveryWare
Toolkit for building global programs
Analysis How do we reason about the Grid
globally?
G-Commerce
Systemwide efficiency, stability, etc.

4
Fortune Telling

Grid resource performance varies dynamically
Machines, networks and storage systems are shared
by competing applications
Federation
Either the system or the application itself must
tolerate performance variation
Dynamic scheduling
Scheduling requires a prediction of future
performance levels
What performance level will be deliverable?

5
Skepticism

Is it really possible to predict future
performance levels?
Self-similarity
Non-stationarity
With what accuracy?
For how long into the future?
NWS On-line, semi non-parametric time series
techniques
Use running tabulation of forecast error to
choose between competing forecasters
Bandwidth, latency, CPU load, available memory,
battery power

6
What About Machine Availability?
7
The Normal Approach

Each measurement is modeled as a sample from a
random variable
Time invariant
IID (independent, identically distributed)
Stationary (IID forever)
Well studied in the literature
Exponential distributions
Compose well
Memoryless
Popular in database and fault-tolerance
communities
Pareto distributions
Potentially related to self-similarity
heavy-tailed implying non-predictability
Popular in networking, Internet, and Dist. System
communities

8
Our Abnormal Approach

Measure availability as lifetime in a variety
of settings
Student lab at UCSB, Condor pool
New NWS availability sensors
Data used in fault-tolerance community for
checkpointing research
Predicting optimal checkpoint
Develop robust software for MLE parameter
estimation
Fit Exponential, Pareto, and Weibull
distributions
Compare the fits
Visually
Goodness of fit tests
Goal is to provide an automated mechanism for the
NWS
Let the best distribution win

9
UCSB Student Computing Labs

Approximately 85 machines running Red Hat Linux
located in three separate buildings
Open to all Computer Science graduate and
undergraduates
Only graduates have building keys
Power-switch is not protected
Anyone with physical access to the machine can
reboot it by power cycling it
Students routinely clean off competing users or
intrusive processes to gain better performance
response
NWS deployed and monitoring duration between
restarts
Can we model the time-to-reboot?

10
UCSB Empirical CDF
11
MLE Weibull Fit to UCSB Data
12
Comparing Fits at UCSB
13
The Visual Acid Test
14
More Systems

Condor Cycle harvesting system (M. Livny, U.
Wisconsin)
Workstations in a pool run the (trusted) Condor
daemons
When a machine running a Condor job becomes
busy Job is terminated (vanilla universe)
Unknown and constantly changing number of
workstations in UWisc Condor Pool ( 1000 Linux
Workstations)
Long, Muir, Golding Internet Survey (1995)
Pinged the rpc.statd as a heartbeat
Used extensive in fault-tolerance community to
model host failure
1170 hosts covering 3 months of Spring

15
(No Transcript)
16
The Condor Picture
April 2003 through Oct 2004, 600 hosts
17
More Condor
April 2003 through July 2005, 900 hosts
18
Condor Clusters
April 2003 through July 2005, 730 hosts
19
Condor Non-cluster
April 2003 through July 2005, 170 hosts
20
Modeling Lessons

Machine availability looks like it is
well-modeled by a Weibull, but Condor process
lifetime is trickier
Hyper-exponentials do well, but hard to fit and
use
Log-normal looks better in the large (need to
investigate more)
May be able to do piece-wise fit for desktops
Who should care?
Grid simulators
Availability is critical
P2P systems
Oceanstore, CAN, TAPESTRY, etc. all assume very
basic availability distributions in their proofs
Replication systems
It does not mean, that model fitting works best
for predicting availability gt data shortage

21
Predicting Individual Machine Behavior

Estimating Mean Time to Failure (MTTF) is
relatively easy
Unless the data is Pareto, the mean is the
expected value
Probably not what is needed to support scheduling
The cost of being below the mean is not the same
as the cost of being above it
At least how much time will elapse before this
machine reboots with 95 certainty?
The answer is the 0.05 quantile (not an
expectation) from the cumulative distribution
function (CDF)

22
Certainty in an Uncertain World

Predictions of the form
For at least how long with this machine be
available with X certainty?
Requires two estimates if certainty is to be
quantified
Estimate the (1-X) quantile for the distribution
of availability gt Qx
Estimate the lower X confidence bound on the
statistic Qx gt Q(x,lb)
If the estimates are unbiased, and the
distribution is stationary, future availability
duration will be larger than Q(x,lb) X of the
time, guaranteed

23
Neo-classical Methods

The classical (parametric) method has some
drawbacks
Which distribution?
MLE is computationally challenging or impossible
for some distributions and/or data sets
Requires quite a bit of data to get a good fit
Quantiles near the tails are squeezed so fit
error is significant
Estimating confidence bounds for high-order
models is computationally (and theoretically)
difficult
Non-parametric techniques
Can usually only recover a statistic and not the
distribution
Those that appeal to the CLT may have an
asymptote problem
New non-parametric invention based on Binomial
assumptions

24
Experiments in Fortune Telling

CSIL, Condor (2 years), and Long data sets
Split into training and experimental periods
Use only machines with 20 training samples or
more
Using synthetic data we noticed that be best
method needed at least 20 samples
Use methods to estimate 95 confidence on 0.05
quantile from training period
Record success if 95-100 of the remaining
experimental availability durations gt estimate
Report success (want to see 95)

25
Non-parametric methods seem to work

Weibull over-estimates the tail for Condor data
Bootstrapping works okay, but is very
computationally expensive

26
On-going Work with Condor

Checkpoint scheduling
Parametric method reduces network load
dramatically
Applications
LDPC investigation gt lowest observed error rates
Ramsey search
GridSAT
UCSBGrid
Automatic Program Overlay
On-demand Condor as a grid programming middleware
NWS Condor Integration
Publishing NWS forecasts via Hawkeye
Incorporating machine availability predictor

27
Thanks and More

Miron Livny and the Condor group at the
University of Wisconsin
Darrell Long (UCSC) and James Plank (UTK)
UCSB Facilities Staff
NSF SCI and DOE
Middleware and Applications Yielding
Heterogeneous Environments for Metacomputing at
UCSB
Students Matthew Allen, Wahid Chrabakh, Ryan
Garver, Andrew Mutz, Dan Nurmi, Erik Peterson,
Fred Tu, Lamia Youseff, Ye Wen
Research Staff John Brevik, Graziano Obertelli
www.cs.ucsb.edu/rich
rich_at_cs.ucsb.edu