Anomaly Management in Grid Environments

About This Presentation

Title:

Anomaly Management in Grid Environments

Description:

Tendency-based predictor with mixed variation does not work effectively on network data ... Indicate the proportion of variation in the performance of ... – PowerPoint PPT presentation

Number of Views:52

Avg rating:3.0/5.0

Slides: 50

Provided by: lingyu

Category:

more less

Transcript and Presenter's Notes

Title: Anomaly Management in Grid Environments

1
Anomaly Management in Grid Environments

Candidate Lingyun Yang
Advisor Ian Foster
University of Chicago

2
Grid Environments

Mean6.9, SD3.1 Mean197.3,
SD324.8
Resource capabilities change dynamically due to
the sharing of resources
The performance of Grid applications are affected
by the capabilities of these resources
Applications want not only fast execution time,
but also stable behavior
Significant deviation from normal profile is
defined as anomaly D. Denning

Network Load
CPU Load
3
Challenges

The problem is more complex because

Resources are shared
Resources are heterogeneous
Resource are distributed
Grid environments are complex

4
Solutions

Avoid or reduce the anomaly before it happens
Prediction
Proper scheduling strategy
Detect the application anomaly when it happens
Distinguish the anomaly from normal profiles of
applications and resources
Diagnose the anomaly after it happens
Study the relationship between resource
capabilities and application performance

5
Contributions

A set of new one-step-ahead prediction strategies
A conservative scheduling strategy
? More reliable application behavior
A statistical data reduction strategy
? Simplify the data analysis

Avoid or reduce the application anomaly before it
happens

Detect the application anomaly when it happens

An anomaly detection and diagnosis strategy
using signal processing techniques
? More accurate anomaly detection and diagnosis

Diagnose the application anomaly after it happens
6
Outline

Avoid or reduce the anomaly
Performance prediction
Conservative Scheduling
Detect and diagnose the anomaly
Performance monitoring and data reduction
Anomaly detection and diagnosis
Summary

7
Avoid or Reduce Anomalies

Deliver fast and predictable behavior to
applications
Proper scheduling strategy
Time balancingF. Berman, H. Dail--heterogeneous
Assign more work load to more powerful resources
Each resource finishes (roughly) at the same time
Why it is not enough?
Resource performance may change during execution
Resources with a larger capacity may also show a
higher variance
Stochastic scheduling J. Schopf
Use stochastic data ( Average and variation)
My approaches
Prediction
Conservative Scheduling

8
CPU Load Prediction

Prediction estimate future values using
historical data
Key correctly model the relationship of the
historical data with future values
Time series modeling
Financial data prediction A. Lendasse, earth
and ocean sciences L. Lawson, biomedical signal
processing H. Kato, network H.Braun, N.
Groschwitz etc
CPU load prediction
NWS R. Wolski Mean based, Media based, AR
model.
Dynamically selects the
next strategy
Linear models study P. Dinda AR model is the
best one
Estimate value directly
Question What if we estimate the direction and
variation
separately?

9
Direction Two Families of Strategies

Homeostatic Assume the values are
self-correcting - the value will return to the
mean of the previous values
Tendency Assume that if the current value
decreases, the next value will also decrease if
the current value increases, the next value will
also increase

10
How Much is the Variation?

IncrementValue (or DecrementValue) can be
Independent IncrementValueIncConstant
Relative IncrementValue CurValue
IncFactor
Or
Static Do not change the value at any step
Dynamic Adapt the constant at each step using
the real-time information

11
Prediction Strategies

Create a set of predication strategies by
different combinations of the directions and
variations
Evaluate this set of prediction strategies on 12
CPU time series
Experimental results show that dynamic tendency
prediction strategy with mixed variation works
best
Dynamic tendency prediction with mixed variation
Independent IncrementValue
Relative DecrementValue

12
Comparison Results

Dynamic tendency predictor with mixed variation
outperforms NWS 255 less error rate, average
37 less error rate

Homeostatic and Tendency-based CPU Load
Predictions, L. Yang, I. Foster, and J. M.
Schopf, Proceedings of International Parallel and
Distributed Processing Symposium (IPDPS), 2003 .
13
Network Load Prediction

Tendency-based predictor with mixed variation
does not work effectively on network data
NWS works better on average
Possible explanation
Network data usually has much higher variation
and smaller correlation between two adjacent
values than CPU load data
Our strategies give high weight on most recent
data, thus can not track the trace tendency of
network data
NWS predictors take account of more statistic
information
?For Network load prediction, we will use NWS
predictors

14
Mean Resource Capability Prediction

Calculate an mean resource capability time series
ai average resource capability over the time
interval, which is approximately equal to the
total execution time
Mean resource capability prediction

15
Resource Capability Variation Prediction

Calculate the standard deviation time series
Si the average difference between the resource
capability and the mean resource capability over
the time interval
Resource capability variation prediction

16
Outline

Avoid or reduce the anomaly
Performance prediction
Conservative Scheduling
Detect and diagnose the anomaly
Performance monitoring and data reduction
Anomaly detection and diagnosis
Summary

17
Conservative Scheduling

A resource with a larger capacity may also show a
higher variance in performance
Assign less work to less reliable
(higher-variance) resources
Avoid the peak in the application performance
caused by variance in the resource capability
Verified in two contexts
Computation intensive application Cactus
Parallel Data Transfer GridFTP

18
Effective CPU Capability--Cactus

Conservative load prediction
Effective CPU load predicted Mean predicted
SD
SD ?, effective CPU load ?, less workload
allocated
Other Scheduling Options
One-Step Scheduling (OSS) H.Dail, C. Liu
Effective CPU load predicted CPU at
the next step
Predicted Mean Interval Scheduling (PMIS)
Effective CPU load predicted Mean
History Mean Scheduling (HMS) A. Turgeon, J.
Weissman
Effective CPU load mean of the
historical load collected during a 5-minute
period preceding the application start time
History Conservative Scheduling (HCS) J.
Schopf
Effective CPU load mean SD of the
historical load collected during a 5-minute
period preceding the application run

19
An Example of Experimental Results

Comparison of five strategies on cactus running
on the Chiba City cluster
32 machines used

20
Result Evaluation

An average mean and s.d. of execution time
The statistical analysis T test
Result shows statistical improvement of our
strategy over other strategies
The possibility of improvement happening by
chance is quite small

Conservative Scheduling Using Predicted Variance
to Improve Scheduling Decisions in Dynamic
Environments, L. Yang, J. M. Schopf, and I.
Foster, Proceedings of SuperComputing
2003. Improving Parallel Data Transfer Times
Using Predicted Variances in Shared Networks, L.
Yang, J. M. Schopf, and I. Foster, CCGrid 2005.
21
Outline

Avoid or reduce the anomaly
Performance prediction
Conservative Scheduling
? more stable application behavior
Detect and diagnose the anomaly
Performance monitoring and data reduction
Anomaly detection and diagnosis using signal
processing techniques
Summary

22
Detect and Diagnose Anomalies

Anomaly detection is to send alarm for
significant deviation from the established normal
profile D.Denning
How to build application normal profile?
User-defined specifications R. Sekar
Not efficient for novel anomalies
Simple statistic method(Window Average G. Allen
)
Not efficient in shared and distributed
environment
Problems
Detect anomaly only using the application
performance information
Do not provide information for anomaly diagnose
?Detect and diagnose application anomaly by
analyzing resource load information

Anomaly?
23
Questions and Solutions

Questions
What kind of resource information should be use?
How to use the resource information to detect and
diagnose application anomalous behavior?
My approaches
Data reduction
Select the only necessary system metrics
Anomaly detection and diagnose using signal
processing based techniques
Using selected system metrics to detect and
diagnose application anomalous behavior

24
Data Reduction

First step of anomaly detection and diagnosis is
performance monitoring
Computer systems and applications continue to
increase in complexity and size
Interactions among components are poorly
understood
Two ways to understand the relationship between
an application and resource performance
Performance Model A. Hoisie,, M.Ripeanu, D.
Kerbyson
? Application specific, expensive etc.
Instrumentation and data analysis A. Malony
?Produce tremendous amount of data
Need mechanisms to select only necessary metrics
Two-step data reduction strategy

25
Reduce Redundant System Metrics

Some system metrics capture the same (similar)
information
Highly correlated (measured by correlation
coefficient (r) )
Only one is necessary, others are redundant
Two Questions
A threshold value t (determined experimentally)
A method to compare
Traditional method mathematical comparison M.
Knop
correlation coefficient (r) gtt ?
Problems Only limited number of sample data are
available

r may vary from run-to-run
26
Redundant Metrics Reduction Algorithm

Using Z-test
A statistical method
Determine whether correlation is statistically
significantly larger than the threshold value
(with 95 confidence in results)
Given a set of samples, we proceed as follows
Perform the Z-test for r between every pair of
system metrics
Group two metrics into one cluster if Z test
shows their r value is statistically larger than
the threshold value
The result is a set of system metric clusters
Only one metric from the cluster can be used as
the representative of the cluster while the
others are deleted as redundant

27
Select Necessary System Metrics

Some system metrics may not relate to application
performance
Backward Elimination (BE) stepwise regression
System metrics remaining after first step X
(x1, x2, xn)
The application performance metric y
Regress the y on the set of x
y ?0?1x1?2x2?nxn
Delete one metric that either is irrelevant or
that, given other metrics, is not useful to the
model
F value
All metrics remaining are useful and necessary
for capturing the variation of y

28
Evaluation

Two criteria
Necessary--Reduction degree (RD)
Total percentage of system metrics eliminated
Sufficient--coefficient of determination (R2)
A statistical measurement
Indicate the proportion of variation in the
performance of application explained by the
system metrics selected
Larger R2 value means system metrics selected
can better capture the variation in application
performance
Applications
Cactus (UCSD cluster)
GridFTP (Planetlab)

29
Two-step Experiment

1st stepData Reduction
Using training data to select system metrics
2nd step Verification
Are these system metrics sufficient?
Is the result stable?
How does this method compare with other
strategies?
RAND, randomly picks a subset of system metrics
equal in number to those selected by our strategy
MAIN, uses a subset of system metrics that are
commonly used to model the performance of
applications by other works H. Dail, M. Ripeanu,
S.Vazhkudai

30
Data Reduction Result on Cactus Data

Six machines, 600 system metrics
Threshold ? , RD ?, since fewer system metrics
group into clusters and thus are removed as
redundant
Threshold ? , R2 ?, since more information is
available to model the application performance
22 system metrics selected can capture 98 of
variance in the performance, when the threshold
value 0.95

31
Verification Result on Cactus Data

Using 11 chunks data collected over one day
period
SDR exhibited an average R2 value of 0.907
55.0 and 98.5 higher than those of RAND and
MAIN
Can better capture the applications behavior
Results are stable over time(24-hour period)

Statistical Data Reduction for Efficient
Application Performance Monitoring, L. Yang, J.
M. Schopf, C. L. Dumitrescu, and I. Foster,
CCGrid 2006, 2006
32
Outline

Avoid or reduce the anomaly
Performance prediction
Conservative Scheduling
Detect and diagnose the anomaly
Performance monitoring and data reduction
Anomaly detection and diagnosis using signal
processing techniques
Summary

33
Application and Resources Behavior

Some resources may have periodic usage pattern
?Performance decrement of application caused by
periodical resource usage pattern is normal
Challenges
Different resources may show different usage
patterns
Resource information are noisy
Solution Signal Processing techniques

34
Example-detection

Adfa

False alarm is reduced!
35
Example-diagnosis
Anomaly is related to network load!
36
Summary

Avoid or reduce the anomaly
A set of new one-step-ahead prediction strategies
IPDPS02
Better CPU load prediction
A conservative scheduling strategySC03,SY03,CCGri
d05,
Take account of predicted mean and variation
resource capability
More reliable application behavior!
Detect and diagnose the anomaly
A statistical data reduction strategyCCGrid06
Identify system metrics that are only necessary
and sufficient to capture application behavior
Simplify data analysis
Anomaly detection and diagnosis strategy
(partially done)
Build resource usage pattern automatically
Detect and diagnose the anomaly of application

37
Work Remaining to be Done

Evaluate the detection and diagnose strategy
Calculate
Success rate
False alarm rate
Using
Sweep3d data (partially done)
GridFtp data collected from machines in planetlab
Web server data

38
Time Table

Thesis work includes
Introduction
Performance Prediction
Conservative Scheduling
Data Reduction
Anomaly detection and diagnose using signal
processing techniques
Discussion
May evaluate results on the anomaly detection
and diagnosis strategy
June report on anomaly detection diagnosis
work
Aug 11st draft of chapter 1-3 of the thesis
Sep 11st draft of chapter 4-6 of the thesis
Oct 1 revised thesis

Questions?
Thank you!

40
How Much is the Variation?

IncrementValue (or DecrementValue) can be
Independent IncrementValueIncConstant
Relative IncrementValue CurValue
IncFactor
Or
Static Do not change the value at any step
Dynamic Adjust the constant at each step using
an adaptation process
Measure VT1
RealIncValueT VT1 - VT
IncConstantT1 IncConstantT
(RealIncValueT
IncConstantT) AdaptDegree

41
Prediction Strategies

Four homeostatic prediction strategies
Independent static homeostatic prediction
strategy
Independent dynamic homeostatic prediction
strategy
Relative static homeostatic prediction strategy
Relative dynamic homeostatic prediction strategy
Three tendency-based prediction strategies
Independent dynamic tendency prediction
Relative dynamic tendency prediction
Dynamic tendency prediction with mixed variation
Independent IncrementValue
Relative DecrementValue
?Experimental results show that dynamic tendency
prediction strategy with mixed variation works
best!

42
Autocorrelation Function

Autocorrelation function from lag0 to lag10 on
CPU load traces
Autocorrelation function value at lag1 and lag2
on 74 network performance time series

43
Cactus Experiment

Application Cactus
Compare different methods fairly load trace
Playback tool generates a background workload
from a trace of the CPU load.
64 real-load traces with different mean and s.d.
Execution time of Cactus 1 minute 10 minutes
Three clusters UIUC, UCSD, Chiba City

44
Parallel Data Transfer Scheduling

Our Tuned Conservative Scheduling (TCS)
EffectiveBW BWMean TF BWSD
SD ?, effective bandwidth ?, less workload
allocated
Other stochastic strategies
Best One Scheduling (BOS)
Retrieve data from the source with the highest
predicted mean bandwidth
Equal Allocation Scheduling (EAS)
Retrieve the same amount of data from each
source
Mean Scheduling (MS) (TF0)
EffectiveBW predicted BWMean
Non-tuned stochastic Scheduling (NTSS)(TF1)
EffectiveBW predicted BWMean predicted
BWSD

45
Tuning Factor Algorithm

effective BWpredicted BW Mean TFpredicted BW S
SD ?, effective bandwidth ?, less workload
allocated
SD/Mean lt 1, then TF ½ to 8
Lower variability so higher effective BW desired
SD/Mean gt 1, then TF 0 to ½
SD is higher than the mean, network performance
greatly changing, want small effective BW result
In both cases, the value of TF and TFSD are
inversely proportional to N
Other formulas would also work just fine

NSD/Mean If (Nlt1) TF1/N N/2 Else
(Ngt1) TF1/(2N2)
46
TF and TFSD as BW SD changes

Mean 5, SD varies 0 to 10
Both inversely proportional to the BW standard
deviation (and N)
Max of TFSD value is equal to the mean of the BW
Other functions are also feasible

47
Evaluation

Applications
Cactus (UCSD cluster)
GridFTP (Planetlab)
Data collected once every 30 seconds for 24 hours
Every data point 100 system metric values
/machine
1 application
performance value
Collect system metrics on each machine using
three utilities
The sar command of the SYSSTAT tool set,
Network weather service (NWS) sensors, and
The Unix command ping

48
Verification Result on GridFTP Data

Using data collected from 24 different clients
SDR achieves mean R2 value of 0.947
92.5 and 28.1 higher than those of the RAND and
MAIN strategies
Results are stable over different machines with
the same configuration

49
Resource Performance Analysis

Solution Signal Processing techniques
Denoise
Fourier transform based method
Difficult to choose the width and shape of the
filter
White noise is distributed across all
frequencies or spatial scales. Fourier based
filter is inefficient for filtering this kind of
noise.
Wavelet analysis offers a scale-independent and
robust method to filter out noise
Is able to remove noise without losing useful
information
Soft threshold denoise technique
Construct normal profile
Construct periodic usage pattern of resources
Fourier transfer --Capable of dealing periodic
signals
Performance decrement is tagged as an anomaly
only when it is not caused by resources usage
pattern