Title: Anomaly Management in Grid Environments
1Anomaly Management in Grid Environments
- Candidate Lingyun Yang
- Advisor Ian Foster
- University of Chicago
2Grid Environments
- Mean6.9, SD3.1 Mean197.3,
SD324.8 - Resource capabilities change dynamically due to
the sharing of resources - The performance of Grid applications are affected
by the capabilities of these resources - Applications want not only fast execution time,
but also stable behavior - Significant deviation from normal profile is
defined as anomaly D. Denning
Network Load
CPU Load
3Challenges
- The problem is more complex because
- Resources are shared
- Resources are heterogeneous
- Resource are distributed
- Grid environments are complex
4Solutions
- Avoid or reduce the anomaly before it happens
- Prediction
- Proper scheduling strategy
- Detect the application anomaly when it happens
- Distinguish the anomaly from normal profiles of
applications and resources - Diagnose the anomaly after it happens
- Study the relationship between resource
capabilities and application performance
5Contributions
- A set of new one-step-ahead prediction strategies
- A conservative scheduling strategy
- ? More reliable application behavior
- A statistical data reduction strategy
- ? Simplify the data analysis
- Avoid or reduce the application anomaly before it
happens
- Detect the application anomaly when it happens
- An anomaly detection and diagnosis strategy
using signal processing techniques - ? More accurate anomaly detection and diagnosis
Diagnose the application anomaly after it happens
6Outline
- Avoid or reduce the anomaly
- Performance prediction
- Conservative Scheduling
- Detect and diagnose the anomaly
- Performance monitoring and data reduction
- Anomaly detection and diagnosis
- Summary
7Avoid or Reduce Anomalies
- Deliver fast and predictable behavior to
applications - Proper scheduling strategy
- Time balancingF. Berman, H. Dail--heterogeneous
- Assign more work load to more powerful resources
- Each resource finishes (roughly) at the same time
- Why it is not enough?
- Resource performance may change during execution
- Resources with a larger capacity may also show a
higher variance - Stochastic scheduling J. Schopf
- Use stochastic data ( Average and variation)
- My approaches
- Prediction
- Conservative Scheduling
8CPU Load Prediction
- Prediction estimate future values using
historical data - Key correctly model the relationship of the
historical data with future values - Time series modeling
- Financial data prediction A. Lendasse, earth
and ocean sciences L. Lawson, biomedical signal
processing H. Kato, network H.Braun, N.
Groschwitz etc - CPU load prediction
- NWS R. Wolski Mean based, Media based, AR
model. - Dynamically selects the
next strategy - Linear models study P. Dinda AR model is the
best one - Estimate value directly
- Question What if we estimate the direction and
variation - separately?
9Direction Two Families of Strategies
- Homeostatic Assume the values are
self-correcting - the value will return to the
mean of the previous values - Tendency Assume that if the current value
decreases, the next value will also decrease if
the current value increases, the next value will
also increase
10How Much is the Variation?
- IncrementValue (or DecrementValue) can be
- Independent IncrementValueIncConstant
- Relative IncrementValue CurValue
IncFactor - Or
- Static Do not change the value at any step
- Dynamic Adapt the constant at each step using
the real-time information
11Prediction Strategies
- Create a set of predication strategies by
different combinations of the directions and
variations - Evaluate this set of prediction strategies on 12
CPU time series - Experimental results show that dynamic tendency
prediction strategy with mixed variation works
best - Dynamic tendency prediction with mixed variation
- Independent IncrementValue
- Relative DecrementValue
12Comparison Results
- Dynamic tendency predictor with mixed variation
outperforms NWS 255 less error rate, average
37 less error rate
Homeostatic and Tendency-based CPU Load
Predictions, L. Yang, I. Foster, and J. M.
Schopf, Proceedings of International Parallel and
Distributed Processing Symposium (IPDPS), 2003 .
13Network Load Prediction
- Tendency-based predictor with mixed variation
does not work effectively on network data - NWS works better on average
- Possible explanation
- Network data usually has much higher variation
and smaller correlation between two adjacent
values than CPU load data - Our strategies give high weight on most recent
data, thus can not track the trace tendency of
network data - NWS predictors take account of more statistic
information - ?For Network load prediction, we will use NWS
predictors
14 Mean Resource Capability Prediction
- Calculate an mean resource capability time series
- ai average resource capability over the time
interval, which is approximately equal to the
total execution time - Mean resource capability prediction
15Resource Capability Variation Prediction
- Calculate the standard deviation time series
-
- Si the average difference between the resource
capability and the mean resource capability over
the time interval - Resource capability variation prediction
16Outline
- Avoid or reduce the anomaly
- Performance prediction
- Conservative Scheduling
- Detect and diagnose the anomaly
- Performance monitoring and data reduction
- Anomaly detection and diagnosis
- Summary
17Conservative Scheduling
- A resource with a larger capacity may also show a
higher variance in performance - Assign less work to less reliable
(higher-variance) resources - Avoid the peak in the application performance
caused by variance in the resource capability - Verified in two contexts
- Computation intensive application Cactus
- Parallel Data Transfer GridFTP
18Effective CPU Capability--Cactus
- Conservative load prediction
- Effective CPU load predicted Mean predicted
SD - SD ?, effective CPU load ?, less workload
allocated - Other Scheduling Options
- One-Step Scheduling (OSS) H.Dail, C. Liu
- Effective CPU load predicted CPU at
the next step - Predicted Mean Interval Scheduling (PMIS)
- Effective CPU load predicted Mean
- History Mean Scheduling (HMS) A. Turgeon, J.
Weissman - Effective CPU load mean of the
historical load collected during a 5-minute
period preceding the application start time - History Conservative Scheduling (HCS) J.
Schopf - Effective CPU load mean SD of the
historical load collected during a 5-minute
period preceding the application run
19An Example of Experimental Results
- Comparison of five strategies on cactus running
on the Chiba City cluster - 32 machines used
20Result Evaluation
- An average mean and s.d. of execution time
- The statistical analysis T test
- Result shows statistical improvement of our
strategy over other strategies - The possibility of improvement happening by
chance is quite small
Conservative Scheduling Using Predicted Variance
to Improve Scheduling Decisions in Dynamic
Environments, L. Yang, J. M. Schopf, and I.
Foster, Proceedings of SuperComputing
2003. Improving Parallel Data Transfer Times
Using Predicted Variances in Shared Networks, L.
Yang, J. M. Schopf, and I. Foster, CCGrid 2005.
21Outline
- Avoid or reduce the anomaly
- Performance prediction
- Conservative Scheduling
- ? more stable application behavior
- Detect and diagnose the anomaly
- Performance monitoring and data reduction
- Anomaly detection and diagnosis using signal
processing techniques - Summary
22Detect and Diagnose Anomalies
- Anomaly detection is to send alarm for
significant deviation from the established normal
profile D.Denning - How to build application normal profile?
- User-defined specifications R. Sekar
- Not efficient for novel anomalies
- Simple statistic method(Window Average G. Allen
) - Not efficient in shared and distributed
environment - Problems
- Detect anomaly only using the application
performance information - Do not provide information for anomaly diagnose
- ?Detect and diagnose application anomaly by
analyzing resource load information
Anomaly?
23Questions and Solutions
- Questions
- What kind of resource information should be use?
- How to use the resource information to detect and
diagnose application anomalous behavior? - My approaches
- Data reduction
- Select the only necessary system metrics
- Anomaly detection and diagnose using signal
processing based techniques - Using selected system metrics to detect and
diagnose application anomalous behavior
24Data Reduction
- First step of anomaly detection and diagnosis is
performance monitoring - Computer systems and applications continue to
increase in complexity and size - Interactions among components are poorly
understood - Two ways to understand the relationship between
an application and resource performance - Performance Model A. Hoisie,, M.Ripeanu, D.
Kerbyson - ? Application specific, expensive etc.
- Instrumentation and data analysis A. Malony
- ?Produce tremendous amount of data
- Need mechanisms to select only necessary metrics
- Two-step data reduction strategy
25Reduce Redundant System Metrics
- Some system metrics capture the same (similar)
information - Highly correlated (measured by correlation
coefficient (r) ) - Only one is necessary, others are redundant
- Two Questions
- A threshold value t (determined experimentally)
- A method to compare
- Traditional method mathematical comparison M.
Knop - correlation coefficient (r) gtt ?
- Problems Only limited number of sample data are
available
r may vary from run-to-run
26Redundant Metrics Reduction Algorithm
- Using Z-test
- A statistical method
- Determine whether correlation is statistically
significantly larger than the threshold value
(with 95 confidence in results) - Given a set of samples, we proceed as follows
- Perform the Z-test for r between every pair of
system metrics - Group two metrics into one cluster if Z test
shows their r value is statistically larger than
the threshold value - The result is a set of system metric clusters
- Only one metric from the cluster can be used as
the representative of the cluster while the
others are deleted as redundant
27Select Necessary System Metrics
- Some system metrics may not relate to application
performance - Backward Elimination (BE) stepwise regression
- System metrics remaining after first step X
(x1, x2, xn) - The application performance metric y
- Regress the y on the set of x
- y ?0?1x1?2x2?nxn
- Delete one metric that either is irrelevant or
that, given other metrics, is not useful to the
model - F value
- All metrics remaining are useful and necessary
for capturing the variation of y
28Evaluation
- Two criteria
- Necessary--Reduction degree (RD)
- Total percentage of system metrics eliminated
- Sufficient--coefficient of determination (R2)
- A statistical measurement
- Indicate the proportion of variation in the
performance of application explained by the
system metrics selected - Larger R2 value means system metrics selected
can better capture the variation in application
performance - Applications
- Cactus (UCSD cluster)
- GridFTP (Planetlab)
29Two-step Experiment
- 1st stepData Reduction
- Using training data to select system metrics
- 2nd step Verification
- Are these system metrics sufficient?
- Is the result stable?
- How does this method compare with other
strategies? - RAND, randomly picks a subset of system metrics
equal in number to those selected by our strategy
- MAIN, uses a subset of system metrics that are
commonly used to model the performance of
applications by other works H. Dail, M. Ripeanu,
S.Vazhkudai
30Data Reduction Result on Cactus Data
- Six machines, 600 system metrics
- Threshold ? , RD ?, since fewer system metrics
group into clusters and thus are removed as
redundant - Threshold ? , R2 ?, since more information is
available to model the application performance - 22 system metrics selected can capture 98 of
variance in the performance, when the threshold
value 0.95
31Verification Result on Cactus Data
- Using 11 chunks data collected over one day
period - SDR exhibited an average R2 value of 0.907
- 55.0 and 98.5 higher than those of RAND and
MAIN - Can better capture the applications behavior
- Results are stable over time(24-hour period)
Statistical Data Reduction for Efficient
Application Performance Monitoring, L. Yang, J.
M. Schopf, C. L. Dumitrescu, and I. Foster,
CCGrid 2006, 2006
32Outline
- Avoid or reduce the anomaly
- Performance prediction
- Conservative Scheduling
- Detect and diagnose the anomaly
- Performance monitoring and data reduction
- Anomaly detection and diagnosis using signal
processing techniques - Summary
33Application and Resources Behavior
- Some resources may have periodic usage pattern
- ?Performance decrement of application caused by
periodical resource usage pattern is normal - Challenges
- Different resources may show different usage
patterns - Resource information are noisy
- Solution Signal Processing techniques
34 Example-detection
False alarm is reduced!
35 Example-diagnosis
Anomaly is related to network load!
36Summary
- Avoid or reduce the anomaly
- A set of new one-step-ahead prediction strategies
IPDPS02 - Better CPU load prediction
- A conservative scheduling strategySC03,SY03,CCGri
d05, - Take account of predicted mean and variation
resource capability - More reliable application behavior!
- Detect and diagnose the anomaly
- A statistical data reduction strategyCCGrid06
- Identify system metrics that are only necessary
and sufficient to capture application behavior - Simplify data analysis
- Anomaly detection and diagnosis strategy
(partially done) - Build resource usage pattern automatically
- Detect and diagnose the anomaly of application
37Work Remaining to be Done
- Evaluate the detection and diagnose strategy
- Calculate
- Success rate
- False alarm rate
- Using
- Sweep3d data (partially done)
- GridFtp data collected from machines in planetlab
- Web server data
38Time Table
- Thesis work includes
- Introduction
- Performance Prediction
- Conservative Scheduling
- Data Reduction
- Anomaly detection and diagnose using signal
processing techniques - Discussion
- May evaluate results on the anomaly detection
and diagnosis strategy - June report on anomaly detection diagnosis
work - Aug 11st draft of chapter 1-3 of the thesis
- Sep 11st draft of chapter 4-6 of the thesis
- Oct 1 revised thesis
39 40How Much is the Variation?
- IncrementValue (or DecrementValue) can be
- Independent IncrementValueIncConstant
- Relative IncrementValue CurValue
IncFactor - Or
- Static Do not change the value at any step
- Dynamic Adjust the constant at each step using
an adaptation process - Measure VT1
- RealIncValueT VT1 - VT
- IncConstantT1 IncConstantT
- (RealIncValueT
IncConstantT) AdaptDegree
41Prediction Strategies
- Four homeostatic prediction strategies
- Independent static homeostatic prediction
strategy - Independent dynamic homeostatic prediction
strategy - Relative static homeostatic prediction strategy
- Relative dynamic homeostatic prediction strategy
- Three tendency-based prediction strategies
- Independent dynamic tendency prediction
- Relative dynamic tendency prediction
- Dynamic tendency prediction with mixed variation
- Independent IncrementValue
- Relative DecrementValue
- ?Experimental results show that dynamic tendency
prediction strategy with mixed variation works
best!
42Autocorrelation Function
- Autocorrelation function from lag0 to lag10 on
CPU load traces - Autocorrelation function value at lag1 and lag2
on 74 network performance time series
43Cactus Experiment
- Application Cactus
- Compare different methods fairly load trace
Playback tool generates a background workload
from a trace of the CPU load. - 64 real-load traces with different mean and s.d.
- Execution time of Cactus 1 minute 10 minutes
- Three clusters UIUC, UCSD, Chiba City
44Parallel Data Transfer Scheduling
- Our Tuned Conservative Scheduling (TCS)
- EffectiveBW BWMean TF BWSD
- SD ?, effective bandwidth ?, less workload
allocated - Other stochastic strategies
- Best One Scheduling (BOS)
- Retrieve data from the source with the highest
predicted mean bandwidth - Equal Allocation Scheduling (EAS)
- Retrieve the same amount of data from each
source - Mean Scheduling (MS) (TF0)
- EffectiveBW predicted BWMean
- Non-tuned stochastic Scheduling (NTSS)(TF1)
- EffectiveBW predicted BWMean predicted
BWSD
45Tuning Factor Algorithm
- effective BWpredicted BW Mean TFpredicted BW S
- SD ?, effective bandwidth ?, less workload
allocated - SD/Mean lt 1, then TF ½ to 8
- Lower variability so higher effective BW desired
- SD/Mean gt 1, then TF 0 to ½
- SD is higher than the mean, network performance
greatly changing, want small effective BW result - In both cases, the value of TF and TFSD are
inversely proportional to N - Other formulas would also work just fine
NSD/Mean If (Nlt1) TF1/N N/2 Else
(Ngt1) TF1/(2N2)
46TF and TFSD as BW SD changes
- Mean 5, SD varies 0 to 10
- Both inversely proportional to the BW standard
deviation (and N) - Max of TFSD value is equal to the mean of the BW
- Other functions are also feasible
47Evaluation
- Applications
- Cactus (UCSD cluster)
- GridFTP (Planetlab)
- Data collected once every 30 seconds for 24 hours
- Every data point 100 system metric values
/machine - 1 application
performance value - Collect system metrics on each machine using
three utilities - The sar command of the SYSSTAT tool set,
- Network weather service (NWS) sensors, and
- The Unix command ping
48Verification Result on GridFTP Data
- Using data collected from 24 different clients
- SDR achieves mean R2 value of 0.947
- 92.5 and 28.1 higher than those of the RAND and
MAIN strategies - Results are stable over different machines with
the same configuration
49Resource Performance Analysis
- Solution Signal Processing techniques
- Denoise
- Fourier transform based method
- Difficult to choose the width and shape of the
filter - White noise is distributed across all
frequencies or spatial scales. Fourier based
filter is inefficient for filtering this kind of
noise. - Wavelet analysis offers a scale-independent and
robust method to filter out noise - Is able to remove noise without losing useful
information - Soft threshold denoise technique
- Construct normal profile
- Construct periodic usage pattern of resources
- Fourier transfer --Capable of dealing periodic
signals - Performance decrement is tagged as an anomaly
only when it is not caused by resources usage
pattern