Title: Anomaly Management in Grid Environments
1Anomaly Management in Grid Environments
- Candidate Lingyun Yang
- Advisor Ian Foster
- University of Chicago
2Grid Environments
- Mean6.9, SD3.1 Mean197.3,
SD324.8 - Resource capabilities change dynamically due to
the sharing of resources - The performance of Grid applications are affected
by the capabilities of these resources - Applications want not only fast execution time,
but also stable behavior - Significant deviation from normal profile is
defined as anomaly D. Denning
Network Load
CPU Load
3Challenges
- The problem is more complex because
- Resources are shared
- Resources are heterogeneous
- Resource are distributed
- Grid environments are complex
4Solutions
- Avoid or reduce the anomaly before it happens
- Prediction
- Proper scheduling strategy
- Detect the application anomaly when it happens
- Distinguish the anomaly from normal profiles of
applications and resources - Diagnose the anomaly after it happens
- Study the relationship between resource
capabilities and application performance
5Contributions
- A set of new one-step-ahead prediction strategies
- A conservative scheduling strategy
- ? More reliable application behavior
- A statistical data reduction strategy
- ? Simplify the data analysis
- Avoid or reduce the application anomaly before it
happens
- Detect the application anomaly when it happens
- An anomaly detection and diagnosis strategy
using signal processing techniques - ? More accurate anomaly detection and diagnosis
Diagnose the application anomaly after it happens
6Outline
- Avoid or reduce the anomaly
- Performance prediction
- Conservative Scheduling
- Detect and diagnose the anomaly
- Performance monitoring and data reduction
- Anomaly detection and diagnosis
- Summary
7Avoid or Reduce Anomalies
- Deliver fast and predictable behavior to
applications - Proper scheduling strategy
- Time balancingF. Berman, H. Dail--heterogeneous
- Assign more work load to more powerful resources
- Each resource finishes (roughly) at the same time
- Why it is not enough?
- Resource performance may change during execution
- Resources with a larger capacity may also show a
higher variance - Stochastic scheduling J. Schopf
- Use stochastic data ( Average and variation)
- My approaches
- Prediction
- Conservative Scheduling
8CPU Load Prediction
- Prediction estimate future values using
historical data - Key correctly model the relationship of the
historical data with future values - Time series modeling
- Financial data prediction A. Lendasse, earth
and ocean sciences L. Lawson, biomedical signal
processing H. Kato etc - CPU load prediction
- NWS R. Wolski Mean based, Media based, AR
model. - Dynamically selects the
next strategy - Linear models study P. Dinda AR model is the
best one - Estimate value directly
- Question What if we estimate the direction and
variation - separately?
9Direction Two Families of Strategies
- Homeostatic Assume the values are
self-correcting - the value will return to the
mean of the previous values - Tendency Assume that if the current value
decreases, the next value will also decrease if
the current value increases, the next value will
also increase
10How Much is the Variation?
- IncrementValue (or DecrementValue) can be
- Independent IncrementValueIncConstant
- Relative IncrementValue CurValue
IncFactor - Or
- Static Do not change the value at any step
- Dynamic Adapt the constant at each step using
the real-time information
11Prediction Strategies
- Create a set of predication strategies by
different combinations of the directions and
variations - Evaluate this set of prediction strategies on 12
CPU time series - Experimental results show that dynamic tendency
prediction strategy with mixed variation works
best - Dynamic tendency prediction with mixed variation
- Independent IncrementValue
- Relative DecrementValue
12Comparison Results
- Dynamic tendency predictor with mixed variation
outperforms NWS 255 less error rate, average
37 less error rate
Homeostatic and Tendency-based CPU Load
Predictions, L. Yang, I. Foster, and J. M.
Schopf, Proceedings of International Parallel and
Distributed Processing Symposium (IPDPS), 2003 .
13Network Load Prediction
- Tendency-based predictor with mixed variation
does not work effectively on network data - NWS works better on average
- Possible explanation
- Network data usually has much higher variation
and smaller correlation between two adjacent
values than CPU load data - Our strategies give high weight on most recent
data, thus can not track the trace tendency of
network data - NWS predictors take account of more statistic
information - ?For Network load prediction, we will use NWS
predictors
14 Mean Resource Capability Prediction
- Calculate a mean resource capability time series
- ai average resource capability over the time
interval, which is approximately equal to the
total execution time - Mean resource capability prediction
15Resource Capability Variation Prediction
- Calculate the standard deviation time series
-
- Si the average difference between the resource
capability and the mean resource capability over
the time interval - Resource capability variation prediction
16Outline
- Avoid or reduce the anomaly
- Performance prediction
- Conservative Scheduling
- Detect and diagnose the anomaly
- Performance monitoring and data reduction
- Anomaly detection and diagnosis
- Summary
17Conservative Scheduling
- A resource with a larger capacity may also show a
higher variance in performance - Assign less work to less reliable
(higher-variance) resources - Avoid the peak in the application performance
caused by variance in the resource capability - Verified in two contexts
- Computation intensive application Cactus
- Parallel Data Transfer GridFTP
18Effective CPU Capability--Cactus
- Conservative load prediction
- Effective CPU load predicted Mean predicted
SD - SD ?, effective CPU load ?, less workload
allocated - Other Scheduling Options
- One-Step Scheduling (OSS) H.Dail, C. Liu
- Effective CPU load predicted CPU at
the next step - Predicted Mean Interval Scheduling (PMIS)
- Effective CPU load predicted Mean
- History Mean Scheduling (HMS) A. Turgeon, J.
Weissman - Effective CPU load mean of the
historical load collected during a 5-minute
period preceding the application start time - History Conservative Scheduling (HCS) J.
Schopf - Effective CPU load mean SD of the
historical load collected during a 5-minute
period preceding the application run
19An Example of Experimental Results
- Comparison of five strategies on cactus running
on the Chiba City cluster - 32 machines used
20Result Evaluation
- An average mean and s.d. of execution time
- The statistical analysis T test
- Result shows statistical improvement of our
strategy over other strategies - The possibility of improvement happening by
chance is quite small
Conservative Scheduling Using Predicted Variance
to Improve Scheduling Decisions in Dynamic
Environments, L. Yang, J. M. Schopf, and I.
Foster, Proceedings of SuperComputing
2003. Improving Parallel Data Transfer Times
Using Predicted Variances in Shared Networks, L.
Yang, J. M. Schopf, and I. Foster, CCGrid 2005.
21Outline
- Avoid or reduce the anomaly
- Performance prediction
- Conservative Scheduling
- ? more stable application behavior
- Detect and diagnose the anomaly
- Performance monitoring and data reduction
- Anomaly detection and diagnosis using signal
processing techniques - Summary
22Detect and Diagnose Anomalies
- Anomaly detection send an alarm for significant
deviation from the established normal profile
D.Denning - Anomaly diagnosis relate application anomalous
behaviors to anomalous resource behaviors - Need to detect resource anomaly
- Key correctly define the normal profile of
application and resource behaviors
Anomaly?
23Build Normal Profile
- Signature-based methods K. Ilgun, T. Lunt
- Known patterns of anomalous activities
- Specification-based methods R. Sekar
- Logic-based rules specifying the legitimate
system and/or application behaviors - Statistical methods G. Allen
- Use statistics to construct a reference model of
normal application and system behaviors
24Questions and Solutions
- Questions
- What kind of resource information should be used?
- How to use the resource information to detect and
diagnose application anomalous behaviors? - My approaches
- Statistical data reduction
- Select only necessary system metrics
- Anomaly detection and diagnosis using signal
processing based techniques - Extend the window average based method
- Statistical method
- Simple and efficient
25Data Reduction
- First step of anomaly detection and diagnosis is
performance monitoring - Computer systems and applications continue to
increase in complexity and size - Interactions among components are poorly
understood - Two ways to understand the relationship between
an application and resource performance - Performance Model A. Hoisie,, M.Ripeanu, D.
Kerbyson - ? Application specific, expensive etc.
- Instrumentation and data analysis A. Malony
- ?Produce tremendous amount of data
- Need mechanisms to select only necessary metrics
- Two-step data reduction strategy
26Reduce Redundant System Metrics
- Some system metrics capture the same (similar)
information - Highly correlated (measured by correlation
coefficient (r) ) - Only one is necessary, others are redundant
- Two questions
- A threshold value t (determined experimentally)
- A method to compare
- Traditional method mathematical comparison M.
Knop - correlation coefficient (r) gtt ?
- Problems Only limited number of sample data are
available
r may vary from run-to-run
27Redundant Metrics Reduction Algorithm
- Use Z-test
- A statistical method
- Determine whether the correlation is
statistically significantly larger than the
threshold value (with 95 confidence in results) - Given a set of samples, we proceed as follows
- Perform the Z-test for r between every pair of
system metrics - Group two metrics into one cluster if Z test
shows their r value is statistically larger than
the threshold value - The result is a set of system metric clusters
- Select one metric from a cluster as the
representative and delete the others as redundant
28Select Necessary System Metrics
- Some system metrics may not relate to application
performance - Backward Elimination (BE) stepwise regression
- System metrics remaining after first step X
(x1, x2, xn) - The application performance metric y
- Regress the y on the set of x
- y ?0?1x1?2x2?nxn
- Delete one metric that either is irrelevant or
that, given other metrics, is not useful to the
model - F value
- All metrics remaining are useful and necessary
for capturing the variation of y
29Evaluation
- Two criteria
- Necessary--Reduction degree (RD)
- Total percentage of system metrics eliminated
- Sufficient--coefficient of determination (R2)
- A statistical measurement
- Indicate the proportion of variation in the
performance of application explained by the
system metrics selected - Larger R2 value means system metrics selected
can better capture the variation in application
performance - Applications
- Cactus (UCSD cluster)
- GridFTP (Planetlab)
30Experimental Methodology
- 1st stepData Reduction
- Use training data to select system metrics
- 2nd step Verification
- Are these system metrics sufficient?
- Is the result stable?
- How does this method compare with other
strategies? - RAND, randomly picks a subset of system metrics
equal in number to those selected by our strategy
- MAIN, uses a subset of system metrics that are
commonly used to model the performance of
applications by other works H. Dail, M. Ripeanu,
S.Vazhkudai
31Data Reduction Result on Cactus Data
- Six machines, 600 system metrics
- Threshold ? , RD ?, since fewer system metrics
are grouped into clusters and removed as
redundant - Threshold ? , R2 ?, since more information is
available to model the application performance - 22 system metrics selected can capture 98 of
variance in the performance, when the threshold
value 0.95
32Verification Result on Cactus Data
- Using 11 chunks data collected over one day
period - SDR exhibited an average R2 value of 0.907
- 55.0 and 98.5 higher than those of RAND and
MAIN - Can better capture the applications behavior
- Results are stable over time(24-hour period)
Statistical Data Reduction for Efficient
Application Performance Monitoring, L. Yang, J.
M. Schopf, C. L. Dumitrescu, and I. Foster,
CCGrid 2006, 2006
33Outline
- Avoid or reduce the anomaly
- Performance prediction
- Conservative Scheduling
- Detect and diagnose the anomaly
- Performance monitoring and data reduction
- Anomaly detection and diagnosis using signal
processing techniques - Summary
34Anomaly Detection and Diagnosis
- Traditional window average based method G.
Allen, J. Brutlag, D.Gunter - Use window average as the baseline to compare
- Simple and efficient
- Some resources may have periodic usage patterns
-
- Slowdowns caused by periodic resource usage
patterns are normal - Will result in high false positives if without
taking account of the periodic resource usage
patterns properly
35Challenges
- In Grid environments
- Resource measurements are noisy
- Resources are distributed, with different
administrative or access policies - Different resources may show different usage
patterns, with different frequencies, shapes and
amplitudes - Need an approach that can identify the periodic
resource usage patterns automatically and
dynamically - Solution Signal Processing techniques
- Fourier transform based method
- Dominant capability in frequency domain analysis
36 Example-detection
False positives are reduced!
37 Example-diagnosis
This anomaly is related to network load!
38Evaluation
- Insert 100 anomalies randomly
- Compare the results of three strategies
- Traditional Window Average method (TWA)
- Modified Window Average method with denoising
only (MWAD) - Modified Window Average method (MWA)
- Two criteria
- Number of detected anomalies (HIT)
- Number of false positives (FP)
- Three Applications
- Cactus (UofC cluster)
- GridFTP (emulab)
- Sweep3d (emulab)
39Experimental Methodology
- Run Cactus on 4 shared Linux machines over two
weeks - CPU load shows half an hour periodic usage
pattern - Performance of Cactus is influenced by the
periodic CPU load pattern - Insert 100 anomalies randomly during application
running - By running resource consumption tools
- Collect 4 sets of data
- The first data set is used as training data
- window size , data reduction threshold value
- The other three sets of data are used for
verification
40Window Size
- When window size is small, FP is high and HIT is
low - window size ?, the FP ? and HIT ?
- Window size gt32, FPlt60 and HITgt90
- Window size 128, FP53, HIT96
- For comparison, TWA achieves FP696, HIT99
41Data Reduction Parameter
- Threshold value ?, HIT ?, FP ?
- FP flats with thresholdgt0.35
- Threshold value0.9 , HIT97
42Cactus Results
- Detection
- Eliminate 90 of the FP, HIT 93 96
- Diagnosis
- Relate application anomalies to resource
anomalous behaviors - Report the reasons for 82 to 87 anomalies
correctly
43Summary of Contributions
- Avoid or reduce the anomaly
- A set of new one-step-ahead prediction strategies
IPDPS02 - Better CPU load prediction
- A conservative scheduling strategySC03,SY03,CCGri
d05, - Take account of predicted mean and variation
resource capability - More reliable application behavior!
- Detect and diagnose the anomaly
- A statistical data reduction strategyCCGrid06
- Identify system metrics that are only necessary
and sufficient to capture application behavior - Simplify data analysis
- Anomaly detection and diagnosis strategy
- Identify the periodic resource usage pattern
automatically and dynamically - Reduce the false positives significantly
44Future Work
- Anomaly prevention
- Multi-job Scheduling
- More advanced detection and diagnosis methods
- Neural networks methods
- Hidden Markov model methods
45 46How Much is the Variation?
- IncrementValue (or DecrementValue) can be
- Independent IncrementValueIncConstant
- Relative IncrementValue CurValue
IncFactor - Or
- Static Do not change the value at any step
- Dynamic Adjust the constant at each step using
an adaptation process - Measure VT1
- RealIncValueT VT1 - VT
- IncConstantT1 IncConstantT
- (RealIncValueT
IncConstantT) AdaptDegree
47Prediction Strategies
- Four homeostatic prediction strategies
- Independent static homeostatic prediction
strategy - Independent dynamic homeostatic prediction
strategy - Relative static homeostatic prediction strategy
- Relative dynamic homeostatic prediction strategy
- Three tendency-based prediction strategies
- Independent dynamic tendency prediction
- Relative dynamic tendency prediction
- Dynamic tendency prediction with mixed variation
- Independent IncrementValue
- Relative DecrementValue
- ?Experimental results show that dynamic tendency
prediction strategy with mixed variation works
best!
48Autocorrelation Function
- Autocorrelation function from lag0 to lag10 on
CPU load traces - Autocorrelation function value at lag1 and lag2
on 74 network performance time series
49Cactus Experiment
- Application Cactus
- Compare different methods fairly load trace
Playback tool generates a background workload
from a trace of the CPU load. - 64 real-load traces with different mean and s.d.
- Execution time of Cactus 1 minute 10 minutes
- Three clusters UIUC, UCSD, Chiba City
50Parallel Data Transfer Scheduling
- Our Tuned Conservative Scheduling (TCS)
- EffectiveBW BWMean TF BWSD
- SD ?, effective bandwidth ?, less workload
allocated - Other stochastic strategies
- Best One Scheduling (BOS)
- Retrieve data from the source with the highest
predicted mean bandwidth - Equal Allocation Scheduling (EAS)
- Retrieve the same amount of data from each
source - Mean Scheduling (MS) (TF0)
- EffectiveBW predicted BWMean
- Non-tuned Stochastic Scheduling (NTSS)(TF1)
- EffectiveBW predicted BWMean predicted
BWSD
51Tuning Factor Algorithm
- effective BWpredicted BW Mean TFpredicted BW S
- SD ?, effective bandwidth ?, less workload
allocated - SD/Mean lt 1, then TF ½ to 8
- Lower variability so higher effective BW desired
- SD/Mean gt 1, then TF 0 to ½
- SD is higher than the mean, network performance
greatly changing, want small effective BW result - In both cases, the value of TF and TFSD are
inversely proportional to N - Other formulas would also work just fine
NSD/Mean If (Nlt1) TF1/N N/2 Else
(Ngt1) TF1/(2N2)
52TF and TFSD as BW SD changes
- Mean 5, SD varies 0 to 10
- Both inversely proportional to the BW standard
deviation (and N) - Max of TFSD value is equal to the mean of the BW
- Other functions are also feasible
53Evaluation
- Applications
- Cactus (UCSD cluster)
- GridFTP (Planetlab)
- Data collected once every 30 seconds for 24 hours
- Every data point 100 system metric values
/machine - 1 application
performance value - Collect system metrics on each machine using
three utilities - The sar command of the SYSSTAT tool set,
- Network weather service (NWS) sensors, and
- The Unix command ping
54Verification Result on GridFTP Data
- Using data collected from 24 different clients
- SDR achieves mean R2 value of 0.947
- 92.5 and 28.1 higher than those of the RAND and
MAIN strategies - Results are stable over different machines with
the same configuration
55Resource Performance Analysis
- Solution Signal Processing techniques
- Denoise
- Fourier transform based method
- Difficult to choose the width and shape of the
filter - White noise is distributed across all
frequencies or spatial scales. Fourier based
filter is inefficient for filtering this kind of
noise. - Wavelet analysis offers a scale-independent and
robust method to filter out noise - Is able to remove noise without losing useful
information - Soft threshold denoise technique
- Construct normal profile
- Construct periodic usage pattern of resources
- Fourier transfer --Capable of dealing periodic
signals - Performance decrement is tagged as an anomaly
only when it is not caused by resources usage
pattern
56Experimental Methodology
- Ran Cactus on 4 shared Linux machines over two
weeks - Insert 100 anomalies manually
- By running resource consumption tools
- Consume more than 90 of CPU,bandwidth or memory
- Anomaly caused by high CPU load
- Anomaly caused by high bandwidth load
- (4) Anomalies caused by high memory load
57Experimental Result- Diagnosis
- Classify the system metrics into three
categories - CPU related, memory related and Network related
- Total 12 possible reasons on four machines
- Relate application anomaly to resource anomalous
behavior - Check the resource anomalous behavior if an
application anomaly has been identified - Count the number of resource anomalies in each
categories - Output all reasons happening more than 10 of the
time - Result
- For 97 anomalies detected, our strategy could
reports the reasons for 89 anomalies. - 80 anomalies have been reported with the largest
possibility.
58 Experimental Methodology for GridFTP
- Insert 100 network anomalies on network links in
the path - No periodic resource usage pattern
59GridFTP Result
- Detection
- Do not improve a lot when FP is small,HIT 9095
- Diagnosis
- Locate the problematic network links
- Report the reasons for 73 to 81 anomalies
correctly
60 Experimental Methodology for Sweep3d
- Insert 100 network anomalies on network links
- Emulate various periodic CPU load patterns for
machines from different domains - Daily and hourly
- Vary the problem size to change its computation /
communication ratio - Small, medium and large
61GridFTP Result -Detection
- Do not improve a lot when FP is small
- Reduce about 85 FP when FP is large
- HIT8995
62Sweep3d Result-Diagnosis
- Locate the problematic network links
- Report the reasons for 73 to 81 anomalies
correctly