Anomaly Management in Grid Environments

About This Presentation

Title:

Anomaly Management in Grid Environments

Description:

Financial data prediction [A. Lendasse], earth and ocean sciences [L. ... Run Cactus on 4 shared Linux machines over two weeks ... – PowerPoint PPT presentation

Number of Views:492

Avg rating:3.0/5.0

Slides: 63

Provided by: lingyu

Category:

more less

Transcript and Presenter's Notes

Title: Anomaly Management in Grid Environments

1
Anomaly Management in Grid Environments

Candidate Lingyun Yang
Advisor Ian Foster
University of Chicago

2
Grid Environments

Mean6.9, SD3.1 Mean197.3,
SD324.8
Resource capabilities change dynamically due to
the sharing of resources
The performance of Grid applications are affected
by the capabilities of these resources
Applications want not only fast execution time,
but also stable behavior
Significant deviation from normal profile is
defined as anomaly D. Denning

Network Load
CPU Load
3
Challenges

The problem is more complex because

Resources are shared
Resources are heterogeneous
Resource are distributed
Grid environments are complex

4
Solutions

Avoid or reduce the anomaly before it happens
Prediction
Proper scheduling strategy
Detect the application anomaly when it happens
Distinguish the anomaly from normal profiles of
applications and resources
Diagnose the anomaly after it happens
Study the relationship between resource
capabilities and application performance

5
Contributions

A set of new one-step-ahead prediction strategies
A conservative scheduling strategy
? More reliable application behavior
A statistical data reduction strategy
? Simplify the data analysis

Avoid or reduce the application anomaly before it
happens

Detect the application anomaly when it happens

An anomaly detection and diagnosis strategy
using signal processing techniques
? More accurate anomaly detection and diagnosis

Diagnose the application anomaly after it happens
6
Outline

Avoid or reduce the anomaly
Performance prediction
Conservative Scheduling
Detect and diagnose the anomaly
Performance monitoring and data reduction
Anomaly detection and diagnosis
Summary

7
Avoid or Reduce Anomalies

Deliver fast and predictable behavior to
applications
Proper scheduling strategy
Time balancingF. Berman, H. Dail--heterogeneous
Assign more work load to more powerful resources
Each resource finishes (roughly) at the same time
Why it is not enough?
Resource performance may change during execution
Resources with a larger capacity may also show a
higher variance
Stochastic scheduling J. Schopf
Use stochastic data ( Average and variation)
My approaches
Prediction
Conservative Scheduling

8
CPU Load Prediction

Prediction estimate future values using
historical data
Key correctly model the relationship of the
historical data with future values
Time series modeling
Financial data prediction A. Lendasse, earth
and ocean sciences L. Lawson, biomedical signal
processing H. Kato etc
CPU load prediction
NWS R. Wolski Mean based, Media based, AR
model.
Dynamically selects the
next strategy
Linear models study P. Dinda AR model is the
best one
Estimate value directly
Question What if we estimate the direction and
variation
separately?

9
Direction Two Families of Strategies

Homeostatic Assume the values are
self-correcting - the value will return to the
mean of the previous values
Tendency Assume that if the current value
decreases, the next value will also decrease if
the current value increases, the next value will
also increase

10
How Much is the Variation?

IncrementValue (or DecrementValue) can be
Independent IncrementValueIncConstant
Relative IncrementValue CurValue
IncFactor
Or
Static Do not change the value at any step
Dynamic Adapt the constant at each step using
the real-time information

11
Prediction Strategies

Create a set of predication strategies by
different combinations of the directions and
variations
Evaluate this set of prediction strategies on 12
CPU time series
Experimental results show that dynamic tendency
prediction strategy with mixed variation works
best
Dynamic tendency prediction with mixed variation
Independent IncrementValue
Relative DecrementValue

12
Comparison Results

Dynamic tendency predictor with mixed variation
outperforms NWS 255 less error rate, average
37 less error rate

Homeostatic and Tendency-based CPU Load
Predictions, L. Yang, I. Foster, and J. M.
Schopf, Proceedings of International Parallel and
Distributed Processing Symposium (IPDPS), 2003 .
13
Network Load Prediction

Tendency-based predictor with mixed variation
does not work effectively on network data
NWS works better on average
Possible explanation
Network data usually has much higher variation
and smaller correlation between two adjacent
values than CPU load data
Our strategies give high weight on most recent
data, thus can not track the trace tendency of
network data
NWS predictors take account of more statistic
information
?For Network load prediction, we will use NWS
predictors

14
Mean Resource Capability Prediction

Calculate a mean resource capability time series
ai average resource capability over the time
interval, which is approximately equal to the
total execution time
Mean resource capability prediction

15
Resource Capability Variation Prediction

Calculate the standard deviation time series
Si the average difference between the resource
capability and the mean resource capability over
the time interval
Resource capability variation prediction

16
Outline

Avoid or reduce the anomaly
Performance prediction
Conservative Scheduling
Detect and diagnose the anomaly
Performance monitoring and data reduction
Anomaly detection and diagnosis
Summary

17
Conservative Scheduling

A resource with a larger capacity may also show a
higher variance in performance
Assign less work to less reliable
(higher-variance) resources
Avoid the peak in the application performance
caused by variance in the resource capability
Verified in two contexts
Computation intensive application Cactus
Parallel Data Transfer GridFTP

18
Effective CPU Capability--Cactus

Conservative load prediction
Effective CPU load predicted Mean predicted
SD
SD ?, effective CPU load ?, less workload
allocated
Other Scheduling Options
One-Step Scheduling (OSS) H.Dail, C. Liu
Effective CPU load predicted CPU at
the next step
Predicted Mean Interval Scheduling (PMIS)
Effective CPU load predicted Mean
History Mean Scheduling (HMS) A. Turgeon, J.
Weissman
Effective CPU load mean of the
historical load collected during a 5-minute
period preceding the application start time
History Conservative Scheduling (HCS) J.
Schopf
Effective CPU load mean SD of the
historical load collected during a 5-minute
period preceding the application run

19
An Example of Experimental Results

Comparison of five strategies on cactus running
on the Chiba City cluster
32 machines used

20
Result Evaluation

An average mean and s.d. of execution time
The statistical analysis T test
Result shows statistical improvement of our
strategy over other strategies
The possibility of improvement happening by
chance is quite small

Conservative Scheduling Using Predicted Variance
to Improve Scheduling Decisions in Dynamic
Environments, L. Yang, J. M. Schopf, and I.
Foster, Proceedings of SuperComputing
2003. Improving Parallel Data Transfer Times
Using Predicted Variances in Shared Networks, L.
Yang, J. M. Schopf, and I. Foster, CCGrid 2005.
21
Outline

Avoid or reduce the anomaly
Performance prediction
Conservative Scheduling
? more stable application behavior
Detect and diagnose the anomaly
Performance monitoring and data reduction
Anomaly detection and diagnosis using signal
processing techniques
Summary

22
Detect and Diagnose Anomalies

Anomaly detection send an alarm for significant
deviation from the established normal profile
D.Denning
Anomaly diagnosis relate application anomalous
behaviors to anomalous resource behaviors
Need to detect resource anomaly
Key correctly define the normal profile of
application and resource behaviors

Anomaly?
23
Build Normal Profile

Signature-based methods K. Ilgun, T. Lunt
Known patterns of anomalous activities
Specification-based methods R. Sekar
Logic-based rules specifying the legitimate
system and/or application behaviors
Statistical methods G. Allen
Use statistics to construct a reference model of
normal application and system behaviors

24
Questions and Solutions

Questions
What kind of resource information should be used?
How to use the resource information to detect and
diagnose application anomalous behaviors?
My approaches
Statistical data reduction
Select only necessary system metrics
Anomaly detection and diagnosis using signal
processing based techniques
Extend the window average based method
Statistical method
Simple and efficient

25
Data Reduction

First step of anomaly detection and diagnosis is
performance monitoring
Computer systems and applications continue to
increase in complexity and size
Interactions among components are poorly
understood
Two ways to understand the relationship between
an application and resource performance
Performance Model A. Hoisie,, M.Ripeanu, D.
Kerbyson
? Application specific, expensive etc.
Instrumentation and data analysis A. Malony
?Produce tremendous amount of data
Need mechanisms to select only necessary metrics
Two-step data reduction strategy

26
Reduce Redundant System Metrics

Some system metrics capture the same (similar)
information
Highly correlated (measured by correlation
coefficient (r) )
Only one is necessary, others are redundant
Two questions
A threshold value t (determined experimentally)
A method to compare
Traditional method mathematical comparison M.
Knop
correlation coefficient (r) gtt ?
Problems Only limited number of sample data are
available

r may vary from run-to-run
27
Redundant Metrics Reduction Algorithm

Use Z-test
A statistical method
Determine whether the correlation is
statistically significantly larger than the
threshold value (with 95 confidence in results)
Given a set of samples, we proceed as follows
Perform the Z-test for r between every pair of
system metrics
Group two metrics into one cluster if Z test
shows their r value is statistically larger than
the threshold value
The result is a set of system metric clusters
Select one metric from a cluster as the
representative and delete the others as redundant

28
Select Necessary System Metrics

Some system metrics may not relate to application
performance
Backward Elimination (BE) stepwise regression
System metrics remaining after first step X
(x1, x2, xn)
The application performance metric y
Regress the y on the set of x
y ?0?1x1?2x2?nxn
Delete one metric that either is irrelevant or
that, given other metrics, is not useful to the
model
F value
All metrics remaining are useful and necessary
for capturing the variation of y

29
Evaluation

Two criteria
Necessary--Reduction degree (RD)
Total percentage of system metrics eliminated
Sufficient--coefficient of determination (R2)
A statistical measurement
Indicate the proportion of variation in the
performance of application explained by the
system metrics selected
Larger R2 value means system metrics selected
can better capture the variation in application
performance
Applications
Cactus (UCSD cluster)
GridFTP (Planetlab)

30
Experimental Methodology

1st stepData Reduction
Use training data to select system metrics
2nd step Verification
Are these system metrics sufficient?
Is the result stable?
How does this method compare with other
strategies?
RAND, randomly picks a subset of system metrics
equal in number to those selected by our strategy
MAIN, uses a subset of system metrics that are
commonly used to model the performance of
applications by other works H. Dail, M. Ripeanu,
S.Vazhkudai

31
Data Reduction Result on Cactus Data

Six machines, 600 system metrics
Threshold ? , RD ?, since fewer system metrics
are grouped into clusters and removed as
redundant
Threshold ? , R2 ?, since more information is
available to model the application performance
22 system metrics selected can capture 98 of
variance in the performance, when the threshold
value 0.95

32
Verification Result on Cactus Data

Using 11 chunks data collected over one day
period
SDR exhibited an average R2 value of 0.907
55.0 and 98.5 higher than those of RAND and
MAIN
Can better capture the applications behavior
Results are stable over time(24-hour period)

Statistical Data Reduction for Efficient
Application Performance Monitoring, L. Yang, J.
M. Schopf, C. L. Dumitrescu, and I. Foster,
CCGrid 2006, 2006
33
Outline

Avoid or reduce the anomaly
Performance prediction
Conservative Scheduling
Detect and diagnose the anomaly
Performance monitoring and data reduction
Anomaly detection and diagnosis using signal
processing techniques
Summary

34
Anomaly Detection and Diagnosis

Traditional window average based method G.
Allen, J. Brutlag, D.Gunter
Use window average as the baseline to compare
Simple and efficient
Some resources may have periodic usage patterns
Slowdowns caused by periodic resource usage
patterns are normal
Will result in high false positives if without
taking account of the periodic resource usage
patterns properly

35
Challenges

In Grid environments
Resource measurements are noisy
Resources are distributed, with different
administrative or access policies
Different resources may show different usage
patterns, with different frequencies, shapes and
amplitudes
Need an approach that can identify the periodic
resource usage patterns automatically and
dynamically
Solution Signal Processing techniques
Fourier transform based method
Dominant capability in frequency domain analysis

36
Example-detection

Adfa

False positives are reduced!
37
Example-diagnosis
This anomaly is related to network load!
38
Evaluation

Insert 100 anomalies randomly
Compare the results of three strategies
Traditional Window Average method (TWA)
Modified Window Average method with denoising
only (MWAD)
Modified Window Average method (MWA)
Two criteria
Number of detected anomalies (HIT)
Number of false positives (FP)
Three Applications
Cactus (UofC cluster)
GridFTP (emulab)
Sweep3d (emulab)

39
Experimental Methodology

Run Cactus on 4 shared Linux machines over two
weeks
CPU load shows half an hour periodic usage
pattern
Performance of Cactus is influenced by the
periodic CPU load pattern
Insert 100 anomalies randomly during application
running
By running resource consumption tools
Collect 4 sets of data
The first data set is used as training data
window size , data reduction threshold value
The other three sets of data are used for
verification

40
Window Size

When window size is small, FP is high and HIT is
low
window size ?, the FP ? and HIT ?
Window size gt32, FPlt60 and HITgt90
Window size 128, FP53, HIT96
For comparison, TWA achieves FP696, HIT99

41
Data Reduction Parameter

Threshold value ?, HIT ?, FP ?
FP flats with thresholdgt0.35
Threshold value0.9 , HIT97

42
Cactus Results

Detection
Eliminate 90 of the FP, HIT 93 96
Diagnosis
Relate application anomalies to resource
anomalous behaviors
Report the reasons for 82 to 87 anomalies
correctly

43
Summary of Contributions

Avoid or reduce the anomaly
A set of new one-step-ahead prediction strategies
IPDPS02
Better CPU load prediction
A conservative scheduling strategySC03,SY03,CCGri
d05,
Take account of predicted mean and variation
resource capability
More reliable application behavior!
Detect and diagnose the anomaly
A statistical data reduction strategyCCGrid06
Identify system metrics that are only necessary
and sufficient to capture application behavior
Simplify data analysis
Anomaly detection and diagnosis strategy
Identify the periodic resource usage pattern
automatically and dynamically
Reduce the false positives significantly

44
Future Work

Anomaly prevention
Multi-job Scheduling
More advanced detection and diagnosis methods
Neural networks methods
Hidden Markov model methods

Questions?
Thank you!

46
How Much is the Variation?

IncrementValue (or DecrementValue) can be
Independent IncrementValueIncConstant
Relative IncrementValue CurValue
IncFactor
Or
Static Do not change the value at any step
Dynamic Adjust the constant at each step using
an adaptation process
Measure VT1
RealIncValueT VT1 - VT
IncConstantT1 IncConstantT
(RealIncValueT
IncConstantT) AdaptDegree

47
Prediction Strategies

Four homeostatic prediction strategies
Independent static homeostatic prediction
strategy
Independent dynamic homeostatic prediction
strategy
Relative static homeostatic prediction strategy
Relative dynamic homeostatic prediction strategy
Three tendency-based prediction strategies
Independent dynamic tendency prediction
Relative dynamic tendency prediction
Dynamic tendency prediction with mixed variation
Independent IncrementValue
Relative DecrementValue
?Experimental results show that dynamic tendency
prediction strategy with mixed variation works
best!

48
Autocorrelation Function

Autocorrelation function from lag0 to lag10 on
CPU load traces
Autocorrelation function value at lag1 and lag2
on 74 network performance time series

49
Cactus Experiment

Application Cactus
Compare different methods fairly load trace
Playback tool generates a background workload
from a trace of the CPU load.
64 real-load traces with different mean and s.d.
Execution time of Cactus 1 minute 10 minutes
Three clusters UIUC, UCSD, Chiba City

50
Parallel Data Transfer Scheduling

Our Tuned Conservative Scheduling (TCS)
EffectiveBW BWMean TF BWSD
SD ?, effective bandwidth ?, less workload
allocated
Other stochastic strategies
Best One Scheduling (BOS)
Retrieve data from the source with the highest
predicted mean bandwidth
Equal Allocation Scheduling (EAS)
Retrieve the same amount of data from each
source
Mean Scheduling (MS) (TF0)
EffectiveBW predicted BWMean
Non-tuned Stochastic Scheduling (NTSS)(TF1)
EffectiveBW predicted BWMean predicted
BWSD

51
Tuning Factor Algorithm

effective BWpredicted BW Mean TFpredicted BW S
SD ?, effective bandwidth ?, less workload
allocated
SD/Mean lt 1, then TF ½ to 8
Lower variability so higher effective BW desired
SD/Mean gt 1, then TF 0 to ½
SD is higher than the mean, network performance
greatly changing, want small effective BW result
In both cases, the value of TF and TFSD are
inversely proportional to N
Other formulas would also work just fine

NSD/Mean If (Nlt1) TF1/N N/2 Else
(Ngt1) TF1/(2N2)
52
TF and TFSD as BW SD changes

Mean 5, SD varies 0 to 10
Both inversely proportional to the BW standard
deviation (and N)
Max of TFSD value is equal to the mean of the BW
Other functions are also feasible

53
Evaluation

Applications
Cactus (UCSD cluster)
GridFTP (Planetlab)
Data collected once every 30 seconds for 24 hours
Every data point 100 system metric values
/machine
1 application
performance value
Collect system metrics on each machine using
three utilities
The sar command of the SYSSTAT tool set,
Network weather service (NWS) sensors, and
The Unix command ping

54
Verification Result on GridFTP Data

Using data collected from 24 different clients
SDR achieves mean R2 value of 0.947
92.5 and 28.1 higher than those of the RAND and
MAIN strategies
Results are stable over different machines with
the same configuration

55
Resource Performance Analysis

Solution Signal Processing techniques
Denoise
Fourier transform based method
Difficult to choose the width and shape of the
filter
White noise is distributed across all
frequencies or spatial scales. Fourier based
filter is inefficient for filtering this kind of
noise.
Wavelet analysis offers a scale-independent and
robust method to filter out noise
Is able to remove noise without losing useful
information
Soft threshold denoise technique
Construct normal profile
Construct periodic usage pattern of resources
Fourier transfer --Capable of dealing periodic
signals
Performance decrement is tagged as an anomaly
only when it is not caused by resources usage
pattern

56
Experimental Methodology

Ran Cactus on 4 shared Linux machines over two
weeks
Insert 100 anomalies manually
By running resource consumption tools
Consume more than 90 of CPU,bandwidth or memory

Anomaly caused by high CPU load
Anomaly caused by high bandwidth load
(4) Anomalies caused by high memory load

57
Experimental Result- Diagnosis

Classify the system metrics into three
categories
CPU related, memory related and Network related
Total 12 possible reasons on four machines
Relate application anomaly to resource anomalous
behavior
Check the resource anomalous behavior if an
application anomaly has been identified
Count the number of resource anomalies in each
categories
Output all reasons happening more than 10 of the
time
Result
For 97 anomalies detected, our strategy could
reports the reasons for 89 anomalies.
80 anomalies have been reported with the largest
possibility.