Anomaly Management in Grid Environments - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

Anomaly Management in Grid Environments

Description:

Tendency-based predictor with mixed variation does not work effectively on network data ... Indicate the proportion of variation in the performance of ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 50
Provided by: lingyu
Category:

less

Transcript and Presenter's Notes

Title: Anomaly Management in Grid Environments


1
Anomaly Management in Grid Environments
  • Candidate Lingyun Yang
  • Advisor Ian Foster
  • University of Chicago

2
Grid Environments
  • Mean6.9, SD3.1 Mean197.3,
    SD324.8
  • Resource capabilities change dynamically due to
    the sharing of resources
  • The performance of Grid applications are affected
    by the capabilities of these resources
  • Applications want not only fast execution time,
    but also stable behavior
  • Significant deviation from normal profile is
    defined as anomaly D. Denning

Network Load
CPU Load
3
Challenges
  • The problem is more complex because
  • Resources are shared
  • Resources are heterogeneous
  • Resource are distributed
  • Grid environments are complex

4
Solutions
  • Avoid or reduce the anomaly before it happens
  • Prediction
  • Proper scheduling strategy
  • Detect the application anomaly when it happens
  • Distinguish the anomaly from normal profiles of
    applications and resources
  • Diagnose the anomaly after it happens
  • Study the relationship between resource
    capabilities and application performance

5
Contributions
  • A set of new one-step-ahead prediction strategies
  • A conservative scheduling strategy
  • ? More reliable application behavior
  • A statistical data reduction strategy
  • ? Simplify the data analysis
  • Avoid or reduce the application anomaly before it
    happens
  • Detect the application anomaly when it happens
  • An anomaly detection and diagnosis strategy
    using signal processing techniques
  • ? More accurate anomaly detection and diagnosis

Diagnose the application anomaly after it happens
6
Outline
  • Avoid or reduce the anomaly
  • Performance prediction
  • Conservative Scheduling
  • Detect and diagnose the anomaly
  • Performance monitoring and data reduction
  • Anomaly detection and diagnosis
  • Summary

7
Avoid or Reduce Anomalies
  • Deliver fast and predictable behavior to
    applications
  • Proper scheduling strategy
  • Time balancingF. Berman, H. Dail--heterogeneous
  • Assign more work load to more powerful resources
  • Each resource finishes (roughly) at the same time
  • Why it is not enough?
  • Resource performance may change during execution
  • Resources with a larger capacity may also show a
    higher variance
  • Stochastic scheduling J. Schopf
  • Use stochastic data ( Average and variation)
  • My approaches
  • Prediction
  • Conservative Scheduling

8
CPU Load Prediction
  • Prediction estimate future values using
    historical data
  • Key correctly model the relationship of the
    historical data with future values
  • Time series modeling
  • Financial data prediction A. Lendasse, earth
    and ocean sciences L. Lawson, biomedical signal
    processing H. Kato, network H.Braun, N.
    Groschwitz etc
  • CPU load prediction
  • NWS R. Wolski Mean based, Media based, AR
    model.
  • Dynamically selects the
    next strategy
  • Linear models study P. Dinda AR model is the
    best one
  • Estimate value directly
  • Question What if we estimate the direction and
    variation
  • separately?

9
Direction Two Families of Strategies
  • Homeostatic Assume the values are
    self-correcting - the value will return to the
    mean of the previous values
  • Tendency Assume that if the current value
    decreases, the next value will also decrease if
    the current value increases, the next value will
    also increase

10
How Much is the Variation?
  • IncrementValue (or DecrementValue) can be
  • Independent IncrementValueIncConstant
  • Relative IncrementValue CurValue
    IncFactor
  • Or
  • Static Do not change the value at any step
  • Dynamic Adapt the constant at each step using
    the real-time information

11
Prediction Strategies
  • Create a set of predication strategies by
    different combinations of the directions and
    variations
  • Evaluate this set of prediction strategies on 12
    CPU time series
  • Experimental results show that dynamic tendency
    prediction strategy with mixed variation works
    best
  • Dynamic tendency prediction with mixed variation
  • Independent IncrementValue
  • Relative DecrementValue

12
Comparison Results
  • Dynamic tendency predictor with mixed variation
    outperforms NWS 255 less error rate, average
    37 less error rate

Homeostatic and Tendency-based CPU Load
Predictions, L. Yang, I. Foster, and J. M.
Schopf, Proceedings of International Parallel and
Distributed Processing Symposium (IPDPS), 2003 .
13
Network Load Prediction
  • Tendency-based predictor with mixed variation
    does not work effectively on network data
  • NWS works better on average
  • Possible explanation
  • Network data usually has much higher variation
    and smaller correlation between two adjacent
    values than CPU load data
  • Our strategies give high weight on most recent
    data, thus can not track the trace tendency of
    network data
  • NWS predictors take account of more statistic
    information
  • ?For Network load prediction, we will use NWS
    predictors

14
Mean Resource Capability Prediction
  • Calculate an mean resource capability time series
  • ai average resource capability over the time
    interval, which is approximately equal to the
    total execution time
  • Mean resource capability prediction

15
Resource Capability Variation Prediction
  • Calculate the standard deviation time series
  • Si the average difference between the resource
    capability and the mean resource capability over
    the time interval
  • Resource capability variation prediction

16
Outline
  • Avoid or reduce the anomaly
  • Performance prediction
  • Conservative Scheduling
  • Detect and diagnose the anomaly
  • Performance monitoring and data reduction
  • Anomaly detection and diagnosis
  • Summary

17
Conservative Scheduling
  • A resource with a larger capacity may also show a
    higher variance in performance
  • Assign less work to less reliable
    (higher-variance) resources
  • Avoid the peak in the application performance
    caused by variance in the resource capability
  • Verified in two contexts
  • Computation intensive application Cactus
  • Parallel Data Transfer GridFTP

18
Effective CPU Capability--Cactus
  • Conservative load prediction
  • Effective CPU load predicted Mean predicted
    SD
  • SD ?, effective CPU load ?, less workload
    allocated
  • Other Scheduling Options
  • One-Step Scheduling (OSS) H.Dail, C. Liu
  • Effective CPU load predicted CPU at
    the next step
  • Predicted Mean Interval Scheduling (PMIS)
  • Effective CPU load predicted Mean
  • History Mean Scheduling (HMS) A. Turgeon, J.
    Weissman
  • Effective CPU load mean of the
    historical load collected during a 5-minute
    period preceding the application start time
  • History Conservative Scheduling (HCS) J.
    Schopf
  • Effective CPU load mean SD of the
    historical load collected during a 5-minute
    period preceding the application run

19
An Example of Experimental Results
  • Comparison of five strategies on cactus running
    on the Chiba City cluster
  • 32 machines used

20
Result Evaluation
  • An average mean and s.d. of execution time
  • The statistical analysis T test
  • Result shows statistical improvement of our
    strategy over other strategies
  • The possibility of improvement happening by
    chance is quite small

Conservative Scheduling Using Predicted Variance
to Improve Scheduling Decisions in Dynamic
Environments, L. Yang, J. M. Schopf, and I.
Foster, Proceedings of SuperComputing
2003. Improving Parallel Data Transfer Times
Using Predicted Variances in Shared Networks, L.
Yang, J. M. Schopf, and I. Foster, CCGrid 2005.
21
Outline
  • Avoid or reduce the anomaly
  • Performance prediction
  • Conservative Scheduling
  • ? more stable application behavior
  • Detect and diagnose the anomaly
  • Performance monitoring and data reduction
  • Anomaly detection and diagnosis using signal
    processing techniques
  • Summary

22
Detect and Diagnose Anomalies
  • Anomaly detection is to send alarm for
    significant deviation from the established normal
    profile D.Denning
  • How to build application normal profile?
  • User-defined specifications R. Sekar
  • Not efficient for novel anomalies
  • Simple statistic method(Window Average G. Allen
    )
  • Not efficient in shared and distributed
    environment
  • Problems
  • Detect anomaly only using the application
    performance information
  • Do not provide information for anomaly diagnose
  • ?Detect and diagnose application anomaly by
    analyzing resource load information

Anomaly?
23
Questions and Solutions
  • Questions
  • What kind of resource information should be use?
  • How to use the resource information to detect and
    diagnose application anomalous behavior?
  • My approaches
  • Data reduction
  • Select the only necessary system metrics
  • Anomaly detection and diagnose using signal
    processing based techniques
  • Using selected system metrics to detect and
    diagnose application anomalous behavior

24
Data Reduction
  • First step of anomaly detection and diagnosis is
    performance monitoring
  • Computer systems and applications continue to
    increase in complexity and size
  • Interactions among components are poorly
    understood
  • Two ways to understand the relationship between
    an application and resource performance
  • Performance Model A. Hoisie,, M.Ripeanu, D.
    Kerbyson
  • ? Application specific, expensive etc.
  • Instrumentation and data analysis A. Malony
  • ?Produce tremendous amount of data
  • Need mechanisms to select only necessary metrics
  • Two-step data reduction strategy

25
Reduce Redundant System Metrics
  • Some system metrics capture the same (similar)
    information
  • Highly correlated (measured by correlation
    coefficient (r) )
  • Only one is necessary, others are redundant
  • Two Questions
  • A threshold value t (determined experimentally)
  • A method to compare
  • Traditional method mathematical comparison M.
    Knop
  • correlation coefficient (r) gtt ?
  • Problems Only limited number of sample data are
    available

r may vary from run-to-run
26
Redundant Metrics Reduction Algorithm
  • Using Z-test
  • A statistical method
  • Determine whether correlation is statistically
    significantly larger than the threshold value
    (with 95 confidence in results)
  • Given a set of samples, we proceed as follows
  • Perform the Z-test for r between every pair of
    system metrics
  • Group two metrics into one cluster if Z test
    shows their r value is statistically larger than
    the threshold value
  • The result is a set of system metric clusters
  • Only one metric from the cluster can be used as
    the representative of the cluster while the
    others are deleted as redundant

27
Select Necessary System Metrics
  • Some system metrics may not relate to application
    performance
  • Backward Elimination (BE) stepwise regression
  • System metrics remaining after first step X
    (x1, x2, xn)
  • The application performance metric y
  • Regress the y on the set of x
  • y ?0?1x1?2x2?nxn
  • Delete one metric that either is irrelevant or
    that, given other metrics, is not useful to the
    model
  • F value
  • All metrics remaining are useful and necessary
    for capturing the variation of y

28
Evaluation
  • Two criteria
  • Necessary--Reduction degree (RD)
  • Total percentage of system metrics eliminated
  • Sufficient--coefficient of determination (R2)
  • A statistical measurement
  • Indicate the proportion of variation in the
    performance of application explained by the
    system metrics selected
  • Larger R2 value means system metrics selected
    can better capture the variation in application
    performance
  • Applications
  • Cactus (UCSD cluster)
  • GridFTP (Planetlab)

29
Two-step Experiment
  • 1st stepData Reduction
  • Using training data to select system metrics
  • 2nd step Verification
  • Are these system metrics sufficient?
  • Is the result stable?
  • How does this method compare with other
    strategies?
  • RAND, randomly picks a subset of system metrics
    equal in number to those selected by our strategy
  • MAIN, uses a subset of system metrics that are
    commonly used to model the performance of
    applications by other works H. Dail, M. Ripeanu,
    S.Vazhkudai

30
Data Reduction Result on Cactus Data
  • Six machines, 600 system metrics
  • Threshold ? , RD ?, since fewer system metrics
    group into clusters and thus are removed as
    redundant
  • Threshold ? , R2 ?, since more information is
    available to model the application performance
  • 22 system metrics selected can capture 98 of
    variance in the performance, when the threshold
    value 0.95

31
Verification Result on Cactus Data
  • Using 11 chunks data collected over one day
    period
  • SDR exhibited an average R2 value of 0.907
  • 55.0 and 98.5 higher than those of RAND and
    MAIN
  • Can better capture the applications behavior
  • Results are stable over time(24-hour period)

Statistical Data Reduction for Efficient
Application Performance Monitoring, L. Yang, J.
M. Schopf, C. L. Dumitrescu, and I. Foster,
CCGrid 2006, 2006
32
Outline
  • Avoid or reduce the anomaly
  • Performance prediction
  • Conservative Scheduling
  • Detect and diagnose the anomaly
  • Performance monitoring and data reduction
  • Anomaly detection and diagnosis using signal
    processing techniques
  • Summary

33
Application and Resources Behavior
  • Some resources may have periodic usage pattern
  • ?Performance decrement of application caused by
    periodical resource usage pattern is normal
  • Challenges
  • Different resources may show different usage
    patterns
  • Resource information are noisy
  • Solution Signal Processing techniques

34
Example-detection
  • Adfa

False alarm is reduced!
35
Example-diagnosis
Anomaly is related to network load!
36
Summary
  • Avoid or reduce the anomaly
  • A set of new one-step-ahead prediction strategies
    IPDPS02
  • Better CPU load prediction
  • A conservative scheduling strategySC03,SY03,CCGri
    d05,
  • Take account of predicted mean and variation
    resource capability
  • More reliable application behavior!
  • Detect and diagnose the anomaly
  • A statistical data reduction strategyCCGrid06
  • Identify system metrics that are only necessary
    and sufficient to capture application behavior
  • Simplify data analysis
  • Anomaly detection and diagnosis strategy
    (partially done)
  • Build resource usage pattern automatically
  • Detect and diagnose the anomaly of application

37
Work Remaining to be Done
  • Evaluate the detection and diagnose strategy
  • Calculate
  • Success rate
  • False alarm rate
  • Using
  • Sweep3d data (partially done)
  • GridFtp data collected from machines in planetlab
  • Web server data

38
Time Table
  • Thesis work includes
  • Introduction
  • Performance Prediction
  • Conservative Scheduling
  • Data Reduction
  • Anomaly detection and diagnose using signal
    processing techniques
  • Discussion
  • May evaluate results on the anomaly detection
    and diagnosis strategy
  • June report on anomaly detection diagnosis
    work
  • Aug 11st draft of chapter 1-3 of the thesis
  • Sep 11st draft of chapter 4-6 of the thesis
  • Oct 1 revised thesis

39
  • Questions?
  • Thank you!

40
How Much is the Variation?
  • IncrementValue (or DecrementValue) can be
  • Independent IncrementValueIncConstant
  • Relative IncrementValue CurValue
    IncFactor
  • Or
  • Static Do not change the value at any step
  • Dynamic Adjust the constant at each step using
    an adaptation process
  • Measure VT1
  • RealIncValueT VT1 - VT
  • IncConstantT1 IncConstantT
  • (RealIncValueT
    IncConstantT) AdaptDegree

41
Prediction Strategies
  • Four homeostatic prediction strategies
  • Independent static homeostatic prediction
    strategy
  • Independent dynamic homeostatic prediction
    strategy
  • Relative static homeostatic prediction strategy
  • Relative dynamic homeostatic prediction strategy
  • Three tendency-based prediction strategies
  • Independent dynamic tendency prediction
  • Relative dynamic tendency prediction
  • Dynamic tendency prediction with mixed variation
  • Independent IncrementValue
  • Relative DecrementValue
  • ?Experimental results show that dynamic tendency
    prediction strategy with mixed variation works
    best!

42
Autocorrelation Function
  • Autocorrelation function from lag0 to lag10 on
    CPU load traces
  • Autocorrelation function value at lag1 and lag2
    on 74 network performance time series

43
Cactus Experiment
  • Application Cactus
  • Compare different methods fairly load trace
    Playback tool generates a background workload
    from a trace of the CPU load.
  • 64 real-load traces with different mean and s.d.
  • Execution time of Cactus 1 minute 10 minutes
  • Three clusters UIUC, UCSD, Chiba City

44
Parallel Data Transfer Scheduling
  • Our Tuned Conservative Scheduling (TCS)
  • EffectiveBW BWMean TF BWSD
  • SD ?, effective bandwidth ?, less workload
    allocated
  • Other stochastic strategies
  • Best One Scheduling (BOS)
  • Retrieve data from the source with the highest
    predicted mean bandwidth
  • Equal Allocation Scheduling (EAS)
  • Retrieve the same amount of data from each
    source
  • Mean Scheduling (MS) (TF0)
  • EffectiveBW predicted BWMean
  • Non-tuned stochastic Scheduling (NTSS)(TF1)
  • EffectiveBW predicted BWMean predicted
    BWSD

45
Tuning Factor Algorithm
  • effective BWpredicted BW Mean TFpredicted BW S
  • SD ?, effective bandwidth ?, less workload
    allocated
  • SD/Mean lt 1, then TF ½ to 8
  • Lower variability so higher effective BW desired
  • SD/Mean gt 1, then TF 0 to ½
  • SD is higher than the mean, network performance
    greatly changing, want small effective BW result
  • In both cases, the value of TF and TFSD are
    inversely proportional to N
  • Other formulas would also work just fine

NSD/Mean If (Nlt1) TF1/N N/2 Else
(Ngt1) TF1/(2N2)
46
TF and TFSD as BW SD changes
  • Mean 5, SD varies 0 to 10
  • Both inversely proportional to the BW standard
    deviation (and N)
  • Max of TFSD value is equal to the mean of the BW
  • Other functions are also feasible

47
Evaluation
  • Applications
  • Cactus (UCSD cluster)
  • GridFTP (Planetlab)
  • Data collected once every 30 seconds for 24 hours
  • Every data point 100 system metric values
    /machine
  • 1 application
    performance value
  • Collect system metrics on each machine using
    three utilities
  • The sar command of the SYSSTAT tool set,
  • Network weather service (NWS) sensors, and
  • The Unix command ping

48
Verification Result on GridFTP Data
  • Using data collected from 24 different clients
  • SDR achieves mean R2 value of 0.947
  • 92.5 and 28.1 higher than those of the RAND and
    MAIN strategies
  • Results are stable over different machines with
    the same configuration

49
Resource Performance Analysis
  • Solution Signal Processing techniques
  • Denoise
  • Fourier transform based method
  • Difficult to choose the width and shape of the
    filter
  • White noise is distributed across all
    frequencies or spatial scales. Fourier based
    filter is inefficient for filtering this kind of
    noise.
  • Wavelet analysis offers a scale-independent and
    robust method to filter out noise
  • Is able to remove noise without losing useful
    information
  • Soft threshold denoise technique
  • Construct normal profile
  • Construct periodic usage pattern of resources
  • Fourier transfer --Capable of dealing periodic
    signals
  • Performance decrement is tagged as an anomaly
    only when it is not caused by resources usage
    pattern
Write a Comment
User Comments (0)
About PowerShow.com