Anomaly Management in Grid Environments - PowerPoint PPT Presentation

1 / 62
About This Presentation
Title:

Anomaly Management in Grid Environments

Description:

Financial data prediction [A. Lendasse], earth and ocean sciences [L. ... Run Cactus on 4 shared Linux machines over two weeks ... – PowerPoint PPT presentation

Number of Views:492
Avg rating:3.0/5.0
Slides: 63
Provided by: lingyu
Category:

less

Transcript and Presenter's Notes

Title: Anomaly Management in Grid Environments


1
Anomaly Management in Grid Environments
  • Candidate Lingyun Yang
  • Advisor Ian Foster
  • University of Chicago

2
Grid Environments
  • Mean6.9, SD3.1 Mean197.3,
    SD324.8
  • Resource capabilities change dynamically due to
    the sharing of resources
  • The performance of Grid applications are affected
    by the capabilities of these resources
  • Applications want not only fast execution time,
    but also stable behavior
  • Significant deviation from normal profile is
    defined as anomaly D. Denning

Network Load
CPU Load
3
Challenges
  • The problem is more complex because
  • Resources are shared
  • Resources are heterogeneous
  • Resource are distributed
  • Grid environments are complex

4
Solutions
  • Avoid or reduce the anomaly before it happens
  • Prediction
  • Proper scheduling strategy
  • Detect the application anomaly when it happens
  • Distinguish the anomaly from normal profiles of
    applications and resources
  • Diagnose the anomaly after it happens
  • Study the relationship between resource
    capabilities and application performance

5
Contributions
  • A set of new one-step-ahead prediction strategies
  • A conservative scheduling strategy
  • ? More reliable application behavior
  • A statistical data reduction strategy
  • ? Simplify the data analysis
  • Avoid or reduce the application anomaly before it
    happens
  • Detect the application anomaly when it happens
  • An anomaly detection and diagnosis strategy
    using signal processing techniques
  • ? More accurate anomaly detection and diagnosis

Diagnose the application anomaly after it happens
6
Outline
  • Avoid or reduce the anomaly
  • Performance prediction
  • Conservative Scheduling
  • Detect and diagnose the anomaly
  • Performance monitoring and data reduction
  • Anomaly detection and diagnosis
  • Summary

7
Avoid or Reduce Anomalies
  • Deliver fast and predictable behavior to
    applications
  • Proper scheduling strategy
  • Time balancingF. Berman, H. Dail--heterogeneous
  • Assign more work load to more powerful resources
  • Each resource finishes (roughly) at the same time
  • Why it is not enough?
  • Resource performance may change during execution
  • Resources with a larger capacity may also show a
    higher variance
  • Stochastic scheduling J. Schopf
  • Use stochastic data ( Average and variation)
  • My approaches
  • Prediction
  • Conservative Scheduling

8
CPU Load Prediction
  • Prediction estimate future values using
    historical data
  • Key correctly model the relationship of the
    historical data with future values
  • Time series modeling
  • Financial data prediction A. Lendasse, earth
    and ocean sciences L. Lawson, biomedical signal
    processing H. Kato etc
  • CPU load prediction
  • NWS R. Wolski Mean based, Media based, AR
    model.
  • Dynamically selects the
    next strategy
  • Linear models study P. Dinda AR model is the
    best one
  • Estimate value directly
  • Question What if we estimate the direction and
    variation
  • separately?

9
Direction Two Families of Strategies
  • Homeostatic Assume the values are
    self-correcting - the value will return to the
    mean of the previous values
  • Tendency Assume that if the current value
    decreases, the next value will also decrease if
    the current value increases, the next value will
    also increase

10
How Much is the Variation?
  • IncrementValue (or DecrementValue) can be
  • Independent IncrementValueIncConstant
  • Relative IncrementValue CurValue
    IncFactor
  • Or
  • Static Do not change the value at any step
  • Dynamic Adapt the constant at each step using
    the real-time information

11
Prediction Strategies
  • Create a set of predication strategies by
    different combinations of the directions and
    variations
  • Evaluate this set of prediction strategies on 12
    CPU time series
  • Experimental results show that dynamic tendency
    prediction strategy with mixed variation works
    best
  • Dynamic tendency prediction with mixed variation
  • Independent IncrementValue
  • Relative DecrementValue

12
Comparison Results
  • Dynamic tendency predictor with mixed variation
    outperforms NWS 255 less error rate, average
    37 less error rate

Homeostatic and Tendency-based CPU Load
Predictions, L. Yang, I. Foster, and J. M.
Schopf, Proceedings of International Parallel and
Distributed Processing Symposium (IPDPS), 2003 .
13
Network Load Prediction
  • Tendency-based predictor with mixed variation
    does not work effectively on network data
  • NWS works better on average
  • Possible explanation
  • Network data usually has much higher variation
    and smaller correlation between two adjacent
    values than CPU load data
  • Our strategies give high weight on most recent
    data, thus can not track the trace tendency of
    network data
  • NWS predictors take account of more statistic
    information
  • ?For Network load prediction, we will use NWS
    predictors

14
Mean Resource Capability Prediction
  • Calculate a mean resource capability time series
  • ai average resource capability over the time
    interval, which is approximately equal to the
    total execution time
  • Mean resource capability prediction

15
Resource Capability Variation Prediction
  • Calculate the standard deviation time series
  • Si the average difference between the resource
    capability and the mean resource capability over
    the time interval
  • Resource capability variation prediction

16
Outline
  • Avoid or reduce the anomaly
  • Performance prediction
  • Conservative Scheduling
  • Detect and diagnose the anomaly
  • Performance monitoring and data reduction
  • Anomaly detection and diagnosis
  • Summary

17
Conservative Scheduling
  • A resource with a larger capacity may also show a
    higher variance in performance
  • Assign less work to less reliable
    (higher-variance) resources
  • Avoid the peak in the application performance
    caused by variance in the resource capability
  • Verified in two contexts
  • Computation intensive application Cactus
  • Parallel Data Transfer GridFTP

18
Effective CPU Capability--Cactus
  • Conservative load prediction
  • Effective CPU load predicted Mean predicted
    SD
  • SD ?, effective CPU load ?, less workload
    allocated
  • Other Scheduling Options
  • One-Step Scheduling (OSS) H.Dail, C. Liu
  • Effective CPU load predicted CPU at
    the next step
  • Predicted Mean Interval Scheduling (PMIS)
  • Effective CPU load predicted Mean
  • History Mean Scheduling (HMS) A. Turgeon, J.
    Weissman
  • Effective CPU load mean of the
    historical load collected during a 5-minute
    period preceding the application start time
  • History Conservative Scheduling (HCS) J.
    Schopf
  • Effective CPU load mean SD of the
    historical load collected during a 5-minute
    period preceding the application run

19
An Example of Experimental Results
  • Comparison of five strategies on cactus running
    on the Chiba City cluster
  • 32 machines used

20
Result Evaluation
  • An average mean and s.d. of execution time
  • The statistical analysis T test
  • Result shows statistical improvement of our
    strategy over other strategies
  • The possibility of improvement happening by
    chance is quite small

Conservative Scheduling Using Predicted Variance
to Improve Scheduling Decisions in Dynamic
Environments, L. Yang, J. M. Schopf, and I.
Foster, Proceedings of SuperComputing
2003. Improving Parallel Data Transfer Times
Using Predicted Variances in Shared Networks, L.
Yang, J. M. Schopf, and I. Foster, CCGrid 2005.
21
Outline
  • Avoid or reduce the anomaly
  • Performance prediction
  • Conservative Scheduling
  • ? more stable application behavior
  • Detect and diagnose the anomaly
  • Performance monitoring and data reduction
  • Anomaly detection and diagnosis using signal
    processing techniques
  • Summary

22
Detect and Diagnose Anomalies
  • Anomaly detection send an alarm for significant
    deviation from the established normal profile
    D.Denning
  • Anomaly diagnosis relate application anomalous
    behaviors to anomalous resource behaviors
  • Need to detect resource anomaly
  • Key correctly define the normal profile of
    application and resource behaviors

Anomaly?
23
Build Normal Profile
  • Signature-based methods K. Ilgun, T. Lunt
  • Known patterns of anomalous activities
  • Specification-based methods R. Sekar
  • Logic-based rules specifying the legitimate
    system and/or application behaviors
  • Statistical methods G. Allen
  • Use statistics to construct a reference model of
    normal application and system behaviors

24
Questions and Solutions
  • Questions
  • What kind of resource information should be used?
  • How to use the resource information to detect and
    diagnose application anomalous behaviors?
  • My approaches
  • Statistical data reduction
  • Select only necessary system metrics
  • Anomaly detection and diagnosis using signal
    processing based techniques
  • Extend the window average based method
  • Statistical method
  • Simple and efficient

25
Data Reduction
  • First step of anomaly detection and diagnosis is
    performance monitoring
  • Computer systems and applications continue to
    increase in complexity and size
  • Interactions among components are poorly
    understood
  • Two ways to understand the relationship between
    an application and resource performance
  • Performance Model A. Hoisie,, M.Ripeanu, D.
    Kerbyson
  • ? Application specific, expensive etc.
  • Instrumentation and data analysis A. Malony
  • ?Produce tremendous amount of data
  • Need mechanisms to select only necessary metrics
  • Two-step data reduction strategy

26
Reduce Redundant System Metrics
  • Some system metrics capture the same (similar)
    information
  • Highly correlated (measured by correlation
    coefficient (r) )
  • Only one is necessary, others are redundant
  • Two questions
  • A threshold value t (determined experimentally)
  • A method to compare
  • Traditional method mathematical comparison M.
    Knop
  • correlation coefficient (r) gtt ?
  • Problems Only limited number of sample data are
    available

r may vary from run-to-run
27
Redundant Metrics Reduction Algorithm
  • Use Z-test
  • A statistical method
  • Determine whether the correlation is
    statistically significantly larger than the
    threshold value (with 95 confidence in results)
  • Given a set of samples, we proceed as follows
  • Perform the Z-test for r between every pair of
    system metrics
  • Group two metrics into one cluster if Z test
    shows their r value is statistically larger than
    the threshold value
  • The result is a set of system metric clusters
  • Select one metric from a cluster as the
    representative and delete the others as redundant

28
Select Necessary System Metrics
  • Some system metrics may not relate to application
    performance
  • Backward Elimination (BE) stepwise regression
  • System metrics remaining after first step X
    (x1, x2, xn)
  • The application performance metric y
  • Regress the y on the set of x
  • y ?0?1x1?2x2?nxn
  • Delete one metric that either is irrelevant or
    that, given other metrics, is not useful to the
    model
  • F value
  • All metrics remaining are useful and necessary
    for capturing the variation of y

29
Evaluation
  • Two criteria
  • Necessary--Reduction degree (RD)
  • Total percentage of system metrics eliminated
  • Sufficient--coefficient of determination (R2)
  • A statistical measurement
  • Indicate the proportion of variation in the
    performance of application explained by the
    system metrics selected
  • Larger R2 value means system metrics selected
    can better capture the variation in application
    performance
  • Applications
  • Cactus (UCSD cluster)
  • GridFTP (Planetlab)

30
Experimental Methodology
  • 1st stepData Reduction
  • Use training data to select system metrics
  • 2nd step Verification
  • Are these system metrics sufficient?
  • Is the result stable?
  • How does this method compare with other
    strategies?
  • RAND, randomly picks a subset of system metrics
    equal in number to those selected by our strategy
  • MAIN, uses a subset of system metrics that are
    commonly used to model the performance of
    applications by other works H. Dail, M. Ripeanu,
    S.Vazhkudai

31
Data Reduction Result on Cactus Data
  • Six machines, 600 system metrics
  • Threshold ? , RD ?, since fewer system metrics
    are grouped into clusters and removed as
    redundant
  • Threshold ? , R2 ?, since more information is
    available to model the application performance
  • 22 system metrics selected can capture 98 of
    variance in the performance, when the threshold
    value 0.95

32
Verification Result on Cactus Data
  • Using 11 chunks data collected over one day
    period
  • SDR exhibited an average R2 value of 0.907
  • 55.0 and 98.5 higher than those of RAND and
    MAIN
  • Can better capture the applications behavior
  • Results are stable over time(24-hour period)

Statistical Data Reduction for Efficient
Application Performance Monitoring, L. Yang, J.
M. Schopf, C. L. Dumitrescu, and I. Foster,
CCGrid 2006, 2006
33
Outline
  • Avoid or reduce the anomaly
  • Performance prediction
  • Conservative Scheduling
  • Detect and diagnose the anomaly
  • Performance monitoring and data reduction
  • Anomaly detection and diagnosis using signal
    processing techniques
  • Summary

34
Anomaly Detection and Diagnosis
  • Traditional window average based method G.
    Allen, J. Brutlag, D.Gunter
  • Use window average as the baseline to compare
  • Simple and efficient
  • Some resources may have periodic usage patterns
  • Slowdowns caused by periodic resource usage
    patterns are normal
  • Will result in high false positives if without
    taking account of the periodic resource usage
    patterns properly

35
Challenges
  • In Grid environments
  • Resource measurements are noisy
  • Resources are distributed, with different
    administrative or access policies
  • Different resources may show different usage
    patterns, with different frequencies, shapes and
    amplitudes
  • Need an approach that can identify the periodic
    resource usage patterns automatically and
    dynamically
  • Solution Signal Processing techniques
  • Fourier transform based method
  • Dominant capability in frequency domain analysis

36
Example-detection
  • Adfa

False positives are reduced!
37
Example-diagnosis
This anomaly is related to network load!
38
Evaluation
  • Insert 100 anomalies randomly
  • Compare the results of three strategies
  • Traditional Window Average method (TWA)
  • Modified Window Average method with denoising
    only (MWAD)
  • Modified Window Average method (MWA)
  • Two criteria
  • Number of detected anomalies (HIT)
  • Number of false positives (FP)
  • Three Applications
  • Cactus (UofC cluster)
  • GridFTP (emulab)
  • Sweep3d (emulab)

39
Experimental Methodology
  • Run Cactus on 4 shared Linux machines over two
    weeks
  • CPU load shows half an hour periodic usage
    pattern
  • Performance of Cactus is influenced by the
    periodic CPU load pattern
  • Insert 100 anomalies randomly during application
    running
  • By running resource consumption tools
  • Collect 4 sets of data
  • The first data set is used as training data
  • window size , data reduction threshold value
  • The other three sets of data are used for
    verification

40
Window Size
  • When window size is small, FP is high and HIT is
    low
  • window size ?, the FP ? and HIT ?
  • Window size gt32, FPlt60 and HITgt90
  • Window size 128, FP53, HIT96
  • For comparison, TWA achieves FP696, HIT99

41
Data Reduction Parameter
  • Threshold value ?, HIT ?, FP ?
  • FP flats with thresholdgt0.35
  • Threshold value0.9 , HIT97

42
Cactus Results
  • Detection
  • Eliminate 90 of the FP, HIT 93 96
  • Diagnosis
  • Relate application anomalies to resource
    anomalous behaviors
  • Report the reasons for 82 to 87 anomalies
    correctly

43
Summary of Contributions
  • Avoid or reduce the anomaly
  • A set of new one-step-ahead prediction strategies
    IPDPS02
  • Better CPU load prediction
  • A conservative scheduling strategySC03,SY03,CCGri
    d05,
  • Take account of predicted mean and variation
    resource capability
  • More reliable application behavior!
  • Detect and diagnose the anomaly
  • A statistical data reduction strategyCCGrid06
  • Identify system metrics that are only necessary
    and sufficient to capture application behavior
  • Simplify data analysis
  • Anomaly detection and diagnosis strategy
  • Identify the periodic resource usage pattern
    automatically and dynamically
  • Reduce the false positives significantly

44
Future Work
  • Anomaly prevention
  • Multi-job Scheduling
  • More advanced detection and diagnosis methods
  • Neural networks methods
  • Hidden Markov model methods

45
  • Questions?
  • Thank you!

46
How Much is the Variation?
  • IncrementValue (or DecrementValue) can be
  • Independent IncrementValueIncConstant
  • Relative IncrementValue CurValue
    IncFactor
  • Or
  • Static Do not change the value at any step
  • Dynamic Adjust the constant at each step using
    an adaptation process
  • Measure VT1
  • RealIncValueT VT1 - VT
  • IncConstantT1 IncConstantT
  • (RealIncValueT
    IncConstantT) AdaptDegree

47
Prediction Strategies
  • Four homeostatic prediction strategies
  • Independent static homeostatic prediction
    strategy
  • Independent dynamic homeostatic prediction
    strategy
  • Relative static homeostatic prediction strategy
  • Relative dynamic homeostatic prediction strategy
  • Three tendency-based prediction strategies
  • Independent dynamic tendency prediction
  • Relative dynamic tendency prediction
  • Dynamic tendency prediction with mixed variation
  • Independent IncrementValue
  • Relative DecrementValue
  • ?Experimental results show that dynamic tendency
    prediction strategy with mixed variation works
    best!

48
Autocorrelation Function
  • Autocorrelation function from lag0 to lag10 on
    CPU load traces
  • Autocorrelation function value at lag1 and lag2
    on 74 network performance time series

49
Cactus Experiment
  • Application Cactus
  • Compare different methods fairly load trace
    Playback tool generates a background workload
    from a trace of the CPU load.
  • 64 real-load traces with different mean and s.d.
  • Execution time of Cactus 1 minute 10 minutes
  • Three clusters UIUC, UCSD, Chiba City

50
Parallel Data Transfer Scheduling
  • Our Tuned Conservative Scheduling (TCS)
  • EffectiveBW BWMean TF BWSD
  • SD ?, effective bandwidth ?, less workload
    allocated
  • Other stochastic strategies
  • Best One Scheduling (BOS)
  • Retrieve data from the source with the highest
    predicted mean bandwidth
  • Equal Allocation Scheduling (EAS)
  • Retrieve the same amount of data from each
    source
  • Mean Scheduling (MS) (TF0)
  • EffectiveBW predicted BWMean
  • Non-tuned Stochastic Scheduling (NTSS)(TF1)
  • EffectiveBW predicted BWMean predicted
    BWSD

51
Tuning Factor Algorithm
  • effective BWpredicted BW Mean TFpredicted BW S
  • SD ?, effective bandwidth ?, less workload
    allocated
  • SD/Mean lt 1, then TF ½ to 8
  • Lower variability so higher effective BW desired
  • SD/Mean gt 1, then TF 0 to ½
  • SD is higher than the mean, network performance
    greatly changing, want small effective BW result
  • In both cases, the value of TF and TFSD are
    inversely proportional to N
  • Other formulas would also work just fine

NSD/Mean If (Nlt1) TF1/N N/2 Else
(Ngt1) TF1/(2N2)
52
TF and TFSD as BW SD changes
  • Mean 5, SD varies 0 to 10
  • Both inversely proportional to the BW standard
    deviation (and N)
  • Max of TFSD value is equal to the mean of the BW
  • Other functions are also feasible

53
Evaluation
  • Applications
  • Cactus (UCSD cluster)
  • GridFTP (Planetlab)
  • Data collected once every 30 seconds for 24 hours
  • Every data point 100 system metric values
    /machine
  • 1 application
    performance value
  • Collect system metrics on each machine using
    three utilities
  • The sar command of the SYSSTAT tool set,
  • Network weather service (NWS) sensors, and
  • The Unix command ping

54
Verification Result on GridFTP Data
  • Using data collected from 24 different clients
  • SDR achieves mean R2 value of 0.947
  • 92.5 and 28.1 higher than those of the RAND and
    MAIN strategies
  • Results are stable over different machines with
    the same configuration

55
Resource Performance Analysis
  • Solution Signal Processing techniques
  • Denoise
  • Fourier transform based method
  • Difficult to choose the width and shape of the
    filter
  • White noise is distributed across all
    frequencies or spatial scales. Fourier based
    filter is inefficient for filtering this kind of
    noise.
  • Wavelet analysis offers a scale-independent and
    robust method to filter out noise
  • Is able to remove noise without losing useful
    information
  • Soft threshold denoise technique
  • Construct normal profile
  • Construct periodic usage pattern of resources
  • Fourier transfer --Capable of dealing periodic
    signals
  • Performance decrement is tagged as an anomaly
    only when it is not caused by resources usage
    pattern

56
Experimental Methodology
  • Ran Cactus on 4 shared Linux machines over two
    weeks
  • Insert 100 anomalies manually
  • By running resource consumption tools
  • Consume more than 90 of CPU,bandwidth or memory
  • Anomaly caused by high CPU load
  • Anomaly caused by high bandwidth load
  • (4) Anomalies caused by high memory load

57
Experimental Result- Diagnosis
  • Classify the system metrics into three
    categories
  • CPU related, memory related and Network related
  • Total 12 possible reasons on four machines
  • Relate application anomaly to resource anomalous
    behavior
  • Check the resource anomalous behavior if an
    application anomaly has been identified
  • Count the number of resource anomalies in each
    categories
  • Output all reasons happening more than 10 of the
    time
  • Result
  • For 97 anomalies detected, our strategy could
    reports the reasons for 89 anomalies.
  • 80 anomalies have been reported with the largest
    possibility.

58
Experimental Methodology for GridFTP
  • Insert 100 network anomalies on network links in
    the path
  • No periodic resource usage pattern

59
GridFTP Result
  • Detection
  • Do not improve a lot when FP is small,HIT 9095
  • Diagnosis
  • Locate the problematic network links
  • Report the reasons for 73 to 81 anomalies
    correctly

60
Experimental Methodology for Sweep3d
  • Insert 100 network anomalies on network links
  • Emulate various periodic CPU load patterns for
    machines from different domains
  • Daily and hourly
  • Vary the problem size to change its computation /
    communication ratio
  • Small, medium and large

61
GridFTP Result -Detection
  • Do not improve a lot when FP is small
  • Reduce about 85 FP when FP is large
  • HIT8995

62
Sweep3d Result-Diagnosis
  • Locate the problematic network links
  • Report the reasons for 73 to 81 anomalies
    correctly
Write a Comment
User Comments (0)
About PowerShow.com