A Statistical Learning Approach to Diagnosing eBay - PowerPoint PPT Presentation

About This Presentation
Title:

A Statistical Learning Approach to Diagnosing eBay

Description:

But, exact root cause may not be required for many recovery techniques ... Deployed on the entire eBay site. Sends real-time alerts to ops ... – PowerPoint PPT presentation

Number of Views:87
Avg rating:3.0/5.0
Slides: 22
Provided by: mike90
Category:

less

Transcript and Presenter's Notes

Title: A Statistical Learning Approach to Diagnosing eBay


1
A Statistical Learning Approach to Diagnosing
eBays Site
  • Mike Chen, Alice Zheng, Jim Lloyd,
  • Michael Jordan, Eric Brewer
  • mikechen_at_cs.berkeley.edu

2
Motivation
  • Fast failure detection and diagnosis are critical
    to high availability
  • But, exact root cause may not be required for
    many recovery techniques
  • Many potential causes of failures
  • Software bugs, hardware, configuration, network,
    database, etc.
  • Manual diagnosis is slow and inconsistent
  • Statistical approaches are ideal
  • Simultaneously examining many possible causes of
    failures
  • Robust to noise

3
Challenges
  • Lots of (noisy) data
  • Near real-time detection and diagnosis
  • Multiple independent failures
  • Root cause might not be captured in logs

4
Talk Outline
  • Introduction
  • eBays infrastructure
  • 3 statistical approaches
  • Early results

5
eBays Infrastructure
  • 2 physical tiers
  • Web server/app server DB
  • Migrating to Java (WebSphere) from C
  • SuperCAL (Centralized Application Logging)
  • API for app developer to log anything to CAL
  • Runtime platform provides application-generic
    logging cookie, host, URL, DB table(s), status,
    duration, etc.
  • Supports nested txns
  • A path can be identified via thread ID host ID

6
SuperCAL Architecture
detection
App Servers
LB Switch
diagnosis

Real-time msg bus
  • Stats
  • 2K app servers, 40 SuperCAL machines
  • 1B URLs/day
  • 1TB raw logs/day (150GB gzipped), 200Mbps peak

7
Failure Analysis
  • Summarize each transaction into
  • What features are causing requests to fail?
  • Txn type, txn name, pool, host, version, DB, or a
    combination of these?
  • Different causes require different recovery
    techniques

ID Type Name Pool Host Version DB Status
1 URL ViewFeedback Cgi0 134 1.2.1 FeedbackDB, UserDB, NullPointer
2 URL Bid Cgi2 231 1.0.3 PriceDB Success
3 XML
Features
Class
8
3 Approaches
  • Machine learning
  • Decision trees
  • MinEntropy eBays greedy variant of decision
    trees
  • Data mining
  • Association rules

9
Decision Trees
  • Classifiers developed in the statistical machine
    learning field
  • Example go skiing tomorrow?
  • learning gt inferring the decision trees rules
    from data

New snow
No new snow
Y
Cloudy
Sunny
Y
N
10
Decision Trees
  • Feature selection
  • Look for features that best separates the classes
  • Different algorithms uses different metrics to
    measure skewness (e.g. C4.5 uses information
    gain)
  • The goal of decision tree algorithm
  • to split nodes until leaves are pure enough or
    until no further split is possible
  • i.e. pure gt all data points have the same class
    label
  • Use pruning heuristics to control over-fitting

TxnName Failed
MyEBay 636
MyEBaySeller 512
MyEBayLogin 736

Machine Failed
Attila 2985
Lenin 20
Marcus 4
Scipio 5

11
Decision Trees Sample Output
  • Pool icgi1
  • TxnName LeaveFeedback failed (8,1)
  • TxnName MyFeedback failed (205,3)
  • Pool icgi2
  • TxnName Respond failed (1)
  • TxnName ViewFeedback failed (3554,52)
  • Naïve diagnosis
  • Poolicgi1 and TxnNameLeaveFeedback
  • Poolicgi1 and TxnNameMyFeedback
  • Poolicgi2 and TxnNameRespond
  • Poolicgi2 and TxnNameViewFeedback

icgi1
icgi2
Respond
MyFdbk
LeaveFdbk
ViewFdbk
8
205
1
3554
12
Feature Selection Heuristics
  • Ignore leaf nodes with no failed transactions
  • Problem noisy leaves
  • keep the top N leaves, or ignore nodes with lt M
    failues
  • Problem features may not be independent
  • drop ancestor nodes that are subsumed by the
    leaves
  • Rank by impact
  • sort the predicted causes by failure count

icgi1
icgi2
LeaveFdbk
Respond
MyFdbk
ViewFdbk
8
205
1
3554
13
MinEntropy
  • Entropy measures the randomness of data
  • E.g. if failure is evenly distributed (very
    random), then entropy is high
  • Rank features by the normalized entropy
  • Greedy approach searches for the leaf node with
    most failures
  • Always produces one and exactly one diagnosis
  • Deployed on the entire eBay site
  • Sends real-time alerts to ops
  • Pros fast (lt1s for 100K txns and scales
    linearly)
  • Cons optimized for single faults

14
MinEntropy example
TxnName Errors
MyEBay 636
MyEBaySeller 512
MyEBayLogin 736

Pool Errors
Cgi0 12
Cgi1 4002
Cgi2 30
Cgi3 8
Cgi4 5

TxnType Errors
URL 4350
SQL 47
EMAIL 12
XSLT 0

Version Errors
E293 3987
E291 15
Machine Errors
Attila 1985
Lenin 2002
Marcus 4
Scipio 0

Alert Version E293 causing URL failures (not
specific to any URL) in pool CGI1
15
Association Rules
  • Data mining technique to compute item sets
  • e.g. Shoppers who bought this item also shopped
    for
  • Metrics
  • Confidence ( of A B) / of A
  • Conditional probability of B given A
  • Support ( of A B)/total of txns
  • Generates rules for all possible sets
  • e.g. machineabc, txnlogin gt
    statusNullPointer (conf0.1, support0.02)
  • Applied to failure diagnosis
  • Find all rules that has failed status on the
    right, then rank by conf
  • Pros looks at combinations of features
  • Cons generates many rules

16
Association Rules Sample Output
  • Sample output (rules containing failures)
  • TxnTypeURL Poolicgi2 TxnNameLeaveFeedback gt
    StatusFailed conf(0.28)
  • Poolicgi2 TxnNameLeaveFeedback gt
    StatusFailed conf(0.28)
  • TxnTypeURL TxnNameLeaveFeedback gt
    StatusFailed conf(0.28)
  • TxnNameLeaveFeedback gt StatusFailed
    conf(0.28)
  • Problem features may not be independent
  • e.g. all LeaveFeedback txns are of type URL
  • Drop rules that are subsumed by more specific
    rules
  • Diagnosis TxnNameLeaveFeedback

17
Experimental Setup
  • Dataset
  • About 1/8 of the whole site
  • 10 one-minute traces, 4 with 2 concurrent faults
  • total of 14 independent faults
  • True faults identified through post-mortems, ops
    chat logs, application logs, etc.
  • Metrics
  • Precision ( of identified faults) / ( of true
    faults)
  • Recall ( of identified faults) / ( of
    predicted faults)

Type Name Pool Machine Version Database Status
10 300 15 260 7 40 8
Host DB Host, Host Host, DB Host, SW DB, SW
2 4 1 1 1 1
18
Results DBs in Dataset
  • True causes for DB-related failures are captured
    in the dataset
  • Variable number of DBs used by each txn
  • Feature selection heuristics
  • Ignore leaf nodes with no failed transactions
  • Noise filtering
  • ignore nodes with lt M failues (in this case, M
    10)
  • Path trimming
  • drop ancestor nodes subsumed by the leaf nodes

19
Results DBs not in Dataset
  • True cause not captured for DB-related failures
  • C4.5 suffers from unbalanced dataset
  • i.e. produces a single-rule that predicts every
    txn to be successful

20
Whats next?
  • ROC curves
  • show tradeoff between precision and recall
  • Transient failures
  • Up-sample to balance dataset or use cost matrix
  • Some measure of the confidence of the
    prediction
  • More data points
  • Have 20hrs of logs that have failures

21
Open Questions
  • How to deal with multiple symptoms?
  • E.g. DB outage causing multiple types of requests
    to fail
  • Treat it as multiple failures?
  • Failure importance (count vs. rate)
  • Two failures may have similar failure count
  • Low volume and higher failure rate vs. high
    volume and lower failure rate
Write a Comment
User Comments (0)
About PowerShow.com