Title: A Statistical Learning Approach to Diagnosing eBay
1A Statistical Learning Approach to Diagnosing
eBays Site
- Mike Chen, Alice Zheng, Jim Lloyd,
- Michael Jordan, Eric Brewer
- mikechen_at_cs.berkeley.edu
- Fast failure detection and diagnosis are critical
to high availability - But, exact root cause may not be required for
many recovery techniques - Many potential causes of failures
- Software bugs, hardware, configuration, network,
database, etc. - Manual diagnosis is slow and inconsistent
- Statistical approaches are ideal
- Simultaneously examining many possible causes of
failures - Robust to noise
- Lots of (noisy) data
- Near real-time detection and diagnosis
- Multiple independent failures
- Root cause might not be captured in logs
4Talk Outline
- Introduction
- eBays infrastructure
- 3 statistical approaches
- Early results
5eBays Infrastructure
- 2 physical tiers
- Web server/app server DB
- Migrating to Java (WebSphere) from C
- SuperCAL (Centralized Application Logging)
- API for app developer to log anything to CAL
- Runtime platform provides application-generic
logging cookie, host, URL, DB table(s), status,
duration, etc. - Supports nested txns
- A path can be identified via thread ID host ID
6SuperCAL Architecture
App Servers
LB Switch
Real-time msg bus
- Stats
- 2K app servers, 40 SuperCAL machines
- 1B URLs/day
- 1TB raw logs/day (150GB gzipped), 200Mbps peak
7Failure Analysis
- Summarize each transaction into
- What features are causing requests to fail?
- Txn type, txn name, pool, host, version, DB, or a
combination of these? - Different causes require different recovery
ID Type Name Pool Host Version DB Status
1 URL ViewFeedback Cgi0 134 1.2.1 FeedbackDB, UserDB, NullPointer
2 URL Bid Cgi2 231 1.0.3 PriceDB Success
83 Approaches
- Machine learning
- Decision trees
- MinEntropy eBays greedy variant of decision
trees - Data mining
- Association rules
9Decision Trees
- Classifiers developed in the statistical machine
learning field - Example go skiing tomorrow?
- learning gt inferring the decision trees rules
from data
New snow
No new snow
10Decision Trees
- Feature selection
- Look for features that best separates the classes
- Different algorithms uses different metrics to
measure skewness (e.g. C4.5 uses information
gain) - The goal of decision tree algorithm
- to split nodes until leaves are pure enough or
until no further split is possible - i.e. pure gt all data points have the same class
label - Use pruning heuristics to control over-fitting
TxnName Failed
MyEBay 636
MyEBaySeller 512
MyEBayLogin 736
Machine Failed
Attila 2985
Lenin 20
Marcus 4
Scipio 5
11Decision Trees Sample Output
- Pool icgi1
- TxnName LeaveFeedback failed (8,1)
- TxnName MyFeedback failed (205,3)
- Pool icgi2
- TxnName Respond failed (1)
- TxnName ViewFeedback failed (3554,52)
- Naïve diagnosis
- Poolicgi1 and TxnNameLeaveFeedback
- Poolicgi1 and TxnNameMyFeedback
- Poolicgi2 and TxnNameRespond
- Poolicgi2 and TxnNameViewFeedback
12Feature Selection Heuristics
- Ignore leaf nodes with no failed transactions
- Problem noisy leaves
- keep the top N leaves, or ignore nodes with lt M
failues - Problem features may not be independent
- drop ancestor nodes that are subsumed by the
leaves - Rank by impact
- sort the predicted causes by failure count
- Entropy measures the randomness of data
- E.g. if failure is evenly distributed (very
random), then entropy is high - Rank features by the normalized entropy
- Greedy approach searches for the leaf node with
most failures - Always produces one and exactly one diagnosis
- Deployed on the entire eBay site
- Sends real-time alerts to ops
- Pros fast (lt1s for 100K txns and scales
linearly) - Cons optimized for single faults
14MinEntropy example
TxnName Errors
MyEBay 636
MyEBaySeller 512
MyEBayLogin 736
Pool Errors
Cgi0 12
Cgi1 4002
Cgi2 30
Cgi3 8
Cgi4 5
TxnType Errors
URL 4350
SQL 47
Version Errors
E293 3987
E291 15
Machine Errors
Attila 1985
Lenin 2002
Marcus 4
Scipio 0
Alert Version E293 causing URL failures (not
specific to any URL) in pool CGI1
15Association Rules
- Data mining technique to compute item sets
- e.g. Shoppers who bought this item also shopped
for - Metrics
- Confidence ( of A B) / of A
- Conditional probability of B given A
- Support ( of A B)/total of txns
- Generates rules for all possible sets
- e.g. machineabc, txnlogin gt
statusNullPointer (conf0.1, support0.02) - Applied to failure diagnosis
- Find all rules that has failed status on the
right, then rank by conf - Pros looks at combinations of features
- Cons generates many rules
16Association Rules Sample Output
- Sample output (rules containing failures)
- TxnTypeURL Poolicgi2 TxnNameLeaveFeedback gt
StatusFailed conf(0.28) - Poolicgi2 TxnNameLeaveFeedback gt
StatusFailed conf(0.28) - TxnTypeURL TxnNameLeaveFeedback gt
StatusFailed conf(0.28) - TxnNameLeaveFeedback gt StatusFailed
conf(0.28) - Problem features may not be independent
- e.g. all LeaveFeedback txns are of type URL
- Drop rules that are subsumed by more specific
rules - Diagnosis TxnNameLeaveFeedback
17Experimental Setup
- Dataset
- About 1/8 of the whole site
- 10 one-minute traces, 4 with 2 concurrent faults
- total of 14 independent faults
- True faults identified through post-mortems, ops
chat logs, application logs, etc. - Metrics
- Precision ( of identified faults) / ( of true
faults) - Recall ( of identified faults) / ( of
predicted faults)
Type Name Pool Machine Version Database Status
10 300 15 260 7 40 8
Host DB Host, Host Host, DB Host, SW DB, SW
2 4 1 1 1 1
18Results DBs in Dataset
- True causes for DB-related failures are captured
in the dataset - Variable number of DBs used by each txn
- Feature selection heuristics
- Ignore leaf nodes with no failed transactions
- Noise filtering
- ignore nodes with lt M failues (in this case, M
10) - Path trimming
- drop ancestor nodes subsumed by the leaf nodes
19Results DBs not in Dataset
- True cause not captured for DB-related failures
- C4.5 suffers from unbalanced dataset
- i.e. produces a single-rule that predicts every
txn to be successful
20Whats next?
- ROC curves
- show tradeoff between precision and recall
- Transient failures
- Up-sample to balance dataset or use cost matrix
- Some measure of the confidence of the
prediction - More data points
- Have 20hrs of logs that have failures
21Open Questions
- How to deal with multiple symptoms?
- E.g. DB outage causing multiple types of requests
to fail - Treat it as multiple failures?
- Failure importance (count vs. rate)
- Two failures may have similar failure count
- Low volume and higher failure rate vs. high
volume and lower failure rate