A Statistical Learning Approach to Diagnosing eBay - PowerPoint PPT Presentation

About This Presentation

Title:

A Statistical Learning Approach to Diagnosing eBay

Description:

But, exact root cause may not be required for many recovery techniques ... Deployed on the entire eBay site. Sends real-time alerts to ops ... – PowerPoint PPT presentation

Number of Views:87

Avg rating:3.0/5.0

Slides: 22

Provided by: mike90

Learn more at: http://roc.cs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: A Statistical Learning Approach to Diagnosing eBay

1
A Statistical Learning Approach to Diagnosing
eBays Site

Mike Chen, Alice Zheng, Jim Lloyd,
Michael Jordan, Eric Brewer
mikechen_at_cs.berkeley.edu

2
Motivation

Fast failure detection and diagnosis are critical
to high availability
But, exact root cause may not be required for
many recovery techniques
Many potential causes of failures
Software bugs, hardware, configuration, network,
database, etc.
Manual diagnosis is slow and inconsistent
Statistical approaches are ideal
Simultaneously examining many possible causes of
failures
Robust to noise

3
Challenges

Lots of (noisy) data
Near real-time detection and diagnosis
Multiple independent failures
Root cause might not be captured in logs

4
Talk Outline

Introduction
eBays infrastructure
3 statistical approaches
Early results

5
eBays Infrastructure

2 physical tiers
Web server/app server DB
Migrating to Java (WebSphere) from C
SuperCAL (Centralized Application Logging)
API for app developer to log anything to CAL
Runtime platform provides application-generic
logging cookie, host, URL, DB table(s), status,
duration, etc.
Supports nested txns
A path can be identified via thread ID host ID

6
SuperCAL Architecture
detection
App Servers
LB Switch
diagnosis

Real-time msg bus

Stats
2K app servers, 40 SuperCAL machines
1B URLs/day
1TB raw logs/day (150GB gzipped), 200Mbps peak

7
Failure Analysis

Summarize each transaction into
What features are causing requests to fail?
Txn type, txn name, pool, host, version, DB, or a
combination of these?
Different causes require different recovery
techniques

ID Type Name Pool Host Version DB Status
1 URL ViewFeedback Cgi0 134 1.2.1 FeedbackDB, UserDB, NullPointer
2 URL Bid Cgi2 231 1.0.3 PriceDB Success
3 XML
Features
Class
8
3 Approaches

Machine learning
Decision trees
MinEntropy eBays greedy variant of decision
trees
Data mining
Association rules

9
Decision Trees

Classifiers developed in the statistical machine
learning field
Example go skiing tomorrow?
learning gt inferring the decision trees rules
from data

New snow
No new snow
Y
Cloudy
Sunny
Y
N
10
Decision Trees

Feature selection
Look for features that best separates the classes
Different algorithms uses different metrics to
measure skewness (e.g. C4.5 uses information
gain)
The goal of decision tree algorithm
to split nodes until leaves are pure enough or
until no further split is possible
i.e. pure gt all data points have the same class
label
Use pruning heuristics to control over-fitting

TxnName Failed
MyEBay 636
MyEBaySeller 512
MyEBayLogin 736

Machine Failed
Attila 2985
Lenin 20
Marcus 4
Scipio 5

11
Decision Trees Sample Output

Pool icgi1
TxnName LeaveFeedback failed (8,1)
TxnName MyFeedback failed (205,3)
Pool icgi2
TxnName Respond failed (1)
TxnName ViewFeedback failed (3554,52)

Naïve diagnosis
Poolicgi1 and TxnNameLeaveFeedback
Poolicgi1 and TxnNameMyFeedback
Poolicgi2 and TxnNameRespond
Poolicgi2 and TxnNameViewFeedback

icgi1
icgi2
Respond
MyFdbk
LeaveFdbk
ViewFdbk
8
205
1
3554
12
Feature Selection Heuristics

Ignore leaf nodes with no failed transactions
Problem noisy leaves
keep the top N leaves, or ignore nodes with lt M
failues
Problem features may not be independent
drop ancestor nodes that are subsumed by the
leaves
Rank by impact
sort the predicted causes by failure count

icgi1
icgi2
LeaveFdbk
Respond
MyFdbk
ViewFdbk
8
205
1
3554
13
MinEntropy

Entropy measures the randomness of data
E.g. if failure is evenly distributed (very
random), then entropy is high
Rank features by the normalized entropy
Greedy approach searches for the leaf node with
most failures
Always produces one and exactly one diagnosis
Deployed on the entire eBay site
Sends real-time alerts to ops
Pros fast (lt1s for 100K txns and scales
linearly)
Cons optimized for single faults

14
MinEntropy example
TxnName Errors
MyEBay 636
MyEBaySeller 512
MyEBayLogin 736

Pool Errors
Cgi0 12
Cgi1 4002
Cgi2 30
Cgi3 8
Cgi4 5

TxnType Errors
URL 4350
SQL 47
EMAIL 12
XSLT 0

Version Errors
E293 3987
E291 15
Machine Errors
Attila 1985
Lenin 2002
Marcus 4
Scipio 0

Alert Version E293 causing URL failures (not
specific to any URL) in pool CGI1
15
Association Rules

Data mining technique to compute item sets
e.g. Shoppers who bought this item also shopped
for
Metrics
Confidence ( of A B) / of A
Conditional probability of B given A
Support ( of A B)/total of txns
Generates rules for all possible sets
e.g. machineabc, txnlogin gt
statusNullPointer (conf0.1, support0.02)
Applied to failure diagnosis
Find all rules that has failed status on the
right, then rank by conf
Pros looks at combinations of features
Cons generates many rules

16
Association Rules Sample Output

Sample output (rules containing failures)
TxnTypeURL Poolicgi2 TxnNameLeaveFeedback gt
StatusFailed conf(0.28)
Poolicgi2 TxnNameLeaveFeedback gt
StatusFailed conf(0.28)
TxnTypeURL TxnNameLeaveFeedback gt
StatusFailed conf(0.28)
TxnNameLeaveFeedback gt StatusFailed
conf(0.28)
Problem features may not be independent
e.g. all LeaveFeedback txns are of type URL
Drop rules that are subsumed by more specific
rules
Diagnosis TxnNameLeaveFeedback

17
Experimental Setup

Dataset
About 1/8 of the whole site
10 one-minute traces, 4 with 2 concurrent faults
total of 14 independent faults
True faults identified through post-mortems, ops
chat logs, application logs, etc.
Metrics
Precision ( of identified faults) / ( of true
faults)
Recall ( of identified faults) / ( of
predicted faults)