Failure Data Collection and Analysis - PowerPoint PPT Presentation

About This Presentation

Title:

Failure Data Collection and Analysis

Description:

Frequency of collection. synchronized with application and system crashes on computers ... How we collect minidumps (1) Corporate Error Reporting ... – PowerPoint PPT presentation

Number of Views:139

Avg rating:3.0/5.0

Slides: 26

Provided by: arch1

Learn more at: http://roc.cs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Failure Data Collection and Analysis

1
Failure Data Collection and Analysis

Archana Ganapathi
Peter Bodik
Wei Xu

2
Motivation (1)My machine crashes

Since 3/1/04
3 system crashes
18 application errors
96 application hangs
Who cares?
I do!
People who share similar experiences
In general, customer uproar

3
Motivation (2)An Internet service has failures

Who cares?
Internet service users
Internet service system administrators
Anyone affected by the ISs loss of revenue

Total 61 user-visible failures in 12 months at
Online Service
4
Motivation (3)

ROC/RADS needs real failure/attack information
to drive benchmarks
evaluate our prototypes
help us select what we work attack

5
Data Sources

1000s of individual machines
Cory/Soda Hall, BOINC
Large clusters at real Internet services
Internet services
Distributed applications on 100s of machines
PlanetLab

6
Individual Machines
7
Data Collection

Collect minidumps that contain
The Stop message/parameters/data
Loaded drivers
Processor context for processor that stopped
Process info/kernel context for process/thread
stopped
The Kernel-mode call stack for thread that
stopped
Frequency of collection
synchronized with application and system crashes
on computers

8
Analysis results

What happened that is immediately responsible for
the crash
exact error code
brief description, primarily for debugging
Bucketing info, e.g. "driver fault"
Details for debugging, e.g. stack contents
Use Microsofts publicly available analysis tools
Caveat significant variability in results
between internal and public version of tool!

9
How we collect minidumps (1)

Corporate Error Reporting
http//www.microsoft.com/resources/satech/cer/
Manage error reports/msgs generated by WER and
other programs
Configure clients to redirect reports to CER
shared directory

10
Sample Statistics(25 nodes, 5 days)
11
Sample Statistics(25 nodes, 5 days)
Crashed Program Version Problem
BESConsole.exe 4.1.3.33 hungapp
CDCopier.exe 5.3.4.21 hungapp
CreateCD50.exe 5.3.4.21 hungapp
CreateCD50.exe 5.3.4.21 hungapp
explorer.exe 6.0.2800.1106 shlwapi.dll
firefox.exe 0.8.0.0 hungapp
IAMAPP.EXE 5.1.1.309 hungapp
iexplore.exe 6.0.2800.1106 hungapp
iexplore.exe 6.0.2800.1106 mshtml.dll
matlab.exe 1.0.0.1 hungapp
mozilla.exe 1.6.20040.11308 ntdll.dll
msmsgs.exe 4.7.0.2009 msmsgs.exe
OUTLOOK.EXE 10.0.4510.0 hungapp
thunde1.exe 0.6.0.0 xpc3250.dll
12
How we collect minidumps (2)

BOINC
For SETI_at_home esque apps that pool resources
Provides client API to send/receive data to/from
BOINC server
Write tools to read info in minidump directory
and send to us

13
Sample Statistics (50 system crashes)
Thread stuck in device driver 12
Page Fault in Non-Paged Area 10
System Thread Exception Not Handled 6
Unexpected Kernel Mode Trap 6
Kernel Mode Exception Not Handled 5
IRQL Not Less or Equal 3
Driver IRQL Not Less or Equal 3
NTFS File System 2
Bad Pool Caller 2
PFN List Corrupt 1
14
Sample Statistics (50 system crashes)

CLASSPNP.SYS 2
win32k.sys 2
SynTP.sys 1
TDI.SYS 1
ino_fltr.sys 1
ks.sys 1
drvnddm.sys 1
ntkrnlmp.exe 1
Pool_Corruption 1

watchdog.sys 7
ar5211.sys 6
ibmpmdrv.sys 6
ati3duag.dll 5
SYMEVENT.SYS 3
ipsecw2k.sys 3
memory_corruption 3
ialmdev5.DLL 2
PSCRIPT4.DLL 2
ntoskrnl.exe 2

15
Metrics (Windows Linux)

Availability
system uptime, time BOINC running
CPU(s)
processes, processor queue length, non-idle
Memory
available physical memory, free swap space
Disk(s)
free space
Network(s)
IP address, packetsbytes sentreceived/sec,
bandwidth to/from SETI_at_home server, first-hop
bandwidth, network coordinates
Static
CPU type, , and benchmarks total memory OS
type

16
Questions

Other metrics?
Frequency with which to measure them?
What research questions can we answer with this
data set?
original goal workload to evaluate our node
discovery service
evaluate effectiveness of network coordinates
evaluate potential to run more than just
embarrassingly parallel apps on this type of
infrastructure depending on
machines uptime
network connectivity
available disk space
distributed analysis?
security uses?

17
Internet Services
18
Data characteristics

Real companies
Multitude of users
Voluminous data (several terabytes)
Systems are complex
Treat as black box
Use SLT algorithms for analysis
More data gt better models

19
Analysis Results

Study event logs
Not necessarily failures
Can derive models of good bad behavior
Models with varying granularity
Use different algorithms
Vary boundary parameters
For more details see poster
Towards a General Approach for Event Log
Analysis

20
Distributed Apps
21
PlanetLab

An open platform for developing, deploying, and
accessing planetary-scale services
392 nodes at 164 sites around the world
Per-site system administration
Applications OceanStore, PIER

22
Why?