Title: Failure Data Collection and Analysis
1Failure Data Collection and Analysis
- Archana Ganapathi
- Peter Bodik
- Wei Xu
2Motivation (1)My machine crashes
- Since 3/1/04
- 3 system crashes
- 18 application errors
- 96 application hangs
- Who cares?
- I do!
- People who share similar experiences
- In general, customer uproar
3Motivation (2)An Internet service has failures
- Who cares?
- Internet service users
- Internet service system administrators
- Anyone affected by the ISs loss of revenue
Total 61 user-visible failures in 12 months at
Online Service
4Motivation (3)
- ROC/RADS needs real failure/attack information
- to drive benchmarks
- evaluate our prototypes
- help us select what we work attack
5Data Sources
- 1000s of individual machines
- Cory/Soda Hall, BOINC
- Large clusters at real Internet services
- Internet services
- Distributed applications on 100s of machines
- PlanetLab
6Individual Machines
7Data Collection
- Collect minidumps that contain
- The Stop message/parameters/data
- Loaded drivers
- Processor context for processor that stopped
- Process info/kernel context for process/thread
stopped - The Kernel-mode call stack for thread that
stopped - Frequency of collection
- synchronized with application and system crashes
on computers
8Analysis results
- What happened that is immediately responsible for
the crash - exact error code
- brief description, primarily for debugging
- Bucketing info, e.g. "driver fault"
- Details for debugging, e.g. stack contents
- Use Microsofts publicly available analysis tools
- Caveat significant variability in results
between internal and public version of tool!
9How we collect minidumps (1)
- Corporate Error Reporting
- http//www.microsoft.com/resources/satech/cer/
- Manage error reports/msgs generated by WER and
other programs - Configure clients to redirect reports to CER
shared directory
10Sample Statistics(25 nodes, 5 days)
11Sample Statistics(25 nodes, 5 days)
Crashed Program Version Problem
BESConsole.exe 4.1.3.33 hungapp
CDCopier.exe 5.3.4.21 hungapp
CreateCD50.exe 5.3.4.21 hungapp
CreateCD50.exe 5.3.4.21 hungapp
explorer.exe 6.0.2800.1106 shlwapi.dll
firefox.exe 0.8.0.0 hungapp
IAMAPP.EXE 5.1.1.309 hungapp
iexplore.exe 6.0.2800.1106 hungapp
iexplore.exe 6.0.2800.1106 mshtml.dll
matlab.exe 1.0.0.1 hungapp
mozilla.exe 1.6.20040.11308 ntdll.dll
msmsgs.exe 4.7.0.2009 msmsgs.exe
OUTLOOK.EXE 10.0.4510.0 hungapp
thunde1.exe 0.6.0.0 xpc3250.dll
12How we collect minidumps (2)
- BOINC
- For SETI_at_home esque apps that pool resources
- Provides client API to send/receive data to/from
BOINC server - Write tools to read info in minidump directory
and send to us
13Sample Statistics (50 system crashes)
Thread stuck in device driver 12
Page Fault in Non-Paged Area 10
System Thread Exception Not Handled 6
Unexpected Kernel Mode Trap 6
Kernel Mode Exception Not Handled 5
IRQL Not Less or Equal 3
Driver IRQL Not Less or Equal 3
NTFS File System 2
Bad Pool Caller 2
PFN List Corrupt 1
14Sample Statistics (50 system crashes)
- CLASSPNP.SYS 2
- win32k.sys 2
- SynTP.sys 1
- TDI.SYS 1
- ino_fltr.sys 1
- ks.sys 1
- drvnddm.sys 1
- ntkrnlmp.exe 1
- Pool_Corruption 1
- watchdog.sys 7
- ar5211.sys 6
- ibmpmdrv.sys 6
- ati3duag.dll 5
- SYMEVENT.SYS 3
- ipsecw2k.sys 3
- memory_corruption 3
- ialmdev5.DLL 2
- PSCRIPT4.DLL 2
- ntoskrnl.exe 2
15Metrics (Windows Linux)
- Availability
- system uptime, time BOINC running
- CPU(s)
- processes, processor queue length, non-idle
- Memory
- available physical memory, free swap space
- Disk(s)
- free space
- Network(s)
- IP address, packetsbytes sentreceived/sec,
bandwidth to/from SETI_at_home server, first-hop
bandwidth, network coordinates - Static
- CPU type, , and benchmarks total memory OS
type
16Questions
- Other metrics?
- Frequency with which to measure them?
- What research questions can we answer with this
data set? - original goal workload to evaluate our node
discovery service - evaluate effectiveness of network coordinates
- evaluate potential to run more than just
embarrassingly parallel apps on this type of
infrastructure depending on - machines uptime
- network connectivity
- available disk space
- distributed analysis?
- security uses?
17Internet Services
18Data characteristics
- Real companies
- Multitude of users
- Voluminous data (several terabytes)
- Systems are complex
- Treat as black box
- Use SLT algorithms for analysis
- More data gt better models
19Analysis Results
- Study event logs
- Not necessarily failures
- Can derive models of good bad behavior
- Models with varying granularity
- Use different algorithms
- Vary boundary parameters
- For more details see poster
- Towards a General Approach for Event Log
Analysis
20Distributed Apps
21PlanetLab
- An open platform for developing, deploying, and
accessing planetary-scale services - 392 nodes at 164 sites around the world
- Per-site system administration
- Applications OceanStore, PIER
22Why?
- Platform for injecting faults and testing our
algorithms - Applications on RADS-like environment
- Research platform
- More accessible
- University-developed apps most likely to be
tested on PlanetLab
23Applications
- 1) OceanStore
- Global persistent data store.
- In the process of running prototype on PlanetLab
- Good source of failure data
- 2) PIER
- Distributed query processor
- Currently running on PlanetLab
- Good source of failure data analysis engine
24What do we do with these apps?
- Instrument applications to collect any type of
information - Choice of granularity
- Open source - no longer black box
- Can modify it as much as necessary
25Questions
- What other applications can we use?
- What should we measure and model?
- What information is useful for industry?
- Do you have any failure/attack data you are
willing to share with us?