Failure Data Collection and Analysis - PowerPoint PPT Presentation

About This Presentation
Title:

Failure Data Collection and Analysis

Description:

Frequency of collection. synchronized with application and system crashes on computers ... How we collect minidumps (1) Corporate Error Reporting ... – PowerPoint PPT presentation

Number of Views:139
Avg rating:3.0/5.0
Slides: 26
Provided by: arch1
Category:

less

Transcript and Presenter's Notes

Title: Failure Data Collection and Analysis


1
Failure Data Collection and Analysis
  • Archana Ganapathi
  • Peter Bodik
  • Wei Xu

2
Motivation (1)My machine crashes
  • Since 3/1/04
  • 3 system crashes
  • 18 application errors
  • 96 application hangs
  • Who cares?
  • I do!
  • People who share similar experiences
  • In general, customer uproar

3
Motivation (2)An Internet service has failures
  • Who cares?
  • Internet service users
  • Internet service system administrators
  • Anyone affected by the ISs loss of revenue

Total 61 user-visible failures in 12 months at
Online Service
4
Motivation (3)
  • ROC/RADS needs real failure/attack information
  • to drive benchmarks
  • evaluate our prototypes
  • help us select what we work attack

5
Data Sources
  • 1000s of individual machines
  • Cory/Soda Hall, BOINC
  • Large clusters at real Internet services
  • Internet services
  • Distributed applications on 100s of machines
  • PlanetLab

6
Individual Machines
7
Data Collection
  • Collect minidumps that contain
  • The Stop message/parameters/data
  • Loaded drivers
  • Processor context for processor that stopped
  • Process info/kernel context for process/thread
    stopped
  • The Kernel-mode call stack for thread that
    stopped
  • Frequency of collection
  • synchronized with application and system crashes
    on computers

8
Analysis results
  • What happened that is immediately responsible for
    the crash
  • exact error code
  • brief description, primarily for debugging
  • Bucketing info, e.g. "driver fault"
  • Details for debugging, e.g. stack contents
  • Use Microsofts publicly available analysis tools
  • Caveat significant variability in results
    between internal and public version of tool!

9
How we collect minidumps (1)
  • Corporate Error Reporting
  • http//www.microsoft.com/resources/satech/cer/
  • Manage error reports/msgs generated by WER and
    other programs
  • Configure clients to redirect reports to CER
    shared directory

10
Sample Statistics(25 nodes, 5 days)
11
Sample Statistics(25 nodes, 5 days)
Crashed Program Version Problem
BESConsole.exe 4.1.3.33 hungapp
CDCopier.exe 5.3.4.21 hungapp
CreateCD50.exe 5.3.4.21 hungapp
CreateCD50.exe 5.3.4.21 hungapp
explorer.exe 6.0.2800.1106 shlwapi.dll
firefox.exe 0.8.0.0 hungapp
IAMAPP.EXE 5.1.1.309 hungapp
iexplore.exe 6.0.2800.1106 hungapp
iexplore.exe 6.0.2800.1106 mshtml.dll
matlab.exe 1.0.0.1 hungapp
mozilla.exe 1.6.20040.11308 ntdll.dll
msmsgs.exe 4.7.0.2009 msmsgs.exe
OUTLOOK.EXE 10.0.4510.0 hungapp
thunde1.exe 0.6.0.0 xpc3250.dll
12
How we collect minidumps (2)
  • BOINC
  • For SETI_at_home esque apps that pool resources
  • Provides client API to send/receive data to/from
    BOINC server
  • Write tools to read info in minidump directory
    and send to us

13
Sample Statistics (50 system crashes)
Thread stuck in device driver 12
Page Fault in Non-Paged Area 10
System Thread Exception Not Handled 6
Unexpected Kernel Mode Trap 6
Kernel Mode Exception Not Handled 5
IRQL Not Less or Equal 3
Driver IRQL Not Less or Equal 3
NTFS File System 2
Bad Pool Caller 2
PFN List Corrupt 1
14
Sample Statistics (50 system crashes)
  • CLASSPNP.SYS 2
  • win32k.sys 2
  • SynTP.sys 1
  • TDI.SYS 1
  • ino_fltr.sys 1
  • ks.sys 1
  • drvnddm.sys 1
  • ntkrnlmp.exe 1
  • Pool_Corruption 1
  • watchdog.sys 7
  • ar5211.sys 6
  • ibmpmdrv.sys 6
  • ati3duag.dll 5
  • SYMEVENT.SYS 3
  • ipsecw2k.sys 3
  • memory_corruption 3
  • ialmdev5.DLL 2
  • PSCRIPT4.DLL 2
  • ntoskrnl.exe 2

15
Metrics (Windows Linux)
  • Availability
  • system uptime, time BOINC running
  • CPU(s)
  • processes, processor queue length, non-idle
  • Memory
  • available physical memory, free swap space
  • Disk(s)
  • free space
  • Network(s)
  • IP address, packetsbytes sentreceived/sec,
    bandwidth to/from SETI_at_home server, first-hop
    bandwidth, network coordinates
  • Static
  • CPU type, , and benchmarks total memory OS
    type

16
Questions
  • Other metrics?
  • Frequency with which to measure them?
  • What research questions can we answer with this
    data set?
  • original goal workload to evaluate our node
    discovery service
  • evaluate effectiveness of network coordinates
  • evaluate potential to run more than just
    embarrassingly parallel apps on this type of
    infrastructure depending on
  • machines uptime
  • network connectivity
  • available disk space
  • distributed analysis?
  • security uses?

17
Internet Services
18
Data characteristics
  • Real companies
  • Multitude of users
  • Voluminous data (several terabytes)
  • Systems are complex
  • Treat as black box
  • Use SLT algorithms for analysis
  • More data gt better models

19
Analysis Results
  • Study event logs
  • Not necessarily failures
  • Can derive models of good bad behavior
  • Models with varying granularity
  • Use different algorithms
  • Vary boundary parameters
  • For more details see poster
  • Towards a General Approach for Event Log
    Analysis

20
Distributed Apps
21
PlanetLab
  • An open platform for developing, deploying, and
    accessing planetary-scale services
  • 392 nodes at 164 sites around the world
  • Per-site system administration
  • Applications OceanStore, PIER

22
Why?
  • Platform for injecting faults and testing our
    algorithms
  • Applications on RADS-like environment
  • Research platform
  • More accessible
  • University-developed apps most likely to be
    tested on PlanetLab

23
Applications
  • 1) OceanStore
  • Global persistent data store.
  • In the process of running prototype on PlanetLab
  • Good source of failure data
  • 2) PIER
  • Distributed query processor
  • Currently running on PlanetLab
  • Good source of failure data analysis engine

24
What do we do with these apps?
  • Instrument applications to collect any type of
    information
  • Choice of granularity
  • Open source - no longer black box
  • Can modify it as much as necessary

25
Questions
  • What other applications can we use?
  • What should we measure and model?
  • What information is useful for industry?
  • Do you have any failure/attack data you are
    willing to share with us?
Write a Comment
User Comments (0)
About PowerShow.com