Troubleshooting Distributed Systems via Data Mining - PowerPoint PPT Presentation

About This Presentation
Title:

Troubleshooting Distributed Systems via Data Mining

Description:

... - Memory=1927, TotalLoadAvg – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 14
Provided by: Douglas253
Learn more at: https://www3.nd.edu
Category:

less

Transcript and Presenter's Notes

Title: Troubleshooting Distributed Systems via Data Mining


1
Troubleshooting Distributed Systems via Data
Mining
  • David Cieslak, Douglas Thain, and Nitesh Chawla
  • University of Notre Dame

2
Its Ugly in the Real World
  • Machine related failures
  • Power outages, network outages, faulty memory,
    corrupted file system, bad config files, expired
    certs, packet filters...
  • Job related failures
  • Crash on some args, bad executable, missing input
    files, mistake in args, missing components,
    failure to understand dependencies...
  • Incompatibilities between jobs and machines
  • Missing libraries, not enough disk/cpu/mem, wrong
    software installed, wrong version installed,
    wrong memory layout...
  • Load related failures
  • Slow actions induce timeouts kernel tables
    files, sockets, procs router tables addresses,
    routes, connections competition with other
    users...
  • Non-deterministic failures
  • Multi-thread/CPU synchronization, event
    interleaving across systems, random number
    generators, interactive effects, cosmic rays...

3
Reports of Bad News
  • Grid2003 Thirty percent of ATLAS/CMS jobs
    failed!
  • Jobs often failed due to site configuration
    problems, or in groups from site service
    failures
  • - R. Gardner et al, The Grid2003 Production
    Grid Principles and Practice, HPDC 2003.
  • Users ... need tools to help debug why failures
    happen.
  • Need user oriented diagnostic tools
  • - J. Schopf, State of Grid Users 25
    Conversations with UK eScience Groups, Argonne
    Tech Report ANL/MCS-TM-278.

4
A Grand Challenge Problem
  • A user submits one million jobs to the grid.
  • Half of them fail.
  • Now what?
  • Examine the output of every failed job?
  • Login to every site to examine the logs?
  • Resubmit and hope for the best?
  • We need some way of getting the big picture.
  • Need to identify problems not seen before.

5
The Wisdom of Secretary Rumsfeld
  • As we know, There are known knowns. There are
    things we know we know. We also know There are
    known unknowns. That is to say We know there
    are some things We do not know. But there are
    also unknown unknowns, The ones we don't know
    We don't know.
  • - Donald Rumsfeld, 12 February 2002

6
An Idea
  • We have lots of structured information about the
    components of a grid.
  • Can we perform some form of data mining to
    discover the big picture of what is going on?
  • User Your jobs work fine on RH Linux 12.1 and
    12.3 but they always seem to crash on version
    12.2.
  • Admin User joe is running 1000s of jobs that
    transfer 10 TB of data that fail immediately
    perhaps he needs help.
  • Can we act on this information to improve the
    system?
  • User Avoid resources that are working for you.
  • Admin Assist the user in understand and fixing
    the problem.

7
Job ClassAd MyType "Job" TargetType
"Machine" ClusterId 11839 QDate
1150231068 CompletionDate 0 Owner
"dcieslak JobUniverse 5 Cmd
"ripper-cost-can-9-50.sh" LocalUserCpu
0.000000 LocalSysCpu 0.000000 ExitStatus
0 ImageSize 40000 DiskUsage 110000 NumCkpts
0 NumRestarts 0 NumSystemHolds
0 CommittedTime 0 ExitBySignal FALSE PoolName
"ccl00.cse.nd.edu" CondorVersion "6.7.19 May
10 2006" CondorPlatform I386-LINUX_RH9 RootDir
"/" Iwd "/tmp/dcieslak/smotewrap1" MinHosts
1 WantRemoteSyscalls FALSE WantCheckpoint
FALSE JobPrio 0 User "dcieslak_at_nd.edu" NiceUse
r FALSE Env "LD_LIBRARY_PATH." EnvDelim
"" JobNotification 0 WantRemoteIO
TRUE UserLog "/tmp/dcieslak/smotewrap1/ripper-co
st-can-9-50.log" CoreSize -1 KillSig
"SIGTERM" Rank 0.000000 In "/dev/null" Transfe
rIn FALSE Out "ripper-cost-can-9-50.output" St
reamOut FALSE Err "ripper-cost-can-9-50.error"
StreamErr FALSE BufferSize
524288 BufferBlockSize 32768 ShouldTransferFiles
"YES" WhenToTransferOutput
"ON_EXIT_OR_EVICT" TransferFiles
"ALWAYS" TransferInput "scripts.tar.gz,can-rippe
r.tar.gz" TransferOutput "ripper-cost-50-can-9.t
ar.gz" ExecutableSize_RAW 1 ExecutableSize
10000 Requirements (OpSys "LINUX") (Arch
"INTEL") (Disk gt DiskUsage) ((Memory
1024) gt ImageSize) (HasFileTransfer) JobLeaseD
uration 1200 PeriodicHold FALSE PeriodicReleas
e FALSE PeriodicRemove FALSE OnExitHold
FALSE OnExitRemove TRUE LeaveJobInQueue
FALSE Arguments "" GlobalJobId
"cclbuild02.cse.nd.edu115023106911839.0" ProcId
0 AutoClusterId 0 AutoClusterAttrs
"Owner,Requirements" JobStartDate
1150256907 LastRejMatchReason "no match
found" LastRejMatchTime 1150815515 TotalSuspensi
ons 73 CumulativeSuspensionTime
8179 RemoteWallClockTime 432493.000000 LastRemot
eHost "hobbes.helios.nd.edu" LastClaimId
"lt129.74.221.1689359gt11508117332" MaxHosts
1 WantMatchDiagnostics TRUE LastMatchTime
1150817352 NumJobMatches 34 OrigMaxHosts
1 JobStatus 2 EnteredCurrentStatus
1150817354 LastSuspensionTime 0 CurrentHosts
1 ClaimId "lt129.74.20.209322gt1150232335157" R
emoteHost "vm2_at_sirius.cse.nd.edu" RemoteVirtualM
achineID 2 ShadowBday 1150817355 JobLastStartD
ate 1150815519 JobCurrentStartDate
1150817355 JobRunCount 24 WallClockCheckpoint
65927 RemoteSysCpu 52.000000 ImageSize_RAW
31324 DiskUsage_RAW 102814 RemoteUserCpu
62319.000000 LastJobLeaseRenewal 11
Machine ClassAd MyType "Machine" TargetType
"Job" Name "ccl00.cse.nd.edu" CpuBusy
((LoadAvg - CondorLoadAvg) gt 0.500000) MachineGro
up "ccl" MachineOwner "dthain" CondorVersion
"6.7.19 May 10 2006" CondorPlatform
"I386-LINUX_RH9" VirtualMachineID
1 ExecutableSize 20000 JobUniverse 1 NiceUser
FALSE VirtualMemory 962948 Memory 498 Cpus
1 Disk 19072712 CondorLoadAvg
1.000000 LoadAvg 1.130000 KeyboardIdle
817093 ConsoleIdle 817093 StartdIpAddr
"lt129.74.153.1649453gt" Arch "INTEL" OpSys
"LINUX" UidDomain "nd.edu" FileSystemDomain
"nd.edu" Subnet "129.74.153" HasIOProxy
TRUE CheckpointPlatform "LINUX INTEL 2.4.x
normal" TotalVirtualMemory 962948 TotalDisk
19072712 TotalCpus 1 TotalMemory 498 KFlops
659777 Mips 2189 LastBenchmark
1150271600 TotalLoadAvg 1.130000 TotalCondorLoad
Avg 1.000000 ClockMin 347 ClockDay
3 TotalVirtualMachines 1 HasFileTransfer
TRUE HasPerFileEncryption TRUE HasReconnect
TRUE HasMPI TRUE HasTDP TRUE HasJobDeferral
TRUE HasJICLocalConfig TRUE HasJICLocalStdin
TRUE HasPVM TRUE HasRemoteSyscalls
TRUE HasCheckpointing TRUE CpuBusyTime
0 CpuIsBusy FALSE TimeToLive 2147483647 State
"Claimed" EnteredCurrentState
1150284871 Activity "Busy" EnteredCurrentActivit
y 1150877237 Start ((KeyboardIdle gt 15 60)
(((LoadAvg - CondorLoadAvg) lt 0.300000)
(State ! "Unclaimed" State !
"Owner"))) Requirements (START)
(IsValidCheckpointPlatform) IsValidCheckpointPlatf
orm (((TARGET.JobUniverse 1) FALSE)
((MY.CheckpointPlatform ! UNDEFINED)
((TARGET.LastCheckpointPlatform ?
MY.CheckpointPlatform) (TARGET.NumCkpts
0)))) MaxJobRetirementTime 0 CurrentRank
1.000000 RemoteUser "johanes_at_nd.edu" RemoteOwner
"johanes_at_nd.edu" ClientMachine
"cclbuild00.cse.nd.edu" JobId
"2929.0" GlobalJobId "cclbuild00.cse.nd.edu1150
4255942929.0" JobStart 1150425941 LastPeriodicC
heckpoint 1150879661 ImageSize
54196 TotalJobRunTime 456222 TotalJobSuspendTime
1080 TotalClaimRunTime 597057 TotalClaimSuspe
ndTime 1271 MonitorSelfTime
1150883051 MonitorSelfCPUUsage
0.066660 MonitorSelfImageSize
8244.000000 MonitorSelfResidentSetSize
2036 MonitorSelfAge 0 DaemonStartTime
1150231320 UpdateSequenceNumber 2208 MyAddress
"lt129.74.153.1649453gt" LastHeardFrom
1150883243 UpdatesTotal 2785 UpdatesSequenced
2784 UpdatesLost 0 UpdatesHistory
"0x00000000000000000000000000000000" Machine
"ccl00.cse.nd.edu" Rank ((Owner "dthain")
(Owner "psnowber") (Owner "cmoretti")
(Owner "jhemmes") (Owner "gniederw"))
2 (PoolName ? "ccl00.cse.nd.edu") 1
User Job Log Job 1 submitted. Job 2
submitted. Job 1 placed on ccl00.cse.nd.edu Job
1 evicted. Job 1 placed on smarty.cse.nd.edu. Job
1 completed. Job 2 placed on dvorak.helios.nd.edu
Job 2 suspended Job 2 resumed Job 2 exited
normally with status 1. ...
8
Failure Criteria exit !0 core
dump evicted suspended bad output
9
  • ------------------------- run 1
    -------------------------
  • Hypothesis
  • exit1 - Memorygt1930, JobStartgt1.14626e09,
    MonitorSelfTimegt1.14626e09 (491/377)
  • exit1 - Memorygt1930, Disklt555320 (1670/1639).
  • default exit0 (11904/4503).
  • Error rate on holdout data is 30.9852
  • Running average of error rate is 30.9852
  • ------------------------- run 2
    -------------------------
  • Hypothesis exit1 - Memorygt1930, Disklt541186
    (2076/1812).
  • default exit0 (12090/4606).
  • Error rate on holdout data is 31.8791
  • Running average of error rate is 31.4322
  • ------------------------- run 3
    -------------------------
  • Hypothesis
  • exit1 - Memorygt1930, MonitorSelfImageSizegt8.844
    e09 (1270/1050).
  • exit1 - Memorygt1930, KeyboardIdlegt815995
    (793/763).
  • exit1 - Memorygt1927, EnteredCurrentStatelt1.1462
    5e09, VirtualMemorygt2.09646e06,
    LoadAvggt30000, LastBenchmarklt1.14623e09,
    MonitorSelfImageSizelt7.836e09 (94/84). exit1 -
    Memorygt1927, TotalLoadAvglt1.43e06,
    UpdatesTotallt8069, LastBenchmarklt1.14619e09,
    UpdatesLostlt1 (77/61).
  • default exit0 (11940/4452).
  • Error rate on holdout data is 31.8111

10
Unexpected Discoveries
  • Purdue Teragrid (91343 jobs on 2523 CPUs)
  • Jobs fail on machines with (Memorygt1920MB)
  • Diagnosis Linux machines with gt 3GB have a
    different memory layout that breaks some programs
    that do inappropriate pointer arithmetic.
  • UND UW (4005 jobs on 1460 CPUs)
  • Jobs fail on machines with less than 4MB disk.
  • Diagnosis Condor failed in an unusual way when
    the job transfers input files that dont fit.

11
Many Open Problems
  • Strengths and Weaknesses of Approach
  • Correlation ! Causation -gt could be enough?
  • Limits of reported data -gt increase resolution?
  • Not enough data points -gt direct job placement?
  • Acting on Information
  • Steering by the end user.
  • Applying learned rules back to the system.
  • Evaluating (and sometimes abandoning) changes.
  • Creating tools that assist with digging deeper.
  • Data Mining Research
  • Continuous intake incremental construction.
  • Creating results that non-specialists can
    understand.

12
Just Getting Started
  • Douglas Thain
  • University of Notre Dame
  • dthain_at_cse.nd.edu
  • We like to collect things
  • Obscure failure modes.
  • War stories about how the bugs were found.
  • Log files from big batch runs.

13
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com