Reliability and Troubleshooting with Condor - PowerPoint PPT Presentation

About This Presentation
Title:

Reliability and Troubleshooting with Condor

Description:

... Structure US-CMS Logical Structure Condor-G Directed Acyclic Graph Manager (DAGMan) Fault Tolerant Shell (FTSH) Hawkeye For More Info ... – PowerPoint PPT presentation

Number of Views:95
Avg rating:3.0/5.0
Slides: 10
Provided by: Dougla176
Category:

less

Transcript and Presenter's Notes

Title: Reliability and Troubleshooting with Condor


1
Reliability and Troubleshootingwith Condor
  • Douglas Thain
  • Condor Project
  • University of Wisconsin
  • PPDG Troubleshooting Workshop
  • 12 December 2002

2
Condor Reliability
  • Condor was designed for idle machines
  • Reclaim, reboot, crash, out of memory...
  • Sounds much like the grid!
  • US-CMS testbed
  • Distributed ownership, control, and resources.
  • (War stories abound.)
  • Condor tools add controlled reliability.
  • Not absolute reliability, but
  • A finite amount of retry.
  • A notification/recovery strategy.
  • Logging and book-keeping.
  • Known state after a failure.

3
US-CMS Physical Structure
MOP Master
Workers
Head Node
Workers
Head Node
Public Internet
Workers
Head Node
4
US-CMS Logical Structure
Master Site
Worker
Impala
Globus
MOP
Condor
DAGMan
Real Work
Condor-G
Red items expect a reliable environment. Green
items create a reliable environment.
5
Condor-G
End-User Tools
(transaction interface)
Job Managers
Head Node
Condor-G Submitter
Gatekeeper
System Log
Job Log
Job Queue
Local Resource Manager
Grid Managers
GRAM
GAHP-Server
6
Directed Acyclic Graph Manager (DAGMan)
  • Condor-G deals with system failures, DAGMan deals
    with app and user failures.
  • PRE and POST may be used to validate inputs and
    outputs.
  • Rescue DAG describes what is left unexecuted.
  • DAG nodes may themselves be DAGs.

A
B
D
7
Fault Tolerant Shell (FTSH)
  • Standard shell scripts are very error-prone.
  • FTSH adds time limits, retry, logging, and clean
    termination.
  • Exceptions for scripts unexpected errors
    cannot accidentally be ignored.
  • try 10 times
  • try for 15 minutes
  • globus_url_copy A B
  • end
  • try for 1 hour
  • run-simulation lt B gt C
  • gzip lt C gtD
  • end
  • try for 15 minutes
  • globus_url_copy D E
  • end
  • end

8
Hawkeye
Hawkeye Manager
Policy Manager
Trigger Exprs
ClassAd Queries
ClassAd Data
Probe Modules
Probe Modules
Probe Modules
Submit Repair Job
Contact Sysadmin
Log Event
  • (Example Hawkeye Page)

9
For More Info...
  • Condor-G
  • http//www.cs.wisc.edu/condor/condorg
  • DAGMan
  • http//www.cs.wisc.edu/condor/dagman
  • Fault Tolerant Shell
  • http//www.cs.wisc.edu/thain/research/ftsh
  • Hawkeye
  • http//www.cs.wisc.edu/condor/hawkeye
  • Philosophy of Error Management
  • http//www.cs.wisc.edu/condor/doc/error-scope.pdf
  • The Condor Project
  • http//www.cs.wisc.edu/condor
Write a Comment
User Comments (0)
About PowerShow.com