What - PowerPoint PPT Presentation

About This Presentation
Title:

What

Description:

http://www.cs.wisc.edu/condor. What's New in Condor-G. ondor. C. www. ... Best effort is made to clean up current job submission. New job submission is attempted ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 49
Provided by: Miron1
Category:
Tags: up | whats

less

Transcript and Presenter's Notes

Title: What


1
Whats New in Condor-G
2
Outline
  • What is Condor-G
  • Released New Features
  • In Development

3
What Is Condor-G
  • Use Condor to run jobs on the Grid
  • Uses Globus Toolkit
  • GRAM (submit a remote job)
  • GASS (transfer jobs files)
  • Two components
  • Globus Universe
  • GlideIn

4
Globus Universe
  • Run a job on a Grid resource
  • Features
  • Job management
  • Fault tolerance
  • Credential management
  • Roughly equivalent to the vanilla universe

5
How It Works
Condor-G
Grid Resource
Schedd
LSF
6
How It Works
Condor-G
Grid Resource
Schedd
LSF
7
How It Works
Condor-G
Grid Resource
Schedd
LSF
GridManager
8
How It Works
Condor-G
Grid Resource
JobManager
Schedd
LSF
GridManager
9
How It Works
Condor-G
Grid Resource
JobManager
Schedd
LSF
GridManager
User Job
10
GlideIn
  • Run the Condor daemons on Grid resources as user
    jobs
  • Create your own personal Condor pool from
    temporarily-acquired Grid resources
  • Brings the full power of Condor to the Grid

11
Condor-G
12
Condor-G
13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
(No Transcript)
17
(No Transcript)
18
Released New Features
  • Stuff weve added in the past year
  • Released and ready for use in Condor 6.6

19
Globus ASCII Helper Protocol (GAHP)
  • Encapsulates Globus libraries in separate process
  • Simple ASCII protocol
  • Easy for legacy applications to use Globus when
    they cant link directly with the libraries

20
How It Works - GAHP
Condor-G
Grid Resources
JobManager
Schedd
GridManager
JobManager
GAHP Client
JobManager
GAHP Server
21
File Staging
  • Arbitrary input and output files can be staged to
    and from execution site
  • Same syntax as other universes
  • Limitation
  • Output files must be explicitly named

22
File Staging (cont)
  • Input, Output, and Error can be URLs
  • Files will be transferred directly to and from
    execution site
  • Output and Error can be staged or streamed

23
Credential Refresh
  • Renewed credentials are used by Condor-G and
    forwarded to the execution site automatically
  • No processes need to be restarted

24
Better Credential Management
  • One GridManager process can handle multiple
    credential files with same subject
  • More efficient when you want to have different
    credential lifetimes for different jobs

25
Grid Match-Making
  • Globus jobs matched with Globus resources by the
    Condor match-maker using ClassAds
  • Current limitation
  • User/admin must create resources ads

26
Fault Tolerance
  • Condor-G does its best to automatically recover
    from failures
  • User can guide decisions with job policy
    expressions
  • Periodic Release
  • GlobusResubmit
  • Rematch

27
PeriodicRelease Expression
  • Condor-G puts problematic jobs on hold
  • This expression tells Condor-G when to release
    and retry such jobs

28
GlobusResubmit Expression
  • Tells Condor-G when a problematic job submission
    should be abandoned
  • When this expression becomes true
  • Best effort is made to clean up current job
    submission
  • New job submission is attempted

29
Rematch Expression
  • Tells Condor-G when a problematic resource should
    be abandoned
  • Evaluated when GlobusResubmit evaluates to true
  • When this expression becomes true
  • Best effort is made to clean up current job
    submission
  • Job is rematched

30
Job Ad Example
  • GlobusContactString TARGET.gatekeeper_url
  • Requirements TARGET.Arch LINUX
    TARGET.OpSys LINUX
  • Rank TARGET.Mflops
  • PeriodicRelease ((NumMatches lt 10)
    ((CurrentTime-EnteredCurrentStatus) gt 600))
  • GlobusResubmit NumSystemHolds gt NumMatches
  • Rematch True

31
Hardening
  • Regular testing on the CMS testbed with real
    applications
  • Many bugs and integration issues found and fixed
  • Hostile Environment

32
Hostile Environment
  • Full disks
  • Machine crashes
  • File server lock-ups
  • Network outages
  • Power outages

33
One CMS Dataset Run
  • 300 jobs
  • Last fall
  • 50 (16) of the jobs stalled and required human
    recovery
  • Multiple service restarts (20 daemon crashes over
    6 hours)
  • Now
  • 0 jobs stalled
  • 0 service restarts

34
Integration Work
  • Dozens of Condor-G improvements and bug fixes
  • Over 40 Globus bugzilla incidents, many with
    patches
  • Globus 2.2.4 has 21 Advisories as of 4/11/04
  • Use latest version of both

35
Scalability
  • Submitting several hundred jobs produced high
    load on server
  • Machine became unresponsive
  • We saw a load average of 1000 at one point
  • Caused Globus JobManager processes

36
Grid Manager Monitor Agent
  • New tool Condor-G can use to reduce this load
  • Efficient job status polling program
  • Allows Condor-G to shut down JobManager processes
    when theyre not needed

37
Load Reduced
  • 400 jobs (/bin/sleep 900)
  • Without Grid Monitor
  • 42 hours to complete
  • Peak load average of 610
  • With Grid Monitor
  • 40 minutes
  • Peak load average of 104

38
Miscellaneous Stuff
  • Email notification on job completion
  • Port range restrictions
  • Problem jobs put on hold

39
In Development
  • Stuff were currently working on
  • Will be released sometime in the next year

40
Job Policy Expressions
  • PeriodicHold
  • PeriodicRemove
  • OnExitHold
  • OnExitRemove

41
Improved GlideIn
  • MDS use optional
  • User specifies necessary information
  • Automatic setup
  • GlideIn job transfers and installs binaries if
    needed
  • Binaries can come from submit machine

42
New Job Types
  • Submit jobs directly to other schedulers (not
    through Globus)
  • Why?
  • Richer interface semantics
  • Not supported by Globus

43
NorduGrid
  • Grid batch system designed by Nordic countries
  • Globus GRAM didnt offer necessary semantics
  • Client control of file staging
  • Automatic cleanup of abandoned jobs

44
Oracle
  • Oracle DBMS supports a job queue
  • Run this query in 5 hours
  • Run this query every Monday
  • Condor can add more management features

45
Generic Job Interface
  • Re-arrange GridManager to allow easy addition of
    new job types
  • Define appropriate interface
  • Plug-ins for new job types?

46
Globus Toolkit 3.0
  • OGSA (Open Grid Services Architecture)
  • Submit jobs to GT3 sites
  • Grid Service client interface to Condor-G

47
Miscellaneous
  • Condor-G for Windows
  • MyProxy credential management
  • URLs for executable, staged files

48
Thank You!
  • Questions?
  • Also
  • Condor-G Globus Q/A session
  • Wednesday, 9am-12pm, room TBA
  • E-mail condor-admin_at_cs.wisc.edu
Write a Comment
User Comments (0)
About PowerShow.com