Workload Management WP Status and next steps - PowerPoint PPT Presentation

About This Presentation
Title:

Workload Management WP Status and next steps

Description:

'Negotiation' in the ATF. To understand if these functionalities 'address' the proposed use cases ... First version of job description language (JDL) ... – PowerPoint PPT presentation

Number of Views:21
Avg rating:3.0/5.0
Slides: 17
Provided by: massimosg
Category:

less

Transcript and Presenter's Notes

Title: Workload Management WP Status and next steps


1
Workload Management WPStatus and next steps
  • Massimo Sgaravatto
  • INFN Padova

2
Where we are
  • CMS-HLT use case (Monte Carlo production and
    reconstruction) analyzed in terms of GRID
    requirements and GRID tools availability
  • Discussions with Globus team and Condor team
  • Definition of a prototype architecture of
    workload management system
  • Use of Globus and Condor mechanisms
  • But major developments needed

3
Prototype workload management system architecture
Info
Resource Discovery
Master
Grid Information Service (GIS)
Submit jobs
condor_submit (Globus Universe)
Master chooses in which Globus resources the
jobs must be submitted
Condor-G
Condor-G able to provide a reliable/crashproof
job submission service
Globus GRAM as uniform interface to different
local resource management systems
Globus GRAM
Globus GRAM
Globus GRAM
Local Resource Management Systems
CONDOR
LSF
PBS
Site1
Farms
Site2
Site3
4
Where we are
  • Evaluating the existing components (D1.1) and
    putting together the various building blocks
  • Evaluation of Globus
  • Collaboration with WP 1 of INFN-GRID project
    (Evaluation of the Globus toolkit)
    http//www.infn.it/globus
  • Evaluation of Globus GRAM
  • GRAM as uniform interface to different underlying
    resource management systems
  • Evaluation of RSL
  • Cooperation between GRAM and GIS
  • Evaluation of Condor-G
  • The current implementation is a prototype
  • It works, but some problems must be solved
  • Globus Condor-G tested with a real CMS MC
    production
  • Many many many memory leaks found in the Globus
    jobmanager !!!
  • Fixes (provided by Francesco Prelz) submitted to
    Globus team
  • Feedback only for what concerning the bugs in the
    GAA and GSS modules (new fixes merged with the
    original ones)

5
Layout for CMS production
Submit jobs
Production manager (Ivano Lippi Padova)
condor_submit (Globus Universe)
Condor-G
Padova
globusrun
Globus GRAM
Globus GRAM
Local Resource Management Systems
CONDOR
LSF
Farms
Bologna
Pisa
6
First deliverables
  • Month 3 Report on current technology (report)
    D1.1
  • Month 6 Definition of architecture for
    scheduling, resource management, security and job
    description (report) D1.2
  • Month 9 Components and documentation for the 1st
    release initial workload management system
    (prototype) D1.3

7
Proposed work plan
  • Lets continue the implementation of the proposed
    prototype
  • Evaluation of current technologies (Globus,
    Condor) (D1.1)
  • Functionalities for the 1st release
  • First release
  • We can propose the functionalities that could be
    implemented
  • Negotiation in the ATF
  • To understand if these functionalities address
    the proposed use cases
  • To understand if our module can be plugged
    together with the other pieces
  • To understand if the other WPs can provide the
    required (by WP 1) functionalities

8
Proposed functionalities for the 1st release
  • First version of job description language (JDL)
  • First version of broker (master), that decides
    where to submit the jobs
  • Job submission service
  • First version of logging and bookkeeping services
  • First user interface

9
Job Description Language (JDL)
  • Used when the job is submitted, to specify
  • The application
  • The input data set
  • File ? Collection of files ? Logical or
    physical names ?
  • Need to be discussed with WP 2, WP 8, ATF
  • Where the output data must be saved
  • (Required and preferable) resources
  • Info for bookkeeping
  • ???
  • Prototype Condor ClassAds

10
Broker/Master
  • Choice of resource (farm) where to submit job
  • Input JDL expression
  • Output computing resource choice
  • Published resource access lists (gridmap-files in
    the Globus-based prototype) are checked as a
    first step in the resource match-making

11
Broker/Master
  • The accessible computing resources are matched
    with the job request according to
  • Availability of the requested input data set
  • In the 1st release the broker will have to
    choose a resource where this input data set is
    already available (we are not going to trigger
    the replica of the input data set)
  • Availability of the appropriate application
    "sandbox
  • If necessary, it could be necessary to "copy" and
    install this sandbox if not already available in
    the executing farm (code migration) (in the 1st
    release ???)
  • Queue characteristics and status (architecture,
    etc) vs. job requests
  • Lets start with a few, simple parameters
  • Availability of the requested amount of scratch
    space

12
Broker/Master
  • We assume that all the information needed by the
    broker are published in one Grid Information
    Space (GIS in the Globus-based prototype) by the
    other WPs
  • Prototype Condor matchmaking library
  • Match between the info published in the GIS and
    the ClassAds defined in the JDL
  • Necessary a translator GIS attributes ?
    ClassAds
  • Some work already done by Globus team ???

13
Job submission service
  • Input job to submit computing resource choice
    (provided by broker)
  • Reliable, fault tolerant, crash proof service
  • Reliability in the executing machines up to WP 4
  • Prototype Condor-G
  • Submission of jobs to Globus resources (farms)
  • New implementation of Condor-G ( new Globus job
    manager) available soon

14
Code migration
  • Not easy at all !!!
  • Necessary to install in the target farm a
    complex run time environment
  • Necessary a STRONG collaboration with WP 8 (and
    WP 4) to define an application sandbox, that
    can easily be installed in one farm, and doesnt
    conflict with other sandboxes
  • Use of application repositories ???
  • When an application must be installed on one
    farm, the sandbox is downloaded from such
    repository

15
Bookkeeping
  • Necessary to record for each job
  • Submitting user identity
  • Input data
  • Output data
  • Status of processing
  • Where and when the processing has been done
  • Other bookkeeping info specified in the JDL
  • ???

16
Logging
  • Necessary to keep tracks of the significant
    events occurred in the system
  • Requests by users
  • Computing resource choice (by broker)
  • Submission to resource
  • ???

17
User Interface
  • Job management
  • Job submission
  • Job removal
  • Job status monitoring
  • Access to bookkeeping info
  • Access to logging info
  • ???
Write a Comment
User Comments (0)
About PowerShow.com