Resource Management Working Group - PowerPoint PPT Presentation

About This Presentation
Title:

Resource Management Working Group

Description:

The Queue Manager notifies Scheduler of job completion ... Meta Scheduler of job completion. The user is notified of job completion. Design/Interface Progress ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 26
Provided by: scottmj
Learn more at: https://www.csm.ornl.gov
Category:

less

Transcript and Presenter's Notes

Title: Resource Management Working Group


1
Resource Management Working Group
  • SSS Quarterly Meeting
  • November 28, 2001
  • Dallas, Tx

2
Resource Management and Accounting Working Group
  • Working group scope and components
  • Progress made
  • Current and future issues
  • Next steps

3
Working Group Scope
  • The Resource Management Working Group encompasses
    the areas of resource management, scheduling and
    accounting.
  • This working group will focus on the following
    software components
  • Queue Manager
  • Scheduler
  • Allocation Manager
  • Meta Scheduler
  • Our charter will also encompass the following
    capabilities
  • Accounting
  • Usage Reports

4
Phase 1 Milestones
  • 6 months Contribute to checkpoint/restart report
    with regard to scheduling related aspects
  • 12 months Establish and release initial resource
    management interface specifications
  • 12 months Establishment of the CVS repository
    and module structure, agreement on document
    conventions
  • 12 months Finalized API for system initiated
    checkpoint/restart of parallel MPI jobs on Linux
    systems
  • 18 months Release v1.0 of the Centers resource
    management system based on existing open source
    code and the results of the scalability testing.

5
High Level Progress
  • Establishing high level design covering initial
    component functionality and required interfaces
  • Determining inter-group requirements (GUI,
    security, IS, process management, etc)
  • Preparing existing tools (Maui, Silver, QBank)
    for use within SSS
  • Creating infrastructure within which to develop
    and test RM deliverables
  • Creating infrastructure within which to develop
    and test intra- and inter-group interfaces

6
Proposed Component Architecture
Meta Scheduler
Security System
Information Service
Allocation Manager
Scheduler
Discovery Service
Queue Manager
Collector
Node Manager
Process Manager
7
Component Interaction DiagramJob submitted to
Queue Manager
User Interface
Collector
Meta Scheduler
Queue Manager
Allocation Manager
Scheduler
Process Manager
1
2
3
4
5
6
7
8
9
10
11
8
Component Interaction TraceJob submitted to
Queue Manager
  • A user submits a job to the Queue Manager
  • The Queue Manager does a sanity balance check
    with the Bank
  • The Queue Manager notifies the Scheduler that a
    new job has arrived
  • The Scheduler queries node and job status until
    job can run
  • A bank reservation is made with the Allocation
    Manager
  • The Scheduler requests the Queue Manager to run
    the job
  • The Queue Manager passes job control to the
    Process Manager
  • The Process Manager notifies Queue Manager of job
    completion
  • The Queue Manager notifies Scheduler of job
    completion
  • A bank withdrawal is made with the Allocation
    Manager
  • The user is notified of job completion

9
Component Interaction Diagram Job submitted to
Meta Scheduler
User Interface
Collector
Meta Scheduler
Queue Manager
Allocation Manager
Scheduler
Process Manager
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
10
Component Interaction TraceJob submitted to Meta
Scheduler
  • A user submits a job to the Meta Scheduler
  • The Meta Scheduler contacts Schedulers to
    determine which systems could run the job the
    soonest
  • The Schedulers request quotes from Allocation
    Banks to determine which systems would run the
    job for the lowest cost
  • A Scheduler reservation is created for the job on
    the resource providing the best service -- this
    reservation can be moved or improved upon until
    the job is staged
  • The job is staged and queued at the system where
    it is to run
  • The Queue Manager notifies the Scheduler that a
    new job has arrived
  • The Scheduler queries node and job status until
    job can run
  • A bank reservation is made with the Allocation
    Manager
  • The Scheduler requests the Queue Manager to run
    the job
  • The Queue Manager passes job control to the
    Process Manager
  • The Process Manager notifies Queue Manager of job
    completion
  • The Queue Manager notifies Scheduler of job
    completion
  • A bank withdrawal is made with the Allocation
    Manager
  • The Scheduler notifies the Meta Scheduler of job
    completion
  • The user is notified of job completion

11
Design/Interface Progress
  • Initial high level RMS architecture defined
  • Resource management dictionary created defining
    objects within resource management world
  • Object tokens declared for major objects
  • Component functional interfaces identified
  • Initial XML request/response syntax proposed
  • Prototypes being constructed to test
    communication protocols
  • Initial detailed extra-group component
    requirements document created

12
Local Scheduler Rationale
  • Local interfaces with majority of inter and intra
    RM components
  • Establish test platform from which interfaces can
    be tested
  • Leverage existing capabilities to accelerate SSS
    development
  • Establish infrastructure within which scheduling
    and metascheduling services and capabilities can
    be developed
  • Establish driver to evaluate other resource
    management components

13
Local Scheduler Progress
  • Baseline scheduler established (Maui 3.2) for SSS
    scheduling services integrating production and
    development capabilities
  • Prototype interface enabling XML communication
    with queue manager, metascheduler, and node
    manager
  • Extended QoS infrastructure integrated
  • Extended Job prioritization infrastructure
    integrated
  • Prototype created for object-oriented data access
  • Advanced metascheduling interface integrated

14
Meta Scheduler Progress
  • Initial distribution packaging created to allow
    collaborative development
  • Documentation enhanced and extended
  • Prototype XML scheduler to metascheduler query
    interface developed
  • Initial fault tolerance framework designed

15
Queue Manager Design
  • Established need for unified queue manager design
    common to Scheduler and Metascheduler
  • Queue manager will interface directly with
    Process manager
  • In process of refining the queue manager tasks
  • Queue manager will provide an interface to obtain
    information about any job regardless of job state
    including completed jobs (i.e. it will maintain a
    job information archive)

16
Allocation Manager Progress
  • QBank placed under revision control
  • Java prototype created which sends requests in
    XML
  • Experimenting with protocol frameworks (simple
    octet-counting, octet-stuffing, SOAP, BEEP)

17
Next Steps (In Progress)
  • Software Lifecycle Infrastructure
  • Online intra-RM schedule and dependencies
    document
  • Detailed extra-RM working group requirements
  • Coordinate creation of component level regression
    test suite
  • Bug tracking systems activated (used to track
    internal defects and development plans)
  • Interface
  • Produce validating intra-RM XML schema
  • Produce prototype RM components communicating in
    initial protocol
  • Feature Enhancements
  • Contribution to checkpoint/restart report
  • Creation of queue manager prototype

18
Next Steps (6 Months)
  • Usability
  • GUI-server interface, GUI format, security
    determined and prototypes created
  • Documentation of initial meta job
    constraints/features and specification language
  • Inter-group Collaboration
  • Creation of early scheduler XML implementation
    for use as RM driver
  • Development of initial dynamic job
    scheduler-queue manager interface
  • Extension of RM specifications/requirement
    document
  • Extension of internal component test
    infrastructure
  • Determination of best practices in
    documentation maintenance
  • Evaluation and adoption of web project management
    and collaboration tools
  • Creation of prototype queue manager with
    scheduler/task manager interfaces

19
Next Steps (6 Months)
  • Fault Tolerance
  • Enhance metascheduler to survive local daemon
    failure
  • Enhancement of threaded scheduling interface.
  • Development of threaded metascheduling interface.
  • Resource Optimization
  • Development of local optimization features of
    meta workload
  • Feature Enhancements
  • Creation of resource manager extension features.
  • Development of direct metascheduler to queue
    manager staging roadmap.
  • Interfaces
  • Specification of best guess security
    infrastructure and evaluation of impact on system
    internals and communication protocols

20
Next Steps (1 year)
  • Software Lifecycle Infrastructure
  • Create multi-component regression tests
  • Generate alpha package of scheduling,
    metascheduling, and allocation management
    packages.
  • Interfaces
  • Development of functional XML interfaces for all
    components
  • Early adoption of security infrastructure
  • Creation of optional information service
    interfaces
  • Admin and end-user GUIs proposed to enable use
    of new functionality
  • Inter-group Collaboration
  • Enhanced suspend/resume and checkpoint/restart
    features with detailed roadmap specified for all
    remaining suspend/resume and checkpoint restart
    deliverables

21
Current Issues
  • Should there be an enveloping protocol framework
    which handles framing (where the XML document
    begins and ends), authentication, multiplexing,
    streaming data, etc? (should we look at something
    like BEEP, or start from scratch and invent
    something of our own?)
  • The queue manager/collector to node/process
    manager functionality and data interface requires
    further refinement.
  • Queue manager/collector and node/process manager
    development schedules must be determined and
    coordinated.

22
Issues
  • Continued effort is required to complete an
    intra-RM XML schema to handle initial RMS
    interaction needs. Boundaries between internal
    intra-RM and global XML schema is needed.
  • Understanding of open source requirements (I.e.
    can software be included in SSS distribution that
    requires registration and usage agreements)

23
Inter-Group Issues
  • Need for coordination of resource management
    system across working groups so that the pieces
    all function together properly and no part is
    overlooked. Need to coordinate schedules for
    delivery of RMWG-dependent non-RMWG components.
  • Early vendor/industry collaborations (Wed better
    do this while it can still influence our design.
    Need to talk to decision makers and develop
    business plans)

24
Inter-group Issues
  • Information service should we rather be looking
    for something existing? (i.e. MDS2)
  • Need to solidify SSS-wide standards for
    packaging, revision control, documentation
    content, format, and packaging, problem tracking,
    and establish mechanisms and places to home
    them.
  • Creation of regression and integration test suite
    (w/ Validation and Testing WG we need this from
    an early stage)

25
Conclusions
  • Questions
Write a Comment
User Comments (0)
About PowerShow.com