Scalable Systems Software Center Resource Management and Accounting Working Group FacetoFace Meeting - PowerPoint PPT Presentation

About This Presentation
Title:

Scalable Systems Software Center Resource Management and Accounting Working Group FacetoFace Meeting

Description:

Fixed additional packaging bugs, buffer overflows. Started work on ... for Globus 3.x (had to workaround a lot of Globus bugs) ... end of June for alpha ... – PowerPoint PPT presentation

Number of Views:76
Avg rating:3.0/5.0
Slides: 21
Provided by: scottmj
Learn more at: https://www.csm.ornl.gov
Category:

less

Transcript and Presenter's Notes

Title: Scalable Systems Software Center Resource Management and Accounting Working Group FacetoFace Meeting


1
Scalable Systems Software CenterResource
Management and Accounting Working
GroupFace-to-Face MeetingMay 10-11,
2005Argonne, IL
2
Resource Management and Accounting Working Group
  • Working group scope
  • Progress since last face-to-face
  • Future Work
  • Other issues

3
Working Group Scope
  • The Resource Management Working Group is involved
    in the areas of resource management, scheduling
    and accounting.
  • This working group will focus on the following
    software components
  • Queue Manager
  • Scheduler
  • Accounting and Allocation Manager
  • Meta Scheduler
  • Other critical resource management components are
    being developed in the Process Management and
    Monitoring Working Group
  • Process Manager
  • Cluster Monitor

4
Resource Management Component Architecture
Grid Scheduler
Infrastructure Services
Allocation Manager
Cluster Scheduler
Discovery Service
Queue Manager
Node Monitor
Event Manager
Security System
Node Manager
Process Manager
5
Resource Management Prototype Demonstration
This demo runs a simple end-to-end test with a
job being submitted running past its wallclock
limit
4 Create-Reservation
Allocation Manager
Cluster Scheduler
9 Withdraw-Allocation
2 Query-Job
7 Query-Job
8 Delete-Job
3 Query-Node
5 Run-Job
Queue Manager
Node Monitor
Job Submission Client
1 Submit-Job
0 Service-Lookup
6 Exec-Process
Process Manager
Discovery Service
6
General Progress
  • New release of RMWG components made available
    from SSS web site
  • Bamboo Queue Manager v1.1
  • Maui Scheduler v3.2.6p13
  • Gold Accounting and Allocation Manager v2.b2.10.2

7
General Progress
  • Continued Adoption of SSS components and
    interfaces
  • SSS suite running on additional systems in Ames
  • Gold being used in production on University of
    Utahs Icebox cluster

8
General Progress
  • Working on integration of SSSRMAP into ssslib
  • Bill Pitre -- implementing the SSSRMAP Message
    Format SDK (Python classes)
  • Craig Steffen -- integrating SSSRMAP Wire Level
    protocol into ssslib

9
General Progress
  • Paper accepted for presentation and publication
    at a conference
  • Title Allocation Management Solutions for High
    Performance Computing
  • Conference Parallel and Distributed Processing
    Techniques and Applications (PDPTA'05)
  • Workshop on Scheduling and Resource Management
    for Parallel and Distributed Systems

10
General Progress
  • New Documents in SSS RMWG Notebook
  • Considerations for using SOAP as the basis for
    SSSRMAP v4
  • Fault Tolerance with Gold
  • Last Quarters Weekly RMWG Meeting Notes

11
Queue Manager Progress
  • V1.1 release of Bamboo made available
  • SSS suite running on several systems in Ames.
  • Support for Task Groups and Node Properties added
    to server.
  • Added a new mailing feature
  • New fountain component created to pull node
    information from multiple sources.
  • Simple node information now supported.
  • Working on adding support for SuperMon, Ganglia
    and NWPerf

12
Accounting and Allocation Manager Progress
  • New release of Gold available 2nd Gold Beta
    v2.b2.10.2
  • v2.b2.7.0 incorporated into OSCAR release
  • Gold being used in production on University of
    Utahs Icebox cluster
  • Implemented and tested design for distributed
    accounting and multi-organizational negotiation
    in job launching
  • Implemented fault tolerance to 50 cluster loss
    by adding support for a backup gold server.
  • Clients can failover to a backup gold server if
    defined
  • The database can be made fault tolerant by
    utilizing a synchronous
  • multi-master replication system such as
    pgcluster.
  • documented in RMWG notebook

13
Accounting and Allocation Manager Progress
  • Simplified ease of use for allocation management
    for basic configurations by adding ability to
    hide account abstraction layer
  • enabled account auto-generation, project-level
    deposits, etc.
  • Ported Gold to Tier3 and Tier4 OSs
  • (OS-X, IRIX, HP-UX, Solaris) - unable to get
    access to Unicos
  • Enabled support for mysql database

14
Cluster Scheduler Progress
  • Migrated latest MCOM library into Maui
  • includes support for encryption, scalability
    enhancements, sss return codes, job description
    extensions, etc.
  • Enabled support for partitions, node features
  • Enhanced recovery modes for failures and
    unexpected conditions
  • Additional QOS modes for Allocation Manager
  • fallback QOS, QOS requested vs. delivered
  • Fixed additional packaging bugs, buffer overflows
  • Started work on multi-taskgroup jobs

15
Grid Scheduler Progress
  • Added support for multi-site authentication (per
    peer-service symmetric keys)
  • Rolling X.509 credential management into MCOM
    library
  • Enabled support for Globus 3.x (had to workaround
    a lot of Globus bugs)
  • Enhanced grid job queue and launch
  • Reliability - completed Globus failure
    diagnostics, logging and auto-recovery
  • Data Staging - completed Globus/non-Globus data
    staging failure auto-recovery
  • Fairness - implemented Priority, Fairshare, and
    Usage Limit based policy enforcement
  • Statistics - added credential, job, and cluster
    based usage statistics

16
Future Work
  • General release of all components
  • Including new Silver Meta-scheduler
  • Increase deployment base
  • Integrate SSSRMAP into ssslib
  • Portability testing for all components
  • Fault Tolerance supporting 25 cluster loss

17
Future Work
  • Queue manager
  • Add job group support (mainly for submission)
  • Add Job Submission filter
  • Finish final missing portions of PBS style job
    language support.

18
Future Work
  • Accounting and Allocation manager
  • General release to be made available by mid-year
  • Production deployment of Gold on additional sites
  • Port Gold GUI from JSP to Perl CGI
  • Add support for multi-site authentication (each
    site having its own symmetric key)
  • Documentation to include object customization

19
Future Work
  • Cluster Scheduler
  • Add support for multi-taskgroup SSS jobs
  • Support SSS job extensions and job-level policies
  • Peer Diagnostics - add auto-recovery to failed
    service interfaces
  • Resource Utilization - complete development of
    all resource utilization objectives
  • Resource Limits - complete development of all
    resource limits objectives
  • Checkpoint Restart test with LBNL and optimize
    resource management for suspended jobs
  • Get X.509 credential management working

20
Future Work
  • Grid Scheduler
  • Release Silver meta-scheduler
  • targeting end of June for alpha release
  • need to test Maui/Silver interoperability with
    new MCOM lib
  • Need to test
  • Priority, Fairshare, and Usage Limit based policy
    enforcement
  • credential, job, and cluster based usage
    statistics
  • Optimization - add network co-allocation
    reservation
  • General - mature client commands to provide
    status reporting in more intuitive manner
Write a Comment
User Comments (0)
About PowerShow.com