Title: Scalable Systems Software Center Al Geist, coordinating P.I.
1Scalable Systems Software CenterAl Geist,
coordinating P.I.
- Rusty Lusk
- Mathematics and Computer Science Division
- Argonne National Laboratory
2Outline
- The Scalable Systems Software Center
- Goals
- Participants
- Scope
- Structure
- Issues
- Status
- Experiences
- Lessons
3Current State of Systems Software for Large-Scale
Machines
- Both proprietary and open-source systems
- Machine-specific, PBS, LSF, POE, SLURM, COOAE
(Collections Of Odds And Ends), - Many are monolithic resource management
systems, combining multiple functions - Job queuing, scheduling, process management, node
monitoring, job monitoring, accounting,
configuration management, etc. - A few established separate components exist
- Maui scheduler
- Qbank accounting system
- Many home-grown, local pieces of software
- Scalability often a weak point
4The Scalable Systems Software SciDAC Project
- Research goal to develop a component-based
architecture for systems software for scalable
machines - Software goal to demonstrate this architecture
with prototype open-source components - One powerful effect forcing rigorous (and
aggressive) definition of what components should
do and what should be encapsulated in other
components - http//www.scidac.org//ScalableSystems
5Participants
- Labs
- ORNL, ANL, LBNL, PNNL, Ames, SNL, LANL
- Universities
- NCSA, SDSC, PSC, Clemson
- Vendors
- Unlimited Scale, IBM, Cray, Intel
- Open to anyone who wants to participate
6Project Concept
Meta Scheduler
Meta Monitor
Meta Manager
Access Control Security Manager
Meta Services
Interacts with all components
Node Configuration Build Manager
System Monitor
Accounting
Scheduler
Resource Allocation Management
Process Manager
Queue Manager
User DB
Data Migration
High Performance Communication I/O
Usage Reports
User Utilities
Checkpoint / Restart
File System
Testing Validation
Application Environment
7Structure of Project
- Working Groups
- Node build, configuration, and global
infrastructure - Job submission, queue management, scheduling, and
accounting - Process management, system monitoring, and
checkpointing - Validation and integration
- Quarterly project meetings, weekly working group
conference calls - Electronic notebooks for all working documents
- www.scidac.org/ScalableSystems
8SSS Project Issues
- Put minimal constraints on component
implementations - Ease merging of existing components into SSS
framework - E.g., Maui scheduler
- Ease development of new components
- Encourage multiple implementations from vendors,
others - Define minimal global structure
- Components need to find one another
- Need common communication method
- Need common data format at some level
- Each component will compose messages others will
read and parse - Message-framing protocols
9SSS Project Status Global
- Early decisions on inter-component communication
- Lowest level communication is over sockets (at
least) - Message content will be XML
- Parsers available in all languages
- Did not reach consensus on transport protocol
(HTTP, SOAP, BEEP, assorted home grown),
especially to cope with local security
requirements, so multiple protocols are supported - Early implementation work on global issues
- Service directory component defined and
implemented - SSSlib library for inter-component communication
- Handles interaction with service directory
- Hides details of transport protocols from
component logic - Anyone can add protocols to the library
- Bindings for C, C, Java, Perl, and Python
- Event manager for asynchronous communication
10SSS Project Status Individual Component
Prototypes
- Precise XML interfaces not settled on yet,
pending experiments with component prototypes - Both new and existing components
- Maui scheduler is existing full-featured
scheduler, SSS communication added - QBank accounting system has added SSS
communication interface, evolving into Gold - New Checkpoint Manager component being integrated
now - System-initiated checkpoints of LAM jobs
11SSS Project Status More Individual Component
Prototypes
- New Build-and-Configuration Manager completed
- Controls how nodes are configured and built
- New Node State Manager
- Manages nodes as they are installed,
reconfigured, added to active pool - New Event Manager for asynchronous communication
among components - Components can register for notification of
events supplied by other components (useful in
monitoring, fault tolerance) - New Queue Manager mediates among user (job
submitter), Job Scheduler, and Process Manager - Multiple monitoring components, both new and old
- Data warehouse
12SSS Project Status Still More Individual
Component Prototypes
- New Process Manager component provides SSS
interface to MPD scalable process manager - Speaks XML through SSSlib to other SSS components
- Invokes MPD to implement SSS process management
specification - MPD itself is not an SSS component
- Allows MPD development, especially with respect
to supporting MPI and MPI-2, to proceed
independently - SSS Process Manager abstract definitions have
influenced addition of MPD functionality beyond
what is needed to implement mpiexec from MPI-2
standard - E.g. separate environment variables for separate
processes
13Schematic of Process Management Component in
Scalable Systems Software Context
NSM
SD
Sched
EM
MPDs
SSS Components
QM
PM
PM
SSS XML
application processes
mpdrun
simple scripts or hairy GUIs using SSS XML
QMs job submission language
XML file
mpiexec
(MPI Standard args)
interactive
Prototype MPD-based implementation side
SSS side
Other managers could go here instead
14Other Accomplishments
- APItest is a component test framework, capable of
conducting unit tests on components - Well-suited to complicated network such as this
one - Allows testing of one component at a time without
testing all at once - Used on SSS components this year
- SSS-OSCAR is a public, open-source release of the
current state of the component system, tested for
compatibility. (Get from web page) - Tested subset of components on 5000-node cluster
at NCSA - SSS component architecture put into production on
clusters at ORNL, PNNL, ANL (ANL story follows)
15New Challenges on Chiba City
- Medium-sized, middle-aged cluster at Argonne
- Dedicated to computer science scalability
research, not applications - Also used by friendly, hungry applications
- New requirement support research requiring
specialized kernels and alternate operating
systems, for OS scalability research - Want to schedule jobs that require node rebuilds
(for new OSs, kernel module tests, virtual
nodes, etc.) as part of normal job scheduling - Requires major upgrade of Chiba City systems
software
16Chiba Commits to SSS
- Fork in the road
- Major overhaul of old, crufty, Chiba systems
software (open PBS Maui scheduler homegrown
stuff), OR - Take leap forward and bet on all-new software
architecture of SSS - Problems with leaping approach
- SSS interfaces not finalized
- Some components dont yet use library (implement
own protocols in open code, not encapsulated in
library) - Some components not fully functional yet
- Solutions to problems
- Collect components that are adequately functional
and integrated (PM, SD, EM, BCM) - Write stubs for other critical components
(Sched, QM) - Do without some components (CKPT, monitors,
accounting) until ready
17Features of Adopted Solution
- Stubs adequate, at least for time being
- Scheduler does FIFO reservations backfill,
improving - QM implements PBS compatibility mode (accepts
user PBS scripts) as well as asking Process
Manager to start parallel jobs directly - Process Manager wraps MPD, as described above
- Single ring of daemons runs as root, managing all
jobs for all users - Daemonss started by Build-and-Config manager at
boot time - An MPI program called MPISH (MPI Shell) wraps
user jobs for handling file staging and multiple
job steps - Python implementation of most components
- Each component lt 400 lines
- Demonstrates feasibility of using SSS component
approach to systems software - Running normal Chiba job mix for over six months
now - Only systems software on this machine
- Moving forward on meeting new requirements for
research support
18Lessons Learned This Approach Really Works!
- Components can use one anothers data
- Functionality only needs to be implemented once
- E.g., broadcast of messages
- Components are more robust, since they focus on
one task - Code volume shrinks because of less duplication
of functionality - Easy to add new functionality
- File staging
- MPISH
- Rich infrastructure on which to build new
components - Communication, logging, location services
- Need not be limited by existing subcomponents of
existing systems - Can replace just the functionality needed (get to
solve the problem you want to solve, without
re-implementing everything). - E.g. having queue manager accept requests for
rebuilt nodes before starting jobs.
19Summary
- The Scalable Systems Software SciDAC project is
addressing the problem of systems software for
terascale systems. - Component architecture for systems software
- Definitions of standard interfaces between
components - An infrastructure to support component
implementations within this framework - A set of component implementations, continuing to
improve - Prototype software suite released
- Experimental production use of the component
architecture and some of the component
implementations - Encourages development of sharable tools and
solutions - Scalability testing under way