Title: Scalable Systems Software Enabling Technology Center
1SciDAC
Scalable Systems Software Center
August 14-15 Atlanta GA
2Agenda - August 14
800 wireless set up 900
Introductions 930 Overview and Goals of
the Center (Geist) 1000 SciDAC ISIC
Expectations (Johnson) 1030 Discussion of
meeting goals 1130 Strawman proposal for an
interface framework 1200 Lunch (as group to
hotel restaurant) 100 Other
proposals/ideas for system interfaces doug -
runtime architecture (scyld vs cplant vs
???) karl rm-api, cpu sets, need to schedule
fat nodes don scyld boot method, multicast
status info paul checkpoint/restart sung
science appliance project 3.00 Enumerating
key attributes 400 Discuss merits of
attribute database 500 Break for dinner
3Agenda - August 15
800 wireless set up 830 Decide on
logistics of consensus 930 Overnight
proposals? 1000 Decide on working groups for
key areas Begin initial discussion
of interfaces integration 1200 Next
meeting dates, what happens till then 1230
Meeting Ends Lunch and further
discussion for hangers-on
4Scalable Systems Softwarefor Terascale Computer
Centers
www.scidac.org/ScalableSystems
Problem
Resource Management
- Computer centers use incompatible, ad hoc set of
systems tools - Present tools are not designed to scale to
multi-Teraflop systems
Accounting user mgmt
Solution
- Collectively (with industry) define standard
interfaces between systems components for
interoperability - Create scalable, standardized management tools
for efficiently running our large computing
centers
System Monitoring
System Build Configure
Impact
- Revolutionize the way system software is designed
and used.
Job management
5Goal and Vision of the Center
Four Goals
Collectively (with industry) agree on and specify
standardized interfaces between system components
in order to promote interoperability,
portability, and long-term usability. Produce a
fully integrated suite of systems software and
tools for the effective management and
utilization of terascale computational resources
particularly those at the DOE facilities.
Research and development of more advanced
versions of the components as well as OS
modifications required to support the scalability
and performance requirements of SciDAC
applications. Carry out a software lifecycle
plan for support and maintenance of systems
software suite.
6ScopeThe Spaghetti and Meatballs Picture
Access control
Meta
Meta
Meta
Security
Scheduler
Monitor
Manager
manager
Interacts with
all components
Node
System Monitor
Accounting
Scheduler
Configuration
Build
Manager
Resource
Allocation
management
Job
Queue
Manager Monitor
Manager
User DB
Data
Migration
High
Usage
Checkpoint/
File
User
Performance
Reports
Restart
System
utilities
Communication
I/O
Application Environment
7Working with Computer Centers
Our Customers are the Managers and System
Administrators At the terascale computer centers
around the nation. their guidance their
feedback
Working with other SciDAC Centers
Common Component Architecture parallel
startup event services runtime
framework Scalable Data Management others?
8Meeting Goals
Decide logistics of reaching consensus on
standard interfaces MPI-like process, CCA-like
process, other? How to deal will
errata Enumeration of key attributes common
across system components expect there are less
than 30 Discuss whether an attribute database be
a part of the architecture could be considered
as just another component Begin defining
interfaces and working groups for key
areas node configuration building, resource
management, parallel job startup, system job
monitoring
9Infrastructure
Project Web Page www.scidac.org/ScalableSystems
proposal plan overview slides links to
individual sites and software downloads Project
Notebook www.csm.ornl.gov/geist/enote/system.ht
ml meeting notes (like this meeting) progress
reports draft standards for group to comment
on CVS when we begin to produce software suite
10Strawman a common integrated interface framework
Easy to swap components
Vendor optimized
highly scalable version
common pool of attributes
XML format for attributes Standardized request
protocol Choose an existing transfer
protocol-TCP Every component uses the same
framework
Attribute database
User Host OS Mem Allocation Etc
11Objects and Components
Components Job Manager System Monitor Accounting
Allocation management Logging Node
Management Process Management Job
monitor Configuration management Scheduler Queue
manager Meta-services Information service System
management
Components Checkpoint File staging Security
manager
Objects Job Node Task User Group Account/project
Queue??? Data store/IO Interconnect partition
12Node, system, and configuration Services
Services Start job Signal job
Services Start job Signal job
Services Start job Signal job
Objects Job Node Task User Group Account/project
Queue??? Data store/IO Interconnect partition
13Job and System Monitor Services
Services Start job Signal job
Services Start job Signal job
Services Start job Signal job
Objects Job Node Task User Group Account/project
Queue??? Data store/IO Interconnect partition
14Accounting and logging Services
Services Start job Signal job
Services Start job Signal job
Services Start job Signal job
Objects Job Node Task User Group Account/project
Queue??? Data store/IO Interconnect partition
15Job and Process Mgmt (chkpt) Services
Services Start job Signal job
Services Start job Signal job
Services Start job Signal job
Objects Job Node Task User Group Account/project
Queue??? Data store/IO Interconnect partition
16Scheduler, Queue, and meta- Services
Services Start job Signal job
Services Start job Signal job
Services Start job Signal job
Objects Job Node Task User Group Account/project
Queue??? Data store/IO Interconnect partition
17Information Services
Objects Job Node Task User Group Account/project
Queue??? Data store/IO Interconnect partition
Slow Services Start job Signal job
Fast Services Start job Signal job
Static Services Start job Signal job
18Security Mgr Services
Objects Job Node Task User Group Account/project
Queue??? Data store/IO Interconnect partition
Services Start job Signal job
19Storage and I/O Services
Objects Job Node Task User Group Account/project
Queue??? Data store/IO Interconnect partition
Services Start job Signal job
20Consensus and Voting Rules
Written Documentation Written draft standards
available to everyone in Project notebook Drafts
must be presented week to 10 days before a
vote Errata or extensionsrevisit interface
standard every 6 months Voting Pass with simple
majority of people voting yes/no Who can vote?
Organizations with physical attendance at 2 of
last 3 meetings One vote per organization No
email-in or phone votes accepted. Straw votes are
non-binding and many can be used for guidance Two
formal votes are required to accept a chapter for
final vote Both votes cant occur at the same
meeting. Global vote of whole document as
standard interface
21Other organizational
Weekly teleconference of Working Groups Up to
the groups Video Conference Meetings Explore AG
in future