NCCS User Forum - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

NCCS User Forum

Description:

Discover Utilization Past Year. by Month. 9/4/08 SCU3 (2064 cores added) ... Discover Utilization Past Quarter. by Week. 2/4/09 SCU4 (544 cores moved from ... – PowerPoint PPT presentation

Number of Views:74
Avg rating:3.0/5.0
Slides: 40
Provided by: drbil1
Learn more at: http://www.nccs.nasa.gov
Category:
Tags: nccs | forum | user | utilization

less

Transcript and Presenter's Notes

Title: NCCS User Forum


1
NCCS User Forum
  • 24 March 2009

2
Agenda
Welcome Introduction Phil Webster, CISTO Chief
Current System Status Fred Reitz, Operations
Manager
NCCS Compute Capabilities Dan Duffy, Lead
Architect
User Services Updates Bill Ward, User Services
Lead
Questions and Comments Phil Webster, CISTO
Chief
3
Key Accomplishments
  • Incorporation of SCU4 processors into general
    queue pool
  • Acquisition of analysis system

4
Agenda
Welcome Introduction Phil Webster, CISTO Chief
Current System Status Fred Reitz, Operations
Manager
NCCS Compute Capabilities Dan Duffy, Lead
Architect
User Services Updates Bill Ward, User Services
Lead
Questions and Comments Phil Webster, CISTO
Chief
5
Key Accomplishments
  • SCU4 processors added to the general queue pool
    on Discover
  • SAN implementation
  • Improved data sharing between Discover and Data
    Portal
  • RAID 6 implementation

6
Discover Utilization Past Yearby Month
  • 9/4/08 SCU3 (2064 cores added)
  • 2/4/09 SCU4 (544 cores moved from test queue)
  • 2/19/09 SCU4 (240 cores moved from test queue)
  • 2/27/09 SCU4 (1280 cores moved from test queue)

7
Discover Utilization Past Quarterby Week
  • 2/4/09 SCU4 (544 cores moved from test queue)
  • 2/19/09 SCU4 (240 cores moved from test queue)
  • 2/27/09 SCU4 (1280 cores moved from test queue)

8
Discover CPU ConsumptionPast 6 Months (CPU Hours)
  • 9/4/08 SCU3 (2064 cores added)
  • 2/4/09 SCU4 (544 cores moved from test queue)
  • 2/19/09 SCU4 (240 cores moved from test queue)
  • 2/27/09 SCU4 (1280 cores moved from test queue)

9
Discover Queue Expansion FactorDecember
February
Weighted over all queues for all jobs (Background
and Test queues excluded)
Eligible Time Run Time Run Time
10
Discover Job Analysis February 2009
11
Discover Availability
  • December through February availability
  • 4 outages
  • 2 unscheduled
  • 0 hardware failures
  • 1 user error
  • 1 extended maintenance window
  • 2 scheduled
  • 11.7 hours total downtime
  • 1.2 unscheduled
  • 10.5 scheduled
  • Outages
  • 2/11 Maintenance (Infiniband and GPFS upgrades,
    node reprovisioning), 10.5 hours scheduled
    outage plus extension
  • 11/12 SPOOL filled due to user error, 45
    minutes
  • 1/6 Network line card replacement, 30 minutes
    scheduled outage

Maintenance (scheduled plus extension)
Infiniband, GPFS upgrades, node reprovisioning
Network line card maintenance
SPOOL filled
12
Current Issues on DiscoverGPFS Hangs
  • Symptom GPFS hangs resulting from users running
    nodes out of memory.
  • Impact Users cannot login or use filesystem.
    System Admins reboot affected nodes.
  • Status Implemented additional monitoring and
    reporting tools.

13
Current Issues on DiscoverProblems with PBS V
  • Symptom Jobs with large environments not
    starting.
  • Impact Jobs placed on hold by PBS.
  • Status Awaiting PBS 10.0 upgrade. In the
    interim, dont use V to pass full environment,
    instead use v or define necessary variables
    within job scripts.

14
Future Enhancements
  • Discover Cluster
  • Hardware platform
  • Additional storage
  • Data Portal
  • Hardware platform
  • Analysis environment
  • Hardware platform
  • DMF
  • Hardware platform
  • Additional disk cache

15
Agenda
Welcome Introduction Phil Webster, CISTO Chief
Current System Status Fred Reitz, Operations
Manager
NCCS Compute Capabilities Dan Duffy, Lead
Architect
User Services Updates Bill Ward, User Services
Lead
Questions and Comments Phil Webster, CISTO
Chief
16
FY09 Operating PlanBreakdown of Major Initiatives
  • Analysis System Integration
  • Large scale disk and interactive analysis nodes
  • Pioneer users in April full production in June
  • FY09 Cluster Upgrade
  • Two scalable compute units (approximately 4K
    cores)
  • Additional 40 TF of Intel Nehalem processors
  • To be completed by July (subject to vendor
    availability of equipment)
  • Data Portal
  • Enhance services within the data portal to serve
    IPCC and other data to the Earth Systems Grid
    (ESG) and PCMDI
  • Actively looking for partners
  • To be completed by the end of FY09
  • Data Management
  • Concept of operations still being worked out
  • Actively looking for partners
  • Plan is to have some amount of capability based
    on iRODS rolled out by the end of FY09
  • DMF Migration from Irix to Linux
  • Move DMF equipment out of S100 into E100
  • SGI dropping support for DMF on Irix will re-use
    Palm (SGI Linux) system as the new DMF server
  • To be completed by June

17
Representative Architecture
Planned for FY09
Future Plans
Existing
NCCS LAN (1 GbE and 10 GbE)
Login
Data Portal
Existing Discover 65 TF
Analysis
FY09 Upgrade 40 TF
Future Upgrades TBD
Data Gateways
Data Management
Viz
Direct Connect GPFS Nodes
ARCHIVE
GPFS I/O Nodes
GPFS I/O Nodes
GPFS I/O Nodes
Disk 300 TB
GPFS Disk Subsystems 1.3 PB
Tape 8 PB
Management Servers
License Servers
GPFS Management
Other Services
PBS Servers
Internal Services
18
Benefits of the Representative Architecture
  • Breakout of services
  • Separate highly available login, data mover, and
    visualization service nodes
  • These can be available even when upgrades are
    occurring within the cluster elsewhere
  • Data Mover Service these service nodes allow for
  • Data to be moved between the discover cluster and
    the archive
  • Access of data within the GPFS system to be
    served to the data portal
  • WAN accessible nodes within the compute cluster
  • Users have requested nodes within compute jobs to
    have access to the network
  • The NCCS is currently configuring network
    accessible nodes to be scheduled in PBS jobs so
    users can run sentinel type processes, easily
    move data via NFS mounts, etc.
  • Internal services run on dedicated nodes
  • Allows for the vertical components of the
    architecture to go up and down independently
  • Critical services are run in a high availability
    mode
  • Can even allow for licenses to be served outside
    the NCCS

19
Analysis Requirements
  • Phase 1
  • Reproduce Current SGI Capabilities
  • Fast access to all GPFS and Archive file systems
  • FORTRAN, C, IDL, GrADS, Matlab, Quads, LATS4D,
    Python
  • Visibility and easy access to post data to the
    data portal
  • Interactive display of analysis results
  • Beyond Phase 1
  • Develop Client/Server Capabilities
  • Extend analytic functions to the users
    workstations
  • Subsetting functions
  • In-line and Interactive visualization
  • Synchronize analysis with model execution
  • See the intermediate data as they are being
    generated
  • Generate images for display back to the users
    workstations
  • Capture and store images during execution for
    later analysis

20
Analysis System Technical Solution
Analysis
Compute
10 GbE LAN
NFS, bbftp, scp Single Stream 10-50
MB/sec Aggregate 1-1.5 GB/sec
16 cores 256GB
Discover
Multiple Interfaces
IB
DMF Archive
Large Network Pipes
I/O Servers 4 MDS 16 NSD
Direct GPFS I/O Connections 3 GB/sec per node
IP over IB Single Stream 250-300
MB/sec Aggregate 600 GB/sec
Fibre Channel SAN
Fibre Channel SAN
Additional Storage
Archive File Systems
GPFS
Large staging area to minimize data recall from
archive
20
21
Analysis System Technical Details
  • 8 IBM x3950 Nodes
  • 4 socket, Quad-core (16 cores per server, 128
    cores total)
  • Intel Dunnington E7440, 2.4 GHz cores with 1,066
    MHz FSB
  • 256 GB memory (16 GB/core)
  • 10 GbE network interface
  • Can be configured as a single system image up to
    4 servers (64 cores and 1 TB of RAM)
  • GPFS File System
  • Direct connect I/O servers
  • 3 GB/sec per Analysis Node
  • Analysis nodes will see ALL GPFS file systems,
    including the nobackup areas currently in use no
    need to move data into the analysis system
  • Additional Disk Capacity
  • 2 x DDN S2A9900 SATA disk subsystems
  • 900 TB RAW capacity
  • Total of 6 GB/sec throughput

21
22
Analysis System Timeline
  • 1 April 2009 Pioneer/Early Access Users
  • If you would like to be one of the first, please
    let us know.
  • Contact user services.
  • Provide us with some details as to what you may
    need.
  • 1 May 2009 Analysis System in Production
  • Continued support for analysis users migrating
    off of Dirac.
  • 1 June 2009 Dirac Transition
  • Dirac no longer used for analysis.
  • Migrate DMF from Irix to Linux.

22
23
Agenda
Welcome Introduction Phil Webster, CISTO Chief
Current System Status Fred Reitz, Operations
Manager
NCCS Compute Capabilities Dan Duffy, Lead
Architect
User Services Updates Bill Ward, User Services
Lead
Questions and Comments Phil Webster, CISTO
Chief
24
What Happened to My Ticket?
25
Ticket Closure Percentilesfor the Past Quarter
26
Issue Commands to Access DMF
  • Implementation of dmget and dmput
  • Status resolved
  • Enabled on Discover login nodes
  • Performance has been stable since installation
    on 11 Dec 09

27
Issue Parallel Jobs gt 1500 CPUs
  • Many jobs wont run at gt 1500 CPUs
  • Status resolved
  • Requires a different version of the DAPL library
  • Since this is not the officially supported
    version, it is not the default

28
Issue Enabling Sentinel Jobs
  • Need capability to run a sentinel subjob to
    watch a main parallel compute subjob in a single
    PBS job
  • Status in process
  • Requires an NFS mount of data portal file systems
    on Discover gateway nodes (done!)
  • Requires some special PBS usage to specify how
    subjobs will land on nodes

29
Issue Poor Interactive Response
  • Slow interactive response on Discover
  • Status under investigation
  • Router line card replaced
  • Automatic monitoring instituted to promptly
    detect future problems
  • Seems to happen when filesystem usage is heavy
    (anecdotal)

30
Issue Getting Jobs into Execution
  • Long wait for queued jobs before launching
  • Reasons
  • SCALITRUE is restrictive
  • Per user per project limits on number of
    eligible jobs (use qstat is)
  • Scheduling policy first-fit on job list ordered
    by queue priority and queue time
  • Status under investigation
  • Individual job priorities available in PBS v10
    may help with this

31
Use of Project Shared Space
  • Please begin using SHARE instead of /share
    since the shared space may move
  • Try to avoid having soft links that explicitly
    point to /share/ for the same reason

32
Dirac Filesystems
  • Diracs disks are being repurposed for primary
    archive cache
  • Hence, the SGI file systems on Dirac will be
    going away
  • Users will need to migrate data off of the SGI
    home, nobackup, and share file systems
  • Contact User Services if you need assistance.

33
Integrated Performance Monitor (IPM)
  • Provides
  • Short report of resource consumption, and
  • Longer web-based presentation
  • Requires
  • Low runtime overhead (2-5)
  • Linking with MPI wrapper library (your job)
  • Newer version of OS for complete statistics (our
    job)

34
IPM Sample Output
35
IPM Sample Output
36
Access to Analysis System
  • Pioneer access scheduled for 1 April
  • All Dirac analysis users welcome as pioneers
  • Initially, no charge against your allocation
  • If you have no allocation in e-Books,contact USG
    and we will resolve

37
Future User Forums
  • The next three NCCS User Forums
  • 23 June, 22 Sep, 8 Dec
  • All on Tuesday
  • All 200-330 PM
  • All in Building 33, Room H114
  • Published
  • On http//nccs.nasa.gov/
  • On GSFC-CAL-NCCS-Users

38
Agenda
Welcome Introduction Phil Webster, CISTO Chief
Current System Status Fred Reitz, Operations
Manager
NCCS Compute Capabilities Dan Duffy, Lead
Architect
User Services Updates Bill Ward, User Services
Lead
Questions and Comments Phil Webster, CISTO
Chief
39
Feedback
  • Now Open discussion to voice your
  • Praises
  • Complaints
  • Suggestions
  • Later to NCCS Support
  • support_at_nccs.nasa.gov
  • (301) 286-9120
  • Later to USG Lead
  • William.A.Ward_at_nasa.gov
  • (301) 286-2954
Write a Comment
User Comments (0)
About PowerShow.com