Title: Project Athena: Technical Issues
1Project Athena Technical Issues
- Larry Marx and the Project Athena Team
2Outline
- Project Athena Resources
- Models and Machine Usage
- Experiments
- Running Models
- Initial and Boundary Data Preparation
- Post Processing, Data Selection and Compression
- Data Management
3Dedicated, Oct09 Mar10 79 million core-hours
Dedicated, Oct09 Mar10 post-processing
Shared, Oct09 Mar10 5 million core-hours
Athena 4512 nodes _at_ 4 cores, 2 GB mem
Kraken 8256 nodes _at_ 12 cores, 16 GB mem
Verne 5 nodes _at_ 32 cores, 128 GB mem
Read-only
scratch 78 TB (Lustre)
homes 8 TB (NFS)
nakji 360 TB (Lustre)
800 TB HPSS tape archive
4Models and Machine Usage
- NICAM initially was the primary focus of
implementation - Limited flexibility in scaling, due to
icosahedral grid - Limited testing on multicore/cache processor
architectures production primarily on the
vector-parallel (NEC SX) Earth Simulator - Step 1 Port low resolution version with simple
physics to Athena - Step 2 Determine highest resolution possible on
Athena and minimum and maximum number of cores to
be used - Unique solution G-level 10 or 10,485,762 cells
(7-km spacing) using exactly 2,560 cores - Step 3 Initially NICAM jobs failed frequently
due to improper namelist settings. During visit
by U. Tokyo and JAMSTEC scientists to COLA, new
settings determined that generally ran with
little trouble. However 2003 could never be
stabilized and was abandoned.
5Models and Machine Usage (contd)
- IFS flexible scalability sustains good
performance for higher resolution configurations
(T1279 and T2047) using 2,560 processor cores - We defined one slot as 2,560 cores and managed
a mix of NICAM and IFS jobs _at_ 1 job per slot ?
maximally efficient use of resource. - Having equal size slots for both models permits
either model to be queued and run in the event of
a job failure. - Selected jobs given higher priority so that they
continue to run ahead of others. - Machine partition 7 slots of 2,560 cores
17,920 cores out of 18,048 - 99 machine utilization
- 128 processors for pre- and post-processing and
as spares (postpone reboot) - Lower resolution IFS experiments (T159 and T511)
were run on Kraken - IFS runs were initially made by COLA. When the
ECMWF SMS model management system was installed,
runs could be made by COLA or ECMWF.
6Project Athena Experiments
7Initial and Boundary Data Preparation
- IFS
- Most input data prepared by ECMWF. Large files
shipped by removable disk. - Time Slice experiment input data prepared by
COLA. - NICAM
- Initial data from GDAS 1 files. Available for
all dates. - Boundary files other than SST included with
NICAM. - SST from ΒΌ NCDC OI daily (version 2). Data
starting 1 June 2002 include in situ, AVHRR (IR),
and AMSR-E (microwave) . Earlier data does not
include AMSR-E. - All data interpolated to icosahedral grid.
8Post Processing, Data Selection and Compression
- All IFS (Grib-1) data interpolated (coarsened) to
the N80 reduce grid for common comparison among
the resolutions and with the ERA-40 data. All
IFS spectral data truncated to T159 coefficients
and transformed to N80 full grid. - Key fields at full model resolution were
processed, including transforming spectral
coefficients to grids and compression to NetCDF-4
via GrADS. - Processing accomplished using Kraken, because
Athena lacks sufficient memory and computing
power on each node. - All the common comparison and selected
high-resolution data electronically transferred
to COLA via bbcp (up to 40MB/s sustained).
9Post Processing, Data Selection and Compression
(contd)
- Nearly all (91) NICAM diagnostic variables saved.
Each variable saved with (2560) separate files
for model domains, resulting in over 230,000
files. The number of files quickly saturated
LFS. - Original program to interpolate data to regular
lat-lon grid had to be revised to use less I/O
and to multithread, thereby eliminating a
processing backlog. - Selected 3-d fields were interpolated from
z-coordinate to p-coordinate levels. - Selected 2- and 3-d fields were compressed
(NetCDF-4) and electronically transferred to
COLA. - All selected fields coarsened to N80 full grid.
10Data Management NICS
- All data archived to HPSS approaching 1 PB
- Workflow required complex data movement
- All model runs at high resolution done on Athena
- Model output stored on scratch or nakji and all
copied to tape on HPSS - IFS data interpolation/truncation done directly
from retrieved HPSS files - NICAM data processed using Verne and nakji (more
capable CPUs and larger memory)
11Data Management COLA
- Athena allocated 50TB (26) on COLA disk servers.
- Required considerable discussion and judgment to
down-select variables from IFS and NICAM, based
on factors including scientific use and data
compressibility. - Large directory structure needed to organize the
data, particularly IFS with many resolutions,
sub-resolutions, data forms and ensemble members.
12Data Management Future
- New machines at COLA and NICS will permit further
analysis not currently possible due to lack of
memory and compute power. - Some or all of the data will be made publically
available eventually when long term disposition
is determined. - TeraGrid Science Portal??
- Earth System Grid??
13Summary
- Large, international team of climate and computer
scientists, using dedicated and shared resources,
introduces many challenges for production
computing, data analysis and data management - The shear volume and the complexity of the data,
breaks everything - Disk capacity
- File name space
- Bandwidth connecting systems within NICS
- HPSS tape capacity
- Bandwidth to remote sites for collaborating
groups - Software for analysis and display of results
(GrADS modifications) - COLA overcame these difficulties as they were
encountered in 247 production mode and prevent
having an idle dedicated computer.