Title: The Impact of Global Petascale Plans on Geoscience Modeling
1The Impact of Global Petascale Plans on
Geoscience Modeling
- Richard Loft
- SCD Deputy Director for RD
2Outline
- Good news/bad news about the petascale
architectures. - NSF Petascale Track-1 Solicitation
- A sample problem description from NSF Track-1.
- NCARs petscale science response.
- CCSM as a petascale application
3Good news the race to petascale is on and is
international
- Earth Simulator-2 (MEXT) in Japan is committed to
regaining ES leadership position by 2011. - DOE is deploying 1 PFLOPS peak system by 2008.
- Europe has the ENES EVElab program program.
- NSF Track-1 solicitation targets 2010 system.
- Lots of opportunities for ES modeling on
petascale systems - worldwide! - But how does this impact ES research/application
development plans?
4Bad news computer architecture is facing
significant challenges
- Memory wall memory speeds not keeping up with
CPU - Memory is optimized for density not speed
- Which causes CPU latency to memory in terms of
the number of CPU clocks per load to increase - Which causes more and more on-chip real-estate
used for cache - Which causes cache lines are getting longer
- Which causes microprocessors to become less
forgiving of irregular memory access patterns. - Microprocessor performance improvement has slowed
since 2002, and are already 3x off projected
levels of performance for 2006, based on pre-2002
historical rate of improvement. - Key driver is power consumption.
- Future feature size shrinks will be devoted to
more CPUs per chip. - Rumors of Japanese chips with 1024 CPUs per chip
at ISCA-33. - Design verification of multi-billion gate chips,
fab-ability, reliability (MTBF), fault tolerance,
are becoming serious issues. - Can we program these things?
5Best Guess about Architectures 2010
- 5 GHz is looking like an aggressive clock speed
for 2010. - For 8 CPUs/sockets (chip) this is about 80
GFLOPS peak/socket. - 2 PFLOPS is 25,000 sockets with 200,000 CPUs.
- Key unknown is which architecture for a cluster
on a chip will be most effective (there are many
ways to organize a CMP. - Vector systems will be around, but at what price?
- Wildcards
- Impact of DARPA HPCS program architectures.
- Exotics in the wings MTAs, FPGAs, PIMs,
GPUs, etc.
6NSF Track-1 System Background
- Source of funds Presidential Innovation
Initiative announce in SOTU. - Performance goal 1 PFLOPS sustained on
interesting problems. - Science goal breakthroughs
- Use model 12 research teams per year using whole
system for days or weeks at a time. - Capability system - large everything fault
tolerant. - Single system in one location.
- Not a requirement that machine be upgradable.
7The NSF Track-1 petascale system proposal is
out NSF06-573
8Track-1 Project Parameters
- Funds 200M over 4 years, starting FY07
- Single award
- Money is for end-to-end system (as in 625)
- Not intended to fund facility.
- Release of funds tied to meeting hw and sw
milestones. - Deployment Stages
- Simulator
- Prototype
- Petascale system operates FY10-FY15
- Operations funds FY10-15 funded separately.
- Facility costs not included.
9Two Stage Award Process Timeline
- Solicitation out June 6, 2006
- HPCS down-select July, 2006
- Preliminary Proposal due September 8, 2006
- Down selection (invitation to 3-4 to write Full
Proposal) - Full Proposal due February 2, 2007
- Site visits Spring, 2007
- Award Sep, 2007
10NSFs view of the problem
- NSF recognizes the facility (power, cooling,
space) challenge of this system. - NSF recognizes the need for fault tolerance
features. - NSF recognizes that applications will need
significant modification to run on this systems. - NSF expects Track-1 proposer to discuss needs
with application experts (many are in this room). - NSF application support funds - expect
solicitation out in September, 2006.
11Sample benchmark problem (from Solicitation)
- A 12,288-cubed simulation of fully developed
homogeneous turbulence in a periodic domain for
one eddy turnover time at a value of Rlambda of
O(2000). The model problem should be solved using
a dealiased, pseudospectral algorithm, a
fourth-order explicit Runge-Kutta time-stepping
scheme, 64-bit floating point (or similar)
arithmetic, and a time-step of 0.0001 eddy
turnaround times. Full resolution snapshots of
the three-dimensional vorticity, velocity and
pressure fields should be saved to disk every
0.02 eddy turnaround times. The target wall-clock
time for completion is 40 hours.
12Back of the envelope calculations forturbulence
example
- N12288
- One 64 bit variable 14.8 TB of memory
- 3 time levels x 5 variables (u,v,w,P,vor) 222 TB
of memory - File output every 0.02 eddy turn-over times 74
TB/snapshot - Total output in 40 hour run 3.7 PB
- I/O BW gtgt 256 GB/sec
- 3(NNN(65)) 361.8 TFLOP/field/step
- Assume 4 variables (u,v,w,P) 1.447 PFLOP/step
- Must average 14.4 seconds/step
- Mean flops rate 100 TFLOPS. (what a
coincidence) - Real FLOPS rate gtgt 100 TFLOPS because of I/O
comms overhead. - Must communicate 384NNN 178 TB per
timestep - aggregate network BW gtgt 12.4 TB/sec
13Turbulence problem system resource estimates
- gtgt 5 GFLOPS sustained per socket.
- Computations must scale well on chip.
- 10 GFLOPS sustained probably more realistics.
- Probably doable with optimized RFFT calls (FFTW).
- gt 8 GB memory/socket (1-2 GB/CPU)
- gtgt 0.5 GB/sec/socket sustained system bisection
BW - For a 100,000 byte message
- Realistically ( 2 GB/sec/socket )
- gtgt 670 disk spindles saturated during I/O
- Realistically ( 2k-4k disks
- multi-pbyte RAID
14UCAR Peta-Process
- Define UCAR petascale science goals
- Develop system requirements
- Make these available to Track-1 proposal writers
- Define application development resource
requirements - Fold these into proposals for additional
resources
15Peta-process details
- A few, strategic, specific and realistic science
goals for exploiting petascale systems. What are
the killer apps? - The CI requirements (system attributes, data
archive footprint, etc.) of the entire toolchain
for the science workflow for each. - A mechanism to provide part of this information
to all the consortia competing for the 200M. - The project retasking required to ultimately
write viable proposals for time on petascale
systems over the next 4 years. - Resource requirements for
- staff augmentations
- Local equipment infrastructure enhancements
- Build University collaborations to support this
effort.
16Relevant Science Areas (from NSF Track-1
Solicitation)
- The detailed structure of, and the nature of
intermittency in, stratified and unstratified,
rotating and non-rotating turbulence in classical
and magnetic fluids, and in chemically reacting
mixtures - The nonlinear interactions between cloud systems,
weather systems and the Earths climate - The dynamics of the Earths coupled, carbon,
nitrogen and hydrologic cycles - The decadal dynamics of the hydrology of large
river basins - The onset of coronal mass ejections and their
interaction with the Earths magnetic field,
including modeling magnetic reconnection and
geo-magnetic sub-storms - The coupled dynamics of marine and terrestrial
ecosystems and oceanic and atmospheric physics - The interaction between chemical reactivity and
fluid dynamics in complex systems such as
combustion, atmospheric chemistry, and chemical
processing.
17Peta-science Ideas (slide 1 of 2)
- Topic 1. Across scale modeling simulation of the
21st century climate with a coupled
atmosphere-ocean model at 0.1 degree resolution
(eddy resolving in the ocean). For specific time
periods of the integration, shorter-time
simulations with higher spatial resolution 1 km
with a nonhydrostatic global atmospheric model
and 100 m resolution in a nested regional model.
Emphasis will be put the explicit representation
of moist turbulence, convection and hydrological
cycle. - Topic 2. Interactions between atmospheric layers
and response of the atmosphere to solar
variability. Simulations of the atmospheric
response to 10-15 solar cycles derived by a
high-resolution version of WACCM (with explicit
simulation of the QBO) coupled to an ocean model.
18Peta-science Ideas (slide 2 of 2)
- Topic 3. Simulation of Chemical Weather High
resolution chemical/dynamical model with rather
detailed chemical scheme. Study of global air
pollution, impact of mega-cities wildfires, and
other pollution sources. - Topic 4. Solar MHD a high resolution model of
turbulence in the solar photosphere.
19Will CCSM qualify on the Track-1 system? Maybe!
- This system is designed to be 10x bigger than
Track-2 systems being funded by OCI money at 30M
a piece over the next 5 years. - Therefore, a case can be made an qualifying
ensemble application that could run on at least
10-25 of the system, even as an ensemble in
order to justify needing the resource. - This means a good target for one instance of CCSM
is 10,000 to 50,000 processors. - John Dennis (with Bryan and Jones) has done some
work on POP 2.1 _at_ 0.1 degree that looks
encouraging at 30,000 CPUs
20Some petascale application dos and donts
- Dont assume the node is a rack mounted server.
- No local disk drive
- No full kernel - beware of relying on system
calls. - No serialization of memory or I/O
- Global arrays read in and distributed
- Serialized I/O (I.e. through node 0) of any kind,
will be a show stopper - Multiple executables applications may be
problematic - Eschew giant look-up tables.
- No unaddressed load imbalances.
- Dynamic load balancing schemes
- Communication overlapping
- Algorithms with irregular memory accesses will be
perform poorly.
21Some design question for discussion
- How CCSM can use different CMPs and/or MTAs
architectures effectively? Consider- - Power 5 ( 2 CPU per chip, 2 threads per CPU)
- Niagara (8 CPUs per chip, 4 thread per CPU)
- Cell (assymetrical - 1 scalar 8 Synergistic
procs per chip) - New coupling strategies?
- Time to scrap multi-executables?
- New ensemble strategies?
- Stacking instances as multiple threads on an CPU?
- Will the software we rely on scale?
- MPI
- ESMF
- pNetCDF
- NCL