The Impact of Global Petascale Plans on Geoscience Modeling - PowerPoint PPT Presentation

1 / 21

About This Presentation

Title:

The Impact of Global Petascale Plans on Geoscience Modeling

Description:

Good news: the race to petascale is on and is international... Source of funds: Presidential Innovation Initiative announce in SOTU. ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 22

Provided by: TomBe

Category:

more less

Transcript and Presenter's Notes

Title: The Impact of Global Petascale Plans on Geoscience Modeling

1
The Impact of Global Petascale Plans on
Geoscience Modeling

Richard Loft
SCD Deputy Director for RD

2
Outline

Good news/bad news about the petascale
architectures.
NSF Petascale Track-1 Solicitation
A sample problem description from NSF Track-1.
NCARs petscale science response.
CCSM as a petascale application

3
Good news the race to petascale is on and is
international

Earth Simulator-2 (MEXT) in Japan is committed to
regaining ES leadership position by 2011.
DOE is deploying 1 PFLOPS peak system by 2008.
Europe has the ENES EVElab program program.
NSF Track-1 solicitation targets 2010 system.
Lots of opportunities for ES modeling on
petascale systems - worldwide!
But how does this impact ES research/application
development plans?

4
Bad news computer architecture is facing
significant challenges

Memory wall memory speeds not keeping up with
CPU
Memory is optimized for density not speed
Which causes CPU latency to memory in terms of
the number of CPU clocks per load to increase
Which causes more and more on-chip real-estate
used for cache
Which causes cache lines are getting longer
Which causes microprocessors to become less
forgiving of irregular memory access patterns.
Microprocessor performance improvement has slowed
since 2002, and are already 3x off projected
levels of performance for 2006, based on pre-2002
historical rate of improvement.
Key driver is power consumption.
Future feature size shrinks will be devoted to
more CPUs per chip.
Rumors of Japanese chips with 1024 CPUs per chip
at ISCA-33.
Design verification of multi-billion gate chips,
fab-ability, reliability (MTBF), fault tolerance,
are becoming serious issues.
Can we program these things?

5
Best Guess about Architectures 2010

5 GHz is looking like an aggressive clock speed
for 2010.
For 8 CPUs/sockets (chip) this is about 80
GFLOPS peak/socket.
2 PFLOPS is 25,000 sockets with 200,000 CPUs.
Key unknown is which architecture for a cluster
on a chip will be most effective (there are many
ways to organize a CMP.
Vector systems will be around, but at what price?
Wildcards
Impact of DARPA HPCS program architectures.
Exotics in the wings MTAs, FPGAs, PIMs,
GPUs, etc.

6
NSF Track-1 System Background

Source of funds Presidential Innovation
Initiative announce in SOTU.
Performance goal 1 PFLOPS sustained on
interesting problems.
Science goal breakthroughs
Use model 12 research teams per year using whole
system for days or weeks at a time.
Capability system - large everything fault
tolerant.
Single system in one location.
Not a requirement that machine be upgradable.

7
The NSF Track-1 petascale system proposal is
out NSF06-573
8
Track-1 Project Parameters

Funds 200M over 4 years, starting FY07
Single award
Money is for end-to-end system (as in 625)
Not intended to fund facility.
Release of funds tied to meeting hw and sw
milestones.
Deployment Stages
Simulator
Prototype
Petascale system operates FY10-FY15
Operations funds FY10-15 funded separately.
Facility costs not included.

9
Two Stage Award Process Timeline

Solicitation out June 6, 2006
HPCS down-select July, 2006
Preliminary Proposal due September 8, 2006
Down selection (invitation to 3-4 to write Full
Proposal)
Full Proposal due February 2, 2007
Site visits Spring, 2007
Award Sep, 2007

10
NSFs view of the problem

NSF recognizes the facility (power, cooling,
space) challenge of this system.
NSF recognizes the need for fault tolerance
features.
NSF recognizes that applications will need
significant modification to run on this systems.
NSF expects Track-1 proposer to discuss needs
with application experts (many are in this room).
NSF application support funds - expect
solicitation out in September, 2006.

11
Sample benchmark problem (from Solicitation)

A 12,288-cubed simulation of fully developed
homogeneous turbulence in a periodic domain for
one eddy turnover time at a value of Rlambda of
O(2000). The model problem should be solved using
a dealiased, pseudospectral algorithm, a
fourth-order explicit Runge-Kutta time-stepping
scheme, 64-bit floating point (or similar)
arithmetic, and a time-step of 0.0001 eddy
turnaround times. Full resolution snapshots of
the three-dimensional vorticity, velocity and
pressure fields should be saved to disk every
0.02 eddy turnaround times. The target wall-clock
time for completion is 40 hours.

12
Back of the envelope calculations forturbulence
example

N12288
One 64 bit variable 14.8 TB of memory
3 time levels x 5 variables (u,v,w,P,vor) 222 TB
of memory
File output every 0.02 eddy turn-over times 74
TB/snapshot
Total output in 40 hour run 3.7 PB
I/O BW gtgt 256 GB/sec
3(NNN(65)) 361.8 TFLOP/field/step
Assume 4 variables (u,v,w,P) 1.447 PFLOP/step
Must average 14.4 seconds/step
Mean flops rate 100 TFLOPS. (what a
coincidence)
Real FLOPS rate gtgt 100 TFLOPS because of I/O
comms overhead.
Must communicate 384NNN 178 TB per
timestep - aggregate network BW gtgt 12.4 TB/sec

13
Turbulence problem system resource estimates

gtgt 5 GFLOPS sustained per socket.
Computations must scale well on chip.
10 GFLOPS sustained probably more realistics.
Probably doable with optimized RFFT calls (FFTW).
gt 8 GB memory/socket (1-2 GB/CPU)
gtgt 0.5 GB/sec/socket sustained system bisection
BW
For a 100,000 byte message
Realistically ( 2 GB/sec/socket )
gtgt 670 disk spindles saturated during I/O
Realistically ( 2k-4k disks
multi-pbyte RAID

14
UCAR Peta-Process

Define UCAR petascale science goals
Develop system requirements
Make these available to Track-1 proposal writers
Define application development resource
requirements
Fold these into proposals for additional
resources

15
Peta-process details

A few, strategic, specific and realistic science
goals for exploiting petascale systems. What are
the killer apps?
The CI requirements (system attributes, data
archive footprint, etc.) of the entire toolchain
for the science workflow for each.
A mechanism to provide part of this information
to all the consortia competing for the 200M.
The project retasking required to ultimately
write viable proposals for time on petascale
systems over the next 4 years.
Resource requirements for
staff augmentations
Local equipment infrastructure enhancements
Build University collaborations to support this
effort.

16
Relevant Science Areas (from NSF Track-1
Solicitation)

The detailed structure of, and the nature of
intermittency in, stratified and unstratified,
rotating and non-rotating turbulence in classical
and magnetic fluids, and in chemically reacting
mixtures
The nonlinear interactions between cloud systems,
weather systems and the Earths climate
The dynamics of the Earths coupled, carbon,
nitrogen and hydrologic cycles
The decadal dynamics of the hydrology of large
river basins
The onset of coronal mass ejections and their
interaction with the Earths magnetic field,
including modeling magnetic reconnection and
geo-magnetic sub-storms
The coupled dynamics of marine and terrestrial
ecosystems and oceanic and atmospheric physics
The interaction between chemical reactivity and
fluid dynamics in complex systems such as
combustion, atmospheric chemistry, and chemical
processing.

17
Peta-science Ideas (slide 1 of 2)

Topic 1. Across scale modeling simulation of the
21st century climate with a coupled
atmosphere-ocean model at 0.1 degree resolution
(eddy resolving in the ocean). For specific time
periods of the integration, shorter-time
simulations with higher spatial resolution 1 km
with a nonhydrostatic global atmospheric model
and 100 m resolution in a nested regional model.
Emphasis will be put the explicit representation
of moist turbulence, convection and hydrological
cycle.
Topic 2. Interactions between atmospheric layers
and response of the atmosphere to solar
variability. Simulations of the atmospheric
response to 10-15 solar cycles derived by a
high-resolution version of WACCM (with explicit
simulation of the QBO) coupled to an ocean model.

18
Peta-science Ideas (slide 2 of 2)

Topic 3. Simulation of Chemical Weather High
resolution chemical/dynamical model with rather
detailed chemical scheme. Study of global air
pollution, impact of mega-cities wildfires, and
other pollution sources.
Topic 4. Solar MHD a high resolution model of
turbulence in the solar photosphere.

19
Will CCSM qualify on the Track-1 system? Maybe!

This system is designed to be 10x bigger than
Track-2 systems being funded by OCI money at 30M
a piece over the next 5 years.
Therefore, a case can be made an qualifying
ensemble application that could run on at least
10-25 of the system, even as an ensemble in
order to justify needing the resource.
This means a good target for one instance of CCSM
is 10,000 to 50,000 processors.
John Dennis (with Bryan and Jones) has done some
work on POP 2.1 _at_ 0.1 degree that looks
encouraging at 30,000 CPUs

20
Some petascale application dos and donts

Dont assume the node is a rack mounted server.
No local disk drive
No full kernel - beware of relying on system
calls.
No serialization of memory or I/O
Global arrays read in and distributed
Serialized I/O (I.e. through node 0) of any kind,
will be a show stopper
Multiple executables applications may be
problematic
Eschew giant look-up tables.
No unaddressed load imbalances.
Dynamic load balancing schemes
Communication overlapping
Algorithms with irregular memory accesses will be
perform poorly.

21
Some design question for discussion

How CCSM can use different CMPs and/or MTAs
architectures effectively? Consider-
Power 5 ( 2 CPU per chip, 2 threads per CPU)
Niagara (8 CPUs per chip, 4 thread per CPU)
Cell (assymetrical - 1 scalar 8 Synergistic
procs per chip)
New coupling strategies?
Time to scrap multi-executables?
New ensemble strategies?
Stacking instances as multiple threads on an CPU?
Will the software we rely on scale?
MPI
ESMF
pNetCDF
NCL