Title: Recent Developments for Parallel CMAQ
1Recent Developments for Parallel CMAQ
- Jeff Young AMDB/ASMD/ARL/NOAA
- David Wong SAIC NESCC/EPA
2AQF-CMAQ
Running in quasi-operational mode at NCEP
26 minutes for a 48 hour forecast (on 33
processors)
3Code modifications to improve data locality
Some vdiff optimization Jerry Gipsons MEBI/EBI
chem solver Tried CGRID( SPC, LAYER, COLUMN, ROW
) tests Indicated not good for MPI
communication of data
Some background MPI data communication ghost
(halo) regions
4Ghost (Halo) Regions
DO J 1, N DO I 1, M DATA( I,J ) A(
I2,J ) A( I,J-1 ) END DO END DO
5Horizontal Advection and Diffusion Data
Requirements
3
4
5
3
4
Ghost Region
0
2
1
Exterior Boundary
Stencil Exchange Data Communication Function
6Architectural Changes for Parallel I/O
Tests confirmed latest releases (May, Sep 2003)
not very scalable
Background
Original Version
Latest Release
AQF Version
7Parallel I/O
2002 Release
Computation only
Read, write and computation
Read and computation
2003 Release
Write and computation
Write only
asynchronous
AQF
Data transfer by message passing
8Modifications to the I/O API for Parallel I/O For
3D data, e.g.
time interpolation
INTERP3 (
FileName, VarName,
ProgName,
Date, Time,
NcolsNrowsNlays,
Data_Buffer )
spatial subset
StartRow, EndRow,
StartLay, EndLay,
XTRACT3 (
FileName, VarName,
StartCol, EndCol,
Data_Buffer )
Date, Time,
StartCol, EndCol,
StartRow, EndRow,
INTERPX (
FileName, VarName,
ProgName,
StartLay, EndLay,
Data_Buffer )
Date, Time,
WRPATCH (
FileID, VarID,
TimeStamp,
Record_No )
Called from PWRITE3 (pario)
9Standard (May 2003 Release)
DRIVER read ICs into CGRID begin output
timestep loop advstep (determine sync
timestep) couple begin sync
timestep loop SCIPROC
X-Y-Z advect adjadv
hdiff decouple
vdiff DRYDEP cloud
WETDEP gas chem
(aero VIS) couple
end sync timestep loop decouple
write conc and avg conc CONC, ACONC end
output timestep loop
10AQF CMAQ
DRIVER set WORKERs, WRITER if WORKER,
read ICs into CGRID begin output timestep
loop if WORKER advstep
(determine sync timestep) couple
begin sync timestep loop
SCIPROC X-Y-Z advect
hdiff decouple
vdiff cloud
gas chem (aero)
couple end sync
timestep loop decouple
MPI send conc, aconc, drydep, wetdep, (vis)
if WRITER completion-wait for
conc, write conc CONC for aconc, write
aconc, etc. end output timestep loop
11Power3 Cluster (NESCCs LPAR)
12
11
14
13
15
12
11
14
13
15
0
2
1
3
8
7
9
5
4
6
12
11
14
13
10
15
Memory
2X slower than NCEPs
Switch
The platform (cypress00, cypress01, cypress02)
consists of 3 SP-Nighthawk nodes
All cpus share user applications with file
servers, interactive use, etc.
12Power4 p690 Servers (NCEPs LPAR)
0
3
2
1
0
3
2
1
0
3
2
1
0
3
2
1
Memory
Memory
Memory
Memory
2 colony SW connectors/node 2X performance of
cypress nodes
Switch
Each platform (snow, frost) is composed of 22
p690 (Regatta) servers Each server has 32 cpus
LPAR-ed into 8 nodes per server (4 cpus per
node) Some nodes are dedicated to file servers,
interactive use, etc. There are effectively 20
servers for general use (160 nodes, 640 cpus)
13 ice Beowulf Cluster Pentium 3 1.4
GHz
0
1
0
1
0
1
0
1
0
1
0
1
Memory
Memory
Memory
Memory
Memory
Memory
Internal Network
Isolated from outside network traffic
14 global MPICH Cluster Pentium 4 XEON 2.4
GHz
global
global1
global2
global3
global4
global5
0
1
0
1
0
1
0
1
0
1
0
1
Memory
Memory
Memory
Memory
Memory
Memory
Network
15RESULTS
5 hour, 12Z 17Z and 24 hour, 12Z 12Z runs
20 Sept 2002 test data set used for developing
the AQF-CMAQ Input Met from ETA, processed thru
PRDGEN and PREMAQ
166 columns X 142 rows X 22 layers at 12 km
resolution
Domain seen on following slides
CB4 mechanism, no aerosols, Pleims Yamartino
advection for AQF-CMAQ
PPM advection for May 2003 Release
16The Matrix
4 2 X 2
8 4 X 2
16 4 X 4
32 8 X 4
64 8 X 8
64 16 X 4
comparison of run times
run
comparison of relative wall times
wall
for main science processes
1724Hour AQF vs. 2003 Release
2003 Release
AQF-CMAQ
Data shown at peak hour
1824Hour AQF vs. 2003 Release
Less than 0.2 ppb diff between
Almost 9 ppb max diff between Yamartino and PPM
AQF on cypress and snow
19AQF vs. 2003 Release
Absolute Run Times
24 Hours, 8 Worker Processors
Sec
20 AQF vs. 2003 Release on cypress and AQF on
snow
Absolute Run Times
24 Hours, 32 Worker Processors
Sec
21 AQF CMAQ Various Platforms
Relative Run Times
5 Hours, 8 Worker Processors
of Slowest
22AQF-CMAQ cypress vs. snow
Relative Run Times 5 Hours
of Slowest
23 AQF-CMAQ on Various Platforms
Relative Run Times 5 Hours
of Slowest
Number of Worker Processors
24AQF-CMAQ on cypress
Relative Wall Times 5 Hours
25AQF-CMAQ on snow
Relative Wall Times 5 Hours
26AQF vs. 2003 Release on cypress
Relative Wall Times 24 hr, 8 Worker Processors
PPM
Yamo
27AQF vs. 2003 Release on cypress
Relative Wall Times 24 hr, 8 Worker Processors
Add snow for 8 and 32 processors
28AQF Horizontal Advection
Relative Wall Times 24 hr, 8 Worker Processors
x-r-l
y-c-l
x-row-loop
y-column-loop
-hppm
kernel solver in each loop
29AQF Horizontal Advection
Relative Wall Times 24 hr, 32 Worker Processors
30AQF Release Horizontal Advection
Relative Wall Times 24 hr, 32 Worker Processors
r-
release
31Future Work
Add aerosols back in
TKE vdiff
Improve horizontal advection/diffusion scalability
Some I/O improvements
Layer variable horizontal advection time steps
32Aspect Ratios
71
71
142
83
41
42
166
35
35
36
36
21
20
42
41
33Aspect Ratios
17
35
36
18
11
10
21
20