PerformancePortability and the Weather Research and Forecast Model - PowerPoint PPT Presentation

1 / 30

About This Presentation

Title:

PerformancePortability and the Weather Research and Forecast Model

Description:

Performance-Portability and the Weather Research and Forecast Model ... Implicit solver for vert. propagation of acoustic modes ... – PowerPoint PPT presentation

Number of Views:70

Avg rating:3.0/5.0

Slides: 31

Provided by: johng200

Category:

more less

Transcript and Presenter's Notes

Title: PerformancePortability and the Weather Research and Forecast Model

1
Performance-Portability and the Weather Research
and Forecast Model

John Michalakes, Richard Loft, Alfred Bourgeois
National Center for Atmospheric Research
Boulder, Colorado U.S.A
HPC Asia 2001
September 25, 2001

2
Outline

WRF Model Project
Performance portability and why
Studies
Storage order using whole code
Detailed profiling using kernels
Conclusions

3
WRF Project Overview

Develop advanced model and data-assimilation
system for mesoscale NWP
Accurate, efficient, applicable over a broad
range of scales
Focus on 1-10km resolution
Advanced dynamics, physics, data assimilation,
nesting
Flexible, modular, performance-portable
Large, multi-institution effort
Promote closer ties between research and
operations
Pool resources
Milestones and status
First release Nov. 2000
Current release May 2001
Research community version in 2002 full
operational implementation 2004

Dynamical cores
Time-split explicit Eulerian
5th order advection
Height and mass coordinate
Semi-implicit Semi-Lagrangian
Full physics with multiple options for each type
3DVAR, 4DVAR

4
WRF Project Overview

Develop advanced model and data-assimilation
system for mesoscale NWP
Accurate, efficient, applicable over a broad
range of scales
Focus on 1-10km resolution
Advanced dynamics, physics, data assimilation,
nesting
Flexible, modular, performance-portable
Large, multi-institution effort
Promote closer ties between research and
operations
Pool resources
Milestones and status
First release Nov. 2000
Current release May 2001
Research community version in 2002 full
operational implementation 2004

5
WRF Project Overview

Develop advanced model and data-assimilation
system for mesoscale NWP
Accurate, efficient, applicable over a broad
range of scales
Focus on 1-10km resolution
Advanced dynamics, physics, data assimilation,
nesting
Flexible, modular, performance-portable
Large, multi-institution effort
Promote closer ties between research and
operations
Pool resources
Milestones and status
First release Nov. 2000
Current release May 2001
Research community version in 2002 full
operational implementation 2004

6
WRF Project Overview

Develop advanced model and data-assimilation
system for mesoscale NWP
Accurate, efficient, applicable over a broad
range of scales
Focus on 1-10km resolution
Advanced dynamics, physics, data assimilation,
nesting
Flexible, modular, performance-portable
Large, multi-institution effort
Promote closer ties between research and
operations
Pool resources
Milestones and status
First release Nov. 2000
Current release May 2001
Research community version in 2002 full
operational implementation 2004

7
WRF Project Overview

Develop advanced model and data-assimilation
system for mesoscale NWP
Accurate, efficient, applicable over a broad
range of scales
Focus on 1-10km resolution
Advanced dynamics, physics, data assimilation,
nesting
Flexible, modular, performance-portable
Large, multi-institution effort
Promote closer ties between research and
operations
Pool resources
Milestones and status
First release Nov. 2000
Current release May 2001
Research community version in 2002 full
operational implementation 2004

Principal Partners
NCAR Mesoscale and Microscale Meteorology
Division
NOAA National Centers for Environmental
Prediction
NOAA Forecast Systems Laboratory
U. Oklahoma Center for the Analysis and
Prediction of Storms
U.S. Air Force Weather Agency
Additional Collaborators
NOAA Geophysical Fluid Dynamics Laboratory
NASA GSFC Atmospheric Sciences Division
NOAA National Severe Storms Laboratory
NRL Marine Meteorology Division
EPA Atmospheric Modeling Division
Aerospace Inc
University Community

8
WRF Project Overview

Develop advanced model and data-assimilation
system for mesoscale NWP
Accurate, efficient, applicable over a broad
range of scales
Focus on 1-10km resolution
Advanced dynamics, physics, data assimilation,
nesting
Flexible, modular, performance-portable
Large, multi-institution effort
Promote closer ties between research and
operations
Pool resources
Milestones and status
First release Nov. 2000
Current release May 2001
Research community version in 2002 full
operational implementation 2004

9
Performance Portability

Conflicting concerns
Best possible performance
Maintainable software
What can we control?
Run-time and compiler options libraries
Padding, alignment, sizing of arrays
Blocking factors, vector length
Storage order/loop nesting
Crucial issue in WRF design in 1999-2000 can we
stay portable to vector systems?
Needed decision on storage/loop order while code
still small
Trend in U.S. has been away from vector
Vector vendors dropping support, however
Still significant installed base outside U.S.
Expedited Storage Order study to support
decision for WRF

10
WRF storage-order

Whats the best order of three-dimensional array
indices and corresponding loop nesting IJK, KIJ,
or IKJ?
Original study
M. Ashworth (Daresbury Laboratory, UK), ECMWF 98
workshop presentation compared k-inner versus
i-inner orderings for array dimensions and loop
nesting in a test kernel.
(Optimisation for Vector and RISC Processors,
in Towards Teracomputing, World Scientific, River
Edge, NJ. 1999. pp. 353-359.)
RISC relatively insensitive (30 penalty)
Vector was very sensitive, by up to factors of 4
on Fujitsu VPP300 and 8 on NEC SX-4
Reproduce Ashworths study with WRF prototype
Generated three versions, IJK, KIJ, and IKJ
Benchmark various problem sizes on representative
systems
Present results to WRF collaborators

11
IJK versus KIJ on RISC and Vector

IJK versus KIJ for all patch dimensions
X,Y(21,41,81) 41 levels throughout
KIJ is the favored order on RISC IJK is strongly
favored on vector for adequate length of X
Surprise vector prefers KIJ for short X (no
physics in tested prototype)
RISC penalty for IJK decreases with increased
length of minor dimension
Penalty is most severe for sizes typical of a DM
patch

12
IJK versus KIJ on RISC and Vector

IJK versus KIJ for all patch dimensions
X,Y(21,41,81) 41 levels throughout
KIJ is the favored order on RISC IJK is strongly
favored on vector for adequate length of X
Surprise vector prefers KIJ for short X (no
physics in tested prototype)
RISC penalty for IJK decreases with increased
length of minor dimension
Penalty is most severe for sizes typical of a DM
patch

13
IJK versus KIJ on RISC and Vector

IJK versus KIJ for all patch dimensions
X,Y(21,41,81) 41 levels throughout
KIJ is the favored order on RISC IJK is strongly
favored on vector for adequate length of X
Surprise vector prefers KIJ for short X (no
physics in tested prototype)
RISC penalty for IJK decreases with increased
length of minor dimension
Penalty is most severe for sizes typical of a DM
patch

14
IJK and IKJ Penalties versus KIJ EV6
IJK vs KIJ
IKJ vs KIJ

Penalty for I-innermost is much less severe when
K is middle dimension
WRF adopted IKJ ordering in April, 2000

15
Detailed Performance Studies

Follow-on to WRF storage order experiments to
determine effects of controllable aspects
All MxN subdomain sizes 2 lt M lt 80 2 lt N lt 80
Padding 0, 1, 2, and 3 extra cells around patch
Effect of shared memory tile size
Storage order
Computational kernel ADVANCE_W
Implicit solver for vert. propagation of acoustic
modes
Compute-intensive, fully 3D vertical recurrence
Three versions IJK, IKJ, and KIJ

16
Detailed Performance Studies

Procedure
For each processor type
Run ADVANCE_W test code over 6400 different
domain sizes and 4 array padding options
Generated detailed performance data using
hardware performance counters available for
processor
Architectures
IBM
375 MHz (Power 3), patched version of AIX
Hardware performance counters Cycles, Flops,
Loads, L1 cache misses, TLB misses, Prefetches,
etc.
Silicon Graphics
250 MHz MIPS R10000
Perfex and Speedshop to count Cycles, Flops, L1
misses, L2 misses, TLB misses, etc.
Compaq, Fujitsu, NEC (In progress)

17
IBM Results

Overall performance KIJ is the best overall
storage order but IKJ is nearly as good IJK is
worst
Effect of varying subdomain size
Most significant is the occurrence of
pathological TLB behavior for certain sub-domains
Sweet spot 15 lt N lt 35
Not well correlated with L1 cache behavior (L2
could not be measured reliably)
Prefetching appears more significant
Effect of varying tile sizes
Any tiling in minor I-dimension always degraded
performance
Very slight benefit to having thin tiles in major
J-dimension

18
IBM Results

Effect of varying subdomain size
Most significant is the occurrence of
pathological TLB behavior (106 or greater) for
certain sub-domains
Sweet spot 15 lt N lt 35
Not well correlated with L1 cache behavior (L2
could not be measured reliably)
Prefetching appears more significant
Effect of varying tile sizes
Any tiling in minor I-dimension always degraded
performance
Very slight benefit to having thin tiles in major
J-dimension

19
IBM Results

Effect of varying subdomain size
Most significant is the occurrence of
pathological TLB behavior (106 or greater) for
certain sub-domains
Sweet spot 15 lt N lt 35
Not well correlated with L1 cache behavior (L2
could not be measured reliably)
Prefetching appears more significant
Effect of varying tile sizes
Any tiling in minor I-dimension always degraded
performance
Very slight benefit to having thin tiles in major
J-dimension

20
TLB

ADVANCE_W has 20 3D arrays
What happens when an arbitrary index (I,J,K) is
touched in every one of these?
Assuming arrays are allocated contiguously, the
touches will be to pages at regular intervals
Hopefully these addresses will be scattered
around memory and evenly distributed over TLB,
but maybe not
Certain array sizes can cause clumping around a
few entries, and because of the low-associativity
of the IBM TLB, there isnt much margin

Translation Look-aside Buffer (TLB)
Associates virtual addresses to pages in physical
memory

IBM TLB
256 page-entries
4KB pages
128 two-way set associative banks
Least recently used replacement
128-byte cache-lines
Miss Penalty 25 to hundreds of cycles

21
TLB

ADVANCE_W has 20 3D arrays
What happens when an arbitrary index (I,J,K) is
touched in every one of these?
Assuming arrays are allocated contiguously, the
touches will be to pages at regular intervals
Hopefully these addresses will be scattered
around memory and evenly distributed over TLB,
but maybe not
Certain array sizes can cause clumping around a
few entries, and because of the low-associativity
of the IBM TLB, there isnt much margin

Translation Look-aside Buffer (TLB)
Associates virtual addresses to pages in physical
memory

IBM TLB
256 page-entries
4KB pages
128 two-way set associative banks
Least recently used replacement
128-byte cache-lines
Miss Penalty 25 to hundreds of cycles

22
Remediation

Observe that pattern does not change with array
padding it simply moves
Detect subdomain sizes that are liable to TLB
misses and pad accordingly when the arrays are
allocated
Entirely driver-level

23
Remediation

Observe that pattern does not change with array
padding it simply moves
Detect subdomain sizes that are liable to TLB
misses and pad accordingly when the arrays are
allocated
Entirely driver-level

24
IBM Results

Effect of varying subdomain size
Sweet spot 15 lt N lt 35
Not well correlated with L1 cache behavior (L2
could not be measured reliably)
Prefetching appears more significant

25
IBM Results

Effect of varying subdomain size
Sweet spot 15 lt N lt 35
Not well correlated with L1 cache behavior (L2
could not be measured reliably)
Prefetching appears more significant

26
SGI Results

Storage order
KIJ is the best performer for N lt 30 worst N gt
45
IKJ slightly better than IJK N lt 30
IJK is best for N gt 45

27
SGI Results

Effect of varying subdomain size
L2 data cache miss behavior best explains storage
order behavior

28
SGI Results

Effect of varying subdomain size
Pathological L1 data cache miss behavior for N
64

29
SGI Results

Effect of varying subdomain size
TLB behavior interesting but not significant

30
Summary

WRF design provides flexible runtime control over
tile and patch sizes, padding
Patch size is a significant factor, and most
machines exhibit sweet spots
Padding can be used to avoid patch sizes that
result in pathological array-memory layouts on
the IBM
Some aspects of code, such as storage order, are
rigid
Design decisions that cant be easily changed,
should be based on careful study
KIJ ordering is overall best for microprocessors
(but fatal on vector)
IKJ ordering is best compromise between vector
and microprocessors
General observations
Beyond these relatively high-level aspects,
trying to outsmart compilers usually doesnt pay
Hyper-engineering, even when effective,
introduces architecture dependencies
Obtaining detailed processor performance
statistics is a problem, and is likely to get
worse