PerformancePortability and the Weather Research and Forecast Model - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

PerformancePortability and the Weather Research and Forecast Model

Description:

Performance-Portability and the Weather Research and Forecast Model ... Implicit solver for vert. propagation of acoustic modes ... – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 31
Provided by: johng200
Category:

less

Transcript and Presenter's Notes

Title: PerformancePortability and the Weather Research and Forecast Model


1
Performance-Portability and the Weather Research
and Forecast Model
  • John Michalakes, Richard Loft, Alfred Bourgeois
  • National Center for Atmospheric Research
  • Boulder, Colorado U.S.A
  • HPC Asia 2001
    September 25, 2001

2
Outline
  • WRF Model Project
  • Performance portability and why
  • Studies
  • Storage order using whole code
  • Detailed profiling using kernels
  • Conclusions

3
WRF Project Overview
  • Develop advanced model and data-assimilation
    system for mesoscale NWP
  • Accurate, efficient, applicable over a broad
    range of scales
  • Focus on 1-10km resolution
  • Advanced dynamics, physics, data assimilation,
    nesting
  • Flexible, modular, performance-portable
  • Large, multi-institution effort
  • Promote closer ties between research and
    operations
  • Pool resources
  • Milestones and status
  • First release Nov. 2000
  • Current release May 2001
  • Research community version in 2002 full
    operational implementation 2004
  • Dynamical cores
  • Time-split explicit Eulerian
  • 5th order advection
  • Height and mass coordinate
  • Semi-implicit Semi-Lagrangian
  • Full physics with multiple options for each type
  • 3DVAR, 4DVAR

4
WRF Project Overview
  • Develop advanced model and data-assimilation
    system for mesoscale NWP
  • Accurate, efficient, applicable over a broad
    range of scales
  • Focus on 1-10km resolution
  • Advanced dynamics, physics, data assimilation,
    nesting
  • Flexible, modular, performance-portable
  • Large, multi-institution effort
  • Promote closer ties between research and
    operations
  • Pool resources
  • Milestones and status
  • First release Nov. 2000
  • Current release May 2001
  • Research community version in 2002 full
    operational implementation 2004

5
WRF Project Overview
  • Develop advanced model and data-assimilation
    system for mesoscale NWP
  • Accurate, efficient, applicable over a broad
    range of scales
  • Focus on 1-10km resolution
  • Advanced dynamics, physics, data assimilation,
    nesting
  • Flexible, modular, performance-portable
  • Large, multi-institution effort
  • Promote closer ties between research and
    operations
  • Pool resources
  • Milestones and status
  • First release Nov. 2000
  • Current release May 2001
  • Research community version in 2002 full
    operational implementation 2004

6
WRF Project Overview
  • Develop advanced model and data-assimilation
    system for mesoscale NWP
  • Accurate, efficient, applicable over a broad
    range of scales
  • Focus on 1-10km resolution
  • Advanced dynamics, physics, data assimilation,
    nesting
  • Flexible, modular, performance-portable
  • Large, multi-institution effort
  • Promote closer ties between research and
    operations
  • Pool resources
  • Milestones and status
  • First release Nov. 2000
  • Current release May 2001
  • Research community version in 2002 full
    operational implementation 2004

7
WRF Project Overview
  • Develop advanced model and data-assimilation
    system for mesoscale NWP
  • Accurate, efficient, applicable over a broad
    range of scales
  • Focus on 1-10km resolution
  • Advanced dynamics, physics, data assimilation,
    nesting
  • Flexible, modular, performance-portable
  • Large, multi-institution effort
  • Promote closer ties between research and
    operations
  • Pool resources
  • Milestones and status
  • First release Nov. 2000
  • Current release May 2001
  • Research community version in 2002 full
    operational implementation 2004
  • Principal Partners
  • NCAR Mesoscale and Microscale Meteorology
    Division
  • NOAA National Centers for Environmental
    Prediction
  • NOAA Forecast Systems Laboratory
  • U. Oklahoma Center for the Analysis and
    Prediction of Storms
  • U.S. Air Force Weather Agency
  • Additional Collaborators
  • NOAA Geophysical Fluid Dynamics Laboratory
  • NASA GSFC Atmospheric Sciences Division
  • NOAA National Severe Storms Laboratory
  • NRL Marine Meteorology Division
  • EPA Atmospheric Modeling Division
  • Aerospace Inc
  • University Community

8
WRF Project Overview
  • Develop advanced model and data-assimilation
    system for mesoscale NWP
  • Accurate, efficient, applicable over a broad
    range of scales
  • Focus on 1-10km resolution
  • Advanced dynamics, physics, data assimilation,
    nesting
  • Flexible, modular, performance-portable
  • Large, multi-institution effort
  • Promote closer ties between research and
    operations
  • Pool resources
  • Milestones and status
  • First release Nov. 2000
  • Current release May 2001
  • Research community version in 2002 full
    operational implementation 2004

9
Performance Portability
  • Conflicting concerns
  • Best possible performance
  • Maintainable software
  • What can we control?
  • Run-time and compiler options libraries
  • Padding, alignment, sizing of arrays
  • Blocking factors, vector length
  • Storage order/loop nesting
  • Crucial issue in WRF design in 1999-2000 can we
    stay portable to vector systems?
  • Needed decision on storage/loop order while code
    still small
  • Trend in U.S. has been away from vector
  • Vector vendors dropping support, however
  • Still significant installed base outside U.S.
  • Expedited Storage Order study to support
    decision for WRF

10
WRF storage-order
  • Whats the best order of three-dimensional array
    indices and corresponding loop nesting IJK, KIJ,
    or IKJ?
  • Original study
  • M. Ashworth (Daresbury Laboratory, UK), ECMWF 98
    workshop presentation compared k-inner versus
    i-inner orderings for array dimensions and loop
    nesting in a test kernel.
  • (Optimisation for Vector and RISC Processors,
    in Towards Teracomputing, World Scientific, River
    Edge, NJ. 1999. pp. 353-359.)
  • RISC relatively insensitive (30 penalty)
  • Vector was very sensitive, by up to factors of 4
    on Fujitsu VPP300 and 8 on NEC SX-4
  • Reproduce Ashworths study with WRF prototype
  • Generated three versions, IJK, KIJ, and IKJ
  • Benchmark various problem sizes on representative
    systems
  • Present results to WRF collaborators

11
IJK versus KIJ on RISC and Vector
  • IJK versus KIJ for all patch dimensions
    X,Y(21,41,81) 41 levels throughout
  • KIJ is the favored order on RISC IJK is strongly
    favored on vector for adequate length of X
  • Surprise vector prefers KIJ for short X (no
    physics in tested prototype)
  • RISC penalty for IJK decreases with increased
    length of minor dimension
  • Penalty is most severe for sizes typical of a DM
    patch

12
IJK versus KIJ on RISC and Vector
  • IJK versus KIJ for all patch dimensions
    X,Y(21,41,81) 41 levels throughout
  • KIJ is the favored order on RISC IJK is strongly
    favored on vector for adequate length of X
  • Surprise vector prefers KIJ for short X (no
    physics in tested prototype)
  • RISC penalty for IJK decreases with increased
    length of minor dimension
  • Penalty is most severe for sizes typical of a DM
    patch

13
IJK versus KIJ on RISC and Vector
  • IJK versus KIJ for all patch dimensions
    X,Y(21,41,81) 41 levels throughout
  • KIJ is the favored order on RISC IJK is strongly
    favored on vector for adequate length of X
  • Surprise vector prefers KIJ for short X (no
    physics in tested prototype)
  • RISC penalty for IJK decreases with increased
    length of minor dimension
  • Penalty is most severe for sizes typical of a DM
    patch

14
IJK and IKJ Penalties versus KIJ EV6
IJK vs KIJ
IKJ vs KIJ
  • Penalty for I-innermost is much less severe when
    K is middle dimension
  • WRF adopted IKJ ordering in April, 2000

15
Detailed Performance Studies
  • Follow-on to WRF storage order experiments to
    determine effects of controllable aspects
  • All MxN subdomain sizes 2 lt M lt 80 2 lt N lt 80
  • Padding 0, 1, 2, and 3 extra cells around patch
  • Effect of shared memory tile size
  • Storage order
  • Computational kernel ADVANCE_W
  • Implicit solver for vert. propagation of acoustic
    modes
  • Compute-intensive, fully 3D vertical recurrence
  • Three versions IJK, IKJ, and KIJ

16
Detailed Performance Studies
  • Procedure
  • For each processor type
  • Run ADVANCE_W test code over 6400 different
    domain sizes and 4 array padding options
  • Generated detailed performance data using
    hardware performance counters available for
    processor
  • Architectures
  • IBM
  • 375 MHz (Power 3), patched version of AIX
  • Hardware performance counters Cycles, Flops,
    Loads, L1 cache misses, TLB misses, Prefetches,
    etc.
  • Silicon Graphics
  • 250 MHz MIPS R10000
  • Perfex and Speedshop to count Cycles, Flops, L1
    misses, L2 misses, TLB misses, etc.
  • Compaq, Fujitsu, NEC (In progress)

17
IBM Results
  • Overall performance KIJ is the best overall
    storage order but IKJ is nearly as good IJK is
    worst
  • Effect of varying subdomain size
  • Most significant is the occurrence of
    pathological TLB behavior for certain sub-domains
  • Sweet spot 15 lt N lt 35
  • Not well correlated with L1 cache behavior (L2
    could not be measured reliably)
  • Prefetching appears more significant
  • Effect of varying tile sizes
  • Any tiling in minor I-dimension always degraded
    performance
  • Very slight benefit to having thin tiles in major
    J-dimension

18
IBM Results
  • Effect of varying subdomain size
  • Most significant is the occurrence of
    pathological TLB behavior (106 or greater) for
    certain sub-domains
  • Sweet spot 15 lt N lt 35
  • Not well correlated with L1 cache behavior (L2
    could not be measured reliably)
  • Prefetching appears more significant
  • Effect of varying tile sizes
  • Any tiling in minor I-dimension always degraded
    performance
  • Very slight benefit to having thin tiles in major
    J-dimension

19
IBM Results
  • Effect of varying subdomain size
  • Most significant is the occurrence of
    pathological TLB behavior (106 or greater) for
    certain sub-domains
  • Sweet spot 15 lt N lt 35
  • Not well correlated with L1 cache behavior (L2
    could not be measured reliably)
  • Prefetching appears more significant
  • Effect of varying tile sizes
  • Any tiling in minor I-dimension always degraded
    performance
  • Very slight benefit to having thin tiles in major
    J-dimension

20
TLB
  • ADVANCE_W has 20 3D arrays
  • What happens when an arbitrary index (I,J,K) is
    touched in every one of these?
  • Assuming arrays are allocated contiguously, the
    touches will be to pages at regular intervals
  • Hopefully these addresses will be scattered
    around memory and evenly distributed over TLB,
    but maybe not
  • Certain array sizes can cause clumping around a
    few entries, and because of the low-associativity
    of the IBM TLB, there isnt much margin
  • Translation Look-aside Buffer (TLB)
  • Associates virtual addresses to pages in physical
    memory
  • IBM TLB
  • 256 page-entries
  • 4KB pages
  • 128 two-way set associative banks
  • Least recently used replacement
  • 128-byte cache-lines
  • Miss Penalty 25 to hundreds of cycles

21
TLB
  • ADVANCE_W has 20 3D arrays
  • What happens when an arbitrary index (I,J,K) is
    touched in every one of these?
  • Assuming arrays are allocated contiguously, the
    touches will be to pages at regular intervals
  • Hopefully these addresses will be scattered
    around memory and evenly distributed over TLB,
    but maybe not
  • Certain array sizes can cause clumping around a
    few entries, and because of the low-associativity
    of the IBM TLB, there isnt much margin
  • Translation Look-aside Buffer (TLB)
  • Associates virtual addresses to pages in physical
    memory
  • IBM TLB
  • 256 page-entries
  • 4KB pages
  • 128 two-way set associative banks
  • Least recently used replacement
  • 128-byte cache-lines
  • Miss Penalty 25 to hundreds of cycles

22
Remediation
  • Observe that pattern does not change with array
    padding it simply moves
  • Detect subdomain sizes that are liable to TLB
    misses and pad accordingly when the arrays are
    allocated
  • Entirely driver-level

23
Remediation
  • Observe that pattern does not change with array
    padding it simply moves
  • Detect subdomain sizes that are liable to TLB
    misses and pad accordingly when the arrays are
    allocated
  • Entirely driver-level

24
IBM Results
  • Effect of varying subdomain size
  • Sweet spot 15 lt N lt 35
  • Not well correlated with L1 cache behavior (L2
    could not be measured reliably)
  • Prefetching appears more significant

25
IBM Results
  • Effect of varying subdomain size
  • Sweet spot 15 lt N lt 35
  • Not well correlated with L1 cache behavior (L2
    could not be measured reliably)
  • Prefetching appears more significant

26
SGI Results
  • Storage order
  • KIJ is the best performer for N lt 30 worst N gt
    45
  • IKJ slightly better than IJK N lt 30
  • IJK is best for N gt 45

27
SGI Results
  • Effect of varying subdomain size
  • L2 data cache miss behavior best explains storage
    order behavior

28
SGI Results
  • Effect of varying subdomain size
  • Pathological L1 data cache miss behavior for N
    64

29
SGI Results
  • Effect of varying subdomain size
  • TLB behavior interesting but not significant

30
Summary
  • WRF design provides flexible runtime control over
    tile and patch sizes, padding
  • Patch size is a significant factor, and most
    machines exhibit sweet spots
  • Padding can be used to avoid patch sizes that
    result in pathological array-memory layouts on
    the IBM
  • Some aspects of code, such as storage order, are
    rigid
  • Design decisions that cant be easily changed,
    should be based on careful study
  • KIJ ordering is overall best for microprocessors
    (but fatal on vector)
  • IKJ ordering is best compromise between vector
    and microprocessors
  • General observations
  • Beyond these relatively high-level aspects,
    trying to outsmart compilers usually doesnt pay
  • Hyper-engineering, even when effective,
    introduces architecture dependencies
  • Obtaining detailed processor performance
    statistics is a problem, and is likely to get
    worse
Write a Comment
User Comments (0)
About PowerShow.com