Title: Partitioning the CubedSphere for BGL
1Partitioning the Cubed-Sphere for BG/L
- John M. Dennis, Henry M. Tufo, Richard D. Loft
- dennis,tufo,loft_at_ucar.edu
- National Center for Atmospheric Research
- Computational Science Section
- Boulder, Colorado USA
2Overview
- Dynamical core of high res. Atmospheric Global
Circulation Model (AGCM) on BG/L - High Order Multi-scale Modeling Environment
(HOMME) - spectral element based
- 9.5 km mesh at equator (6.7M grid points)
- Explicit time-stepping (dt 5 sec)
- Project performance to 54K BG/L processors -gt (23
- 39 Tflops) - Similar floating-point rate for 10 km mesh AGCM
on ES (26.6 Tflops)
3Outline
- BG/L Hardware (final and prototype)
- Computational Mesh
- Cubed-sphere
- Performance Model
- Description
- IBM P690 results
- Prototype BG/L results
- Prediction for 9.5 km simulation
4BG/L Hardware
- 64k node supercomputer
- Two PowerPC 440 cores per node (750Mhz)
- 180/360 Tflops Peak
- Networks
- 3D torus network (64x32x32)
- (1.5?s, 175 Mbyte/sec/link)
- Fat-Tree reduction network
- (5.0?s, 350 Mbyte/sec)
5Prototype BG/L Hardware
- 512 nodes
- 32 nodes per board
- 1/2 rack
- Two 500 Mhz PowerPC per node
- Network
- 3D mesh (8x8x8)
6Computational MeshCubed-Sphere
- Spectral Elements
- A quadrilateral patch of gridpoints NpNp
- Cube face
- A square collection of spectral elements NeNe
- Cube
- Total number of spectral elements 6NeNe
- Partitioning strategy
- Place one or more spectral element on each
processor - Use space-filling curves
7Performance Model
Tk D nelemdk ?( ts s(k,l) Bw) Tserial
D K Speedup Tserial/maxk(Tk)
Parameters K Total number of spectral
elements Tk Time to execute on processor k D
Serial time to execute a single spectral
element Nelemdk Number of spectral elements on
processor k ts Network latency Bw Network
bandwidth (including contention) s(k,l)
Message volume between the kth and lth processor
8Performance Model (cont)
- Explicit time stepping
- Very cache friendly
- Serial performance NOT dependent on problem size
- Semi-implicit time stepping
- Preconditioned conjugate gradient solver
- Preconditioner is not cache friendly
- Serial performance dependent on problem size
- Cache size
- Memory bandwidth per processor
- Memory bandwidth per SMP node
9HOMME on IBM P690 Cluster
- Validate performance model with lower resolution
- Spectral elements (Np6)
- K1536 elements (Ne16)
- 16 vertical levels
- Perf. Model accurate to 10
- Model less accurate at 768 processors
- gt 50 communication time
10HOMME on prototype BG/L Hardware
- Machine
- 512 nodes
- 8x8x8 mesh
- Single processor per node (500 Mhz)
- HOMME configuration
- Spectral element (Np8)
- K1536 elements (Ne16)
- 16 vertical levels
- Impact of contention?
- 9 messages per link (experimental)
- Perf. Model accurate to 1
- 23 communication time _at_ 512 processors
11HOMME on BG/L
- Computational Mesh
- K55296 spectral elements (Np10,Ne 96)
- 96 vertical levels
- 9.5 km mesh (6.7M velocity grid points)
- Give bounds for BG/L performance predictions
- Single Processor 450 to 750 Mflops
- Network
- 50 to 100 of projected
- Contention 4.5 per link
12Possible BG/L configurations
13Projected HOMME Performance for K55296
14Conclusions
- Perf. Model accurately predicts execution time of
explicit time-stepping on ?(1000) processors (IBM
P690, prototype BG/L) - Perf. Model should accurately predict BG/L
execution time - Communication time 7 on 54K processors
- HOMME should achieve 23-39 Tflops on 54K
processors - 9.5 km mesh should achieve similar performance
levels versus 10 km mesh AGCM on ES
15Questions?
- Thanks
- IBM BlueGene/L development team
- Funding
- Department of Energy Climate Change Prediction
Program - Contact
- John Dennis dennis_at_ucar.edu
16Partitioning a cubed-sphere on 8 processors
17Partitioning a cubed-sphere on 8 processors
18Peano curve construction (P3m)
19Hilbert curve construction (P2n)
20Partitioning with Space-Filling Curves
M
Meandering Peano M 3m
Hilbert curve M 2n
Hilbert-Peano curve M 2n3m
21Application of SFC to cubed-sphere (cont)