Title: Computational Chemistry for Large Numbers of Processors
1Computational Chemistry for Large Numbers of
Processors
Theresa Windus NWChem task leader
2Outline
- NWChem Overview
- Sequential Performance
- NWChem Architecture
- Global Arrays and ARMCI
- Performance Model
- Dynamic Load Balancing in MD
- Conclusions
3Molecular Science Software Group
Edo Apra Eric Bylaska Bert deJong Michel
Dupuis George Fann Maciej Gutowski Mahin
Hackler Robert Harrison So Hirata Tjerk
Straatsma Theresa Windus
Gary Black Brett Didier Carina Lansing Bruce
Palmer Karen Schuchardt Eric Stephan Erich
Vorpagel
Jarek Nieplocha, Vinod Tipparaju
4NWChem overview
- Provides major new modeling and simulation
capability for molecular science - Broad range of molecules, including biomolecules
- Molecular dynamics, molecular mechanics
- Electronic structure of molecules
(non-relativistic, relativistic) - Solid state capabilities
- Solving DOEs grand challenge environmental
restoration problems - Performance characteristics designed for MPPs
- Single node performance comparable to best serial
codes - Scalability to 1000s of processors
- Uses full resources of the hardware (memory,
disk, and CPU) - Runs on a wide range of computers
- Extensible and long-lived
5NWChem Capabilities
- Quantum Mechanical Capabilities
- Hartree-Fock density functional theory at the
local and nonlocal levels (with N3 and N4 formal
scaling) energies, gradients, second
derivatives. Linear scaling quadrature and
exchange. - Multiconfiguration self consistent field (MCSCF)
energies and gradients. - Many-body perturbation theory energies and
gradients. - Effective core potential energies, gradients, and
second derivatives. - Coupled cluster CCSD and CCSD(T) and
configuration interaction energies. - Segmented and generally contracted basis sets
including the correlation-consistent basis sets
under development at EMSL. - Plane-wave pseudo-potential codes (periodic and
free-space) with dynamics - Classical Mechanical Capabilities
- Energy minimization molecular dynamics
simulation ab initio dynamics - Free energy calculation
- Supports variations such as multiconfiguration
thermodynamic integration or multiple step
thermodynamic perturbation, first order or self
consistent electronic polarization, simple
reaction field or particle mesh Ewald, and
quantum force-field dynamics - Mixed QM MM models and ONIOM
6Current NWChem Capabilities
7NWChem Platforms
- IBM SP
- CRAY T3
- SGI SMP systems
- Fujitsu VX/VPP
- SUN and other workstations
- Tru64 and Linux Alpha workstations and SC
clusters - x86-based workstations running Linux
- x86-based workstations running NT or Win98
- Linux Intel clusters (Myrinet, Giganet, and
Quadrics)
8Sequential Performance is Important!
- Best time to solution is what we are really after
- We dont want to increase the of FLOPS just for
scalability, only if it allows us to do problems
we couldnt explore before - Make maximum use of cache
- Optimized libraries (BLAS, LINPACK, FFTs, etc.)
- Profiling tools
- Compiler options
- Faster algorithms
- Order N methods - wavelets
9Single CPU performance NWChem vs DGauss
- 33 atom molecule
- DFT using LDA
- Similar energy/density convergence
- Calculations run on SGI R10000
10New reduced scaling methods being developed and
implemented
1,000
100
MP2 N5
CCSD(T) N7
DFT N3
MM/MD N2
Performance (TFlops)
10
1
.1
102
103
105
106
107
104
Number of Atoms
11Scalability (of course!) is important
Wall-Time needed to compute the converged Local
Density Approximation (LDA) wavefunction of the
Si28O67H30 zeolite fragment (basis set size of
1687 Gaussian functions).
12NWChem Architecture
Generic Tasks
- Object-oriented design
- abstraction, data hiding, handles, APIs
- Parallel programming model
- non-uniform memory access, global arrays, MPI
- Infrastructure
- GA, Parallel I/O, RTDB, MA, ...
- Program modules
- communication only through the database
- persistence for easy restart
Molecular Calculation Modules
Molecular Modeling Toolkit
Molecular Software Development Toolkit
13Object-oriented Design
- Objective
- simplify program development and maintenance
- encourage reuse of code
- Abstraction
- Use higher-level concepts than subroutine/data
- E.g., basis setexponentscoeffs
- Data-hiding
- Hide complexity behind an interface
- Fortran does not support OO programming
14Parallel Programming Model
- Non-uniform memory access - NUMA
- Your workstation is NUMA - registers, cache, main
memory, virtual memory - Parallel computers just add non-local memory(s)
- Unites sequential and parallel computation
- Distributed data
- Do not limit calculation by resources of one node
- Exploit aggregate resources of the whole machine
- SCF and DFT can distribute all data gt O(N)
- MP2 gradients distribute all data gt O(N2)
15Global Arrays
- Shared-memory-like model
- Fast local access
- NUMA aware and easy to use
- MIMD and data-parallel modes
- Inter-operates with MPI,
- BLAS and linear algebra interface
- Ported to major parallel machines
- IBM, Cray, SGI, clusters,...
- Originated in an HPCC project
- Used by several major chemistry codes, financial
futures forecasting, astrophysics, computer
graphics - Supported by DOE 2000, ACTS, SciDAC
16GA vs. other models(biased view!)
17Global Array Communication via ARMCI
node K
node I
Application
Global Array Library
ga_put(x,y)
Interrupt
ARMCI Library
ARMCI_PutS
ARMCI handler
Active Messages
Data Transport Layer
Other protocols also used remote memory copy,
sockets, threads, shared memory
18Aggregate Remote MemoryCopy Interface (ARMCI)
- Examples of specific interfaces
- Cray SHMEM, IBM LAPI, Fujitsu MPlib, NEC
Parlib/CJ, Hitachi RDMA, Quadrics Elan - VIA
- System V Shared Memory
- Sockets
- Capabilities of the above usually include some of
the following - put, get, scatter, gather, read-modify-write,
locks - memory consistency and ordering of operations
- Interoperates with MPI
19Factors for SMPs
Latencies between SMP nodes are higher than
within an SMP node
On the other hand, may not be able to get full
bandwidth because of competition between
processes on a node
Need to use the best protocol available and may
need to modify algorithms to get performance!
20Performance Modeling
- Define objectives
- (e.g.) enable calculations with several thousand
basis functions, perform close to the minimum
number of operations at near peak speed on each
processor, scale efficiently to at least 10,000
processors, and use 1 GB of memory/processor - Minimize operations and communication, yet
maintain good T(comp)/T(comm) ratio - Specialize algorithm to suit a particular
platform - CCSD(T) example
m/f ratio between the speeds of remote memory
access and floating point multiply-add b?
virtual block size bo occupied block size and
? efficiency.
21CCSD(T) Performance Modeling
Efficiency versus virtual block size for occupied
block sizes of 1, 2, 4, 6, 8, 10, 15 with m/f32
words/FLOP (the ratio for the MSCF IBM SP 120 MHz
P2SC).
Memory (words) versus virtual block size for
occupied block sizes of 1,2,4,6,8,10,15 with
V1500, O40. The memory increases with the
occupied block size.
22Load Balancing with Domain Decomposition in
Molecular Dynamics
Locality of interactions reduction of
communication Distribution of data reduction
of memory Fluctuating number of atoms frequent
atom redistribution Inhomogeneous distribution
load balancing
23Load balancing
24Dynamic Load Balancing
25Effect of dynamic load balancing
The times are for a 10000-step molecular
dynamics simulation of solvated
haloalkanedehologenase performed on 27
processors. They are shown for simulations
without load balancing (blue curves), load
balancing based on subbox redistribution only
(green curves), subbox pair resizing only (black
curves), and the combination of these (red
curves). Measured total wall-clock times and
accumulated synchronization times for a single
molecular dynamics step in seconds versus
simulation time in picoseconds (ps)
26(No Transcript)
27What do we need to continue improving?
- Better algorithms (always!!!)
- Better memory, communication and I/O latency and
bandwidth - Better parallel profiling tools (ones that dont
require a lot of new programming and are easy
to use) - Better parallel debugging tools (ones that scale
to 1,000s of processors) - Time on various supercomputers to perform
porting, tuning and real benchmarking
28Some conclusions
- Sequential performance is important
- Portability and scalability can be obtained by
using a NUMA model and one-sided communication
with ParSoft tools (GA and ARMCI) - Performance models provide a way to test the
algorithm before implementation - Data locality and creative use of dynamic
load-balancing can drastically improve performance
29 Acknowledgements
- Thanks to Edo Apra, Eric Bylaska, George Fann,
Robert Harrison, Jarek Nieplocha, TP Straatsma - This research was performed in part using the
Molecular Science Computing Facility(MSCF) in the
William R. Wiley Environmental Laboratory at the
Pacific Northwest National Laboratory (PNNL). The
MSCF is funded by the Office of Biological and
Environmental Research in the U. S. Department of
Energy. PNNL is operated by Battelle for the U.
S. Department of Energy under contract
DE-AC06-76RLO 1830.