Computational Chemistry for Large Numbers of Processors - PowerPoint PPT Presentation

1 / 29

About This Presentation

Title:

Computational Chemistry for Large Numbers of Processors

Description:

Scaling to New Heights Workshop, May 20-21 ... Theresa Windus. NWChem task leader. U.S. ... Parallel Programming Model. Non-uniform memory access - NUMA ... – PowerPoint PPT presentation

Number of Views:187

Avg rating:3.0/5.0

Slides: 30

Provided by: theresa58

Category:

more less

Transcript and Presenter's Notes

Title: Computational Chemistry for Large Numbers of Processors

1
Computational Chemistry for Large Numbers of
Processors
Theresa Windus NWChem task leader
2
Outline

NWChem Overview
Sequential Performance
NWChem Architecture
Global Arrays and ARMCI
Performance Model
Dynamic Load Balancing in MD
Conclusions

3
Molecular Science Software Group
Edo Apra Eric Bylaska Bert deJong Michel
Dupuis George Fann Maciej Gutowski Mahin
Hackler Robert Harrison So Hirata Tjerk
Straatsma Theresa Windus
Gary Black Brett Didier Carina Lansing Bruce
Palmer Karen Schuchardt Eric Stephan Erich
Vorpagel
Jarek Nieplocha, Vinod Tipparaju
4
NWChem overview

Provides major new modeling and simulation
capability for molecular science
Broad range of molecules, including biomolecules
Molecular dynamics, molecular mechanics
Electronic structure of molecules
(non-relativistic, relativistic)
Solid state capabilities
Solving DOEs grand challenge environmental
restoration problems
Performance characteristics designed for MPPs
Single node performance comparable to best serial
codes
Scalability to 1000s of processors
Uses full resources of the hardware (memory,
disk, and CPU)
Runs on a wide range of computers
Extensible and long-lived

5
NWChem Capabilities

Quantum Mechanical Capabilities
Hartree-Fock density functional theory at the
local and nonlocal levels (with N3 and N4 formal
scaling) energies, gradients, second
derivatives. Linear scaling quadrature and
exchange.
Multiconfiguration self consistent field (MCSCF)
energies and gradients.
Many-body perturbation theory energies and
gradients.
Effective core potential energies, gradients, and
second derivatives.
Coupled cluster CCSD and CCSD(T) and
configuration interaction energies.
Segmented and generally contracted basis sets
including the correlation-consistent basis sets
under development at EMSL.
Plane-wave pseudo-potential codes (periodic and
free-space) with dynamics
Classical Mechanical Capabilities
Energy minimization molecular dynamics
simulation ab initio dynamics
Free energy calculation
Supports variations such as multiconfiguration
thermodynamic integration or multiple step
thermodynamic perturbation, first order or self
consistent electronic polarization, simple
reaction field or particle mesh Ewald, and
quantum force-field dynamics
Mixed QM MM models and ONIOM

6
Current NWChem Capabilities
7
NWChem Platforms

IBM SP
CRAY T3
SGI SMP systems
Fujitsu VX/VPP
SUN and other workstations
Tru64 and Linux Alpha workstations and SC
clusters
x86-based workstations running Linux
x86-based workstations running NT or Win98
Linux Intel clusters (Myrinet, Giganet, and
Quadrics)

8
Sequential Performance is Important!

Best time to solution is what we are really after
We dont want to increase the of FLOPS just for
scalability, only if it allows us to do problems
we couldnt explore before
Make maximum use of cache
Optimized libraries (BLAS, LINPACK, FFTs, etc.)
Profiling tools
Compiler options
Faster algorithms
Order N methods - wavelets

9
Single CPU performance NWChem vs DGauss

33 atom molecule
DFT using LDA
Similar energy/density convergence
Calculations run on SGI R10000

10
New reduced scaling methods being developed and
implemented
1,000
100
MP2 N5
CCSD(T) N7
DFT N3
MM/MD N2
Performance (TFlops)
10
1
.1
102
103
105
106
107
104
Number of Atoms
11
Scalability (of course!) is important
Wall-Time needed to compute the converged Local
Density Approximation (LDA) wavefunction of the
Si28O67H30 zeolite fragment (basis set size of
1687 Gaussian functions).
12
NWChem Architecture
Generic Tasks

Object-oriented design
abstraction, data hiding, handles, APIs
Parallel programming model
non-uniform memory access, global arrays, MPI
Infrastructure
GA, Parallel I/O, RTDB, MA, ...
Program modules
communication only through the database
persistence for easy restart

Molecular Calculation Modules

Molecular Modeling Toolkit
Molecular Software Development Toolkit
13
Object-oriented Design

Objective
simplify program development and maintenance
encourage reuse of code
Abstraction
Use higher-level concepts than subroutine/data
E.g., basis setexponentscoeffs
Data-hiding
Hide complexity behind an interface
Fortran does not support OO programming

14
Parallel Programming Model

Non-uniform memory access - NUMA
Your workstation is NUMA - registers, cache, main
memory, virtual memory
Parallel computers just add non-local memory(s)
Unites sequential and parallel computation
Distributed data
Do not limit calculation by resources of one node
Exploit aggregate resources of the whole machine
SCF and DFT can distribute all data gt O(N)
MP2 gradients distribute all data gt O(N2)

15
Global Arrays

Shared-memory-like model
Fast local access
NUMA aware and easy to use
MIMD and data-parallel modes
Inter-operates with MPI,
BLAS and linear algebra interface
Ported to major parallel machines
IBM, Cray, SGI, clusters,...
Originated in an HPCC project
Used by several major chemistry codes, financial
futures forecasting, astrophysics, computer
graphics
Supported by DOE 2000, ACTS, SciDAC

16
GA vs. other models(biased view!)
17
Global Array Communication via ARMCI
node K
node I
Application
Global Array Library
ga_put(x,y)
Interrupt
ARMCI Library
ARMCI_PutS
ARMCI handler
Active Messages
Data Transport Layer
Other protocols also used remote memory copy,
sockets, threads, shared memory
18
Aggregate Remote MemoryCopy Interface (ARMCI)

Examples of specific interfaces
Cray SHMEM, IBM LAPI, Fujitsu MPlib, NEC
Parlib/CJ, Hitachi RDMA, Quadrics Elan
VIA
System V Shared Memory
Sockets
Capabilities of the above usually include some of
the following
put, get, scatter, gather, read-modify-write,
locks
memory consistency and ordering of operations
Interoperates with MPI

19
Factors for SMPs
Latencies between SMP nodes are higher than
within an SMP node
On the other hand, may not be able to get full
bandwidth because of competition between
processes on a node
Need to use the best protocol available and may
need to modify algorithms to get performance!
20
Performance Modeling

Define objectives
(e.g.) enable calculations with several thousand
basis functions, perform close to the minimum
number of operations at near peak speed on each
processor, scale efficiently to at least 10,000
processors, and use 1 GB of memory/processor
Minimize operations and communication, yet
maintain good T(comp)/T(comm) ratio
Specialize algorithm to suit a particular
platform
CCSD(T) example

m/f ratio between the speeds of remote memory
access and floating point multiply-add b?
virtual block size bo occupied block size and
? efficiency.
21
CCSD(T) Performance Modeling

Efficiency versus virtual block size for occupied
block sizes of 1, 2, 4, 6, 8, 10, 15 with m/f32
words/FLOP (the ratio for the MSCF IBM SP 120 MHz
P2SC).
Memory (words) versus virtual block size for
occupied block sizes of 1,2,4,6,8,10,15 with
V1500, O40. The memory increases with the
occupied block size.
22
Load Balancing with Domain Decomposition in
Molecular Dynamics
Locality of interactions reduction of
communication Distribution of data reduction
of memory Fluctuating number of atoms frequent
atom redistribution Inhomogeneous distribution
load balancing
23
Load balancing
24
Dynamic Load Balancing
25
Effect of dynamic load balancing
The times are for a 10000-step molecular
dynamics simulation of solvated
haloalkanedehologenase performed on 27
processors. They are shown for simulations
without load balancing (blue curves), load
balancing based on subbox redistribution only
(green curves), subbox pair resizing only (black
curves), and the combination of these (red
curves). Measured total wall-clock times and
accumulated synchronization times for a single
molecular dynamics step in seconds versus
simulation time in picoseconds (ps)
26
(No Transcript)
27
What do we need to continue improving?

Better algorithms (always!!!)
Better memory, communication and I/O latency and
bandwidth
Better parallel profiling tools (ones that dont
require a lot of new programming and are easy
to use)
Better parallel debugging tools (ones that scale
to 1,000s of processors)
Time on various supercomputers to perform
porting, tuning and real benchmarking

28
Some conclusions

Sequential performance is important
Portability and scalability can be obtained by
using a NUMA model and one-sided communication
with ParSoft tools (GA and ARMCI)
Performance models provide a way to test the
algorithm before implementation
Data locality and creative use of dynamic
load-balancing can drastically improve performance

29

Acknowledgements

Thanks to Edo Apra, Eric Bylaska, George Fann,
Robert Harrison, Jarek Nieplocha, TP Straatsma
This research was performed in part using the
Molecular Science Computing Facility(MSCF) in the
William R. Wiley Environmental Laboratory at the
Pacific Northwest National Laboratory (PNNL). The
MSCF is funded by the Office of Biological and
Environmental Research in the U. S. Department of
Energy. PNNL is operated by Battelle for the U.
S. Department of Energy under contract
DE-AC06-76RLO 1830.