Computational Chemistry for Large Numbers of Processors - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Computational Chemistry for Large Numbers of Processors

Description:

Scaling to New Heights Workshop, May 20-21 ... Theresa Windus. NWChem task leader. U.S. ... Parallel Programming Model. Non-uniform memory access - NUMA ... – PowerPoint PPT presentation

Number of Views:187
Avg rating:3.0/5.0
Slides: 30
Provided by: theresa58
Category:

less

Transcript and Presenter's Notes

Title: Computational Chemistry for Large Numbers of Processors


1
Computational Chemistry for Large Numbers of
Processors
Theresa Windus NWChem task leader
2
Outline
  • NWChem Overview
  • Sequential Performance
  • NWChem Architecture
  • Global Arrays and ARMCI
  • Performance Model
  • Dynamic Load Balancing in MD
  • Conclusions

3
Molecular Science Software Group
Edo Apra Eric Bylaska Bert deJong Michel
Dupuis George Fann Maciej Gutowski Mahin
Hackler Robert Harrison So Hirata Tjerk
Straatsma Theresa Windus
Gary Black Brett Didier Carina Lansing Bruce
Palmer Karen Schuchardt Eric Stephan Erich
Vorpagel
Jarek Nieplocha, Vinod Tipparaju
4
NWChem overview
  • Provides major new modeling and simulation
    capability for molecular science
  • Broad range of molecules, including biomolecules
  • Molecular dynamics, molecular mechanics
  • Electronic structure of molecules
    (non-relativistic, relativistic)
  • Solid state capabilities
  • Solving DOEs grand challenge environmental
    restoration problems
  • Performance characteristics designed for MPPs
  • Single node performance comparable to best serial
    codes
  • Scalability to 1000s of processors
  • Uses full resources of the hardware (memory,
    disk, and CPU)
  • Runs on a wide range of computers
  • Extensible and long-lived

5
NWChem Capabilities
  • Quantum Mechanical Capabilities
  • Hartree-Fock density functional theory at the
    local and nonlocal levels (with N3 and N4 formal
    scaling) energies, gradients, second
    derivatives. Linear scaling quadrature and
    exchange.
  • Multiconfiguration self consistent field (MCSCF)
    energies and gradients.
  • Many-body perturbation theory energies and
    gradients.
  • Effective core potential energies, gradients, and
    second derivatives.
  • Coupled cluster CCSD and CCSD(T) and
    configuration interaction energies.
  • Segmented and generally contracted basis sets
    including the correlation-consistent basis sets
    under development at EMSL.
  • Plane-wave pseudo-potential codes (periodic and
    free-space) with dynamics
  • Classical Mechanical Capabilities
  • Energy minimization molecular dynamics
    simulation ab initio dynamics
  • Free energy calculation
  • Supports variations such as multiconfiguration
    thermodynamic integration or multiple step
    thermodynamic perturbation, first order or self
    consistent electronic polarization, simple
    reaction field or particle mesh Ewald, and
    quantum force-field dynamics
  • Mixed QM MM models and ONIOM

6
Current NWChem Capabilities
7
NWChem Platforms
  • IBM SP
  • CRAY T3
  • SGI SMP systems
  • Fujitsu VX/VPP
  • SUN and other workstations
  • Tru64 and Linux Alpha workstations and SC
    clusters
  • x86-based workstations running Linux
  • x86-based workstations running NT or Win98
  • Linux Intel clusters (Myrinet, Giganet, and
    Quadrics)

8
Sequential Performance is Important!
  • Best time to solution is what we are really after
  • We dont want to increase the of FLOPS just for
    scalability, only if it allows us to do problems
    we couldnt explore before
  • Make maximum use of cache
  • Optimized libraries (BLAS, LINPACK, FFTs, etc.)
  • Profiling tools
  • Compiler options
  • Faster algorithms
  • Order N methods - wavelets

9
Single CPU performance NWChem vs DGauss
  • 33 atom molecule
  • DFT using LDA
  • Similar energy/density convergence
  • Calculations run on SGI R10000

10
New reduced scaling methods being developed and
implemented
1,000
100
MP2 N5
CCSD(T) N7
DFT N3
MM/MD N2
Performance (TFlops)
10
1
.1
102
103
105
106
107
104
Number of Atoms
11
Scalability (of course!) is important
Wall-Time needed to compute the converged Local
Density Approximation (LDA) wavefunction of the
Si28O67H30 zeolite fragment (basis set size of
1687 Gaussian functions).
12
NWChem Architecture
Generic Tasks
  • Object-oriented design
  • abstraction, data hiding, handles, APIs
  • Parallel programming model
  • non-uniform memory access, global arrays, MPI
  • Infrastructure
  • GA, Parallel I/O, RTDB, MA, ...
  • Program modules
  • communication only through the database
  • persistence for easy restart

Molecular Calculation Modules

Molecular Modeling Toolkit
Molecular Software Development Toolkit
13
Object-oriented Design
  • Objective
  • simplify program development and maintenance
  • encourage reuse of code
  • Abstraction
  • Use higher-level concepts than subroutine/data
  • E.g., basis setexponentscoeffs
  • Data-hiding
  • Hide complexity behind an interface
  • Fortran does not support OO programming

14
Parallel Programming Model
  • Non-uniform memory access - NUMA
  • Your workstation is NUMA - registers, cache, main
    memory, virtual memory
  • Parallel computers just add non-local memory(s)
  • Unites sequential and parallel computation
  • Distributed data
  • Do not limit calculation by resources of one node
  • Exploit aggregate resources of the whole machine
  • SCF and DFT can distribute all data gt O(N)
  • MP2 gradients distribute all data gt O(N2)

15
Global Arrays
  • Shared-memory-like model
  • Fast local access
  • NUMA aware and easy to use
  • MIMD and data-parallel modes
  • Inter-operates with MPI,
  • BLAS and linear algebra interface
  • Ported to major parallel machines
  • IBM, Cray, SGI, clusters,...
  • Originated in an HPCC project
  • Used by several major chemistry codes, financial
    futures forecasting, astrophysics, computer
    graphics
  • Supported by DOE 2000, ACTS, SciDAC

16
GA vs. other models(biased view!)
17
Global Array Communication via ARMCI
node K
node I
Application
Global Array Library
ga_put(x,y)
Interrupt
ARMCI Library
ARMCI_PutS
ARMCI handler
Active Messages
Data Transport Layer
Other protocols also used remote memory copy,
sockets, threads, shared memory
18
Aggregate Remote MemoryCopy Interface (ARMCI)
  • Examples of specific interfaces
  • Cray SHMEM, IBM LAPI, Fujitsu MPlib, NEC
    Parlib/CJ, Hitachi RDMA, Quadrics Elan
  • VIA
  • System V Shared Memory
  • Sockets
  • Capabilities of the above usually include some of
    the following
  • put, get, scatter, gather, read-modify-write,
    locks
  • memory consistency and ordering of operations
  • Interoperates with MPI

19
Factors for SMPs
Latencies between SMP nodes are higher than
within an SMP node
On the other hand, may not be able to get full
bandwidth because of competition between
processes on a node
Need to use the best protocol available and may
need to modify algorithms to get performance!
20
Performance Modeling
  • Define objectives
  • (e.g.) enable calculations with several thousand
    basis functions, perform close to the minimum
    number of operations at near peak speed on each
    processor, scale efficiently to at least 10,000
    processors, and use 1 GB of memory/processor
  • Minimize operations and communication, yet
    maintain good T(comp)/T(comm) ratio
  • Specialize algorithm to suit a particular
    platform
  • CCSD(T) example

m/f ratio between the speeds of remote memory
access and floating point multiply-add b?
virtual block size bo occupied block size and
? efficiency.
21
CCSD(T) Performance Modeling

Efficiency versus virtual block size for occupied
block sizes of 1, 2, 4, 6, 8, 10, 15 with m/f32
words/FLOP (the ratio for the MSCF IBM SP 120 MHz
P2SC).
Memory (words) versus virtual block size for
occupied block sizes of 1,2,4,6,8,10,15 with
V1500, O40. The memory increases with the
occupied block size.
22
Load Balancing with Domain Decomposition in
Molecular Dynamics
Locality of interactions reduction of
communication Distribution of data reduction
of memory Fluctuating number of atoms frequent
atom redistribution Inhomogeneous distribution
load balancing
23
Load balancing
24
Dynamic Load Balancing
25
Effect of dynamic load balancing
The times are for a 10000-step molecular
dynamics simulation of solvated
haloalkanedehologenase performed on 27
processors. They are shown for simulations
without load balancing (blue curves), load
balancing based on subbox redistribution only
(green curves), subbox pair resizing only (black
curves), and the combination of these (red
curves). Measured total wall-clock times and
accumulated synchronization times for a single
molecular dynamics step in seconds versus
simulation time in picoseconds (ps)
26
(No Transcript)
27
What do we need to continue improving?
  • Better algorithms (always!!!)
  • Better memory, communication and I/O latency and
    bandwidth
  • Better parallel profiling tools (ones that dont
    require a lot of new programming and are easy
    to use)
  • Better parallel debugging tools (ones that scale
    to 1,000s of processors)
  • Time on various supercomputers to perform
    porting, tuning and real benchmarking

28
Some conclusions
  • Sequential performance is important
  • Portability and scalability can be obtained by
    using a NUMA model and one-sided communication
    with ParSoft tools (GA and ARMCI)
  • Performance models provide a way to test the
    algorithm before implementation
  • Data locality and creative use of dynamic
    load-balancing can drastically improve performance

29

Acknowledgements
  • Thanks to Edo Apra, Eric Bylaska, George Fann,
    Robert Harrison, Jarek Nieplocha, TP Straatsma
  • This research was performed in part using the
    Molecular Science Computing Facility(MSCF) in the
    William R. Wiley Environmental Laboratory at the
    Pacific Northwest National Laboratory (PNNL). The
    MSCF is funded by the Office of Biological and
    Environmental Research in the U. S. Department of
    Energy. PNNL is operated by Battelle for the U.
    S. Department of Energy under contract
    DE-AC06-76RLO 1830.
Write a Comment
User Comments (0)
About PowerShow.com