Title: Matt Challacombe,Theoretical Division
1An Irregular Approach to Parallelism in Linear
Scaling Quantum Chemistry
LAUR01-4856
Linear Scaling Electronic Structure
Methods, IPAM, UCLA Spring 2002
Matt Challacombe,Theoretical Division
Overview of MondoSCF Full O(N) Examples Nearsighte
dness and the scalability of O(N) methods Data
locality Scalable algorithms Data
structures Tiling and dynamic load balancing
2The Nearsighted Principle and Linear Scaling
- In a local basis f, quantum effects are short
ranged for non-metallic systems - Locality is manifested in approximate exponential
decay of PAB with atomic separation - Locality of P may be exploited to achieve O(N)
algorithms for SCF theory - The non-local Coulomb problem can be overcome
with multiscale methods
MondoSCF
3(No Transcript)
4MondoSCF Capabilities, Current and Future
Recently accomplished, unpublished Currently in
development at LANL Currently in development at
LLNL
- O(N) Periodic boundary conditions C.J. Tymczak
(LANL) - O(N) Exact exchange Eric Schwegler (LLNL)
- Parallel Fock builds and Space Filling Curves,
C.K. Gan (LANL) - O(N) methods for QM/MM, internal coordinate
approaches to transition state and geometry
optimization Karoly Nemeth (LANL)
5MondoSCF
6Sequestration of Carbon Dioxide (Neil Henson and
MC)
- Lizardite (L) is Mg3Si2O5(OH)4
- L-OH H2CO3 L-HCO3H2O
- Can carbon dioxide dissolved in water carbonate
the mineral surface?
MondoSCF
7QM/MM Crambin in a Water Droplet (Karoly Nemeth
and MC)
MM from F90 Dynamo libraries (M. Field et
al). Seamless QM/MM electrostatics via QCTC
One simplified density.
MondoSCF
8Additional Thoughts on the Promise of Linear
Scaling
We have sorted most of the "details" and have a
fully O(N) ab initio code, but....
- Even with O(N) , going large costs and is not
enough in itself to solve many real world
problems. - Size matters, but so does speed
- Must have scalable algorithms to realize the
promise of O(N) ab initio. - This is challenging because to scale, 5 leading
edge methods must play well together in parallel. - Need good, globally applicable paradigms
MondoSCF
9Scalability of O(N) Methods
- Locality
- The nearsighted principle suggests that
communication costs can be reduced to O(1) with
respect to p/N - Load Balance
- Domain decomposition after locality enhancement
- Leads to irregular work loads
- Requires
- good data structures
- good algorithms
- dynamic load balancing
MondoSCF
10Data Locality in Sparse Matrix Algebra
- Eliminate all to all communications
- Reduction of sparse matrices to banded form
Graph or Geometric methods
- Radial Cutoffs
- Identical graphs
- Good for uniform systems (ie Si)
- 1 Graph theoretical methods
- Thresholding
- General, transferable
- Dissimilar graphs
- Good for inhomogeneous systems
- Can exploit incremental matrices
- 1 Geometric methods
MondoSCF
11Geometric Ordering Space Filling Curves (SCFs)
- SFCs map points in space onto a line such that
points close in space are also close on the line.
- Loopback problem the converse is not always
true. - Decay in matrix elements with atom-atom
separation should lead to banded matrices
Loopback problem with Hilbert (H) and Morton (Z)
orderings points close in space can also be far
apart on the line. Example, 2-D Z curve
MondoSCF
12Band Width Reducing Curves (C. K. Gan and MC)
Hilbert and BWR orderings for the overlap matrix
of (H2O)350
- BWR curves
- Give up heuristics and self-similarity
- Avoid loopback, gain locality
BWR
Frequency
Hilbert
13Scalable Algorithms for Electrostatics
For a fixed error, all methods are O(N lg N)
Fast Multipole Method Tree to tree cannot be load
balanced Particle Mesh Ewald FFTs require all
to all communication Hard to treat mixed boundary
conditions (ie 2-D periodic) Tree Code Teraflop
performance achieved (M. Warren at LANL) Uses
Space Filling Curves in domain decomposition
MondoSCF
14Scalability of O(N) Algorithms for Exact Exchange
- ONX
- No permutational symmetry
- Density driven
- 1 Maintains data locality of K and P
- SONX/LinK
- Permutational symmetry
- Serial version 1 to 2 times faster than ONX
- Integral driven
- 1 Global communication of K and P
MondoSCF
15Data Structures for Early O(N) and Support of
Parallelism
Trees, trees and more trees!
Fast Matrix DS
- Row-wise LL, column wise k-d tree
- Leaf nodes contain dense atom-atom blocks
- Supports skip pointers for efficient leaf-wise
traversal (no recursion) - O(lg N) overhead. Useful for summation of
fragment matrices over sub-volumes and from
different processors - Compare with CSR format which is O(N)
O
H
H
O
H
H
MondoSCF
16k-d trees for Fast Sparse Matrix Access
- Skip pointers on
- Fast access of leaf nodes
- No recursion
OH
HH
O-O
O-O
- Skip pointers off
- Fast recursive searching, insertion, deletion etc
OH
HH
O-O
O-O
MondoSCF
17Hierarchical Representation of the Density
r-tree
- Hierarchical representation of the density using
the k-d tree data structure allows for very
efficient range queries. - Recursive bisection on position, width and
magnitude - Range queries enable rapid access of all density
elements with "overlap". - Implemented in F95 with recursive subroutines and
doubly linked lists using pointers.
Bounding Box
Leaf Node
MondoSCF
18k-d tree Data Structures Allow Early Onset Linear
Scaling
- Use tests involving only box-box overlap
- Very fast access of minimally essential data
- True linear scaling for 3-D systems
Building a Hierarchical Grid for
Exchange-Correlation Cubature
Cube with integration grid and bounding box
Coulomb sums with a Tree Code
r-tree
Error in each leaf-cube ltã
Evaluate density on cube grid
MondoSCF
MondoSCF
19The HiCu grid for RB3LYP/6-31G (H2O)70
Early onset of O(N) for RB3LYP/6-31G (H2O)N
MondoSCF
20 Paradigms for Irregular Parallel Computation
- ORB Decomposition
- Recursive bisection to generate (ideally)
equivalent units of work - Tiling
- Tethering data to a spatial neighborhood and
processor - Limited overlapping of data between processors
allows work to move between neighboring
processors while retaining locality
Ideal work performed with all volumes equal
(ideal case)
Additional work that can be performed to achieve
balance
Locally essential data supporting encompassed
volumes
MondoSCF
21 Paradigms for Irregular Parallel Computation
- Space filling curves for atom ordering
- BWR curve yields banded matrices for 3-D systems
- One curve works for all matrices S, F, P, Z
- SFCs for volume ordering and decomposition
- Sectioning yields a locality preserving
decomposition
2 Ideal work load decomposition
2 Work distribution possible with tiling
MondoSCF
22Why Static Load Balancing Does Not Scale
- For example, Static Load Balancing by attempting
to evenly distribute non-zero matrix elements of
a target matrix (F or P) - One decomposition does not fit all (multiplies
involving disparate graphs). - Requires global algorithms that do not scale (eg
for solving the bin packing problem).
MondoSCF
23 Paradigms for Irregular Parallel Computation
- Diffusion Based Load Balancing
- Distributed method involves only local
communications between neighbors - Hydrodynamic like work percolates between
processors with unequal loads - Proven scalability
- Does not require exact work load estimation
- Data locality enabled with tiling
- Sender vs receiver initiated diffusion depends on
granularity
Estimated work load
Low Water Mark
0
1
2
3
Redistribution of work load between processors 0
and 1 with receiver initiated diffusion
0
1
2
3
MondoSCF
24- Summary
- The promise of O(N) ab initio methods will only
be realized with scalable parallelism. - Data locality is central to achieving scaling
with p/N - This competes with achieving a uniform work load
- We are developing and extending globally
applicable strategies for irregular parallel
computation to achieve scalability and
interoperability of the following O(N)
algorithms - Exchange-Correlation (HiCu)
- Exact Exchange (ONX)
- Coulomb Summation (QCTC)
- SCF Equations (SDMM)
- Orthogonalization (BlokAINV)
MondoSCF