Title: Edoardo Apr
1Edoardo Aprà
- Materials Chemistry Applications on the ORNL XT3
2Acknowledgments
- Part of this work is funded by the U.S.
Department of Energy, Office of Advanced
Scientific Computing Research and by the division
of Basic Energy Science, Office of Science, under
contract DE-AC05-00OR22725 with Oak Ridge
National Laboratory. - This research used resources of the National
Center for Computational Sciences at Oak Ridge
National Laboratory under contract
DE-AC05-00OR22725.
3Motivation
- CHM022 Incite Allocation
- The Chemical Endstation brings together
researchers from 12 academic and government
laboratories representing research projects
funded by DOE, NSF and other agencies, with the
common interest in the rational design of
catalysts
4Scalable and Accurate First Principles Method
Atomistic Methods
5Application Software
- Plane-wave basis
- CPMD, VASP, Dacapo, Espresso, Parsec
- Gaussian basis
- NWChem
6The Quantum-ESPRESSO Software Distribution
- Quantum-ESPRESSO stands for Quantum opEn-Source
Package for Research in Electronic Structure,
Simulation, and Optimization - An initiative by DEMOCRITOS (Trieste), in
collaboration with - CINECA Bologna,
- Princeton University, EPFL,
- MIT Boston, and many other individuals, aimed at
the development of - high-quality scientic software
- Released under a free license (GNU GPL)
- Written in Fortran 90, with a modern approach
- Efficient, Parallelized (MPI), Portable
- Integrated suite of computer codes for
electronic-structure calculations and materials
modeling at the nanoscale
7The Quantum ESPRESSO suite of ab-initio codes
- PWscf (Trieste, Lausanne, Pisa) self-consistent
electronic structure, structural relaxation, BO
molecular dynamics, linear-response (phonons,
dielectric properties) - CP (Lausanne, Princeton) (variable-cell)
Car-Parrinello molecular dynamics - FPMD (Trieste, Bologna) also (variable-cell)
Car-Parrinello molecular dynamics - Plus a number of utilities for graphical input
(GUI), molecular graphics (XCrysDen), output
postprocessing, including - Wannier-function generation, pseudopotential
generation, etc.
8QE - Technical characteristics (algorithms)
- use of iterative techniques the Hamiltonian is
stored as operator, not as matrix. All standard
PW technicalities FFT, dual-space, etc., are
used. - Iterative diagonalization used whenever it is
useful. - fast double-grid" implementation for ultrasoft
PP's the cutoff for the augmentation part can be
larger (the corresponding FFT grid denser in real
space) than the cutoff for the smooth part of the
charge density. CP only very fast box grid"
implementation. - Parallelization is performed on both PW's and
FFT grids, using a parallel 3D FFT algorithm
having good scaling with the number of processors
(memory use also scales) - Parallelization on k-points is also available by
dividing the processors into pools" and dividing
k-points across pools of processors
9QE - Technical features (implementation)
- Written mostly in fortran 90, with some degree of
sophistication (advanced f90 features). - Portable optimization achieved by extensive use
of standard (and free!) mathematical libraries
(FFTW, BLAS, LAPACK) - c-style pre-processing options allow to maintain
various - architecture-dependent features on a same tree
- Parallelization via MPI calls hidden in very few
routines allow for an easy maintenance of a
unified serial/parallel code - Easy (or not-so-difficult) installation via the
GNU configure utility
10QE - CP code
- Car-Parrinello variable-cell molecular dynamics
with Ultrasoft PPs. - Developed by Alfredo Pasquarello (IRRMA,
Lausanne), Kari Laasonen (Oulu), Andrea Trave
(LLNL), Roberto Car (Princeton), Nicola Marzari
(MIT), Paolo Giannozzi, and Carlo Cavazzoni,
Gerardo Ballabio (CINECA), Sandro Scandolo
(ICTP), Guido Chiarotti (SISSA), Paolo Focher. - Verlet dynamics with mass preconditioning
- Temperature control Nosé thermostat for both
electrons and ions, velocity rescaling - variable-cell (Parrinello-Rahman) dynamics
- Damped dynamics minimization for electronic and
ionic minimization - Modified kinetic functional for constant-pressure
calculations - Grid Box for fast treatment of augmentation
terms in Ultrasoft PPs - Metallic systems variable-occupancy dynamics
- Nudged Elastic Band (NEB) for energy barriers and
reaction paths - Dynamics with Wannier functions
- Limitations
- no k-points
11PWSCF code
- Developed by S. Baroni, S. de Gironcoli, A. Dal
Corso (SISSA), PG, and others. - Self-consistent ground-state energy and Kohn-Sham
orbitals, forces, structural optimization - Spin-orbit and non-collinear magnetization
- Molecular dynamics on the ground-state
Born-Oppenheimer surface - Variable-cell molecular dynamics with modified
kinetic functional - Phonon frequencies and eigenvectors at a generic
wave vector. interatomic force - constants in real space, effective charges and
dielectric tensors, electron-phonon - interaction coefficients for metals
- Macroscopic polarization via Berry Phase
- Nudged Elastic Band, Fourier Strings Method
schemes for transition paths, - energy barriers
- Third-order anharmonic phonon lifetimes,
nonresonant Raman cross sections - Limitations
- no Car-Parrinello dynamics
- very limited constrained minimization and dynamics
12Flow chart
13Basics of Plane Wave Method
Core electrons are removed through the
pseudopotential generation, thus only treat the
valence electrons
- Important advantages
- Planewave basis can be performed on a discrete
numerical grid - Can use preconditioned conjugate gradient
iterative techniques
14Computational Techniques and Libraries
- Fast Fourier Transforms
- Machine library or FFTW
- Diagonalization
- Direct methods
- Preconditioned iterative methods (CG)
- Dense Linear Algebra
- LAPACK and BLAS
15FFT Decomposition
- Data Structure
- Global array N x N x N
- Local array N x N x (N/p)
- Algorithm
- Perform local FFT along X
- Perform local FFT along Y
- Transpose array to get Z values on local memory
- Perform local FFT along Z
- Complexity
- No. of FFTs per cpu N2/p
- FFT FLOPS 5 x N x log2(N)
16Cray XT3 Scaling
- Vienna Ab-initio Simulation Package (VASP)
- Based on BCC Cu
- Single k-point (0.0,0.0,0.0)
- 400 eV Plane wave cutoff
- Small system sizes
- Forward and backward FFT transform
- Large system sizes
- Davidson diagonalization
- Forward and backward FFT transform
17Cray XT3 Scaling (fixed system size)
- Fixed number of plane waves
- Changing the number of plane waves per processor
- Optimal density of atoms/node
- For the Buckyball 1.9 atoms/node
- Thus, to run a 1000 atom system optimally would
require 500 processors
18Quantum Espresso Performance
Cray-XT3 located in the NCCS at ORNL
2-16 N 16-54 N-N2 54-128 N2
19Quantum Espresso Performance
Roberto Ansaloni, Cray Workshop on
High-Performance Computing ETH-Zurich-September
4-5, 2006
20Porphyrin Functionalized Nanotube
- New materials for solar energy applications
- Relatively simple, synthetically feasible (at
ORNL-UT) mimics of light-harvesting antenna units - Porphyrin molecules are the light absorbing
antenna and the nanotube may provide a conducting
channel (or at least an electron acceptor) - Key questions
- Do the covalently attached porphyrins undergo
facile absorption of visible light and transfer
electrons onto the tube - What types of efficiencies does one obtain
21Porphyrin Functionalized Nanotube
- Problem size ranges from 1532 atoms ( 60 Å) to
6128 atoms ( 202 Å by 60 Å) - Key research question to address are
- How does porphyrin attach to the nanotube
- How does the electronic structure change as
porphyrin molecules are added to the nanotube (
up to 22 by weight has been observed
experimentally - How is the conductance affected by surface
orientation and composition
22Problem Requirements
- Lattice constant approximately z-381.72 a.u.,
x,y-113.38 a.u. - Approximately 6128 atoms (C,N,O, H)
- Energy cut-off 25 Ryd.
- Produces a mesh (608x180x180)19,699,200
- Number of states 10,464
- RAM for Hamiltonian Matrix (10,464)2161.6
Gbytes ( or .16 Mbytes per processor on 10,000
processors) - Store the double precision real
wavefunction/k-point 19,699,2001046418
1.6 Tbytes - On 10,000 processors this would require .16
Gbytes/processor - This would require a bandwidth of 30
Gbytes/sec to read in or write out data in 1
minute - Store the double precision real charge density
would require .16 Gbytes - For a 1532 atom system divide all numbers by 4
23Espresso - XT4 Benchmark
24PARSEC
- computer code that solves the Kohn-Sham equations
by expressing electron wave-functions directly in
real space, without the use of explicit basis
sets. - uses norm-conserving pseudopotentials (e.g.
Troullier-Martins). - designed for ab initio quantum-mechanical
calculations of the electronic structure of
matter, within Density-Functional Theory (DFT).
25http//www.ices.utexas.edu/parsec
Features
- High order finite difference expansion of
differential operators (Hamiltonian matrix is
sparse). - Scalability in parallel environment.
- Confined and periodic boundary conditions.
- Molecular dynamics.
- Open Source
26Algorithm
- Wave-functions ?i(r) are calculated on a grid.
- Regular grid (cartesian for simplicity).
- Laplacian evaluated using finite differences.
- Code parallelized w.r.t. to number of grid
points. - Source code fortran 95.
27Chebyshev subspace filtering
Given a set of basis vectors ?i, filter
according to ?i Pm(H) ?i Pm Chebyshev
polynomial of degree m. Vectors ?i will have
their projection onto the desired subspace
enhanced.
Window for Filtering
28Traditional Approach
Most of the time is spent on the diagonalization
part. One can use ARPACK, variant of the Lanczos
algorithm, or other exact eigensolvers.
29Kohn-Sham SCF with filtering
Exact diagonalization.
Chebyshev filtering, much faster than exact
diagonalization.
30PARSEC model input
- Si2713PH828
- Silicon nanocrystal, prototypical semiconductor
nanocrystal - surface passivated with hydrogen atoms
- P atom adds three degenerate levels very close to
the LUMO, making the nanocrystal itself virtually
gap-less. Study the properties of p-doped silicon
nanocrystals
31Parsec XT4 Benchmark
32Performance - Summary
- TFlops Cores Sust Scalab
-
- RMG 7.78 4,096 36.5 50
- Espresso 7.68 4,096 36 50
- PARSEC 7.45 4,096 35 50
33Why Was NWChem Developed?
- Developed as part of the construction of the
Environmental Molecular Sciences Laboratory
(EMSL) at Pacific Northwest National Lab - Designed and developed to be a highly efficient
and portable Massively Parallel computational
chemistry package - Provides computational chemistry solutions that
are scalable with respect to chemical system size
as well as MPP hardware size
34What is NWChem used for?
- Provides major modeling and simulation capability
for molecular science - Broad range of molecules, including biomolecules,
nanoparticles and heavy elements - Electronic structure of molecules
(non-relativistic, relativistic, structural
optimizations and vibrational analysis) - Increasingly extensive solid state capability
(DFT plane-wave, CPMD) - Molecular dynamics, molecular mechanics
35Molecular Science Software Group
36GA Tools Overview
- Shared memory model in context of distributed
dense arrays - Complete environment for parallel code
development - Compatible with MPI
- Data locality control similar to distributed
memory/message passing model - Extensible and scalable
- Compatible with other libraries ScaLapack,
Peigs, - Parallel and local I/O library (Pario)
37Structure of GA
F90
Java
Application programming language interface
Fortran 77
C
C
Babel
Python
distributed arrays layer memory management, index
translation
Global Arrays and MPI are completely
interoperable. Code can contain calls to both
libraries.
Message Passing Global operations
ARMCI portable 1-sided communication put,get,
locks, etc
system specific interfaces LAPI, GM, threads,
QSnet, IB, SHMEM, Portlals
38NWChem Architecture
- Object-oriented design
- abstraction, data hiding, APIs
- Parallel programming model
- non-uniform memory access, Global Arrays, MPI
- Infrastructure
- GA, Parallel I/O, RTDB, MA, Peigs, ...
- Program modules
- communication only through the database
- persistence for easy restart
39Gaussian DFT computational kernelEvaluation of
XC potential matrix element
my_next_task SharedCounter() do i1,max_i
if(i.eq.my_next_task) then call ga_get()
(do work) call ga_acc()
my_next_task SharedCounter() endif
enddo barrier()
r(xq) ?mn Dmn ?k(xq) ?n(xq) Fls ?q wq
?l(xq) Vxcr(xq) ?s(xq)
Both GA operations are greatly dependent on the
communication latency
40NWChem Porting issues on the XT3
- Re-use of existing serial port for x86_64
(compilers) - Communication library GA/ARMCI
- Two ARMCI ports CRAY-SHMEM Portals
- Made Pario library compatible with Catamount
(using glibc calls)
41NWChem Porting - II
- Stability issue with SHMEM port
- Uncovered portal problem Cray SHMEM group
provided workaround - NWChem crashes the whole XT running large jobs
- Performance of SHMEM port
- Latency 10 ?sec
- BW
- Contiguous put/get 2GB/sec
- Strided put/get 0.3GB/sec
- ARMCI using Portal in progress
42 H2O7 287 Basis functions aug-cc-pvdz MP2 Energy
Gradient
43 Replicated Data vs. Distributed Data on XT3
Si28O67H30 1687 Basis f. LDA wavefunction
44 Parallel scaling of the DFT code of NWChem
Si28O67H30 1687 Basis f. LDA wavefunction
45 Si28O148H66 3554 Basis functions LDA
wavefunction
46 Si159O264H110 7108 Basis functions LDA
wavefunction
47 Si159O264H110 7108 Basis functions GGA
wavefunction
48Thanks
- NWChem group and GA Group - PNNL
- Bill Shelton Bobby Sumpter - ORNL
- Roberto Ansaloni Cray
- Murilo Tiago and Jim Chelikowsky University of
Texas - Carlo Cavazzoni Cineca and the rest of the
Quantum Espresso developers
49Backup
50Non-Blocking Communication
- Allows overlapping of communications and
computations - resulting in latency hiding
- Non-blocking operations initiate a communication
call and then return control to the application
immediately - operation completed locally by making a call to
the wait routine
51 Si159O264H110 7108 Basis functions LDA
wavefunction 2 SCF cycles Benchmark run
on Itanium2/Elan4
52Recent improvements in algorithm
- Use symmetry operations and reduce the real-space
domain to an irreducible wedge. - Tiago Chelikowsky, in preparation
- Replace numerical diagonalization with Chebyshev
subspace filtering - Zhou et al., J. Comp. Phys. in press(2006)
In 1997, a SCF calculation of Si525H276 took 20
hours of CPU time on the Cray T3E (300 MHz) using
48 processors. Today, it takes 2 hours on one SGI
Madison processor (1.3 GHz).
Impact in real calculations
53(No Transcript)