Edoardo Apr - PowerPoint PPT Presentation

1 / 53

About This Presentation

Title:

Edoardo Apr

Description:

Part of this work is funded by the U.S. Department of Energy, Office of Advanced ... CP only: very fast 'box grid' implementation. ... – PowerPoint PPT presentation

Number of Views:105

Avg rating:3.0/5.0

Slides: 54

Provided by: nccs9

Category:

more less

Transcript and Presenter's Notes

Title: Edoardo Apr

1
Edoardo Aprà

Materials Chemistry Applications on the ORNL XT3

2
Acknowledgments

Part of this work is funded by the U.S.
Department of Energy, Office of Advanced
Scientific Computing Research and by the division
of Basic Energy Science, Office of Science, under
contract DE-AC05-00OR22725 with Oak Ridge
National Laboratory.
This research used resources of the National
Center for Computational Sciences at Oak Ridge
National Laboratory under contract
DE-AC05-00OR22725.

3
Motivation

CHM022 Incite Allocation
The Chemical Endstation brings together
researchers from 12 academic and government
laboratories representing research projects
funded by DOE, NSF and other agencies, with the
common interest in the rational design of
catalysts

4
Scalable and Accurate First Principles Method
Atomistic Methods
5
Application Software

Plane-wave basis
CPMD, VASP, Dacapo, Espresso, Parsec
Gaussian basis
NWChem

6
The Quantum-ESPRESSO Software Distribution

Quantum-ESPRESSO stands for Quantum opEn-Source
Package for Research in Electronic Structure,
Simulation, and Optimization
An initiative by DEMOCRITOS (Trieste), in
collaboration with
CINECA Bologna,
Princeton University, EPFL,
MIT Boston, and many other individuals, aimed at
the development of
high-quality scientic software
Released under a free license (GNU GPL)
Written in Fortran 90, with a modern approach
Efficient, Parallelized (MPI), Portable
Integrated suite of computer codes for
electronic-structure calculations and materials
modeling at the nanoscale

7
The Quantum ESPRESSO suite of ab-initio codes

PWscf (Trieste, Lausanne, Pisa) self-consistent
electronic structure, structural relaxation, BO
molecular dynamics, linear-response (phonons,
dielectric properties)
CP (Lausanne, Princeton) (variable-cell)
Car-Parrinello molecular dynamics
FPMD (Trieste, Bologna) also (variable-cell)
Car-Parrinello molecular dynamics
Plus a number of utilities for graphical input
(GUI), molecular graphics (XCrysDen), output
postprocessing, including
Wannier-function generation, pseudopotential
generation, etc.

8
QE - Technical characteristics (algorithms)

use of iterative techniques the Hamiltonian is
stored as operator, not as matrix. All standard
PW technicalities FFT, dual-space, etc., are
used.
Iterative diagonalization used whenever it is
useful.
fast double-grid" implementation for ultrasoft
PP's the cutoff for the augmentation part can be
larger (the corresponding FFT grid denser in real
space) than the cutoff for the smooth part of the
charge density. CP only very fast box grid"
implementation.
Parallelization is performed on both PW's and
FFT grids, using a parallel 3D FFT algorithm
having good scaling with the number of processors
(memory use also scales)
Parallelization on k-points is also available by
dividing the processors into pools" and dividing
k-points across pools of processors

9
QE - Technical features (implementation)

Written mostly in fortran 90, with some degree of
sophistication (advanced f90 features).
Portable optimization achieved by extensive use
of standard (and free!) mathematical libraries
(FFTW, BLAS, LAPACK)
c-style pre-processing options allow to maintain
various
architecture-dependent features on a same tree
Parallelization via MPI calls hidden in very few
routines allow for an easy maintenance of a
unified serial/parallel code
Easy (or not-so-difficult) installation via the
GNU configure utility

10
QE - CP code

Car-Parrinello variable-cell molecular dynamics
with Ultrasoft PPs.
Developed by Alfredo Pasquarello (IRRMA,
Lausanne), Kari Laasonen (Oulu), Andrea Trave
(LLNL), Roberto Car (Princeton), Nicola Marzari
(MIT), Paolo Giannozzi, and Carlo Cavazzoni,
Gerardo Ballabio (CINECA), Sandro Scandolo
(ICTP), Guido Chiarotti (SISSA), Paolo Focher.
Verlet dynamics with mass preconditioning
Temperature control Nosé thermostat for both
electrons and ions, velocity rescaling
variable-cell (Parrinello-Rahman) dynamics
Damped dynamics minimization for electronic and
ionic minimization
Modified kinetic functional for constant-pressure
calculations
Grid Box for fast treatment of augmentation
terms in Ultrasoft PPs
Metallic systems variable-occupancy dynamics
Nudged Elastic Band (NEB) for energy barriers and
reaction paths
Dynamics with Wannier functions
Limitations
no k-points

11
PWSCF code

Developed by S. Baroni, S. de Gironcoli, A. Dal
Corso (SISSA), PG, and others.
Self-consistent ground-state energy and Kohn-Sham
orbitals, forces, structural optimization
Spin-orbit and non-collinear magnetization
Molecular dynamics on the ground-state
Born-Oppenheimer surface
Variable-cell molecular dynamics with modified
kinetic functional
Phonon frequencies and eigenvectors at a generic
wave vector. interatomic force
constants in real space, effective charges and
dielectric tensors, electron-phonon
interaction coefficients for metals
Macroscopic polarization via Berry Phase
Nudged Elastic Band, Fourier Strings Method
schemes for transition paths,
energy barriers
Third-order anharmonic phonon lifetimes,
nonresonant Raman cross sections
Limitations
no Car-Parrinello dynamics
very limited constrained minimization and dynamics

12
Flow chart
13
Basics of Plane Wave Method
Core electrons are removed through the
pseudopotential generation, thus only treat the
valence electrons

Important advantages
Planewave basis can be performed on a discrete
numerical grid
Can use preconditioned conjugate gradient
iterative techniques

14
Computational Techniques and Libraries

Fast Fourier Transforms
Machine library or FFTW
Diagonalization
Direct methods
Preconditioned iterative methods (CG)
Dense Linear Algebra
LAPACK and BLAS

15
FFT Decomposition

Data Structure
Global array N x N x N
Local array N x N x (N/p)
Algorithm
Perform local FFT along X
Perform local FFT along Y
Transpose array to get Z values on local memory
Perform local FFT along Z
Complexity
No. of FFTs per cpu N2/p
FFT FLOPS 5 x N x log2(N)

16
Cray XT3 Scaling

Vienna Ab-initio Simulation Package (VASP)
Based on BCC Cu
Single k-point (0.0,0.0,0.0)
400 eV Plane wave cutoff
Small system sizes
Forward and backward FFT transform
Large system sizes
Davidson diagonalization
Forward and backward FFT transform

17
Cray XT3 Scaling (fixed system size)

Fixed number of plane waves
Changing the number of plane waves per processor
Optimal density of atoms/node
For the Buckyball 1.9 atoms/node
Thus, to run a 1000 atom system optimally would
require 500 processors

18
Quantum Espresso Performance
Cray-XT3 located in the NCCS at ORNL
2-16 N 16-54 N-N2 54-128 N2
19
Quantum Espresso Performance
Roberto Ansaloni, Cray Workshop on
High-Performance Computing ETH-Zurich-September
4-5, 2006
20
Porphyrin Functionalized Nanotube

New materials for solar energy applications
Relatively simple, synthetically feasible (at
ORNL-UT) mimics of light-harvesting antenna units
Porphyrin molecules are the light absorbing
antenna and the nanotube may provide a conducting
channel (or at least an electron acceptor)
Key questions
Do the covalently attached porphyrins undergo
facile absorption of visible light and transfer
electrons onto the tube
What types of efficiencies does one obtain

21
Porphyrin Functionalized Nanotube

Problem size ranges from 1532 atoms ( 60 Å) to
6128 atoms ( 202 Å by 60 Å)
Key research question to address are
How does porphyrin attach to the nanotube
How does the electronic structure change as
porphyrin molecules are added to the nanotube (
up to 22 by weight has been observed
experimentally
How is the conductance affected by surface
orientation and composition

22
Problem Requirements

Lattice constant approximately z-381.72 a.u.,
x,y-113.38 a.u.
Approximately 6128 atoms (C,N,O, H)
Energy cut-off 25 Ryd.
Produces a mesh (608x180x180)19,699,200
Number of states 10,464
RAM for Hamiltonian Matrix (10,464)2161.6
Gbytes ( or .16 Mbytes per processor on 10,000
processors)
Store the double precision real
wavefunction/k-point 19,699,2001046418
1.6 Tbytes
On 10,000 processors this would require .16
Gbytes/processor
This would require a bandwidth of 30
Gbytes/sec to read in or write out data in 1
minute
Store the double precision real charge density
would require .16 Gbytes
For a 1532 atom system divide all numbers by 4

23
Espresso - XT4 Benchmark
24
PARSEC

computer code that solves the Kohn-Sham equations
by expressing electron wave-functions directly in
real space, without the use of explicit basis
sets.
uses norm-conserving pseudopotentials (e.g.
Troullier-Martins).
designed for ab initio quantum-mechanical
calculations of the electronic structure of
matter, within Density-Functional Theory (DFT).

25
http//www.ices.utexas.edu/parsec
Features

High order finite difference expansion of
differential operators (Hamiltonian matrix is
sparse).
Scalability in parallel environment.
Confined and periodic boundary conditions.
Molecular dynamics.
Open Source

26
Algorithm

Wave-functions ?i(r) are calculated on a grid.
Regular grid (cartesian for simplicity).
Laplacian evaluated using finite differences.
Code parallelized w.r.t. to number of grid
points.
Source code fortran 95.

27
Chebyshev subspace filtering
Given a set of basis vectors ?i, filter
according to ?i Pm(H) ?i Pm Chebyshev
polynomial of degree m. Vectors ?i will have
their projection onto the desired subspace
enhanced.
Window for Filtering
28
Traditional Approach
Most of the time is spent on the diagonalization
part. One can use ARPACK, variant of the Lanczos
algorithm, or other exact eigensolvers.
29
Kohn-Sham SCF with filtering
Exact diagonalization.
Chebyshev filtering, much faster than exact
diagonalization.
30
PARSEC model input

Si2713PH828
Silicon nanocrystal, prototypical semiconductor
nanocrystal
surface passivated with hydrogen atoms
P atom adds three degenerate levels very close to
the LUMO, making the nanocrystal itself virtually
gap-less. Study the properties of p-doped silicon
nanocrystals

31
Parsec XT4 Benchmark
32
Performance - Summary

TFlops Cores Sust Scalab
RMG 7.78 4,096 36.5 50
Espresso 7.68 4,096 36 50
PARSEC 7.45 4,096 35 50

33
Why Was NWChem Developed?

Developed as part of the construction of the
Environmental Molecular Sciences Laboratory
(EMSL) at Pacific Northwest National Lab
Designed and developed to be a highly efficient
and portable Massively Parallel computational
chemistry package
Provides computational chemistry solutions that
are scalable with respect to chemical system size
as well as MPP hardware size

34
What is NWChem used for?

Provides major modeling and simulation capability
for molecular science
Broad range of molecules, including biomolecules,
nanoparticles and heavy elements
Electronic structure of molecules
(non-relativistic, relativistic, structural
optimizations and vibrational analysis)
Increasingly extensive solid state capability
(DFT plane-wave, CPMD)
Molecular dynamics, molecular mechanics

35
Molecular Science Software Group
36
GA Tools Overview

Shared memory model in context of distributed
dense arrays
Complete environment for parallel code
development
Compatible with MPI
Data locality control similar to distributed
memory/message passing model
Extensible and scalable
Compatible with other libraries ScaLapack,
Peigs,
Parallel and local I/O library (Pario)

37
Structure of GA
F90
Java
Application programming language interface
Fortran 77
C
C
Babel
Python
distributed arrays layer memory management, index
translation
Global Arrays and MPI are completely
interoperable. Code can contain calls to both
libraries.
Message Passing Global operations
ARMCI portable 1-sided communication put,get,
locks, etc
system specific interfaces LAPI, GM, threads,
QSnet, IB, SHMEM, Portlals
38
NWChem Architecture

Object-oriented design
abstraction, data hiding, APIs
Parallel programming model
non-uniform memory access, Global Arrays, MPI
Infrastructure
GA, Parallel I/O, RTDB, MA, Peigs, ...
Program modules
communication only through the database
persistence for easy restart

39
Gaussian DFT computational kernelEvaluation of
XC potential matrix element
my_next_task SharedCounter() do i1,max_i
if(i.eq.my_next_task) then call ga_get()
(do work) call ga_acc()
my_next_task SharedCounter() endif
enddo barrier()
r(xq) ?mn Dmn ?k(xq) ?n(xq) Fls ?q wq
?l(xq) Vxcr(xq) ?s(xq)
Both GA operations are greatly dependent on the
communication latency
40
NWChem Porting issues on the XT3

Re-use of existing serial port for x86_64
(compilers)
Communication library GA/ARMCI
Two ARMCI ports CRAY-SHMEM Portals
Made Pario library compatible with Catamount
(using glibc calls)

41
NWChem Porting - II

Stability issue with SHMEM port
Uncovered portal problem Cray SHMEM group
provided workaround
NWChem crashes the whole XT running large jobs
Performance of SHMEM port
Latency 10 ?sec
BW
Contiguous put/get 2GB/sec
Strided put/get 0.3GB/sec
ARMCI using Portal in progress

H2O7 287 Basis functions aug-cc-pvdz MP2 Energy
Gradient
43

Replicated Data vs. Distributed Data on XT3

Si28O67H30 1687 Basis f. LDA wavefunction
44

Parallel scaling of the DFT code of NWChem

Si28O67H30 1687 Basis f. LDA wavefunction
45

Si28O148H66 3554 Basis functions LDA
wavefunction
46

Si159O264H110 7108 Basis functions LDA
wavefunction
47

Si159O264H110 7108 Basis functions GGA
wavefunction
48
Thanks

NWChem group and GA Group - PNNL
Bill Shelton Bobby Sumpter - ORNL
Roberto Ansaloni Cray
Murilo Tiago and Jim Chelikowsky University of
Texas
Carlo Cavazzoni Cineca and the rest of the
Quantum Espresso developers

49
Backup
50
Non-Blocking Communication

Allows overlapping of communications and
computations
resulting in latency hiding
Non-blocking operations initiate a communication
call and then return control to the application
immediately
operation completed locally by making a call to
the wait routine

Si159O264H110 7108 Basis functions LDA
wavefunction 2 SCF cycles Benchmark run
on Itanium2/Elan4
52
Recent improvements in algorithm

Use symmetry operations and reduce the real-space
domain to an irreducible wedge.
Tiago Chelikowsky, in preparation
Replace numerical diagonalization with Chebyshev
subspace filtering
Zhou et al., J. Comp. Phys. in press(2006)

In 1997, a SCF calculation of Si525H276 took 20
hours of CPU time on the Cray T3E (300 MHz) using
48 processors. Today, it takes 2 hours on one SGI
Madison processor (1.3 GHz).
Impact in real calculations
53
(No Transcript)

Write a Comment

User Comments (0)