Edoardo Apr - PowerPoint PPT Presentation

1 / 53
About This Presentation
Title:

Edoardo Apr

Description:

Part of this work is funded by the U.S. Department of Energy, Office of Advanced ... CP only: very fast 'box grid' implementation. ... – PowerPoint PPT presentation

Number of Views:105
Avg rating:3.0/5.0
Slides: 54
Provided by: nccs9
Category:
Tags: apr | edoardo | study | time

less

Transcript and Presenter's Notes

Title: Edoardo Apr


1
Edoardo Aprà
  • Materials Chemistry Applications on the ORNL XT3

2
Acknowledgments
  • Part of this work is funded by the U.S.
    Department of Energy, Office of Advanced
    Scientific Computing Research and by the division
    of Basic Energy Science, Office of Science, under
    contract DE-AC05-00OR22725 with Oak Ridge
    National Laboratory.
  • This research used resources of the National
    Center for Computational Sciences at Oak Ridge
    National Laboratory under contract
    DE-AC05-00OR22725.

3
Motivation
  • CHM022 Incite Allocation
  • The Chemical Endstation brings together
    researchers from 12 academic and government
    laboratories representing research projects
    funded by DOE, NSF and other agencies, with the
    common interest in the rational design of
    catalysts

4
Scalable and Accurate First Principles Method
Atomistic Methods
5
Application Software
  • Plane-wave basis
  • CPMD, VASP, Dacapo, Espresso, Parsec
  • Gaussian basis
  • NWChem

6
The Quantum-ESPRESSO Software Distribution
  • Quantum-ESPRESSO stands for Quantum opEn-Source
    Package for Research in Electronic Structure,
    Simulation, and Optimization
  • An initiative by DEMOCRITOS (Trieste), in
    collaboration with
  • CINECA Bologna,
  • Princeton University, EPFL,
  • MIT Boston, and many other individuals, aimed at
    the development of
  • high-quality scientic software
  • Released under a free license (GNU GPL)
  • Written in Fortran 90, with a modern approach
  • Efficient, Parallelized (MPI), Portable
  • Integrated suite of computer codes for
    electronic-structure calculations and materials
    modeling at the nanoscale

7
The Quantum ESPRESSO suite of ab-initio codes
  • PWscf (Trieste, Lausanne, Pisa) self-consistent
    electronic structure, structural relaxation, BO
    molecular dynamics, linear-response (phonons,
    dielectric properties)
  • CP (Lausanne, Princeton) (variable-cell)
    Car-Parrinello molecular dynamics
  • FPMD (Trieste, Bologna) also (variable-cell)
    Car-Parrinello molecular dynamics
  • Plus a number of utilities for graphical input
    (GUI), molecular graphics (XCrysDen), output
    postprocessing, including
  • Wannier-function generation, pseudopotential
    generation, etc.

8
QE - Technical characteristics (algorithms)
  • use of iterative techniques the Hamiltonian is
    stored as operator, not as matrix. All standard
    PW technicalities FFT, dual-space, etc., are
    used.
  • Iterative diagonalization used whenever it is
    useful.
  • fast double-grid" implementation for ultrasoft
    PP's the cutoff for the augmentation part can be
    larger (the corresponding FFT grid denser in real
    space) than the cutoff for the smooth part of the
    charge density. CP only very fast box grid"
    implementation.
  • Parallelization is performed on both PW's and
    FFT grids, using a parallel 3D FFT algorithm
    having good scaling with the number of processors
    (memory use also scales)
  • Parallelization on k-points is also available by
    dividing the processors into pools" and dividing
    k-points across pools of processors

9
QE - Technical features (implementation)
  • Written mostly in fortran 90, with some degree of
    sophistication (advanced f90 features).
  • Portable optimization achieved by extensive use
    of standard (and free!) mathematical libraries
    (FFTW, BLAS, LAPACK)
  • c-style pre-processing options allow to maintain
    various
  • architecture-dependent features on a same tree
  • Parallelization via MPI calls hidden in very few
    routines allow for an easy maintenance of a
    unified serial/parallel code
  • Easy (or not-so-difficult) installation via the
    GNU configure utility

10
QE - CP code
  • Car-Parrinello variable-cell molecular dynamics
    with Ultrasoft PPs.
  • Developed by Alfredo Pasquarello (IRRMA,
    Lausanne), Kari Laasonen (Oulu), Andrea Trave
    (LLNL), Roberto Car (Princeton), Nicola Marzari
    (MIT), Paolo Giannozzi, and Carlo Cavazzoni,
    Gerardo Ballabio (CINECA), Sandro Scandolo
    (ICTP), Guido Chiarotti (SISSA), Paolo Focher.
  • Verlet dynamics with mass preconditioning
  • Temperature control Nosé thermostat for both
    electrons and ions, velocity rescaling
  • variable-cell (Parrinello-Rahman) dynamics
  • Damped dynamics minimization for electronic and
    ionic minimization
  • Modified kinetic functional for constant-pressure
    calculations
  • Grid Box for fast treatment of augmentation
    terms in Ultrasoft PPs
  • Metallic systems variable-occupancy dynamics
  • Nudged Elastic Band (NEB) for energy barriers and
    reaction paths
  • Dynamics with Wannier functions
  • Limitations
  • no k-points

11
PWSCF code
  • Developed by S. Baroni, S. de Gironcoli, A. Dal
    Corso (SISSA), PG, and others.
  • Self-consistent ground-state energy and Kohn-Sham
    orbitals, forces, structural optimization
  • Spin-orbit and non-collinear magnetization
  • Molecular dynamics on the ground-state
    Born-Oppenheimer surface
  • Variable-cell molecular dynamics with modified
    kinetic functional
  • Phonon frequencies and eigenvectors at a generic
    wave vector. interatomic force
  • constants in real space, effective charges and
    dielectric tensors, electron-phonon
  • interaction coefficients for metals
  • Macroscopic polarization via Berry Phase
  • Nudged Elastic Band, Fourier Strings Method
    schemes for transition paths,
  • energy barriers
  • Third-order anharmonic phonon lifetimes,
    nonresonant Raman cross sections
  • Limitations
  • no Car-Parrinello dynamics
  • very limited constrained minimization and dynamics

12
Flow chart
13
Basics of Plane Wave Method
Core electrons are removed through the
pseudopotential generation, thus only treat the
valence electrons
  • Important advantages
  • Planewave basis can be performed on a discrete
    numerical grid
  • Can use preconditioned conjugate gradient
    iterative techniques

14
Computational Techniques and Libraries
  • Fast Fourier Transforms
  • Machine library or FFTW
  • Diagonalization
  • Direct methods
  • Preconditioned iterative methods (CG)
  • Dense Linear Algebra
  • LAPACK and BLAS

15
FFT Decomposition
  • Data Structure
  • Global array N x N x N
  • Local array N x N x (N/p)
  • Algorithm
  • Perform local FFT along X
  • Perform local FFT along Y
  • Transpose array to get Z values on local memory
  • Perform local FFT along Z
  • Complexity
  • No. of FFTs per cpu N2/p
  • FFT FLOPS 5 x N x log2(N)

16
Cray XT3 Scaling
  • Vienna Ab-initio Simulation Package (VASP)
  • Based on BCC Cu
  • Single k-point (0.0,0.0,0.0)
  • 400 eV Plane wave cutoff
  • Small system sizes
  • Forward and backward FFT transform
  • Large system sizes
  • Davidson diagonalization
  • Forward and backward FFT transform

17
Cray XT3 Scaling (fixed system size)
  • Fixed number of plane waves
  • Changing the number of plane waves per processor
  • Optimal density of atoms/node
  • For the Buckyball 1.9 atoms/node
  • Thus, to run a 1000 atom system optimally would
    require 500 processors

18
Quantum Espresso Performance
Cray-XT3 located in the NCCS at ORNL
2-16 N 16-54 N-N2 54-128 N2
19
Quantum Espresso Performance
Roberto Ansaloni, Cray Workshop on
High-Performance Computing ETH-Zurich-September
4-5, 2006
20
Porphyrin Functionalized Nanotube
  • New materials for solar energy applications
  • Relatively simple, synthetically feasible (at
    ORNL-UT) mimics of light-harvesting antenna units
  • Porphyrin molecules are the light absorbing
    antenna and the nanotube may provide a conducting
    channel (or at least an electron acceptor)
  • Key questions
  • Do the covalently attached porphyrins undergo
    facile absorption of visible light and transfer
    electrons onto the tube
  • What types of efficiencies does one obtain

21
Porphyrin Functionalized Nanotube
  • Problem size ranges from 1532 atoms ( 60 Å) to
    6128 atoms ( 202 Å by 60 Å)
  • Key research question to address are
  • How does porphyrin attach to the nanotube
  • How does the electronic structure change as
    porphyrin molecules are added to the nanotube (
    up to 22 by weight has been observed
    experimentally
  • How is the conductance affected by surface
    orientation and composition

22
Problem Requirements
  • Lattice constant approximately z-381.72 a.u.,
    x,y-113.38 a.u.
  • Approximately 6128 atoms (C,N,O, H)
  • Energy cut-off 25 Ryd.
  • Produces a mesh (608x180x180)19,699,200
  • Number of states 10,464
  • RAM for Hamiltonian Matrix (10,464)2161.6
    Gbytes ( or .16 Mbytes per processor on 10,000
    processors)
  • Store the double precision real
    wavefunction/k-point 19,699,2001046418
    1.6 Tbytes
  • On 10,000 processors this would require .16
    Gbytes/processor
  • This would require a bandwidth of 30
    Gbytes/sec to read in or write out data in 1
    minute
  • Store the double precision real charge density
    would require .16 Gbytes
  • For a 1532 atom system divide all numbers by 4

23
Espresso - XT4 Benchmark
24
PARSEC
  • computer code that solves the Kohn-Sham equations
    by expressing electron wave-functions directly in
    real space, without the use of explicit basis
    sets.
  • uses norm-conserving pseudopotentials (e.g.
    Troullier-Martins).
  • designed for ab initio quantum-mechanical
    calculations of the electronic structure of
    matter, within Density-Functional Theory (DFT).

25
http//www.ices.utexas.edu/parsec
Features
  • High order finite difference expansion of
    differential operators (Hamiltonian matrix is
    sparse).
  • Scalability in parallel environment.
  • Confined and periodic boundary conditions.
  • Molecular dynamics.
  • Open Source

26
Algorithm
  • Wave-functions ?i(r) are calculated on a grid.
  • Regular grid (cartesian for simplicity).
  • Laplacian evaluated using finite differences.
  • Code parallelized w.r.t. to number of grid
    points.
  • Source code fortran 95.

27
Chebyshev subspace filtering
Given a set of basis vectors ?i, filter
according to ?i Pm(H) ?i Pm Chebyshev
polynomial of degree m. Vectors ?i will have
their projection onto the desired subspace
enhanced.
Window for Filtering
28
Traditional Approach
Most of the time is spent on the diagonalization
part. One can use ARPACK, variant of the Lanczos
algorithm, or other exact eigensolvers.
29
Kohn-Sham SCF with filtering
Exact diagonalization.
Chebyshev filtering, much faster than exact
diagonalization.
30
PARSEC model input
  • Si2713PH828
  • Silicon nanocrystal, prototypical semiconductor
    nanocrystal
  • surface passivated with hydrogen atoms
  • P atom adds three degenerate levels very close to
    the LUMO, making the nanocrystal itself virtually
    gap-less. Study the properties of p-doped silicon
    nanocrystals

31
Parsec XT4 Benchmark
32
Performance - Summary
  • TFlops Cores Sust Scalab
  • RMG 7.78 4,096 36.5 50
  • Espresso 7.68 4,096 36 50
  • PARSEC 7.45 4,096 35 50

33
Why Was NWChem Developed?
  • Developed as part of the construction of the
    Environmental Molecular Sciences Laboratory
    (EMSL) at Pacific Northwest National Lab
  • Designed and developed to be a highly efficient
    and portable Massively Parallel computational
    chemistry package
  • Provides computational chemistry solutions that
    are scalable with respect to chemical system size
    as well as MPP hardware size

34
What is NWChem used for?
  • Provides major modeling and simulation capability
    for molecular science
  • Broad range of molecules, including biomolecules,
    nanoparticles and heavy elements
  • Electronic structure of molecules
    (non-relativistic, relativistic, structural
    optimizations and vibrational analysis)
  • Increasingly extensive solid state capability
    (DFT plane-wave, CPMD)
  • Molecular dynamics, molecular mechanics

35
Molecular Science Software Group
36
GA Tools Overview
  • Shared memory model in context of distributed
    dense arrays
  • Complete environment for parallel code
    development
  • Compatible with MPI
  • Data locality control similar to distributed
    memory/message passing model
  • Extensible and scalable
  • Compatible with other libraries ScaLapack,
    Peigs,
  • Parallel and local I/O library (Pario)

37
Structure of GA
F90
Java
Application programming language interface
Fortran 77
C
C
Babel
Python
distributed arrays layer memory management, index
translation
Global Arrays and MPI are completely
interoperable. Code can contain calls to both
libraries.
Message Passing Global operations
ARMCI portable 1-sided communication put,get,
locks, etc
system specific interfaces LAPI, GM, threads,
QSnet, IB, SHMEM, Portlals
38
NWChem Architecture
  • Object-oriented design
  • abstraction, data hiding, APIs
  • Parallel programming model
  • non-uniform memory access, Global Arrays, MPI
  • Infrastructure
  • GA, Parallel I/O, RTDB, MA, Peigs, ...
  • Program modules
  • communication only through the database
  • persistence for easy restart

39
Gaussian DFT computational kernelEvaluation of
XC potential matrix element
my_next_task SharedCounter() do i1,max_i
if(i.eq.my_next_task) then call ga_get()
(do work) call ga_acc()
my_next_task SharedCounter() endif
enddo barrier()
r(xq) ?mn Dmn ?k(xq) ?n(xq) Fls ?q wq
?l(xq) Vxcr(xq) ?s(xq)
Both GA operations are greatly dependent on the
communication latency
40
NWChem Porting issues on the XT3
  • Re-use of existing serial port for x86_64
    (compilers)
  • Communication library GA/ARMCI
  • Two ARMCI ports CRAY-SHMEM Portals
  • Made Pario library compatible with Catamount
    (using glibc calls)

41
NWChem Porting - II
  • Stability issue with SHMEM port
  • Uncovered portal problem Cray SHMEM group
    provided workaround
  • NWChem crashes the whole XT running large jobs
  • Performance of SHMEM port
  • Latency 10 ?sec
  • BW
  • Contiguous put/get 2GB/sec
  • Strided put/get 0.3GB/sec
  • ARMCI using Portal in progress

42

H2O7 287 Basis functions aug-cc-pvdz MP2 Energy
Gradient
43

Replicated Data vs. Distributed Data on XT3

Si28O67H30 1687 Basis f. LDA wavefunction
44

Parallel scaling of the DFT code of NWChem

Si28O67H30 1687 Basis f. LDA wavefunction
45

Si28O148H66 3554 Basis functions LDA
wavefunction
46

Si159O264H110 7108 Basis functions LDA
wavefunction
47

Si159O264H110 7108 Basis functions GGA
wavefunction
48
Thanks
  • NWChem group and GA Group - PNNL
  • Bill Shelton Bobby Sumpter - ORNL
  • Roberto Ansaloni Cray
  • Murilo Tiago and Jim Chelikowsky University of
    Texas
  • Carlo Cavazzoni Cineca and the rest of the
    Quantum Espresso developers

49
Backup
50
Non-Blocking Communication
  • Allows overlapping of communications and
    computations
  • resulting in latency hiding
  • Non-blocking operations initiate a communication
    call and then return control to the application
    immediately
  • operation completed locally by making a call to
    the wait routine

51

Si159O264H110 7108 Basis functions LDA
wavefunction 2 SCF cycles Benchmark run
on Itanium2/Elan4
52
Recent improvements in algorithm
  • Use symmetry operations and reduce the real-space
    domain to an irreducible wedge.
  • Tiago Chelikowsky, in preparation
  • Replace numerical diagonalization with Chebyshev
    subspace filtering
  • Zhou et al., J. Comp. Phys. in press(2006)

In 1997, a SCF calculation of Si525H276 took 20
hours of CPU time on the Cray T3E (300 MHz) using
48 processors. Today, it takes 2 hours on one SGI
Madison processor (1.3 GHz).
Impact in real calculations
53
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com