Parallel Methods for Nano/Materials Science Applications - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Parallel Methods for Nano/Materials Science Applications

Description:

Computational Research Division LBNL & UC Davis, Applied Science Dept. ... Ab initio Method: Density Functional Theory (Kohn 98 Nobel Prize) ... – PowerPoint PPT presentation

Number of Views:95
Avg rating:3.0/5.0
Slides: 29
Provided by: flav59
Category:

less

Transcript and Presenter's Notes

Title: Parallel Methods for Nano/Materials Science Applications


1
Parallel Methods for Nano/Materials Science
Applications
(Electronic Structure Calculations)
  • Andrew Canning
  • Computational Research Division LBNL
    UC Davis, Applied Science Dept.

2
Outline
  • Introduction to Nano/Materials science
  • Electronic Structure Calculations (DFT)
  • Parallelism for Plane Wave Approach
  • Code performance on High Performance Parallel
    Computers
  • New Methods and Applications

3
Milestones in Parallel Calculations
1991 Silicon surface reconstruction (7x7),
Phys. Rev. (Stich, Payne, King-Smith, Lin,
Clarke) Meiko I860, 64 processor Computing
Surface (Brommer, Needels, Larson,
Joannopoulos)
Thinking Machines CM2, 16,384 bit processors
1998 FeMn alloys (exchange bias), Gordon Bell
prize SC98 (Ujfalussy, Stocks, Canning, Y. Wang,
Shelton et al.)
Cray T3E, 1500 procs.
first gt 1 Tflop Simulation
2006 1000 atom Molybdenum simulation with Qbox
SC05 (Gordon Bell prize for peak performance).
(F. Gygi et al. )
BlueGene/L, 207 Tflops on IBM BG/L (LLNL)

2008 New Algorithm to Enable 400 TFlop/s
Sustained Performance in Simulations of Disorder
Effects in High-Tc SC08 Gordon Bell prize
(Thomas C. Schulthess et al.)

Cray XT5 (ORNL), first gt 1 Pflop Simulation
Linear
Scaling Divide-and-Conquer Electronic Structure
Calculations (L-W Wang et. al.) SC08 Gordon
Bell prize (algorithmic innovation )
4
Electronic Structure Calculations
  • Accurate Quantum Mechanical treatment for the
    electrons
  • Each electron represented on grid or with some
    basis functions (eg. Fourier components)
  • Compute Intensive Each electron requires 1
    million points/basis (need 100s of electrons)
  • 70-80 NERSC Materials Science Computer Time
    (first-principles electronic
    structure)

BaYClCe excited state (new scintillator gamma
radiation detector)
5
Motivation for Electronic Structure Calculations
  • Most Materials Properties Only Understood at a
    fundamental level from accurate electronic
    structure calculations (Strength, Cohesion etc)
  • Many Properties Purely Electronic eg.
    Optical Properties (Lasers)
  • Complements Experiments
  • Computer Design Materials at the nanoscale

6
Materials Science Methods
  • Many Body Quantum Mechanical Approach (Quantum
    Monte Carlo) 20-30 atoms
  • Single Particle QM (Density Functional Theory)
    No free parameters. 100-1000 atoms
  • Empirical QM Models eg. Tight Binding
    1000-5000 atoms
  • Empirical Classical Potential Methods
    thousand-million
    atoms
  • Continuum Methods


Increasing atoms
Single particle DFT methods largest user of
supercomputer cycles of any scientific method in
any discipline
7
Quantum Mechanics for molecules and crystals
Many Body Schrodinger Equation for electrons
(position ri) in a molecule or crystals
with nuclei charge Z and position RI (natural
units e h 1 etc.)
kinetic energy
electron-nuclei interaction
electron-electron interaction
  • Solution of the above eigenfunction equation
    HYEY (H is Hamiltonian) for the wavefunction Y
    gives complete information for the system.
  • Observables take the form of operators eg.
    momentum p (classical form mv) pop - i h
    then p observed is given by the solution of
    popYpY
  • Lowest eigenvalue pair (E0,Y0) is the
    groundstate energy. Higher eigenpairs correspond
    to excited states of the system.

8
Ab initio Method Density Functional Theory
(Kohn 98 Nobel Prize)
Many Body Schrodinger Equation (exact but
exponential scaling )

Kohn Sham Equation (65) The many body ground
state problem can be mapped onto a single
particle non-linear problem with the same
electron density and a different effective
potential (cubic scaling).

Use Local Density Approximation (LDA) for
(good for Si,C)
r charge density
9
Selfconsistent Solution
fix V(r,r) a linear problem
N electrons N wave functions lowest N
eigenfunctions
Selfconsistency until Vout Vin
10
Choice of Basis for DFT(LDA)
Increasing basis size M
Gaussian
FLAPW
Fourier
grid
Percentage of eigenpairs M/N
30
2
Eigensolvers
Direct (scalapack)
Iterative
11
Plane-wave Pseudopotential Method in DFT
Solve Kohn-Sham Equations self-consistently for
electron wavefunctions within the Local Density
Approximation


1. Plane-wave (Fourier) expansion for to energy
cutoff g lt rcut (sphere )
2. Replace frozen atom core (core electrons
and nuclei) by a pseudopotential
(pseudo-wavefunctions in core region allows rcut
to be reduced without loss of accuracy)
Different parts of the Hamiltonian calculated in
different spaces (Fourier and real) via 3d FFT
12
Computational Considerations
  • H never computed explicitly (available through
    mat-vec product)
  • Matrix-vector product in NlogN steps (not N2)
  • Orthogonalization is expensive, Mat-vec is
    cheap
  • Matrix is dense (in Fourier or Real space)
  • Real space grid approximately double size of
    Fourier sphere
  • Each Selfconsistent step we have good guess for
    eigenvectors (from previous step)
  • Typically use stable CG based iterative methods
    or Davidson

FFT
13
Parallelization (load balance, minimize
communications)
Dataset spheres of Fourier coefficients for each
wavefunction (electron) 100s-1000s N
spheres of M points per sphere (M 100N)
.
  • Calculations (most costly)
  • Calculation of overlap matrices for
    orthogonalization, scales as MN2
    (written in BLAS3)
  • Non-local part of pseudopotential calculation
    (ion-electron interaction) scales as MN2
    (written in BLAS3) can also be done in real
    space
  • 3d FFTs to construct charge density in real
    space, scales as NMlogM

Load balancing calculations 1., 2. Give the
same number of points in sphere to each
processor 3. Give the same number of columns of
data to each processor
Conflicting conditions, np complete problem bin
(bag) packing problem
14
Parallelization (load balance, minimize
communications)
Load Balancing Scheme
  • Consider sphere as one dimensional list of
    columns ordered according to length.
  • Give out columns to processors in descending
    order such that each column is given to the
    processor with the least number of points in the
    sphere.

3 processor example
Close to the same number of points and columns go
to each processor
3d FFT
Communications
  • With this data distribution the BLAS3 operations
    in Fourier space have limited communications
  • Main communications occur in 3d FFTs

Write specialized 3d FFTs for this data layout
15
Standard Parallel 3d FFT on NxNxN (x,y,z)
grid
1. Perform 1d FFTs on N2 x direction columns

2. Transpose (x,y,z) -gt (y,z,x)
3. Perform 1d FFTs on N2 y direction columns
4. Transpose (y,z,x) -gt (z,y,x)
5. Perform 1d FFTs on N2 z direction columns
6. Transpose (z,y,x) -gt (x,y,z) optional
Scaling Issues (bandwidth and latency)
computations/communications N2NlogN/N3 logN
O(10)
message size (num. of processors)-2 for 1d
layout
16
Specialized 3d FFT for Electronic Structure Codes
(Plane Waves/Fourier)
  • Works for any grid size on any number of
    processors
  • Only non-zero elements communicated/calculated
  • Most communication in global transpose (b) to (c)
    little communication (d) to (e)
  • Many 3d FFTs done at the same time to avoid
    latency issues
  • Much faster than vendor supplied 3d-FFT (no grid
    size limitations)
  • Used in many codes (PARATEC, PETot, ESCAN, GIT
    code etc. )

(a)
(b)
FFT
Transp.
(c)
(d)
FFT
Transp.
(e)
(f)
FFT
17
Results 3d-FFT 5123 grid on Cray XT4 (Franklin,
NERSC, LBNL)
  • Cray XT4 (NERSC)
  • Quad core proc. Opteron 2.3 GHz
  • 38,640 proc. cores, 9660 nodes

Procs. isen/rec isen/rec (1 core) alltoallv isenrec-block 40 alltoallv-block 40 Weak scaling Nprocs x 5122
128 0.6175 s 0.3350 0.5560
256 0.3752 s 0.1858 0.3033 0.3292 0.3013 0.1604
512 0.3417 s 0.1326 0.2007 0.1770 0.1609 0.1609
1024 6.3045 s 0.1221 0.1992 0.1913 0.0772 0.1711
2048 7.6232 s 6.2814 0.2767 0.5552 0.0477 0.1857
4096 9.4510 s 6.8414 0.4272 0.0439 0.2019
8192 0.3746 0.6888 0.0369 0.2415
Time for forwardreverse 3d FFT, 5123 grid
uses FFTW 1d FFT libs

18
PARATEC (PARAllel Total Energy Code)
  • PARATEC performs first-principles quantum
    mechanical total energy calculation using
    pseudopotentials plane wave basis set
  • Written in F90 and MPI
  • Designed to run on large parallel machines IBM
    SP etc. but also runs on PCs
  • PARATEC uses all-band CG approach to obtain
    wavefunctions of electrons (blocks comms.
    Specialized 3dffts)
  • Generally obtains high percentage of peak on
    different platforms (uses BLAS3 and 1d FFT libs)
  • Developed with Louie and Cohens groups (UCB,
    LBNL)

Overall ratio calcs./communications N (not
logN)
19
PARATEC Code Details
  • Code written in F90 and MPI (50,000 lines)
  • 33 3D FFT, 33 BLAS3, 33 Hand coded F90
  • Global Communications in 3D FFT (Transpose)
  • Parallel 3D FFT handwritten, minimize comms.
    reduce latency (written on top of vendor
    supplied 1D complex FFT )

20
PARATEC Performance
Problem Proc Bassi NERSC (IBM Power5) Bassi NERSC (IBM Power5) Jaquard NERSC (Opteron) Jaquard NERSC (Opteron) Thunder (Itanium2) Thunder (Itanium2) Franklin NERSC (Cray XT4) Franklin NERSC (Cray XT4) NEC ES (SX6) NEC ES (SX6) IBM BG/L IBM BG/L
Problem Proc Gflops/Proc peak Gflops/Proc peak Gflops/Proc peak Gflops/Proc peak Gflops/Proc peak Gflops/Proc peak
488 AtomCdSeQuantumDot 128 5.49 72     2.8 51 5.1 64
488 AtomCdSeQuantumDot 256 5.52 73 1.98 45 2.6 47 3.36 65 5.0 62 1.21 43
488 AtomCdSeQuantumDot 512 5.13 67 0.95 21 2.4 44   3.15 61 4.4 55 1.00  35
488 AtomCdSeQuantumDot 1024 3.74 49 1.8 32 2.93 56 3.6 46
488 AtomCdSeQuantumDot 2048       2.37  46 2.7 35    
  • Grid size 2503
  • All architectures generally achieve high
    performance due to computational intensity of
    code (BLAS3, FFT)
  • ES achieves highest overall performance
    5.5Tflop/s on 2048 procs (5.3
    Tflops on XT4 on 2048 procs in single proc. node
    mode)
  • FFT used for benchmark for NERSC procurements
    (run on
    up to 18K procs on Cray XT4, weak scaling )
  • Vectorisation directives and multiple 1d FFTs
    required for NEC SX6

Developed with Louie and Cohens groups (UCB,
LBNL), also work with L. Oliker, J Carter
21
Application Gamma Ray Detection ( Cerium
activated Scintillators )
  • Applications in cosmology, particle physics,
    medical, homeland security
  • Scintillators produce light in the optical range
    when exposed to gamma rays
  • Brightest Scintillators are Cerium doped
    materials (lanthanum halides)
  • Theoretical Screening of Candidate Materials

Ce states
Increasing energy
Project funded by the DHS, in collaboration with
S. Derenzo, M. Weber, A. Chaudhry in Life
Sciences (LBNL)
22
Scintillation process model (Ce activ.)
?-ray excitation produces a hole and an e in the
host material
The hole is captured by the dopant Ce site (Ce3
-gt Ce4 )
Electron diffuses down through conduction band
into 5d Ce
5d -gt 4f transition on Ce site emits optical
photon
Key assumption 4f and 5d states of Ce atom must
be in the energy gap of the host crystal
conduction
5 d
??F
4 f
valence
23
Density of States Plot for Ce in LaBr3
5 d
4 f
Single Ce atom in large unit cell

Ce density of states
Energy
24
Criteria to determine bright Ce activated
Scintillators
  1. Size of host material bandgap smaller the
    bandgap the more electron hole pairs but must
    accommodate Ce 4f,5d
  2. Energy difference between the VBM and Ce 4f
    level larger this value the less probable will
    be hole capture at the Ce3 site (requires
    multiple or high-energy phonons)
  3. Energy difference between the CBM and Ce 5d
    level if 5d is too close thermal excitations may
    excite electron into CB
  4. Level of localisation of the (Ce3) state on the
    Ce atom

Ce states
25
Predictions for some new materials
Compound Bandgap EG (eV) Ce 4f - VBM (eV) (Ce3) localisation
Ba2YCl7Ce 4.05 0.87 64
BaY2Br8Ce 2.94 0.45 39
BaY2I8Ce 2.50 0.0 42
LaFSCe 3.27 0.71 45
La2Ca2O5Ce 2.38 0.26 28
La2Mg2O5Ce 2.80 0.1 12
Ba2YCl7Ce made by experimentalists and found to
be a very bright scintillator. Initial Patent
Filing taken out for Material Ba2YCl7Ce
26
The Quantization Condition of Quantum-well States
in Cu/Co(100)
  • Theoretical investigation of Quantum Well
    states in Cu films using our codes (PARATEC,
    PEtot) to compare with experiments at the ALS (E.
    Rotenberg, Y.Z. Wu, Z.Q. Qiu)
  • New computational methods for metallic systems
    used in the calculations.
  • Lead to an understanding of surface effects on
    the Quantum Well States. Improves on simple Phase
    Accumulation Model used previously

QW states in Copper Wedge
Difference between theory and experiment improved
by taking surface effects into account
27
Summary
  • Plane wave electronic structure methods can
    scale to the 1K-10K
    processor regime
  • Require different parallelization methods for
    higher proc. count (state parallelism, k-point
    parallelism eg. Qbox code)
  • This research was funded by the DOE as part of
    the Modeling and Simulation in Nanoscience
    Initiative (MICS and BES) contract No
    DE-AC03-76SF00098
  • Gamma ray detection funded by DHS

28
Future Directions
  • O(N) based methods (exploit locality) can give
    sparse matrix problem
  • Excited state calculations
  • Transport Calculations (conductivity)
    (dynamical properties)
Write a Comment
User Comments (0)
About PowerShow.com