Massively parallel atomistic - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

Massively parallel atomistic

Description:

3-D torus. 1148/1152. 8.5/6.4. AMD Opteron 2.4 Ghz. Bigben (PSC) Cray XT3 ... 3-D torus and global tree. 159/344. 4.3/3.1. IBM Power PC. 0.7 GHz. BlueGene (SDSC) ... – PowerPoint PPT presentation

Number of Views:67
Avg rating:3.0/5.0
Slides: 42
Provided by: Faisal9
Category:

less

Transcript and Presenter's Notes

Title: Massively parallel atomistic


1
Massively parallel atomistic simulations of
thermal properties of Silicon on the
Teragrid Lin Sun, Chinh Le, Faisal Saied,
Jayathi Murthy, David McWilliams Teragrid06
Indianapolis, June 1215
2
Author affiliations Lin Sun(1) Chinh
Le(2) Faisal Saied(2,3) Jayathi Y.
Murthy(1) David McWilliams(4)
(1) School of Mechanical Engineering, Purdue
University (2) Rosen Center for Advanced
Computing, Purdue University. (3) Computing
Research Institute, Purdue University (4)
National Center for Supercomputing Applications,
University of Illinois at
Urbana-Champaign.
3
Acknowledgements
  • NSF Teragrid resources at NCSA,
  • PSC, SDSC, RCAC

4
Motivation
  • Can HPC simulations match experiments (verify,
    validate, predict)?
  • Can scientific codes be designed to scale up to
    new levels of massive parallelism?

5
Molecular dynamics simulations
  • MD simulations have been used to predict the
    thermal properties of various bulk materials and
    to study the phonon transport at nanoscale.

Nanotubes in a polymer matrix http//depts.washin
gton.edu/polylab/cn.html
finFET transistors
CNT thin-film transistor
6
Silicon study using EDIP
  • The purpose of this paper is to calculate the
    thermal conductivities of bulk silicon and its
    thin films.
  • Environment Dependent Inter-atomic Potential
    (EDIP) is chosen to describe the interaction
    between silicon atoms.
  • The inter-atomic forces are only considered when
    the distance between atoms is in a short range.

Silicon lattice structure
7
Molecular dynamics
Initial Condition
Build Neighbor List
I.C. ---crystal lattice and random velocity
Compute Forces
B.C. ---periodic
Renew rv
NVE ensemble
Leapfrog Verlet integrator
Evaluate Properties
Time step 0.1-5 fs
Output Data
Total running time 1-3 ns
Finish
8
Thermal conductivity prediction
  • Equilibrium MD (Green-Kubo) method

9
Thermal conductivity prediction
  • Non-equilibrium MD
  • By adding and subtracting energy, a heat
    flux is imposed on the system, thus a temperature
    gradient is formed. Then apply Fourier formula to
    calculate the thermal conductivity.

10
The Need for High Performance
  • It takes more than one month to complete a
  • MD task with 1M atoms and 1M steps if
  • running serially.

11
Advances in HPC
  • Based on current plans, there will be petascale
  • resources on NSFs Teragrid with 10s of
  • thousands of processors (cores) within a few
  • years! (A 200 Teraflops machine will happen
  • sooner.)
  • As we build these larger systems, and seek
  • a higher degree of scalability, it is essential
    to
  • identify and eliminate limits to scalability.

12
Flowchart of parallel MD program
13
Performance engineering for scalability
  • Data distribution
  • Ghost zone updates
  • Asynchronous message passing
  • Message coalescing
  • Overlapping arithmetic and communication
  • Latency-bound vs. Bandwidth-bound

14
Data distribution
3D domain decomposition preferable to 1D, 2D
because one can use more processors, and have a
more favorable surface area to volume ratio.
x
1D processor mesh
3D processor mesh
15
2-D illustration of ghost zone updates
y
x
By appropriate ordering of the nearest neighbor
communication, the ghost zone update can be done
in 4 exchanges in 2D (6 exchanges in 3D), rather
than 8 (26 in 3D).
16
MPI communication issues
  • Asynchronous (non-blocking) message-passing
    MPI_ISend, MPI_IRecv
  • Buffered MPI communication
  • Eager/Rendezvous protocols
  • TCP buffer size
  • Monitoring MPI performance

17
Message coalescing
  • At each step, all the data being sent from one
  • processor to a neighbor is packed into a
  • single message buffer, and unpacked by the
  • receiving processor.
  • This results in fewer, longer messages.
  • E.g. x, y, z coordinates.

18
Overlapping arithmeticand communication
  • We do not see any significant overlap
  • between arithmetic and communication in
  • our MD code.
  • However, this remains a very important goal
  • when designing highly scalable codes.

19
Latency-bound vs. Bandwidth-bound
  • In the strong scaling context, as we scale out to
    a large number of processors, the message sizes
    decrease, and there could be a transition from a
    bandwidth-bound mode to a latency-bound mode
  • In our MD code, with message packing and fewer,
    longer messages, we are not in the latency-bound
    regime.

20
Latency-bound vs. Bandwidth-bound
21
Teragrid resources used
Tungsten
Cobalt
Mercury
Lear
Bigben
BlueGene
DataStar
22
Specifications of the computersused in this study
NCSA . National Center for Supercomputing
Applications PSC Pittsburgh Supercomputing
Center RCAC . Rosen Center for Advanced
Computing, Purdue University SDSC . San Diego
Supercomputer Center
23
Timing experiments
Timing results presented here are for a test case
with 1 million atoms. We ran a number of time
steps and took the average time per time
step. The arithmetic and communication (MPI)
times were measured and are presented
separately, along with the total (average) time,
per time step.
24
Timing results (in seconds)
arith .. average arithmetic time for one time
step. comm ... average communication time for
one time step. total .. average total time for
one time step, i.e. the sum
of arithmetic and communication times. ratio (R)
... arithmetic time divided by communication time.
25
Time on 64 processors
26
Time on 512 processors
27
Average arithmetic time per step
28
Average arithmetic time per step
29
Average communication time per step
30
Average communication time per step
31
Average total time
32
Specifications of the computersused in this study
NCSA . National Center for Supercomputing
Applications PSC Pittsburgh Supercomputing
Center RCAC . Rosen Center for Advanced
Computing, Purdue University SDSC . San Diego
Supercomputer Center
33
Average total time
34
Speedup
Speedup over the 16 processor times.
35
A note on metrics
Time to completion and speedup are two different
metrics, and each has a role, depending on the
question we are asking. Notice that the BlueGene
is the slowest in the time to completion chart,
but it is the best architecture for
speedup/scalability.
36
Completion times
  • DataStar
  • 39.3 days on 1 proc (est.)
  • 4.3 hours using 512 procs
    219
  • 2.7 hours using 1024 procs
    349
  • Tungsten
  • 29.6 days on 1 proc (est.)
  • 3.0 hours using 512 procs
    236
  • 1.95 hours using 1024 procs
    364
  • For reference speed-up of 100 gt 30 days reduced
    to 7.2 hours.

37
Completion times
  • The following processor counts require roughly
    the same
  • time to complete

38
Cost-performance issues
  • Clusters based on commodity components
  • offer a very attractive price/performance ratio.
  • However, they do have some drawbacks, such
  • as the mismatch between processor speed and
  • memory, network speed.

39
Bulk Silicon Thermal Conductivity
Predicted thermal conductivity of bulk silicon
using Green-Kubo method as a function of number
of atoms. The experimental value is 148 W/mK.
The deviation is due to the inaccuracy of the
chosen inter-atomic potential.
40
Silicon Thin Films
Out-of-plane thermal conductivity as a function
of film thickness.
A typical temperature profile.
41
Summary
  • We have demonstrated that it is possible to
    design simulations that have predictive power,
    and can reproduce experimental results.
  • We have demonstrated that a carefully designed
    parallel code can scale up to over 1000
    processors on current supercomputers.
  • In doing so, we have shown how an
    interdisciplinary team can work effectively to
    tackle problems in HPC.
Write a Comment
User Comments (0)
About PowerShow.com