Title: Massively parallel atomistic
1Massively parallel atomistic simulations of
thermal properties of Silicon on the
Teragrid Lin Sun, Chinh Le, Faisal Saied,
Jayathi Murthy, David McWilliams Teragrid06
Indianapolis, June 1215
2Author affiliations Lin Sun(1) Chinh
Le(2) Faisal Saied(2,3) Jayathi Y.
Murthy(1) David McWilliams(4)
(1) School of Mechanical Engineering, Purdue
University (2) Rosen Center for Advanced
Computing, Purdue University. (3) Computing
Research Institute, Purdue University (4)
National Center for Supercomputing Applications,
University of Illinois at
Urbana-Champaign.
3Acknowledgements
- NSF Teragrid resources at NCSA,
- PSC, SDSC, RCAC
4Motivation
- Can HPC simulations match experiments (verify,
validate, predict)? - Can scientific codes be designed to scale up to
new levels of massive parallelism?
5Molecular dynamics simulations
- MD simulations have been used to predict the
thermal properties of various bulk materials and
to study the phonon transport at nanoscale.
Nanotubes in a polymer matrix http//depts.washin
gton.edu/polylab/cn.html
finFET transistors
CNT thin-film transistor
6Silicon study using EDIP
- The purpose of this paper is to calculate the
thermal conductivities of bulk silicon and its
thin films. - Environment Dependent Inter-atomic Potential
(EDIP) is chosen to describe the interaction
between silicon atoms. - The inter-atomic forces are only considered when
the distance between atoms is in a short range.
Silicon lattice structure
7Molecular dynamics
Initial Condition
Build Neighbor List
I.C. ---crystal lattice and random velocity
Compute Forces
B.C. ---periodic
Renew rv
NVE ensemble
Leapfrog Verlet integrator
Evaluate Properties
Time step 0.1-5 fs
Output Data
Total running time 1-3 ns
Finish
8Thermal conductivity prediction
- Equilibrium MD (Green-Kubo) method
9Thermal conductivity prediction
- Non-equilibrium MD
- By adding and subtracting energy, a heat
flux is imposed on the system, thus a temperature
gradient is formed. Then apply Fourier formula to
calculate the thermal conductivity.
10The Need for High Performance
- It takes more than one month to complete a
- MD task with 1M atoms and 1M steps if
- running serially.
11Advances in HPC
- Based on current plans, there will be petascale
- resources on NSFs Teragrid with 10s of
- thousands of processors (cores) within a few
- years! (A 200 Teraflops machine will happen
- sooner.)
- As we build these larger systems, and seek
- a higher degree of scalability, it is essential
to - identify and eliminate limits to scalability.
12Flowchart of parallel MD program
13Performance engineering for scalability
- Data distribution
- Ghost zone updates
- Asynchronous message passing
- Message coalescing
- Overlapping arithmetic and communication
- Latency-bound vs. Bandwidth-bound
14Data distribution
3D domain decomposition preferable to 1D, 2D
because one can use more processors, and have a
more favorable surface area to volume ratio.
x
1D processor mesh
3D processor mesh
152-D illustration of ghost zone updates
y
x
By appropriate ordering of the nearest neighbor
communication, the ghost zone update can be done
in 4 exchanges in 2D (6 exchanges in 3D), rather
than 8 (26 in 3D).
16MPI communication issues
- Asynchronous (non-blocking) message-passing
MPI_ISend, MPI_IRecv - Buffered MPI communication
- Eager/Rendezvous protocols
- TCP buffer size
- Monitoring MPI performance
17Message coalescing
- At each step, all the data being sent from one
- processor to a neighbor is packed into a
- single message buffer, and unpacked by the
- receiving processor.
- This results in fewer, longer messages.
- E.g. x, y, z coordinates.
18Overlapping arithmeticand communication
- We do not see any significant overlap
- between arithmetic and communication in
- our MD code.
- However, this remains a very important goal
- when designing highly scalable codes.
19Latency-bound vs. Bandwidth-bound
- In the strong scaling context, as we scale out to
a large number of processors, the message sizes
decrease, and there could be a transition from a
bandwidth-bound mode to a latency-bound mode - In our MD code, with message packing and fewer,
longer messages, we are not in the latency-bound
regime.
20Latency-bound vs. Bandwidth-bound
21Teragrid resources used
Tungsten
Cobalt
Mercury
Lear
Bigben
BlueGene
DataStar
22Specifications of the computersused in this study
NCSA . National Center for Supercomputing
Applications PSC Pittsburgh Supercomputing
Center RCAC . Rosen Center for Advanced
Computing, Purdue University SDSC . San Diego
Supercomputer Center
23Timing experiments
Timing results presented here are for a test case
with 1 million atoms. We ran a number of time
steps and took the average time per time
step. The arithmetic and communication (MPI)
times were measured and are presented
separately, along with the total (average) time,
per time step.
24Timing results (in seconds)
arith .. average arithmetic time for one time
step. comm ... average communication time for
one time step. total .. average total time for
one time step, i.e. the sum
of arithmetic and communication times. ratio (R)
... arithmetic time divided by communication time.
25Time on 64 processors
26Time on 512 processors
27Average arithmetic time per step
28Average arithmetic time per step
29Average communication time per step
30Average communication time per step
31Average total time
32Specifications of the computersused in this study
NCSA . National Center for Supercomputing
Applications PSC Pittsburgh Supercomputing
Center RCAC . Rosen Center for Advanced
Computing, Purdue University SDSC . San Diego
Supercomputer Center
33Average total time
34Speedup
Speedup over the 16 processor times.
35A note on metrics
Time to completion and speedup are two different
metrics, and each has a role, depending on the
question we are asking. Notice that the BlueGene
is the slowest in the time to completion chart,
but it is the best architecture for
speedup/scalability.
36Completion times
- DataStar
- 39.3 days on 1 proc (est.)
- 4.3 hours using 512 procs
219 - 2.7 hours using 1024 procs
349 - Tungsten
- 29.6 days on 1 proc (est.)
- 3.0 hours using 512 procs
236 - 1.95 hours using 1024 procs
364 - For reference speed-up of 100 gt 30 days reduced
to 7.2 hours.
37Completion times
- The following processor counts require roughly
the same - time to complete
38Cost-performance issues
- Clusters based on commodity components
- offer a very attractive price/performance ratio.
- However, they do have some drawbacks, such
- as the mismatch between processor speed and
- memory, network speed.
39Bulk Silicon Thermal Conductivity
Predicted thermal conductivity of bulk silicon
using Green-Kubo method as a function of number
of atoms. The experimental value is 148 W/mK.
The deviation is due to the inaccuracy of the
chosen inter-atomic potential.
40Silicon Thin Films
Out-of-plane thermal conductivity as a function
of film thickness.
A typical temperature profile.
41Summary
- We have demonstrated that it is possible to
design simulations that have predictive power,
and can reproduce experimental results. - We have demonstrated that a carefully designed
parallel code can scale up to over 1000
processors on current supercomputers. - In doing so, we have shown how an
interdisciplinary team can work effectively to
tackle problems in HPC.