Massively parallel atomistic presentation

About This Presentation

Transcript and Presenter's Notes

Title: Massively parallel atomistic

1
Massively parallel atomistic simulations of
thermal properties of Silicon on the
Teragrid Lin Sun, Chinh Le, Faisal Saied,
Jayathi Murthy, David McWilliams Teragrid06
Indianapolis, June 1215
2
Author affiliations Lin Sun(1) Chinh
Le(2) Faisal Saied(2,3) Jayathi Y.
Murthy(1) David McWilliams(4)
(1) School of Mechanical Engineering, Purdue
University (2) Rosen Center for Advanced
Computing, Purdue University. (3) Computing
Research Institute, Purdue University (4)
National Center for Supercomputing Applications,
University of Illinois at
Urbana-Champaign.
3
Acknowledgements

NSF Teragrid resources at NCSA,
PSC, SDSC, RCAC

4
Motivation

Can HPC simulations match experiments (verify,
validate, predict)?
Can scientific codes be designed to scale up to
new levels of massive parallelism?

5
Molecular dynamics simulations

MD simulations have been used to predict the
thermal properties of various bulk materials and
to study the phonon transport at nanoscale.

Nanotubes in a polymer matrix http//depts.washin
gton.edu/polylab/cn.html
finFET transistors
CNT thin-film transistor
6
Silicon study using EDIP

The purpose of this paper is to calculate the
thermal conductivities of bulk silicon and its
thin films.
Environment Dependent Inter-atomic Potential
(EDIP) is chosen to describe the interaction
between silicon atoms.
The inter-atomic forces are only considered when
the distance between atoms is in a short range.

Silicon lattice structure
7
Molecular dynamics
Initial Condition
Build Neighbor List
I.C. ---crystal lattice and random velocity
Compute Forces
B.C. ---periodic
Renew rv
NVE ensemble
Leapfrog Verlet integrator
Evaluate Properties
Time step 0.1-5 fs
Output Data
Total running time 1-3 ns
Finish
8
Thermal conductivity prediction

Equilibrium MD (Green-Kubo) method

9
Thermal conductivity prediction

Non-equilibrium MD
By adding and subtracting energy, a heat
flux is imposed on the system, thus a temperature
gradient is formed. Then apply Fourier formula to
calculate the thermal conductivity.

10
The Need for High Performance

It takes more than one month to complete a
MD task with 1M atoms and 1M steps if
running serially.

11
Advances in HPC

Based on current plans, there will be petascale
resources on NSFs Teragrid with 10s of
thousands of processors (cores) within a few
years! (A 200 Teraflops machine will happen
sooner.)
As we build these larger systems, and seek
a higher degree of scalability, it is essential
to
identify and eliminate limits to scalability.

12
Flowchart of parallel MD program
13
Performance engineering for scalability

Data distribution
Ghost zone updates
Asynchronous message passing
Message coalescing
Overlapping arithmetic and communication
Latency-bound vs. Bandwidth-bound

14
Data distribution
3D domain decomposition preferable to 1D, 2D
because one can use more processors, and have a
more favorable surface area to volume ratio.
x
1D processor mesh
3D processor mesh
15
2-D illustration of ghost zone updates
y
x
By appropriate ordering of the nearest neighbor
communication, the ghost zone update can be done
in 4 exchanges in 2D (6 exchanges in 3D), rather
than 8 (26 in 3D).
16
MPI communication issues

Asynchronous (non-blocking) message-passing
MPI_ISend, MPI_IRecv
Buffered MPI communication
Eager/Rendezvous protocols
TCP buffer size
Monitoring MPI performance

17
Message coalescing

At each step, all the data being sent from one
processor to a neighbor is packed into a
single message buffer, and unpacked by the
receiving processor.
This results in fewer, longer messages.
E.g. x, y, z coordinates.

18
Overlapping arithmeticand communication

We do not see any significant overlap
between arithmetic and communication in
our MD code.
However, this remains a very important goal
when designing highly scalable codes.

19
Latency-bound vs. Bandwidth-bound

In the strong scaling context, as we scale out to
a large number of processors, the message sizes
decrease, and there could be a transition from a
bandwidth-bound mode to a latency-bound mode
In our MD code, with message packing and fewer,
longer messages, we are not in the latency-bound
regime.

20
Latency-bound vs. Bandwidth-bound
21
Teragrid resources used
Tungsten
Cobalt
Mercury
Lear
Bigben
BlueGene
DataStar
22
Specifications of the computersused in this study
NCSA . National Center for Supercomputing
Applications PSC Pittsburgh Supercomputing
Center RCAC . Rosen Center for Advanced
Computing, Purdue University SDSC . San Diego
Supercomputer Center
23
Timing experiments
Timing results presented here are for a test case
with 1 million atoms. We ran a number of time
steps and took the average time per time
step. The arithmetic and communication (MPI)
times were measured and are presented
separately, along with the total (average) time,
per time step.
24
Timing results (in seconds)
arith .. average arithmetic time for one time
step. comm ... average communication time for
one time step. total .. average total time for
one time step, i.e. the sum
of arithmetic and communication times. ratio (R)
... arithmetic time divided by communication time.
25
Time on 64 processors
26
Time on 512 processors
27
Average arithmetic time per step
28
Average arithmetic time per step
29
Average communication time per step
30
Average communication time per step
31
Average total time
32
Specifications of the computersused in this study
NCSA . National Center for Supercomputing
Applications PSC Pittsburgh Supercomputing
Center RCAC . Rosen Center for Advanced
Computing, Purdue University SDSC . San Diego
Supercomputer Center
33
Average total time
34
Speedup
Speedup over the 16 processor times.
35
A note on metrics
Time to completion and speedup are two different
metrics, and each has a role, depending on the
question we are asking. Notice that the BlueGene
is the slowest in the time to completion chart,
but it is the best architecture for
speedup/scalability.
36
Completion times

DataStar
39.3 days on 1 proc (est.)
4.3 hours using 512 procs
219
2.7 hours using 1024 procs
349
Tungsten
29.6 days on 1 proc (est.)
3.0 hours using 512 procs
236
1.95 hours using 1024 procs
364
For reference speed-up of 100 gt 30 days reduced
to 7.2 hours.

37
Completion times

The following processor counts require roughly
the same
time to complete

38
Cost-performance issues

Clusters based on commodity components
offer a very attractive price/performance ratio.
However, they do have some drawbacks, such
as the mismatch between processor speed and
memory, network speed.

39
Bulk Silicon Thermal Conductivity
Predicted thermal conductivity of bulk silicon
using Green-Kubo method as a function of number
of atoms. The experimental value is 148 W/mK.
The deviation is due to the inaccuracy of the
chosen inter-atomic potential.
40
Silicon Thin Films
Out-of-plane thermal conductivity as a function
of film thickness.
A typical temperature profile.
41
Summary

We have demonstrated that it is possible to
design simulations that have predictive power,
and can reproduce experimental results.
We have demonstrated that a carefully designed
parallel code can scale up to over 1000
processors on current supercomputers.
In doing so, we have shown how an
interdisciplinary team can work effectively to
tackle problems in HPC.

Write a Comment

User Comments (0)

About PowerShow.com

Massively parallel atomistic PowerPoint PPT Presentation