MS 15: Data-Aware Parallel Computing - PowerPoint PPT Presentation

About This Presentation

Title:

MS 15: Data-Aware Parallel Computing

Description:

Fit a linear surface and interpolate/extrapolate to get coefficients c1 and ... simulation similarly, in terms of coefficients b and perform least squares fit ... – PowerPoint PPT presentation

Number of Views:21

Avg rating:3.0/5.0

Slides: 25

Provided by: asri9

Learn more at: http://www.cs.fsu.edu

Category:

more less

Transcript and Presenter's Notes

Title: MS 15: Data-Aware Parallel Computing

1
MS 15 Data-Aware Parallel Computing

Data-Driven Parallelization in Multi-Scale
Applications
Ashok Srinivasan, Florida State University
Dynamic Data Driven Finite Element Modeling of
Brain Shape Deformation During Neurosurgery
Amitava Majumdar, San Diego Supercomputer Center
Dynamic Computations in Large-Scale Graphs
David Bader, Georgia Tech
Tackling Obesity in Children
Radha Nandkumar, NCSA

www.cs.fsu.edu/asriniva/presentations/siampp06
2
Data-Driven Parallelization in Multi-Scale
Applications

Ashok Srinivasan
Computer Science, Florida State University
http//www.cs.fsu.edu/asriniva

Aim Simulate for long time spans Solution
features Use data from prior simulations to
parallelize the time domain
Acknowledgements NSF, ORNL, NERSC,
NCSA Collaborators Yanan Yu and Namas Chandra
3
Outline

Background
Limitations of Conventional Parallelization
Example Application Carbon Nanotube Tensile Test
Small Time Step Size in Molecular Dynamics
Simulations
Data-Driven Time Parallelization
Experimental Results
Scaled efficiently to 1000 processors, for a
problem where conventional parallelization scales
to just 2-3 processors
Other time parallelization approaches
Conclusions

4
Background

Limitations of Conventional Parallelization
Example Application Carbon Nanotube Tensile Test
Molecular Dynamics Simulations
Problems with Multiple Time-Scales

5
Limitations of Conventional Parallelization

Conventional parallelization decomposes the state
space across processors
It is effective for large state space
It is not effective when computational effort
arises from a large number of time steps
or when granularity becomes very fine due to a
large number of processors

6
Example Application Carbon Nanotube Tensile Test

Pull the CNT at a constant velocity
Determine stress-strain response and yield strain
(when CNT starts breaking) using MD
Strain rate dependent

7
A Drawback of Molecular Dynamics

Molecular dynamics
In each time step, forces of atoms on each other
modeled using some potential
After force is computed, update positions
Repeat for desired number of time steps
Time steps size 10 15 seconds, due to physical
and numerical considerations
Desired time range is much larger
A million time steps are required to reach 10-9 s
Around a day of computing for a 3000-atom CNT
MD uses unrealistically large strain-rates

8
Problems with multiple time-scales

Fine-scale computations (such as MD) are more
accurate, but more time consuming
Much of the details at the finer scale are
unimportant, but some are

A simple schematic of multiple time scales
9
Data-Driven Time Parallelization

Time parallelization
Data Driven Prediction
Dimensionality Reduction
Relate Simulation Parameters
Static Prediction
Dynamic Prediction
Verification

10
Time Parallelization

Each processor simulates a different time
interval
Initial state is obtained by prediction, except
for processor 0
Verify if prediction for end state is close to
that computed by MD
Prediction is based on dynamically determining a
relationship between the current simulation and
those in a database of prior results

If time interval is sufficiently large, then
communication overhead is small
11
Dimensionality Reduction

Movement of atoms in a 1000-atom CNT can be
considered the motion of a point in
3000-dimensional space
Find a lower dimensional subspace close to which
the points lie
We use principal orthogonal decomposition
Find a low dimensional affine subspace
Motion may, however, be complex in this subspace
Use results for different strain rates
Velocity 10m/s, 5m/s, and 1 m/s
At five different time points
U, S, V svd(Shifted Data)
Shifted Data USVT
States of CNT expressed as
m c1 u1 c2 u2

u?
u?
m
12
Basis Vectors from POD

CNT of 100 A with 1000 atoms at 300 K

u1 (blue) and u2 (red) for z u1 (green) for x is
not significant
Blue z Green, Red x, y
13
Relate strain rate and time

Coefficients of u1
Blue 1m/s
Red 5 m/s
Green 10m/s
Dotted line same strain
Suggests that behavior is similar at similar
strains
In general, clustering similar coefficients can
give parameter-time relationships

14
Prediction When v is the only parameter

Direct Predictor
Independently predict change in each coordinate
Use precomputed results for 40 different time
points each for three different velocities
To predict for (t v) not in the database
Determine coefficients for nearby v at nearby
strains
Fit a linear surface and interpolate/extrapolate
to get coefficients c1 and c2 for (t v)
Get state as m c1 u1 c2 u2

Green 10 m/s, Red 5 m/s, Blue 1 m/s, Magenta
0.1 m/s, Black 0.1m/s through direct prediction

Dynamic Prediction
Correct the above coefficients, by determining
the error between the previously predicted and
computed states

15
Verification of prediction

Definition of equivalence of two states
Atoms vibrate around their mean position
Consider states equivalent if difference in
position, potential energy, and temperature are
within the normal range of fluctuations

Displacement (from mean)
Mean position
16
Experimental Results

Relate simulations with different strain rates
Use the above strategy directly
Relate simulations with different strain rates
and different CNT sizes
Express basis vectors in a different functional
form
Relate simulations with different temperatures
and strain rates
Dynamically identify different simulations that
are similar in current behavior

17
Stress-strain response at 0.1 m/s

Blue Exact result
Green Direct prediction with interpolation /
extrapolation
Points close to yield involve extrapolation in
velocity and strain
Red Time parallel results

18
Speedup

Red line Ideal speedup
Blue v 0.1m/s
Green The next predictor
v 1m/s, using v 10m/s
CNT with 1000 atoms
Xeon/ Myrinet cluster

19
CNTs of varying sizes

Use a 1000-atom CNT result
Parallelize 1200, 1600, 2000-atom CNT runs
Observe that the dominant mode is approximately a
linear function of the initial z-coordinate
Normalize coordinates to be in 0,1
z tDt z t z tDt Dt, predict z

Speedup
- 2000 atoms
.- 1600 atoms
__ 1200 atoms
Linear
Stress-strain
Blue Exact 2000 atoms
Red 200 processors

20
Predict change in coordinates

Express x in terms of basis functions
Example
x tDt a0, tDt a1, tDt x t
a0, tDt, a1, tDt are unknown
Express changes, y, for the base (old) simulation
similarly, in terms of coefficients b and perform
least squares fit
Predict ai, tDt as bi, tDt R tDt
R tDt (1-b) R t b(ai, t- bi, t)
Intuitively, the difference between the base
coefficient and the current coefficient is
predicted as a weighted combination of previous
weights
We use b 0.5
Gives more weight to latest results
Does not let random fluctuations affect the
predictor too much
Velocity estimated as latest accurate results
known

21
Temperature and velocity vary

Use 1000-atom CNT results
Temperatures 300K, 600K, 900K, 1200K
Velocities 1m/s, 5m/s, 10m/s
Dynamically choose closest simulation for
prediction

Speedup __ 450K, 2m/s Linear Stress-strain Blu
e Exact 450K Red 200 processors
22
Other time parallelization approaches

Waveform relaxation
Repeatedly solve for the entire time domain
Parallelizes well but convergence can be slow
Several variants to improve convergence
Parareal approach
Features similar to ours and to waveform
relaxation
Precedes our approach
Not data-driven
Sequential phase for prediction
Not very effective in practice so far
Has much potential to be improved

23
Conclusions

Data-driven time parallelization shows
significant improvement in speed, without
sacrificing accuracy significantly
Direct prediction is very effective when
applicable
The 980-processor simulation attained a flop
rate of 420 Gflops
Its flops per atom rate of 420 Mflops/atom is
likely the largest flop per atom rate in
classical MD simulations

24
Future Work

More complex problems
Better prediction
POD is good for representing data, but not
necessarily for identifying patterns
Use better dimensionality reduction / reduced
order modeling techniques
Use experimental data for prediction
Better learning
Better verification
In CP8 Application of Dimensionality Reduction
Techniques to Time Parallelization, Yanan Yu
Tomorrow, 230 300 pm