Parallelization of CPAIMD using Charm

About This Presentation

Title:

Parallelization of CPAIMD using Charm

Description:

Parallelization of CPAIMD using Charm++ Parallel Programming Lab CPAIMD Collaboration with Glenn Martyna and Mark Tuckerman MPI code PINY Scalability problems ... – PowerPoint PPT presentation

Number of Views:79

Avg rating:3.0/5.0

Slides: 16

Provided by: Whowant3

Learn more at: http://charm.cs.illinois.edu

Category:

more less

Transcript and Presenter's Notes

Title: Parallelization of CPAIMD using Charm

1
Parallelization of CPAIMD using Charm

Parallel Programming Lab

2
CPAIMD

Collaboration with Glenn Martyna and Mark
Tuckerman
MPI code PINY
Scalability problems
When procs gt orbitals
Charm approach
Better scalability using virtualization
Further divide orbitals

3
The Iteration
4
The Iteration (contd.)

Start with 128 states
State spatial representation of electron
FFT each of 128 states
In parallel
Planar decomposition gt transpose
Compute densities (DFT)
Compute energies using density
Compute Forces and move electrons
Orthonormalize states
Start over

5
Parallel View
6
Optimized Parallel 3D FFT

To perform 3D FFT
1d followed by 2d instead of 2d followed by 1d
Lesser computation
Lesser communication

7
Orthonormalization

All-pairs operation
The data of each state has to meet with the data
of all other states
Our approach (picture follows)
A virtual processor acts as meeting point for
several pairs of states
Create lots of these
The number of pairs meeting at a VP n
Communication decreases with n
Computation increases with n
Balance required

8
VP based approach
9
Performance

Existing MPI code PINY
Does not scale beyond 128 processors
Best per-iteration 1.7s
Our performance

Processors Time(s)
128 2.07
256 1.18
512 0.65
1024 0.48
1536 0.39
10
Load balancing

Load imbalance due to distribution of data in
orbitals
Planes are sections of a sphere
Hence imbalance
Computation more points
Communication more data to send

11
Load Imbalance
Iteration time 900ms on 1024 procs
12
Improvement - I
Improvement by pairing heavily loaded planes with
lightly loaded planes. Iteration time 590ms
13
Charm Load Balancing
Load balancing provided by the system, iteration
time 600ms
14
Improvement - II
Improvement by using a load vector based scheme
to map planes to processors. The number of
light planes per processor is corresponding
lesser than that of the number of heavy planes.
Iteration time 480ms
15
Scope for Improvement