Parallelization of CPAIMD using Charm - PowerPoint PPT Presentation

About This Presentation
Title:

Parallelization of CPAIMD using Charm

Description:

Parallelization of CPAIMD using Charm++ Parallel Programming Lab CPAIMD Collaboration with Glenn Martyna and Mark Tuckerman MPI code PINY Scalability problems ... – PowerPoint PPT presentation

Number of Views:79
Avg rating:3.0/5.0
Slides: 16
Provided by: Whowant3
Category:

less

Transcript and Presenter's Notes

Title: Parallelization of CPAIMD using Charm


1
Parallelization of CPAIMD using Charm
  • Parallel Programming Lab

2
CPAIMD
  • Collaboration with Glenn Martyna and Mark
    Tuckerman
  • MPI code PINY
  • Scalability problems
  • When procs gt orbitals
  • Charm approach
  • Better scalability using virtualization
  • Further divide orbitals

3
The Iteration
4
The Iteration (contd.)
  • Start with 128 states
  • State spatial representation of electron
  • FFT each of 128 states
  • In parallel
  • Planar decomposition gt transpose
  • Compute densities (DFT)
  • Compute energies using density
  • Compute Forces and move electrons
  • Orthonormalize states
  • Start over

5
Parallel View
6
Optimized Parallel 3D FFT
  • To perform 3D FFT
  • 1d followed by 2d instead of 2d followed by 1d
  • Lesser computation
  • Lesser communication

7
Orthonormalization
  • All-pairs operation
  • The data of each state has to meet with the data
    of all other states
  • Our approach (picture follows)
  • A virtual processor acts as meeting point for
    several pairs of states
  • Create lots of these
  • The number of pairs meeting at a VP n
  • Communication decreases with n
  • Computation increases with n
  • Balance required

8
VP based approach
9
Performance
  • Existing MPI code PINY
  • Does not scale beyond 128 processors
  • Best per-iteration 1.7s
  • Our performance

Processors Time(s)
128 2.07
256 1.18
512 0.65
1024 0.48
1536 0.39
10
Load balancing
  • Load imbalance due to distribution of data in
    orbitals
  • Planes are sections of a sphere
  • Hence imbalance
  • Computation more points
  • Communication more data to send

11
Load Imbalance
Iteration time 900ms on 1024 procs
12
Improvement - I
Improvement by pairing heavily loaded planes with
lightly loaded planes. Iteration time 590ms
13
Charm Load Balancing
Load balancing provided by the system, iteration
time 600ms
14
Improvement - II
Improvement by using a load vector based scheme
to map planes to processors. The number of
light planes per processor is corresponding
lesser than that of the number of heavy planes.
Iteration time 480ms
15
Scope for Improvement
  • Load balancing
  • Charm load balancer shows encouraging results
    on 512 pes
  • Combination of automated and manual
    load-balancing
  • Avoiding copying when sending messages
  • In ffts
  • Sending large read-only messages
  • FFTs can be made more efficient
  • Use double packing
  • Make assumption about data distribution when
    performing FFTs
  • Alternative implementation of orthonormalization
Write a Comment
User Comments (0)
About PowerShow.com