Improving LeanMD Performance on Blue GeneL - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Improving LeanMD Performance on Blue GeneL

Description:

A molecular dynamics simulation taking advantage of Charm features ... These are the basic data structures are distributed to different processors ... – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Slides: 21
Provided by: ans110
Category:

less

Transcript and Presenter's Notes

Title: Improving LeanMD Performance on Blue GeneL


1
Improving LeanMD Performance on Blue Gene/L
  • CS533 Class Project
  • Aaron Becker, Abhinav Bhatele, Chao Mei

2
LeanMD Introduction
  • A molecular dynamics simulation taking advantage
    of Charm features
  • Modular code meant to allow many types of
    physical interaction simulations
  • Kind of legacy code (untouched for the last two
    years)

3
Basic Structure
  • These are the basic data structures are
    distributed to different processors

4
Sequential Performance
  • Hotspot functions had already been inlined
  • A two-level nested loop over two atoms sets
  • Complex control flow and highly inter-dependent
    calculation
  • We reorganized some control flows
  • Hardly any improvement

5
PME Calculation
  • Particle Mesh Ewald (PME)
  • Calculate the long-range electrostatic energy
  • Problems
  • Uses too fine-grained grid size
  • 3D grid data is not decomposed
  • Every Cell object is associated with a whole 3D
    grid for the convenience of communication and
    computation

6
Solutions to PME Problems
  • Adjust grid size to the appropriate level
  • Parallelize the 3D grid along one dimension
  • Associate each cell object with the exact part of
    the grid it touches

7
PME Performance (3D Grid Data Decomposition)
8
PME Performance (Cell Associated with Exact Part)
  • On 1 processor
  • Speedup 1.69
  • Memory Usage before 543.7491 MB
  • Memory Usage afterwards 308.4071 MB down by
    42.38
  • On 8 processors
  • Speedup 4.12

9
Average memory usage down by 30.92
10
LeanMD Communication Patterns
  • Depending on the k we choose in k-away, the
    Cells interact with their neighbors through Cell
    Pairs
  • Each cell sends data to all its cell pairs and
    receives results from them

Cells in a 1-away scheme
11
LeanMD Communication Patterns
  • For a typical system like HCA and 2-away
    decomposition, we have
  • 12 x 12 x 12 1728 cells
  • 124 x 12 x 12 x 12 / 2 12 x 12 x 12 108864
    Cell Pairs

Cells in a 2-away scheme
12
Blue Gene/L
  • IBMs Blue Gene/L architecture is currently one
    of the fastest in the world
  • Network topology on Blue Gene
  • 3D mesh for lt 512 processors
  • 3D torus for gt 512 processors
  • Tests on the machine show that latency increases
    dramatically in presence of heavy contention
  • We intend to minimize hop count of messages to
    reduce contention

13
Motivation
  • Reduction in the hop count reduces the latency
    and also the contention in the network
  • If we can reduce the contention on the network,
    then we can reduce latency which can trigger
    waiting processes faster
  • The current code doesnt have any knowledge of
    the topology
  • It uses the insert function call provided by
    Charm to randomly insert objects on different
    processors (though ensuring load balance)

14
Topology Aware Mapping
  • Instead, we wish to map communicating objects on
    same or nearby processors
  • Optimal mapping of objects to processors is a
    NP-hard problem
  • We wish to come up with a good-enough mapping
    which minimizes the communication and gives a
    good overlap with computation at the same time
  • We also have to ensure load balance ourselves
    since it is no longer taken care of by Charm

15
Different Mapping Schemes
  • Blocked Cells, Random Cell Pairs 1.87s
  • Blocked Cells, Cell Pairs on first cell 2.46s
  • Blocked Cells, Cell Pairs alternately on first
    and second cell 1.20s

16
Different Mapping Schemes
  • Cells round robin, and Cell Pairs alternate
    1.73s
  • Blocked Cells, Cell Pairs around smaller cell
    distributed 1.26s
  • Blocked Cells, Cell Pairs distributed around
    smaller cell 1.43s

17
Performance Results
18
Possible Reasons
  • Topology Mapping might not be a good one!
  • Not enough communication in leanMD which can be
    improved upon
  • Test system (HCA with 12 x 12 x 12 cells) is not
    large enough
  • LeanMD is efficient enough that topology mapping
    does not make a difference

19
Modifications
  • Enable PME to increase communication
  • Run on more processors to give more pronounced
    topological effects
  • Use larger molecular systems to expose
    inefficiencies in default mapping

20
Thank You!
  • Questions?
Write a Comment
User Comments (0)
About PowerShow.com