Title: Optimizing Quantum Chemistry using Charm
1Optimizing Quantum Chemistry using Charm
Eric Bohm http//charm.cs.uiuc.edu Parallel
Programming Laboratory Department of Computer
Science University of Illinois at Urbana Champaign
2Overview
- Decomposition
- State Planes
- 3d FFT
- 3d matrix multiply
- Utilizing Charm
- Prioritized nonlocal
- Commlib
- Projections
- CPMD
- 9 phases
- Charm applicability
- Overlap
- Decomposition
- Portability
- Communication Optimization
3Quantum Chemistry
- LeanCP Collaboration
- Glenn Martyna (IBM TJ Watson)
- Mark Tuckerman (NYU)
- Nick Nystrom (PSU)
- PPL Kale, Shi, Bohm, Pauli, Kumar (now at IBM),
Vadali - CPMD Method
- Plane wave QM
- Charm Parallelization
- PINY MD Physics engine
4CPMD on Charm
- 11 Charm Arrays
- 4 Charm Modules
- 13 Charm Groups
- 3 Commlib strategies
- BLAS
- FFTW
- PINY MD
- Adaptive Overlap
- Prioritized computation for phased application
- Communication optimization
- Load balancing
- Group caches
- Rth Threads
5Practical Scaling
- Single Wall Carbon Nanotube Field Effect
Transistor - BG/L Performance
6Computation Flow
7Charm
- Uses the approach of virtualization
- Divide the work into VPs
- Typically much more than proc
- Schedule each VP for execution
- Advantage
- Computation and communication can be overlapped
(between VPs) - Number of VPs can be independent of proc
- Other load balancing, checkpointing, etc.
8Decomposition
- Higher degree of virtualization better for
Charm - Real Space State Planes, Gspace State Planes, Rho
Real and Rho G, S-Calculators for each gspace
state plane. - Tens of thousands of chares for a 32 mol problem
- Careful scheduling to maximize efficiency
- Most of the computation is in FFTs and Matrix
Multiplies
93-D FFT Implementation
Dense 3-D FFT
Sparse 3-D FFT
10Parallel FFT Library
- Slab-based parallelization
- We do not re-implement the sequential routine
- Utilize 1-D and 2-D FFT routines provided by FFTW
- Allow for
- Multiple 3-D FFTs simultaneously
- Multiple data sets within the same set of slab
objects - Useful as 3-D FFTs are frequently used in CP
computations
11Multiple Parallel 3-D FFTs
12Matrix Multiply
- AKA Scalculator or Pair Calculator
- Decompose state-plane values into smaller
objects. - Use DGEMM on smaller sub-matrices
- Sum together via reduction back to Gspace
13Matrix Multiply VP based approach
14Charm Tricks and Tips
- Message driven execution and high degree of
virtualization present tuning challenges - Flow of control using Rth-Threads
- Prioritized messages
- Commlib framework
- Charm arrays vs groups
- Problem identification with projections
- Problem isolation techniques
15Flow Control in Parallel
- Rth Threads
- Based on Duff's device these are user level
threads with negligible overhead. - Essentially Goto and Return without readability
loss - Allow for an event loop style of programming
- Makes flow of control explicit
- Uses familiar threading semantic
16Rth Threads for Flow Control
17Prioritized Messages for Overlap
18Communication Library
- Fine grained decomposition can result in many
small messages. - Message combining via the Commlib framework in
Charm addresses this problem. - Streaming protocol optimizes many to many
personalized. - Forwarding protocols like Ring or Multiring can
be beneficial. - But not on BG/L
19Commlib Strategy Selection
20Bound Arrays
- Why?
- Efficiency and clarity of expression.
- Two arrays of the same dimensionality where like
indices are co-placed. - Gspace and the non-local computation both have
plane based computations and share many data
elements. - Use ck-local to access elements, like local
functions and local function calls. - Remain distinct parallel objects
21Group Caching Techniques
- Group objects have 1 element per processor)
- Making excellent cache points for arrays which
may have many chares per processor - Place low volatility data in the group
- Array elements use cklocal to access
- In CPMD the Structure Factor for all chares
which have plane P use the same memory
22Charm Performance Debugging
- Complex parallel applications hard to debug
- Event based model with high degree of
virtualization presents new challenges - Projections and Charm debugger Tools
- Bottleneck identification
- using the Projections Usage Profile tool
23Old S-gtT Orthonormalization
24After Parallel S-gtT
25Problem isolation techniques
- Using Rth threads its easy to isolate phases by
adding a barrier. - Contribute to Reduction -gt suspend
- Reduction proxy is broadcast client -gtresume
- In the following example we break up the Gspace
IFFT into computation and communication entry
methods. - We then insert a barrier between them to
highlight a specific performance problem
26Projections Timeline Analysis
27Future Work
- Scaling to 20k processors on BG/L - density
pencil ffts - Rhospace real-gtcomplex doublepack optimization
- New FFT based algorithm for Structure Factor
- More systems
- Topology aware chare mapping
- HLL Orchestration expression
28What time is it in Scotland?
- There is a 1024 node BG/L in Edinburg
- Time is 6 hours ahead of CT there.
- During this non production time we can run on the
full rack at night - Thank you EPCC!