Title: Massively Parallel Cosmological Simulations with ChaNGa
1Massively Parallel Cosmological Simulations with
ChaNGa
- Pritish Jetley, Filippo Gioachin, Celso Mendes,
Laxmikant V. Kale and Thomas Quinn
2Simulations and Scientific Discovery
- Help reconcile observation and theory
- Calculate final states of theories of structure
formation - Direct observational programs
- What should we look for in space?
- Help determine underlying structures and masses
3Computational Challenges
- N 1012
- Direct summation forces would take 1010
Teraflop years - Need efficient, scalable algorithms
- Large dynamic ranges
- Need multiple timestepping
- Irregular domains
- Balance load across processors
4ChaNGa
- Uses Barnes-Hut algorithm
- Based on Charm
- Processor virtualization
- Asynchronous message-driven model
- Computation and communication overlap
- Intelligent, adaptive runtime system
- Load balancing
5(No Transcript)
6(No Transcript)
7(No Transcript)
8Major Optimizations
- Pipelined computation
- Prefetch tree chunk before starting traversal
- Tree-in-Cache
- Aggregate trees from all chares on processor
- Tunable computation granularity
- Response time for data requests vs Scheduling
overhead
9Experimental Setup
dwarf 5 and 50 million particles
lambs 3 million particles
drgas 700 million particles
hrwh_LCDMs 16 milllion particles
10Experimental Setup (contd.)?
11Parallel Performance
A comparison of Parallel Performance with
PKDGRAV. (Dwarf' dataset on Tungsten.)?
12Scaling Tests
IBM BG/L
Cray XT3
Poor scaling
13Towards Greater Scalability
- Load Imbalance causes poor scaling
- Static balancing not good enough
- Even number of particles ! Even work
distribution - Must balance both computation communication
14(No Transcript)
15(No Transcript)
16Results with OrbRefineLB
- Different datasets
- OrbRefineLB
17(No Transcript)
18(No Transcript)
19Balancing Load in MS Runs
- Different strategies for different phases
- Multiphase instrumentation
- Model-based load estimation (first few small
steps)?
0
0
1
2
20Preliminary Results
Singlestepped (613 s)?
- Dwarf dataset
- 32 BG/L processors
- Different timestepping schemes
Multistepped (429 s)?
Multistepped with load balancing (228 s)?
21Preliminary Results
- 50 reduction in execution time
- Lambb dataset
- 512 and 1024 BG/L processors
- Singlestepped vs load-balanced multistepped
- Multistepping and overdecomposition
- Lambb dataset
- 1024 BG/L processors
- Varying num. TreePieces
More TreePieces ? greater load balance
22Future Work
- SPH
- Alternative decomposition schemes
- Runtime optimizations to reduce communication
cost - More sophisticated load balancing algorithms
- Account for
- Complete simulation space topology
- Processor topology (reduce hop-bytes)?
23Conclusions
- Introduced ChaNGa
- Optimizations to reduce simulation time
- Load imbalance issues tackled
- Multiple timestepping beneficial
- Balancing load in multistepped simulations