Massively Parallel Cosmological Simulations with ChaNGa presentation

About This Presentation

Transcript and Presenter's Notes

Title: Massively Parallel Cosmological Simulations with ChaNGa

1
Massively Parallel Cosmological Simulations with
ChaNGa

Pritish Jetley, Filippo Gioachin, Celso Mendes,
Laxmikant V. Kale and Thomas Quinn

2
Simulations and Scientific Discovery

Help reconcile observation and theory
Calculate final states of theories of structure
formation
Direct observational programs
What should we look for in space?
Help determine underlying structures and masses

3
Computational Challenges

N 1012
Direct summation forces would take 1010
Teraflop years
Need efficient, scalable algorithms
Large dynamic ranges
Need multiple timestepping
Irregular domains
Balance load across processors

4
ChaNGa

Uses Barnes-Hut algorithm
Based on Charm
Processor virtualization
Asynchronous message-driven model
Computation and communication overlap
Intelligent, adaptive runtime system
Load balancing

5
(No Transcript)
6
(No Transcript)
7
(No Transcript)
8
Major Optimizations

Pipelined computation
Prefetch tree chunk before starting traversal
Tree-in-Cache
Aggregate trees from all chares on processor
Tunable computation granularity
Response time for data requests vs Scheduling
overhead

9
Experimental Setup
dwarf 5 and 50 million particles
lambs 3 million particles
drgas 700 million particles
hrwh_LCDMs 16 milllion particles
10
Experimental Setup (contd.)?

Platforms

11
Parallel Performance
A comparison of Parallel Performance with
PKDGRAV. (Dwarf' dataset on Tungsten.)?
12
Scaling Tests
IBM BG/L
Cray XT3
Poor scaling
13
Towards Greater Scalability

Load Imbalance causes poor scaling
Static balancing not good enough
Even number of particles ! Even work
distribution
Must balance both computation communication

14
(No Transcript)
15
(No Transcript)
16
Results with OrbRefineLB

Different datasets
OrbRefineLB

17
(No Transcript)
18
(No Transcript)
19
Balancing Load in MS Runs

Different strategies for different phases
Multiphase instrumentation
Model-based load estimation (first few small
steps)?

0
0
1
2
20
Preliminary Results
Singlestepped (613 s)?

Dwarf dataset
32 BG/L processors
Different timestepping schemes

Multistepped (429 s)?
Multistepped with load balancing (228 s)?
21
Preliminary Results

50 reduction in execution time

Lambb dataset
512 and 1024 BG/L processors
Singlestepped vs load-balanced multistepped

Multistepping and overdecomposition

Lambb dataset
1024 BG/L processors
Varying num. TreePieces

More TreePieces ? greater load balance
22
Future Work

SPH
Alternative decomposition schemes
Runtime optimizations to reduce communication
cost
More sophisticated load balancing algorithms
Account for
Complete simulation space topology
Processor topology (reduce hop-bytes)?

23
Conclusions

Introduced ChaNGa
Optimizations to reduce simulation time
Load imbalance issues tackled
Multiple timestepping beneficial
Balancing load in multistepped simulations

Write a Comment

User Comments (0)

About PowerShow.com

Massively Parallel Cosmological Simulations with ChaNGa PowerPoint PPT Presentation