Title: Scalability Study of S3D on Jaguar CNL using TAU
1Scalability Study of S3D on Jaguar CNL using TAU
- Sameer Shende
- tau-team_at_cs.uoregon.edu
2Acknowledgements
- Alan Morris UO
- Kevin Huck UO
- Allen D. Malony UO
- Bronis R. de Supinski LLNL
- The performance data presented here is
available at - http//tau.uoregon.edu/s3d
3TAU Parallel Performance System
- http//tau.uoregon.edu/
- Multi-level performance instrumentation
- Multi-language automatic source instrumentation
- Flexible and configurable performance measurement
- Widely-ported parallel performance profiling
system - Computer system architectures and operating
systems - Different programming languages and compilers
- Support for multiple parallel programming
paradigms - Multi-threading, message passing, mixed-mode,
hybrid
4Scalability Study
- C2H4 Benchmark
- Platform Jaguar Cray CNL at ORNL
- 1p
- 4p
- 64p
- 512p
- 1728p
- 4096p
- 8000p
- 12000p
- Goal to evaluate scaling properties of code
regions - Scalability of MPI operations
5PerfDMF Performance Data Mgmt. Framework
6PerfExplorer - Comparative Analysis
- Relative speedup, efficiency
- total runtime, by event, one event, by phase
- Breakdown of total runtime
- Group fraction of total runtime
- Correlating events to total runtime
- Timesteps per second
7PerfExplorer
TAUs PerfDMF database
8PerfExplorer Select Experiment Analysis
9Total Execution Time
10Relative Efficiency For S3D - Weak Scaling
11Relative Efficiency by Event
12Relative Speedup by Event
13Data Mining Event Correlation to Total Time
r 1 implies direct correlation
14MPI Scaling (Total time in MPI/Total Time)
15Total Runtime Breakdown by Events
16Floating Point Instructions
17Level 1 Data Cache Misses
18ParaProf 12000 core job
19ParaProf Mean across all nodes
20ParaProf 3D Correlation Cube MPI_Wait!
21ParaProf MPI_Wait variation!
22ParaProf MPI_Wait Histogram
23ParaProf Mflops in Code Regions
24ParaProf Mflops Sorted by Exclusive Time
low mflops?
25S3D - Building with TAU
- Change name of compiler in build/make.XT3
- ftngt tau_f90.sh
- cc gt tau_cc.sh
- Set compile time environment variables
- setenv TAU_MAKEFILE /spin/proj/perc/TOOLS/tau_late
st/craycnl/lib/ Makefile.tau-callpath-multipleco
unters-mpi-papi-pdt-pgi - Choose callpath, PAPI counters, MPI profiling,
PDT for source instrumentation - setenv TAU_OPTIONS -optTauSelectFileselect.tau
-optPreProcess - Selective instrumentation file eliminates
instrumentation in lightweight routines - Pre-process Fortran source code using cpp before
compiling - Set runtime environment variables for
instrumentation control and event PAPI counter
selection in job submission script - export TAU_THROTTLE1
- export COUNTER1 GET_TIME_OF_DAY
- export COUNTER2 PAPI_FP_INS
- export COUNTER3 PAPI_L1_DCM
26Concluding Discussion
- Identified scaling trends for S3D upto 12k cores
- Identified two loops that take a significant
amount of time that has relatively low mflops - TRANSPORT_MCOMPUTESPECIESDIFFFLUX 630-656
- INTEGRATE 73-93
- MPI_Wait has jagged edges
27Support Acknowledgements
- Department of Energy (DOE)
- Office of Science
- LLNL, LANL, ORNL, ASC
- PERI