Title: Application Performance Profiling and Prediction in Grid Environment
1Application Performance Profiling and Prediction
in Grid Environment
- Presented by Marlon Bright
- 1 August 2008
- Advisor Masoud Sadjadi, Ph.D.
- REU Florida International University
2Outline
- Grid Enablement of Weather Research and
Forecasting Code (WRF) - Profiling and Prediction Tools
- Research Goals
- Project Timeline
- Project Status
- Challenges Overcome
- Remaining Work
3Motivation Weather Research and Forecasting
Code (WRF)
- Goal Improved Weather Prediction
- Accurate and Timely Results
- Precise Location Information
- WRF Status
- Over 160,000 lines (mostly FORTRAN and C)
- Single Machine/Cluster compatible
- Single Domain
- Fine Resolution -gt Resource Requirements
- How to Overcome this?
- Through Grid Enablement
- Expected Benefits to WRF
- More available resources Different Domains
- Faster results
- Improved Accuracy
4System Overview
- Web-Based Portal
- Grid Middleware (Plumbing)
- Job-Flow Management
- Meta-Scheduling
- Performance Prediction
- Profiling and Benchmarking
- Development Tools and Environments
- Transparent Grid Enablement (TGE)
- TRAP Static and Dynamic adaptation of programs
- TRAP/BPEL, TRAP/J, TRAP.NET, etc.
- GRID superscalar Programming Paradigm for
parallelizing a sequential application
dynamically in a Computational Grid
5Performance Prediction
- IMPORTANT part of Meta-Scheduling
- Allows for
- Optimal usage of grid resources through smarter
meta-scheduling - Many users overestimate job requirements
- Reduced idle time for compute resources
- Could save costs and energy
- Optimal resource selection for most expedient job
return time - Tools Amon /Aprof and Paraver/Dimemas
6Research Goals
- Extend Amon/Aprof research to larger number of
nodes, different archtitecture, and different
version of WRF (Version 2.2.1). - Compare/contrast Aprof predictions to Dimemas
predictions in terms of accuracy and prediction
computation time. - Analyze if/how Amon/Aprof could be used in
conjunction with Dimemas/Paraver for optimized
application performance prediction and,
ultimately, meta-scheduling
7Timeline
- End of June
- Get MPItrace linking properly with WRF Version
Compiled on GCB, then Mind COMPLETE - a) Install Amon and Aprof on MareNostrum and
ensure proper functioning AMON COMPLETE APROF
FINAL STAGES - b) Run Amon benchmarks on MareNostrum COMPLETE
- Early/Mid July
- Use and analyze Aprof predictions within
MareNostrum (and possibly between MareNostrum,
GCB, and Mind) MN COMPLETE - Use generated MPI/ OpenMP tracefiles
(Paraver/Dimemas) to predict within (and possibly
between) Mind, GCB, and MareNostrum IN PROGRESS - Late July/Early August
- Experiment with how well Amon and Aprof relate
to/could possibly be combined with Dimemas IN
PROGRESS - Compose paper presenting significant findings. IN
PROGRESS - Analyze how findings relate to bigger picture.
Make optimizations on grid-enablement of WRF.
8The ToolsAmon / Aprof Dimemas / Paraver
9Amon / Aprof
- Amon monitoring program that runs on each
compute node recording new processes - Aprof regression analysis program running on
head node receives input from Amon to make
execution time predictions (within cluster
between clusters)
10Amon / Aprof Monitoring and Prediction
11Amon / Aprof Approach to Modeling Resource Usage
WRF
Network Latency
CPU Speed
Hard Disk I/O
Network Bandwidth
Number of Nodes
FSB Bandwidth
RAM Size
L2 Cache
Application Resource Usage Model
12Previous Findings for Amon / Aprof
- Experiments were performed on two clusters at
FIUMind (16 nodes) and GCB (8 nodes) - Experiments were run to predict for different
number of nodes and cpu loads (i.e. 2,3,,14,15
and 20, 30,,90, 100) - Aprof predictions were within 10 error versus
actual recorded runtimes within Mind and GCB and
between Mind and GCB - Conclusion first step assumption was valid. -gt
Move to extending research to higher number of
nodes.
13Howd they do that?
- Developed a benchmarking script that edits and
submits a job file to MareNostrum (MN) scheduler - Runs for each number of nodes (8, 16, 32, 64, 96,
128) - Runs for each cpu percentage (100, 75, 50, 25)
- Records execution time, average cpu utilization,
participating nodes, etc. - Job file
- Requests desired number of nodes from MN
- Starts Amon on each returned node to monitor and
return processes - Starts cpulimit on each returned node limiting
the effective power given to the WRF process - Executes WRF as parallel job across the returned
nodes - Developed modification script
- Combines Amon output to one file
- Filters processes to solely WRF processes
- Edits processes to Aprof friendly format
14Howd they do that ? (contd)
- Start Aprof loading input file as data
- Executed Aprof Query Automation script
- Starts telnet session querying Aprof for
benchmarked scenarios - Compares predicted values to actual values
returned in run - Outputs text file and graphing plot file of
comparison statistics
15Experimental Process
16Extreme? Makeover
- --- (464) ---
- name wrf.exe
- cpus 4
- cpu MHz 1/0.000 MHz
- cache size 1/0 KB
- elapsed time 957952 msec
- utime 956370 msec 957810 msec
- stime 570 msec 860 msec
- intr 18783
- ctxt switch 58290
- fork 95
- storage R 0 blocks 0 blocks
- storage W 0 blocks
- network Rx 19547308 bytes
- network Tx 1434925 bytes
- --- (464) ---
- name wrf.exe
- inv cpu 1/16
- inv clock 1/574
- cache size 1/1024 KB
- elapsed time 1990992 msec
- inv clockcpu 1/(36763)
- Why Version of Linux on MN does not report some
characteristics (i.e. cache size). From its
initial design, Amon reports in different format
than Aprof reads.
17Aprof Prediction
- name wrf.exe
- elapsed time
- 5.783787e06
- explanatory value parameter
std.dev - ----------------- ------------- -------------
------------- - 1.000000e00
5.783787e06 1.982074e05
- predicted value residue rms
std.dev - ----------------- ------------- -------------
------------- - elapsed time 5.783787e06 4.246451e06
1.982074e05
18Query Automation Script Output
- adj. cpu speed, processors, actual, predicted,
rms, std. dev, actual difference, - 3591.363, 1, 5222, 5924.82, 1592.459, 415.3491,
13.4588280352 - 3591.363, 2, 2881, 3246.283, 1592.459, 181.5382,
12.6790350573 - 3591.363, 3, 2281, 2353.438, 1592.459, 105.334,
3.17571240684 - 3591.363, 4, 1860, 1907.015, 1592.459, 69.19778,
2.52768817204 - 3591.363, 5, 1681, 1639.161, 1592.459, 49.83672,
2.48893515764 - 3591.363, 6, 1440, 1460.592, 1592.459, 39.5442,
1.43 - 3591.363, 7, 1380, 1333.043, 1592.459, 34.76459,
3.40268115942 - 3591.363, 8, 1200, 1237.381, 1592.459, 33.27651,
3.11508333333 - 3591.363, 9, 1200, 1162.977, 1592.459, 33.56231,
3.08525 - 3591.363, 10, 1080, 1103.454, 1592.459, 34.68943,
2.17166666667 - 3591.363, 11, 1200, 1054.753, 1592.459, 36.15324,
12.1039166667 - 3591.363, 12, 1080, 1014.169, 1592.459, 37.70271,
6.09546296296 - 3591.363, 13, 1200, 979.8292, 1592.459, 39.22018,
18.3475666667 - 3591.363, 14, 1021, 950.3947, 1592.459, 40.65455,
6.91530852106 - 3591.363, 15, 1020, 924.8848, 1592.459, 41.9872,
9.32501960784
19Paraver / Dimemas
- Dimemas - simulation tool for the parametric
analysis of the behavior of message-passing
applications on a configurable parallel platform. - Paraver tool that allows for performance
visualization and analysis of trace files
generated from actual executions and by Dimemas - Tracefiles generated by MPItrace that is linked
into execution code
20Dimemas Simulation Process Overview
- Link MPItrace into application source
codedynamically generates tracefiles for each
node application running on - Identify computation iterations in Paraver
compose a smaller trace file by selecting a few
iterations, preserving communications and
eliminating initialization phases - Convert the new tracefile to Dimemas format
(.trf) using CEPBA provided prv2trf tool - Load tracefile into Dimemas simulator, configure
target machine, and with information generate
Dimemas configuration file - Call simulator with or without option of
generating a Paraver (.prv) tracefile for
viewing.
21Paraver/Dimemas DiP Environment
22Howd they do that?
- Generated Paraver tracefiles, Dimemas tracefiles,
and simulation configuration files for each
number of nodes - Developed Dimemas simulation script
simulation_automater.sh - Selects configuration file for desired number of
nodes - Edits configuration file for desired cpu
percentage - Records execution time, average cpu utilization
- Finalizing development of prediction validation
script. Will - Compare Dimemas predicted values to actual run
values - Outputs text file and graphing plot file of
comparison statistics
23Dimemas Prediction
- Execution time 36.354146
- Speedup 5.34
- CPU Time 194.066431
- Id. Computation time Communication
- 1 31.224017 91.21 3.008552
- 2 20.089440 78.20 5.599083
- 3 19.305673 76.84 5.818317
- 4 28.672368 83.27 5.762332
- 5 29.058603 85.36 4.982049
- 6 19.488003 77.63 5.614155
- 7 18.727851 78.57 5.108366
- 8 27.500476 84.29 5.123971
- Id. Mess.sent Bytes sent Immediate
recv Waiting recv Bytes re
cv Coll.op. Block time
Comm. time Wait link time Wait bus
es time I/O time - 1 7.577000e03 1.583659e08
3.539000e03 4.080000e03 1.671666
e08 1.475000e03 0.247092
0.383663 0.319859 0.000000
0.000000 - 2 8.948000e03 2.200029e08
8.797000e03 1.440000e02 2.186629
e08 1.475000e03 3.710867
0.383663 0.098868 0.000000
0.000000 - 3 8.948000e03 2.176712e08
6.904000e03 2.037000e03 2.163992
e08 1.475000e03 3.453668
0.383663 0.243052 0.000000
0.000000
24Project Status
25Amon / Aprof
- Software installed and tailored to MareNostrum
- Proficient in executing software
- Amon benchmarking completed
- Aprof query automation complete and results
generated - Lessons learned on extending Amon/Aprof to
different architecture
26Dimemas / Paraver
- Proficient in executing software
- Paraver and Dimemas tracefiles generated for each
number of nodes (8, 16, 32, 64, 96, 128) - Benchmarking script complete
- Simulations generated
- Comparison script being finalized
27Quick Comparison
- Pros
- Simpler to deploy in comparison
- Scalability of model is promising with first
results - Feasible solution for performance prediction
purposes - Cons
- Requires more base executions for accurate
performance in comparison
- Pros
- More featurescould be more useful to experienced
user (i.e. adjustment of system characteristics) - Visualization and analysis of execution for
analysis purposes - Graphical User Interface
- Cons
- Requires special compilation of applications
- Requires non-trivial-to-install kernel patch
- Large tracefiles (sometimes gigabytes)
28Aprof Results 100 CPU Utilization
29Aprof Results 100 CPU Utilization
30Significant Challenges Overcome
- Amon
- Adjustment of source code to proper functioning
on MareNostrum (MN) - Development of benchmarking script to conform to
system architecture of MareNostrum (i.e. going
through its scheduler one process per node
etc.) - Proper functioning of CPU limit for accurate cpu
percentage - Job termination by MN Scheduler due to execution
surpassing wall clock limit
31Significant Challenges Overcome(contd)
- Aprof
- Adjustment of source code for less complex, more
consistent data input - Development of prediction and comparison scripts
for MareNostrum - Dimemas/Paraver
- MPItrace properly linked in with WRF on GCB and
Mind - Generation of trace and configuration files
- WRF
- Version 2.2 installed and compiled on Mind
32Challenges Remaining
- Lengthy Amon benchmarking runs due to job times
spent in queue - Complexities in preparing Dimemas tracefiles for
simulation purposes - Extracting accurate predictions from Dimemas
trace files are reduced in order to speed up
prediction process therefore, predicted times
must be multiplied by a determined factor
33Remaining Work
- Next Week
- Finalizing scripting of Dimemas prediction
simulations for the same scenarios of those of
Amon and Aprof - Fall 2008
- Experiment with how well Amon and Aprof relate
to/could possibly be combined with Dimemas - Decide if and how to compare results from
MareNostrum, GCB, and Mind (i.e. the same
versions of WRF would have to be running in all
three locations) - Compose paper presenting significant results and
submit paper to conference. - Future Work
- Work with metascheduling team on implementation
of tools.
34References
- S. Masoud Sadjadi, Liana Fong, Rosa M. Badia,
Javier Figueroa, Javier Delgado, Xabriel J.
Collazo-Mojica, Khalid Saleem, Raju Rangaswami,
Shu Shimizu, Hector A. Duran Limon, Pat Welsh,
Sandeep Pattnaik, Anthony Praino, David Villegas,
Selim Kalayci, Gargi Dasgupta, Onyeka Ezenwoye,
Juan Carlos Martinez, Ivan Rodero, Shuyi Chen,
Javier Muñoz, Diego Lopez, Julita Corbalan, Hugh
Willoughby, Michael McFail, Christine Lisetti,
and Malek Adjouadi. Transparent grid enablement
of weather research and forecasting. In
Proceedings of the Mardi Gras Conference 2008 -
Workshop on Grid-Enabling Applications, Baton
Rouge, Louisiana, USA, January 2008. - http//www.cs.fiu.edu/sadjadi/Presentations/Mardi
-Gras-GEA-2008-TGE-WRF.ppt
- S. Masoud Sadjadi, Shu Shimizu, Javier Figueroa,
Raju Rangaswami, Javier Delgado, Hector Duran,
and Xabriel Collazo. A modeling approach for
estimating execution time of long-running
scientific applications. In Proceedings of the
22nd IEEE International Parallel Distributed
Processing Symposium (IPDPS-2008), the Fifth
High-Performance Grid Computing Workshop
(HPGC-2008), Miami, Florida, April 2008. - http//www.cs.fiu.edu/sadjadi/Presentations/HPGC-
2008-WRF20Modeling20Paper20Presentationl.ppt - Performance/Profiling. Presented by Javier
Figueroa in Special Topics in Grid Enablement of
Scientific Applications Class. 13 May 2008
35Acknowledgements
- REU
- Partnerships for International Research and
Education (PIRE) - The Barcelona SuperComputing Center (BSC)
- Masoud Sadjadi, Ph. D. - FIU
- Rosa Badia, Ph.D. - BSC
- Javier Delgado FIU
- Javier Figueroa Univ. of Miami