Title: Application Performance Profiling and Prediction in Grid Environment
1Application Performance Profiling and Prediction
in Grid Environment
- Presented by Marlon Bright
- 14 July 2008
- Advisor Masoud Sadjadi, Ph.D.
- REU Florida International University
2Outline
- Grid Enablement of Weather Research and
Forecasting Code (WRF) - Profiling and Prediction Tools
- Research Goals
- Project Timeline
- Current Progress
- Challenges
- Remaining Work
3Motivation Weather Research and Forecasting
Code (WRF)
- Goal Improved Weather Prediction
- Accurate and Timely Results
- Precise Location Information
- WRF Status
- Over 160,000 lines (mostly FORTRAN and C)
- Single Machine/Cluster compatible
- Single Domain
- Fine Resolution -gt Resource Requirements
- How to Overcome this?
- Through Grid Enablement
- Expected Benefits to WRF
- More available resources Different Domains
- Faster results
- Improved Accuracy
4System Overview
- Web-Based Portal
- Grid Middleware (Plumbing)
- Job-Flow Management
- Meta-Scheduling
- Performance Prediction
- Profiling and Benchmarking
- Development Tools and Environments
- Transparent Grid Enablement (TGE)
- TRAP Static and Dynamic adaptation of programs
- TRAP/BPEL, TRAP/J, TRAP.NET, etc.
- GRID superscalar Programming Paradigm for
parallelizing a sequential application
dynamically in a Computational Grid
5Performance Prediction
- IMPORTANT part of Meta-Scheduling
- Allows for
- Optimal usage of grid resources through smarter
meta-scheduling - Many users overestimate job requirements
- Reduced idle time for compute resources
- Could save costs and energy
- Optimal resource selection for most expedient job
return time
6The ToolsAmon / Aprof Dimemas / Paraver
7Amon / Aprof
- Amon monitoring program that runs on each
compute node recording new processes - Aprof regression analysis program running on
head node receives input from Amon to make
execution time predictions (within cluster
between clusters)
8Amon / Aprof Monitoring and Prediction
9Amon / Aprof Approach to Modeling Resource Usage
WRF
Network Latency
CPU Speed
Hard Disk I/O
Network Bandwidth
Number of Nodes
FSB Bandwidth
RAM Size
L2 Cache
Application Resource Usage Model
10Sample Amon Output Process
- --- (464) ---
- name wrf.exe
- cpus 8
- inv clock 1/2297.700 MHz
- inv cache size 1/1024 KB
- elapsed time 1234232 msec
- utime 1233890 msec 1236360
msec - stime 560 msec 1420
msec - intr 44959
- ctxt switch 84394
- fork 89
- storage R 0 blocks 0
blocks - storage W 0
blocks - network Rx 4188840 bytes
- network Tx 2106854 bytes
11Sample Aprof Output
- name wrf_arw_DM.exe
- elapsed time
- 5.783787e06
- explanatory value parameter
std.dev - ----------------- ------------- -------------
------------- - 1.000000e00
5.783787e06 1.982074e05
- predicted value residue rms
std.dev - ----------------- ------------- -------------
------------- - elapsed time 5.783787e06 4.246451e06
1.982074e05
12Sample Query Automation Script Output
- adj. cpu speed, processors, actual, predicted,
rms, std. dev, actual difference, - 3591.363, 1, 5222, 5924.82, 1592.459, 415.3491,
13.4588280352 - 3591.363, 2, 2881, 3246.283, 1592.459, 181.5382,
12.6790350573 - 3591.363, 3, 2281, 2353.438, 1592.459, 105.334,
3.17571240684 - 3591.363, 4, 1860, 1907.015, 1592.459, 69.19778,
2.52768817204 - 3591.363, 5, 1681, 1639.161, 1592.459, 49.83672,
2.48893515764 - 3591.363, 6, 1440, 1460.592, 1592.459, 39.5442,
1.43 - 3591.363, 7, 1380, 1333.043, 1592.459, 34.76459,
3.40268115942 - 3591.363, 8, 1200, 1237.381, 1592.459, 33.27651,
3.11508333333 - 3591.363, 9, 1200, 1162.977, 1592.459, 33.56231,
3.08525 - 3591.363, 10, 1080, 1103.454, 1592.459, 34.68943,
2.17166666667 - 3591.363, 11, 1200, 1054.753, 1592.459, 36.15324,
12.1039166667 - 3591.363, 12, 1080, 1014.169, 1592.459, 37.70271,
6.09546296296 - 3591.363, 13, 1200, 979.8292, 1592.459, 39.22018,
18.3475666667 - 3591.363, 14, 1021, 950.3947, 1592.459, 40.65455,
6.91530852106 - 3591.363, 15, 1020, 924.8848, 1592.459, 41.9872,
9.32501960784
13Previous Findings for Amon / Aprof
- Experiments were performed on two clusters at
FIUMind (16 nodes) and GCB (8 nodes) - Experiments were run to predict for different
number of nodes and cpu loads (i.e. 2,3,,14,15
and 20, 30,,90, 100) - Aprof predictions were within 10 error versus
actual recorded runtimes within Mind and GCB and
between Mind and GCB - Conclusion first step assumption was valid. -gt
Move to extending research to higher number of
nodes.
14Paraver / Dimemas
- Dimemas - simulation tool for the parametric
analysis of the behavior of message-passing
applications on a configurable parallel platform. - Paraver tool that allows for performance
visualization and analysis of trace files
generated from actual executions and by Dimemas - Tracefiles generated by MPItrace that is linked
into execution code
15Dimemas Simulation Process Overview
- Link MPItrace into application source
codedynamically generates tracefiles for each
node application running on (.mpit) - Use CEPBA tool mpi2prv to convert .mpit files
into one .prv file - Load file into Parver using XML filtering file
(provided by CEPBA) to reduce tracefile
eliminating perturbed regions (i.e. much of the
initialization) - Open tracefile in Paraver using useful_duration
configuration file and adjust scales to fit
events - Identify computation iterations compose a smaller
trace file by selecting a few iterations,
preserving communications and eliminating
initialization phases
16Paraver tracefile with iterations selected, cut,
and ready for Dimemas conversion.
17Simulation Process (contd)
- Convert the new tracefile to Dimemas format
(.trf) using CEPBA provided prv2trf tool - Load tracefile into Dimemas simulator, configure
target machine, and with information generate
Dimemas configuration file - Call simulator with or without option of
generating a Paraver (.prv) tracefile for
viewing. - Great News
- You only have to go through this process once if
done for the maximum amount of nodes you will
simulate for! Once configuration file is
generated, different numbers of nodes can be
simulated for through alterations to the file.
18Dimemas Simulator Results
19Goals
- Extend Amon/Aprof research to larger number of
nodes, different architecture, and different
version of WRF (Version 2.2.1). - Compare/contrast Aprof predictions to Dimemas
predictions in terms of accuracy and prediction
computation time. - Analyze if/how Amon/Aprof could be used in
conjunction with Dimemas/Paraver for optimized
application performance prediction and,
ultimately, meta-scheduling
20Timeline
- End of June
- Get MPItrace linking properly with WRF Version
Compiled on GCB, then Mind COMPLETE - a) Install Amon and Aprof on MareNostrum and
ensure proper functioning AMON COMPLETE APROF
FINAL STAGES - b) Run Amon benchmarks on MareNostrum COMPLETE
- Early/Mid July
- Use and analyze Aprof predictions within
MareNostrum (and possibly between MareNostrum,
GCB, and Mind) IN PROGRESS - Use generated MPI/ OpenMP tracefiles
(Paraver/Dimemas) to predict within (and possibly
between) Mind, GCB, and MareNostrum IN PROGRESS - Late July/Early August
- Experiment with how well Amon and Aprof relate
to/could possibly be combined with Dimemas - Analyze how findings relate to bigger picture.
Make optimizations on grid-enablement of WRF. - Compose paper presenting significant findings.
21Current Progress
22General
- Completed reading of related works papers
- Well advanced in Linux studies
- Established effective collaboration/working
relationship with developers of Dimemas and
Paraver
23Amon
- Installed on MareNostrum
- Adjusted source code to properly read node
information from MareNostrum (will document this
on Wiki to be considered when configuring on new
architectures)
24Amon (contd)
- Automated benchmarking shell script developed
- Starts Amon on each compute node returned by
system scheduler - Executes WRF with one process per node for
- Node counts of 8, 16, 32, 64, 96, and 128
- CPU percentage () loads of 25, 50, 75, 100
(Done through implementation of CPULimit program) - Writes results (to be used as Aprof input) to
organized results directory of /ltcpu load
percentagegt/ltnumber of nodesgt/lttimestamp of rungt/
ltamon output by nodegt
25Aprof
- Installed on MareNostrum
- Adjusted source code to change the way Aprof
reads in information - Before Input files had to specify number of
bytes in process listing in process header (This
was very complicated and error prone. Aprof was
inconsistent in loading MareNostrum data). - Now Input files simply need to separate process
entries with one or more blank lines.
26Aprof (contd)
- Script developed that combines Amon output from
all nodes and edits it into the necessary read-in
format for Aprof. - Aprof query automation script adjusted /developed
for MareNostrum - Queries Aprof for prediction information for
different cases (number of nodes cpu percentage
loads) - Compares predicted values to actual values
returned by run
27Dimemas / Paraver
- Paraver tracefile successfully generated and
visualized with GUI on MareNostrum - Dimemas tracefile successfully generated from
Paraver on MareNostrum - Configuration file for MareNostrum developed
- Prediction simulations will begin shortly
28Significant Challenges Overcome
- Amon
- Adjustment of source code to proper functioning
on MareNostrum - Development of benchmarking script to conform to
system architecture of MareNostrum (i.e. going
through its scheduler one process per node
etc.) - Aprof
- Adjustment of source code for less complex, more
consistent data input - Development of prediction and comparison scripts
for MareNostrum
29Significant Challenges Overcome(contd)
- Dimemas/Paraver
- MPItrace properly linked in with WRF on GCB and
Mind - Paraver and Dimemas successfully generated and
configuration file configured for MareNostrum. - WRF
- Version 2.2 installed and compiled on Mind
30Remaining Work
- Scripting Dimemas prediction simulations for the
same scenarios of those of Amon and Aprof - Finalizing Aprof prediction/comparison script so
that Aprofs performance on new architecture of
MareNostrum can be analyzed - Deciding if and how to compare results from
MareNostrum, GCB, and Mind (i.e. the same
versions of WRF would have to be running in all
three locations) - Experiment with how well Amon and Aprof relate
to/could possibly be combined with Dimemas
31References
- S. Masoud Sadjadi, Liana Fong, Rosa M. Badia,
Javier Figueroa, Javier Delgado, Xabriel J.
Collazo-Mojica, Khalid Saleem, Raju Rangaswami,
Shu Shimizu, Hector A. Duran Limon, Pat Welsh,
Sandeep Pattnaik, Anthony Praino, David Villegas,
Selim Kalayci, Gargi Dasgupta, Onyeka Ezenwoye,
Juan Carlos Martinez, Ivan Rodero, Shuyi Chen,
Javier Muñoz, Diego Lopez, Julita Corbalan, Hugh
Willoughby, Michael McFail, Christine Lisetti,
and Malek Adjouadi. Transparent grid enablement
of weather research and forecasting. In
Proceedings of the Mardi Gras Conference 2008 -
Workshop on Grid-Enabling Applications, Baton
Rouge, Louisiana, USA, January 2008. - http//www.cs.fiu.edu/sadjadi/Presentations/Mardi
-Gras-GEA-2008-TGE-WRF.ppt
- S. Masoud Sadjadi, Shu Shimizu, Javier Figueroa,
Raju Rangaswami, Javier Delgado, Hector Duran,
and Xabriel Collazo. A modeling approach for
estimating execution time of long-running
scientific applications. In Proceedings of the
22nd IEEE International Parallel Distributed
Processing Symposium (IPDPS-2008), the Fifth
High-Performance Grid Computing Workshop
(HPGC-2008), Miami, Florida, April 2008. - http//www.cs.fiu.edu/sadjadi/Presentations/HPGC-
2008-WRF20Modeling20Paper20Presentationl.ppt - Performance/Profiling. Presented by Javier
Figueroa in Special Topics in Grid Enablement of
Scientific Applications Class. 13 May 2008
32Acknowledgements
- REU
- PIRE
- BSC
- Masoud Sadjadi, Ph. D. - FIU
- Rosa Badia, Ph.D. - BSC
- Javier Delgado FIU
- Javier Figueroa - UM