Title: MPI performance prediction by DIMEMAS
1MPI performance prediction by DIMEMAS
- Rosa M. Badia, Jesús Labarta, Judit Giménez and
Francesc Escalé - CEPBA-IBM Research Institute
- rosab_at_ciri.upc.es
2Outline
- EU Project Damien Overview
- Dimemas Overview
- Communication model
- Point to point
- Collective operations
- Examples
- Validation of collective operations
- RNAfold
- Qualitative analysis
- Summary
3DAMIEN project
- Funded by European Commission
- Motivation
- Industrial and academical applications have high
requirements for memory and CPUs - Typical problems multi-physics coupled
applications (e.g. fluid-structure interaction) - Large companies have sites and computing
resources all over the world - Goals
- Provide a toolbox starting from existing and
widely accepted tools to support
Grid-environments - Test the toolbox with real applications from
industry in a testbed based on high-speed
networks - Demonstration on industrial applications
4DAMIEN project
- 30 months project (Jan 2001 October 2003)
- Partners
- CEPBA-UPC (Spain)
- CRIHAN (France)
- EADS (France)
- HLRS (Germany)
- PALLAS (Germany)
5DAMIEN structure
6Outline
- EU Project Damien Overview
- Dimemas Overview
- Communication model
- Point to point
- Collective operations
- Examples
- Validation of collective operations
- RNAfold
- Qualitative analysis
- Summary
7Dimemas overview
- Application performance analysis tool for message
passing programs - Event-based simulator
- Tracefile based
- In development since 1992
- Runs on any PC or workstation (UNIX, Linux,
Windows) - Distributed by CEPBA
8Dimemas overview
Sequential machine
MPI
Message Passing Code
PACX-MPI
Computational Grid
Visualization tools MetaVampir, Paraver
Dimemas simulation
- Double use
- Application tuning in development phase
- Computational Grid selection for production
9Dimemas overview
- Architecture model from networks of SMPs to
computational Grids
10Dimemas overview
- Graphic interface
- Architecture configuration
- Parameter settings (different L and BW at
different levels) - Configuration file loading/saving
11Dimemas extensions for the GRID
- One connection from each machine to the network
- Utilization of this resource using FCFS bases
- External network influence of traffic
- Estimation of the traffic in the wide area
network
12Outline
- EU Project Damien Overview
- Dimemas Overview
- Communication model
- Point to point
- Collective operations
- Examples
- Validation of collective operations
- RNAfold
- Qualitative analysis
- Summary
13Communication model point to point
- Latency
- Node, SMP or remote level
- Resource consuming
14Communication model point to point
- Machine resources contention
- Simulated by Dimemas
15Communication model point to point
- Transfer
- BW at node, SMP or remote level
- Process may resume
16Communication model point to point
- WAN contention at remote level only
-
17Communication model point to point
- Flight time at remote level only
- Non resource consuming latency
- f(distance)
18Communication model collective operations
- Four phases external and internal phases
Machine 1
Machine 2
19Communication model collective operations
20Communication model collective operations
- Example of collectives operation configuration
file
21L1
L2
L,BW
BW
L3
22Outline
- EU Project Damien Overview
- Dimemas Overview
- Communication model
- Point to point
- Collective operations
- Examples
- Validation of collective operations
- RNAfold
- Qualitative analysis
- Summary
23Example of validation collective operations
- Benchmarks PMB
- Allgather, Allreduce, Alltoall, Barrier, and
Bcast - Communication size from 0 to 512 Kbytes
- 50 iterations per size (except 1000 iterations
Barrier, 10 iterations Alltoall) - Goal of the experiment identify a set of
parameters for a given target configuration
24Example of validation collective operations
- Methodology
- Local execution of the benchmark (8 processors,
IBM-SP2). - Results tracefile obtained with mpidtrace
- Execution of the benchmark in a mini-GRID
- 2 processors IBM-SP
- 2 processors IBM Power4
- 4 processor on a SGI O2000
- Measuring execution time for reference after
MPI_Init and before MPI_finalize - Results set of time measurements for each of the
benchmarks. - Execution of hundreds of Dimemas simulations
varying - flight time
- latency
- bandwidth
- Result range of values that fit the measured
executions.
25Example of validation collective operations
26Example of validation collective operations
27Example of validation collective operations
Allgather (similar in Alltoall, Allreduce, Bcast)
28Example of validation collective operations
Barrier
29Example of validation collective operations
- External globalop 0 LIN 2MAX LIN 2MAX (Barrier)
- External globalop 1 LIN MEAN 0 MIN (BCast)
- External globalop 2 LIN MEAN 0 MAX
- External globalop 3 LIN MEAN 0 MAX
- External globalop 4 LIN MEAN 0 MAX
- External globalop 5 LIN MEAN 0 MAX
- External globalop 6 LIN MIN LIN
MEAN (Allgather) - External globalop 7 LIN MEAN LIN MEAN
- External globalop 8 LIN 2MAX LIN 2MAX (Alltoall)
- External globalop 9 LIN MEAN LIN MAX
- External globalop 10 LIN 2MAX 0 MAX
- External globalop 11 LIN MIN LIN
MIN (Allreduce) - External globalop 12 LIN 2MAX LIN MIN
- External globalop 13 LIN MAX LIN MAX
30Application RNAfold
- RNAfold computes the secondary structure of
minimal free energy of long RNA sequences. - Derived out of the Vienna-RNA package of Ivo
Hofäcker. - Tightly coupled MPI-parallelized version.
- Version was improved for HPC Challenge 2002
- Include newest free energy parameters
- Better communication pattern
- Integration into Virtual Environment
31Application - RNAfold
- Machines involved
- Cray T3Y/900 at HLRS
- IBM SP-3 at CEPBA
- SGI O2000 at CEPBA
- Application
- RNAfold
- Configurations
- 44 processors
- 66 processors
- 1414 processors
BW, Flight time
32Application - RNAfold
Yeast RNA 44 processors
33Test 6000, 1414 processors, BW 70 KB/s, Flight
time 10 ms
34Qualitative Analysis Uranus
- 64 processes
- 4 16-way SMPs
- Fligth times
- 0,1,10 and 50 ms.
35Qualitative analysis linpack
- 256 processes
- 16 16-way SMPs
- BW (MB/s) / Flight time (ms)
- 50 / 1
- 100 / 1
- 200 / 1
- 200 / 0.1
- 200 / 0.01
- 500 / 0.01
36Qualitative analysis Explore response surface
- Linpack
- 256 processes
- 16 16-way SMPs
- Flight time and bandwidth exploration
37Outline
- EU Project Damien Overview
- Dimemas Overview
- Communication model
- Point to point
- Collective operations
- Examples
- Validation of collective operations
- RNAfold
- Qualitative analysis
- Summary
38Summary
- The performance analysis of parallel programs is
a must, but - Execution of production runs with the objective
of doing performance analysis may be very
expensive. - Worst if the target architecture is a
computational Grid instead of a parallel machine! - Dimemas is a valuable tool, since
- Allows to perform predictions of the execution of
MPI programs - Without requiring the use of the target platform.
- Helps development of MPI applications by allowing
to see which are the bottlenecks of the
communications when running on the grid, see the
impact of contention on the network, ... - Also, can be used before production to select the
optimum Grid configuration of the target
architecture (which machines use, how many
processors in each machine, ...)
39Summary
- Communication model
- Simple fast
- Current version is able to predict for
computational Grids - Very easy to use. It has a java-based graphic
user interface. A user manual exists to help
users on the parameter setting process. - The Grid version of Dimemas was built from the
initial version that do performance prediction
tool for parallel platforms. - Utility of Dimemas has been demonstrated inside
project DAMIEN with several applications
40Summary
- Dimemas predicts for Grid architectures, but does
not really run on the Grid, so we do not run into
many grid problems. - Main problem was how to model the MPI
communication through the Grid. - Initial modeling, tuning and then reformulation
for those aspects that do not fit reality.