MPI performance prediction by DIMEMAS - PowerPoint PPT Presentation

1 / 40

About This Presentation

Title:

MPI performance prediction by DIMEMAS

Description:

MPI performance prediction by DIMEMAS. Rosa M. Badia, Jes s Labarta, ... 2 processors IBM-SP. 2 processors IBM Power4. 4 processor on a SGI O2000 ... 500 / 0.01 ... – PowerPoint PPT presentation

Number of Views:36

Avg rating:3.0/5.0

Slides: 41

Provided by: serg166

Category:

more less

Transcript and Presenter's Notes

Title: MPI performance prediction by DIMEMAS

1
MPI performance prediction by DIMEMAS

Rosa M. Badia, Jesús Labarta, Judit Giménez and
Francesc Escalé
CEPBA-IBM Research Institute
rosab_at_ciri.upc.es

2
Outline

EU Project Damien Overview
Dimemas Overview
Communication model
Point to point
Collective operations
Examples
Validation of collective operations
RNAfold
Qualitative analysis
Summary

3
DAMIEN project

Funded by European Commission
Motivation
Industrial and academical applications have high
requirements for memory and CPUs
Typical problems multi-physics coupled
applications (e.g. fluid-structure interaction)
Large companies have sites and computing
resources all over the world
Goals
Provide a toolbox starting from existing and
widely accepted tools to support
Grid-environments
Test the toolbox with real applications from
industry in a testbed based on high-speed
networks
Demonstration on industrial applications

4
DAMIEN project

30 months project (Jan 2001 October 2003)
Partners
CEPBA-UPC (Spain)
CRIHAN (France)
EADS (France)
HLRS (Germany)
PALLAS (Germany)

5
DAMIEN structure
6
Outline

EU Project Damien Overview
Dimemas Overview
Communication model
Point to point
Collective operations
Examples
Validation of collective operations
RNAfold
Qualitative analysis
Summary

7
Dimemas overview

Application performance analysis tool for message
passing programs
Event-based simulator
Tracefile based
In development since 1992
Runs on any PC or workstation (UNIX, Linux,
Windows)
Distributed by CEPBA

8
Dimemas overview
Sequential machine
MPI
Message Passing Code
PACX-MPI
Computational Grid
Visualization tools MetaVampir, Paraver
Dimemas simulation

Double use
Application tuning in development phase
Computational Grid selection for production

9
Dimemas overview

Architecture model from networks of SMPs to
computational Grids

10
Dimemas overview

Graphic interface
Architecture configuration
Parameter settings (different L and BW at
different levels)
Configuration file loading/saving

11
Dimemas extensions for the GRID

One connection from each machine to the network
Utilization of this resource using FCFS bases
External network influence of traffic
Estimation of the traffic in the wide area
network

12
Outline

EU Project Damien Overview
Dimemas Overview
Communication model
Point to point
Collective operations
Examples
Validation of collective operations
RNAfold
Qualitative analysis
Summary

13
Communication model point to point

Latency
Node, SMP or remote level
Resource consuming

14
Communication model point to point

Machine resources contention
Simulated by Dimemas

15
Communication model point to point

Transfer
BW at node, SMP or remote level
Process may resume

16
Communication model point to point

WAN contention at remote level only

17
Communication model point to point

Flight time at remote level only
Non resource consuming latency
f(distance)

18
Communication model collective operations

Four phases external and internal phases

Machine 1
Machine 2
19
Communication model collective operations
20
Communication model collective operations

Example of collectives operation configuration
file

21
L1
L2
L,BW
BW
L3
22
Outline

EU Project Damien Overview
Dimemas Overview
Communication model
Point to point
Collective operations
Examples
Validation of collective operations
RNAfold
Qualitative analysis
Summary

23
Example of validation collective operations

Benchmarks PMB
Allgather, Allreduce, Alltoall, Barrier, and
Bcast
Communication size from 0 to 512 Kbytes
50 iterations per size (except 1000 iterations
Barrier, 10 iterations Alltoall)
Goal of the experiment identify a set of
parameters for a given target configuration

24
Example of validation collective operations

Methodology
Local execution of the benchmark (8 processors,
IBM-SP2).
Results tracefile obtained with mpidtrace
Execution of the benchmark in a mini-GRID
2 processors IBM-SP
2 processors IBM Power4
4 processor on a SGI O2000
Measuring execution time for reference after
MPI_Init and before MPI_finalize
Results set of time measurements for each of the
benchmarks.
Execution of hundreds of Dimemas simulations
varying
flight time
latency
bandwidth
Result range of values that fit the measured
executions.

25
Example of validation collective operations

Variables ranges

26
Example of validation collective operations
27
Example of validation collective operations
Allgather (similar in Alltoall, Allreduce, Bcast)
28
Example of validation collective operations
Barrier
29
Example of validation collective operations

External globalop 0 LIN 2MAX LIN 2MAX (Barrier)
External globalop 1 LIN MEAN 0 MIN (BCast)
External globalop 2 LIN MEAN 0 MAX
External globalop 3 LIN MEAN 0 MAX
External globalop 4 LIN MEAN 0 MAX
External globalop 5 LIN MEAN 0 MAX
External globalop 6 LIN MIN LIN
MEAN (Allgather)
External globalop 7 LIN MEAN LIN MEAN
External globalop 8 LIN 2MAX LIN 2MAX (Alltoall)
External globalop 9 LIN MEAN LIN MAX
External globalop 10 LIN 2MAX 0 MAX
External globalop 11 LIN MIN LIN
MIN (Allreduce)
External globalop 12 LIN 2MAX LIN MIN
External globalop 13 LIN MAX LIN MAX

30
Application RNAfold

RNAfold computes the secondary structure of
minimal free energy of long RNA sequences.
Derived out of the Vienna-RNA package of Ivo
Hofäcker.
Tightly coupled MPI-parallelized version.
Version was improved for HPC Challenge 2002
Include newest free energy parameters
Better communication pattern
Integration into Virtual Environment

31
Application - RNAfold

Machines involved
Cray T3Y/900 at HLRS
IBM SP-3 at CEPBA
SGI O2000 at CEPBA
Application
RNAfold
Configurations
44 processors
66 processors
1414 processors

BW, Flight time
32
Application - RNAfold
Yeast RNA 44 processors
33
Test 6000, 1414 processors, BW 70 KB/s, Flight
time 10 ms
34
Qualitative Analysis Uranus

64 processes
4 16-way SMPs
Fligth times
0,1,10 and 50 ms.

35
Qualitative analysis linpack

256 processes
16 16-way SMPs
BW (MB/s) / Flight time (ms)
50 / 1
100 / 1
200 / 1
200 / 0.1
200 / 0.01
500 / 0.01

36
Qualitative analysis Explore response surface

Linpack
256 processes
16 16-way SMPs

Flight time and bandwidth exploration

37
Outline

EU Project Damien Overview
Dimemas Overview
Communication model
Point to point
Collective operations
Examples
Validation of collective operations
RNAfold
Qualitative analysis
Summary

38
Summary

The performance analysis of parallel programs is
a must, but
Execution of production runs with the objective
of doing performance analysis may be very
expensive.
Worst if the target architecture is a
computational Grid instead of a parallel machine!
Dimemas is a valuable tool, since
Allows to perform predictions of the execution of
MPI programs
Without requiring the use of the target platform.
Helps development of MPI applications by allowing
to see which are the bottlenecks of the
communications when running on the grid, see the
impact of contention on the network, ...
Also, can be used before production to select the
optimum Grid configuration of the target
architecture (which machines use, how many
processors in each machine, ...)

39
Summary

Communication model
Simple fast
Current version is able to predict for
computational Grids
Very easy to use. It has a java-based graphic
user interface. A user manual exists to help
users on the parameter setting process.
The Grid version of Dimemas was built from the
initial version that do performance prediction
tool for parallel platforms.
Utility of Dimemas has been demonstrated inside
project DAMIEN with several applications

40
Summary

Dimemas predicts for Grid architectures, but does
not really run on the Grid, so we do not run into
many grid problems.
Main problem was how to model the MPI
communication through the Grid.
Initial modeling, tuning and then reformulation
for those aspects that do not fit reality.

Write a Comment

User Comments (0)