MPI performance prediction by DIMEMAS - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

MPI performance prediction by DIMEMAS

Description:

MPI performance prediction by DIMEMAS. Rosa M. Badia, Jes s Labarta, ... 2 processors IBM-SP. 2 processors IBM Power4. 4 processor on a SGI O2000 ... 500 / 0.01 ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 41
Provided by: serg166
Category:

less

Transcript and Presenter's Notes

Title: MPI performance prediction by DIMEMAS


1
MPI performance prediction by DIMEMAS
  • Rosa M. Badia, Jesús Labarta, Judit Giménez and
    Francesc Escalé
  • CEPBA-IBM Research Institute
  • rosab_at_ciri.upc.es

2
Outline
  • EU Project Damien Overview
  • Dimemas Overview
  • Communication model
  • Point to point
  • Collective operations
  • Examples
  • Validation of collective operations
  • RNAfold
  • Qualitative analysis
  • Summary

3
DAMIEN project
  • Funded by European Commission
  • Motivation
  • Industrial and academical applications have high
    requirements for memory and CPUs
  • Typical problems multi-physics coupled
    applications (e.g. fluid-structure interaction)
  • Large companies have sites and computing
    resources all over the world
  • Goals
  • Provide a toolbox starting from existing and
    widely accepted tools to support
    Grid-environments
  • Test the toolbox with real applications from
    industry in a testbed based on high-speed
    networks
  • Demonstration on industrial applications

4
DAMIEN project
  • 30 months project (Jan 2001 October 2003)
  • Partners
  • CEPBA-UPC (Spain)
  • CRIHAN (France)
  • EADS (France)
  • HLRS (Germany)
  • PALLAS (Germany)

5
DAMIEN structure
6
Outline
  • EU Project Damien Overview
  • Dimemas Overview
  • Communication model
  • Point to point
  • Collective operations
  • Examples
  • Validation of collective operations
  • RNAfold
  • Qualitative analysis
  • Summary

7
Dimemas overview
  • Application performance analysis tool for message
    passing programs
  • Event-based simulator
  • Tracefile based
  • In development since 1992
  • Runs on any PC or workstation (UNIX, Linux,
    Windows)
  • Distributed by CEPBA

8
Dimemas overview
Sequential machine
MPI
Message Passing Code
PACX-MPI
Computational Grid
Visualization tools MetaVampir, Paraver
Dimemas simulation
  • Double use
  • Application tuning in development phase
  • Computational Grid selection for production

9
Dimemas overview
  • Architecture model from networks of SMPs to
    computational Grids

10
Dimemas overview
  • Graphic interface
  • Architecture configuration
  • Parameter settings (different L and BW at
    different levels)
  • Configuration file loading/saving

11
Dimemas extensions for the GRID
  • One connection from each machine to the network
  • Utilization of this resource using FCFS bases
  • External network influence of traffic
  • Estimation of the traffic in the wide area
    network

12
Outline
  • EU Project Damien Overview
  • Dimemas Overview
  • Communication model
  • Point to point
  • Collective operations
  • Examples
  • Validation of collective operations
  • RNAfold
  • Qualitative analysis
  • Summary

13
Communication model point to point
  • Latency
  • Node, SMP or remote level
  • Resource consuming

14
Communication model point to point
  • Machine resources contention
  • Simulated by Dimemas

15
Communication model point to point
  • Transfer
  • BW at node, SMP or remote level
  • Process may resume

16
Communication model point to point
  • WAN contention at remote level only

17
Communication model point to point
  • Flight time at remote level only
  • Non resource consuming latency
  • f(distance)

18
Communication model collective operations
  • Four phases external and internal phases

Machine 1
Machine 2
19
Communication model collective operations
20
Communication model collective operations
  • Example of collectives operation configuration
    file

21
L1
L2
L,BW
BW
L3
22
Outline
  • EU Project Damien Overview
  • Dimemas Overview
  • Communication model
  • Point to point
  • Collective operations
  • Examples
  • Validation of collective operations
  • RNAfold
  • Qualitative analysis
  • Summary

23
Example of validation collective operations
  • Benchmarks PMB
  • Allgather, Allreduce, Alltoall, Barrier, and
    Bcast
  • Communication size from 0 to 512 Kbytes
  • 50 iterations per size (except 1000 iterations
    Barrier, 10 iterations Alltoall)
  • Goal of the experiment identify a set of
    parameters for a given target configuration

24
Example of validation collective operations
  • Methodology
  • Local execution of the benchmark (8 processors,
    IBM-SP2).
  • Results tracefile obtained with mpidtrace
  • Execution of the benchmark in a mini-GRID
  • 2 processors IBM-SP
  • 2 processors IBM Power4
  • 4 processor on a SGI O2000
  • Measuring execution time for reference after
    MPI_Init and before MPI_finalize
  • Results set of time measurements for each of the
    benchmarks.
  • Execution of hundreds of Dimemas simulations
    varying
  • flight time
  • latency
  • bandwidth
  • Result range of values that fit the measured
    executions.

25
Example of validation collective operations
  • Variables ranges

26
Example of validation collective operations
27
Example of validation collective operations
Allgather (similar in Alltoall, Allreduce, Bcast)
28
Example of validation collective operations
Barrier
29
Example of validation collective operations
  • External globalop 0 LIN 2MAX LIN 2MAX (Barrier)
  • External globalop 1 LIN MEAN 0 MIN (BCast)
  • External globalop 2 LIN MEAN 0 MAX
  • External globalop 3 LIN MEAN 0 MAX
  • External globalop 4 LIN MEAN 0 MAX
  • External globalop 5 LIN MEAN 0 MAX
  • External globalop 6 LIN MIN LIN
    MEAN (Allgather)
  • External globalop 7 LIN MEAN LIN MEAN
  • External globalop 8 LIN 2MAX LIN 2MAX (Alltoall)
  • External globalop 9 LIN MEAN LIN MAX
  • External globalop 10 LIN 2MAX 0 MAX
  • External globalop 11 LIN MIN LIN
    MIN (Allreduce)
  • External globalop 12 LIN 2MAX LIN MIN
  • External globalop 13 LIN MAX LIN MAX

30
Application RNAfold
  • RNAfold computes the secondary structure of
    minimal free energy of long RNA sequences.
  • Derived out of the Vienna-RNA package of Ivo
    Hofäcker.
  • Tightly coupled MPI-parallelized version.
  • Version was improved for HPC Challenge 2002
  • Include newest free energy parameters
  • Better communication pattern
  • Integration into Virtual Environment

31
Application - RNAfold
  • Machines involved
  • Cray T3Y/900 at HLRS
  • IBM SP-3 at CEPBA
  • SGI O2000 at CEPBA
  • Application
  • RNAfold
  • Configurations
  • 44 processors
  • 66 processors
  • 1414 processors

BW, Flight time
32
Application - RNAfold
Yeast RNA 44 processors
33
Test 6000, 1414 processors, BW 70 KB/s, Flight
time 10 ms
34
Qualitative Analysis Uranus
  • 64 processes
  • 4 16-way SMPs
  • Fligth times
  • 0,1,10 and 50 ms.

35
Qualitative analysis linpack
  • 256 processes
  • 16 16-way SMPs
  • BW (MB/s) / Flight time (ms)
  • 50 / 1
  • 100 / 1
  • 200 / 1
  • 200 / 0.1
  • 200 / 0.01
  • 500 / 0.01

36
Qualitative analysis Explore response surface
  • Linpack
  • 256 processes
  • 16 16-way SMPs
  • Flight time and bandwidth exploration

37
Outline
  • EU Project Damien Overview
  • Dimemas Overview
  • Communication model
  • Point to point
  • Collective operations
  • Examples
  • Validation of collective operations
  • RNAfold
  • Qualitative analysis
  • Summary

38
Summary
  • The performance analysis of parallel programs is
    a must, but
  • Execution of production runs with the objective
    of doing performance analysis may be very
    expensive.
  • Worst if the target architecture is a
    computational Grid instead of a parallel machine!
  • Dimemas is a valuable tool, since
  • Allows to perform predictions of the execution of
    MPI programs
  • Without requiring the use of the target platform.
  • Helps development of MPI applications by allowing
    to see which are the bottlenecks of the
    communications when running on the grid, see the
    impact of contention on the network, ...
  • Also, can be used before production to select the
    optimum Grid configuration of the target
    architecture (which machines use, how many
    processors in each machine, ...)

39
Summary
  • Communication model
  • Simple fast
  • Current version is able to predict for
    computational Grids
  • Very easy to use. It has a java-based graphic
    user interface. A user manual exists to help
    users on the parameter setting process.
  • The Grid version of Dimemas was built from the
    initial version that do performance prediction
    tool for parallel platforms.
  • Utility of Dimemas has been demonstrated inside
    project DAMIEN with several applications

40
Summary
  • Dimemas predicts for Grid architectures, but does
    not really run on the Grid, so we do not run into
    many grid problems.
  • Main problem was how to model the MPI
    communication through the Grid.
  • Initial modeling, tuning and then reformulation
    for those aspects that do not fit reality.
Write a Comment
User Comments (0)
About PowerShow.com