ParCo 2003 Presentation - PowerPoint PPT Presentation

About This Presentation
Title:

ParCo 2003 Presentation

Description:

Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 41
Provided by: Nikolaos9
Category:

less

Transcript and Presenter's Notes

Title: ParCo 2003 Presentation


1
Delivering High Performance to Parallel
Applications Using Advanced Scheduling
Nikolaos Drosinos, Georgios Goumas Maria
Athanasaki and Nectarios Koziris National
Technical University of Athens Computing
Systems Laboratory ndros,goumas,maria,nkoziris
_at_cslab.ece.ntua.gr www.cslab.ece.ntua.gr
2
Overview
  • Introduction
  • Background
  • Code Generation
  • Computation/Data Distribution
  • Communication Schemes
  • Summary
  • Experimental Results
  • Conclusions Future Work

3
Introduction
  • Motivation
  • A lot of theoretical work has been done on
    arbitrary tiling but there are no actual
    experimental results!
  • There is no complete method to generate code for
    non-rectangular tiles

4
Introduction
  • Contribution
  • Complete end-to-end SPMD code generation method
    for arbitrarily tiled iteration spaces
  • Simulation of blocking and non-blocking
    communication primitives
  • Experimental evaluation of proposed scheduling
    scheme

5
Overview
  • Introduction
  • Background
  • Code Generation
  • Computation/Data Distribution
  • Communication Schemes
  • Summary
  • Experimental Results
  • Conclusions Future Work

6
Background
  • Algorithmic Model
  • FOR j1 min1 TO max1 DO
  • FOR jn minn TO maxn DO
  • Computation(j1,,jn)
  • ENDFOR
  • ENDFOR
  • Perfectly nested loops
  • Constant flow data dependencies (D)

7
Background
  • Tiling
  • Popular loop transformation
  • Groups iterations into atomic units
  • Enhances locality in uniprocessors
  • Enables coarse-grain parallelism in distributed
    memory systems
  • Valid tiling matrix H

8
Tiling Transformation
Example FOR j10 TO 11 DO FOR j20 TO 8 DO
Aj1,j2Aj1-1,j2 Aj1-1,j2-1
ENDFOR ENDFOR
9
Rectangular Tiling Transformation
10
Non-rectangular Tiling Transformation
11
Why Non-rectangular Tiling?
  • Reduces communication

8 communication points
6 communication points
  • Enables more efficient scheduling schemes

6 time steps
5 time steps
12
Overview
  • Introduction
  • Background
  • Code Generation
  • Computation/Data Distribution
  • Communication Schemes
  • Summary
  • Experimental Results
  • Conclusions Future Work

13
Computation Distribution
  • We map tiles along the longest dimension to the
    same processor because
  • It reduces the number of processors required
  • It simplifies message-passing
  • It reduces total execution times when
    overlapping computation with communication

14
Computation Distribution
15
Data Distribution
  • Computer-owns rule Each processor owns the data
    it computes
  • Arbitrary convex iteration space, arbitrary
    tiling
  • Rectangular local iteration and data spaces

16
Data Distribution
17
Data Distribution
18
Data Distribution
19
Overview
  • Introduction
  • Background
  • Code Generation
  • Computation/Data Distribution
  • Communication Schemes
  • Summary
  • Experimental Results
  • Conclusions Future Work

20
Communication Schemes
  • With whom do I communicate?

21
Communication Schemes
  • With whom do I communicate?

22
Communication Schemes
  • What do I send?

23
Blocking Scheme
j2
P3
P2
P1
j1
12 time steps
24
Non-blocking Scheme
j2
P3
P2
P1
j1
6 time steps
25
Overview
  • Introduction
  • Background
  • Code Generation
  • Computation/Data Distribution
  • Communication Schemes
  • Summary
  • Experimental Results
  • Conclusions Future Work

26
Code Generation Summary
Parallelization
Tiling
  • Computation Distribution
  • Data Distribution
  • Communication Primitives

Dependence Analysis
Advanced Scheduling Suitable Tiling
Non-blocking Communication Scheme
27
Code Summary Blocking Scheme
28
Code Summary Non-blocking Scheme
29
Overview
  • Introduction
  • Background
  • Code Generation
  • Computation/Data Distribution
  • Communication Schemes
  • Summary
  • Experimental Results
  • Conclusions Future Work

30
Experimental Results
  • 8-node SMP Linux Cluster (800 MHz PIII, 128 MB
    RAM, kernel 2.4.20)
  • MPICH v.1.2.5 (--with-devicep4,
    --with-commshared)
  • g compiler v.2.95.4 (-O3)
  • FastEthernet interconnection
  • 2 micro-kernel benchmarks (3D)
  • Gauss Successive Over-Relaxation (SOR)
  • Texture Smoothing Code (TSC)
  • Simulation of communication schemes

31
SOR
  • Iteration space M x N x N
  • Dependence matrix
  • Rectangular Tiling
  • Non-rectangular Tiling

32
SOR
33
SOR
34
TSC
  • Iteration space T x N x N
  • Dependence matrix
  • Rectangular Tiling
  • Non-rectangular Tiling

35
TSC
36
TSC
37
Overview
  • Introduction
  • Background
  • Code Generation
  • Computation/Data Distribution
  • Communication Schemes
  • Summary
  • Experimental Results
  • Conclusions Future Work

38
Conclusions
  • Automatic code generation for arbitrary tiled
    spaces can be efficient
  • High performance can be achieved by means of
  • a suitable tiling transformation
  • overlapping computation with communication

39
Future Work
  • Application of methodology to imperfectly nested
    loops and non-constant dependencies
  • Investigation of hybrid programming models
    (MPIOpenMP)
  • Performance evaluation on advanced
    interconnection networks (SCI, Myrinet)

40
Questions?
http//www.cslab.ece.ntua.gr/ndros
Write a Comment
User Comments (0)
About PowerShow.com