Title: ParCo 2003 Presentation
1Delivering High Performance to Parallel
Applications Using Advanced Scheduling
Nikolaos Drosinos, Georgios Goumas Maria
Athanasaki and Nectarios Koziris National
Technical University of Athens Computing
Systems Laboratory ndros,goumas,maria,nkoziris
_at_cslab.ece.ntua.gr www.cslab.ece.ntua.gr
2Overview
- Introduction
- Background
- Code Generation
- Computation/Data Distribution
- Communication Schemes
- Summary
- Experimental Results
- Conclusions Future Work
3Introduction
- Motivation
- A lot of theoretical work has been done on
arbitrary tiling but there are no actual
experimental results! - There is no complete method to generate code for
non-rectangular tiles
4Introduction
- Contribution
- Complete end-to-end SPMD code generation method
for arbitrarily tiled iteration spaces - Simulation of blocking and non-blocking
communication primitives - Experimental evaluation of proposed scheduling
scheme
5Overview
- Introduction
- Background
- Code Generation
- Computation/Data Distribution
- Communication Schemes
- Summary
- Experimental Results
- Conclusions Future Work
6Background
- Algorithmic Model
- FOR j1 min1 TO max1 DO
-
- FOR jn minn TO maxn DO
- Computation(j1,,jn)
- ENDFOR
-
- ENDFOR
- Perfectly nested loops
- Constant flow data dependencies (D)
7Background
- Tiling
- Popular loop transformation
- Groups iterations into atomic units
- Enhances locality in uniprocessors
- Enables coarse-grain parallelism in distributed
memory systems - Valid tiling matrix H
8Tiling Transformation
Example FOR j10 TO 11 DO FOR j20 TO 8 DO
Aj1,j2Aj1-1,j2 Aj1-1,j2-1
ENDFOR ENDFOR
9Rectangular Tiling Transformation
10Non-rectangular Tiling Transformation
11Why Non-rectangular Tiling?
8 communication points
6 communication points
- Enables more efficient scheduling schemes
6 time steps
5 time steps
12Overview
- Introduction
- Background
- Code Generation
- Computation/Data Distribution
- Communication Schemes
- Summary
- Experimental Results
- Conclusions Future Work
13Computation Distribution
- We map tiles along the longest dimension to the
same processor because - It reduces the number of processors required
- It simplifies message-passing
- It reduces total execution times when
overlapping computation with communication
14Computation Distribution
15Data Distribution
- Computer-owns rule Each processor owns the data
it computes - Arbitrary convex iteration space, arbitrary
tiling - Rectangular local iteration and data spaces
16Data Distribution
17Data Distribution
18Data Distribution
19Overview
- Introduction
- Background
- Code Generation
- Computation/Data Distribution
- Communication Schemes
- Summary
- Experimental Results
- Conclusions Future Work
20Communication Schemes
- With whom do I communicate?
21Communication Schemes
- With whom do I communicate?
22Communication Schemes
23Blocking Scheme
j2
P3
P2
P1
j1
12 time steps
24Non-blocking Scheme
j2
P3
P2
P1
j1
6 time steps
25Overview
- Introduction
- Background
- Code Generation
- Computation/Data Distribution
- Communication Schemes
- Summary
- Experimental Results
- Conclusions Future Work
26Code Generation Summary
Parallelization
Tiling
- Computation Distribution
- Data Distribution
- Communication Primitives
Dependence Analysis
Advanced Scheduling Suitable Tiling
Non-blocking Communication Scheme
27Code Summary Blocking Scheme
28Code Summary Non-blocking Scheme
29Overview
- Introduction
- Background
- Code Generation
- Computation/Data Distribution
- Communication Schemes
- Summary
- Experimental Results
- Conclusions Future Work
30Experimental Results
- 8-node SMP Linux Cluster (800 MHz PIII, 128 MB
RAM, kernel 2.4.20) - MPICH v.1.2.5 (--with-devicep4,
--with-commshared) - g compiler v.2.95.4 (-O3)
- FastEthernet interconnection
- 2 micro-kernel benchmarks (3D)
- Gauss Successive Over-Relaxation (SOR)
- Texture Smoothing Code (TSC)
- Simulation of communication schemes
31SOR
- Iteration space M x N x N
- Dependence matrix
32SOR
33SOR
34TSC
- Iteration space T x N x N
- Dependence matrix
35TSC
36TSC
37Overview
- Introduction
- Background
- Code Generation
- Computation/Data Distribution
- Communication Schemes
- Summary
- Experimental Results
- Conclusions Future Work
38Conclusions
- Automatic code generation for arbitrary tiled
spaces can be efficient - High performance can be achieved by means of
- a suitable tiling transformation
- overlapping computation with communication
39Future Work
- Application of methodology to imperfectly nested
loops and non-constant dependencies - Investigation of hybrid programming models
(MPIOpenMP) - Performance evaluation on advanced
interconnection networks (SCI, Myrinet)
40Questions?
http//www.cslab.ece.ntua.gr/ndros