High Performance LU Factorization for Non-dedicated Clusters PowerPoint PPT Presentation

presentation player overlay
1 / 30
About This Presentation
Transcript and Presenter's Notes

Title: High Performance LU Factorization for Non-dedicated Clusters


1
High Performance LU Factorization for
Non-dedicated Clusters
and the future Grid
  • Toshio Endo, Kenji Kaneda,
  • Kenjiro Taura, Akinori Yonezawa
  • (University of Tokyo)

2
Background
  • Computing nodes on clusters/Grid are shared by
    multiple applications
  • To obtain good performance, HPC applications
    should struggle with
  • Background processes
  • Dynamic changing available nodes
  • Large latencies on the Grid

3
Performance limiting factorbackground processes
  • Other processes may run on background
  • Network daemons, interactive shells, etc.
  • Many typical applcations are written in
    synchronous style
  • In such applications, delay of a single node
    degrades the overall performance

4
Performance limitng factorLarge latencies on
the Grid
  • In the future Grid environments, bandwidth will
    accommodate HPC applications
  • Large latencies will remain to be obstacles

gt100ms
  • Synchronous applications suffer from large
    latencies

5
Available nodes change dynamically
  • Many HPC applications assumes that computing
    nodes are fixed
  • If applications support dynamically changing
    nodes, we can harness computing resources more
    efficiently!

6
Goal of this work
  • An LU factorization algorithm that
  • Tolerates background processes large latencies
  • Supports dynamically changing nodes

A fast HPC application on non-dedicated clusters
and Grid
7
Outline of this talk
  • The Phoenix model
  • Our LU Algorithm
  • Overlapping multiple iterations
  • Data mapping for dynamically changing nodes
  • Performance of our LU and HPL
  • Related work
  • Summary

8
Phoenix model Taura et al. 03
  • A message passing model for dynamically changing
    environments
  • Concept of virtual nodes
  • Virtual nodes as destinations of messages

Virtual nodes
Physical nodes
9
Overview of our LU
  • Like typical implementations,
  • Based on message passing
  • The matrix is decomposed into small blocks
  • A block is updated by its owner node
  • Unlike typical implementations,
  • Asynchronous data-driven style for overlapping
    multiple iterations
  • Cyclic-like data mapping for any dynamically
    changing number of nodes
  • (Currently, pivoting is not performed)

10
LU factorization
  • for (k0 kltB k)
  • Ak,kfact(Ak,k)
  • for (ik1 iltB i)
  • Ai,kupdate_L(Ai,k,Ak,k)
  • for (jk1 jltB j)
  • Ak,jupdate_U(Ak,j,Ak,k)
  • for (ik1 iltB i)
  • for (jk1 jltB j)
  • Ai,jAi,j Ai,k x Ak,j

Diagonal
L part
U part
Trail part
11
Naïve implementation and its problem
(k1) th iteration
(k2) th iteration
k th iteration
of executable tasks
time
Diagonal
U
L
trail
  • Iterations are separated
  • Not tolerant to latencies/background processes!

12
Latency Hiding Techniques
  • Overlapping iterations hides latencies
  • Diagonal/L/U parts is advanced
  • If computations of trail parts are separated,
    only adjacent two iterations are overlapped

There is room for further improvement
13
Overlapping multiple iterations for more tolerance
  • We overlap multiple iterations
  • by computing all blocks, including trail parts
    asynchronously
  • Data driven style prioritized task scheduling
    are used

14
Prioritized task scheduling
  • We assign a priority to updating task of each
    block
  • k-th update of block Ai,j has a priority of
  • min(i-S, j-S, k) (smaller number is higher)
  • where S is a desired overlap depth
  • We can control overlapping by changing the value
    of S

15
Typical data mapping and its problem
  • Two dimensional block cyclic distribution

matrix
  • Good load balance and small communication, but
  • The number of nodes must be fixed and factored
    into two small numbers
  • How to support dynamically changing nodes?

16
Our data mapping for dynamically changing nodes
Original matrix
  • Permutation is common among all nodes

17
Dynamically joining nodes
  • A new node sends a steal message to one of nodes
  • The receiver abandons some virual nodes, and
    sends blocks to the new node
  • The new node undertakes virtual nodes and blocks
  • For better load balance, stealing process is
    repeated

18
Experimental environments (1)
  • 112 nodes IBM BladeCenter Cluster
  • Dual 2.4GHz Xeon 70 nodes
  • Dual 2.8GHz Xeon 42 nodes
  • 1 CPU per node is used
  • Slower CPU (2.4GHz) determines the overall
    performance
  • Gigabit ethernet

19
Experimental environments (2)
  • Ours (S0) dont overlap explicitly
  • Ours (S1) overlap with an adjacent iteration
  • Ours (S5) overlap multiple (5) iterations
  • High performance Linpack (HPL) is by Petitet et
    al.
  • GOTO BLAS is made by Kazushige Goto (UT-Austin)

20
Scalability
x72
  • Matrix size N61440
  • Block size NB240
  • Overlap depth S0 or 5

x65
  • Ours(S5) achieves 190 GFlops with 108 nodes
  • 65 times speedup

21
Tolerance to background processes (1)
  • We run LU/HPL with background processes
  • We run 3 background processes per randomely
    chosen node
  • The background processes are short term
  • They move to other random nodes every 10 secs

22
Tolerance to background processes (2)
-16
-26
-31
-36
  • 108 nodes for computation
  • N46080
  • HPL slows down heavily
  • Ours(S0) and Ours(S1) also suffer
  • By overlapping multiple iterations (S5), Our LU
    becomes more tolerant !

23
Tolerance to large latencies (1)
  • We emulate the future Grid environment with high
    bandwidth large latencies
  • Experiments are done on a cluster
  • Large latencies are emulated by software
  • 0ms, 200ms, 500ms

24
Tolerance to large latencies (2)
-19
-20
-28
  • 108 nodes for computation
  • N46080
  • S0 suffers by 28
  • Overlapping of iterations makes our LU more
    tolerant
  • Both S1 and S5 work well

25
Performance with joining nodes (1)
  • 16 nodes at first, then 48 nodes are added
    dynamically

26
Performance with joining nodes (2)
x1.9 faster
  • N30720
  • S5
  • Flexibility to the number of nodes is useful to
    obtain higher performance
  • Comared with Fixed-64, Dynamic suffers migration
    overhead etc.

27
Related Work
  • Dyn-MPI Weatherly et al. 03
  • An extended MPI library that supports dynamically
    changing nodes

Dyn-MPI Our approach
Redist method Synchronous Asynchronous
Distribution of 2D matrix Only the first dimension Arbitrary (Left for the programmers)
28
Summary
  • An LU implementation suitable for non-dedicated
    clusters and the Grid
  • Scalable
  • Support dynamically changing nodes
  • Tolerate background processes large latencies

29
Future Work
  • Perform pivoting
  • More data dependencies are introduced
  • Is our LU still tolerant?
  • Improve dynamic load balancing
  • Choose better target nodes for stealing
  • Take care of CPU speeds
  • Apply our approach to other HPC applications
  • CFD applications

30
Thank you!
31
Typical task scheduling
  • Each node updates blocks synchronously
  • Not tolerant to background processes

32
Our task scheduling to tolerate delays (1)
  • Each block is updated asynchronously
  • Blocks may have different iteration numbers

33
Our task scheduling to tolerate delays (2)
  • Not only allow skew, but make skew explicitly
  • Intoduce prioritized task scheduling
  • Give higher priority to upper left blocks

Target skew 3 (Similar to pipeline depth)
34
Performance with joining processes (3)
suffer from migration
good peak speed
procs added
longer tail end
Write a Comment
User Comments (0)
About PowerShow.com