Title: High Performance LU Factorization for Non-dedicated Clusters
1High Performance LU Factorization for
Non-dedicated Clusters
and the future Grid
- Toshio Endo, Kenji Kaneda,
- Kenjiro Taura, Akinori Yonezawa
- (University of Tokyo)
2Background
- Computing nodes on clusters/Grid are shared by
multiple applications - To obtain good performance, HPC applications
should struggle with - Background processes
- Dynamic changing available nodes
- Large latencies on the Grid
3Performance limiting factorbackground processes
- Other processes may run on background
- Network daemons, interactive shells, etc.
- Many typical applcations are written in
synchronous style - In such applications, delay of a single node
degrades the overall performance
4Performance limitng factorLarge latencies on
the Grid
- In the future Grid environments, bandwidth will
accommodate HPC applications - Large latencies will remain to be obstacles
gt100ms
- Synchronous applications suffer from large
latencies
5Available nodes change dynamically
- Many HPC applications assumes that computing
nodes are fixed - If applications support dynamically changing
nodes, we can harness computing resources more
efficiently!
6Goal of this work
- An LU factorization algorithm that
- Tolerates background processes large latencies
- Supports dynamically changing nodes
A fast HPC application on non-dedicated clusters
and Grid
7Outline of this talk
- The Phoenix model
- Our LU Algorithm
- Overlapping multiple iterations
- Data mapping for dynamically changing nodes
- Performance of our LU and HPL
- Related work
- Summary
8Phoenix model Taura et al. 03
- A message passing model for dynamically changing
environments - Concept of virtual nodes
- Virtual nodes as destinations of messages
Virtual nodes
Physical nodes
9Overview of our LU
- Like typical implementations,
- Based on message passing
- The matrix is decomposed into small blocks
- A block is updated by its owner node
- Unlike typical implementations,
- Asynchronous data-driven style for overlapping
multiple iterations - Cyclic-like data mapping for any dynamically
changing number of nodes - (Currently, pivoting is not performed)
10LU factorization
- for (k0 kltB k)
- Ak,kfact(Ak,k)
- for (ik1 iltB i)
- Ai,kupdate_L(Ai,k,Ak,k)
- for (jk1 jltB j)
- Ak,jupdate_U(Ak,j,Ak,k)
- for (ik1 iltB i)
- for (jk1 jltB j)
- Ai,jAi,j Ai,k x Ak,j
Diagonal
L part
U part
Trail part
11Naïve implementation and its problem
(k1) th iteration
(k2) th iteration
k th iteration
of executable tasks
time
Diagonal
U
L
trail
- Iterations are separated
- Not tolerant to latencies/background processes!
12Latency Hiding Techniques
- Overlapping iterations hides latencies
- Diagonal/L/U parts is advanced
- If computations of trail parts are separated,
only adjacent two iterations are overlapped
There is room for further improvement
13Overlapping multiple iterations for more tolerance
- We overlap multiple iterations
- by computing all blocks, including trail parts
asynchronously - Data driven style prioritized task scheduling
are used
14Prioritized task scheduling
- We assign a priority to updating task of each
block - k-th update of block Ai,j has a priority of
- min(i-S, j-S, k) (smaller number is higher)
- where S is a desired overlap depth
- We can control overlapping by changing the value
of S
15Typical data mapping and its problem
- Two dimensional block cyclic distribution
matrix
- Good load balance and small communication, but
- The number of nodes must be fixed and factored
into two small numbers - How to support dynamically changing nodes?
16Our data mapping for dynamically changing nodes
Original matrix
- Permutation is common among all nodes
17Dynamically joining nodes
- A new node sends a steal message to one of nodes
- The receiver abandons some virual nodes, and
sends blocks to the new node - The new node undertakes virtual nodes and blocks
- For better load balance, stealing process is
repeated
18Experimental environments (1)
- 112 nodes IBM BladeCenter Cluster
- Dual 2.4GHz Xeon 70 nodes
- Dual 2.8GHz Xeon 42 nodes
- 1 CPU per node is used
- Slower CPU (2.4GHz) determines the overall
performance - Gigabit ethernet
19Experimental environments (2)
- Ours (S0) dont overlap explicitly
- Ours (S1) overlap with an adjacent iteration
- Ours (S5) overlap multiple (5) iterations
- High performance Linpack (HPL) is by Petitet et
al. - GOTO BLAS is made by Kazushige Goto (UT-Austin)
20Scalability
x72
- Matrix size N61440
- Block size NB240
- Overlap depth S0 or 5
x65
- Ours(S5) achieves 190 GFlops with 108 nodes
- 65 times speedup
21Tolerance to background processes (1)
- We run LU/HPL with background processes
- We run 3 background processes per randomely
chosen node - The background processes are short term
- They move to other random nodes every 10 secs
22Tolerance to background processes (2)
-16
-26
-31
-36
- 108 nodes for computation
- N46080
- HPL slows down heavily
- Ours(S0) and Ours(S1) also suffer
- By overlapping multiple iterations (S5), Our LU
becomes more tolerant !
23Tolerance to large latencies (1)
- We emulate the future Grid environment with high
bandwidth large latencies - Experiments are done on a cluster
- Large latencies are emulated by software
- 0ms, 200ms, 500ms
24Tolerance to large latencies (2)
-19
-20
-28
- 108 nodes for computation
- N46080
- S0 suffers by 28
- Overlapping of iterations makes our LU more
tolerant - Both S1 and S5 work well
25Performance with joining nodes (1)
- 16 nodes at first, then 48 nodes are added
dynamically
26Performance with joining nodes (2)
x1.9 faster
- Flexibility to the number of nodes is useful to
obtain higher performance - Comared with Fixed-64, Dynamic suffers migration
overhead etc.
27Related Work
- Dyn-MPI Weatherly et al. 03
- An extended MPI library that supports dynamically
changing nodes
Dyn-MPI Our approach
Redist method Synchronous Asynchronous
Distribution of 2D matrix Only the first dimension Arbitrary (Left for the programmers)
28Summary
- An LU implementation suitable for non-dedicated
clusters and the Grid - Scalable
- Support dynamically changing nodes
- Tolerate background processes large latencies
29Future Work
- Perform pivoting
- More data dependencies are introduced
- Is our LU still tolerant?
- Improve dynamic load balancing
- Choose better target nodes for stealing
- Take care of CPU speeds
- Apply our approach to other HPC applications
- CFD applications
30Thank you!
31Typical task scheduling
- Each node updates blocks synchronously
- Not tolerant to background processes
32Our task scheduling to tolerate delays (1)
- Each block is updated asynchronously
- Blocks may have different iteration numbers
33Our task scheduling to tolerate delays (2)
- Not only allow skew, but make skew explicitly
- Intoduce prioritized task scheduling
- Give higher priority to upper left blocks
Target skew 3 (Similar to pipeline depth)
34Performance with joining processes (3)
suffer from migration
good peak speed
procs added
longer tail end