High Performance LU Factorization for Non-dedicated Clusters presentation

About This Presentation

Transcript and Presenter's Notes

Title: High Performance LU Factorization for Non-dedicated Clusters

1
High Performance LU Factorization for
Non-dedicated Clusters
and the future Grid

Toshio Endo, Kenji Kaneda,
Kenjiro Taura, Akinori Yonezawa
(University of Tokyo)

2
Background

Computing nodes on clusters/Grid are shared by
multiple applications
To obtain good performance, HPC applications
should struggle with
Background processes
Dynamic changing available nodes
Large latencies on the Grid

3
Performance limiting factorbackground processes

Other processes may run on background
Network daemons, interactive shells, etc.
Many typical applcations are written in
synchronous style
In such applications, delay of a single node
degrades the overall performance

4
Performance limitng factorLarge latencies on
the Grid

In the future Grid environments, bandwidth will
accommodate HPC applications
Large latencies will remain to be obstacles

gt100ms

Synchronous applications suffer from large
latencies

5
Available nodes change dynamically

Many HPC applications assumes that computing
nodes are fixed
If applications support dynamically changing
nodes, we can harness computing resources more
efficiently!

6
Goal of this work

An LU factorization algorithm that
Tolerates background processes large latencies
Supports dynamically changing nodes

A fast HPC application on non-dedicated clusters
and Grid
7
Outline of this talk

The Phoenix model
Our LU Algorithm
Overlapping multiple iterations
Data mapping for dynamically changing nodes
Performance of our LU and HPL
Related work
Summary

8
Phoenix model Taura et al. 03

A message passing model for dynamically changing
environments
Concept of virtual nodes
Virtual nodes as destinations of messages

Virtual nodes
Physical nodes
9
Overview of our LU

Like typical implementations,
Based on message passing
The matrix is decomposed into small blocks
A block is updated by its owner node
Unlike typical implementations,
Asynchronous data-driven style for overlapping
multiple iterations
Cyclic-like data mapping for any dynamically
changing number of nodes
(Currently, pivoting is not performed)

10
LU factorization

for (k0 kltB k)
Ak,kfact(Ak,k)
for (ik1 iltB i)
Ai,kupdate_L(Ai,k,Ak,k)
for (jk1 jltB j)
Ak,jupdate_U(Ak,j,Ak,k)
for (ik1 iltB i)
for (jk1 jltB j)
Ai,jAi,j Ai,k x Ak,j

Diagonal
L part
U part
Trail part
11
Naïve implementation and its problem
(k1) th iteration
(k2) th iteration
k th iteration
of executable tasks
time
Diagonal
U
L
trail

Iterations are separated
Not tolerant to latencies/background processes!

12
Latency Hiding Techniques

Overlapping iterations hides latencies
Diagonal/L/U parts is advanced
If computations of trail parts are separated,
only adjacent two iterations are overlapped

There is room for further improvement
13
Overlapping multiple iterations for more tolerance

We overlap multiple iterations
by computing all blocks, including trail parts
asynchronously
Data driven style prioritized task scheduling
are used

14
Prioritized task scheduling

We assign a priority to updating task of each
block
k-th update of block Ai,j has a priority of
min(i-S, j-S, k) (smaller number is higher)
where S is a desired overlap depth
We can control overlapping by changing the value
of S

15
Typical data mapping and its problem

Two dimensional block cyclic distribution

matrix

Good load balance and small communication, but
The number of nodes must be fixed and factored
into two small numbers
How to support dynamically changing nodes?

16
Our data mapping for dynamically changing nodes
Original matrix

Permutation is common among all nodes

17
Dynamically joining nodes

A new node sends a steal message to one of nodes
The receiver abandons some virual nodes, and
sends blocks to the new node
The new node undertakes virtual nodes and blocks
For better load balance, stealing process is
repeated

18
Experimental environments (1)

112 nodes IBM BladeCenter Cluster
Dual 2.4GHz Xeon 70 nodes
Dual 2.8GHz Xeon 42 nodes
1 CPU per node is used
Slower CPU (2.4GHz) determines the overall
performance
Gigabit ethernet

19
Experimental environments (2)

Ours (S0) dont overlap explicitly
Ours (S1) overlap with an adjacent iteration
Ours (S5) overlap multiple (5) iterations

High performance Linpack (HPL) is by Petitet et
al.
GOTO BLAS is made by Kazushige Goto (UT-Austin)

20
Scalability
x72

Matrix size N61440
Block size NB240
Overlap depth S0 or 5

x65

Ours(S5) achieves 190 GFlops with 108 nodes
65 times speedup

21
Tolerance to background processes (1)

We run LU/HPL with background processes
We run 3 background processes per randomely
chosen node
The background processes are short term
They move to other random nodes every 10 secs

22
Tolerance to background processes (2)
-16
-26
-31
-36

108 nodes for computation
N46080

HPL slows down heavily
Ours(S0) and Ours(S1) also suffer
By overlapping multiple iterations (S5), Our LU
becomes more tolerant !

23
Tolerance to large latencies (1)

We emulate the future Grid environment with high
bandwidth large latencies
Experiments are done on a cluster
Large latencies are emulated by software
0ms, 200ms, 500ms

24
Tolerance to large latencies (2)
-19
-20
-28

108 nodes for computation
N46080

S0 suffers by 28
Overlapping of iterations makes our LU more
tolerant
Both S1 and S5 work well

25
Performance with joining nodes (1)

16 nodes at first, then 48 nodes are added
dynamically

26
Performance with joining nodes (2)
x1.9 faster

N30720
S5

Flexibility to the number of nodes is useful to
obtain higher performance
Comared with Fixed-64, Dynamic suffers migration
overhead etc.

27
Related Work

Dyn-MPI Weatherly et al. 03
An extended MPI library that supports dynamically
changing nodes

Dyn-MPI Our approach
Redist method Synchronous Asynchronous
Distribution of 2D matrix Only the first dimension Arbitrary (Left for the programmers)
28
Summary

An LU implementation suitable for non-dedicated
clusters and the Grid
Scalable
Support dynamically changing nodes
Tolerate background processes large latencies

29
Future Work

Perform pivoting
More data dependencies are introduced
Is our LU still tolerant?
Improve dynamic load balancing
Choose better target nodes for stealing
Take care of CPU speeds
Apply our approach to other HPC applications
CFD applications

30
Thank you!
31
Typical task scheduling

Each node updates blocks synchronously

Not tolerant to background processes

32
Our task scheduling to tolerate delays (1)

Each block is updated asynchronously
Blocks may have different iteration numbers

33
Our task scheduling to tolerate delays (2)

Not only allow skew, but make skew explicitly
Intoduce prioritized task scheduling
Give higher priority to upper left blocks

Target skew 3 (Similar to pipeline depth)
34
Performance with joining processes (3)
suffer from migration
good peak speed
procs added
longer tail end

Write a Comment

User Comments (0)

About PowerShow.com

High Performance LU Factorization for Non-dedicated Clusters PowerPoint PPT Presentation